merlyn/collector
Yitong Li 2f7a818744
fix(collector): infer file extension from Content-Type for URLs without explicit extensions (#5252)
* fix(collector): infer file extension from Content-Type for URLs without explicit extensions

When downloading files from URLs like https://arxiv.org/pdf/2307.10265,
the path has no recognizable file extension. The downloaded file gets
saved without an extension (or with a nonsensical one like .10265),
causing processSingleFile to reject it with 'File extension .10265
not supported for parsing'.

Fix: after downloading, check if the filename has a supported file
extension. If not, inspect the response Content-Type header and map
it to the correct extension using the existing ACCEPTED_MIMES table.

For example, a response with Content-Type: application/pdf will cause
the file to be saved with a .pdf extension, allowing it to be processed
correctly.

Fixes #4513

* small refactor

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2026-03-23 09:40:22 -07:00
..
__tests__/utils fix(collector): infer file extension from Content-Type for URLs without explicit extensions (#5252) 2026-03-23 09:40:22 -07:00
extensions chore: add ESLint to /collector (#5128) 2026-03-05 16:25:23 -08:00
hotdir Document Processor v2 (#442) 2023-12-14 15:14:56 -08:00
middleware chore: add ESLint to /collector (#5128) 2026-03-05 16:25:23 -08:00
processLink fix(collector): infer file extension from Content-Type for URLs without explicit extensions (#5252) 2026-03-23 09:40:22 -07:00
processRawText chore: add ESLint to /collector (#5128) 2026-03-05 16:25:23 -08:00
processSingleFile Make XLSX spreadsheets visible in chat by combining sheets (#4847) 2026-01-13 15:46:16 -08:00
storage feat: Embed on-instance Whisper model for audio/mp4 transcribing (#449) 2023-12-15 11:20:13 -08:00
utils fix(collector): infer file extension from Content-Type for URLs without explicit extensions (#5252) 2026-03-23 09:40:22 -07:00
.env.example Add HTTP request/response logging middleware for development mode (#4425) 2025-09-29 13:33:15 -07:00
.gitignore Add HTTP request/response logging middleware for development mode (#4425) 2025-09-29 13:33:15 -07:00
.nvmrc dev build with new epub2 build target and remove patch work (#4694) 2025-11-26 17:36:34 -08:00
eslint.config.mjs chore: add ESLint to /collector (#5128) 2026-03-05 16:25:23 -08:00
index.js Add HTTP request/response logging middleware for development mode (#4425) 2025-09-29 13:33:15 -07:00
nodemon.json Document Processor v2 (#442) 2023-12-14 15:14:56 -08:00
package.json chore: add ESLint CI workflow (#5160) 2026-03-09 14:27:08 -07:00
yarn.lock chore: add ESLint to /collector (#5128) 2026-03-05 16:25:23 -08:00