merlyn/collector/utils
Yitong Li 2f7a818744
fix(collector): infer file extension from Content-Type for URLs without explicit extensions (#5252)
* fix(collector): infer file extension from Content-Type for URLs without explicit extensions

When downloading files from URLs like https://arxiv.org/pdf/2307.10265,
the path has no recognizable file extension. The downloaded file gets
saved without an extension (or with a nonsensical one like .10265),
causing processSingleFile to reject it with 'File extension .10265
not supported for parsing'.

Fix: after downloading, check if the filename has a supported file
extension. If not, inspect the response Content-Type header and map
it to the correct extension using the existing ACCEPTED_MIMES table.

For example, a response with Content-Type: application/pdf will cause
the file to be saved with a .pdf extension, allowing it to be processed
correctly.

Fixes #4513

* small refactor

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2026-03-23 09:40:22 -07:00
..
comKey [BETA] Live document sync (#1719) 2024-06-21 13:38:50 -07:00
downloadURIToFile fix(collector): infer file extension from Content-Type for URLs without explicit extensions (#5252) 2026-03-23 09:40:22 -07:00
EncryptionWorker [BETA] Live document sync (#1719) 2024-06-21 13:38:50 -07:00
extensions linting & show descriptive error for bad addtoWorkspace request body 2026-03-09 11:30:53 -07:00
files chore: add ESLint to /collector (#5128) 2026-03-05 16:25:23 -08:00
http chore: add ESLint to /collector (#5128) 2026-03-05 16:25:23 -08:00
logger patch logger for full logs 2024-07-19 18:35:41 -07:00
OCRLoader chore: add ESLint to /collector (#5128) 2026-03-05 16:25:23 -08:00
runtimeSettings Fetch, Parse, and Create Documents for Statically Hosted Files (#4398) 2025-10-01 15:49:05 -07:00
tokenizer Add tokenizer improvments via Singleton class and estimation (#3072) 2025-01-30 17:55:03 -08:00
url Add ability to auto-handle YT video URLs in uploader & chat (#4547) 2025-10-15 12:18:57 -07:00
WhisperProviders Adjust fix path to use ESM import (#4867) 2026-01-15 16:13:21 -08:00
constants.js Fetch, Parse, and Create Documents for Statically Hosted Files (#4398) 2025-10-01 15:49:05 -07:00
shell.js Adjust fix path to use ESM import (#4867) 2026-01-15 16:13:21 -08:00