* fix(collector): infer file extension from Content-Type for URLs without explicit extensions When downloading files from URLs like https://arxiv.org/pdf/2307.10265, the path has no recognizable file extension. The downloaded file gets saved without an extension (or with a nonsensical one like .10265), causing processSingleFile to reject it with 'File extension .10265 not supported for parsing'. Fix: after downloading, check if the filename has a supported file extension. If not, inspect the response Content-Type header and map it to the correct extension using the existing ACCEPTED_MIMES table. For example, a response with Content-Type: application/pdf will cause the file to be saved with a .pdf extension, allowing it to be processed correctly. Fixes #4513 * small refactor --------- Co-authored-by: Timothy Carambat <rambat1010@gmail.com> |
||
|---|---|---|
| .. | ||
| __tests__/utils | ||
| extensions | ||
| hotdir | ||
| middleware | ||
| processLink | ||
| processRawText | ||
| processSingleFile | ||
| storage | ||
| utils | ||
| .env.example | ||
| .gitignore | ||
| .nvmrc | ||
| eslint.config.mjs | ||
| index.js | ||
| nodemon.json | ||
| package.json | ||
| yarn.lock | ||