merlyn

History

Yitong Li 2f7a818744 fix(collector): infer file extension from Content-Type for URLs without explicit extensions (#5252 ) * fix(collector): infer file extension from Content-Type for URLs without explicit extensions When downloading files from URLs like https://arxiv.org/pdf/2307.10265, the path has no recognizable file extension. The downloaded file gets saved without an extension (or with a nonsensical one like .10265), causing processSingleFile to reject it with 'File extension .10265 not supported for parsing'. Fix: after downloading, check if the filename has a supported file extension. If not, inspect the response Content-Type header and map it to the correct extension using the existing ACCEPTED_MIMES table. For example, a response with Content-Type: application/pdf will cause the file to be saved with a .pdf extension, allowing it to be processed correctly. Fixes #4513 * small refactor --------- Co-authored-by: Timothy Carambat <rambat1010@gmail.com>		2026-03-23 09:40:22 -07:00
..
__tests__/utils	fix(collector): infer file extension from Content-Type for URLs without explicit extensions (#5252 )	2026-03-23 09:40:22 -07:00
extensions	chore: add ESLint to `/collector` (#5128 )	2026-03-05 16:25:23 -08:00
hotdir	Document Processor v2 (#442 )	2023-12-14 15:14:56 -08:00
middleware	chore: add ESLint to `/collector` (#5128 )	2026-03-05 16:25:23 -08:00
processLink	fix(collector): infer file extension from Content-Type for URLs without explicit extensions (#5252 )	2026-03-23 09:40:22 -07:00
processRawText	chore: add ESLint to `/collector` (#5128 )	2026-03-05 16:25:23 -08:00
processSingleFile	Make XLSX spreadsheets visible in chat by combining sheets (#4847 )	2026-01-13 15:46:16 -08:00
storage	feat: Embed on-instance Whisper model for audio/mp4 transcribing (#449 )	2023-12-15 11:20:13 -08:00
utils	fix(collector): infer file extension from Content-Type for URLs without explicit extensions (#5252 )	2026-03-23 09:40:22 -07:00
.env.example	Add HTTP request/response logging middleware for development mode (#4425 )	2025-09-29 13:33:15 -07:00
.gitignore	Add HTTP request/response logging middleware for development mode (#4425 )	2025-09-29 13:33:15 -07:00
.nvmrc	dev build with new `epub2` build target and remove patch work (#4694 )	2025-11-26 17:36:34 -08:00
eslint.config.mjs	chore: add ESLint to `/collector` (#5128 )	2026-03-05 16:25:23 -08:00
index.js	Add HTTP request/response logging middleware for development mode (#4425 )	2025-09-29 13:33:15 -07:00
nodemon.json	Document Processor v2 (#442 )	2023-12-14 15:14:56 -08:00
package.json	chore: add ESLint CI workflow (#5160 )	2026-03-09 14:27:08 -07:00
yarn.lock	chore: add ESLint to `/collector` (#5128 )	2026-03-05 16:25:23 -08:00