* fix(collector): infer file extension from Content-Type for URLs without explicit extensions
When downloading files from URLs like https://arxiv.org/pdf/2307.10265,
the path has no recognizable file extension. The downloaded file gets
saved without an extension (or with a nonsensical one like .10265),
causing processSingleFile to reject it with 'File extension .10265
not supported for parsing'.
Fix: after downloading, check if the filename has a supported file
extension. If not, inspect the response Content-Type header and map
it to the correct extension using the existing ACCEPTED_MIMES table.
For example, a response with Content-Type: application/pdf will cause
the file to be saved with a .pdf extension, allowing it to be processed
correctly.
Fixes#4513
* small refactor
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* Adjust fix path to use ESM import
* normalize fix-path imports and usage across the app
* extract path fix logic to utils for server and collector
* add helpers
* repin strip-ansi in collector
* fix log for localWhisper
lint
* refactor localWhisper to use new custom FFMPEGWrapper class
* stub tests in github actions
* add back wavefile conversion to 16khz 32f to fix docker builds
* use afterEach for cleanup in ffmpeg tests
* remove unused FFMPEG_PATH env check
* use spawnSync for ffmpeg to capture and log output
* lint
* revert removal of try/catch around validateAudioFile for more helpful error msgs
* use readFileSync instead of createReadStream for less overhead
* change import to require for fix-path and stub import in tests
* refactor to singleton to preserve ffmpeg path
dev build
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* fix: remove unnecessary toLowerCase in URL validation
* test: enhance URL validation tests to preserve case sensitivity and format
* test: update URL validation tests to ensure domain normalization to lowercase while preserving path case
* small formatting
* fix filenames when downloading live URI
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>
* fix: youtube transcript collector not work well with non en or non asr caption
* stub YT test in Github actions
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>