merlyn/collector/utils
Marcello Fitton f7b90571be
Fetch, Parse, and Create Documents for Statically Hosted Files (#4398)
* Add capability to web scraping feature for document creation to download and parse statically hosted files

* lint

* Remove unneeded comment

* Simplified process by using key of ACCEPTED_MIMES to validate the response content type, as a result unlocked all supported files

* Add TODO comments for future implementation of asDoc.js to handle standard MS Word files in constants.js

* Return captureAs argument to be exposed by scrapeGenericUrl and passed into getPageContent | Return explicit argument of captureAs into scrapeGenericUrl in processLink fn

* Return debug log for scrapeGenericUrl

* Change conditional to a guard clause.

* Add error handling, validation, and JSDOC to getContentType helper fn

* remove unneeded comments

* Simplify URL validation by reusing module

* Rename downloadFileToHotDir to downloadURIToFile and moved up to a global module | Add URL valuidation to downloadURIToFile

* refactor

* add support for webp
remove unused imports

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2025-10-01 15:49:05 -07:00
..
comKey [BETA] Live document sync (#1719) 2024-06-21 13:38:50 -07:00
downloadURIToFile Fetch, Parse, and Create Documents for Statically Hosted Files (#4398) 2025-10-01 15:49:05 -07:00
EncryptionWorker [BETA] Live document sync (#1719) 2024-06-21 13:38:50 -07:00
extensions fix: youtube transcript collector not work well with non en or non asr caption (#4442) 2025-09-29 13:22:50 -07:00
files add back normalization + docs link 2025-08-14 11:43:04 -07:00
http Feature/drupalwiki collector (#3693) 2025-04-21 09:17:24 -07:00
logger patch logger for full logs 2024-07-19 18:35:41 -07:00
OCRLoader feat: Add multilingual support for ocr module (#3325) 2025-02-27 12:31:17 -08:00
runtimeSettings Fetch, Parse, and Create Documents for Statically Hosted Files (#4398) 2025-10-01 15:49:05 -07:00
tokenizer Add tokenizer improvments via Singleton class and estimation (#3072) 2025-01-30 17:55:03 -08:00
url Fetch, Parse, and Create Documents for Statically Hosted Files (#4398) 2025-10-01 15:49:05 -07:00
WhisperProviders Prevent collector crash when blocked by CDN (#3373) 2025-02-28 10:27:05 -08:00
constants.js Fetch, Parse, and Create Documents for Statically Hosted Files (#4398) 2025-10-01 15:49:05 -07:00