merlyn/collector
Marcello Fitton f7b90571be
Fetch, Parse, and Create Documents for Statically Hosted Files (#4398)
* Add capability to web scraping feature for document creation to download and parse statically hosted files

* lint

* Remove unneeded comment

* Simplified process by using key of ACCEPTED_MIMES to validate the response content type, as a result unlocked all supported files

* Add TODO comments for future implementation of asDoc.js to handle standard MS Word files in constants.js

* Return captureAs argument to be exposed by scrapeGenericUrl and passed into getPageContent | Return explicit argument of captureAs into scrapeGenericUrl in processLink fn

* Return debug log for scrapeGenericUrl

* Change conditional to a guard clause.

* Add error handling, validation, and JSDOC to getContentType helper fn

* remove unneeded comments

* Simplify URL validation by reusing module

* Rename downloadFileToHotDir to downloadURIToFile and moved up to a global module | Add URL valuidation to downloadURIToFile

* refactor

* add support for webp
remove unused imports

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2025-10-01 15:49:05 -07:00
..
__tests__/utils/extensions/YoutubeTranscript/YoutubeLoader fix: youtube transcript collector not work well with non en or non asr caption (#4442) 2025-09-29 13:22:50 -07:00
extensions Obsidian data connector (#3798) 2025-05-12 13:45:27 -07:00
hotdir Document Processor v2 (#442) 2023-12-14 15:14:56 -08:00
middleware Add HTTP request/response logging middleware for development mode (#4425) 2025-09-29 13:33:15 -07:00
processLink Fetch, Parse, and Create Documents for Statically Hosted Files (#4398) 2025-10-01 15:49:05 -07:00
processRawText Update writeToServerDocuments to take config object (#4213) 2025-07-29 17:53:05 -07:00
processSingleFile Fetch, Parse, and Create Documents for Statically Hosted Files (#4398) 2025-10-01 15:49:05 -07:00
storage feat: Embed on-instance Whisper model for audio/mp4 transcribing (#449) 2023-12-15 11:20:13 -08:00
utils Fetch, Parse, and Create Documents for Statically Hosted Files (#4398) 2025-10-01 15:49:05 -07:00
.env.example Add HTTP request/response logging middleware for development mode (#4425) 2025-09-29 13:33:15 -07:00
.gitignore Add HTTP request/response logging middleware for development mode (#4425) 2025-09-29 13:33:15 -07:00
.nvmrc Document Processor v2 (#442) 2023-12-14 15:14:56 -08:00
index.js Add HTTP request/response logging middleware for development mode (#4425) 2025-09-29 13:33:15 -07:00
nodemon.json Document Processor v2 (#442) 2023-12-14 15:14:56 -08:00
package.json forgot 1.8.5 tag :) 2025-08-14 17:43:55 -07:00
yarn.lock Enable workflow rule for package verification (#3778) 2025-05-07 12:51:14 -07:00