Commit Graph

8 Commits

Author SHA1 Message Date
Marcello Fitton
f7b90571be
Fetch, Parse, and Create Documents for Statically Hosted Files (#4398)
* Add capability to web scraping feature for document creation to download and parse statically hosted files

* lint

* Remove unneeded comment

* Simplified process by using key of ACCEPTED_MIMES to validate the response content type, as a result unlocked all supported files

* Add TODO comments for future implementation of asDoc.js to handle standard MS Word files in constants.js

* Return captureAs argument to be exposed by scrapeGenericUrl and passed into getPageContent | Return explicit argument of captureAs into scrapeGenericUrl in processLink fn

* Return debug log for scrapeGenericUrl

* Change conditional to a guard clause.

* Add error handling, validation, and JSDOC to getContentType helper fn

* remove unneeded comments

* Simplify URL validation by reusing module

* Rename downloadFileToHotDir to downloadURIToFile and moved up to a global module | Add URL valuidation to downloadURIToFile

* refactor

* add support for webp
remove unused imports

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2025-10-01 15:49:05 -07:00
Timothy Carambat
1601eb986c
Enable bypass of ip limitations via ENV in collector processing (#3652)
* Enable bypass of ip limitations via ENV in collector startup
resolves #3625
connect #3626

* dev build

* bump dockerx build action

* enable runtime setting config of collector requests

* comments and linting for option passing

* unset

* unset

* update docs link

* linting and docs
2025-04-21 11:10:41 -07:00
Sean Hatfield
0bb47619dc
Allow 127.0.0.1 as valid URL for scraping (#2560)
* allow 127.0.0.1 as valid url for scraping

* update comments and lint

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2024-10-31 09:57:28 -07:00
timothycarambat
619f6b3884 Ignore SSL errors for web scraper
resolves #2114
2024-08-14 09:11:22 -07:00
timothycarambat
b541623c9e add SSRF notice 2024-08-13 17:46:07 -07:00
Timothy Carambat
0db6c3b2aa
Prevent private octets from link collection for self-hosted (#626) 2024-01-19 10:49:40 -08:00
Timothy Carambat
1563a1b20f
Strict link protocol validation (#577) 2024-01-11 12:29:00 -08:00
Timothy Carambat
719521c307
Document Processor v2 (#442)
* wip: init refactor of document processor to JS

* add NodeJs PDF support

* wip: partity with python processor
feat: add pptx support

* fix: forgot files

* Remove python scripts totally

* wip:update docker to boot new collector

* add package.json support

* update dockerfile for new build

* update gitignore and linting

* add more protections on file lookup

* update package.json

* test build

* update docker commands to use cap-add=SYS_ADMIN so web scraper can run
update all scripts to reflect this
remove docker build for branch
2023-12-14 15:14:56 -08:00