Commit Graph

18 Commits

Author SHA1 Message Date
Timothy Carambat
cf3fbcbf0f
Improve URL handler for collector processes (#4504)
* Improve URL handler for collector processes

* dev build
2025-10-07 11:03:27 -07:00
timothycarambat
bdfa0328db update comment about parseOnly 2025-10-01 20:45:52 -07:00
Marcello Fitton
f7b90571be
Fetch, Parse, and Create Documents for Statically Hosted Files (#4398)
* Add capability to web scraping feature for document creation to download and parse statically hosted files

* lint

* Remove unneeded comment

* Simplified process by using key of ACCEPTED_MIMES to validate the response content type, as a result unlocked all supported files

* Add TODO comments for future implementation of asDoc.js to handle standard MS Word files in constants.js

* Return captureAs argument to be exposed by scrapeGenericUrl and passed into getPageContent | Return explicit argument of captureAs into scrapeGenericUrl in processLink fn

* Return debug log for scrapeGenericUrl

* Change conditional to a guard clause.

* Add error handling, validation, and JSDOC to getContentType helper fn

* remove unneeded comments

* Simplify URL validation by reusing module

* Rename downloadFileToHotDir to downloadURIToFile and moved up to a global module | Add URL valuidation to downloadURIToFile

* refactor

* add support for webp
remove unused imports

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2025-10-01 15:49:05 -07:00
Timothy Carambat
95557ee16f
Allow user to specify args for chromium process so they dont need SYS_ADMIN on container. (#4397)
* allow user to specify args for chromium process so they dont need SYS_ADMIN perms

* use arg flag content

* update console outputs
2025-09-17 16:31:08 -07:00
Jonas Stawski
b8d4cc3454
Added metadata parameter to document/upload, document/upload/{folderName}, and document/upload-link (#4342)
* Added the ability to pass in metadata to the /document/upload/{folderName} endpoint

* Added the ability to pass in metadata to the /document/upload-link endpoint

* feat: added metadata to document/upload api endpoint

* simplify optional metadata in document dev api endpoints

* lint

* patch handling of metadata in dev api

* Linting, small comments

---------

Co-authored-by: jstawskigmi <jstawski@getmyinterns.org>
Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2025-09-17 11:17:29 -07:00
Timothy Carambat
70a07b743b
Update writeToServerDocuments to take config object (#4213) 2025-07-29 17:53:05 -07:00
Sean Hatfield
610bdd4673
Allow custom headers in upload-link endpoint (#3695)
* allow custom headers in upload-link endpoint

* override loader.scrape to allow for passing of headers in langchain puppeteer

* lint

* Rename some variables
move positional args to named args
update documentation to reflect arg changes and funciton sigs
validate header object before attempting to end to forward to request

* update header validation for custom headers

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2025-04-22 12:47:12 -07:00
Timothy Carambat
b6d3a411b1
Add querySelectorAll capability to web-scraping block (#3186)
* Add `querySelectorAll` capability to web-scraping block

* patches and fallbacks

* fix styles of text in web scraping block

---------

Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
2025-02-13 16:11:15 -08:00
Timothy Carambat
d1ca16f7f8
Add tokenizer improvments via Singleton class and estimation (#3072)
* Add tokenizer improvments via Singleton class
linting

* dev build

* Estimation fallback when string exceeds a fixed byte size

* Add notice to tiktoken on backend
2025-01-30 17:55:03 -08:00
Sean Hatfield
9bc01afa7d
Fix scraping failed bug in link/bulk link scrapers (#2807)
* fix scraping failed bug in link/bulk link scrapers

* reset submodule

* swap to networkidle2 as a safe mix for SPA and API-loaded pages, but also not hang on request heavy pages

* lint

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2024-12-11 14:01:52 -08:00
timothycarambat
ab6f03ce1c linting 2024-10-18 11:44:14 -07:00
Sean Hatfield
41522cdfb4
Handle non-ascii characters in single and bulk link scraper URLs (#2495)
handle non-ascii characters in urls
2024-10-17 17:04:00 -07:00
timothycarambat
619f6b3884 Ignore SSL errors for web scraper
resolves #2114
2024-08-14 09:11:22 -07:00
Timothy Carambat
a5bb77f97a
Agent support for @agent default agent inside workspace chat (#1093)
V1 of agent support via built-in `@agent` that can be invoked alongside normal workspace RAG chat.
2024-04-16 10:50:10 -07:00
Timothy Carambat
d52f8aafd4
689 links in citation (#715)
* Include links in citations
force ChunkSource key to retain this information
old links will be unsupported

* show special icons depending on source

* remove console log

* reset server documents writeTo
2024-02-13 14:11:57 -08:00
Timothy Carambat
b35feede87
570 document api return object (#608)
* Add support for fetching single document in documents folder

* Add document object to upload + support link scraping via API

* hotfixes for documentation

* update api docs
2024-01-16 16:04:22 -08:00
timothycarambat
daadad3859 hoist var in extensions 2023-12-20 19:41:16 -08:00
Timothy Carambat
719521c307
Document Processor v2 (#442)
* wip: init refactor of document processor to JS

* add NodeJs PDF support

* wip: partity with python processor
feat: add pptx support

* fix: forgot files

* Remove python scripts totally

* wip:update docker to boot new collector

* add package.json support

* update dockerfile for new build

* update gitignore and linting

* add more protections on file lookup

* update package.json

* test build

* update docker commands to use cap-add=SYS_ADMIN so web scraper can run
update all scripts to reflect this
remove docker build for branch
2023-12-14 15:14:56 -08:00