* Added the ability to pass in metadata to the /document/upload/{folderName} endpoint
* Added the ability to pass in metadata to the /document/upload-link endpoint
* feat: added metadata to document/upload api endpoint
* simplify optional metadata in document dev api endpoints
* lint
* patch handling of metadata in dev api
* Linting, small comments
---------
Co-authored-by: jstawskigmi <jstawski@getmyinterns.org>
Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* Create parse endpoint in collector (#4212)
* create parse endpoint in collector
* revert cleanup temp util call
* lint
* remove unused cleanupTempDocuments function
* revert slug change
minor change for destinations
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>
* Add parsed files table and parse server endpoints (#4222)
* add workspace_parsed_files table + parse endpoints/models
* remove dev api parse endpoint
* remove unneeded imports
* iterate over all files + remove unneeded update function + update telemetry debounce
* Upload UI/UX context window check + frontend alert (#4230)
* prompt user to embed if exceeds prompt window + handle embed + handle cancel
* add tokenCountEstimate to workspace_parsed_files + optimizations
* use util for path locations + use safeJsonParse
* add modal for user decision on overflow of context window
* lint
* dynamic fetching of provider/model combo + inject parsed documents
* remove unneeded comments
* popup ui for attaching/removing files + warning to embed + wip fetching states on update
* remove prop drilling, fetch files/limits directly in attach files popup
* rework ux of FE + BE optimizations
* fix ux of FE + BE optimizations
* Implement bidirectional sync for parsed file states
linting
small changes and comments
* move parse support to another endpoint file
simplify calls and loading of records
* button borders
* enable default users to upload parsed files but NOT embed
* delete cascade on user/workspace/thread deletion to remove parsedFileRecord
* enable bgworker with "always" jobs and optional document sync jobs
orphan document job: Will find any broken reference files to prevent overpollution of the storage folder. This will run 10s after boot and every 12hr after
* change run timeout for orphan job to 1m to allow settling before spawning a worker
* linting and cleanup pr
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* dev build
* fix tooltip hiding during embedding overflow files
* prevent crash log from ERRNO on parse files
* unused import
* update docs link
* Migrate parsed-files to GET endpoint
patch logic for grabbing models names from utils
better handling for undetermined context windows (null instead of Pos_INIFI)
UI placeholder for null context windows
* patch URL
---------
Co-authored-by: Sean Hatfield <seanhatfield5@gmail.com>
* feat: add support for custom table formatting in htmlToText conversion
* fix tables
* feat: improve plain text table formatting for AI readability
* fix options
* improve drupal wiki connector
* final fix
* adjust leading slash to match code
* linting
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>
* allow custom headers in upload-link endpoint
* override loader.scrape to allow for passing of headers in langchain puppeteer
* lint
* Rename some variables
move positional args to named args
update documentation to reflect arg changes and funciton sigs
validate header object before attempting to end to forward to request
* update header validation for custom headers
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>
* Enable bypass of ip limitations via ENV in collector startup
resolves#3625
connect #3626
* dev build
* bump dockerx build action
* enable runtime setting config of collector requests
* comments and linting for option passing
* unset
* unset
* update docs link
* linting and docs
* Add multilingual support for ocr mudule
* Add OCR langauge as server var that is passed into Collector
Support all valid tesseract language codes
Filter and parse only valid codes with fallbacks'
* persist TARGET_OCR_LANG
* update docker example env
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* Windows development environment variables support
* moved cross-env to dev dependencies
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* feat(dataconnectors): support confluence personal access token
* fix: change select option
* linting
change name on accesstype field
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>
* Add `querySelectorAll` capability to web-scraping block
* patches and fallbacks
* fix styles of text in web scraping block
---------
Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
* chore: rename Github to GitHub
Signed-off-by: Adam Setch <adam.setch@outlook.com>
* chore: rename Github to GitHub
Signed-off-by: Adam Setch <adam.setch@outlook.com>
* Undo some code changes for references
---------
Signed-off-by: Adam Setch <adam.setch@outlook.com>
Co-authored-by: timothycarambat <rambat1010@gmail.com>
* Add tokenizer improvments via Singleton class
linting
* dev build
* Estimation fallback when string exceeds a fixed byte size
* Add notice to tiktoken on backend
* fix scraping failed bug in link/bulk link scrapers
* reset submodule
* swap to networkidle2 as a safe mix for SPA and API-loaded pages, but also not hang on request heavy pages
* lint
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>
* fix tree/blob github urls from branches not being loaded
* improve ux of github data connector
* lint
* patch Github URL parser to just validate with `URL` native parser
* uncheck LocalStorage of PAT for security reasons
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>