* Added the ability to pass in metadata to the /document/upload/{folderName} endpoint
* Added the ability to pass in metadata to the /document/upload-link endpoint
* feat: added metadata to document/upload api endpoint
* simplify optional metadata in document dev api endpoints
* lint
* patch handling of metadata in dev api
* Linting, small comments
---------
Co-authored-by: jstawskigmi <jstawski@getmyinterns.org>
Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* Create parse endpoint in collector (#4212)
* create parse endpoint in collector
* revert cleanup temp util call
* lint
* remove unused cleanupTempDocuments function
* revert slug change
minor change for destinations
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>
* Add parsed files table and parse server endpoints (#4222)
* add workspace_parsed_files table + parse endpoints/models
* remove dev api parse endpoint
* remove unneeded imports
* iterate over all files + remove unneeded update function + update telemetry debounce
* Upload UI/UX context window check + frontend alert (#4230)
* prompt user to embed if exceeds prompt window + handle embed + handle cancel
* add tokenCountEstimate to workspace_parsed_files + optimizations
* use util for path locations + use safeJsonParse
* add modal for user decision on overflow of context window
* lint
* dynamic fetching of provider/model combo + inject parsed documents
* remove unneeded comments
* popup ui for attaching/removing files + warning to embed + wip fetching states on update
* remove prop drilling, fetch files/limits directly in attach files popup
* rework ux of FE + BE optimizations
* fix ux of FE + BE optimizations
* Implement bidirectional sync for parsed file states
linting
small changes and comments
* move parse support to another endpoint file
simplify calls and loading of records
* button borders
* enable default users to upload parsed files but NOT embed
* delete cascade on user/workspace/thread deletion to remove parsedFileRecord
* enable bgworker with "always" jobs and optional document sync jobs
orphan document job: Will find any broken reference files to prevent overpollution of the storage folder. This will run 10s after boot and every 12hr after
* change run timeout for orphan job to 1m to allow settling before spawning a worker
* linting and cleanup pr
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* dev build
* fix tooltip hiding during embedding overflow files
* prevent crash log from ERRNO on parse files
* unused import
* update docs link
* Migrate parsed-files to GET endpoint
patch logic for grabbing models names from utils
better handling for undetermined context windows (null instead of Pos_INIFI)
UI placeholder for null context windows
* patch URL
---------
Co-authored-by: Sean Hatfield <seanhatfield5@gmail.com>
* feat: add support for custom table formatting in htmlToText conversion
* fix tables
* feat: improve plain text table formatting for AI readability
* fix options
* improve drupal wiki connector
* final fix
* adjust leading slash to match code
* linting
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>
* allow custom headers in upload-link endpoint
* override loader.scrape to allow for passing of headers in langchain puppeteer
* lint
* Rename some variables
move positional args to named args
update documentation to reflect arg changes and funciton sigs
validate header object before attempting to end to forward to request
* update header validation for custom headers
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>
* Enable bypass of ip limitations via ENV in collector startup
resolves#3625
connect #3626
* dev build
* bump dockerx build action
* enable runtime setting config of collector requests
* comments and linting for option passing
* unset
* unset
* update docs link
* linting and docs
* Add multilingual support for ocr mudule
* Add OCR langauge as server var that is passed into Collector
Support all valid tesseract language codes
Filter and parse only valid codes with fallbacks'
* persist TARGET_OCR_LANG
* update docker example env
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* Windows development environment variables support
* moved cross-env to dev dependencies
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* feat(dataconnectors): support confluence personal access token
* fix: change select option
* linting
change name on accesstype field
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>
* Add `querySelectorAll` capability to web-scraping block
* patches and fallbacks
* fix styles of text in web scraping block
---------
Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
* chore: rename Github to GitHub
Signed-off-by: Adam Setch <adam.setch@outlook.com>
* chore: rename Github to GitHub
Signed-off-by: Adam Setch <adam.setch@outlook.com>
* Undo some code changes for references
---------
Signed-off-by: Adam Setch <adam.setch@outlook.com>
Co-authored-by: timothycarambat <rambat1010@gmail.com>
* Add tokenizer improvments via Singleton class
linting
* dev build
* Estimation fallback when string exceeds a fixed byte size
* Add notice to tiktoken on backend
* fix scraping failed bug in link/bulk link scrapers
* reset submodule
* swap to networkidle2 as a safe mix for SPA and API-loaded pages, but also not hang on request heavy pages
* lint
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>
* fix tree/blob github urls from branches not being loaded
* improve ux of github data connector
* lint
* patch Github URL parser to just validate with `URL` native parser
* uncheck LocalStorage of PAT for security reasons
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* Updated the `GitHubRepoLoader` class to use the new import syntax and adjust the `recursiveLoader` method accordingly.
* add @langchain/community to collector package.json
* fix: Improve handling of complex ignore patterns in GitLabRepoLoader
* refactor: use ignore package for simplified ignore logic
* run yarn lint
* add @langchain/community@^0.2.23
* remove unused dep
lint
---------
Co-authored-by: Emil Rofors (aider) <emirof@gmail.com>
* Added an option to fetch issues from gitlab. Made the file fetching asynchornous to improve performance. #2334
* Fixed a typo in loadGitlabRepo.
* Convert issues to markdown.
* Fixed an issue with time estimate field names in issueToMarkdown.
* handle rate limits more gracefully + update checkbox to toggle switch
* lint
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
* support more confluence url formats
* use pattern matching for confluence urls and manual splitting as fallback
* rework entire Confluence flow to prevent issues with custom, local, and cloud spaces
* remove dep
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* Add support for GitLab repo collection as well as Github Repo collection
* Refactor for repo collectors to be more compact
---------
Co-authored-by: Emil Rofors <emirof@gmail.com>
* implement custom PDFLoader to remove LC dep
* remove unneeded comment
* remove pdfjs as dep and fix page splitting using pdf-parse
* linting + export rename for desktop compat
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>
* WIP replace langchain pdfloader with pdfjs and add more context to each page
* remove extras from pdfjs and just replace langchain library
* remove unneeded dep
* fix console log in docs
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>
* wip bg workers for live document sync
* Add ability to re-embed specific documents across many workspaces via background queue
bgworkser is gated behind expieremental system setting flag that needs to be explictly enabled
UI for watching/unwatching docments that are embedded.
TODO: UI to easily manage all bg tasks and see run results
TODO: UI to enable this feature and background endpoints to manage it
* create frontend views and paths
Move elements to correct experimental scope
* update migration to delete runs on removal of watched document
* Add watch support to YouTube transcripts (#1716)
* Add watch support to YouTube transcripts
refactor how sync is done for supported types
* Watch specific files in Confluence space (#1718)
Add failure-prune check for runs
* create tmp workflow modifications for beta image
* create tmp workflow modifications for beta image
* create tmp workflow modifications for beta image
* dual build
update copy of alert modals
* update job interval
* Add support for live-sync of Github files
* update copy for document sync feature
* hide Experimental features from UI
* update docs links
* [FEAT] Implement new settings menu for experimental features (#1735)
* implement new settings menu for experimental features
* remove unused context save bar
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>
* dont run job on boot
* unset workflow changes
* Add persistent encryption service
Relay key to collector so persistent encryption can be used
Encrypt any private data in chunkSources used for replay during resync jobs
* update jsDOC
* Linting and organization
* update modal copy for feature
---------
Co-authored-by: Sean Hatfield <seanhatfield5@gmail.com>
* chore: confluence data connector can now handle custom urls, in addition to default {subdomain}.atlassian.net ones
* chore: formatting as per yarn lint
* chore: fixing the human readable confluence url fetch baseUrl
* chore: fixing the human readable confluence url fetch baseUrl
* chore: fixing the human readable confluence url fetch baseUrl
* chore: fixing the human readable confluence url fetch baseUrl
* chore: fixing the human readable confluence url fetch baseUrl
* refactor implementation of various types of Confluence URL patterns
---------
Co-authored-by: Predrag Stojadinovic <predrag@stojadinovic.net>
Co-authored-by: Predrag Stojadinović <cope@users.noreply.github.com>
Co-authored-by: Predrag Stojadinovic <predrags@nvidia.com>
* Updated apt-packages source for devcontainer
Switched the devcontainer's package source to a different repository to
align with updated dependencies and package availability. The previous
source from 'rocker-org' is replaced with 'devcontainers-contrib', which
may offer more recent or relevant development tools.
* Subject: Centralize prettier ignores and refine
config
Body:
Centralized all prettier ignore rules by removing individual
`.prettierignore` files in subprojects and updating the root
`.prettierignore` to include previously ignored patterns, ensuring
consistency across the workspace. Additionally, the prettier
configuration was refined by making the file pattern for `.config.js`
files consistent and adjusting quote styles for better readability. All
lint scripts across the project were updated to respect the centralized
ignore path, enhancing maintainability.
The consolidation simplifies the process of managing ignore rules as the
project scales, ensuring developers can focus on writing code without
worrying about divergent formatting standards. These changes also align
with introducing comprehensive linting across multiple environments to
keep the codebase clean and consistent.
This adjustment is a foundational step towards a more streamlined and
unified code base, making it easier for new contributors to adhere to
established coding standards and reducing the cognitive load associated
with managing multiple configuration files across the project.
* unset package json changes
---------
Co-authored-by: Francisco Bischoff <franzbischoff@gmail.com>
Co-authored-by: Francisco Bischoff <984592+franzbischoff@users.noreply.github.com>
* chore: confluence data connector can now handle custom urls, in addition to default {subdomain}.atlassian.net ones
* chore: formatting as per yarn lint
* chore: adding /display/ url matching to confluence data connector
* chore: confluence data connector can now handle custom urls, in addition to default {subdomain}.atlassian.net ones
* chore: formatting as per yarn lint