Commit Graph

121 Commits

Author SHA1 Message Date
Timothy Carambat
2dc625193e
4825 patch yt file collector api (#4904)
Patch YT links in API document collector
closes #4825
2026-01-26 14:36:21 -08:00
j0rDy
f52e2866ac
Update common.js (#4894)
* Update common.js

Added missing translations in Dutch.

* linting

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2026-01-23 17:12:17 -08:00
Timothy Carambat
4de5e30ac6
Merge commit from fork 2026-01-23 17:06:44 -08:00
Timothy Carambat
feb039ea70
Adjust fix path to use ESM import (#4867)
* Adjust fix path to use ESM import

* normalize fix-path imports and usage across the app

* extract path fix logic to utils for server and collector

* add helpers

* repin strip-ansi in collector

* fix log for localWhisper
lint
2026-01-15 16:13:21 -08:00
Timothy Carambat
092b1b45f8
Upgrade YT Scraper (#4820) 2026-01-02 15:41:22 -08:00
Sean Hatfield
6c1f8a38ce
Refactor localWhisper to use custom FFMPEGWrapper class (#4775)
* refactor localWhisper to use new custom FFMPEGWrapper class

* stub tests in github actions

* add back wavefile conversion to 16khz 32f to fix docker builds

* use afterEach for cleanup in ffmpeg tests

* remove unused FFMPEG_PATH env check

* use spawnSync for ffmpeg to capture and log output

* lint

* revert removal of try/catch around validateAudioFile for more helpful error msgs

* use readFileSync instead of createReadStream for less overhead

* change import to require for fix-path and stub import in tests

* refactor to singleton to preserve ffmpeg path
dev build

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2025-12-18 11:41:45 -08:00
Sean Hatfield
c76b0708c3
Fix pagination bug in paperless-ngx data connector (#4757)
* iterate over all pages in paperless-ngx data connector

* add error handling and data validation

* refactor to handle edge cases and null values

* catch edge case to prevent infinite loop

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2025-12-12 10:23:32 -08:00
timothycarambat
758db6b677 fix lint 2025-11-25 14:42:10 -08:00
Neha Prasad
3ecf218eea
feat: Add SSL certificate bypass support for self-hosted Confluence instances (#4219)
* Added bypassSSL parameter to constructor and implemented SSL bypass logic in fetchConfluenceData method

* Updated generateChunkSource function to include bypassSSL in the encrypted payload

* Updated the request body to include bypassSSL in the JSON payload sent to the backend

* Updated form submission to include bypassSSL parameter from the checkbox

* Added bypass_ssl: "Bypass SSL Certificate Validation" translation

* passed these parameters to fetchconfluencepage function for proper resync functionality

* allow ignore of SSL cert for Confluence

* add translations

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2025-11-25 14:32:10 -08:00
Sean Hatfield
05df4ac72b
Paperless ngx data connector (#4121)
* paperless ngx data connector

* wip resync paperless ngx

* fix generateChunkSource for resyncing paperless ngx

* lint

* Refactor Paperless-NGX connector
Fix issue with date rendering in tooltip + extended width
Move tooltip details to be column for more space

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2025-11-20 11:27:38 -08:00
Timothy Carambat
b3b261e15d
Fix loop logic for fetchNextPage use in GitLabLoader (#4662)
resolves #4626
closes #4627
2025-11-19 13:53:26 -08:00
Marcello Fitton
d3619689db
Refactor loadYouTubeTranscript() to include YouTube Video Metadata in Content When parseOnly is true (#4552)
* Enhance YouTube transcript loading to include video metadata in parsed content when parseOnly is true

* extract to function

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2025-10-15 15:42:00 -07:00
Timothy Carambat
5edc1bea42
Add ability to auto-handle YT video URLs in uploader & chat (#4547)
* Add ability to auto-handle YT video URLs in uploader & chat

* move YT validator to URL utils

* update comment
2025-10-15 12:18:57 -07:00
Marcello Fitton
d48c76919c
Fix: File pulling fails with uppercase URL characters (#4516)
* fix: remove unnecessary toLowerCase in URL validation

* test: enhance URL validation tests to preserve case sensitivity and format

* test: update URL validation tests to ensure domain normalization to lowercase while preserving path case

* small formatting

* fix filenames when downloading live URI

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2025-10-08 14:00:02 -07:00
Timothy Carambat
cf3fbcbf0f
Improve URL handler for collector processes (#4504)
* Improve URL handler for collector processes

* dev build
2025-10-07 11:03:27 -07:00
Marcello Fitton
f7b90571be
Fetch, Parse, and Create Documents for Statically Hosted Files (#4398)
* Add capability to web scraping feature for document creation to download and parse statically hosted files

* lint

* Remove unneeded comment

* Simplified process by using key of ACCEPTED_MIMES to validate the response content type, as a result unlocked all supported files

* Add TODO comments for future implementation of asDoc.js to handle standard MS Word files in constants.js

* Return captureAs argument to be exposed by scrapeGenericUrl and passed into getPageContent | Return explicit argument of captureAs into scrapeGenericUrl in processLink fn

* Return debug log for scrapeGenericUrl

* Change conditional to a guard clause.

* Add error handling, validation, and JSDOC to getContentType helper fn

* remove unneeded comments

* Simplify URL validation by reusing module

* Rename downloadFileToHotDir to downloadURIToFile and moved up to a global module | Add URL valuidation to downloadURIToFile

* refactor

* add support for webp
remove unused imports

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2025-10-01 15:49:05 -07:00
AoiYamada
8fc1f24d1b
fix: youtube transcript collector not work well with non en or non asr caption (#4442)
* fix: youtube transcript collector not work well with non en or non asr caption

* stub YT test in Github actions

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2025-09-29 13:22:50 -07:00
Timothy Carambat
95557ee16f
Allow user to specify args for chromium process so they dont need SYS_ADMIN on container. (#4397)
* allow user to specify args for chromium process so they dont need SYS_ADMIN perms

* use arg flag content

* update console outputs
2025-09-17 16:31:08 -07:00
timothycarambat
0200e647b8 add back normalization + docs link 2025-08-14 11:43:04 -07:00
Timothy Carambat
0fb33736da
Workspace Chat with documents overhaul (#4261)
* Create parse endpoint in collector (#4212)

* create parse endpoint in collector

* revert cleanup temp util call

* lint

* remove unused cleanupTempDocuments function

* revert slug change
minor change for destinations

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>

* Add parsed files table and parse server endpoints (#4222)

* add workspace_parsed_files table + parse endpoints/models

* remove dev api parse endpoint

* remove unneeded imports

* iterate over all files + remove unneeded update function + update telemetry debounce

* Upload UI/UX context window check + frontend alert (#4230)

* prompt user to embed if exceeds prompt window + handle embed + handle cancel

* add tokenCountEstimate to workspace_parsed_files + optimizations

* use util for path locations + use safeJsonParse

* add modal for user decision on overflow of context window

* lint

* dynamic fetching of provider/model combo + inject parsed documents

* remove unneeded comments

* popup ui for attaching/removing files + warning to embed + wip fetching states on update

* remove prop drilling, fetch files/limits directly in attach files popup

* rework ux of FE + BE optimizations

* fix ux of FE + BE optimizations

* Implement bidirectional sync for parsed file states
linting
small changes and comments

* move parse support to another endpoint file
simplify calls and loading of records

* button borders

* enable default users to upload parsed files but NOT embed

* delete cascade on user/workspace/thread deletion to remove parsedFileRecord

* enable bgworker with "always" jobs and optional document sync jobs
orphan document job: Will find any broken reference files to prevent overpollution of the storage folder. This will run 10s after boot and every 12hr after

* change run timeout for orphan job to 1m to allow settling before spawning a worker

* linting and cleanup pr

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>

* dev build

* fix tooltip hiding during embedding overflow files

* prevent crash log from ERRNO on parse files

* unused import

* update docs link

* Migrate parsed-files to GET endpoint
patch logic for grabbing models names from utils
better handling for undetermined context windows (null instead of Pos_INIFI)
UI placeholder for null context windows

* patch URL

---------

Co-authored-by: Sean Hatfield <seanhatfield5@gmail.com>
2025-08-11 09:26:19 -07:00
Timothy Carambat
70a07b743b
Update writeToServerDocuments to take config object (#4213) 2025-07-29 17:53:05 -07:00
timothycarambat
7692775942 minor change to XLSX parse and upload output folder 2025-07-29 17:44:47 -07:00
timothycarambat
ff34c8cefc use documentsFolder path for simplification 2025-07-16 11:14:18 -07:00
Sean Hatfield
5485c58b44
Sanitize youtube transcription file paths (#4148)
sanitize youtube transcription file paths
2025-07-14 13:53:34 -07:00
rexjohannes
14fa079953
Fix/drupal wiki (improve table & url handling) (#4097)
* feat: add support for custom table formatting in htmlToText conversion

* fix tables

* feat: improve plain text table formatting for AI readability

* fix options

* improve drupal wiki connector

* final fix

* adjust leading slash to match code

* linting

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2025-07-07 13:39:38 -07:00
bobbercheng
d0978fa363
Fix broken YT scraping with YT API (#4005)
* Fix broken YT scraping with YT API

* refactor youtube transcript class/add jsdoc comments

* fix test

---------

Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
Co-authored-by: timothycarambat <rambat1010@gmail.com>
2025-07-07 13:06:18 -07:00
timothycarambat
3d5e8602a8 lint 2025-05-27 13:54:13 -07:00
rexjohannes
dc80d3e535
fixed drupal connector (#3893)
https://github.com/Mintplex-Labs/anything-llm/issues/3875#issuecomment-2913211343
2025-05-27 13:15:43 -07:00
Timothy Carambat
245a5969b8
normalize path on drupal to use documentsFolder constant
normalize path on drupal to use documentsFolder constant
2025-05-27 09:25:48 -07:00
Sean Hatfield
2b274c62b7
Obsidian data connector (#3798)
* add obsidian vault data connector

* lint

* add english translations

* normalize translations

* improve file parser and reader

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2025-05-12 13:45:27 -07:00
timothycarambat
9d661bb96e linting 2025-05-07 09:40:31 -07:00
mr-chenguang
eff9d24cb9
feat: support fetch wikis for gitlab data connectors (#3271)
* feat: support fetch wikis for gitlab data connectors

* gitlab connector button spacing

* add docAuthor and description metadata for GitLab wiki pages

---------

Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2025-05-06 14:09:53 -07:00
Timothy Carambat
1601eb986c
Enable bypass of ip limitations via ENV in collector processing (#3652)
* Enable bypass of ip limitations via ENV in collector startup
resolves #3625
connect #3626

* dev build

* bump dockerx build action

* enable runtime setting config of collector requests

* comments and linting for option passing

* unset

* unset

* update docs link

* linting and docs
2025-04-21 11:10:41 -07:00
Timothy Carambat
fd4929b4d2
Feature/drupalwiki collector (#3693)
* Implement DrupalWiki collector

* Add attachment downloading and processing functionality (#3)

* linting

* Linting
Add citation image
small refactors
add URL for citation identifier

---------

Co-authored-by: em <eugen.mayer@kontextwork.de>
Co-authored-by: rexjohannes <53578137+rexjohannes@users.noreply.github.com>
Co-authored-by: Eugen Mayer <136934+EugenMayer@users.noreply.github.com>
2025-04-21 09:17:24 -07:00
Timothy Carambat
fd174cab86
Apply .git logic handler for repo URLs (#3655)
* Apply `.git` logic handler for repo URLs

* remove comment
2025-04-15 18:01:14 -07:00
Timothy Carambat
fab74037fa
Prevent collector crash when blocked by CDN (#3373)
resolves #3365
2025-02-28 10:27:05 -08:00
AbelDuan
df166eb64e
feat: Add multilingual support for ocr module (#3325)
* Add multilingual support for ocr mudule

* Add OCR langauge as server var that is passed into Collector
Support all valid tesseract language codes
Filter and parse only valid codes with fallbacks'

* persist TARGET_OCR_LANG

* update docker example env

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2025-02-27 12:31:17 -08:00
t2
0eb86e2c12
for projects in gitlab subgroup (#3075) (#3247)
* for projects in gitlab subgroup (#3075)

* fix: false condition

* refactor pattern matching logic

---------

Co-authored-by: t2 <>
Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2025-02-17 12:25:11 -08:00
Timothy Carambat
4545ce24cd
Drop Node canvas for manual sharp conversion (#3221)
* Drop Node `canvas` for manual `sharp` conversion

* bump dev
2025-02-14 17:38:13 -08:00
mr-chenguang
6ffdbf074d
feat(dataconnectors): support confluence personal access token (#3206)
* feat(dataconnectors): support confluence personal access token

* fix: change select option

* linting
change name on accesstype field

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2025-02-14 12:12:01 -08:00
Timothy Carambat
89bba68219
Add OCR of image support (#3219)
* OCR PDFs as fallback in spawn thread

* wip

* build our own worker fanout and wrapper

* norm pkgs

* Add image OCR support
2025-02-14 12:07:33 -08:00
Timothy Carambat
2a9066e83a
OCR PDFs as fallback during upload (#3204)
* OCR PDFs as fallback in spawn thread

* wip

* build our own worker fanout and wrapper

* norm pkgs

* bump dev
2025-02-14 11:57:31 -08:00
Adam Setch
d63438fa61
chore: rename Github to GitHub (#3199)
* chore: rename Github to GitHub

Signed-off-by: Adam Setch <adam.setch@outlook.com>

* chore: rename Github to GitHub

Signed-off-by: Adam Setch <adam.setch@outlook.com>

* Undo some code changes for references

---------

Signed-off-by: Adam Setch <adam.setch@outlook.com>
Co-authored-by: timothycarambat <rambat1010@gmail.com>
2025-02-13 10:45:43 -08:00
Timothy Carambat
9a4df22c70
autodetect parseable text file contents (#3079) 2025-01-31 13:31:26 -08:00
Timothy Carambat
d1ca16f7f8
Add tokenizer improvments via Singleton class and estimation (#3072)
* Add tokenizer improvments via Singleton class
linting

* dev build

* Estimation fallback when string exceeds a fixed byte size

* Add notice to tiktoken on backend
2025-01-30 17:55:03 -08:00
Sean Hatfield
dd017c6cbb
Audio file validations (#2902)
* add audio file validations

* patch sharp to support wavfile parsing

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2024-12-30 14:48:28 -08:00
Sean Hatfield
9bc01afa7d
Fix scraping failed bug in link/bulk link scrapers (#2807)
* fix scraping failed bug in link/bulk link scrapers

* reset submodule

* swap to networkidle2 as a safe mix for SPA and API-loaded pages, but also not hang on request heavy pages

* lint

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2024-12-11 14:01:52 -08:00
Timothy Carambat
5e698534fe
Add plaintext file extensions (#2664) 2024-11-20 09:56:03 -08:00
Sean Hatfield
cf3b085a3a
Handle OpenAI whisper transcription edge case (#2621)
remove openai whisper transcription provider response_format option
2024-11-11 17:32:03 -08:00
Sean Hatfield
0bb47619dc
Allow 127.0.0.1 as valid URL for scraping (#2560)
* allow 127.0.0.1 as valid url for scraping

* update comments and lint

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2024-10-31 09:57:28 -07:00