Commit Graph

73 Commits

Author SHA1 Message Date
Asish Kumar
91e75c27c2
fix: preserve Confluence context paths (#5415)
* fix: preserve confluence context paths

* lint and minor changes

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2026-04-13 13:10:40 -07:00
Timothy Carambat
dc0bdf112b linting & show descriptive error for bad addtoWorkspace request body
resolves #5172
2026-03-09 11:30:53 -07:00
Maxwell Calkin
563f95167d
fix: add missing /wiki to Confluence cloud citation URLs (#5167)
fix: add /wiki to Confluence cloud page URLs in citations
2026-03-09 10:24:56 -07:00
Marcello Fitton
8f33203ade
chore: add ESLint to /collector (#5128)
* add eslint config to /collector

* prettier formatting

* fix unused

* fix undefined

* disable lines

* lockfile

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2026-03-05 16:25:23 -08:00
Timothy Carambat
d58ff0ea3e
Normalize scraper runtimeargs for bulk-scraper (#5083)
resolves #5078
closes #5079
2026-02-27 09:15:17 -08:00
Marcello Fitton
c927eda18f
fix: GitLab connector infinite loop and rate limit crash for large repos (#5021)
* Fix infinite loop and rate limit crashes

* simplify logic | add max-retries to fetchNextPage and fetchSingleFileContents

---------

Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2026-02-19 12:42:21 -08:00
Timothy Carambat
2dc625193e
4825 patch yt file collector api (#4904)
Patch YT links in API document collector
closes #4825
2026-01-26 14:36:21 -08:00
j0rDy
f52e2866ac
Update common.js (#4894)
* Update common.js

Added missing translations in Dutch.

* linting

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2026-01-23 17:12:17 -08:00
Timothy Carambat
4de5e30ac6
Merge commit from fork 2026-01-23 17:06:44 -08:00
Timothy Carambat
092b1b45f8
Upgrade YT Scraper (#4820) 2026-01-02 15:41:22 -08:00
Sean Hatfield
c76b0708c3
Fix pagination bug in paperless-ngx data connector (#4757)
* iterate over all pages in paperless-ngx data connector

* add error handling and data validation

* refactor to handle edge cases and null values

* catch edge case to prevent infinite loop

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2025-12-12 10:23:32 -08:00
timothycarambat
758db6b677 fix lint 2025-11-25 14:42:10 -08:00
Neha Prasad
3ecf218eea
feat: Add SSL certificate bypass support for self-hosted Confluence instances (#4219)
* Added bypassSSL parameter to constructor and implemented SSL bypass logic in fetchConfluenceData method

* Updated generateChunkSource function to include bypassSSL in the encrypted payload

* Updated the request body to include bypassSSL in the JSON payload sent to the backend

* Updated form submission to include bypassSSL parameter from the checkbox

* Added bypass_ssl: "Bypass SSL Certificate Validation" translation

* passed these parameters to fetchconfluencepage function for proper resync functionality

* allow ignore of SSL cert for Confluence

* add translations

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2025-11-25 14:32:10 -08:00
Sean Hatfield
05df4ac72b
Paperless ngx data connector (#4121)
* paperless ngx data connector

* wip resync paperless ngx

* fix generateChunkSource for resyncing paperless ngx

* lint

* Refactor Paperless-NGX connector
Fix issue with date rendering in tooltip + extended width
Move tooltip details to be column for more space

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2025-11-20 11:27:38 -08:00
Timothy Carambat
b3b261e15d
Fix loop logic for fetchNextPage use in GitLabLoader (#4662)
resolves #4626
closes #4627
2025-11-19 13:53:26 -08:00
Marcello Fitton
d3619689db
Refactor loadYouTubeTranscript() to include YouTube Video Metadata in Content When parseOnly is true (#4552)
* Enhance YouTube transcript loading to include video metadata in parsed content when parseOnly is true

* extract to function

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2025-10-15 15:42:00 -07:00
Timothy Carambat
5edc1bea42
Add ability to auto-handle YT video URLs in uploader & chat (#4547)
* Add ability to auto-handle YT video URLs in uploader & chat

* move YT validator to URL utils

* update comment
2025-10-15 12:18:57 -07:00
AoiYamada
8fc1f24d1b
fix: youtube transcript collector not work well with non en or non asr caption (#4442)
* fix: youtube transcript collector not work well with non en or non asr caption

* stub YT test in Github actions

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2025-09-29 13:22:50 -07:00
Timothy Carambat
70a07b743b
Update writeToServerDocuments to take config object (#4213) 2025-07-29 17:53:05 -07:00
timothycarambat
ff34c8cefc use documentsFolder path for simplification 2025-07-16 11:14:18 -07:00
Sean Hatfield
5485c58b44
Sanitize youtube transcription file paths (#4148)
sanitize youtube transcription file paths
2025-07-14 13:53:34 -07:00
rexjohannes
14fa079953
Fix/drupal wiki (improve table & url handling) (#4097)
* feat: add support for custom table formatting in htmlToText conversion

* fix tables

* feat: improve plain text table formatting for AI readability

* fix options

* improve drupal wiki connector

* final fix

* adjust leading slash to match code

* linting

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2025-07-07 13:39:38 -07:00
bobbercheng
d0978fa363
Fix broken YT scraping with YT API (#4005)
* Fix broken YT scraping with YT API

* refactor youtube transcript class/add jsdoc comments

* fix test

---------

Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
Co-authored-by: timothycarambat <rambat1010@gmail.com>
2025-07-07 13:06:18 -07:00
timothycarambat
3d5e8602a8 lint 2025-05-27 13:54:13 -07:00
rexjohannes
dc80d3e535
fixed drupal connector (#3893)
https://github.com/Mintplex-Labs/anything-llm/issues/3875#issuecomment-2913211343
2025-05-27 13:15:43 -07:00
Timothy Carambat
245a5969b8
normalize path on drupal to use documentsFolder constant
normalize path on drupal to use documentsFolder constant
2025-05-27 09:25:48 -07:00
Sean Hatfield
2b274c62b7
Obsidian data connector (#3798)
* add obsidian vault data connector

* lint

* add english translations

* normalize translations

* improve file parser and reader

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2025-05-12 13:45:27 -07:00
timothycarambat
9d661bb96e linting 2025-05-07 09:40:31 -07:00
mr-chenguang
eff9d24cb9
feat: support fetch wikis for gitlab data connectors (#3271)
* feat: support fetch wikis for gitlab data connectors

* gitlab connector button spacing

* add docAuthor and description metadata for GitLab wiki pages

---------

Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2025-05-06 14:09:53 -07:00
Timothy Carambat
fd4929b4d2
Feature/drupalwiki collector (#3693)
* Implement DrupalWiki collector

* Add attachment downloading and processing functionality (#3)

* linting

* Linting
Add citation image
small refactors
add URL for citation identifier

---------

Co-authored-by: em <eugen.mayer@kontextwork.de>
Co-authored-by: rexjohannes <53578137+rexjohannes@users.noreply.github.com>
Co-authored-by: Eugen Mayer <136934+EugenMayer@users.noreply.github.com>
2025-04-21 09:17:24 -07:00
Timothy Carambat
fd174cab86
Apply .git logic handler for repo URLs (#3655)
* Apply `.git` logic handler for repo URLs

* remove comment
2025-04-15 18:01:14 -07:00
t2
0eb86e2c12
for projects in gitlab subgroup (#3075) (#3247)
* for projects in gitlab subgroup (#3075)

* fix: false condition

* refactor pattern matching logic

---------

Co-authored-by: t2 <>
Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2025-02-17 12:25:11 -08:00
mr-chenguang
6ffdbf074d
feat(dataconnectors): support confluence personal access token (#3206)
* feat(dataconnectors): support confluence personal access token

* fix: change select option

* linting
change name on accesstype field

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2025-02-14 12:12:01 -08:00
Adam Setch
d63438fa61
chore: rename Github to GitHub (#3199)
* chore: rename Github to GitHub

Signed-off-by: Adam Setch <adam.setch@outlook.com>

* chore: rename Github to GitHub

Signed-off-by: Adam Setch <adam.setch@outlook.com>

* Undo some code changes for references

---------

Signed-off-by: Adam Setch <adam.setch@outlook.com>
Co-authored-by: timothycarambat <rambat1010@gmail.com>
2025-02-13 10:45:43 -08:00
Timothy Carambat
d1ca16f7f8
Add tokenizer improvments via Singleton class and estimation (#3072)
* Add tokenizer improvments via Singleton class
linting

* dev build

* Estimation fallback when string exceeds a fixed byte size

* Add notice to tiktoken on backend
2025-01-30 17:55:03 -08:00
Sean Hatfield
9bc01afa7d
Fix scraping failed bug in link/bulk link scrapers (#2807)
* fix scraping failed bug in link/bulk link scrapers

* reset submodule

* swap to networkidle2 as a safe mix for SPA and API-loaded pages, but also not hang on request heavy pages

* lint

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2024-12-11 14:01:52 -08:00
Sean Hatfield
0074ededdd
Github data connector improvements (#2439)
* fix tree/blob github urls from branches not being loaded

* improve ux of github data connector

* lint

* patch Github URL parser to just validate with `URL` native parser

* uncheck LocalStorage of PAT for security reasons

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2024-10-21 15:25:35 -07:00
timothycarambat
ab6f03ce1c linting 2024-10-18 11:44:14 -07:00
Sean Hatfield
41522cdfb4
Handle non-ascii characters in single and bulk link scraper URLs (#2495)
handle non-ascii characters in urls
2024-10-17 17:04:00 -07:00
Timothy Carambat
30645831a1
1959 filetype filters (#2378)
* Updated the `GitHubRepoLoader` class to use the new import syntax and adjust the `recursiveLoader` method accordingly.

* add @langchain/community to collector package.json

* fix: Improve handling of complex ignore patterns in GitLabRepoLoader

* refactor: use ignore package for simplified ignore logic

* run yarn lint

* add @langchain/community@^0.2.23

* remove unused dep
lint

---------

Co-authored-by: Emil Rofors (aider) <emirof@gmail.com>
2024-09-26 12:50:35 -07:00
Blazej Owczarczyk
b2123b13b0
Added an option to fetch issues from gitlab. Made the file fetching a… (#2335)
* Added an option to fetch issues from gitlab. Made the file fetching asynchornous to improve performance. #2334

* Fixed a typo in loadGitlabRepo.

* Convert issues to markdown.

* Fixed an issue with time estimate field names in issueToMarkdown.

* handle rate limits more gracefully + update checkbox to toggle switch

* lint

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
2024-09-26 11:45:18 -07:00
Timothy Carambat
961b567541
Add dropdown for confluence connector deployment (#2376) 2024-09-26 08:49:05 -07:00
Sean Hatfield
4488744850
Support more Confluence URL formats (#2118)
* support more confluence url formats

* use pattern matching for confluence urls and manual splitting as fallback

* rework entire Confluence flow to prevent issues with custom, local, and cloud spaces

* remove dep

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2024-09-25 16:12:17 -07:00
Sean Hatfield
5a3d55db67
Fix custom domain in confluence (#2328)
confluence custom domain fix
2024-09-19 15:36:07 -05:00
Timothy Carambat
4fa3d6d333
Load all branches in gitlab data connector (#2325)
* Fix gitlab data connector for self-hosted instances (#2315)

* Linting fix.

* Load all branches in the GitLab data connector #2319

* #2319 lint fixes.

* update fetch on fail

---------

Co-authored-by: Błażej Owczarczyk <blazeyy@gmail.com>
2024-09-19 13:34:38 -05:00
Blazej Owczarczyk
b25298c04a
Fix gitlab data connector for self-hosted instances (#2315) (#2316)
* Fix gitlab data connector for self-hosted instances (#2315)

* Linting fix.
2024-09-18 16:12:15 -05:00
timothycarambat
9aa77dfb8d Add verbose logging to GH loader
connect #2243
2024-09-09 14:36:37 -07:00
Sean Hatfield
2797298507
Fix depth handling in bulk link scraper (#2096)
fix depth handling in bulk link scraper
2024-08-12 11:44:35 -07:00
Mehmet Ünlü
0d4560b9e4
2049 remove break that prevents fetching files from gitlab repo (#2050)
fix: remove unnecessary break

Remove unnecessary break that prevents checking next pages for blob objects.
2024-08-06 10:17:55 -07:00
Sean Hatfield
be3b0b4916
Youtube loader whitespace fix (#2051)
youtube loader whitespace fix
2024-08-06 10:16:17 -07:00