Commit Graph

145 Commits

Author SHA1 Message Date
Sean Hatfield
2b274c62b7
Obsidian data connector (#3798)
* add obsidian vault data connector

* lint

* add english translations

* normalize translations

* improve file parser and reader

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2025-05-12 13:45:27 -07:00
Timothy Carambat
6fc0a6a644
Enable workflow rule for package verification (#3778)
enable workflow rule
2025-05-07 12:51:14 -07:00
timothycarambat
3f4fda86bf match openai versions across collector/backend 2025-05-07 12:30:09 -07:00
timothycarambat
9d661bb96e linting 2025-05-07 09:40:31 -07:00
mr-chenguang
eff9d24cb9
feat: support fetch wikis for gitlab data connectors (#3271)
* feat: support fetch wikis for gitlab data connectors

* gitlab connector button spacing

* add docAuthor and description metadata for GitLab wiki pages

---------

Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2025-05-06 14:09:53 -07:00
Sean Hatfield
610bdd4673
Allow custom headers in upload-link endpoint (#3695)
* allow custom headers in upload-link endpoint

* override loader.scrape to allow for passing of headers in langchain puppeteer

* lint

* Rename some variables
move positional args to named args
update documentation to reflect arg changes and funciton sigs
validate header object before attempting to end to forward to request

* update header validation for custom headers

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2025-04-22 12:47:12 -07:00
Timothy Carambat
1601eb986c
Enable bypass of ip limitations via ENV in collector processing (#3652)
* Enable bypass of ip limitations via ENV in collector startup
resolves #3625
connect #3626

* dev build

* bump dockerx build action

* enable runtime setting config of collector requests

* comments and linting for option passing

* unset

* unset

* update docs link

* linting and docs
2025-04-21 11:10:41 -07:00
Timothy Carambat
fd4929b4d2
Feature/drupalwiki collector (#3693)
* Implement DrupalWiki collector

* Add attachment downloading and processing functionality (#3)

* linting

* Linting
Add citation image
small refactors
add URL for citation identifier

---------

Co-authored-by: em <eugen.mayer@kontextwork.de>
Co-authored-by: rexjohannes <53578137+rexjohannes@users.noreply.github.com>
Co-authored-by: Eugen Mayer <136934+EugenMayer@users.noreply.github.com>
2025-04-21 09:17:24 -07:00
Timothy Carambat
fd174cab86
Apply .git logic handler for repo URLs (#3655)
* Apply `.git` logic handler for repo URLs

* remove comment
2025-04-15 18:01:14 -07:00
Timothy Carambat
fab74037fa
Prevent collector crash when blocked by CDN (#3373)
resolves #3365
2025-02-28 10:27:05 -08:00
AbelDuan
df166eb64e
feat: Add multilingual support for ocr module (#3325)
* Add multilingual support for ocr mudule

* Add OCR langauge as server var that is passed into Collector
Support all valid tesseract language codes
Filter and parse only valid codes with fallbacks'

* persist TARGET_OCR_LANG

* update docker example env

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2025-02-27 12:31:17 -08:00
Kristofer Bourro
b07240deee
Windows development environment variables support (#3354)
* Windows development environment variables support

* moved cross-env to dev dependencies

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2025-02-27 10:43:31 -08:00
t2
0eb86e2c12
for projects in gitlab subgroup (#3075) (#3247)
* for projects in gitlab subgroup (#3075)

* fix: false condition

* refactor pattern matching logic

---------

Co-authored-by: t2 <>
Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2025-02-17 12:25:11 -08:00
Timothy Carambat
4545ce24cd
Drop Node canvas for manual sharp conversion (#3221)
* Drop Node `canvas` for manual `sharp` conversion

* bump dev
2025-02-14 17:38:13 -08:00
mr-chenguang
6ffdbf074d
feat(dataconnectors): support confluence personal access token (#3206)
* feat(dataconnectors): support confluence personal access token

* fix: change select option

* linting
change name on accesstype field

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2025-02-14 12:12:01 -08:00
Timothy Carambat
89bba68219
Add OCR of image support (#3219)
* OCR PDFs as fallback in spawn thread

* wip

* build our own worker fanout and wrapper

* norm pkgs

* Add image OCR support
2025-02-14 12:07:33 -08:00
Timothy Carambat
2a9066e83a
OCR PDFs as fallback during upload (#3204)
* OCR PDFs as fallback in spawn thread

* wip

* build our own worker fanout and wrapper

* norm pkgs

* bump dev
2025-02-14 11:57:31 -08:00
Timothy Carambat
b6d3a411b1
Add querySelectorAll capability to web-scraping block (#3186)
* Add `querySelectorAll` capability to web-scraping block

* patches and fallbacks

* fix styles of text in web scraping block

---------

Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
2025-02-13 16:11:15 -08:00
Adam Setch
d63438fa61
chore: rename Github to GitHub (#3199)
* chore: rename Github to GitHub

Signed-off-by: Adam Setch <adam.setch@outlook.com>

* chore: rename Github to GitHub

Signed-off-by: Adam Setch <adam.setch@outlook.com>

* Undo some code changes for references

---------

Signed-off-by: Adam Setch <adam.setch@outlook.com>
Co-authored-by: timothycarambat <rambat1010@gmail.com>
2025-02-13 10:45:43 -08:00
Timothy Carambat
9a4df22c70
autodetect parseable text file contents (#3079) 2025-01-31 13:31:26 -08:00
Timothy Carambat
d1ca16f7f8
Add tokenizer improvments via Singleton class and estimation (#3072)
* Add tokenizer improvments via Singleton class
linting

* dev build

* Estimation fallback when string exceeds a fixed byte size

* Add notice to tiktoken on backend
2025-01-30 17:55:03 -08:00
Sean Hatfield
dd017c6cbb
Audio file validations (#2902)
* add audio file validations

* patch sharp to support wavfile parsing

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2024-12-30 14:48:28 -08:00
Sean Hatfield
9bc01afa7d
Fix scraping failed bug in link/bulk link scrapers (#2807)
* fix scraping failed bug in link/bulk link scrapers

* reset submodule

* swap to networkidle2 as a safe mix for SPA and API-loaded pages, but also not hang on request heavy pages

* lint

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2024-12-11 14:01:52 -08:00
Timothy Carambat
5e698534fe
Add plaintext file extensions (#2664) 2024-11-20 09:56:03 -08:00
Sean Hatfield
cf3b085a3a
Handle OpenAI whisper transcription edge case (#2621)
remove openai whisper transcription provider response_format option
2024-11-11 17:32:03 -08:00
Sean Hatfield
0bb47619dc
Allow 127.0.0.1 as valid URL for scraping (#2560)
* allow 127.0.0.1 as valid url for scraping

* update comments and lint

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2024-10-31 09:57:28 -07:00
timothycarambat
c870e31aaa add ino filetype to text/plain support 2024-10-28 11:44:15 -07:00
Sean Hatfield
0074ededdd
Github data connector improvements (#2439)
* fix tree/blob github urls from branches not being loaded

* improve ux of github data connector

* lint

* patch Github URL parser to just validate with `URL` native parser

* uncheck LocalStorage of PAT for security reasons

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2024-10-21 15:25:35 -07:00
timothycarambat
ab6f03ce1c linting 2024-10-18 11:44:14 -07:00
Sean Hatfield
41522cdfb4
Handle non-ascii characters in single and bulk link scraper URLs (#2495)
handle non-ascii characters in urls
2024-10-17 17:04:00 -07:00
Sean Hatfield
b658f5012d
Support XLSX files (#2403)
* support xlsx files

* lint

* create seperate docs for each xlsx sheet

* lint

* use node-xlsx pkg for parsing xslx files

* lint

* update error handling

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2024-10-03 13:45:23 -07:00
Timothy Carambat
93d64642f3
Add exception handling for special case files like Dockerfile and Jenkinsfile (#2410) 2024-10-02 15:13:31 -07:00
Blazej Owczarczyk
348d9c8285
Add 3GB file size limit to body parser middlewares (#2390) 2024-09-30 11:19:41 -07:00
Timothy Carambat
30645831a1
1959 filetype filters (#2378)
* Updated the `GitHubRepoLoader` class to use the new import syntax and adjust the `recursiveLoader` method accordingly.

* add @langchain/community to collector package.json

* fix: Improve handling of complex ignore patterns in GitLabRepoLoader

* refactor: use ignore package for simplified ignore logic

* run yarn lint

* add @langchain/community@^0.2.23

* remove unused dep
lint

---------

Co-authored-by: Emil Rofors (aider) <emirof@gmail.com>
2024-09-26 12:50:35 -07:00
Blazej Owczarczyk
b2123b13b0
Added an option to fetch issues from gitlab. Made the file fetching a… (#2335)
* Added an option to fetch issues from gitlab. Made the file fetching asynchornous to improve performance. #2334

* Fixed a typo in loadGitlabRepo.

* Convert issues to markdown.

* Fixed an issue with time estimate field names in issueToMarkdown.

* handle rate limits more gracefully + update checkbox to toggle switch

* lint

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
2024-09-26 11:45:18 -07:00
Timothy Carambat
961b567541
Add dropdown for confluence connector deployment (#2376) 2024-09-26 08:49:05 -07:00
Sean Hatfield
4488744850
Support more Confluence URL formats (#2118)
* support more confluence url formats

* use pattern matching for confluence urls and manual splitting as fallback

* rework entire Confluence flow to prevent issues with custom, local, and cloud spaces

* remove dep

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2024-09-25 16:12:17 -07:00
Sean Hatfield
5a3d55db67
Fix custom domain in confluence (#2328)
confluence custom domain fix
2024-09-19 15:36:07 -05:00
Timothy Carambat
4fa3d6d333
Load all branches in gitlab data connector (#2325)
* Fix gitlab data connector for self-hosted instances (#2315)

* Linting fix.

* Load all branches in the GitLab data connector #2319

* #2319 lint fixes.

* update fetch on fail

---------

Co-authored-by: Błażej Owczarczyk <blazeyy@gmail.com>
2024-09-19 13:34:38 -05:00
Blazej Owczarczyk
b25298c04a
Fix gitlab data connector for self-hosted instances (#2315) (#2316)
* Fix gitlab data connector for self-hosted instances (#2315)

* Linting fix.
2024-09-18 16:12:15 -05:00
timothycarambat
9aa77dfb8d Add verbose logging to GH loader
connect #2243
2024-09-09 14:36:37 -07:00
timothycarambat
5f477e0dbd remove log 2024-09-06 11:37:46 -07:00
timothycarambat
619f6b3884 Ignore SSL errors for web scraper
resolves #2114
2024-08-14 09:11:22 -07:00
timothycarambat
b541623c9e add SSRF notice 2024-08-13 17:46:07 -07:00
Sean Hatfield
2797298507
Fix depth handling in bulk link scraper (#2096)
fix depth handling in bulk link scraper
2024-08-12 11:44:35 -07:00
Lea Anthony
3b6a2fd2fa
#2084 Support Go filetype (#2085)
Support Go filetype
2024-08-09 19:29:29 -07:00
Mehmet Ünlü
0d4560b9e4
2049 remove break that prevents fetching files from gitlab repo (#2050)
fix: remove unnecessary break

Remove unnecessary break that prevents checking next pages for blob objects.
2024-08-06 10:17:55 -07:00
Sean Hatfield
be3b0b4916
Youtube loader whitespace fix (#2051)
youtube loader whitespace fix
2024-08-06 10:16:17 -07:00
Timothy Carambat
04a0fc4ec9
Remove unused deps (#1938)
* Remove unused deps

* improve dependency
2024-07-25 10:21:03 -07:00
Timothy Carambat
42235fcd8a
GitLab Hosted and Local Connector (#1932)
* Add support for GitLab repo collection as well as Github Repo collection
* Refactor for repo collectors to be more compact

---------

Co-authored-by: Emil Rofors <emirof@gmail.com>
2024-07-23 12:23:51 -07:00