merlyn

Author	SHA1	Message	Date
Timothy Carambat	fab74037fa	Prevent collector crash when blocked by CDN (#3373 ) resolves #3365	2025-02-28 10:27:05 -08:00
AbelDuan	df166eb64e	feat: Add multilingual support for ocr module (#3325 ) * Add multilingual support for ocr mudule * Add OCR langauge as server var that is passed into Collector Support all valid tesseract language codes Filter and parse only valid codes with fallbacks' * persist TARGET_OCR_LANG * update docker example env --------- Co-authored-by: Timothy Carambat <rambat1010@gmail.com>	2025-02-27 12:31:17 -08:00
Kristofer Bourro	b07240deee	Windows development environment variables support (#3354 ) * Windows development environment variables support * moved cross-env to dev dependencies --------- Co-authored-by: Timothy Carambat <rambat1010@gmail.com>	2025-02-27 10:43:31 -08:00
t2	0eb86e2c12	for projects in gitlab subgroup (#3075 ) (#3247 ) * for projects in gitlab subgroup (#3075) * fix: false condition * refactor pattern matching logic --------- Co-authored-by: t2 <> Co-authored-by: shatfield4 <seanhatfield5@gmail.com> Co-authored-by: Timothy Carambat <rambat1010@gmail.com>	2025-02-17 12:25:11 -08:00
Timothy Carambat	4545ce24cd	Drop Node `canvas` for manual `sharp` conversion (#3221 ) * Drop Node `canvas` for manual `sharp` conversion * bump dev	2025-02-14 17:38:13 -08:00
mr-chenguang	6ffdbf074d	feat(dataconnectors): support confluence personal access token (#3206 ) * feat(dataconnectors): support confluence personal access token * fix: change select option * linting change name on accesstype field --------- Co-authored-by: timothycarambat <rambat1010@gmail.com>	2025-02-14 12:12:01 -08:00
Timothy Carambat	89bba68219	Add OCR of image support (#3219 ) * OCR PDFs as fallback in spawn thread * wip * build our own worker fanout and wrapper * norm pkgs * Add image OCR support	2025-02-14 12:07:33 -08:00
Timothy Carambat	2a9066e83a	OCR PDFs as fallback during upload (#3204 ) * OCR PDFs as fallback in spawn thread * wip * build our own worker fanout and wrapper * norm pkgs * bump dev	2025-02-14 11:57:31 -08:00
Timothy Carambat	b6d3a411b1	Add `querySelectorAll` capability to web-scraping block (#3186 ) * Add `querySelectorAll` capability to web-scraping block * patches and fallbacks * fix styles of text in web scraping block --------- Co-authored-by: shatfield4 <seanhatfield5@gmail.com>	2025-02-13 16:11:15 -08:00
Adam Setch	d63438fa61	chore: rename Github to GitHub (#3199 ) * chore: rename Github to GitHub Signed-off-by: Adam Setch <adam.setch@outlook.com> * chore: rename Github to GitHub Signed-off-by: Adam Setch <adam.setch@outlook.com> * Undo some code changes for references --------- Signed-off-by: Adam Setch <adam.setch@outlook.com> Co-authored-by: timothycarambat <rambat1010@gmail.com>	2025-02-13 10:45:43 -08:00
Timothy Carambat	9a4df22c70	autodetect parseable text file contents (#3079 )	2025-01-31 13:31:26 -08:00
Timothy Carambat	d1ca16f7f8	Add tokenizer improvments via Singleton class and estimation (#3072 ) * Add tokenizer improvments via Singleton class linting * dev build * Estimation fallback when string exceeds a fixed byte size * Add notice to tiktoken on backend	2025-01-30 17:55:03 -08:00
Sean Hatfield	dd017c6cbb	Audio file validations (#2902 ) * add audio file validations * patch sharp to support wavfile parsing --------- Co-authored-by: timothycarambat <rambat1010@gmail.com>	2024-12-30 14:48:28 -08:00
Sean Hatfield	9bc01afa7d	Fix scraping failed bug in link/bulk link scrapers (#2807 ) * fix scraping failed bug in link/bulk link scrapers * reset submodule * swap to networkidle2 as a safe mix for SPA and API-loaded pages, but also not hang on request heavy pages * lint --------- Co-authored-by: timothycarambat <rambat1010@gmail.com>	2024-12-11 14:01:52 -08:00
Timothy Carambat	5e698534fe	Add plaintext file extensions (#2664 )	2024-11-20 09:56:03 -08:00
Sean Hatfield	cf3b085a3a	Handle OpenAI whisper transcription edge case (#2621 ) remove openai whisper transcription provider response_format option	2024-11-11 17:32:03 -08:00
Sean Hatfield	0bb47619dc	Allow 127.0.0.1 as valid URL for scraping (#2560 ) * allow 127.0.0.1 as valid url for scraping * update comments and lint --------- Co-authored-by: timothycarambat <rambat1010@gmail.com>	2024-10-31 09:57:28 -07:00
timothycarambat	c870e31aaa	add `ino` filetype to text/plain support	2024-10-28 11:44:15 -07:00
Sean Hatfield	0074ededdd	Github data connector improvements (#2439 ) * fix tree/blob github urls from branches not being loaded * improve ux of github data connector * lint * patch Github URL parser to just validate with `URL` native parser * uncheck LocalStorage of PAT for security reasons --------- Co-authored-by: Timothy Carambat <rambat1010@gmail.com>	2024-10-21 15:25:35 -07:00
timothycarambat	ab6f03ce1c	linting	2024-10-18 11:44:14 -07:00
Sean Hatfield	41522cdfb4	Handle non-ascii characters in single and bulk link scraper URLs (#2495 ) handle non-ascii characters in urls	2024-10-17 17:04:00 -07:00
Sean Hatfield	b658f5012d	Support XLSX files (#2403 ) * support xlsx files * lint * create seperate docs for each xlsx sheet * lint * use node-xlsx pkg for parsing xslx files * lint * update error handling --------- Co-authored-by: timothycarambat <rambat1010@gmail.com>	2024-10-03 13:45:23 -07:00
Timothy Carambat	93d64642f3	Add exception handling for special case files like `Dockerfile` and `Jenkinsfile` (#2410 )	2024-10-02 15:13:31 -07:00
Blazej Owczarczyk	348d9c8285	Add 3GB file size limit to body parser middlewares (#2390 )	2024-09-30 11:19:41 -07:00
Timothy Carambat	30645831a1	1959 filetype filters (#2378 ) * Updated the `GitHubRepoLoader` class to use the new import syntax and adjust the `recursiveLoader` method accordingly. * add @langchain/community to collector package.json * fix: Improve handling of complex ignore patterns in GitLabRepoLoader * refactor: use ignore package for simplified ignore logic * run yarn lint * add @langchain/community@^0.2.23 * remove unused dep lint --------- Co-authored-by: Emil Rofors (aider) <emirof@gmail.com>	2024-09-26 12:50:35 -07:00
Blazej Owczarczyk	b2123b13b0	Added an option to fetch issues from gitlab. Made the file fetching a… (#2335 ) * Added an option to fetch issues from gitlab. Made the file fetching asynchornous to improve performance. #2334 * Fixed a typo in loadGitlabRepo. * Convert issues to markdown. * Fixed an issue with time estimate field names in issueToMarkdown. * handle rate limits more gracefully + update checkbox to toggle switch * lint --------- Co-authored-by: Timothy Carambat <rambat1010@gmail.com> Co-authored-by: shatfield4 <seanhatfield5@gmail.com>	2024-09-26 11:45:18 -07:00
Timothy Carambat	961b567541	Add dropdown for confluence connector deployment (#2376 )	2024-09-26 08:49:05 -07:00
Sean Hatfield	4488744850	Support more Confluence URL formats (#2118 ) * support more confluence url formats * use pattern matching for confluence urls and manual splitting as fallback * rework entire Confluence flow to prevent issues with custom, local, and cloud spaces * remove dep --------- Co-authored-by: Timothy Carambat <rambat1010@gmail.com>	2024-09-25 16:12:17 -07:00
Sean Hatfield	5a3d55db67	Fix custom domain in confluence (#2328 ) confluence custom domain fix	2024-09-19 15:36:07 -05:00
Timothy Carambat	4fa3d6d333	Load all branches in gitlab data connector (#2325 ) * Fix gitlab data connector for self-hosted instances (#2315) * Linting fix. * Load all branches in the GitLab data connector #2319 * #2319 lint fixes. * update fetch on fail --------- Co-authored-by: Błażej Owczarczyk <blazeyy@gmail.com>	2024-09-19 13:34:38 -05:00
Blazej Owczarczyk	b25298c04a	Fix gitlab data connector for self-hosted instances (#2315 ) (#2316 ) * Fix gitlab data connector for self-hosted instances (#2315) * Linting fix.	2024-09-18 16:12:15 -05:00
timothycarambat	9aa77dfb8d	Add verbose logging to GH loader connect #2243	2024-09-09 14:36:37 -07:00
timothycarambat	5f477e0dbd	remove log	2024-09-06 11:37:46 -07:00
timothycarambat	619f6b3884	Ignore SSL errors for web scraper resolves #2114	2024-08-14 09:11:22 -07:00
timothycarambat	b541623c9e	add SSRF notice	2024-08-13 17:46:07 -07:00
Sean Hatfield	2797298507	Fix depth handling in bulk link scraper (#2096 ) fix depth handling in bulk link scraper	2024-08-12 11:44:35 -07:00
Lea Anthony	3b6a2fd2fa	#2084 Support Go filetype (#2085 ) Support Go filetype	2024-08-09 19:29:29 -07:00
Mehmet Ünlü	0d4560b9e4	2049 remove break that prevents fetching files from gitlab repo (#2050 ) fix: remove unnecessary break Remove unnecessary break that prevents checking next pages for blob objects.	2024-08-06 10:17:55 -07:00
Sean Hatfield	be3b0b4916	Youtube loader whitespace fix (#2051 ) youtube loader whitespace fix	2024-08-06 10:16:17 -07:00
Timothy Carambat	04a0fc4ec9	Remove unused deps (#1938 ) * Remove unused deps * improve dependency	2024-07-25 10:21:03 -07:00
Timothy Carambat	42235fcd8a	GitLab Hosted and Local Connector (#1932 ) * Add support for GitLab repo collection as well as Github Repo collection * Refactor for repo collectors to be more compact --------- Co-authored-by: Emil Rofors <emirof@gmail.com>	2024-07-23 12:23:51 -07:00
timothycarambat	f15529653f	patch logger for full logs	2024-07-19 18:35:41 -07:00
timothycarambat	cec1a3d585	append stacktraces to winston	2024-07-19 18:13:54 -07:00
Sean Hatfield	9b86bbd2b8	[FIX] PDFLoader module bug fix (#1879 ) use pdf.js by importing it from pdf-parse and fix custom PDFLoader module	2024-07-16 13:09:43 -07:00
Sean Hatfield	79656718b2	[FEAT] Create custom pdfloader (#1852 ) * implement custom PDFLoader to remove LC dep * remove unneeded comment * remove pdfjs as dep and fix page splitting using pdf-parse * linting + export rename for desktop compat --------- Co-authored-by: timothycarambat <rambat1010@gmail.com>	2024-07-11 12:26:11 -07:00
timothycarambat	8658b1e7c7	linting	2024-07-03 18:25:44 -07:00
Timothy Carambat	29c9eeaa5c	Add `winston` logging for production (#1811 )	2024-07-03 16:39:33 -07:00
Sean Hatfield	a87014822a	[REFACTOR] Improve asPDF collector processor with pdfjs (#1791 ) * WIP replace langchain pdfloader with pdfjs and add more context to each page * remove extras from pdfjs and just replace langchain library * remove unneeded dep * fix console log in docs --------- Co-authored-by: timothycarambat <rambat1010@gmail.com>	2024-07-03 14:26:48 -07:00
Sean Hatfield	f205d51fe9	[FIX] Confluence code snippet blocks not being extracted (#1804 ) implement custom confluence loader to extract code blocks properly from documents Co-authored-by: Timothy Carambat <rambat1010@gmail.com>	2024-07-03 14:00:44 -07:00
Sean Hatfield	fc375f4036	[FIX] Bulk link scraper bug fix (#1800 ) patch website depth data connector to work for other links that are not root url	2024-07-01 16:59:28 -07:00

1 2 3

136 Commits