michael/merlyn - merlyn - Honeywest

michael/merlyn

Author	SHA1	Message	Date
Timothy Carambat	cf3fbcbf0f	Improve URL handler for collector processes (#4504 ) * Improve URL handler for collector processes * dev build	2025-10-07 11:03:27 -07:00
timothycarambat	bdfa0328db	update comment about parseOnly	2025-10-01 20:45:52 -07:00
Marcello Fitton	f7b90571be	Fetch, Parse, and Create Documents for Statically Hosted Files (#4398 ) * Add capability to web scraping feature for document creation to download and parse statically hosted files * lint * Remove unneeded comment * Simplified process by using key of ACCEPTED_MIMES to validate the response content type, as a result unlocked all supported files * Add TODO comments for future implementation of asDoc.js to handle standard MS Word files in constants.js * Return captureAs argument to be exposed by scrapeGenericUrl and passed into getPageContent \| Return explicit argument of captureAs into scrapeGenericUrl in processLink fn * Return debug log for scrapeGenericUrl * Change conditional to a guard clause. * Add error handling, validation, and JSDOC to getContentType helper fn * remove unneeded comments * Simplify URL validation by reusing module * Rename downloadFileToHotDir to downloadURIToFile and moved up to a global module \| Add URL valuidation to downloadURIToFile * refactor * add support for webp remove unused imports --------- Co-authored-by: timothycarambat <rambat1010@gmail.com>	2025-10-01 15:49:05 -07:00
Timothy Carambat	95557ee16f	Allow user to specify args for chromium process so they dont need SYS_ADMIN on container. (#4397 ) * allow user to specify args for chromium process so they dont need SYS_ADMIN perms * use arg flag content * update console outputs	2025-09-17 16:31:08 -07:00
Jonas Stawski	b8d4cc3454	Added metadata parameter to document/upload, document/upload/{folderName}, and document/upload-link (#4342 ) * Added the ability to pass in metadata to the /document/upload/{folderName} endpoint * Added the ability to pass in metadata to the /document/upload-link endpoint * feat: added metadata to document/upload api endpoint * simplify optional metadata in document dev api endpoints * lint * patch handling of metadata in dev api * Linting, small comments --------- Co-authored-by: jstawskigmi <jstawski@getmyinterns.org> Co-authored-by: shatfield4 <seanhatfield5@gmail.com> Co-authored-by: Timothy Carambat <rambat1010@gmail.com>	2025-09-17 11:17:29 -07:00
Timothy Carambat	70a07b743b	Update `writeToServerDocuments` to take config object (#4213 )	2025-07-29 17:53:05 -07:00
Sean Hatfield	610bdd4673	Allow custom headers in upload-link endpoint (#3695 ) * allow custom headers in upload-link endpoint * override loader.scrape to allow for passing of headers in langchain puppeteer * lint * Rename some variables move positional args to named args update documentation to reflect arg changes and funciton sigs validate header object before attempting to end to forward to request * update header validation for custom headers --------- Co-authored-by: timothycarambat <rambat1010@gmail.com>	2025-04-22 12:47:12 -07:00
Timothy Carambat	b6d3a411b1	Add `querySelectorAll` capability to web-scraping block (#3186 ) * Add `querySelectorAll` capability to web-scraping block * patches and fallbacks * fix styles of text in web scraping block --------- Co-authored-by: shatfield4 <seanhatfield5@gmail.com>	2025-02-13 16:11:15 -08:00
Timothy Carambat	d1ca16f7f8	Add tokenizer improvments via Singleton class and estimation (#3072 ) * Add tokenizer improvments via Singleton class linting * dev build * Estimation fallback when string exceeds a fixed byte size * Add notice to tiktoken on backend	2025-01-30 17:55:03 -08:00
Sean Hatfield	9bc01afa7d	Fix scraping failed bug in link/bulk link scrapers (#2807 ) * fix scraping failed bug in link/bulk link scrapers * reset submodule * swap to networkidle2 as a safe mix for SPA and API-loaded pages, but also not hang on request heavy pages * lint --------- Co-authored-by: timothycarambat <rambat1010@gmail.com>	2024-12-11 14:01:52 -08:00
timothycarambat	ab6f03ce1c	linting	2024-10-18 11:44:14 -07:00
Sean Hatfield	41522cdfb4	Handle non-ascii characters in single and bulk link scraper URLs (#2495 ) handle non-ascii characters in urls	2024-10-17 17:04:00 -07:00
timothycarambat	619f6b3884	Ignore SSL errors for web scraper resolves #2114	2024-08-14 09:11:22 -07:00
Timothy Carambat	a5bb77f97a	Agent support for `@agent` default agent inside workspace chat (#1093 ) V1 of agent support via built-in `@agent` that can be invoked alongside normal workspace RAG chat.	2024-04-16 10:50:10 -07:00
Timothy Carambat	d52f8aafd4	689 links in citation (#715 ) * Include links in citations force ChunkSource key to retain this information old links will be unsupported * show special icons depending on source * remove console log * reset server documents writeTo	2024-02-13 14:11:57 -08:00
Timothy Carambat	b35feede87	570 document api return object (#608 ) * Add support for fetching single document in documents folder * Add document object to upload + support link scraping via API * hotfixes for documentation * update api docs	2024-01-16 16:04:22 -08:00
timothycarambat	daadad3859	hoist var in extensions	2023-12-20 19:41:16 -08:00
Timothy Carambat	719521c307	Document Processor v2 (#442 ) * wip: init refactor of document processor to JS * add NodeJs PDF support * wip: partity with python processor feat: add pptx support * fix: forgot files * Remove python scripts totally * wip:update docker to boot new collector * add package.json support * update dockerfile for new build * update gitignore and linting * add more protections on file lookup * update package.json * test build * update docker commands to use cap-add=SYS_ADMIN so web scraper can run update all scripts to reflect this remove docker build for branch	2023-12-14 15:14:56 -08:00