michael/merlyn - merlyn - Honeywest

michael/merlyn

Author	SHA1	Message	Date
AbelDuan	df166eb64e	feat: Add multilingual support for ocr module (#3325 ) * Add multilingual support for ocr mudule * Add OCR langauge as server var that is passed into Collector Support all valid tesseract language codes Filter and parse only valid codes with fallbacks' * persist TARGET_OCR_LANG * update docker example env --------- Co-authored-by: Timothy Carambat <rambat1010@gmail.com>	2025-02-27 12:31:17 -08:00
Timothy Carambat	2a9066e83a	OCR PDFs as fallback during upload (#3204 ) * OCR PDFs as fallback in spawn thread * wip * build our own worker fanout and wrapper * norm pkgs * bump dev	2025-02-14 11:57:31 -08:00
Timothy Carambat	d1ca16f7f8	Add tokenizer improvments via Singleton class and estimation (#3072 ) * Add tokenizer improvments via Singleton class linting * dev build * Estimation fallback when string exceeds a fixed byte size * Add notice to tiktoken on backend	2025-01-30 17:55:03 -08:00
Sean Hatfield	79656718b2	[FEAT] Create custom pdfloader (#1852 ) * implement custom PDFLoader to remove LC dep * remove unneeded comment * remove pdfjs as dep and fix page splitting using pdf-parse * linting + export rename for desktop compat --------- Co-authored-by: timothycarambat <rambat1010@gmail.com>	2024-07-11 12:26:11 -07:00