* Add multilingual support for ocr mudule
* Add OCR langauge as server var that is passed into Collector
Support all valid tesseract language codes
Filter and parse only valid codes with fallbacks'
* persist TARGET_OCR_LANG
* update docker example env
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* Add tokenizer improvments via Singleton class
linting
* dev build
* Estimation fallback when string exceeds a fixed byte size
* Add notice to tiktoken on backend
* implement custom PDFLoader to remove LC dep
* remove unneeded comment
* remove pdfjs as dep and fix page splitting using pdf-parse
* linting + export rename for desktop compat
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>