Commit Graph

40 Commits

Author SHA1 Message Date
Timothy Carambat
092b1b45f8
Upgrade YT Scraper (#4820) 2026-01-02 15:41:22 -08:00
Timothy Carambat
b2f49b6036
patch ESM import issue (#4819) 2026-01-02 14:11:13 -08:00
Sean Hatfield
6c1f8a38ce
Refactor localWhisper to use custom FFMPEGWrapper class (#4775)
* refactor localWhisper to use new custom FFMPEGWrapper class

* stub tests in github actions

* add back wavefile conversion to 16khz 32f to fix docker builds

* use afterEach for cleanup in ffmpeg tests

* remove unused FFMPEG_PATH env check

* use spawnSync for ffmpeg to capture and log output

* lint

* revert removal of try/catch around validateAudioFile for more helpful error msgs

* use readFileSync instead of createReadStream for less overhead

* change import to require for fix-path and stub import in tests

* refactor to singleton to preserve ffmpeg path
dev build

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2025-12-18 11:41:45 -08:00
Timothy Carambat
692fa755ee
Bump expressJS from 4.18.2 -> 4.21.2 (#4760)
Bump expressJS from 4.18.2 -> 4.21.2 to patch body-parser CVE-2024-45590 as general maintence task'
2025-12-10 18:54:18 -08:00
Timothy Carambat
d22b7fc4e2
Remove bcrypt from collector - not used (#4747) 2025-12-09 15:23:42 -08:00
Timothy Carambat
cc7c876efc
bump body-parser patch version (#4746) 2025-12-09 15:21:22 -08:00
Timothy Carambat
cd263337f8 fix: bump version tag 2025-12-09 13:18:51 -08:00
Timothy Carambat
155900eae7
dev build with new epub2 build target and remove patch work (#4694) 2025-11-26 17:36:34 -08:00
Marcello Fitton
376c9f7f3f
Install patch-package in /collector and Apply Patch to Fix EPub Upload Bug (#4630)
* Install patch-package and postinstall-postinstall

* Implement patch to ensure title is always a string in EPub class
2025-11-19 13:17:58 -08:00
timothycarambat
71cd46ce1b 1.9.0 tag 2025-10-09 15:11:59 -07:00
timothycarambat
a4a84f9bdd forgot 1.8.5 tag :) 2025-08-14 17:43:55 -07:00
timothycarambat
c535c69345 1.8.4 tag update 2025-07-16 10:40:39 -07:00
Timothy Carambat
8001d9ddeb
update 1.8.3 tags for release (#4109)
* update 1.8.3 tags for release

* whoops, botched news
2025-07-09 12:17:56 -07:00
Timothy Carambat
64d9fbc8f0
Show app version in system settings sidebar (#4044)
* Add version tagging
resolves #4038
closes #4034
closes #4028

* add hook

* add build

* patch
2025-06-24 13:56:12 -07:00
Timothy Carambat
6fc0a6a644
Enable workflow rule for package verification (#3778)
enable workflow rule
2025-05-07 12:51:14 -07:00
timothycarambat
3f4fda86bf match openai versions across collector/backend 2025-05-07 12:30:09 -07:00
Kristofer Bourro
b07240deee
Windows development environment variables support (#3354)
* Windows development environment variables support

* moved cross-env to dev dependencies

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2025-02-27 10:43:31 -08:00
Timothy Carambat
4545ce24cd
Drop Node canvas for manual sharp conversion (#3221)
* Drop Node `canvas` for manual `sharp` conversion

* bump dev
2025-02-14 17:38:13 -08:00
Timothy Carambat
2a9066e83a
OCR PDFs as fallback during upload (#3204)
* OCR PDFs as fallback in spawn thread

* wip

* build our own worker fanout and wrapper

* norm pkgs

* bump dev
2025-02-14 11:57:31 -08:00
Sean Hatfield
dd017c6cbb
Audio file validations (#2902)
* add audio file validations

* patch sharp to support wavfile parsing

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2024-12-30 14:48:28 -08:00
Sean Hatfield
b658f5012d
Support XLSX files (#2403)
* support xlsx files

* lint

* create seperate docs for each xlsx sheet

* lint

* use node-xlsx pkg for parsing xslx files

* lint

* update error handling

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2024-10-03 13:45:23 -07:00
Timothy Carambat
30645831a1
1959 filetype filters (#2378)
* Updated the `GitHubRepoLoader` class to use the new import syntax and adjust the `recursiveLoader` method accordingly.

* add @langchain/community to collector package.json

* fix: Improve handling of complex ignore patterns in GitLabRepoLoader

* refactor: use ignore package for simplified ignore logic

* run yarn lint

* add @langchain/community@^0.2.23

* remove unused dep
lint

---------

Co-authored-by: Emil Rofors (aider) <emirof@gmail.com>
2024-09-26 12:50:35 -07:00
Timothy Carambat
04a0fc4ec9
Remove unused deps (#1938)
* Remove unused deps

* improve dependency
2024-07-25 10:21:03 -07:00
Timothy Carambat
42235fcd8a
GitLab Hosted and Local Connector (#1932)
* Add support for GitLab repo collection as well as Github Repo collection
* Refactor for repo collectors to be more compact

---------

Co-authored-by: Emil Rofors <emirof@gmail.com>
2024-07-23 12:23:51 -07:00
Sean Hatfield
79656718b2
[FEAT] Create custom pdfloader (#1852)
* implement custom PDFLoader to remove LC dep

* remove unneeded comment

* remove pdfjs as dep and fix page splitting using pdf-parse

* linting + export rename for desktop compat

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2024-07-11 12:26:11 -07:00
Timothy Carambat
29c9eeaa5c
Add winston logging for production (#1811) 2024-07-03 16:39:33 -07:00
Sean Hatfield
a87014822a
[REFACTOR] Improve asPDF collector processor with pdfjs (#1791)
* WIP replace langchain pdfloader with pdfjs and add more context to each page

* remove extras from pdfjs and just replace langchain library

* remove unneeded dep

* fix console log in docs

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2024-07-03 14:26:48 -07:00
Timothy Carambat
98cef508a6
Feature/devcontv2 (#1622)
* Updated apt-packages source for devcontainer

Switched the devcontainer's package source to a different repository to
align with updated dependencies and package availability. The previous
source from 'rocker-org' is replaced with 'devcontainers-contrib', which
may offer more recent or relevant development tools.

* Subject: Centralize prettier ignores and refine
config

Body:
Centralized all prettier ignore rules by removing individual
`.prettierignore` files in subprojects and updating the root
`.prettierignore` to include previously ignored patterns, ensuring
consistency across the workspace. Additionally, the prettier
configuration was refined by making the file pattern for `.config.js`
files consistent and adjusting quote styles for better readability. All
lint scripts across the project were updated to respect the centralized
ignore path, enhancing maintainability.

The consolidation simplifies the process of managing ignore rules as the
project scales, ensuring developers can focus on writing code without
worrying about divergent formatting standards. These changes also align
with introducing comprehensive linting across multiple environments to
keep the codebase clean and consistent.

This adjustment is a foundational step towards a more streamlined and
unified code base, making it easier for new contributors to adhere to
established coding standards and reducing the cognitive load associated
with managing multiple configuration files across the project.

* unset package json changes

---------

Co-authored-by: Francisco Bischoff <franzbischoff@gmail.com>
Co-authored-by: Francisco Bischoff <984592+franzbischoff@users.noreply.github.com>
2024-06-06 12:50:42 -07:00
Timothy Carambat
547d4859ef
Bump openai package to latest (#1234)
* Bump `openai` package to latest
Tested all except localai

* bump LocalAI support with latest image

* add deprecation notice

* linting
2024-04-30 12:33:42 -07:00
Timothy Carambat
94017e2b51
bump langchain deps (#1231)
* bump langchain deps

* patch native and ollama providers remove deprecated deps

---------

Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
2024-04-30 12:04:24 -07:00
Sean Hatfield
348b36bf85
[FEAT] Confluence data connector (#1181)
* WIP Confluence data connector backend

* confluence data connector complete

* confluence citations

* fix citation for confluence

* Patch confulence integration

* fix Citation Icon for confluence

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2024-04-25 17:53:38 -07:00
Timothy Carambat
1f8ab0d245
Remove YoutubeLoader dependency (#1050)
* WIP data connector redesign

* new UI for data connectors complete

* remove old data connector page/cleanup imports

* cleanup of UI and imports

* Remove Youtube Transcript dep and move in-house

* lang pref default to en

---------

Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
2024-04-05 16:33:01 -07:00
Timothy Carambat
4fb4aa2041
Add epub support for parsing (#1017) 2024-04-02 14:25:52 -07:00
Timothy Carambat
0ada882991
Support external transcription providers (#909)
* Support External Transcription providers

* patch files

* update docs

* fix return data
2024-03-14 15:43:26 -07:00
Timothy Carambat
0f31e43fd4
bump YT metadata lib for YT api fix rot (#888) 2024-03-11 10:57:53 -07:00
Timothy Carambat
58971e8b30
Build & Publish AnythingLLM for ARM64 and x86 (#549)
* Update build process to support multi-platform builds
Bump @lancedb/vectordb to 0.1.19 for ARM&AMD compatibility
Patch puppeteer on ARM builds because of broken chromium
resolves #539
resolves #548

---------

Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
2024-01-08 16:15:01 -08:00
Timothy Carambat
ecf4295537
Add ability to grab youtube transcripts via doc processor (#470)
* Add ability to grab youtube transcripts via doc processor

* dynamic imports
swap out Github for Youtube in placeholder text
2023-12-18 17:17:26 -08:00
Timothy Carambat
452582489e
GitHub loader extension + extension support v1 (#469)
* feat: implement github repo loading
fix: purge of folders
fix: rendering of sub-files

* noshow delete on custom-documents

* Add API key support because of rate limits

* WIP for frontend of data connectors

* wip

* Add frontend form for GitHub repo data connector

* remove console.logs
block custom-documents from being deleted

* remove _meta unused arg

* Add support for ignore pathing in request
Ignore path input via tagging

* Update hint
2023-12-18 15:48:02 -08:00
Timothy Carambat
61db981017
feat: Embed on-instance Whisper model for audio/mp4 transcribing (#449)
* feat: Embed on-instance Whisper model for audio/mp4 transcribing
resolves #329

* additional logging

* add placeholder for tmp folder in collector storage
Add cleanup of hotdir and tmp on collector boot to prevent hanging files
split loading of model and file conversion into concurrency

* update README

* update model size

* update supported filetypes
2023-12-15 11:20:13 -08:00
Timothy Carambat
719521c307
Document Processor v2 (#442)
* wip: init refactor of document processor to JS

* add NodeJs PDF support

* wip: partity with python processor
feat: add pptx support

* fix: forgot files

* Remove python scripts totally

* wip:update docker to boot new collector

* add package.json support

* update dockerfile for new build

* update gitignore and linting

* add more protections on file lookup

* update package.json

* test build

* update docker commands to use cap-add=SYS_ADMIN so web scraper can run
update all scripts to reflect this
remove docker build for branch
2023-12-14 15:14:56 -08:00