Originally (maybe over a year ago) I had similar issues. But now Zitadel is one `enable = true;` option[1] away and in the official nixpkgs repo so you shouldn't really have this issue anymore. I was able to use it pretty easily with the built in service and postgres service[2] (note mine is encapsulated in a nixos container but otherwise the inner config is all you really need).
Is there a roadmap for planned features in the future? I wouldn't call this a "powerful tool for addressing key challenges in deploying RAG systems" right now. It seems to do the most simple version of RAG that the most basic RAG tutorial teaches someone how to do with a pretty UI over it.
The most key challenges I've faced around RAG are things like:
- Only works on text based modalities (how can I use this with all types of source documents, including images)
- Chunking "well" for the type of document (by paragraph, csvs including header on every chunk, tables in pdfs, diagrams, etc). The rudimentary chunk by character with overlap is demonstrably not very good at retrieval
- the R in rag is really just "how can you do the best possible search for the given query". The approach here is so simple that it is definitely not the best possible search results. It's missing so many known techniques right now like:
- Generate example queries that the chunk can answer and embed those to search against.
- Parent document retrieval
- so many newer better Rag techniques have been talked about and used that are better than chunk based
- How do you differentiate "needs all source" vs "find in source" questions? Think: Summarize the entire pdf, vs a specific question like how long does it take for light to travel to the moon and back?
- Also other search approaches like fuzzy search/lexical based approaches. And ranking them based on criterial like (user query is one word, use fuzzy search instead of semantic search). Things like that
So far this platform seems to just lock you into a really simple embedding pipeline that only supports the most simple chunk based retrieval. I wouldn't use this unless there was some promise of it actually solving some challenges in RAG.
Thanks for taking the time to provide your candid feedback, I think you have made a lot of good points.
You are correct that the options in R2R are fairly simple today - Our approach here is to get input from the developer community to make sure we are on the right track before building out more novel features.
Regarding your challenges:
- Only works on text based modalities (how can I use this with all types of source documents, including images)
For the immediate future R2R will likely remain focused on text, but you are right that the problem gets even more challenging when you introduce the idea of images. I'd like to start working on multi-modal soon.
- Chunking "well" for the type of document (by paragraph, csvs including header on every chunk, tables in pdfs, diagrams, etc). The rudimentary chunk by character with overlap is demonstrably not very good at retrieval
This is very true - a short/medium term goal of mine is to integrate some more intelligent chunking approaches, ranging from Vikp's Surya to Reducto's proprietary model. I'm also interested in exploring what can be done from the pure software side.
- the R in rag is really just "how can you do the best possible search for the given query". The approach here is so simple that it is definitely not the best possible search results. It's missing so many known techniques right now like:
- Generate example queries that the chunk can answer and embed those to search against.
- Parent document retrieval
- so many newer better Rag techniques have been talked about and used that are better than chunk based
- How do you differentiate "needs all source" vs "find in source" questions? Think: Summarize the entire pdf, vs a specific question like how long does it take for light to travel to the moon and back?
I think the other other approaches you outline are all worth investigating as well. There is definitely a tension we face between building and testing new experimental approaches vs. figuring out what features people need in production and implementing those.
Just so you know where we are heading - we want to make sure all the features are there for easy experimentation, but we also want to provide value into production and beyond. As an example, we are currently working on robust task orchestration to accompany our pipeline abstractions to help with ingesting large quantities of data, as this has been a painpoint in our own experience and that of some of our early enterprise users.
Nice, thanks for the reply. Glad to hear you are looking into these challenges and plan to tackle some of them. Will keep my eye on the repo for some of these improvements in the future.
And totally agree, the scaling out of ingesting large quantities of data is a hard challenge as well and it does make sense to work on that problem space too. Sounds like that is a higher priority at the moment which is totally fine.
We are also very interested in the more novel RAG techniques, so I'm not sure that one is necessarily a higher priority than the other.
We've just gotten more immediate feedback from our early users around the difficulties of ingesting data in production and there is less ambiguity around what to build.
Out of your previous list, is there one example that you think would be most useful for the next addition to the framework?
Well, as someone building something similar I have been looking around at how people are tackling the problem of varied index approaches for different files, and again how that can scale.
I haven't read the code on your github but the readme mentions using qdrant/pgvector. I'm curious how you will tackle having that scale to billions of files with tens/hundreds/etc? different indexing approaches for each file. It doesn't feel tennable to keep it in a single postgres instance as it will just grow and grow forever.
Think even a very simple example of more indexes per file: having chunk sizes of 20/500/1000 along with various overlaps of 50/100/500. You suddenly have a large combination of indexes you need to maintain and each is basically a full copy of the source file. (You can imagine indexes for BM25, fuzzy matching, lucene, etc...)
You could be brute force ish and always run every single index mode for every file until a better process exists to only do the best ones for a specific file. But even if you narrowed it down a file could want 5 different index types searched and ranked for Retrieval step.
I want to know how people plan to shard/make it possible to have so many search indexes on all their data and still be able to query against all of it. Postgres will eventually run out of space even on the beefiest cloud instance fairly quickly.
The second biggest thing is then to tackle how to use all of those indexes well in the Retrieval step. Which indexes should be searched against/weighted and how given the user query/convo history?
You are both right about chunking, and i think is one of the main challenges.
About more intelligent chunking approaches, i think you have to give a try to to preprocess.co
It's able to preprocess and chunk PDFs, Office Files, and HTML content.
It follows the original document layout considering the content semantics so you get optimal chunks
This is my problem with every end to end system I've seen around this. I find that, even building these systems from scratch, all of the hard parts are just normal data infrastructure problems. The "AI" part takes a small fraction of the effort to deliver even when just building the RAG part directly on top of huggingface/transformers.
I also have dealt with what you're describing, but then it goes much farther when going to prod IME. The ingestion part is even more messy in ways these kinds of platforms don't seem to help with. When managing multiple tools in prod with overlapping and non-constant data sources (say, you have two tools that need to both know the price of a product, which can change at any time), I need both of those to be built on the same source of truth and for that source of truth to be fed by our data infra in real time, where relevant documents need to be replaced in real time in more or less an atomic way.
Then, I have some tools that have varying levels of permissioning on those overlapping data sources, say, you have two tools that exist in a classroom, one that helps the student based on their work, and another that is used by the TA or teacher to help understand students' answers in a large course. They have overlapping data needs on otherwise private data, and this kind of permissioning layer which is pretty trivial in a normal webapp has, IME, had to have been implemented basically from scratch on top of the vector db and retrieval system.
Then experimentation, eval, testing, and releases are the hardest and most underserved. It was only relatively recently that it seemed like anyone even seemed to be talking about eval as a problem to aspire to solve. There's a pretty interesting and novel interplay of the problems of production ML eval, but with potentially sparse data, and conventional unit testing. This is the area we had to put the most of our own thought into for me to feel reasonably confident in putting anything into prod.
FWIW we just built our own internal platform on top of langchain a while back, seemed like a good balance of the right level of abstraction for our use cases, solid productivity gains from shared effort.
I think this is a really interesting problem space, but yeah, I'm skeptical of all of these platforms as they seem to always be promising a lot more than they're delivering. It looks superficially like there has been all of this progress on tooling, but I built a production service based on vector search in 2018 and it really isn't that much easier today. It works better because the models are so much better, but the tools and frameworks don't help that much with the hard parts, to my surprise honestly.
Perhaps I'm just not the user and am being excessively critical, but I keep having to deal with execs and product people throwing these frameworks at us internally without understanding the alignment between what is hard about building these kinds of services in prod and what these kinds of tools make easier vs harder.
This is AMAZING feedback and it is on brand with what I've heard from a number of builders. Thanks for sharing your experiences here.
The infra challenges are real - it has been what I have been struggling the most with in providing high quality support for early users. Most want to be able to reliably firehose 10-100s of GBs of data through a brittle multistep pipeline. This was something I struggled with when building AgentSearch [https://huggingface.co/datasets/SciPhi/AgentSearch-V1] with LOCAL data - so introducing the networking component only makes things that much harder.
I think we have a lot of work to do to robustly solve this problem, but I'm confident that there is an opportunity to build a framework that results in net positives for the developer.
FWIW, Your feedback would be invaluable as the project continues to grow.
I'll have to try this out. Currently use Amethyst + Hammerspoon scripts for my window tools. Like others in this thread, Amethyst occasionally loses track of all windows and requires a restart (esp after monitor (dis)connection).
Amethyst does a decent job at the layouts I care about.
I primarily use AwesomeWM in linux on my personal computers which has the amazing super key drag/resizing behavior for windows. I use Hammerspoon to replicate this behavior[0][1] and it works quite well.
Eventually I want to replace Amethyst and just do everything in Hammerspoon as it seems quite plausible to do window layouting with it. Will give Yabai a try as well in the meantime.
This was also an issue I had, but there is actually very good support for this hidden by an invisible feature. Simply add an `*@domain` as your email address identity. When you select that as your from address the fastmail UI gives you an input box to use whatever email you want, you don't need to make a new identity each time. For lookup, I just use my password manager.
I agree, all of the "do it all" database tools have always hurt in the long term in every system I've worked on that uses them (TypeOrm, Prisma, etc..)
From what I can tell it is always better to just learn and understand the database you are using and create it by hand. You may spend a tiny bit longer with basic crud boiler plate, but it will benefit you greatly in the long term to just have solid native migrations setup. And a framework that simply works with your database schema, rather than having an opinion forced on your database.
In Prisma, Typeorm, etc you are a slave to what the ORM wants the database to look like. Tools like Objection, jOOQ, etc are all much easier to work with in the long term and allow you to tune your database by hand and without random framework constraints.
Prisma is great if you plan on never maintaining past your MVP, so I guess it makes sense that startups use it and get stuff out the door quickly, but I don't find it as a long term solution or for it to ever be able to handle complicated situations that _will happen_ to your database in the future.
I'm with the product team at Prisma, currently focusing on migrations.
>Prisma is great if you plan on never maintaining past your MVP, so I guess it makes sense that startups use it and get stuff out the door quickly
We want Prisma to help developers get stuff out the door faster but our ultimate goal is to support developers throughout the entire application lifecycle.
We are working on improving Migrate and hope to deliver improvements over the next few months that hopefully can help you change your mind about our toolkit. :)
[1]: https://search.nixos.org/options?channel=25.05&query=zitadel
[2]: https://git.joshuabell.xyz/ringofstorms/dotfiles/src/branch/...