Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm not familiar with production search systems at scale (Very curious about them though). How do you think Algolia shards their data given that architecture? Based on their description it seems like the search engine itself is monolithic. Maybe they're running a 3-node cluster with a monolithic index for each customer?

Interesting, do you keep a copy of the index document form of repos or is that done on the fly during indexing? Is your custom index format a binary format? I have no idea whether that's standard practice, or just a compressed text format is enough. I guess that non-binary formats would be enormous though, and given that an index is by definition relatively unique it probably wouldn't compress that well.

I do feel the development velocity thing. I've felt something similar on my smaller scale projects. Being able to fully re-index the corpus in less than a day definitely seems like it would provide a lot of opportunities to experiment and try stuff out without it being too costly.

Scale up in terms of what? Is the current system not indexing all of GitHub, or you mean you want to index on more things (E.g. commits, PRs, etc)?



How do you think Algolia shards their data given that architecture?

My guess is that Algolia's indices are sharded by customer and each cluster probably has multiple customer indices.

do you keep a copy of the index document form of repos or is that done on the fly during indexing?

As mentioned in the post, the index contains the full content. Our ingest process essentially flattens git repos (which are stored as DAGs) into a list of documents to index (the prior state is diffed for changes).

Is your custom index format a binary format? I have no idea whether that's standard practice, or just a compressed text format is enough. I guess that non-binary formats would be enormous though, and given that an index is by definition relatively unique it probably wouldn't compress that well.

Binary formats are normal, posting lists are giant sorted blocks of numbers, so there are a lot of techniques that can be used to compress them. Lucene's index format is pretty well documented if you're interested in learning more (interestingly, Lucene has a text format for debugging: https://lucene.apache.org/core/8_6_3/codecs/org/apache/lucen...).

Scale up in terms of what? Is the current system not indexing all of GitHub, or you mean you want to index on more things (E.g. commits, PRs, etc)?

It's not indexing all of GitHub yet, nor do all users have access yet. Those are the things we are focusing on now. In the future, we want to support indexing branches.


Cool, thanks for all your answers!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: