How do you think Algolia shards their data given that architecture?
My guess is that Algolia's indices are sharded by customer and each cluster probably has multiple customer indices.
do you keep a copy of the index document form of repos or is that done on the fly during indexing?
As mentioned in the post, the index contains the full content. Our ingest process essentially flattens git repos (which are stored as DAGs) into a list of documents to index (the prior state is diffed for changes).
Is your custom index format a binary format? I have no idea whether that's standard practice, or just a compressed text format is enough. I guess that non-binary formats would be enormous though, and given that an index is by definition relatively unique it probably wouldn't compress that well.
Binary formats are normal, posting lists are giant sorted blocks of numbers, so there are a lot of techniques that can be used to compress them. Lucene's index format is pretty well documented if you're interested in learning more (interestingly, Lucene has a text format for debugging: https://lucene.apache.org/core/8_6_3/codecs/org/apache/lucen...).
Scale up in terms of what? Is the current system not indexing all of GitHub, or you mean you want to index on more things (E.g. commits, PRs, etc)?
It's not indexing all of GitHub yet, nor do all users have access yet. Those are the things we are focusing on now. In the future, we want to support indexing branches.
My guess is that Algolia's indices are sharded by customer and each cluster probably has multiple customer indices.
do you keep a copy of the index document form of repos or is that done on the fly during indexing?
As mentioned in the post, the index contains the full content. Our ingest process essentially flattens git repos (which are stored as DAGs) into a list of documents to index (the prior state is diffed for changes).
Is your custom index format a binary format? I have no idea whether that's standard practice, or just a compressed text format is enough. I guess that non-binary formats would be enormous though, and given that an index is by definition relatively unique it probably wouldn't compress that well.
Binary formats are normal, posting lists are giant sorted blocks of numbers, so there are a lot of techniques that can be used to compress them. Lucene's index format is pretty well documented if you're interested in learning more (interestingly, Lucene has a text format for debugging: https://lucene.apache.org/core/8_6_3/codecs/org/apache/lucen...).
Scale up in terms of what? Is the current system not indexing all of GitHub, or you mean you want to index on more things (E.g. commits, PRs, etc)?
It's not indexing all of GitHub yet, nor do all users have access yet. Those are the things we are focusing on now. In the future, we want to support indexing branches.