Hey everyone, I'm Colin from GitHub's code search team: happy to answer any ques...

Royaljj · on Feb 7, 2023

Hi Colin, I’m curious as to how you search repeated letters through ngram index? I understand the example search with the string “limits” (find intersection of “lim”, “imi”, “mit” and “its”). However, if the user wants to search the string “aaaaa” how would you go about searching that?

colin353 · on Feb 7, 2023

Good question. We still construct ngrams for it, exactly the same way. So for example, we might extract `aaa`, `aaa`, and `aaa`. Or we may extract `aaaa` and `aaaa`, or perhaps `aaaaa`. Then we deduplicate to find the unique ngrams and look them up in the index.

So it's possible that a document containing `aaa` might match our ngram search, but we double check after retrieving them and exclude them from the result set.

Manfred · on Feb 7, 2023

Have you considered using an index directly on language tokens (eg. the abstract language tree representing the file) instead of ngrams on the source text?

colin353 · on Feb 7, 2023

We have not done this yet, but we do intend to.

Actually, our search engine is so fast that syntax highlighting the search results is often slower than finding them... so if we store the language tokens directly in the index, we'll be able to directly emit syntax highlighted snippets and make it even faster.

It may also enable some interesting search capabilities in the future, like searching within comments or by code structure.