Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hey everyone, I'm Colin from GitHub's code search team: happy to answer any questions people have about it. Also, you can sign up to get access here: https://github.com/features/code-search


Hi Colin, I’m curious as to how you search repeated letters through ngram index? I understand the example search with the string “limits” (find intersection of “lim”, “imi”, “mit” and “its”). However, if the user wants to search the string “aaaaa” how would you go about searching that?


Good question. We still construct ngrams for it, exactly the same way. So for example, we might extract `aaa`, `aaa`, and `aaa`. Or we may extract `aaaa` and `aaaa`, or perhaps `aaaaa`. Then we deduplicate to find the unique ngrams and look them up in the index.

So it's possible that a document containing `aaa` might match our ngram search, but we double check after retrieving them and exclude them from the result set.


Have you considered using an index directly on language tokens (eg. the abstract language tree representing the file) instead of ngrams on the source text?


We have not done this yet, but we do intend to.

Actually, our search engine is so fast that syntax highlighting the search results is often slower than finding them... so if we store the language tokens directly in the index, we'll be able to directly emit syntax highlighted snippets and make it even faster.

It may also enable some interesting search capabilities in the future, like searching within comments or by code structure.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: