Talking to users of AI dev tools, there's a great divide in preference between chat and tab-autocomplete. Team Chat are bothered by the invasiveness of always-present inline suggestions and prefer a convenient chat sidebar as a faster Google search. Team Tab feel that chat is useless in the areas where they already know all relevant syntax and libraries, but get great utility from an LLM finishing the line before they can type it themselves.
What I find most interesting is the difference in what makes each difficult to build. For chat, the model is the hard part. The best chat models are typically those with more parameters. Once you have the model running, a decent chat experience is roughly a matter of displaying text. But while open models increasingly prove useful, they may not always fit on a laptop.
For tab, a small model will do (1-15b parameters). This is great because it works locally, but the difficult part becomes building the system around the model: timing, context, and filtering. And while Copilot might afford ~1000 prompt tokens, a fast enough local experience today probably needs to remain around 500, making precision critical.
Spending the last 2 months building local autocomplete has brought us to the following lessons:
The overall goal is to design a function `(filepath, position of cursor) -> completion` that can run in ~500ms. There are 3 key subproblems: 1) timing (when to start the LLM), 2) context (what prompt to send), and 3) filtering (when to stop).
1) Timing consists of lots of debouncing and caching. If the user types quickly, a request shouldn't be made on each keystroke. In fact, if the LLM is already generating from a previous keystroke, this request should be kept running as long as it is still prefixed by what the user types next. Also important is caching: once a completion is generated, any prompt that prefixes a previous prompt should be resolved to the same completion to make for a snappy experience.
2) Context requires designing a function `(filepath, position of cursor) -> ~500 token prompt` that runs in ~100ms. The tokens in this prompt can come from any file in the codebase, making the problem very open-ended. Faced with a problem of arbitrary depth like this the first thing to ask is: how does a human developer do it?
First, we use cmd/ctrl+click to go to definition. Tree Sitter lets us quickly construct the abstract syntax tree around the cursor and if we find, for example, that it is inside of a function call, then we probably want to include the definition of that function in the prompt. The LSP, which is what powers 'go to definition' in VS Code, can be used for this. We can also include relevant type definitions, declarations, or imported files.
Second, we look at recently edited snippets. Since only a small number can be included, we rank them using Jaccard similarity to the area around the cursor: tokenize each snippet to get a set of tokens, then take the intersection divided by the union of the two sets. This catches similar variables names and keywords.
3) Filtering acknowledges LLMs to be imperfect and puts deterministic safeguards in place to prevent annoyance. This can be understood as a mapping between streams of text: given a sequence of tokens from the LLM, we can pass it through multiple filters, each returning a modified sequence. A filter might stop the LLM when it repeats itself, fix indentation, or correct for mismatched parentheses. The second part of filtering is a basic classifier: once we have a final completion, should it be shown at all?
What's amazing is that almost all of this work can and should happen locally. As laptop hardware becomes more powerful and autocomplete models become possibly further condensed it seems likely that we'll wonder why autocomplete ever needed to leave your machine. If you want to see what the local experience looks like today, Continue's local, open-source tab-autocomplete is now in beta. We'd love feedback!