its so long, so much waste of compute during inference. Wondering why they could...

hiddencost · on Aug 27, 2024

Fine-tuning is expensive and slow compared to prompt engineering, for making changes to a production system.

You can develop validate and push a new prompt in hours.

WithinReason · on Aug 27, 2024

You need to include the prompt in every query, which makes it very expensive

GaggiX · on Aug 27, 2024

The prompt is kv-cached, it's precomputed.

WithinReason · on Aug 27, 2024

Good point, but it still increases the compute of all subsequent tokens

isoprophlex · on Aug 27, 2024

They're most likely using prefix caching so it doesn't materially change the inference time

tayo42 · on Aug 27, 2024

has anything been done to like turn common phrases into a single token?

like "can you please" maps to 3895 instead of something like "10 245 87 941"

Or does it not matter since tokenization is already a kind of compression?

naveen99 · on Aug 27, 2024

You can try cyp but ymmv

WesolyKubeczek · on Aug 27, 2024

I imagine the tone you set at the start affects the tone of responses, as it makes completions in that same tone more likely.

I would very much like to see my assumption checked — if you are as terse as possible in your system prompt, would it turn into a drill sergeant or an introvert?