Harrisonv's comments

Harrisonv · on Jan 29, 2024

perfect recall is often a function of the architecture allowing for data to bleed through linkages. you can increase the perfect token recall through dialated wavenet structures, or, in the case of v5, the use of multi-head linear attention creates multiple pathways where information can skip forward in time

Harrisonv · on Jan 29, 2024

Try rwkv-demo-api.recursal.ai if you want to try it and dont want to wait for gradio

zurfer · on Jan 29, 2024

Thank you! I'm not experienced with 7B models.

3 things stand out to me:

- it's absolutely not useable for the kind of use cases I solve with GPT-4 (code generation, information retrieval)

- it could technically swallow a 50 page PDF, but it's not able to answer questions about it (inference speed was good, but content was garbage)

- it is ok for chatting and translations (how is your day?)

lelag · on Jan 29, 2024

I also tried the demo and I find it pretty much useless at most things even comparing it to a small 7b transformer model like mistral.

From my albeit quick tests, what I found is that it knows clearly less things than mistral, it hallucinates much more, it does not follow instructions, has less reasoning capabilities and asking it to translate a Japanese text into English gave me a bad translated summary instead of the full translation.

I don't see how this is soaring past transformers when clearly it's unable to do any of the useful tasks you can use a transformer model for today...

sanxiyn · on Jan 29, 2024

As written in the post, it is a base model with light instruction tuning, i.e. Llama2, not Llama2-chat. You should evaluate it as a base model. If you evaluate it as a chat model, of course it will perform horribly.

lelag · on Jan 29, 2024

OK, I missed that. Thanks for the clarification.

Harrisonv · on Jan 29, 2024

For linear transformers, the current metric is "perfect token recall", the ability for the model to recall a randomized sequence of data. You can find the limit of a particular model architecture by training a model of a particular size to echo randomized data, and I believe this was touched on in the zoo-ology paper.

This doesnt prevent the model from retaining sequences or information beyond this metric, as information can easily be compressed in the state, but it anything within that window can be perfectly recalled by the model.

Internal testing has placed the value for Eagle around the 2.5k ptr[perfect token recall] mark, while community fine tunes done on the partial checkpoints for long distance information gathering and memorization have been shown to easily dwarf that.

prompt processing speed benefits from the same gemm optimizations as standard transformers, with the extra benefit of those gemm optimizations working for batch inference as well (no need for vllm as memory allocation is static per agent)