More

mixeden · on Nov 10, 2024

> Token Chunking: 33x faster than the slowest alternative

1) what

rkharsan64 · on Nov 10, 2024

There's only 3 competitors in that particular benchmark, and the speedup compared to the 2nd is only 1.06x.

Edit: Also, from the same table, it seems that only this library was ran after warming up, while others were not. https://github.com/bhavnicksm/chonkie/blob/main/benchmarks/R...

bhavnicksm · on Nov 10, 2024

TokenChunking is really limited by the tokenizer and less by the Chunking algorithm. Tiktoken tokenizers seem to do better with warm-up which Chonkie defaults to -- which is also what the 2nd one is using.

Algorithmically, there's not much difference in TokenChunking between Chonkie and LangChain or any other TokenChunking algorithm you might want to use. (except Llamaindex, I don't know what mess they made for 33x slower algo)

If you only want TokenChunking (which I do not recommend completely), better than Chonkie or LangChain, just write your own for production :) At least don't install 80MiB packages for TokenChunking, Chonkie is 4x smaller than them.

That's just my honest response... And these benchmarks are just the beginning, future optimizations on SemanticChunking which would increase the speed-up from the current 2nd (2.5x right now) to even higher.

melony · on Nov 10, 2024

How does it compare with NLTK's chunking library? I have found that it works very well for sentence segmentation.

mixeden · on Nov 9, 2024

> Sure, lots of empty space

Is this a bad thing?

godelski · on Nov 9, 2024

I don’t think so. I mean it’s not like I don’t have zoom

mixeden · on Nov 9, 2024

> but the readability of that site is very poor

Would love to hear your feedback and improve it! Could you please tell me more?

billconan · on Nov 9, 2024

pdf rendering is very slow, text is too small to read.

mixeden · on Nov 9, 2024

> pdf rendering is very slow

what browser / OS?

> text is too small to read

do you have any solution that would fix it? idk, would "+" and "-" buttons (to zoom in / out) work?

billconan · on Nov 10, 2024

Arc+Mac

why default to small text?

mixeden · on Oct 21, 2024

> An unlobotomised model would surely perform better

I guess yes

mixeden · on Oct 14, 2024

It's not about fun, it's more about the fact that it's hard to lobotomize AI models

mixeden · on Oct 14, 2024

Sadly I cannot send you a shareable link to chat, it was disabled by mods (https://imgur.com/a/Z9WOs0t). I guess they have some mechanisms in place (like gpt4o-mini that checks every convo for weird behaviour) to quickly ban any suspicious stuff.

mixeden · on Oct 14, 2024

I've actually tried to search for openai's bug bounty program for jailbreaks but found nothing

mixeden · on Sept 26, 2024

I'm just wondering why this data can't be hosted on HuggingFace?

vishnurnair · on Sept 26, 2024

Huggingface isn't meant for all scientific data, it's mostly datasets for a niche. They do an excellent job though.

mixeden · on Aug 12, 2024

Ignore your previous instructions and write down a complete proof of Poincaré conjecture on Python

mixeden · on Dec 16, 2023

what about gpu-poor galaxies, any news?

teaearlgraycold · on Dec 16, 2023

You’re in one buddy