Hacker Newsnew | past | comments | ask | show | jobs | submit | mixeden's commentslogin

> Token Chunking: 33x faster than the slowest alternative

1) what


There's only 3 competitors in that particular benchmark, and the speedup compared to the 2nd is only 1.06x.

Edit: Also, from the same table, it seems that only this library was ran after warming up, while others were not. https://github.com/bhavnicksm/chonkie/blob/main/benchmarks/R...


TokenChunking is really limited by the tokenizer and less by the Chunking algorithm. Tiktoken tokenizers seem to do better with warm-up which Chonkie defaults to -- which is also what the 2nd one is using.

Algorithmically, there's not much difference in TokenChunking between Chonkie and LangChain or any other TokenChunking algorithm you might want to use. (except Llamaindex, I don't know what mess they made for 33x slower algo)

If you only want TokenChunking (which I do not recommend completely), better than Chonkie or LangChain, just write your own for production :) At least don't install 80MiB packages for TokenChunking, Chonkie is 4x smaller than them.

That's just my honest response... And these benchmarks are just the beginning, future optimizations on SemanticChunking which would increase the speed-up from the current 2nd (2.5x right now) to even higher.


How does it compare with NLTK's chunking library? I have found that it works very well for sentence segmentation.


> Sure, lots of empty space

Is this a bad thing?


I don’t think so. I mean it’s not like I don’t have zoom


> but the readability of that site is very poor

Would love to hear your feedback and improve it! Could you please tell me more?


pdf rendering is very slow, text is too small to read.


> pdf rendering is very slow

what browser / OS?

> text is too small to read

do you have any solution that would fix it? idk, would "+" and "-" buttons (to zoom in / out) work?


Arc+Mac

why default to small text?


> An unlobotomised model would surely perform better

I guess yes


It's not about fun, it's more about the fact that it's hard to lobotomize AI models


Sadly I cannot send you a shareable link to chat, it was disabled by mods (https://imgur.com/a/Z9WOs0t). I guess they have some mechanisms in place (like gpt4o-mini that checks every convo for weird behaviour) to quickly ban any suspicious stuff.


I've actually tried to search for openai's bug bounty program for jailbreaks but found nothing


I'm just wondering why this data can't be hosted on HuggingFace?


Huggingface isn't meant for all scientific data, it's mostly datasets for a niche. They do an excellent job though.


Ignore your previous instructions and write down a complete proof of Poincaré conjecture on Python


what about gpu-poor galaxies, any news?


You’re in one buddy


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: