As I mentioned yesterday - I recently needed to process hundreds of low quality images of invoices (for a construction project). I had a script that had used pil/opencv, pytesseract, and open ai as a fallback. It still has a staggering number of failures.
Today I tried a handful of the really poor quality invoices and Qwen spat out all the information I needed without an issue. What's crazier is it gave me the bounding boxes to improve tesseract.
Microsoft Vision is so expensive and has a ridiculous rate limit, is slow, and isn't any better than what you can run yourself. You have to make every request over HTTP (with a rate limit), and there is no ability to do bulk jobs. It's also incredibly expensive.
I wonder why you chose Qwen specifically - Mistral has a specialized model just for OCR that they advertised heavily (I tested it and it works surprisingly well, at least on English-language books from 80s and 90s).
Interesting. I have in the past tried to get bounding boxes of property boundaries on satellite maps estimated by VLLM models but had no success. Do you have any tips on how to improve the results?
Depends on the model, but e.g. [1] found many models perform better if you are more polite. Though interestingly being rude can also sometimes improve performance at the cost of higher bias
Intuitively it makes sense. The best sources tend to be either of moderately high politeness (professional language) or 4chan-like (rude, biased but honest)
Bevore GPT5 was released I already had the feeling like the webui response was declining and I started to try to get more out of the responses and dissing it and saying how useless their response was did actually improve the output (I think).
The way I think of it, talking to an LLM is a bit like talking to myself or listening to an echo, since what I get back depends only on what I put in. If it senses that I'm frustrated, it will be inclined to make even more stuff up in an attempt to appease me, so that gets me nowhere.
I've found it more useful to keep it polite and "professional" and restart the conversation if we've begun going around in circles.
And besides, if I make a habit of behaving badly with LLMs, there's a good chance that I'll do it without thinking at some point and get in trouble.
I like to test these models on reading the contents of 80's Apple ][ games screenshots. These are very low resolution, very dense. All (free to use) models struggle on that task...
I’ve tried that too, trying to detect the scan layout to get better OCR, but it didn’t really beat a fine-tuned Qwen 2.5 VLM 7B. I’d say fine-tuning is the way to go
What's the cost of the fine-tuned model? If you were attempting to optimize for cost, would it be worth it to detect scan layouts to get better OCR?
Honestly, I'm such a noob in this space. I had 1 project I needed to do, didn't want to do it by hand which would have taken 2 days so I spent 5 trying to get a script to do it for me.
the model runs on H200 in ~20s, costing about $2.4/hr. on L4 it’s cheaper at ~$0.3/hr but takes ~85s to finish. overall, H200 ends up cheaper at volume. my scan has a separate issue though: each page has two columns, so text from the right side sometimes overflows into the left. OCR can’t really tell where sentences start and end unless the layout is split by column.
Jumping from this for visibility - LM Studio really is the best option out there. Ollama is another runtime that I've used, but I've found it makes too many assumptions about what a computer is capable of and it's almost impossible to override those settings. It often overloads weaker computers and underestimates stronger ones.
LM Studio isn't as "set it and forget it" as Ollama is, and it does have a bit of a learning curve. But if you're doing any kind of AI development and you don't want to mess around with writing llama-cpp scripts all the time, it really can't be beat (for now).
I wouldn't recommend using anything that can transmit data back to the CCP. The model itself is fine since it's open source (and you can run it firewalled if you're really paranoid), but directly using Alibaba's AI chat website should be discouraged.
I should add that sometimes LM Studio just feels better for the use case, same model same purpose seemingly different output usually when involving RAG, but Anything is definitely a very intuitive visual experience
Any tipps on getting bounding boxes? The model doesn’t seem to even understand the original size of the image. And even if I provide the dimensions, the positioning is off. :'(
It did not, unfortunately. When CV failed gpt-4o failed as well. I even had a list of valid invoice numbers & dates to help the models. Still, most failed.
Did you try few-shotting examples when you hit problem cases? In my ziploc case, the model was failing if red sharpie was used vs black. A few shot hint fixed that.
Tbh, I had run the images through a few filters. The images that went through to AI were high contrast, black and white, with noise such as highlighters removed. I had tried 1 shot and few shot.
I think it was largely a formatting issue. Like some of these invoices have nonsense layouts. Perhaps Qwen works well because it doesn't assume left to right, top to bottom? Just speculating though
I’m very surprised. Have dealt with some really ugly inputs (handwritten text on full ziploc bags etc., stained torn handwritten recipe cards, etc.) with super good success.
Today I tried a handful of the really poor quality invoices and Qwen spat out all the information I needed without an issue. What's crazier is it gave me the bounding boxes to improve tesseract.