Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As I mentioned yesterday - I recently needed to process hundreds of low quality images of invoices (for a construction project). I had a script that had used pil/opencv, pytesseract, and open ai as a fallback. It still has a staggering number of failures.

Today I tried a handful of the really poor quality invoices and Qwen spat out all the information I needed without an issue. What's crazier is it gave me the bounding boxes to improve tesseract.



I would recommend taking a look at this service: https://learn.microsoft.com/en-us/rest/api/computervision/re...


Microsoft Vision is so expensive and has a ridiculous rate limit, is slow, and isn't any better than what you can run yourself. You have to make every request over HTTP (with a rate limit), and there is no ability to do bulk jobs. It's also incredibly expensive.


I wonder why you chose Qwen specifically - Mistral has a specialized model just for OCR that they advertised heavily (I tested it and it works surprisingly well, at least on English-language books from 80s and 90s).


Mistral's model was terrible when I tested it on non Latin characters and on anything that isn't neat printed text (i.e. handwriting)


Interesting. I have in the past tried to get bounding boxes of property boundaries on satellite maps estimated by VLLM models but had no success. Do you have any tips on how to improve the results?


With Qwen I went as stupid as I could: please provide the bounding box metadata for pytesseract for the above image.

And it spat it out.


It’s funny that many of us say please. I don’t think it impacts the output, but it also feels wrong without it sometimes.


Depends on the model, but e.g. [1] found many models perform better if you are more polite. Though interestingly being rude can also sometimes improve performance at the cost of higher bias

Intuitively it makes sense. The best sources tend to be either of moderately high politeness (professional language) or 4chan-like (rude, biased but honest)

1: https://arxiv.org/pdf/2402.14531


When I want an LLM to be be brief, I will say things like "be brief", "don't ramble", etc.

When that fails, "shut the fuck up" always seems to do the trick.


I ripped into cursor today. It didn't change anything but I felt better lmao


Bevore GPT5 was released I already had the feeling like the webui response was declining and I started to try to get more out of the responses and dissing it and saying how useless their response was did actually improve the output (I think).


The way I think of it, talking to an LLM is a bit like talking to myself or listening to an echo, since what I get back depends only on what I put in. If it senses that I'm frustrated, it will be inclined to make even more stuff up in an attempt to appease me, so that gets me nowhere.

I've found it more useful to keep it polite and "professional" and restart the conversation if we've begun going around in circles.

And besides, if I make a habit of behaving badly with LLMs, there's a good chance that I'll do it without thinking at some point and get in trouble.


It's a good habit to build now in case AGI actually happens out of the blue.


Gemini has purpose post training for bounding boxes if you haven't tried it.

The latest update on Gemini live does real time bounding boxes on objects it's talking about, it's pretty neat.


shameless plug here for AMD's AI Dev Day - registration is open and they want feedback on what to focus on: https://www.amd.com/en/corporate/events/amd-ai-dev-day.html


Do you have some example images and the prompt you tried?


also documented stack setup if could.


I like to test these models on reading the contents of 80's Apple ][ games screenshots. These are very low resolution, very dense. All (free to use) models struggle on that task...


My dataset could be described in a similar way. Very low quality, very odd layouts, information density where it's completely unnecessary.

And these contractors were relatively good operators compared to most.


I’ve tried that too, trying to detect the scan layout to get better OCR, but it didn’t really beat a fine-tuned Qwen 2.5 VLM 7B. I’d say fine-tuning is the way to go


What's the cost of the fine-tuned model? If you were attempting to optimize for cost, would it be worth it to detect scan layouts to get better OCR?

Honestly, I'm such a noob in this space. I had 1 project I needed to do, didn't want to do it by hand which would have taken 2 days so I spent 5 trying to get a script to do it for me.


the model runs on H200 in ~20s, costing about $2.4/hr. on L4 it’s cheaper at ~$0.3/hr but takes ~85s to finish. overall, H200 ends up cheaper at volume. my scan has a separate issue though: each page has two columns, so text from the right side sometimes overflows into the left. OCR can’t really tell where sentences start and end unless the layout is split by column.


what fine tuning approach did you use?


just unsloth on colab using A100 and dataset on google drive.


So where did you load up Qwen and how did you supply the pdf or photo files? I don't know how to use these models, but want to learn


LM Studio[0] is the best "i'm new here and what is this!?" tool for dipping your toes in the water.

If the model supports "vision" or "sound", that tool makes it relatively painless to take your input file + text and feed it to the model.

[0]: https://lmstudio.ai/


Jumping from this for visibility - LM Studio really is the best option out there. Ollama is another runtime that I've used, but I've found it makes too many assumptions about what a computer is capable of and it's almost impossible to override those settings. It often overloads weaker computers and underestimates stronger ones.

LM Studio isn't as "set it and forget it" as Ollama is, and it does have a bit of a learning curve. But if you're doing any kind of AI development and you don't want to mess around with writing llama-cpp scripts all the time, it really can't be beat (for now).


Thank you! I will give it a try and see if I can get that 4090 working a bit.


You can use their models here chat.qwenlm.ai, its their official website


I wouldn't recommend using anything that can transmit data back to the CCP. The model itself is fine since it's open source (and you can run it firewalled if you're really paranoid), but directly using Alibaba's AI chat website should be discouraged.


AnythingLLM also good for that GUI experience!


I should add that sometimes LM Studio just feels better for the use case, same model same purpose seemingly different output usually when involving RAG, but Anything is definitely a very intuitive visual experience


People actually use tesseract? It's one of the worst OCR solutions out there. Forget it.


Any tipps on getting bounding boxes? The model doesn’t seem to even understand the original size of the image. And even if I provide the dimensions, the positioning is off. :'(


Wait a moment... It gave you BOUNDING BOXES? That is awesome! That is a missing link I need for models.


i had success with tabula. you may not need ai. but fine if it works too.


I would strongly emphasize:

CV != AI Vision

gpt-4o would breeze through your poor images.


It did not, unfortunately. When CV failed gpt-4o failed as well. I even had a list of valid invoice numbers & dates to help the models. Still, most failed.

Construction invoices are not great.


Did you try few-shotting examples when you hit problem cases? In my ziploc case, the model was failing if red sharpie was used vs black. A few shot hint fixed that.


Tbh, I had run the images through a few filters. The images that went through to AI were high contrast, black and white, with noise such as highlighters removed. I had tried 1 shot and few shot.

I think it was largely a formatting issue. Like some of these invoices have nonsense layouts. Perhaps Qwen works well because it doesn't assume left to right, top to bottom? Just speculating though


I’m very surprised. Have dealt with some really ugly inputs (handwritten text on full ziploc bags etc., stained torn handwritten recipe cards, etc.) with super good success.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: