As I mentioned yesterday - I recently needed to process hundreds of low quality ...

iamflimflam1 · 2025-09-24T10:28:43 1758709723

I would recommend taking a look at this service: https://learn.microsoft.com/en-us/rest/api/computervision/re...

iamleppert · 2025-09-25T13:37:47 1758807467

Microsoft Vision is so expensive and has a ridiculous rate limit, is slow, and isn't any better than what you can run yourself. You have to make every request over HTTP (with a rate limit), and there is no ability to do bulk jobs. It's also incredibly expensive.

benterix · 2025-09-24T11:56:35 1758714995

I wonder why you chose Qwen specifically - Mistral has a specialized model just for OCR that they advertised heavily (I tested it and it works surprisingly well, at least on English-language books from 80s and 90s).

z2 · 2025-09-24T15:14:04 1758726844

Mistral's model was terrible when I tested it on non Latin characters and on anything that isn't neat printed text (i.e. handwriting)

VladVladikoff · 2025-09-24T00:05:19 1758672319

Interesting. I have in the past tried to get bounding boxes of property boundaries on satellite maps estimated by VLLM models but had no success. Do you have any tips on how to improve the results?

richardlblair · 2025-09-24T00:24:10 1758673450

With Qwen I went as stupid as I could: please provide the bounding box metadata for pytesseract for the above image.

And it spat it out.

VladVladikoff · 2025-09-24T00:44:18 1758674658

It’s funny that many of us say please. I don’t think it impacts the output, but it also feels wrong without it sometimes.

wongarsu · 2025-09-24T01:23:50 1758677030

Depends on the model, but e.g. [1] found many models perform better if you are more polite. Though interestingly being rude can also sometimes improve performance at the cost of higher bias

Intuitively it makes sense. The best sources tend to be either of moderately high politeness (professional language) or 4chan-like (rude, biased but honest)

1: https://arxiv.org/pdf/2402.14531

arcanemachiner · 2025-09-24T01:32:20 1758677540

When I want an LLM to be be brief, I will say things like "be brief", "don't ramble", etc.

When that fails, "shut the fuck up" always seems to do the trick.

richardlblair · 2025-09-24T02:57:44 1758682664

I ripped into cursor today. It didn't change anything but I felt better lmao

entropie · 2025-09-24T02:33:26 1758681206

Bevore GPT5 was released I already had the feeling like the webui response was declining and I started to try to get more out of the responses and dissing it and saying how useless their response was did actually improve the output (I think).

indigoabstract · 2025-09-24T12:12:27 1758715947

The way I think of it, talking to an LLM is a bit like talking to myself or listening to an echo, since what I get back depends only on what I put in. If it senses that I'm frustrated, it will be inclined to make even more stuff up in an attempt to appease me, so that gets me nowhere.

I've found it more useful to keep it polite and "professional" and restart the conversation if we've begun going around in circles.

And besides, if I make a habit of behaving badly with LLMs, there's a good chance that I'll do it without thinking at some point and get in trouble.

dabockster · 2025-09-24T17:35:40 1758735340

It's a good habit to build now in case AGI actually happens out of the blue.

Workaccount2 · 2025-09-24T00:41:45 1758674505

Gemini has purpose post training for bounding boxes if you haven't tried it.

The latest update on Gemini live does real time bounding boxes on objects it's talking about, it's pretty neat.

rsalama2 · 2025-09-24T22:08:22 1758751702

shameless plug here for AMD's AI Dev Day - registration is open and they want feedback on what to focus on: https://www.amd.com/en/corporate/events/amd-ai-dev-day.html

mh- · 2025-09-24T00:22:02 1758673322

Do you have some example images and the prompt you tried?

BOOSTERHIDROGEN · 2025-09-24T02:30:07 1758681007

also documented stack setup if could.

wiz21c · 2025-09-24T09:11:19 1758705079

I like to test these models on reading the contents of 80's Apple ][ games screenshots. These are very low resolution, very dense. All (free to use) models struggle on that task...

richardlblair · 2025-09-24T13:12:27 1758719547

My dataset could be described in a similar way. Very low quality, very odd layouts, information density where it's completely unnecessary.

And these contractors were relatively good operators compared to most.

netdur · 2025-09-24T00:42:24 1758674544

I’ve tried that too, trying to detect the scan layout to get better OCR, but it didn’t really beat a fine-tuned Qwen 2.5 VLM 7B. I’d say fine-tuning is the way to go

richardlblair · 2025-09-24T13:14:12 1758719652

What's the cost of the fine-tuned model? If you were attempting to optimize for cost, would it be worth it to detect scan layouts to get better OCR?

Honestly, I'm such a noob in this space. I had 1 project I needed to do, didn't want to do it by hand which would have taken 2 days so I spent 5 trying to get a script to do it for me.

netdur · 2025-09-24T22:12:45 1758751965

the model runs on H200 in ~20s, costing about $2.4/hr. on L4 it’s cheaper at ~$0.3/hr but takes ~85s to finish. overall, H200 ends up cheaper at volume. my scan has a separate issue though: each page has two columns, so text from the right side sometimes overflows into the left. OCR can’t really tell where sentences start and end unless the layout is split by column.

rexreed · 2025-09-24T11:46:22 1758714382

what fine tuning approach did you use?

netdur · 2025-09-24T22:13:38 1758752018

just unsloth on colab using A100 and dataset on google drive.

unixhero · 2025-09-24T02:20:31 1758680431

So where did you load up Qwen and how did you supply the pdf or photo files? I don't know how to use these models, but want to learn

baby_souffle · 2025-09-24T02:26:10 1758680770

LM Studio[0] is the best "i'm new here and what is this!?" tool for dipping your toes in the water.

If the model supports "vision" or "sound", that tool makes it relatively painless to take your input file + text and feed it to the model.

[0]: https://lmstudio.ai/

dabockster · 2025-09-24T17:41:10 1758735670

Jumping from this for visibility - LM Studio really is the best option out there. Ollama is another runtime that I've used, but I've found it makes too many assumptions about what a computer is capable of and it's almost impossible to override those settings. It often overloads weaker computers and underestimates stronger ones.

LM Studio isn't as "set it and forget it" as Ollama is, and it does have a bit of a learning curve. But if you're doing any kind of AI development and you don't want to mess around with writing llama-cpp scripts all the time, it really can't be beat (for now).

unixhero · 2025-09-24T08:16:09 1758701769

Thank you! I will give it a try and see if I can get that 4090 working a bit.

Alifatisk · 2025-09-24T08:57:32 1758704252

You can use their models here chat.qwenlm.ai, its their official website

dabockster · 2025-09-24T17:37:27 1758735447

I wouldn't recommend using anything that can transmit data back to the CCP. The model itself is fine since it's open source (and you can run it firewalled if you're really paranoid), but directly using Alibaba's AI chat website should be discouraged.

captainregex · 2025-09-24T12:10:34 1758715834

AnythingLLM also good for that GUI experience!

captainregex · 2025-09-24T12:12:07 1758715927

I should add that sometimes LM Studio just feels better for the use case, same model same purpose seemingly different output usually when involving RAG, but Anything is definitely a very intuitive visual experience

lofaszvanitt · 2025-09-25T14:08:53 1758809333

People actually use tesseract? It's one of the worst OCR solutions out there. Forget it.

creativebee · 2025-09-24T13:47:12 1758721632

Any tipps on getting bounding boxes? The model doesn’t seem to even understand the original size of the image. And even if I provide the dimensions, the positioning is off. :'(

kardianos · 2025-09-24T12:52:17 1758718337

Wait a moment... It gave you BOUNDING BOXES? That is awesome! That is a missing link I need for models.

pouetpouetpoue · 2025-09-24T17:02:40 1758733360

i had success with tabula. you may not need ai. but fine if it works too.

re5i5tor · 2025-09-24T12:32:01 1758717121

I would strongly emphasize:

CV != AI Vision

gpt-4o would breeze through your poor images.

richardlblair · 2025-09-24T13:11:13 1758719473

It did not, unfortunately. When CV failed gpt-4o failed as well. I even had a list of valid invoice numbers & dates to help the models. Still, most failed.

Construction invoices are not great.

re5i5tor · 2025-09-24T15:41:34 1758728494

Did you try few-shotting examples when you hit problem cases? In my ziploc case, the model was failing if red sharpie was used vs black. A few shot hint fixed that.

richardlblair · 2025-09-24T16:47:43 1758732463

Tbh, I had run the images through a few filters. The images that went through to AI were high contrast, black and white, with noise such as highlighters removed. I had tried 1 shot and few shot.

I think it was largely a formatting issue. Like some of these invoices have nonsense layouts. Perhaps Qwen works well because it doesn't assume left to right, top to bottom? Just speculating though

re5i5tor · 2025-09-24T14:27:16 1758724036

I’m very surprised. Have dealt with some really ugly inputs (handwritten text on full ziploc bags etc., stained torn handwritten recipe cards, etc.) with super good success.