Agreed, I have seen some speedups with ONNX if I'm being honest but the process especially on MacOS is a bit messy. I'll try out Executorch and see how it compares, cheers for the recommendation
Caveat emptor but this seems like an up to date paper on the state of bitwise reproducibility in dl with a bunch of citations to other papers that go into more depth: https://arxiv.org/pdf/2510.09180
Yeah I can see why they let it be that way, but the fact it is pretty undefined is what bugged me. I suppose it depends on what your goals are - efficiency vs reproducibility.
Also I did run a test of FP16 vs FP32 for a large matmul on the Apple GPU and the FP16 calculation was 1.28x faster so it makes sense that they'd go for FP16 as a default.
Also generally I think CoreML isn't the best. The best solution for ORT would probably be to introduce a pure MPS provider (https://github.com/microsoft/onnxruntime/issues/21271), but given they've already bought into CoreML the effort may not be worth the reward for the core team. Which fair enough as it's a pretty mammoth task
However one benefits of CoreML - it is the only way to be able for 3rd party to execute on ANE (Apple Neural Engine aka NPU). ANE for some models can execute even faster than GPU/MPS and consume even less battery.
But I agree CoreML in ONNX Runtime is not perfect - most of the time when I tested some models there were too many partitioning and whole graph was running slower compare when using only model in just CoreML format.
To be honest it's a shame the whole thing is closed up, I guess it's to be expected from Apple, but I reckon CoreML would be benefit a lot from at least exposing the internals/allowing users to define new ops.
Also, the ANE only allows some operators to be ran on it right? There's very little transparency/control on what can be offloaded to it and cannot which makes using it difficult.
Working on a blog post about default behaviour in ONNX Runtime when using the CoreML execution provider. Basically the default args lead to your model being ran in FP16 not FP32.
They create this CNN for exactly this task, autism diagnosis in children. I suppose this model would work for babies too.
Edit: ah I see your point, in the paper they diagnose autism with eye contact, but your point is a task closer to what my model does. It could definelty be augmented for such a task, we’d just need to improve the accuracy. The only issue I see is sourcing training data might be tricky, unless I partner with some institution researching this. If you know of anyone in this field I’d be happy to speak with them.
That's great! What I'm talking about is a bit different though and might be a lot easier to deploy and work on much younger subjects:
Put a tablet in front of a baby. Left half has images of gears and stuff, right half has images of people and faces. Does the baby look at the left or right half of the screen? This is actually pretty indicative of autism and easy to put into a foolproof app.
The linked github is recording a video of an older child's face while they look at a person who is wearing a camera or something, and judging whether or not they make proper eye contact. This is thematically similar but actually really different. Requires an older kid, both for the model and method, and is hard to actually use. Not that useful.
Intervening when still a baby is absolutely critical.
P.S., deciding which half of a tablet a baby is looking is MUCH MUCH easier than gaze tracking. Make the tablet screen bright white around the edges. Turn brightness up. Use off the shelf iris tracking software. Locate the reflection of the iPad in the baby's iris. Is it on the right half or left half of the iris? Adjust for their position in FOV and their face pose a bit and bam that's very accurate. Full, robust gaze tracking is a million times harder, believe me.
Thats a cool idea, thanks for sharing! It's cool to see other uses for a model I built for a completely different task.
Is there any research/papers on this type of autism diagnosis tools for babies?
To your last point, yes I agree. Even the task I setup the model for is relatively easy compared to proper gaze tracking, I just rely on large datasets.
I suppose you could do it in the way you say and then from that gather data to eventually build out another model.
I'll for sure look into this, appreciate the idea sharing!
Idk of any research, sorry, just going from memory from a few years ago. Feel free to lmk if you ever have any questions about mobile gaze tracking, I spent several years on it. Can you DM on here? Idk.
FYI: I got funding and gathered really big mobile phone gaze datasets and trained CNN models on them that got pretty accurate. Avg err below 1cm.
The whole thing worked like this:
Mech Turk workers got to play a mobile memory game. A number flashed on the screen at a random point for a second while the phone took a photo. Then they had to enter it. If they entered it correctly, I assumed they were looking at that point at the screen when the photo was taken and added it to the gaze dataset. Collecting clean data like this was very important for model accuracy. Data is long gone, unfortunately. Oh and the screen for the memory game was pure white, which was essential for a reason described below.
The CNN was a cascade of several models:
First, off the shelf stuff located the face in the image.
A crop of the face was fed to an iris location model we trained. This estimated eye location and size.
Crops of the eyes were fed into 2 more cycles of iris detection, taking a smaller crop and making a more accurate estimation until the irises were located and sized to within about 1 pixel. Imagine the enhance... enhance... trope.
Then crops of the super well centered and size-normalized irises, as well as a crop of the face, were fed together into a CNN, along with metadata about the phone type.
This CNN estimated gaze location using the labels from the memory game derived data.
This worked really well, usually, in lighting the model liked. Failed unpredictably in other lighting. We tried all kinds of pre-processing to address lighting but it was always the achilles heel.
To my shock, I eventually realized (too late) that the CNN was learning to find the phone's reflection in the irises, and estimating the phone position relative to the gaze direction using that. So localizing the irises extremely well was crucial. Letting it know what kind of phone it was looking at, and ergo how large the reflection should appear at a certain distance, was too.
Making a model that segments out the phone or tablet's reflection in the iris is just a very efficient shortcut to do what any actually good model will learn to do anyway, and it will remove all of the lighting variation. Its the way to make gaze tracking on mobile actually work reliably without infrared. Never had time to backtrack and do this because we ran out of funding.
The MOST IMPORTANT thing here is to control what is on the phone screen. If the screen is half black, or dim, or has random darkly colored blobs, it will screw with where the model things the phone screen reflection begins and ends. HOWEVER if your use case allows you to control what is on the screen so it always has for instance a bright white border, your problem is 10x easier. The baby autism screener would let you control that for instance.
But anyway, like I said, to make something that just determines if the baby is looking on one side of the screen or the other, you could do the following:
1. Take maybe 1000 photos of a sampling of people watching a white tablet screen moving around in front of their face.
2. Annotate the photos by labeling visible corners of the reflection of the tablet screen in their irises
3. Make simple CNN to place these
If you can also make a model that locates the irises extremely well, like to within 1px, then making the gaze estimate becomes sort of trivial with that plus the tablet-reflection-in-iris finder. And I promise the iris location model is easy. We trained on about 3000-4000 images of very well labeled irises (with circle drawn over them for the label) with a simple CNN and got really great sub-pixel accuracy in 2018. That plus some smoothing across camera frames was more than enough.
Anyway, hope some of this helps. I know you aren't doing fine-grained gaze tracking like this but maybe something in there is useful.
wow, this is great! You can't DM but my email is in my blog post, in the footnotes.
Do you remember the cost of Mech Turk? It was something I wanted to use for EyesOff but never could get around the cost aspect.
I need some time to process everything you said, but the EyesOff model has pretty low accuracy at the moment. I'm sure some of these tidbits of info could help to improve the model, although my data is pretty messy in comparison. I had thought of doing more gaze tracking work for my model, but at long ranges it just breaks down completely (in my experience, happy to stand corrected if you're worked on that too).
Regarding the baby screener, I see how this approach could be very useful. If I get the time, I'll look into it a bit more and see what I can come up with. I'll let you know once I get round to it.
We paid something like $10 per hour and people loved our tasks. We paid a bit more to make sure our tasks were completed well. The main thing was just making the data collection app as efficient as possible. If you pay twice as much but collect 4x the data in the task, you doubled your efficiency.
Yeah I think its impossible to get good gaze accuracy without observing the device reflection in the eyes. And you will never, ever be able to deal well with edge cases like lighting, hair, glasses, asymmetrical faces, etc. There's just a fundamental information limit you can't overcome. Maybe you could get within 6 inches of accuracy? But mostly it would be learning face pose I assume. Trying to do gaze tracking with a webcam of someone 4 feet away and half offscreen just seems Sisyphean.
Is EyesOff really an important application? I'm not sure many people would want to drain their battery running it. Just a rhetorical question, I don't know.
With the baby autism screener its difficult part is the regulatory aspect. I might have some contacts in Mayo Clinic that would be interested in productizing something like this though, and could be asked about it.
If I were you, I would look at how to take a mobile photo of an iris, and artificially add the reflection of a phone screen to create a synthetic dataset (it won't look like a neat rectangle, more like a blurry fragment of one). Then train a CNN to predict the corners of the added reflection. And after that is solved, try the gaze tracking problem as an algebraic exercise. Like, think of the irises as 2 spherical mirrors. Assume their physical size. If you can locate the reflection of an object of known size in them, you should be able to work out the spatial relationships to figure out where the object being reflected is relative to the mirrors. This is hard, but is 10-100x easier than trying end-to-end gaze tracking with a single model. Also nobody in the world knows to do this, AFAIK.
ha, thats probably why I noticed the EyesOff accuracy drops so much at longer ranges, I suppose two models would do better but atm battery drain is a big issue.
I'm not sure if it's important or not, but the app comes from my own problems working in public so I'm happy to continue working on it. I do want to train and deploy an optimised model, something much smaller.
Sounds great, once a POC get's built I'll let you know and can see about the clinical side.
Thanks for the tips! I'll be sure to post something and reach out if I get round to implementing such a model.
Yeah tbh I do recommend using this alongside a privacy screen for best protection. Privacy screens also suffer from the fact that they won’t block someone directly behind you from seeing the screen, so both methods have issues.
Any tips on improving accuracy? A lot of it might be due to lack of diverse images + labelling errors as I did it all manually.
I dunno, my only idea is that maybe if you use traditional face detection to find the face/eyes and then do classification (assuming you aren't doing that already?).
Right now that's pretty much what I do. I use YuNet to get faces, crop them out and run detection. It's probably a factor of a not enough data/poor model choice.