Thats a cool idea, thanks for sharing! It's cool to see other uses for a model I built for a completely different task.
Is there any research/papers on this type of autism diagnosis tools for babies?
To your last point, yes I agree. Even the task I setup the model for is relatively easy compared to proper gaze tracking, I just rely on large datasets.
I suppose you could do it in the way you say and then from that gather data to eventually build out another model.
I'll for sure look into this, appreciate the idea sharing!
Idk of any research, sorry, just going from memory from a few years ago. Feel free to lmk if you ever have any questions about mobile gaze tracking, I spent several years on it. Can you DM on here? Idk.
FYI: I got funding and gathered really big mobile phone gaze datasets and trained CNN models on them that got pretty accurate. Avg err below 1cm.
The whole thing worked like this:
Mech Turk workers got to play a mobile memory game. A number flashed on the screen at a random point for a second while the phone took a photo. Then they had to enter it. If they entered it correctly, I assumed they were looking at that point at the screen when the photo was taken and added it to the gaze dataset. Collecting clean data like this was very important for model accuracy. Data is long gone, unfortunately. Oh and the screen for the memory game was pure white, which was essential for a reason described below.
The CNN was a cascade of several models:
First, off the shelf stuff located the face in the image.
A crop of the face was fed to an iris location model we trained. This estimated eye location and size.
Crops of the eyes were fed into 2 more cycles of iris detection, taking a smaller crop and making a more accurate estimation until the irises were located and sized to within about 1 pixel. Imagine the enhance... enhance... trope.
Then crops of the super well centered and size-normalized irises, as well as a crop of the face, were fed together into a CNN, along with metadata about the phone type.
This CNN estimated gaze location using the labels from the memory game derived data.
This worked really well, usually, in lighting the model liked. Failed unpredictably in other lighting. We tried all kinds of pre-processing to address lighting but it was always the achilles heel.
To my shock, I eventually realized (too late) that the CNN was learning to find the phone's reflection in the irises, and estimating the phone position relative to the gaze direction using that. So localizing the irises extremely well was crucial. Letting it know what kind of phone it was looking at, and ergo how large the reflection should appear at a certain distance, was too.
Making a model that segments out the phone or tablet's reflection in the iris is just a very efficient shortcut to do what any actually good model will learn to do anyway, and it will remove all of the lighting variation. Its the way to make gaze tracking on mobile actually work reliably without infrared. Never had time to backtrack and do this because we ran out of funding.
The MOST IMPORTANT thing here is to control what is on the phone screen. If the screen is half black, or dim, or has random darkly colored blobs, it will screw with where the model things the phone screen reflection begins and ends. HOWEVER if your use case allows you to control what is on the screen so it always has for instance a bright white border, your problem is 10x easier. The baby autism screener would let you control that for instance.
But anyway, like I said, to make something that just determines if the baby is looking on one side of the screen or the other, you could do the following:
1. Take maybe 1000 photos of a sampling of people watching a white tablet screen moving around in front of their face.
2. Annotate the photos by labeling visible corners of the reflection of the tablet screen in their irises
3. Make simple CNN to place these
If you can also make a model that locates the irises extremely well, like to within 1px, then making the gaze estimate becomes sort of trivial with that plus the tablet-reflection-in-iris finder. And I promise the iris location model is easy. We trained on about 3000-4000 images of very well labeled irises (with circle drawn over them for the label) with a simple CNN and got really great sub-pixel accuracy in 2018. That plus some smoothing across camera frames was more than enough.
Anyway, hope some of this helps. I know you aren't doing fine-grained gaze tracking like this but maybe something in there is useful.
wow, this is great! You can't DM but my email is in my blog post, in the footnotes.
Do you remember the cost of Mech Turk? It was something I wanted to use for EyesOff but never could get around the cost aspect.
I need some time to process everything you said, but the EyesOff model has pretty low accuracy at the moment. I'm sure some of these tidbits of info could help to improve the model, although my data is pretty messy in comparison. I had thought of doing more gaze tracking work for my model, but at long ranges it just breaks down completely (in my experience, happy to stand corrected if you're worked on that too).
Regarding the baby screener, I see how this approach could be very useful. If I get the time, I'll look into it a bit more and see what I can come up with. I'll let you know once I get round to it.
We paid something like $10 per hour and people loved our tasks. We paid a bit more to make sure our tasks were completed well. The main thing was just making the data collection app as efficient as possible. If you pay twice as much but collect 4x the data in the task, you doubled your efficiency.
Yeah I think its impossible to get good gaze accuracy without observing the device reflection in the eyes. And you will never, ever be able to deal well with edge cases like lighting, hair, glasses, asymmetrical faces, etc. There's just a fundamental information limit you can't overcome. Maybe you could get within 6 inches of accuracy? But mostly it would be learning face pose I assume. Trying to do gaze tracking with a webcam of someone 4 feet away and half offscreen just seems Sisyphean.
Is EyesOff really an important application? I'm not sure many people would want to drain their battery running it. Just a rhetorical question, I don't know.
With the baby autism screener its difficult part is the regulatory aspect. I might have some contacts in Mayo Clinic that would be interested in productizing something like this though, and could be asked about it.
If I were you, I would look at how to take a mobile photo of an iris, and artificially add the reflection of a phone screen to create a synthetic dataset (it won't look like a neat rectangle, more like a blurry fragment of one). Then train a CNN to predict the corners of the added reflection. And after that is solved, try the gaze tracking problem as an algebraic exercise. Like, think of the irises as 2 spherical mirrors. Assume their physical size. If you can locate the reflection of an object of known size in them, you should be able to work out the spatial relationships to figure out where the object being reflected is relative to the mirrors. This is hard, but is 10-100x easier than trying end-to-end gaze tracking with a single model. Also nobody in the world knows to do this, AFAIK.
ha, thats probably why I noticed the EyesOff accuracy drops so much at longer ranges, I suppose two models would do better but atm battery drain is a big issue.
I'm not sure if it's important or not, but the app comes from my own problems working in public so I'm happy to continue working on it. I do want to train and deploy an optimised model, something much smaller.
Sounds great, once a POC get's built I'll let you know and can see about the clinical side.
Thanks for the tips! I'll be sure to post something and reach out if I get round to implementing such a model.
Is there any research/papers on this type of autism diagnosis tools for babies?
To your last point, yes I agree. Even the task I setup the model for is relatively easy compared to proper gaze tracking, I just rely on large datasets.
I suppose you could do it in the way you say and then from that gather data to eventually build out another model.
I'll for sure look into this, appreciate the idea sharing!