Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Anyone who's dealt with any kind of human-annotated datasets would be familiar with these kind of errors. It's hard enough to get good clean labels from motivated, native-English speaking annotators. Farm it out to low-paid non-native speakers, and these kind of issues are inevitable.

Annotation isn't a low-skill/low-cost exercise. It needs serious commitment and attention to detail, and ideally it's not something you outsource (or if you do, you need an additional in-house validation pipeline to identify dirty labels).



For a sentiment / emotion classification project, we (2 founders) just ended up doing most of the labeling ourselves. It was a big grind, but given how abysmal the performance of “crowd-sourced” solutions are (eg Amazon Mechanical Turk), and how incredibly important the quality of these labels are for training a model, it made the most sense.

I wonder how others do this kind of thing. Assuming I have 100k text blurbs I want to classify of sentiment and/or emotion and want at least 95% accuracy, what are the options?

I’m still shocked how low the quality of Mechanical Turk was for just sentiment (positive/negative/neutral/unsure), 99% of the classifications were just random. We narrowed our section for workers to higher-qualified ones, for that matter.

What a giant waste of money and time that was, because supposedly it’s the canonical use case for it.


> I wonder how others do this kind of thing. Assuming I have 100k text blurbs I want to classify of sentiment and/or emotion and want at least 95% accuracy, what are the options?

Others who successfully do it have exactly the same secret sauce that you do: they assign it to someone who is well-compensated and competent.

The one time I needed anything remotely like this I just took the old "nobody said programming was gonna be glamorous" adage and ran with it for two weeks. Money-wise, two weeks of programmer time sorting data manually sure beats twelve weeks of programmer time debugging models to cope with inaccurate data, and the results are orders of magnitude better.

Given the widespread understanding of how critical training data is, it's mind-boggling to me that businesses in this field try to outsource it to the lowest-paid external company they can find, thus offering the lowest possible performance incentives and pretty much losing any control they have over quality assurance. Then they proceed to spend humongous amounts of money on clean-up and further refining training data sets, with which they could've hired English Lit majors in the first place, who would've given them a perfectly-classified data set from the very beginning.


But then STEM would have to admit that English Lit Majors didn't waste their money /s


One of my university professors liked to quip: next time someone tells you that you're arguing semantics, ask them what semantics means. That'll teach 'em.


English Lit Majors can't find work? Try MTurk!


I was using MTurk for labeling about 10 years ago.

To see the other side I also did a 1 Month stint as a MTurk worker, earning about $300.

It is absolutely horrible work, and I used MTurk subreddit to find the "decent" jobs. I had the special firefox extension which ranked the job givers etc.

All jobs were below 1st world minimum wage and were incredibly depressing. I think the adult content ones were the worst.

"Jian Yang: No! That's very boring work"

Worst thing was that you could not stop to think, you had to keep going if you wanted to at least make a few bucks an hour.

There are two solutions to the labeling issue: Pay well (at least $20 an hour no matter the location).

If the workers feel exploited you do not get good results no matter the sanctions.

Even better solution to labeling is to find people who give a shit about your task.

That is how I've done 50k lines of labeling training our OCR for a rare font for Tesseract. We have a few volunteers and they know this is important work (preserving cultural heritage, non-profit, national library, etc).

Old reCaptcha had a similar "feel good" element.

Compare it to newest Google reCaptcha - you know those labels are going to be used for evil at some point in the future.


I'm eager to see what nefarious things Google will do once they've mastered the art of identifying all the busses in a photo!


What tells me to start worrying if they get very good at identifying paperclips?



their motorcycle, hills, signal light, and traffic light AI will break the world when it's over!


> To see the other side I also did a 1 Month stint as a MTurk worker, earning about $300.

are you me?

I did the same thing (worked as a mTurk labeler for 2 weeks) which convinced me to never use mTurk for anything even remotely important.

I've been able to use semi-supervised approaches with actual domain experts reviewing outputs.


There are ways to detect problems with results due to things like fatigue or attenuation. And there are ways to deal with those problems (like forced break time and switching up tasks).

It's a wonder that none of that is built in. But maybe things like MTurk aren't built to maximize worker effectiveness because it costs too much. Are there better quality crowd-sourcing options?


> Compare it to newest Google reCaptcha - you know those labels are going to be used for evil at some point in the future.

I for one always try at least once to get something wrong, often several times if it doesn't go through immediately, depending on urgency of my task. I hate being made to work for someone else like this.


> I wonder how others do this kind of thing.

We did the exact same for a text classification project.

The multi-week grind was awful, but it meant 1. we had a really good understanding of our data 2. we discovered surprising edge cases that we would have missed otherwise.

There is a very large fixed overhead you need to pay when you start outsourcing that work, so doing it yourself is cheaper at scales beyond what you'd normally expect.


> I’m still shocked how low the quality of Mechanical Turk was

I don't know about Mechanical Turk, but there is a crowdourcing platform by Yandex. The pay is so low that the reasonable way to earn something is to find a task that is not properly validated and put random answers there from multiple accounts (because there are speed limits). Usually those are tasks by naive foreign companies not knowing about validation.

So if you want high quality you need to implement proper validation, triple check every label by different people and do not expect that someone is going to do it for $5/hour. And maybe you should learn how a crowdsourcing service looks from the worker's side, for example, by registering and trying to do some tasks yourself or by reading forums for workers.


Why shouldn’t someone do it for five dollars per hour, and do it properly lest they get fired? Seems like a very easy job and easy to supervise.


Money isn't what gets people motivated most of the time. This job is boring and unfulfilling, you'll get poor results unless you're offering a life changing amount of money.


If it's such an easy job, why outsource it instead of doing it yourself?


Because there’s other stuff you need to do that you can’t outsource.


Surely if it's such an easy job, you can fit it along your other tasks.

If you can't then perhaps it's worth re-examining if it's as easy as you think it is.


It’s easy but requires time. Like turning a page is easy but turning a million pages takes time.


That's what I'm saying. Classifying one data point is very straightforward and brings negligible value to a company. Reliably classifying hundreds of thousands of them is very complicated and not at all easily supervised. And if your company's business model is based on applying trained models to $real_world_problem, it doesn't just bring a lot of value to your company, it's literally critical for its success, just like a solid CI/CD pipeline or having a good security process.

It's attractive to think that this is just like classifying one data point over and over again. It's nothing like that, just like crossing the Atlantic from Galway to New York is nothing like kayaking around Mutton Island over and over again.


Are you trolling us here? The act of classifying is easy. It needs to be repeated many times. So you scale up by hiring many people to do it.


No, I am not trolling you here.

First of all, real-life data sets have hundreds of thousands of endpoints, and it's easy to classify a few hundred, maybe a few thousands endpoints for a single person. So scaling it up to a point where it's easy for every person involved in it requires hiring a team of dozens, or even 100+ people. That is absolutely not easy, especially not on a short term notice, and not when it's 100% a dead end job, so it's difficult to convince people to come do it in the first place. I'm yet to have met a single company whose idea of scaling it up involved something more elaborate than "we're gonna hire three freelancers". At 30k endpoints/person that's about an order of magnitude away from "easy", and transferring that order of magnitude to the hiring process ("we're gonna hire thirty freelancers") isn't trivial at all.

Second, it is absolutely not at all easy to supervise. QC for classification problems is comparable to QC for other easily-replicable but difficult to automate industrial processes, like semi-automatic manufacturing processes. There is ample literature on the topic, starting with the late period of the industrial revolution and going all the way to present times, and all of it suggests that it's a very hairy problem even without taking into account the human part. Perfect verification requires replicating the classification process. Verification by sampling makes it very hard to guarantee the accuracy requirements of the model. Checking accuracy post-factum poses the same problems.

This idea that classifying training models is a simple job that you can just outsource somewhere cheap is the framework of a bad strategy. Training data accuracy is absolutely critical. If you optimize for time and cost, you get exactly what you pay for: a rushed, cheap model.


Design a machine, possibly including humans as "parts," to automatically turn those million pages. Maybe the machine is just a single human, serially, turning pages, but that's slow. There must be something better!

How easy is it to design a reliable system, considering human/machine interaction and everything we know about human behavior and constraints like human attention, potential injury, dry fingers, etc.?

How simple is the design?


Because they can do more rolling carts around in Wallmart.


I mean for places where $5 per hour is an attractive wage.


> Assuming I have 100k text blurbs I want to classify of sentiment and/or emotion and want at least 95% accuracy, what are the options?

The way I think Google does it with reCAPTCHA is to request classification for each item multiple times. If they differ, keep sending them out until you get a consensus on those items. It weeds out those responses that were just basically random clicks.


I haven't worked in this space specifically, but next time you are thinking of outsourcing this type of work I would suggest giving some Filipino VAs a shot. You can hire fluent, sometimes native English speakers who are motivated at $4-7/hr. Actually even less but I stick to the top of that range personally. (I use OnlineJobs.ph to find people)


I'd love to chat. Want to reach out to the email in my profile? I'm the founder of a startup solving this exact problem (https://www.surgehq.ai), and previously built the human computation platforms at a couple FAANGs (precisely because this was a huge issue I always faced internally).

We work with a lot of the top AI/NLP companies and research labs, and do both the "typical" data labeling work (sentiment analysis, text categorization, etc), but also a lot more advanced stuff (e.g., search evaluation, training the new wave of large language models, adversarial labeling, etc -- so not just distinguishing cats and dogs, but rather making full use of the power of the human mind!).


Good news for you: being your target audience, we actually did have you guys on our radar.I

For the scale of our project, however, the price point was prohibitive.

We ended up building a small cli tool that interactively trained the model, and allowed us to focus on the most important messages (eg those where positive/negative sentiment was closest, the labels with the smallest volume, etc).

EDIT: If I now look at your website, it seems like you’ve also just provide good tooling for doing these types of things yourself? If that were the case, I wouldn’t mind having paid $50-$100 for a week of access to such a tool. But $20/hr to hire someone who classifies data which we would still need to audit afterwards was too much for us.


$20/hour to classify data sounds reasonable though?

If you have more time than money it might not make sense, but at that price point I could save myself a lot of time by just working a few extra hours doing SE and let someone else do 3x that amount of labelling.


I fully agree $20/hr is reasonable, it just was too expensive for us at that time.

So in the end the whole problem boils down to “quality is (more) expensive”; but MTurk is a special case since they’re so heavily positioning themselves as “the” solution for this and they’re terrible.


Why don't they label using the same method as captcha

Show the same image to 10 people, and keep only those who have a high confidence


Because that literally costs 10x


> I wonder how others do this kind of thing. Assuming I have 100k text blurbs I want to classify of sentiment and/or emotion and want at least 95% accuracy, what are the options?

Fwiw, in a similar situation I did about half myself and farmed out the other half to my retired parents. Using trusted people was the only way I found I could get high accuracy without spending thousands of dollars. But, as I frequently tell people, the hard part of ML isn’t the model/code, it’s the training data.


I was going to recommend https://gengo.com/sentiment-analysis/

But it seems they’ve been acquired a few years ago, so I have no idea if the quality is still the same.


you have experience with this, so you're probably the perfect person to ask this to- why didn't you just use the old (inaccurate) labels, perform some clustering based op and re-label then clusters? does that even make sense?


The old (inaccurate) labels were so completely utter shit (pardon my French) that it may as wel have been random.

I honestly believe most MTurkers just clicked random BS in order to complete the tasks as soon as possible.

What I ended up doing was make some Python CLI-based took that made it extremely fast for us to classify messages; after seeding it with about 1000 classifications, we would then focus on the messages based on certain dimensions: eg “contradictions” (“positive” and “negative” being closest as possible, or “angry” and “happy”), “least” (it was surprisingly difficult to find positive and uplifting tweets, and you don’t want a dataset with 90% negative messages!), etc.

That way we worked our way through the dataset and were able to get a pretty decent dataset in about a week time.

No idea how others approach this type of problem, but it’s what I came up with.


>I honestly believe most MTurkers just clicked random BS in order to complete the tasks as soon as possible.

This is correct based on my (limited) mechanical turk experience. Most tasks pay peanuts (minimum payout can be as low as $0.01) so the only reasonable way to make an income is to complete as many tasks as humanly possible, and doing anything but clicking random buttons would slow them down. I doubt paying more could overcome that because so many people engage with the platform in bad faith.


You can filter out bad workers by preparing an additional well-labeled dataset and removing those who made even a single mistake in it. Also you can give the same task to several workers and check if they give the same label. However this won't protect against a bot using multiple accounts and giving answers based on a hash of a question so that the same question gets the same answer in every account.


"removing those who made even a single mistake" means you don't want humans working on this lol. Humans will always make mistakes.


Doesn't this bias the labels?


probably does, but idk how much.. if its mostly okay it's gonna be easier to correct it id assume


> I wonder how others do this kind of thing. Assuming I have 100k text blurbs I want to classify of sentiment and/or emotion and want at least 95% accuracy, what are the options?

Use the consensus of 3 or more annotators (or median).


But is paying 3x $5/h getting better results than hiring a local collage student for $15/h?


Won't you just wind up with a collage from either source?

Answer is yes, because those $5/h workers are likely just as educated, but from a less affluent part of the world.


> I’m still shocked how low the quality of Mechanical Turk was for just sentiment

Given that MT workers earn based on how quickly they complete a task, and not how accurate. I'm struggling to understand how anyone would expect quality.


Completely agree on the need for serious commitment and attention!

Funnily enough, though, many ML engineers and data scientists I know (even those at Google, etc., who depend on human-annotated datasest) aren't familiar with these kinds of errors. At least in my experience, many people rarely inspect their datasets -- they run their black box ML pipelines and compute their confusion matrices, but rarely look at their false positive/negatives to understand more viscerally where and why their models might be failing.

Or when they do see labeling errors, many people chalk it up to "oh, it's just because emotions are subjective, overall I'm sure the labels are fine" without realizing the extent of the problem, or realizing that it's fixable and their data could actually be so much better.

One of my biggest frustrations actually is when great engineers do notice the errors and care, and try to fix them by improving guidelines -- but often the problem isn't the guidelines themselves (in this case, for example, it's not like people don't know what JOY and ANGER are! creating 30 pages of guidelines isn't going to help), but rather that the labeling infrastructure is broken or nonexistent from the beginning. Hence why Surge AI exists, and we're building what we're building :)


This kind of out-sourcing is quite common, and at the prices I heard some time ago, you could run each utterance by 3 to 5 people, which allows you to get an idea of the reliability.

But the core of the problem is 27 emotions. That's really asking for trouble.


> But the core of the problem is 27 emotions. That's really asking for trouble.

I'm not sure I agree. The examples highlighted in the blog post aren't cases of slight mislabels (like mislabeling frustration as anger, for example). They are often labeled polar opposite to what they should be.

Though perhaps what you are saying here is that low-paid workers won't bother to look through a list of 27 emotions to find the right one, and thus they are more likely to label at random.


I always wonder how much it should actually be done to the "coding" [1] standards in the social sciences. Social scientists working with qualitative data start analyzing by putting codes in various specialized ways to the data. In more rigorous studies those codes are simultaneously assign by different researchers and then cross-checked. There is a lot of literature of how to come up with codes, e.g., Grounded Theory, and how to go further. I always think that we need to bridge the gap between engineers and social scientists working on the same problems.

[1] https://en.wikipedia.org/wiki/Coding_(social_sciences)


In my first few months as programmer in the 90's, I realized that human inputs were sketchy. It was from a form field for which state someone was from (USA). Fifty possible states, 2500 different entries. Sure there was a bit of garbage, but 95+% were recognizable states...but why random capital letters? How do you get a space in Wyoming?

It was a good lesson at the start of my career that I see playing out over and over. When I see some cool demo of statistics across the country or globe, I'm more impressed by the effort of cleaning the data than the stats behind it.


Why can’t you do the dataset several times with different labellers, building up a more statistical probability for a label than a pure guarantee. It’s worth noting that different cultures have different interpretations of emotions too (famously Russians don’t smile much even though they are extremely helpful in my experience, I’m still not quite sure what the head wobble in India actually means let alone some of the finer misinterpretations that are going on when I lived in Japan).


Or pay better like other people are suggesting. If you have to "average over" 3 data sets why not just pay $15 instead of $5 and save the computation if $15 or so seemed to be the threshold for getting humans to be good data labelers?


>Farm it out to low-paid non-native speakers

The paper claims:

>“All raters are native English speakers from India.”


US English != Indian English especially if you have to actually know the cultural details behind some sentences. I bet US native speakers would have similar failure rates at labeling English sentences from Indian media, because they belong to different cultures.


The blog post highlights this specific point - "US English" and "Indian English" really aren't the same English (in fact, I'd probably go even further and state that "Reddit English" and "US English" probably aren't the same English either).

Likewise, the Common Voice English dataset isn't great for ASR training outside India, either. There's a huge proportion of Indian speakers, and their data doesn't really help train ASR systems for non-Indian accents.


Is the "right answer" a classification based on US English with background knowledge of US cultural background, or is the goal to build a global sentiment data set?

You and OP ("the indians labelers don't know how to do it correctly") seem to want the former, so it would be good to state that goal upfront.


I think it’s valid to question US-centrism broadly in datasets like this, but presumably that choice was made in selecting the Reddit sample, which is dominated by North American English.

A model is more useful if it approximates the speaker’s intent, not the listener’s interpretation. “Right” is complicated in language, but it’s hard to see how you’d use a model full of cross-cultural misunderstandings.


https://www.heritagexperiential.org/language-policy-in-india...

>English, due to its ‘lingua franca’ status, is an aspiration language for most Indians – for learning English is viewed as a ticket to economic prosperity and social status. Thus almost all private schools in India are English medium. Many public schools, due to political compulsions, have the state’s official languages as the primary school language. English is introduced as a second language from grade 5 onwards.


Perfectly logical choice if you're building a machine to replace call centers.


Can a human validate a label with less effort than it took to create it? Or maybe validating statistically is enough?


Depends what it is. I've had reasonable success with "validating" ASR transcripts by loading up the annotator's transcript, running the audio at 2x speed and just clicking "yes" or "no" to keep the good ones and bin the bad ones. It's roughly 5x faster to do this than to come up with transcripts from scratch, so if the annotators are 5x cheaper, then you come out ahead. You can go even further and pre-filter labels to discard any where inter-annotator agreement falls below some threshold (i.e. 3 people label the same piece of data, and you only include a sample when at least 2 annotators give the same label). You can also use that to discard all annotations from annotators who regularly disagree with the majority.

This is just the reality of outsourced data labelling. One thing I think is really important is to structure the compensation well, so that labellers get paid more when they do a better job. Paying per sample is a terrible idea, and even I was guilty of this - back in university I was paid $20 or so to hand-write 500 words on a resistive touchscreen to train a handwriting recognition model. I won't say I half-assed it, but I remember trying to get through it as as quickly as possible to get my money and go for beer (I think I also justified it to myself on the basis that sloppy samples would help make the bounds of the dataset distribution more robust!).


i mean humans arent great at reading emotions either. 30% error is probably human level


The film director Alfred Hitchcock once commented that in a tense scene, all he needed was a character showing a more or less neutral face, and viewers would read what he needed into it.


that's a really interesting quote! I've thought a lot in the past about specific things in older movies that I enjoy and that some other people can't stand. I think this might be a big part of it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: