Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

US English != Indian English especially if you have to actually know the cultural details behind some sentences. I bet US native speakers would have similar failure rates at labeling English sentences from Indian media, because they belong to different cultures.


The blog post highlights this specific point - "US English" and "Indian English" really aren't the same English (in fact, I'd probably go even further and state that "Reddit English" and "US English" probably aren't the same English either).

Likewise, the Common Voice English dataset isn't great for ASR training outside India, either. There's a huge proportion of Indian speakers, and their data doesn't really help train ASR systems for non-Indian accents.


Is the "right answer" a classification based on US English with background knowledge of US cultural background, or is the goal to build a global sentiment data set?

You and OP ("the indians labelers don't know how to do it correctly") seem to want the former, so it would be good to state that goal upfront.


I think it’s valid to question US-centrism broadly in datasets like this, but presumably that choice was made in selecting the Reddit sample, which is dominated by North American English.

A model is more useful if it approximates the speaker’s intent, not the listener’s interpretation. “Right” is complicated in language, but it’s hard to see how you’d use a model full of cross-cultural misunderstandings.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: