Two things; 1) Honestly, I doubt it matters how randomly the sample was selected...

pmelendez · on July 18, 2015

>...they're not all completely retarded and have no idea what statistics are.

I am reading this right now:

http://www.amazon.com/Statistics-Done-Wrong-Woefully-Complet...

You would be surprised how many talented people (even in engineering) make wrong conclusions in statistics.

shadowmint · on July 18, 2015

Let me emphasize: '...they're not all completely retarded...'

What's the chance that one lone engineer is pedantic in math and stats?

Not 100%, sure. Not every engineer has that background, especially in computer related fields (is stats even taught in college for CS these days?).

...but that none of the people involved either 1) is pedantic and has a strong math/stats background, or 2) bothered to do some learning and read books (like that one) when working on either this, or some other stats related project?

Really?

Come on~

It beggars belief.

I'm willing to wager that a random sample of 10 Microsoft engineers will include at least one person who understands stats. Otherwise, that company is completely broken.

DangerousPie · on July 18, 2015

I would assume people who feel they are being underpaid might be more likely to respond to such a survey. And at the same time people who realize they are getting paid more might report a lower salary. This can completely skew the results.

As long as you rely on people volunteering this info you will always have such problems, no matter how small the p-value is.

shadowmint · on July 18, 2015

> This can completely skew the results.

It can skew the results, but it can't completely invalidate them.

Remember here, we're not trying to robustly estimate populations of engineers with specific wages. We're looking at a data set you would expect to be more or less without variation, and being surprised when there is 1) variation, and 2) that variation appears to have racial/gender/whatever correlations.

Now, I get what you're arguing; you're saying, the sampling is biased, so any of those seeming correlations may well be biased (eg. specific demographic consistently under reports their income, people who do report their income tend to be the 'lower' bracket of incomes, etc.)... well, fair enough.

Caveat any results you come up with; but it's not like the data is going to be completely useless and meaningless because it's noisy.

This isn't an arbitrary academic exercise; it's tool for people to use to evaluate their own job positions.

What's the alternative? Have no idea at all what other people are earning? If you don't have any data, you can't do anything.

Even if the data you have is noisy, it'll give you a lot more insight than nothing.

Sure, I don't endorse getting righteous and taking it up the ladder ('My <insert group here> is discriminated against!') without doing your due diligence about samples and caveats.

...but taking a spreadsheet like this to your next pay review? Your manager better have some good answers to give out if you find yourself on the bottom of the curve.

What's wrong with that?

DangerousPie · on July 18, 2015

There's a difference between noisy and biased. As long as the data is only noisy (that is, it has some random variations) I totally agree with you that it's fine to use, and that the results should be robust. However, if there is some sort of bias that only applies to a particular subset of the samples, all bets are off.

Just as a totally imaginary example, what if men with higher incomes are more likely to share their salary information than men with lower incomes, while at the same time the situation is reversed for women. So now you will end up with more reports of high income from men and more reports of low income from women, even if their pay distributions are exactly the same.

I am of course not saying that this is happening here, but these kinds of things would indeed completely invalidate the results.

noelsusman · on July 18, 2015

Engineers by and large don't know much of anything about proper statistics, especially software engineers where there isn't even an intro stats course in most curriculums.