Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
High-achieving teacher sues state over evaluation labeling her ‘ineffective’ (washingtonpost.com)
76 points by ColinWright on Nov 3, 2014 | hide | past | favorite | 68 comments


> If it sounds as if it doesn’t make a lot of sense, that’s because it doesn’t.

To me, it sounds like it makes absolute sense. If you want to evaluate how good a teacher is, one of the things you need to measure is how well their students performed vs. how well those same students would have performed under the hypothetical average teacher.

The particular VAM system used to estimate this metric may be flawed or even completely broken. But at the point in the article this pull quote is located, that argument had not been made.

> VAMs are generally based on standardized test scores and do not directly measure potential teacher contributions toward other student outcomes.

> VAMs typically measure correlation, not causation: Effects – positive or negative – attributed to a teacher may actually be caused by other factors that are not captured in the model.

These are criticisms of the VAM system, but I imagine they are equally valid criticisms of other methods of teacher evaluation. I find it hard to imagine a practical system of evaluation that accounts for all background factors.

> The lawsuit shows that Lederman’s students traditionally perform much higher on math and English Language Arts standardized tests than average fourth-grade classes in the state.

This argument uses the same standardized test scores and confounding background variables that the VAM system is criticized for! Further, it seems obvious that a set of students doing well does not, in isolation, indicate their teacher is a good teacher. Students in a top set or from a great school or from a wealthy neighborhood might be expected to outperform state averages regardless of teacher quality.

My biggest problem with the article is it doesn't describe alternate methods of teacher evaluation. VAM may be flawed, but how does it compare to the other methods? If we accept that teacher evaluation based on student performance is necessary (is it? I have no idea!), what's a better way to do it?


> If you want to evaluate how good a teacher is, one of the things you need to measure is how well their students performed vs. how well those same students would have performed under the hypothetical average teacher.

But you can't measure that.

> it seems obvious that a set of students doing well does not, in isolation, indicate their teacher is a good teacher.

True. But in this case, at least, we're not talking about "in isolation". We're talking about a 17-year track record of a teacher's students doing well.

> what's a better way to do it?

When I went to school, the people who made these judgments were my parents. You can't make these judgments by formula, and you can't make them if you don't know the details of each individual case. To me, the fact that so many schools are fixated on "data-driven" student evaluations means that parents are not engaged.


>"When I went to school, the people who made these judgments were my parents. You can't make these judgments by formula, and you can't make them if you don't know the details of each individual case. To me, the fact that so many schools are fixated on "data-driven" student evaluations means that parents are not engaged."

Many parents know their children are receiving a substandard (or even damaging) 'education'; the problem they often face is that they are powerless to fire the teachers, or switch schools. Being 'engaged' does basically nothing to fix the schools in these cases. Rich parents exercise school choice by moving to affluent communities with good schools, but the poor often do not have this option.

The VAMs aim to reliably give the school administrators access to the knowledge the parents (and often principals) usually already have, as well as give cause for discipline, incentive, or firing.


> Many parents know their children are receiving a substandard (or even damaging) 'education'

Many parents believe that.

Many parents have children for which that is true.

The overlap between those two sets, OTOH, may be smaller than you think.

> the problem they often face is that they are powerless to fire the teachers, or switch schools. Being 'engaged' does basically nothing to fix the schools in these cases.

IME, this isn't really true -- but the perception it is true results in parents not being engaged. When parents are active in addressing perceived problems with teachers in public schools, it is very effective in getting those teachers out of the classroom. (And I've seen it happen numerous times both to bad teachers based on parents acting reasonably based on real problems, and to good teachers based on parents acting unreasonable out of offense that their special-snowflake children weren't getting handed grades on a silver platter. Sometimes it doesn't mean the teacher gets fired, sometimes they get laterally transferred or technically promoted to a position with the school system which is out of the classroom and not dealing with students, and sometimes they just voluntarily leave teaching. But parent activism is quite effective at getting teachers out of the classroom.)


I could be interpreting your comment wrong, but it seems to put the onus heavily on the teacher, which seems a bit naive. My friend just started teaching in New York actually. And she specifically wanted to teach at a Title 1 school (i.e. poor) in order to reach students with less income and opportunity (she teaches an East Asian language in a predominantly non-Asian minority community). That was the idealized plan anyway. Yet, the school won't give her basic supplies. There is no lounge. No fridge. Ok, fine, that's tolerable (odd to me since I too was raised in an affluent community). But the teacher's need paper. There is a locked up supply room full of supplies, but the teachers are told they can't have access to it. Why? Nobody knows. Printing rights are curbed.

And on top of that all, NYC DoE requires a masters. That equals debt, if you went to a "good" school. But the DoE pays crap. Oh and for some reason at one point, the DoE lowered requirements to be a principal. 3 years of part time teaching, and you could be a principal. Her principal taught dancing for 3 years, and is now purportedly fit to analyze the effectiveness of foreign language teaching methods. But anyway, that's beside the point, that's just describing the environment.

Her program is new. But instead of asking students/parents and filtering for those who might be interested in an East Asian language, the administrators decided to randomly force students into the class, regardless of their prior language history or their year. You know what you get with that? Hostile students. The kind that scream fuck you in your face. OK, fine, just "standard" difficult students. But like any other job, if your "manager" has your back, you can usually deal. But this school doesn't believe in detention. Okay... The "dream" is that if a student is causing trouble, the deans will talk to said student. Uh oh, what wasn't accounted for was the saturation of the deans time due to trouble students. Now you got deans telling you that, sorry, they can't deal with disruptive students telling you to fuck off because they're overloaded. So now you have fire support and you're in the trenches alone. I'm not even going to get into the gay teacher's story. Once the students caught onto that...

This doesn't even include off-the-clock work that is required, which I won't get into. I'm sorry, but as someone in the tech field, or any privatized field, the shit that teachers have to put off with is insane.

I know teachers have been demonized, but the turnover rate in NYC for teachers is apparently extremely high (I have another friend working at the DoE itself). I'll have to get a source, but I seem to recall her saying it was around 70% after 2 years. And after hearing all the ridiculous war stories, I'm not surprised. Ha, another one of my teacher friends was moved to an empty classroom. Upon asking for desks, the administrators told her "we don't know where they are" and left it at that. So she had to essentially salvage desks marked for discard.

I'm not exactly sure how VAM works, but I'm skeptical that an algorithm can model something so complex.


> I'm not exactly sure how VAM works, but I'm skeptical that an algorithm can model something so complex.

And even if it could, should an algorithm like this be obscured from view? Should we rely on "black box" algorithms like this, or should we at least insist that, if not the code, the research behind the code be released openly so that it can be held to proper scrutiny?


People are not engaged. They are fixated on numbers, models, and abstractions, with the assumption that the numbers, models, and abstractions have actual meaning tied to them. They don't, aside from creating more students that create more systems that define more numbers, models, and abstractions. Some of these students get pissed off and try to create the opposite.

This is why the humanities are important. You can have the soundest logical systems, with the most elegant mathematical models, and completely miss the point. People get so caught up in the minutiae of measurements that they forget to see the big picture. Measurements don't mean anything when the measurements are used to measure themselves.

I swear, sometimes I really wonder whether society has it's head up it's ass. If these systems can model 'theoretical students' under 'better conditions', then why aren't the 'theoretical student' models running the world? Oh, that's right, because there's a gigantic difference between data and theory. How do they even know what a better student is? How can anyone in their right mind, define that? How can anyone even pretend to know what a great student is?

> When I went to school, the people who made these judgments were my parents. You can't make these judgments by formula, and you can't make them if you don't know the details of each individual case.

When I went to school, I used to think my grades meant something more than being a very complicated way of validating someone else's world view, in a way that tricks everyone into thinking we've made any progress at defining or understanding intelligence at all.

> To me, the fact that so many schools are fixated on "data-driven" student evaluations means that parents are not engaged.

The parents may be extremely engaged, but they just don't know whether to call their kid smart or not. If the kid complains about the book they read because it was boring, is that appropriate? If the kid programs a calculator to do 4 years of standardized test math homework, is that appropriate? Education is doing a fantastic job at driving the personality out of people by forcing them all to sit in the same box. The parents don't know, the teachers don't know, the government doesn't know, society doesn't know, yet we all pretend we know.


  But you can't measure that.
If you have a model, such as:

score_at_end_of_year = score_at_start_of_year + a * parents_income + b * parents_education_level + c * teacher_skill + random_noise

You can figure out the values of a and b by comparing the performance of different students in the same class. If you have a large enough sample size to average out the random noise, you're left with (c * teacher_skill).

Not a perfect method, obviously - but if the best teachers at schools full of poor kids could get more incentive pay by switching to schools full of rich kids, would that be incentivising the right thing?


> If you have a model

Then you can obviously calculate what the model outputs for any set of inputs. But that's not measuring; that's modeling. They're not the same thing. You can't measure the students' performance with a hypothetical "average" teacher because that teacher does not exist. You can only model what you think are the relevant factors involved and then input actual data into the model to see what it says.

Can such models be informative? Certainly. Are they a substitute for human judgment? No way. Yet that is what these schools appear to be doing.

> if the best teachers at schools full of poor kids could get more incentive pay by switching to schools full of rich kids, would that be incentivising the right thing?

Does that actually happen? And if it does, would this model prevent it? Once again, models are no substitute for human judgment.


> that teacher does not exist

The proposed model makes no assumptions about the "average" teacher -- it proposes a simple metric applied to student performance. Under this metric, some teachers would get a positive score because their students' scores improved more than the average. Others would get a negative score because their students' scores improved less than the average.

> would this model prevent it?

I don't know, but the goal of the model is to prevent this, or even reverse it -- that is, the best teachers would want to work with the least advantaged kids, because arbitrage -- it should be easier to make a difference for kids starting at a low performance level, and harder to make a similar difference with kids starting at a higher level.

I don't know whether this works even a little bit in practice, but the thinking sounds promising.


> The proposed model makes no assumptions about the "average" teacher

Sure it does. How else is the teacher_skill rating calibrated?

> the best teachers would want to work with the least advantaged kids

Why is this necessarily a good thing? Shouldn't there also be an incentive to have the best teachers work with the brightest students, to ensure that those students are actually challenged instead of just skating through school?


Put another way: such modeling is probably informative over the aggregate. That is, when comparing school districts to each other. But such modeling is likely not valid for an individual teacher.


Yes, and then you can figure out different values of a and b for a different teacher, different class, different year, different town, etc.

And you're really left with c*teacher_skill + noise, where the noise in the teacher_skill measure may be what results in some good teachers getting bad ratings.

In my view, empirical modeling assumes that there is some underlying regularity that can be modeled, but this assumption seems questionable. And part of the process has to be to decide an acceptable level of good teachers getting bad ratings.


Most HN readers are programmers. Just imagine this logic being applied to yourselves. Your performance is measured in a single week at the end of each year by statistically-controlled standardized tasks. Your manager's performance is then measured by the statistically-inferred effect that working for her has had on your performance given your previous year's performance and a model of how programmers' performance changes over time based on a few variables.

How well do you think that would work? Would it accurately reflect the work you had done in the previous year? Would it generate the right incentives for you to actually do good work and for your manager to help you grow?

Or is it more likely that random variation would dominate meaningful variation. That getting sick or feeling depressed that week would have markedly more impact than all the ways you really got better or worse that year? Or that it would generate perverse incentives to, say, learn about the standard task early, to prepare for the kinds of tasks you were measured on instead of the ones that produce value?

Or, for that matter, that it would destroy the motivation of everyone involved by replacing their intrinsic motivation to do well with an extrinsic motivation to get rewarded?


To be clear, most HN readers' employment and compensation is based on a manger's holistic sense of that HNer's value. We can generally be fired without cause.

If teachers want to argue that they should be in an employment regime similar to programmers, sure, I'm sympathetic. But my strong impression is that they don't want that at all.


> Your performance is measured in a single week at the end of each year by statistically-controlled standardized tasks.

My main point was that if you're going to appraise teachers based on their students' standardized test scores, it would seem to make sense to use a statistical model that tries to measure how much of an effect the teacher has on the predicted test scores of their students. I'm completely open to the argument that we shouldn't be appraising teachers based on student test scores, but that does not seem to be the position the article is advocating.


If you don't have some sort of system to extrinsically reward people with different skill levels different amounts than the best people are eventually going to figure that out and go do something else where they'll get paid what they're worth.

If you don't believe this think about how pissed you'd be if that idiot down the hall who couldn't write good code to save his life made as much money as you do.


You seem to assume that money is the sole motivator, everywhere.

What's really going on is that you're trapped in your own world view, validated by people like you, all alike in your social circles.

My advice: at least try and get out of it a little. I'm up to my gills in the high-tech culture, working at a startup in the Silicon Valley - but my wife is a teacher. I went to a Halloween party hosted by some of her co-workers - it's a tremendous difference in terms of culture, ideals, etc. It was literally like meeting people from a different country.

Metrics aren't the final answer to every question. When dealing with people (which is what education is doing), a simple, single, narrow number used to reduce everything to a handy metric can be incredibly misleading.

Whenever I head some folks saying that we need more metrics and more competition to make the education system better, I just want to tell them: "thank you, but sit down and shut up, cause you've no idea what you're talking about."


My wife is also a teacher(1). While I didn't go to a Halloween party with her coworkers this year I've been to quite a few other parties with teachers. So maybe don't make assumptions about my world view huh?

I never said that metrics were the final answer to every question either. Teaching is obviously an incredibly nuanced profession. If, as Michael Lewis showed us, we can't even measure baseball player's performance very well then oh gosh it's going to be much much harder for teachers (or programmers too!).

But that doesn't mean there aren't any differences in skill either. There are great programmers. There are not so good programmers. There are great teachers. There are bad teachers. If we deny this reality and just pay them the same amount then eventually the good ones are going to figure out they're getting screwed no matter how much they intrinsically like the gig.

1. http://www.fordham.edu/academics/programs_at_fordham_/commun...


> The particular VAM system used to estimate this metric may be flawed or even completely broken. But at the point in the article this pull quote is located, that argument had not been made.

> My biggest problem with the article is it doesn't describe alternate methods of teacher evaluation. VAM may be flawed, but how does it compare to the other methods? If we accept that teacher evaluation based on student performance is necessary (is it? I have no idea!), what's a better way to do it?

The main point I took from the article is that the VAM approach is a black box. It is data driven and we have some reason to suspect it should be useful, but we simply have no way to audit that on specific cases.

Combine that with the fact that this particular teacher is exemplary, yet got an abysmal score, we are faced with the worst possible result: Someone that should get a great score received a terrible one, and we don't know why. So we can't figure out what went wrong or how to fix it.

This isn't a new issue. Statistical models used to be more human-understandable, then over several decades neural networks and big data and so forth have become popular. They often achieve good results, but often remain partially or entirely black boxes.

It makes sense to use a black box method to recognize handwriting, for example, if it does well on average. It's fine if it messes up on some samples. But teachers deserve to be treated fairly, each and every single one.

That's the tension here - better methods on average tend to be less auditable on individual cases.


> If you want to evaluate how good a teacher is, one of the things you need to measure is how well their students performed vs. how well those same students would have performed under the hypothetical average teacher.

But underlying this assumption is that you have a good measure of student performance in the first place -- or that such a measure can easily be crafted. It's questionable if it even makes sense to quantify a student's abilities into a single catch-all number.

But even if quantification of student performance makes sense, it becomes problematic when it becomes high-stakes (i.e. teachers' payment or school funding depends upon test scores). The problem with tests (or any other metric tied to incentives) is that they are gamed over time; SATs become increasingly meaningless with the cottage industry of test-prep; and so do standardized tests become increasingly meaningless as teachers are incentivized to teach directly to them. (In politics, crime rates are gamed, jobs rates are gamed, etc.).

Were schools so terrible before standardized tests -- when administrators and other educational experts rated teachers? There were existing non-mechanical rating systems that of course predated standardized testing -- is it necessary to mechanize evaluations simply because we like the superficial illusion of objectivity, even if that objectivity bears no resemblance to the thing we are trying to measure?

Or even acknowledging that perhaps existing non-mechanized teacher performance measures may have been flawed, have standardized tests somehow improved schools? No, probably not -- teachers and students both are more miserable and are increasingly tied down by the requirements of possibly meaningless tests.


Think about it this way, using your own claim: if VAM is measuring the amount of value a teacher adds to student test performance based on an average student of similar background, and the average student in a wealthy neighborhood does well regardless of the teacher, then every teacher in every wealthy neighborhood adds no value! Does that justify firing/docking the pay of all your teachers? Obviously not. It says that the kind of value a teacher adds necessarily depends on the community. It's a qualitative difference, not a quantitative one.


> if VAM is measuring the amount of value a teacher adds to student test performance based on an average student of similar background, and the average student in a wealthy neighborhood does well regardless of the teacher, then every teacher in every wealthy neighborhood adds no value!

I'm no statistician, but I don't think this is actually true. We would expect students who are likely to do well on standardized tests under a Hypothetical Average Teacher (HAT) to benefit relatively less from a good teacher than students who are likely to do poorly on standardized tests under a HAT - at least when we use standardized test results as the metric. But presumably, any sophisticated statistical model will take this into account. You would simply adjust the model such than a unit increase in actual score vs. predicted score is "worth more" in terms of indicated teacher performance as predicted score increases. It's just a case of diminishing marginal returns, and not unique to teaching or education.

It might be harder to distinguish signal from noise as the predicted test score increases. But it still seems a better system than "the students have done well, therefore the teacher has done well".


My point is that as wealth and average test scores increase, certain kinds of value become more important in distinguishing teachers. Values which VAM does not even attempt to measure.


>"if VAM is measuring the amount of value a teacher adds to student test performance based on an average student of similar background, and the average student in a wealthy neighborhood does well regardless of the teacher, then every teacher in every wealthy neighborhood adds no value"

If you are going to make arguments against VAM, please come up with some stronger ones. Applying statistical controls for student test results in situations like these is trivial, and teachers do make a big difference. If the teachers really made no difference, why not just hire the cheapest labor available, and spend the rest of the money on extra-curricular programs?


How is that? In some communities the difference could be between learning to read and finding an appreciation for modernist poetry. Or more drastically, the difference between instilling a sense of civic duty and keeping kids out of prison. We do expect teachers to have a part in all of these things.


Speaking as someone who reads a book per week, and has spent all their life in affluent communities with highly rated educational institutions; I can confidently say that for the overwhelming majority of public schools (even those in rich suburbs), achieving functional literacy with some comprehension and analytic ability for the majority of students is a lofty goal. I do not know of any public school where the difference between a good teacher and a bad one is "between learning to read and finding an appreciation for modernist poetry"; maybe this will be a problem in the future, but it is not something we should be grappling with now.

I am not under the impression that what "keep[s] kids out of prison" is "a sense of civic duty". Good career prospects keep people out of prison; high conscientiousness may be correlated with low rates of imprisonment, but conscientiousness is also a luxury the desperate can ill afford to indulge in.


But is it done?


In NFL fandom, Football Outsiders's Defense-adjusted Value Over Average (DVOA), is perhaps the best-regarded predictive model of results. In a shoot-out on /r/nfl last year, it outdid ~20 other ratings and rankings in predicting outcomes. One can only imagine that if predicting NFL games were as politicized as school outcomes, there'd be a powerful lobby claiming DVOA "doesn't make a lot of sense".

You're right, this particular model might be bad. But something along these lines will be good, inasmuch as correctly approximately evaluating a teacher's effect on outcomes is good. Science always wins.


It should be simple to develop a predictive model of programmer effectiveness. We can measure the number of lines of codes, number of defects committed, number of defects fixed, etc. against the average developer on the average software project and then we have a way of approximately evaluating a programmer's contributions to a software project.

Science is really hard.


See, the thing is, people are already trying really hard to do that. Practicing programmers, even.


And it's hard to do that even when the desired outcome is just good code.

Now try and do it when the desired outcome is good people (education).


Good point. Would you agree that: even if a phenomenon (for example performance of some sort of worker) is hard to model, we should still try, especially if that phenomenon provides enormous value to society?


Life is a hell of a lot more complicated than the game of football.


The people advocating for a metric ought to be the ones who are expected to establish that they are sensible. Nonetheless, education "reformers" have consistently been promoting metrics with enormous variation in ranking.

Of course, if your purpose is to garner the political support of people who say "it sounds like it makes absolute sense" actually measuring meaningful things is irrelevant.


> well those same students would have performed under the hypothetical average teacher.

I don't understand how do you do that. I can see this hypothetical setup -- you start a genetic experiment, clone all the student's DNA. Produce the same exact # of babies. Raise them the same way, respectively as each of the original students. Then you build a robotic teacher, whose skills, personality, experience, represents the average of all the personalities, experiences and skills of all the teachers in the state (luckily you just do this once per year, and then make lots of clones, which should be easy).

Then you give the group of clones to this "average" robot teacher and you see how well the students do. (It is optional to keep the clones after this experiment, they could be used for spare parts for later maybe...).

So now you have your VAM measure and you can assign a score of 1 to 20 to this teacher.

That was all sarcastic of course. But ok, how does this model work? Can you explain.


> I don't understand how do you do that

Here's a simple way to compare someone to the average teacher.

Two tests. One at the beginning of the year, one at the end. Take the average improvement across the school, across the district, across the state. Now apply that average school-wide improvement to a classroom's original scores in the first test to see how those students should have performed on the second test. Compare that number to how they actually performed.

I came up with this in about 10 seconds, there's obviously a better way to do things, but comparing someone to the average teacher really isn't that hard to do.

The most important part is coming up with a test that actually measures understanding. Open ended questions are usually a good place to start. Many tests I have taken really require you to understand the material.


> Two tests. One at the beginning of the year, one at the end.

Is it the same test (same questions) or a different test? I would guess same type of questions.

Also what if the students are already top performing and don't improve but actually get a slightly worse. (As in they all get 95% and then, well a few get sick and skip a few months from school and now the class gets 94%).

I guess I am still confused at how average expected improvement is supposed to work reliably across all those population models.

Do we assume the school-wide population of students is uniform enough to calculate a meanigful average improvment from it which can be applied very classroom individually.

There are a lot of school districts that have very mixed populations (economic, ethnic, cultural) backgrounds.

Same about the state. Some states districts that maybe very rural and undeveloped mixed in with a metropolitan area some place across the state.


All great questions, and all things I think can be solved.

What it comes down to is, what's our other option? Does it even matter when it's nearly impossible to fire a bad teacher?


> The lawsuit shows that Lederman’s students traditionally perform much higher on math and English Language Arts standardized tests than average fourth-grade classes in the state. In 2012-13, 68.75 percent of her students met or exceeded state standards in both English and math.

Wow. WaPo is critical of VAMs and then they use a state-to-classroom comparison to show that she is effective? Doing better or worse than the state average is just about the worst measure of teachers performance. Entire school districts tend to perform well mainly as a measure of how well-to-do that school district is.

This is excaltly the comparisoin VAMs are trying to prevent. Measuring value added as opposed to the value that was already there. To think that WaPo thinks you can measure teacher performance in such a naive state vs classroom way really detracts from the article.


very interesting issues, but indeed the wapo article is garbage..

for those asking: http://en.wikipedia.org/wiki/Value-added_modeling


Good. See also my related <rant> below.


VAMs are an issue for the reasons mentioned in the article, but just because a teacher reliably produces the highest performing students does not mean s/he is a good teacher.

There was a teacher who was the only teacher of highest track Algebra II class at a local HS who had such a terrible reputation among the students that some would drop down one track in math to avoid having her. Numerous students (and their parents) complained to the administration and the official reply was: "Our top math students all came out of her class" which was a rather specious argument since all of the top math students also went into her class.

Unofficially she was close to retirement age, had seniority in the department, and nobody wanted to poke the beehive of forcefully reassigning her classes.


Shouldn't, on the other hand, the fact that "all the top math students also went into her class" be counted as positive for her? I mean, based upon your description, those who attended her classes excelled. That is good. So I guess parents didn't want to put their children into her classes because that wouldn't have worked out; but maybe the reason for that is that their children were simply unfit for the level?


Read the description carefully:

> the only teacher of highest track Algebra II class at a local HS

Being the only teacher of the highest track Algebra II class in the school means that anyone who wanted to take the highest track Algebra II class--which plausibly contains many of the best math students--would have to take it from this teacher.

The whole point of the kind of value added modelling that is the centre of this case is that it attempts to factor out things like the quality of the student to estimate the quality of the teacher, precisely so that bad teachers who by dint of circumstances are associated with high-performing students don't get high ratings.

The problem is that if student background counts for the greater part of performance even a good teacher may have difficulty scoring highly if they happen to get a "good" class (one that scores highly on student quality.)

On a larger scale, the anecdote we are discussing here suggests that teachers as individuals may not make that much difference to student's performance, since this bad teacher was still able to turn out the best performing students thanks (one is supposed to presume) selection effects alone.


In our school district, it appeared that teaching assignments were largely based on seniority and intra-district politics. It seemed as though the senior teachers most established in the hierarchy would try to get the classes and programs where the best students would be told to go, so as to seem more effective and have more funding.

Thus, in a program that was supposed to cater to the best students in the school district, we had an English teacher who was senile to the point that she could not keep track of assignments or grading or have a coherent curriculum, a History/Social Studies teacher who was primarily interested in pursuing mandatory, irrelevant projects, like expensive theatrical productions, and a Math teacher who didn't teach at all, and just had us go over homework each day.

We were top students not because of our teachers, but because we were all motivated, and were hand-picked for the classes based on test scores and prior performance. We were the children of involved parents, and a high percentage of them were professors. Students who didn't perform well could be easily thrown out by the teachers, as well. With these advantages, it was a given that their students would excel beyond other students in the district; what was not clear was whether we were actually learning as effectively as similar students elsewhere. We went there because it's where the school district told us to go, and told us we'd be given the best opportunities, but the result was primarily to make ineffective teachers look very effective.

In the end, a number of us left the entire district en masse when we realized that the system was entirely ineffective, and not to our benefit. I think that the aftermath showed just how ineffective they were. In three years, I went from being a middle school student who was not being taught at a significantly higher level than other students, and was struggling, to being a junior at a first-tier university who was excelling and had significantly better grades. At the same time the students from our former middle school were graduating high school, I was a first-year grad student. Two of my friends went similar routes, with similar experiences.

If you grab all the top students, and can throw out any student who doesn't perform well, you're obviously going to look very effective, even if you don't teach well and your students could be doing far, far better. Top students will continue to come to your classes, because you'll appear to be the most effective, while better teachers will not have a chance to succeed, because they won't have your numbers to draw interest from parents, or the ability to game them by hand-selecting students.


Reading through the report from the ASA (that doesn't really "slam" the VAM statistic but rightly points out the flaws inherent to any attempt to use statistics in areas with many confounding factors), it appears as if the VAM is usually derived thusly:

1. Calculate a regression model for a student's expected standardized test scores based off of background variables (like previous scores, socioeconomic status etc). This includes having teacher's as variables. 2. Use the coefficient for the teacher as determined by the model to determine the teacher's "Value Added" metric.

The weaknesses in such an approach are also spelled out in the report: namely, missing background variables, lack of precision, and a lack of time to test for the effectiveness of the statistics themselves.

What's interesting is that the teacher in question was rated as "effective" the year before. The question becomes whether that was based off of her VAM score that year as well as what the standard error was on her regression coefficient. Unfortunately, the article doesn't mention any of that.


The problem with regression models is, in skilled hands, it's easy to manipulate the results. And that is without even opening up the rats nest that is causality.

For instance, want to raise the R^2, a value foolishly used to characterize how well the model explains? Add more variables. R^2 is monotonically increasing in the number of variables. So, for example, add the first letter of the teachers' middle names as an explanatory variable. R^2 will probably increase a bit.

Is there homoskedasticity? How much? What did they do to reduce it?

What observations are considered outliers and dropped, and who makes that determination?

Or, want to tank a teachers' score? Assuming teachers are added as something like indicator variables, there are lots of techniques to make the standard deviation increase, allowing you to say that at 0 is within the CI of B_{teacher}.

If they are using glmms -- as they probably ought to be -- there's even more room for a skilled statistician to pick outcomes, as more and more of the setup is a judgement call.

Finally, there's an open question of how well the exams were designed and if they accurately measured the student pre and post effect; there's a whole field -- psychometrics -- devoted to testing alone.


heteroskedasticity. sigh.


Perhaps I'm naive, but it seems like a model used for decisionmaking should be one that can show predictive performance - one that can predict, based on historical data about a set of students and a specific teacher, how well a teacher would do teaching that set of students. If it can't be accurate in that, how is it possible to know that it's capturing enough of the variables? And it seems that VAMs are decidedly not such a model.


It's hard to know what the system really does because the article really doesn't explain it.

I think what you're describing is Cross Validation[1]. It would work if they are predicting performance, but it sounds like the VAM system might be trying to figure out what a hypothetical "average" teacher would have achieved with the same students and comparing that the actual teacher's performance. This is basically trying to predict how the students will do independent of the teacher, but without such a teacher there is no real way to validate the model. Perhaps if they examined students across all teachers.

The system may ultimately be more about comparing teachers to each other and not about actually determining the value provided by an individual teacher.

1: http://en.wikipedia.org/wiki/Cross-validation_(statistics)


A fundamental problem in evaluating teachers is that the value they add matures over many years.


Many of the reasons why every value-added scoring system in use is terrible: http://garyrubinstein.teachforus.org/2012/02/26/analyzing-re...


Thank you for pointing to a much better argument against this than the original link provides.


> In 2012-13, 68.75 percent of her students met or exceeded state standards in both English and math. She was labeled “effective” that year. In 2013-14, her students’ test results were very similar but she was rated “ineffective.”

That sure makes it sound like the measure is unstable. If it is then, at a minimum, output for a single year should not be used by itself, but only in a rolling average with other years. It seems unlikely that the effectiveness of a veteran teacher would change that much from year to year. Given that there was little change in outcomes (no big drop in test scores) the hypothesis of the measure being unstable seems more likely.


Well, this is one teacher in the state of New York. Even a very stable measurement might give a few funny results every now and then.

(Of course, she's probably not the only teacher in New York with a similar story. But I don't know that there are very many.)


It always cracks me up when every high level education administrator is referred to as a "reformer." It's like referring to members of the Chinese government as "revolutionaries."


looks like the data scientist behind this has some work figuring out why this case was badly classified (if true).


You're making an assumption that has no supporting evidence :)


Fitting to a statistical model superficially makes sense. But I think the details kill it.

The outcome you are measuring is the change in test score from before having a teacher and after. VAM attempts to statistically estimate the teacher's contribution to that change.

Presumably, the test is of something that theoretically the students will not know beforehand. Which means the teachers don't want students who study on their own (or participate in activities where that knowledge might be useful). And they don't want students who aren't going to learn it -- whoops, that was a leap, I meant to say who aren't going to test higher at the end. So you don't really want the top tier nor bottom tier coming into your class.

Nonspecific to VAM, but a result of standardized test results being used for anything meaningful to the teacher (salary, tenure, etc.) is that anything not on the test has an opportunity cost, and so will be omitted in favor of test prep. The more statistical validity that VAM has, the stronger this effect will be. If the teacher shows the students how to incorporate their new knowledge into a broader perspective, it may make the school's scores improve but it will screw over the next teacher in line (because the before test will be higher). So there's some peer pressure to make sure the students learn nothing that they're "supposed" to learn later.

If you consider a subject like math, what happens is that at some point many students fall behind. This makes the later topics much, much harder, because they build on what they never quite understood. A perfect teacher would figure out what balance of old and new material to give each individual student. That perfect teacher would score poorly on VAM compared to a teacher who crammed in test-specific mechanics and regurgitation, relying on dismal beginning test scores to make poor but not awful ending test scores look good. The system would gradually optimize for squeezing incremental gains out of improperly taught students.

And don't forget that the outcome is what's measured, and what's measured is crap. In football, you can look at a score (or just who won). Here, the structure is tuned to produce students who can do well on year-end tests but nothing else, certainly not on their ability to apply their knowledge to situations not likely to show up on a test.

Ok, this became more of a rant against standardized testing, but it just bothers me that adding statistical power magnifies the problems. You'd be better off throwing in a large random component, so that teachers' innate desires to teach well have a chance at winning out over gaming the system. Because even if your population of teachers is really conscientious, you're actively selecting for those most willing to play the game. And selection always wins in the end.


Your assuming the delta is based around just the prior test scores vs this one. aka old 10 new 15 or old 80 vs new 85 is the same improvement. However, statistically there is a tendency to regress toward the mean making simply staying at 80 end up as statistical progress. However, I suspect their using a flawed model that ignores the tendency for school districts to pack high preforming teachers on top of other high preforming teachers. To correct for this you need to look at what happens when someone moves from one district to another.

PS: There is a fair amount of momentum in many subjects so teachers can impact not just this years test results, but next years as well. In the end it's really difficult to come up with a high quality model and my guess is they simply did not bother.


Well it's not like teachers only stay in their position for a year. The framework could (and should?) keep on monitoring the progress of the students down the way and feed back to the teachers' rating until they graduate. That would also increase peer pressure and collaboration between teachers.


There is a simple and easy solution. Private schools.


Unionised teachers always complain about evaluation.


Warning: <rant>

Controversial opinions. Long road with sharp curves ahead. Author spent too much time in school, as both a student and a professor. YMMV.

My guess is that the article deliberately omitted the main point: As good as the teacher was and as good as the performance of her students was, the evidence from the testing of that teacher on her 4th grade students on 4th grade work was that that teacher added relatively little to what her students already had when they entered her class. Or, in a sense, with the way their teacher evaluation system works, the good teachers of her students were in K and grades 1, 2, 3 so that by the start of grade 4 the 4th grade students were already doing really well at grade 4 work and, with the way the measurement of amount added by that 4th grade teacher was done, that teacher was seen to have added relatively little. So, adding relatively little, she was evaluated as not doing well.

Added? Why the emphasis on added? Well, suppose you are teaching 4th grade, your students come into your class struggling with 1st grade work (it can happen), you work hard, and in the one year get your students good at 1st grade work, 2nd grade work, and 3rd grade work, whew, three years of work in one year, but, still don't have the students good at 4th grade work. So, such a teacher, taught three years of work in one year, should, still, be rated very effective. So, likely such situations are the source of the interest in measuring what was added.

Now measuring the added amount is likely, in some cases, tricky both for both testing and statistically. And, likely the severity and rigidity of the system kept the 4th grade teacher from moving her 4th grade students on to 5th and 6th grade work and, instead, kept them grinding away at just the 4th grade work they already knew so well that the teacher had little to add. And if she had moved her students ahead to 5th and 6th grade work, then the test, of just 4th grade work, would not show that progress and the teacher again would be measured as not adding much.

So, net, for a 4th grade teacher, a really good 3rd grade teacher is a tough act to follow!

It's sad to see such stress and struggle with K-12 teaching. We are so wound up with the goal of "no child left behind" and, to this end, coming up with systems to beat up on teachers that don't get us to that goal, that we have really poor systems of evaluation and, with high irony, fail at some basic academic tasks and have far too many false alarms. Bummer.

Or, if you can't do statistics well, then don't do statistics at all. We're better off with no statistics than with bad statistics. Have I seen some really bad, we're talking brain dead, statistics in K-12 education, up close, and personal? Yup!

Net, I wouldn't trust the Statistics and Evaluation Branch of the New York State Department of Education to get 2 + 2 = 4 correct. Why? I've seen just way too much fumbling with statistics. Or, bluntly, effective application of statistics to important real situations is mostly quite far beyond the abilities of ordinary organizations -- it's just too hard for them; they just can't get it right; they make messes and do harm.

Here's one way to slow down nearly any application of statistics: Go to some statistics texts, get the assumptions, and then demand that the assumptions be justified. One assumption? Sure, independence -- that assumption is so powerful that, in anything much closer to daily reality than quantum mechanics or dice rolling, it is essentially impossible to justify.

Looks like the goal of "no child left behind" has generated a massive bozo explosion.

My dad was a fantastic educator, and his description of the ideal in education was a student sitting at one end of a log and a good teacher sitting at the other end. Try to characterize this educational environment with statistics? As they say in New York, f'get about it.

But, really, no worries, mate: There is a safety valve -- the main source of education anyway, the home. E.g., I had a friend who went to a NYC school where most of the students knew only two words, and they could say those two with a wide variety of variations. The abbreviation of those two words was just "MF". That's all they knew. Not very articulate, but, then, usually they did get their meaning across, but, then, their meaning was not very advanced, either.

My friend? In the third grade, he was sick at home for a week with the flu, and his mother was shocked to discover that he didn't know how to read. So, in that week, she taught him. Then he knew how to read. In school? Maybe he also learned different ways to say MF.

Education? He did quite well: Got PBK at SUNY, Ph.D. at Courant, and was a Member of the Institute for Advanced Study at Princeton. His education was (1) in K-12 or (2) at home? Three guesses, the first two don't count! Or, four years, K-3, the schools couldn't teach him to read, and his mother did it in a week.

Yes, maybe by grade 12 he knew the binomial theorem, and maybe his mother didn't teach him that at home, but, really, still, his education, the real key to his education, was at home.

My dad told me about a basic book on education, Dewey, Democracy and Education. So, since I was spending so much time in school, I wanted to know why and read the book. At one point Dewey defined education -- passing down from one generation to the next, where he was quite clear that what gets passed is both the good and the bad, not just all good. Well, net, most of that passing down happens at home, and there's next to nothing K-12 can do about it.

Actually, a lot of people understand this basic fact and, thus, want education to start at birth, that is, have the government provide the basic home parenting in, shall we say, at risk situations. I believe you will find that our current President is in favor of this! In other words, he sees the at risk situations as so hopeless that for a solution it is necessary to replace the home itself. Maybe he's correct.

Sorry 'bout K-12: I trust that it really can do babysitting, that is, keep the kids off the streets and, thus, mostly out of crime and drugs, keep the sixth and seventh grade girls from getting pregnant, etc. For much more, well, in some of the at risk situations, tough to have much more; or watch the PBS

http://www.pbs.org/wgbh/pages/frontline/education-of-michell...

with "The Education of Michelle Rhee". She tried. She was good. She tried hard, really hard. In the end, she accomplished basically zip, zilch, zero and nichts, nil, nada. The teachers themselves commonly believed that the goals were just hopeless. Or, she was unable to make the K-12 schools make up for poor homes. Sorry 'bout that. But didn't we know that already?

Well, maybe George Bush believed that education happened in K-12 and that, thus, we could solve the problem of poor education by a program like No Child Left Behind in K-12. Well, W also believed that "The Iraqis are perfectly capable of governing themselves", i.e., have the country stay together, as one country, a democracy, and not split apart into Shiites, Sunnis, Kurds, and Others, with fighting, torture, atrocities, civil war, little problems like those. So, just help the Iraqis write a constitution, hold elections, and all will come together singing "Why Can't We Be Friends?". Where did W get that really strong funny stuff he'd been smoking?

Bush 41 was smart enough to stay the heck out of Baghdad. Bush 43 was not. Maybe W was not the brightest bulb on the tree. "No child left behind"? I understand: W, if your father had only had that goal in mind!

"More educational statistics, Ma!". Then, for the poor performers, "off with their heads!*. Might not make the situation much worse!

YMMV.

</rant>


YTMND.


I feel like teachers don't ever want to be evaluaged on their performance, yet want raises every year. Without a system in place to get rid of bad teachers and reward good ones, education will stay broken in this country.

Nearly all professions are evaluated and you can be fired. Why should teaching be any different?

Unions are one of the main reasons things haven't gotten any better. As soon as you try to evaluate, the unions step in and stop it.


Meh. Teachers moaning about evaluation. Hold the front page.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: