> If it sounds as if it doesn’t make a lot of sense, that’s because it doesn’t.
To me, it sounds like it makes absolute sense. If you want to evaluate how good a teacher is, one of the things you need to measure is how well their students performed vs. how well those same students would have performed under the hypothetical average teacher.
The particular VAM system used to estimate this metric may be flawed or even completely broken. But at the point in the article this pull quote is located, that argument had not been made.
> VAMs are generally based on standardized test scores and do not directly measure potential teacher contributions toward other student outcomes.
> VAMs typically measure correlation, not causation: Effects – positive or negative – attributed to a teacher may actually be caused by other factors that are not captured in the model.
These are criticisms of the VAM system, but I imagine they are equally valid criticisms of other methods of teacher evaluation. I find it hard to imagine a practical system of evaluation that accounts for all background factors.
> The lawsuit shows that Lederman’s students traditionally perform much higher on math and English Language Arts standardized tests than average fourth-grade classes in the state.
This argument uses the same standardized test scores and confounding background variables that the VAM system is criticized for! Further, it seems obvious that a set of students doing well does not, in isolation, indicate their teacher is a good teacher. Students in a top set or from a great school or from a wealthy neighborhood might be expected to outperform state averages regardless of teacher quality.
My biggest problem with the article is it doesn't describe alternate methods of teacher evaluation. VAM may be flawed, but how does it compare to the other methods? If we accept that teacher evaluation based on student performance is necessary (is it? I have no idea!), what's a better way to do it?
> If you want to evaluate how good a teacher is, one of the things you need to measure is how well their students performed vs. how well those same students would have performed under the hypothetical average teacher.
But you can't measure that.
> it seems obvious that a set of students doing well does not, in isolation, indicate their teacher is a good teacher.
True. But in this case, at least, we're not talking about "in isolation". We're talking about a 17-year track record of a teacher's students doing well.
> what's a better way to do it?
When I went to school, the people who made these judgments were my parents. You can't make these judgments by formula, and you can't make them if you don't know the details of each individual case. To me, the fact that so many schools are fixated on "data-driven" student evaluations means that parents are not engaged.
>"When I went to school, the people who made these judgments were my parents. You can't make these judgments by formula, and you can't make them if you don't know the details of each individual case. To me, the fact that so many schools are fixated on "data-driven" student evaluations means that parents are not engaged."
Many parents know their children are receiving a substandard (or even damaging) 'education'; the problem they often face is that they are powerless to fire the teachers, or switch schools. Being 'engaged' does basically nothing to fix the schools in these cases. Rich parents exercise school choice by moving to affluent communities with good schools, but the poor often do not have this option.
The VAMs aim to reliably give the school administrators access to the knowledge the parents (and often principals) usually already have, as well as give cause for discipline, incentive, or firing.
> Many parents know their children are receiving a substandard (or even damaging) 'education'
Many parents believe that.
Many parents have children for which that is true.
The overlap between those two sets, OTOH, may be smaller than you think.
> the problem they often face is that they are powerless to fire the teachers, or switch schools. Being 'engaged' does basically nothing to fix the schools in these cases.
IME, this isn't really true -- but the perception it is true results in parents not being engaged. When parents are active in addressing perceived problems with teachers in public schools, it is very effective in getting those teachers out of the classroom. (And I've seen it happen numerous times both to bad teachers based on parents acting reasonably based on real problems, and to good teachers based on parents acting unreasonable out of offense that their special-snowflake children weren't getting handed grades on a silver platter. Sometimes it doesn't mean the teacher gets fired, sometimes they get laterally transferred or technically promoted to a position with the school system which is out of the classroom and not dealing with students, and sometimes they just voluntarily leave teaching. But parent activism is quite effective at getting teachers out of the classroom.)
I could be interpreting your comment wrong, but it seems to put the onus heavily on the teacher, which seems a bit naive. My friend just started teaching in New York actually. And she specifically wanted to teach at a Title 1 school (i.e. poor) in order to reach students with less income and opportunity (she teaches an East Asian language in a predominantly non-Asian minority community). That was the idealized plan anyway. Yet, the school won't give her basic supplies. There is no lounge. No fridge. Ok, fine, that's tolerable (odd to me since I too was raised in an affluent community). But the teacher's need paper. There is a locked up supply room full of supplies, but the teachers are told they can't have access to it. Why? Nobody knows. Printing rights are curbed.
And on top of that all, NYC DoE requires a masters. That equals debt, if you went to a "good" school. But the DoE pays crap. Oh and for some reason at one point, the DoE lowered requirements to be a principal. 3 years of part time teaching, and you could be a principal. Her principal taught dancing for 3 years, and is now purportedly fit to analyze the effectiveness of foreign language teaching methods. But anyway, that's beside the point, that's just describing the environment.
Her program is new. But instead of asking students/parents and filtering for those who might be interested in an East Asian language, the administrators decided to randomly force students into the class, regardless of their prior language history or their year. You know what you get with that? Hostile students. The kind that scream fuck you in your face. OK, fine, just "standard" difficult students. But like any other job, if your "manager" has your back, you can usually deal. But this school doesn't believe in detention. Okay... The "dream" is that if a student is causing trouble, the deans will talk to said student. Uh oh, what wasn't accounted for was the saturation of the deans time due to trouble students. Now you got deans telling you that, sorry, they can't deal with disruptive students telling you to fuck off because they're overloaded. So now you have fire support and you're in the trenches alone. I'm not even going to get into the gay teacher's story. Once the students caught onto that...
This doesn't even include off-the-clock work that is required, which I won't get into. I'm sorry, but as someone in the tech field, or any privatized field, the shit that teachers have to put off with is insane.
I know teachers have been demonized, but the turnover rate in NYC for teachers is apparently extremely high (I have another friend working at the DoE itself). I'll have to get a source, but I seem to recall her saying it was around 70% after 2 years. And after hearing all the ridiculous war stories, I'm not surprised. Ha, another one of my teacher friends was moved to an empty classroom. Upon asking for desks, the administrators told her "we don't know where they are" and left it at that. So she had to essentially salvage desks marked for discard.
I'm not exactly sure how VAM works, but I'm skeptical that an algorithm can model something so complex.
> I'm not exactly sure how VAM works, but I'm skeptical that an algorithm can model something so complex.
And even if it could, should an algorithm like this be obscured from view? Should we rely on "black box" algorithms like this, or should we at least insist that, if not the code, the research behind the code be released openly so that it can be held to proper scrutiny?
People are not engaged. They are fixated on numbers, models, and abstractions, with the assumption that the numbers, models, and abstractions have actual meaning tied to them. They don't, aside from creating more students that create more systems that define more numbers, models, and abstractions. Some of these students get pissed off and try to create the opposite.
This is why the humanities are important. You can have the soundest logical systems, with the most elegant mathematical models, and completely miss the point. People get so caught up in the minutiae of measurements that they forget to see the big picture. Measurements don't mean anything when the measurements are used to measure themselves.
I swear, sometimes I really wonder whether society has it's head up it's ass. If these systems can model 'theoretical students' under 'better conditions', then why aren't the 'theoretical student' models running the world? Oh, that's right, because there's a gigantic difference between data and theory. How do they even know what a better student is? How can anyone in their right mind, define that? How can anyone even pretend to know what a great student is?
> When I went to school, the people who made these judgments were my parents. You can't make these judgments by formula, and you can't make them if you don't know the details of each individual case.
When I went to school, I used to think my grades meant something more than being a very complicated way of validating someone else's world view, in a way that tricks everyone into thinking we've made any progress at defining or understanding intelligence at all.
> To me, the fact that so many schools are fixated on "data-driven" student evaluations means that parents are not engaged.
The parents may be extremely engaged, but they just don't know whether to call their kid smart or not. If the kid complains about the book they read because it was boring, is that appropriate? If the kid programs a calculator to do 4 years of standardized test math homework, is that appropriate? Education is doing a fantastic job at driving the personality out of people by forcing them all to sit in the same box. The parents don't know, the teachers don't know, the government doesn't know, society doesn't know, yet we all pretend we know.
score_at_end_of_year = score_at_start_of_year + a * parents_income + b * parents_education_level + c * teacher_skill + random_noise
You can figure out the values of a and b by comparing the performance of different students in the same class. If you have a large enough sample size to average out the random noise, you're left with (c * teacher_skill).
Not a perfect method, obviously - but if the best teachers at schools full of poor kids could get more incentive pay by switching to schools full of rich kids, would that be incentivising the right thing?
Then you can obviously calculate what the model outputs for any set of inputs. But that's not measuring; that's modeling. They're not the same thing. You can't measure the students' performance with a hypothetical "average" teacher because that teacher does not exist. You can only model what you think are the relevant factors involved and then input actual data into the model to see what it says.
Can such models be informative? Certainly. Are they a substitute for human judgment? No way. Yet that is what these schools appear to be doing.
> if the best teachers at schools full of poor kids could get more incentive pay by switching to schools full of rich kids, would that be incentivising the right thing?
Does that actually happen? And if it does, would this model prevent it? Once again, models are no substitute for human judgment.
The proposed model makes no assumptions about the "average" teacher -- it proposes a simple metric applied to student performance. Under this metric, some teachers would get a positive score because their students' scores improved more than the average. Others would get a negative score because their students' scores improved less than the average.
> would this model prevent it?
I don't know, but the goal of the model is to prevent this, or even reverse it -- that is, the best teachers would want to work with the least advantaged kids, because arbitrage -- it should be easier to make a difference for kids starting at a low performance level, and harder to make a similar difference with kids starting at a higher level.
I don't know whether this works even a little bit in practice, but the thinking sounds promising.
> The proposed model makes no assumptions about the "average" teacher
Sure it does. How else is the teacher_skill rating calibrated?
> the best teachers would want to work with the least advantaged kids
Why is this necessarily a good thing? Shouldn't there also be an incentive to have the best teachers work with the brightest students, to ensure that those students are actually challenged instead of just skating through school?
Put another way: such modeling is probably informative over the aggregate. That is, when comparing school districts to each other. But such modeling is likely not valid for an individual teacher.
Yes, and then you can figure out different values of a and b for a different teacher, different class, different year, different town, etc.
And you're really left with c*teacher_skill + noise, where the noise in the teacher_skill measure may be what results in some good teachers getting bad ratings.
In my view, empirical modeling assumes that there is some underlying regularity that can be modeled, but this assumption seems questionable. And part of the process has to be to decide an acceptable level of good teachers getting bad ratings.
Most HN readers are programmers. Just imagine this logic being applied to yourselves. Your performance is measured in a single week at the end of each year by statistically-controlled standardized tasks. Your manager's performance is then measured by the statistically-inferred effect that working for her has had on your performance given your previous year's performance and a model of how programmers' performance changes over time based on a few variables.
How well do you think that would work? Would it accurately reflect the work you had done in the previous year? Would it generate the right incentives for you to actually do good work and for your manager to help you grow?
Or is it more likely that random variation would dominate meaningful variation. That getting sick or feeling depressed that week would have markedly more impact than all the ways you really got better or worse that year? Or that it would generate perverse incentives to, say, learn about the standard task early, to prepare for the kinds of tasks you were measured on instead of the ones that produce value?
Or, for that matter, that it would destroy the motivation of everyone involved by replacing their intrinsic motivation to do well with an extrinsic motivation to get rewarded?
To be clear, most HN readers' employment and compensation is based on a manger's holistic sense of that HNer's value. We can generally be fired without cause.
If teachers want to argue that they should be in an employment regime similar to programmers, sure, I'm sympathetic. But my strong impression is that they don't want that at all.
> Your performance is measured in a single week at the end of each year by statistically-controlled standardized tasks.
My main point was that if you're going to appraise teachers based on their students' standardized test scores, it would seem to make sense to use a statistical model that tries to measure how much of an effect the teacher has on the predicted test scores of their students. I'm completely open to the argument that we shouldn't be appraising teachers based on student test scores, but that does not seem to be the position the article is advocating.
If you don't have some sort of system to extrinsically reward people with different skill levels different amounts than the best people are eventually going to figure that out and go do something else where they'll get paid what they're worth.
If you don't believe this think about how pissed you'd be if that idiot down the hall who couldn't write good code to save his life made as much money as you do.
You seem to assume that money is the sole motivator, everywhere.
What's really going on is that you're trapped in your own world view, validated by people like you, all alike in your social circles.
My advice: at least try and get out of it a little. I'm up to my gills in the high-tech culture, working at a startup in the Silicon Valley - but my wife is a teacher. I went to a Halloween party hosted by some of her co-workers - it's a tremendous difference in terms of culture, ideals, etc. It was literally like meeting people from a different country.
Metrics aren't the final answer to every question. When dealing with people (which is what education is doing), a simple, single, narrow number used to reduce everything to a handy metric can be incredibly misleading.
Whenever I head some folks saying that we need more metrics and more competition to make the education system better, I just want to tell them: "thank you, but sit down and shut up, cause you've no idea what you're talking about."
My wife is also a teacher(1). While I didn't go to a Halloween party with her coworkers this year I've been to quite a few other parties with teachers. So maybe don't make assumptions about my world view huh?
I never said that metrics were the final answer to every question either. Teaching is obviously an incredibly nuanced profession. If, as Michael Lewis showed us, we can't even measure baseball player's performance very well then oh gosh it's going to be much much harder for teachers (or programmers too!).
But that doesn't mean there aren't any differences in skill either. There are great programmers. There are not so good programmers. There are great teachers. There are bad teachers. If we deny this reality and just pay them the same amount then eventually the good ones are going to figure out they're getting screwed no matter how much they intrinsically like the gig.
> The particular VAM system used to estimate this metric may be flawed or even completely broken. But at the point in the article this pull quote is located, that argument had not been made.
> My biggest problem with the article is it doesn't describe alternate methods of teacher evaluation. VAM may be flawed, but how does it compare to the other methods? If we accept that teacher evaluation based on student performance is necessary (is it? I have no idea!), what's a better way to do it?
The main point I took from the article is that the VAM approach is a black box. It is data driven and we have some reason to suspect it should be useful, but we simply have no way to audit that on specific cases.
Combine that with the fact that this particular teacher is exemplary, yet got an abysmal score, we are faced with the worst possible result: Someone that should get a great score received a terrible one, and we don't know why. So we can't figure out what went wrong or how to fix it.
This isn't a new issue. Statistical models used to be more human-understandable, then over several decades neural networks and big data and so forth have become popular. They often achieve good results, but often remain partially or entirely black boxes.
It makes sense to use a black box method to recognize handwriting, for example, if it does well on average. It's fine if it messes up on some samples. But teachers deserve to be treated fairly, each and every single one.
That's the tension here - better methods on average tend to be less auditable on individual cases.
> If you want to evaluate how good a teacher is, one of the things you need to measure is how well their students performed vs. how well those same students would have performed under the hypothetical average teacher.
But underlying this assumption is that you have a good measure of student performance in the first place -- or that such a measure can easily be crafted. It's questionable if it even makes sense to quantify a student's abilities into a single catch-all number.
But even if quantification of student performance makes sense, it becomes problematic when it becomes high-stakes (i.e. teachers' payment or school funding depends upon test scores). The problem with tests (or any other metric tied to incentives) is that they are gamed over time; SATs become increasingly meaningless with the cottage industry of test-prep; and so do standardized tests become increasingly meaningless as teachers are incentivized to teach directly to them. (In politics, crime rates are gamed, jobs rates are gamed, etc.).
Were schools so terrible before standardized tests -- when administrators and other educational experts rated teachers? There were existing non-mechanical rating systems that of course predated standardized testing -- is it necessary to mechanize evaluations simply because we like the superficial illusion of objectivity, even if that objectivity bears no resemblance to the thing we are trying to measure?
Or even acknowledging that perhaps existing non-mechanized teacher performance measures may have been flawed, have standardized tests somehow improved schools? No, probably not -- teachers and students both are more miserable and are increasingly tied down by the requirements of possibly meaningless tests.
Think about it this way, using your own claim: if VAM is measuring the amount of value a teacher adds to student test performance based on an average student of similar background, and the average student in a wealthy neighborhood does well regardless of the teacher, then every teacher in every wealthy neighborhood adds no value! Does that justify firing/docking the pay of all your teachers? Obviously not. It says that the kind of value a teacher adds necessarily depends on the community. It's a qualitative difference, not a quantitative one.
> if VAM is measuring the amount of value a teacher adds to student test performance based on an average student of similar background, and the average student in a wealthy neighborhood does well regardless of the teacher, then every teacher in every wealthy neighborhood adds no value!
I'm no statistician, but I don't think this is actually true. We would expect students who are likely to do well on standardized tests under a Hypothetical Average Teacher (HAT) to benefit relatively less from a good teacher than students who are likely to do poorly on standardized tests under a HAT - at least when we use standardized test results as the metric. But presumably, any sophisticated statistical model will take this into account. You would simply adjust the model such than a unit increase in actual score vs. predicted score is "worth more" in terms of indicated teacher performance as predicted score increases. It's just a case of diminishing marginal returns, and not unique to teaching or education.
It might be harder to distinguish signal from noise as the predicted test score increases. But it still seems a better system than "the students have done well, therefore the teacher has done well".
My point is that as wealth and average test scores increase, certain kinds of value become more important in distinguishing teachers. Values which VAM does not even attempt to measure.
>"if VAM is measuring the amount of value a teacher adds to student test performance based on an average student of similar background, and the average student in a wealthy neighborhood does well regardless of the teacher, then every teacher in every wealthy neighborhood adds no value"
If you are going to make arguments against VAM, please come up with some stronger ones. Applying statistical controls for student test results in situations like these is trivial, and teachers do make a big difference. If the teachers really made no difference, why not just hire the cheapest labor available, and spend the rest of the money on extra-curricular programs?
How is that? In some communities the difference could be between learning to read and finding an appreciation for modernist poetry. Or more drastically, the difference between instilling a sense of civic duty and keeping kids out of prison. We do expect teachers to have a part in all of these things.
Speaking as someone who reads a book per week, and has spent all their life in affluent communities with highly rated educational institutions; I can confidently say that for the overwhelming majority of public schools (even those in rich suburbs), achieving functional literacy with some comprehension and analytic ability for the majority of students is a lofty goal. I do not know of any public school where the difference between a good teacher and a bad one is "between learning to read and finding an appreciation for modernist poetry"; maybe this will be a problem in the future, but it is not something we should be grappling with now.
I am not under the impression that what "keep[s] kids out of prison" is "a sense of civic duty". Good career prospects keep people out of prison; high conscientiousness may be correlated with low rates of imprisonment, but conscientiousness is also a luxury the desperate can ill afford to indulge in.
In NFL fandom, Football Outsiders's Defense-adjusted Value Over Average (DVOA), is perhaps the best-regarded predictive model of results. In a shoot-out on /r/nfl last year, it outdid ~20 other ratings and rankings in predicting outcomes. One can only imagine that if predicting NFL games were as politicized as school outcomes, there'd be a powerful lobby claiming DVOA "doesn't make a lot of sense".
You're right, this particular model might be bad. But something along these lines will be good, inasmuch as correctly approximately evaluating a teacher's effect on outcomes is good. Science always wins.
It should be simple to develop a predictive model of programmer effectiveness. We can measure the number of lines of codes, number of defects committed, number of defects fixed, etc. against the average developer on the average software project and then we have a way of approximately evaluating a programmer's contributions to a software project.
Good point. Would you agree that: even if a phenomenon (for example performance of some sort of worker) is hard to model, we should still try, especially if that phenomenon provides enormous value to society?
The people advocating for a metric ought to be the ones who are expected to establish that they are sensible. Nonetheless, education "reformers" have consistently been promoting metrics with enormous variation in ranking.
Of course, if your purpose is to garner the political support of people who say "it sounds like it makes absolute sense" actually measuring meaningful things is irrelevant.
> well those same students would have performed under the hypothetical average teacher.
I don't understand how do you do that. I can see this hypothetical setup -- you start a genetic experiment, clone all the student's DNA. Produce the same exact # of babies. Raise them the same way, respectively as each of the original students. Then you build a robotic teacher, whose skills, personality, experience, represents the average of all the personalities, experiences and skills of all the teachers in the state (luckily you just do this once per year, and then make lots of clones, which should be easy).
Then you give the group of clones to this "average" robot teacher and you see how well the students do. (It is optional to keep the clones after this experiment, they could be used for spare parts for later maybe...).
So now you have your VAM measure and you can assign a score of 1 to 20 to this teacher.
That was all sarcastic of course. But ok, how does this model work? Can you explain.
Here's a simple way to compare someone to the average teacher.
Two tests. One at the beginning of the year, one at the end. Take the average improvement across the school, across the district, across the state. Now apply that average school-wide improvement to a classroom's original scores in the first test to see how those students should have performed on the second test. Compare that number to how they actually performed.
I came up with this in about 10 seconds, there's obviously a better way to do things, but comparing someone to the average teacher really isn't that hard to do.
The most important part is coming up with a test that actually measures understanding. Open ended questions are usually a good place to start. Many tests I have taken really require you to understand the material.
> Two tests. One at the beginning of the year, one at the end.
Is it the same test (same questions) or a different test? I would guess same type of questions.
Also what if the students are already top performing and don't improve but actually get a slightly worse. (As in they all get 95% and then, well a few get sick and skip a few months from school and now the class gets 94%).
I guess I am still confused at how average expected improvement is supposed to work reliably across all those population models.
Do we assume the school-wide population of students is uniform enough to calculate a meanigful average improvment from it which can be applied very classroom individually.
There are a lot of school districts that have very mixed populations (economic, ethnic, cultural) backgrounds.
Same about the state. Some states districts that maybe very rural and undeveloped mixed in with a metropolitan area some place across the state.
> The lawsuit shows that Lederman’s students traditionally perform much higher on math and English Language Arts standardized tests than average fourth-grade classes in the state. In 2012-13, 68.75 percent of her students met or exceeded state standards in both English and math.
Wow. WaPo is critical of VAMs and then they use a state-to-classroom comparison to show that she is effective? Doing better or worse than the state average is just about the worst measure of teachers performance. Entire school districts tend to perform well mainly as a measure of how well-to-do that school district is.
This is excaltly the comparisoin VAMs are trying to prevent. Measuring value added as opposed to the value that was already there. To think that WaPo thinks you can measure teacher performance in such a naive state vs classroom way really detracts from the article.
VAMs are an issue for the reasons mentioned in the article, but just because a teacher reliably produces the highest performing students does not mean s/he is a good teacher.
There was a teacher who was the only teacher of highest track Algebra II class at a local HS who had such a terrible reputation among the students that some would drop down one track in math to avoid having her. Numerous students (and their parents) complained to the administration and the official reply was: "Our top math students all came out of her class" which was a rather specious argument since all of the top math students also went into her class.
Unofficially she was close to retirement age, had seniority in the department, and nobody wanted to poke the beehive of forcefully reassigning her classes.
Shouldn't, on the other hand, the fact that "all the top math students also went into her class" be counted as positive for her? I mean, based upon your description, those who attended her classes excelled. That is good.
So I guess parents didn't want to put their children into her classes because that wouldn't have worked out; but maybe the reason for that is that their children were simply unfit for the level?
> the only teacher of highest track Algebra II class at a local HS
Being the only teacher of the highest track Algebra II class in the school means that anyone who wanted to take the highest track Algebra II class--which plausibly contains many of the best math students--would have to take it from this teacher.
The whole point of the kind of value added modelling that is the centre of this case is that it attempts to factor out things like the quality of the student to estimate the quality of the teacher, precisely so that bad teachers who by dint of circumstances are associated with high-performing students don't get high ratings.
The problem is that if student background counts for the greater part of performance even a good teacher may have difficulty scoring highly if they happen to get a "good" class (one that scores highly on student quality.)
On a larger scale, the anecdote we are discussing here suggests that teachers as individuals may not make that much difference to student's performance, since this bad teacher was still able to turn out the best performing students thanks (one is supposed to presume) selection effects alone.
In our school district, it appeared that teaching assignments were largely based on seniority and intra-district politics. It seemed as though the senior teachers most established in the hierarchy would try to get the classes and programs where the best students would be told to go, so as to seem more effective and have more funding.
Thus, in a program that was supposed to cater to the best students in the school district, we had an English teacher who was senile to the point that she could not keep track of assignments or grading or have a coherent curriculum, a History/Social Studies teacher who was primarily interested in pursuing mandatory, irrelevant projects, like expensive theatrical productions, and a Math teacher who didn't teach at all, and just had us go over homework each day.
We were top students not because of our teachers, but because we were all motivated, and were hand-picked for the classes based on test scores and prior performance. We were the children of involved parents, and a high percentage of them were professors. Students who didn't perform well could be easily thrown out by the teachers, as well. With these advantages, it was a given that their students would excel beyond other students in the district; what was not clear was whether we were actually learning as effectively as similar students elsewhere. We went there because it's where the school district told us to go, and told us we'd be given the best opportunities, but the result was primarily to make ineffective teachers look very effective.
In the end, a number of us left the entire district en masse when we realized that the system was entirely ineffective, and not to our benefit. I think that the aftermath showed just how ineffective they were. In three years, I went from being a middle school student who was not being taught at a significantly higher level than other students, and was struggling, to being a junior at a first-tier university who was excelling and had significantly better grades. At the same time the students from our former middle school were graduating high school, I was a first-year grad student. Two of my friends went similar routes, with similar experiences.
If you grab all the top students, and can throw out any student who doesn't perform well, you're obviously going to look very effective, even if you don't teach well and your students could be doing far, far better. Top students will continue to come to your classes, because you'll appear to be the most effective, while better teachers will not have a chance to succeed, because they won't have your numbers to draw interest from parents, or the ability to game them by hand-selecting students.
Reading through the report from the ASA (that doesn't really "slam" the VAM statistic but rightly points out the flaws inherent to any attempt to use statistics in areas with many confounding factors), it appears as if the VAM is usually derived thusly:
1. Calculate a regression model for a student's expected standardized test scores based off of background variables (like previous scores, socioeconomic status etc). This includes having teacher's as variables.
2. Use the coefficient for the teacher as determined by the model to determine the teacher's "Value Added" metric.
The weaknesses in such an approach are also spelled out in the report: namely, missing background variables, lack of precision, and a lack of time to test for the effectiveness of the statistics themselves.
What's interesting is that the teacher in question was rated as "effective" the year before. The question becomes whether that was based off of her VAM score that year as well as what the standard error was on her regression coefficient. Unfortunately, the article doesn't mention any of that.
The problem with regression models is, in skilled hands, it's easy to manipulate the results. And that is without even opening up the rats nest that is causality.
For instance, want to raise the R^2, a value foolishly used to characterize how well the model explains? Add more variables. R^2 is monotonically increasing in the number of variables. So, for example, add the first letter of the teachers' middle names as an explanatory variable. R^2 will probably increase a bit.
Is there homoskedasticity? How much? What did they do to reduce it?
What observations are considered outliers and dropped, and who makes that determination?
Or, want to tank a teachers' score? Assuming teachers are added as something like indicator variables, there are lots of techniques to make the standard deviation increase, allowing you to say that at 0 is within the CI of B_{teacher}.
If they are using glmms -- as they probably ought to be -- there's even more room for a skilled statistician to pick outcomes, as more and more of the setup is a judgement call.
Finally, there's an open question of how well the exams were designed and if they accurately measured the student pre and post effect; there's a whole field -- psychometrics -- devoted to testing alone.
Perhaps I'm naive, but it seems like a model used for decisionmaking should be one that can show predictive performance - one that can predict, based on historical data about a set of students and a specific teacher, how well a teacher would do teaching that set of students. If it can't be accurate in that, how is it possible to know that it's capturing enough of the variables? And it seems that VAMs are decidedly not such a model.
It's hard to know what the system really does because the article really doesn't explain it.
I think what you're describing is Cross Validation[1]. It would work if they are predicting performance, but it sounds like the VAM system might be trying to figure out what a hypothetical "average" teacher would have achieved with the same students and comparing that the actual teacher's performance. This is basically trying to predict how the students will do independent of the teacher, but without such a teacher there is no real way to validate the model. Perhaps if they examined students across all teachers.
The system may ultimately be more about comparing teachers to each other and not about actually determining the value provided by an individual teacher.
> In 2012-13, 68.75 percent of her students met or exceeded state standards in both English and math. She was labeled “effective” that year. In 2013-14, her students’ test results were very similar but she was rated “ineffective.”
That sure makes it sound like the measure is unstable. If it is then, at a minimum, output for a single year should not be used by itself, but only in a rolling average with other years. It seems unlikely that the effectiveness of a veteran teacher would change that much from year to year. Given that there was little change in outcomes (no big drop in test scores) the hypothesis of the measure being unstable seems more likely.
It always cracks me up when every high level education administrator is referred to as a "reformer." It's like referring to members of the Chinese government as "revolutionaries."
Fitting to a statistical model superficially makes sense. But I think the details kill it.
The outcome you are measuring is the change in test score from before having a teacher and after. VAM attempts to statistically estimate the teacher's contribution to that change.
Presumably, the test is of something that theoretically the students will not know beforehand. Which means the teachers don't want students who study on their own (or participate in activities where that knowledge might be useful). And they don't want students who aren't going to learn it -- whoops, that was a leap, I meant to say who aren't going to test higher at the end. So you don't really want the top tier nor bottom tier coming into your class.
Nonspecific to VAM, but a result of standardized test results being used for anything meaningful to the teacher (salary, tenure, etc.) is that anything not on the test has an opportunity cost, and so will be omitted in favor of test prep. The more statistical validity that VAM has, the stronger this effect will be. If the teacher shows the students how to incorporate their new knowledge into a broader perspective, it may make the school's scores improve but it will screw over the next teacher in line (because the before test will be higher). So there's some peer pressure to make sure the students learn nothing that they're "supposed" to learn later.
If you consider a subject like math, what happens is that at some point many students fall behind. This makes the later topics much, much harder, because they build on what they never quite understood. A perfect teacher would figure out what balance of old and new material to give each individual student. That perfect teacher would score poorly on VAM compared to a teacher who crammed in test-specific mechanics and regurgitation, relying on dismal beginning test scores to make poor but not awful ending test scores look good. The system would gradually optimize for squeezing incremental gains out of improperly taught students.
And don't forget that the outcome is what's measured, and what's measured is crap. In football, you can look at a score (or just who won). Here, the structure is tuned to produce students who can do well on year-end tests but nothing else, certainly not on their ability to apply their knowledge to situations not likely to show up on a test.
Ok, this became more of a rant against standardized testing, but it just bothers me that adding statistical power magnifies the problems. You'd be better off throwing in a large random component, so that teachers' innate desires to teach well have a chance at winning out over gaming the system. Because even if your population of teachers is really conscientious, you're actively selecting for those most willing to play the game. And selection always wins in the end.
Your assuming the delta is based around just the prior test scores vs this one. aka old 10 new 15 or old 80 vs new 85 is the same improvement. However, statistically there is a tendency to regress toward the mean making simply staying at 80 end up as statistical progress. However, I suspect their using a flawed model that ignores the tendency for school districts to pack high preforming teachers on top of other high preforming teachers. To correct for this you need to look at what happens when someone moves from one district to another.
PS: There is a fair amount of momentum in many subjects so teachers can impact not just this years test results, but next years as well. In the end it's really difficult to come up with a high quality model and my guess is they simply did not bother.
Well it's not like teachers only stay in their position for a year. The framework could (and should?) keep on monitoring the progress of the students down the way and feed back to the teachers' rating until they graduate. That would also increase peer pressure and collaboration between teachers.
Controversial opinions.
Long road with
sharp curves ahead.
Author spent too much time in school,
as both a student and a professor.
YMMV.
My guess is that the article deliberately
omitted the main point: As good as the
teacher was and as good as the performance
of her students was, the evidence
from the testing of that teacher on
her 4th grade students on 4th grade work
was that
that teacher added relatively little to
what her students already had when they
entered her class. Or, in a sense, with
the way their teacher evaluation system works, the
good teachers of her students were in
K and grades 1, 2, 3 so that by the start
of grade 4 the 4th grade students were already doing
really well at grade 4 work and, with the way
the measurement of amount added by that 4th grade teacher
was done, that
teacher
was seen to have added relatively little. So,
adding relatively little, she was evaluated
as not doing well.
Added? Why the emphasis on added? Well, suppose
you are teaching 4th grade, your students come into
your class struggling with 1st grade work (it can
happen), you work hard, and in the one year get
your students good at 1st grade work, 2nd grade
work, and 3rd grade work, whew, three years of work
in one year, but, still don't have the students good at
4th grade work. So, such a teacher, taught three
years of work in one year, should, still,
be rated very effective. So, likely such situations
are the source of the interest in measuring
what was added.
Now measuring the added amount is likely,
in some cases,
tricky both for both testing and statistically.
And, likely the severity and rigidity of the system
kept the 4th grade teacher from moving her 4th
grade students on to 5th and 6th grade work
and, instead, kept them grinding away
at just the 4th grade work they already
knew so well that the teacher had little to
add. And if she had moved her students
ahead to 5th and 6th grade work, then the
test, of just 4th grade work, would not
show that progress and the teacher again
would be measured as not adding much.
So, net, for a 4th grade teacher,
a really good 3rd grade teacher is
a tough act to follow!
It's sad to see such stress and struggle
with K-12 teaching. We are so wound up
with the goal of
"no child left behind"
and, to this end, coming up with
systems to beat up
on teachers
that don't get us to that goal, that
we have really poor systems of evaluation
and, with high irony, fail at some basic
academic tasks and have far too many
false alarms. Bummer.
Or, if you can't do statistics well, then
don't do statistics at all. We're better
off with no statistics than with bad
statistics. Have I seen some really bad, we're
talking brain dead, statistics in K-12
education, up close, and personal? Yup!
Net, I wouldn't trust the Statistics and
Evaluation Branch of the New York State
Department of Education to get 2 + 2 = 4 correct.
Why? I've seen just way too much fumbling with
statistics. Or, bluntly, effective application
of statistics to important real situations is
mostly quite far beyond the abilities of
ordinary organizations -- it's just too hard
for them; they just can't get it right;
they make messes and do harm.
Here's one way to slow down nearly any
application of statistics: Go to some
statistics texts, get the assumptions,
and then demand that the assumptions be
justified. One assumption? Sure,
independence -- that assumption is
so powerful that, in anything much
closer to daily reality than
quantum mechanics or dice rolling,
it is essentially impossible to justify.
Looks like the goal of "no child left behind"
has generated a massive bozo explosion.
My dad was a fantastic educator, and his
description of the ideal in education was
a student sitting at one end of a log
and a good teacher sitting at the other end.
Try to characterize this educational environment
with statistics? As they say in New York,
f'get about it.
But, really, no worries, mate: There is a
safety valve -- the main source of education
anyway, the home. E.g., I had a friend who
went to a NYC school where most of the students
knew only two words, and they could say those two with
a wide variety of variations. The abbreviation
of those two words was just "MF". That's all
they knew. Not very articulate, but, then,
usually they did get their meaning across, but, then,
their meaning was not very advanced, either.
My friend? In the third grade,
he was sick at home for a week with the flu,
and his mother was shocked to discover that he didn't
know how to read. So, in that week, she taught
him. Then he knew how to read. In school?
Maybe he also learned different ways to say MF.
Education?
He did quite well: Got PBK at SUNY, Ph.D. at
Courant, and was a Member of the Institute for
Advanced Study at Princeton. His education
was (1) in K-12 or (2) at home? Three guesses,
the first two don't count! Or, four years,
K-3, the schools couldn't teach him to
read, and his mother did it in a week.
Yes, maybe by grade 12 he knew the binomial
theorem, and maybe his mother didn't teach
him that at home, but, really, still, his
education, the real key to his education,
was at home.
My dad told me about a basic book on education,
Dewey, Democracy and Education. So, since
I was spending so much time in school, I wanted
to know why and read the book. At one point
Dewey defined education -- passing down
from one generation to the next, where he
was quite clear that what gets passed is
both the good and the bad, not just all good.
Well, net,
most of that passing down happens at home,
and there's next to nothing K-12 can do about it.
Actually, a lot of people understand this basic
fact and, thus, want education to start at
birth, that is, have the government provide
the basic home parenting in, shall we say, at risk
situations. I believe you will find that our
current President is in favor of this! In
other words, he sees the at risk situations
as so hopeless that for a solution it is
necessary to replace the home itself. Maybe
he's correct.
Sorry 'bout K-12: I trust that it really can
do babysitting, that is, keep the kids off the
streets and, thus, mostly out of crime and
drugs, keep the sixth and seventh grade
girls from getting pregnant, etc. For much
more, well, in some of the at risk situations,
tough to have much more; or watch the PBS
with "The Education of Michelle Rhee". She tried.
She was good. She tried hard, really hard.
In the end, she accomplished basically zip, zilch,
zero and nichts, nil, nada. The teachers themselves
commonly believed that the goals were just
hopeless. Or, she was unable
to make the K-12 schools make up for poor
homes. Sorry 'bout that. But didn't we know
that already?
Well, maybe George Bush believed that education
happened in K-12 and that, thus, we could solve
the problem of poor education by a program
like No Child Left Behind in K-12. Well,
W also believed that "The Iraqis are perfectly
capable of governing themselves", i.e., have
the country stay together, as one country,
a democracy, and
not split apart into Shiites, Sunnis,
Kurds, and Others, with fighting, torture,
atrocities, civil war, little problems like those.
So, just help the Iraqis write a constitution,
hold elections, and all will come together singing
"Why Can't We Be Friends?". Where did W get
that really strong funny stuff he'd been smoking?
Bush 41 was
smart enough to stay the heck out of Baghdad.
Bush 43 was not. Maybe W was not the brightest
bulb on the tree. "No child left behind"?
I understand:
W, if your father had only had that goal in mind!
"More educational statistics, Ma!". Then,
for the poor performers, "off with their
heads!*. Might not make the situation much
worse!
I feel like teachers don't ever want to be evaluaged on their performance, yet want raises every year. Without a system in place to get rid of bad teachers and reward good ones, education will stay broken in this country.
Nearly all professions are evaluated and you can be fired. Why should teaching be any different?
Unions are one of the main reasons things haven't gotten any better. As soon as you try to evaluate, the unions step in and stop it.
To me, it sounds like it makes absolute sense. If you want to evaluate how good a teacher is, one of the things you need to measure is how well their students performed vs. how well those same students would have performed under the hypothetical average teacher.
The particular VAM system used to estimate this metric may be flawed or even completely broken. But at the point in the article this pull quote is located, that argument had not been made.
> VAMs are generally based on standardized test scores and do not directly measure potential teacher contributions toward other student outcomes.
> VAMs typically measure correlation, not causation: Effects – positive or negative – attributed to a teacher may actually be caused by other factors that are not captured in the model.
These are criticisms of the VAM system, but I imagine they are equally valid criticisms of other methods of teacher evaluation. I find it hard to imagine a practical system of evaluation that accounts for all background factors.
> The lawsuit shows that Lederman’s students traditionally perform much higher on math and English Language Arts standardized tests than average fourth-grade classes in the state.
This argument uses the same standardized test scores and confounding background variables that the VAM system is criticized for! Further, it seems obvious that a set of students doing well does not, in isolation, indicate their teacher is a good teacher. Students in a top set or from a great school or from a wealthy neighborhood might be expected to outperform state averages regardless of teacher quality.
My biggest problem with the article is it doesn't describe alternate methods of teacher evaluation. VAM may be flawed, but how does it compare to the other methods? If we accept that teacher evaluation based on student performance is necessary (is it? I have no idea!), what's a better way to do it?