The problem here is clearly the one doing the evaluation. Why do they even have a job themselves if they're so incapable of evaluating correctly?
There exists a quantitative method to correctly evaluate workers:
1. Collect each worker's work outputs and construct a training dataset.
2. Train an AI model with all work outputs combined.
3. For each worker, train a model with their respective work outputs deleted.
4. Construct a comprehensive evaluation benchmark over the full combined dataset.
5. For each worker, measure the change in the benchmark's performance with the worker's specific model relative to the full global model.
6. Fire the workers that lead to an unexpected improvement in the benchmark with their respective worker's model. This means that these workers were not contributing in a meaningful way to improving the performance over the benchmark. Keep the rest.
This is magical thinking with trendy AI mixed in. Every step here has massive assumptions that are just baked in, like can their output be quantified in principle? How expensive is it? Will the model even be predictive? Can you gain the cooperation of workers or even the companies that they work for?
Every invention is considered magical thinking by some until it is made possible. Nothing I noted is actually magical or prohibited.
As for its feasibility, that's an engineering task. No cooperation is necessary from workers. It is in fact much more feasible than developing a modern LLM.
As for its cost, various optimizations are possible as with LLMs. Also, there are high costs to incorrect classifications currently made by management, so high that they can result in the disposal of the firm.
Please, your employer captures everything you produce for your work. They have your screenshots, emails, chats, git history, Zoom call transcripts, and everything else. You signed off on it too. So save me the "this is ethical" crap. Actually it is ethical because it is much closer to being a blind and unbiased way of gauging utility.
This is only possible because they broke the power of labor unions. Nothing you've said convinces me that it would be unbiased or even measuring something real, in which case totally arbitrary decisions made by machine would be denying people access to basic necessities under our economic system.
Ignoring how ridiculous and impractical this idea is, it fails to capture some of the most important skills in being a developer. Framing real-world problems as code problems. Anticipating design issues. Knowing the right trade-off between solution correctness, complexity,and effort. Mentoring and accelerating others. This is barely different than leetcode interviews.
> Why do they even have a job themselves if they're so incapable of evaluating correctly?
The only way for this to not be the case is if the first person you hired is the most competent that exists, and can do everything already, and therefor accurately measure the anyone in front of them. And, that goes on down the hiring chain.
In reality, you'll eventually need an expert in an area you are not an expert in, which means you won't necessarily have the insight on the best candidate. Maybe AI can do that someday, but definitely not today.
I also think this is what causes large orgs slowly fail: it's not only difficult for a person to gauge when another is smarter/more capable than them, but a smarter person can look less competent to someone hiring, because their answers/approach is outside of the known solution space of the interviewer. So, you end up with a slow net decrease in competence over time.
I've seen it in every org I've been a part of, from startups to corporate, including myself. Trying to judge a person, within small time slot is hard. The alternatives (like take home, temp to hire, etc) are also talent repellent. I think the most revealing method can be the Jim Keller method, where you let them nerd out on some potentially unrelated problem, but the results of that are hard to write down/justify.
If you're not educated about the field, it will of course be nonsense to you. It speaks only to the limitations of your knowledge and intelligence. Maybe stick to commenting on what you know.
There exists a quantitative method to correctly evaluate workers:
1. Collect each worker's work outputs and construct a training dataset.
2. Train an AI model with all work outputs combined.
3. For each worker, train a model with their respective work outputs deleted.
4. Construct a comprehensive evaluation benchmark over the full combined dataset.
5. For each worker, measure the change in the benchmark's performance with the worker's specific model relative to the full global model.
6. Fire the workers that lead to an unexpected improvement in the benchmark with their respective worker's model. This means that these workers were not contributing in a meaningful way to improving the performance over the benchmark. Keep the rest.