How do you define "easy question" for a potential alien intelligence? The solution, like most solutions when dealing with outliers, in my opinion, is to minimize the impact of outliers.
I mean presumably that's what the preview testing stage would handle right ? It should be clear if there are a class of obviously easy questions. And if that's not clear then it makes the scoring even worse.
And in some sense, all of these benchmarks are tied and biased for human utility.
I don't think ARC would be designed and scored the way it is if giving consideration for an alien intelligence was a primary concern. In that case, the entire benchmark itself is flawed and too concerned with human spatial priors.
There are many ways to deal with a problem. Not all of them are good. The scoring for 3 is just bad. It does too much and tells too much.
5% could mean it only answered a fraction of problems or it answered all of them but with more game steps than the best human score. These are wildy different outcomes with wildly different implications. A scoring methodology that can allow for such is simply not a good one.