As the "proper solution" here is of course not using PDFs that are hard-to-parse, but force elections to have machine parseable outputs. And LLMs can "fix in place" stupid solutions.
That's not a hate on the author though. I needed to do some PDF parsing for bank statements before; but also; the proper long-term solution is force banks (by law or by public interest) to have parseable statements, not parse it!
Like putting LLMs to understand bad codebase will not fix the bad codebase, but will build on top of it.
Totally reasonable view, and one of our volunteers actually got the law in Kansas changed to mandate electronic publishing of statewide precinct results in a structured format! But finding legislative champions for this issue isn't easy.
I’ve tried using LLM’s to do the same exact thing (turning precinct-level election results into a spreadsheet) and in my experience they worked rather poorly. Less accurate than traditional OCR, and considering how many fixes I had to make, altogether slower than manual entry. The resolution of the page made an outsized difference. It’s nice that you got it to work, but I am skeptical of it as a permanent solution.
Tangentially- I appreciate what OpenElections does- however, I wish there was a similar organization that did not limit themselves to officially certified results. There are already other organizations who collect precinct results post-2016, and using only official results basically limits you to 2008 and afterwards, but historical election results are the real intrigue. Not to mention that I have noticed many blatant errors in election results that have supposedly been “certified” by a state/county government. The precinct results Pennsylvania publishes, for example, are riddled with issues.
Skepticism is a necessary trait in this type of work, for sure. I will say that the performance has improved substantially in the past year, and there are still PDFs that require a lot of work.
We went with official precinct results for two main reasons: there are differences between election night and final results (some of them non-trivial) and to make the work more manageable. Agree that historical results are a real problem, and as a PA native I know only too well the errors that the state data contains, which is why we go county-by-county there.
I think that we should encourage elections to _not_ be standardized. The problems among various polities in the USA have many different issues and should not be forced to conform to a specific way that elections should be done. This is a social problem and we should not cram it into a technical solution. Legibility of elections should be maintained at the local level, trying to make things legible at a national level is in my opinion unwanted. As much as I would like the data to be clean, people are not so clean. Even if they used slightly more structured formats than PDFs, the differences between polities must be maintained as long as they are different polities.
The way that OpenElections handles this, with 'sources' and 'data' directories I think is a good way to bridge the gap.
Not being standardized is fine and even a positive (diversity of technology vendors is a security feature and increases confidence in elections). But producing machine readable outputs of some sort, instead of physical paper and PDFs, is clearly a positive as well.
How is it unwanted to have a standardized database of _results_? They're partly going to be used in a federal context, right?
We do this pretty decently in India - the results of pretty much every election run by the Election commission is updated on https://results.eci.gov.in/# and it's the same for the whole country.
Elections at the local level should be governed by the locality. I do not see the need for standards at a higher level, other than for democracy to be maintained in some fashion. External data reporting certainly need not be standardized at t̶h̶e̶ ̶l̶o̶c̶a̶l̶ [sic] a higher level.
I have had to do some bank statements to CSV conversions before and still do occasionally and https://tabula.technology/ has been invaluable for this.
In other news, any bank that does not produce a standard CSV file for their bank statements should be fined $1m per day until they do. It's ridiculous that this isn't the first option when you go to download them.
I did it with one of the Go's PDF parsers (I think rsc has one, and ... some guy... forked it and added some features.. it's still kind of manual but worked great)
As the "proper solution" here is of course not using PDFs that are hard-to-parse, but force elections to have machine parseable outputs. And LLMs can "fix in place" stupid solutions.
That's not a hate on the author though. I needed to do some PDF parsing for bank statements before; but also; the proper long-term solution is force banks (by law or by public interest) to have parseable statements, not parse it!
Like putting LLMs to understand bad codebase will not fix the bad codebase, but will build on top of it.
oh well c'est la vie