pgPDF: The actual PDF parsing is done by poppler.
Poppler is a PDF rendering library based on the xpdf-3.0 code base.
Xpdf is based on XpdfWidget/Qt™, by Glyph & Cog.
XpdfWidget is based on the same proven code used in Glyph & Cog's XpdfViewer library.
The XpdfViewer® library / ActiveX control provides a PDF file viewer component for use in Windows applications.
Quite the rabbit hole!
Any licensing complications? Is it cross-platform? XpdfViewer seems to be propriatary and Windows-only.
Beside licensing issues I'm not sure the solution should be to link Poppler (which has a multiple CVE's every year on average) into the database server, especially if you process untrusted data.
Seems to be a great way to gain access to the database server.
Functionally it looks useful, but if those kind of 'helpers' catch on there really should be a way to sandbox these 'parser' processes.
Totally agree, this data should be supplied by a "page server" (analogous to a frame server in video production) over http using pdf.js so it can run in a browser based sandbox.
The risks of running this code are just way too high without an org level security policy about what access this compromised machine would have.
I keep going back and forth trying to figure out if this is sarcasm or not. Firstly it sounds sensible, then you're talking about PDF.js in a browser sandbox?!
Ok, so you're using deno with pdf.js instead of poppler. While Javascript is mostly 'memory safe' using all of deno, pdf.js and firejail make your attack surface huge and difficult to review or constrain and probably tank performance if used on a big dataset because you have to initialize the whole stack per request. All three of those tools have had significant CVE's too so adding more layers increases the amount of CVE's you have to deal with.
I also don't see what firejail buys you when you constrain deno (or another parser) to a properly secured container or VM.
Good catch, the whole PDF parsing ecosystem is kinda grey on those things.
I tried to be extra-careful and I relied on poppler's official statement:
"Note that Poppler is licensed under the GPL, not the LGPL, so
programs which call Poppler must be licensed under the GPL as well.
See the section History and GPL
licensing for more information."
I once needed to have it for some PDF experiments and I put it on github (this is the newest version; I did _not_ go old versions one by one; I just dumped 2 newest versions)
Any licensing complications? Is it cross-platform? XpdfViewer seems to be propriatary and Windows-only.