Comment Re:Shouldn't this be easy? (Score 3, Insightful) 34
if you have ever to work with pdf beyond merely staring at one you'll realize to what extent that format is an absolute disgrace. there are a zillion tools out there to manage pdf in a zillion ways and not a single one of them gets pdf parsing and layout right 100% of the time, not even adobe's. the only thing pdf had going for it was that it wasn't msword, and that's why it spread like a virus, but we could really benefit from having a proper truly portable (and universally adopted) document format even at the cost of a reduced functionality or scope.
but the article isn't about that, like the post above yours says llms probably ignore the format altogether and just ocr-parse a printout. what this is about (apart from being clickbait to take you to the verge) is llms having problems in processing text in general exactly the way we would want them to, which is old news. the underlying cause is that llms simply predict text and lack real understanding, which is why their output often looks surprisingly good but is often not quite ok or completely wrong. this is the limit of what their statistical model can offer, now we're trying to furbish them with other supplementing techniques