
Journal Journal: Testing AIs
Ok, I've mentioned a few times that I tried to get AIs (Claude, Gemini, and ChatGPT) to build an aircraft. I kinda cheated, in that I told them to re-imagine an existing aircraft (the DeHavilland DH98 Mosquito) using modern materials and modern understanding, so they weren't expected to invent a whole lot. What they came up with would run to around 700 pages of text if you were to prettify it in LaTeX. The complexity is... horrendous. The organisation is... dreadful.
Conclusions from testing so far:
1. Gemini is, at this point, absolutely out of its depth. It is really struggling with this and loses track of stuff easily, but is still usable for verifying if a change is going to work or not.
2. ChatGPT 5 is having a very very very hard time. It'll make far fewer mistakes than Gemini, but even simple tasks (such as re-ordering the text) take 6-7 attempts before it works right. It is completely incapable of producing an OWL2 map of the system, the best you will ever get is a simple list of elements and it can take dozens of attempts to get even that right. My confidence in any of its suggestions is low. It usually takes several rounds of verifying its suggestions against Gemini to get something both agree is acceptable.
3. Claude Pro can't load even a small fraction of the specification without running out of context space, I'd have to pay ten times as much to get a context window that suffices, but even when you hand it small fragments, it... does not do well. But, like Gemini, it can do sanity-checking, within its limited context capacity. It was very good in the early days, much stronger than ChatGPT, but once I hit a complexity ceiling, it really didn't do well.
4. Grok can't cope at all - like Claude, there's just too much there. But, unlike Claude, it's not even any good on sanity-checking.
5. DeepSeek suffers the same problem as Grok, but at least makes something of an effort when given small enough fragments.
My conclusion, for right now, is that no AIs out there can handle complex interactions in an engineering project. I'm sure it's fine for very basic stuff, but they hit their limits at precisely the point they might actually have a purpose - complex engineering projects are precisely where humans (even brilliant ones) suffer, because there's just too much to track and too many ways things can conflict.
I'm not into spamming, but the design the AIs came up with is up on GitLab. My thought, at this point, is that it might actually be interesting to the LLM engineers, because this is the sort of real-world problem that you know AIs will be used for, so they'd better work. But it might also be interesting to engineers. Programming is hard, because that involves creating something new, but this isn't creating something new, it's merely replacing components in something that is known, documented, and actually works, where the only things you're changing are the materials used. Orders of magnitude easier and requires no understanding of the engineering itself.
I would say that what has been produced probably has in the region of 250 defects in it, which would actually be pretty decent if this was a clean-sheet design. An actual aircraft engineer would produce something like that in a version 0.0 design for something built completely from scratch. It's not quite such a good total when there's next to no novel design elements in there at all.