How would an LLM accurately determine which cases were "easy"? They don't reason, you know. What they do is useful and interesting, but it's essentially channeling: what is in its giant language model is the raw material, and the prompt is what starts the channeling. Because its dataset is so large, the channeling can be remarkably accurate, as long as the answer is already in some sense known and represented in the dataset.
But if it's not, then the answer is just going to be wrong. And even if it is, whether the answer comes out as something useful is chancy, because what it's doing is not synthesis—it's prediction based on a dataset. This can look a lot like synthesis, but it's really not.