That's a really good question for future systems, as I also see that current ones lack good control over the results.
In the end, you have a similar problem like you had with images: An image can convey more meaning than a thousand words. What did people do? They developed Image-to-Image, Controlnets (for things like poses, normal maps, depth, segmentation, etc.), LoRA for styles and characters, and now you can control all the details that won't work with just a text prompt.
I'm not sure about the controls. I think musicians can tell us more about the factors one would want to manipulate, such as BPM, pitch, and so on, which could be conditioned with simple numbers and how one could control others. A few LoRA already exist for music styles, I've seen three for ACE-Step, but there is not much yet. Audio-to-Audio exist as well, but if you can't sing, it might not improve your audio quality much. I haven't tried it yet, though.