I agree that many low-level programming methods aren't that necessary anyhow, but there is one big point where the compiler cannot help much, and that is data layout. Big hits come from all levels of cache misses, and it's good for the programmer to be aware of this and benchmark the memory access patterns and try to make them good (predictable, linear, clumping frequently used data, etc). Also on some hardwares, the Load-Hit-Stores are something to be aware as well. A reasonable thing to do, when optimizing something, is to fiddle with the code a bit and see what generates the best assembly. This usually is a good compromise (you still stay at a higher level and got portable code with some gained performance on at least one platform). Now still compilers aren't a magic wand everywhere, especially when going to deeply embededded or specialized hardware. One example is SPU programming. Since SPUs read&write everything from/to 16-aligned addresses, current GCC compiler lots of "align ptr, load, rotate, calculate, rotate, combine, store" sequences. If you want good SPU performance, going into ASM is indeed viable something. Though most of the times, staying at intrinsic functions gives you an adequate compromise. But since SPUs are basically fast DSPs, many of the tasks that are ran by them are in nature quite repetitive with short amount of work per item and millions of items (like doing vertex transforms, simulating some post processing effect, mixing audio etc). But a good programmer always benchmarks first, checks the compiler output etc before hitting the deck with raw assembly.