One of the things that doesn't seem to be getting through in most of the media articles is how our memory system is actually set up. I'll try to describe it briefly here, starting from the single core.
At a single core, we have a 128KB multibanked scratchpad memory, which you can think of as just like an L1 cache but smaller and lower latency. We have one cycle latency for a load/store from your registers to or from the scratchpad, and the same latency from the scratchpad to/from the cores router. That scratchpad is physically addressed, and does not have a bunch of extra (and in our opinion, wasted) logic to handle address translations, which just take up a lot of area and power (especially once you multiply it over hundreds of cores and large SRAMs. Most people think the TLB logic is a fixed size for any size SRAM, but it is not, and it gets significantly worse if you add coherency). Remember, even if you have a L1 cache (Typically 16 to 32KB, tops) hit on an Intel chip, it still takes 4 whole cycles.
Once we get to having a 16x16 grid (256 cores) as part of our Network on Chip, we have a total of 32MBs of on chip 1 cycle latency scratchpad. How we have arranged that is as a global flat address space, with all of the addresses being physically mapped. What I mean by this is that Core 0's scratchpad is the first 128K of the address space, and the address space continues on seamlessly to core 1, core 2, and all the way to core 255. If the address requested by a core is not in its own scratchpad's range, it goes to the router and hops on the NoC until it gets there... with a one cycle latency per hop. We have 32GB/s in each cardinal direction per router, giving a total on chip bandwidth of 8TB/s. Since it is all statically routed (which is a *very* important part of our entire design, which I am not revealing the full implications of just yet), we have guaranteed 1 cycle per hop latency between each router on the NoC. So even if you are going from one corner to another (core 0 to core 255) it is still a max latency of 32 cycles... still less than the latency to the L3 cache on an Intel chip.
This gets to the chip to chip interconnect, which we have not been very public about, but I can say it is VERY high bandwidth (48GB/s in each direction, on all four sides of the chip, so an aggregate bandwidth of 384GB/s... compare that to 16GB/s of PCIe or even NVIDIA's 2018/2019 80GB/s plans with NVLINK). There are a lot of very cool things in that design, but I can't go into them publicly quite yet. We sacrifice distance and interoperability to get those numbers, but we think it is a worthy tradeoff for insane speed and efficiency. The other interesting thing that we are looking at (and haven't fully explored the full tradeoffs) is being able to extend of flat address space across multiple chips in a larger grid.
To wrap up, most of the problems you mentioned here and in other comments are not totally valid, as we are not trying to replicate the inefficient protocols implemented super inefficiently in hardware today. We want to eventually be able to provide the same user experience and convenience that hardware caching provides, but keeping it abstracted away from the user. Hopefully you can understand I can't go into full details of this, and you have every reason to be skeptical, but that does not mean we are not going to try to do it anyways.
Also, cool Apple story. Thanks :)
Happy to answer any other questions