Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror

Comment Re:Not sure whats more impressive... (Score 1) 150

It is a lot easier to talk in person or phone/skype/email over the specifics, but what I can say right now is that you are correct in your base assumption, but I would not say it is highly customized... The base of it borrows the number of LLVM improvements targeted for VLIW systems over the past couple of years (which work even better for us as we are a much more simplified/relaxed VLIW), and extends funcitonality, but most of our custom work is meant to extend beyond the base backend. Technically, if you have a program that compiles with Clang, it will run fine on a single one of our cores... to be able to do the more fancy stuff, you'll need to use our additional tools (which we hope to contribute back to the community in some form in the future). Shoot me an email (thomas at rexcomputing.com), and I'd be happy to answer any questions, even before a real job application.

Comment Re:His entire premise is wrong. (Score 1) 150

Except we do have "caches".... just not hardware managed ones, and so we call them scratchpads. Our "L1" equivalent is 128KB per core (2 to 4x more than Intel), 1 cycle latency (compared to 4 cycle latency for Intel), and lower power (As little as 2/5ths the power usage). We do have that TB/s per second for every core to its local scratchpad, and our total aggregate bandwidth on the network on chip (Cores to cores) is 8TB/s.

I'll throw a Seymour Cray quote here for fun... "You can't fake memory bandwidth that isn't there"... Faking memory bandwidth is the whole point of hardware managed caches. Software managed memory/scratchpads are augmenting memory bandwidth.

Comment Re:Not sure whats more impressive... (Score 1) 150

Tensilica (now owned by Cadence) does a shitty version of this already, which generates (restricted in scope) RTL and a compiler backend from a description of what you want. Synopsys also has a smaller version that does that in a larger scope than Tensilica, using some high level synthesis (which I think is basically pseudoscience when it comes to hardware design) along with SystemVerilog stuff. We actually prototyped the hardware generation part of what you are saying, and it works pretty decently without the same limitations imposed by Tensilica, but we have shelved the backend generation part of it to focus on our "real" work.

Comment Re:But it works in RTL Simulation (Score 2) 150

1. We have already run through synthesis of a version of our core (and rough version of our chip)... There's a lot of work to be done, especially as we are in the last steps of locking down the RTL, but we are not worried about timing... we are being very conservative.

2. Already have standard cells and memory compilers. We are not amateurs.

3. We actually have solid state physics and fabrication experience, and understand the physical constraints of wire and gate delays, leakage, etc. All of those played a very large part in our architectural design, specifically so we don't have a timing and closure being a huge clusterfuck.

Comment Re:The 19 year old is a lunatic (Score 2) 150

One of the things that doesn't seem to be getting through in most of the media articles is how our memory system is actually set up. I'll try to describe it briefly here, starting from the single core.

At a single core, we have a 128KB multibanked scratchpad memory, which you can think of as just like an L1 cache but smaller and lower latency. We have one cycle latency for a load/store from your registers to or from the scratchpad, and the same latency from the scratchpad to/from the cores router. That scratchpad is physically addressed, and does not have a bunch of extra (and in our opinion, wasted) logic to handle address translations, which just take up a lot of area and power (especially once you multiply it over hundreds of cores and large SRAMs. Most people think the TLB logic is a fixed size for any size SRAM, but it is not, and it gets significantly worse if you add coherency). Remember, even if you have a L1 cache (Typically 16 to 32KB, tops) hit on an Intel chip, it still takes 4 whole cycles.

Once we get to having a 16x16 grid (256 cores) as part of our Network on Chip, we have a total of 32MBs of on chip 1 cycle latency scratchpad. How we have arranged that is as a global flat address space, with all of the addresses being physically mapped. What I mean by this is that Core 0's scratchpad is the first 128K of the address space, and the address space continues on seamlessly to core 1, core 2, and all the way to core 255. If the address requested by a core is not in its own scratchpad's range, it goes to the router and hops on the NoC until it gets there... with a one cycle latency per hop. We have 32GB/s in each cardinal direction per router, giving a total on chip bandwidth of 8TB/s. Since it is all statically routed (which is a *very* important part of our entire design, which I am not revealing the full implications of just yet), we have guaranteed 1 cycle per hop latency between each router on the NoC. So even if you are going from one corner to another (core 0 to core 255) it is still a max latency of 32 cycles... still less than the latency to the L3 cache on an Intel chip.

This gets to the chip to chip interconnect, which we have not been very public about, but I can say it is VERY high bandwidth (48GB/s in each direction, on all four sides of the chip, so an aggregate bandwidth of 384GB/s... compare that to 16GB/s of PCIe or even NVIDIA's 2018/2019 80GB/s plans with NVLINK). There are a lot of very cool things in that design, but I can't go into them publicly quite yet. We sacrifice distance and interoperability to get those numbers, but we think it is a worthy tradeoff for insane speed and efficiency. The other interesting thing that we are looking at (and haven't fully explored the full tradeoffs) is being able to extend of flat address space across multiple chips in a larger grid.

To wrap up, most of the problems you mentioned here and in other comments are not totally valid, as we are not trying to replicate the inefficient protocols implemented super inefficiently in hardware today. We want to eventually be able to provide the same user experience and convenience that hardware caching provides, but keeping it abstracted away from the user. Hopefully you can understand I can't go into full details of this, and you have every reason to be skeptical, but that does not mean we are not going to try to do it anyways.

Also, cool Apple story. Thanks :)
Happy to answer any other questions

Comment Re:Not sure whats more impressive... (Score 5, Interesting) 150

1. My personal favorite programming models for our sort of architecture would be PGAS/SPMD style, with the latter being the basis for OpenMP. PGAS gives a lot more power in describing and efficiently having shared memory in an application with multiple memory regions. Since every one of our cores have 128KB of our scratchpad memory, and all of those memories are part of a global flat address space, every core can access any other cores memory as if it is part of one giant continuous memory region. That does cause some issues with memory protection, but that is a sacrifice you make for this sort of efficiency and power (but we have some plans on how to address that with software... more news on that will be in the future). The other nice programming model we see is the Actor model... so think Erlang, but potentially also some CSP like stuff with Go in the future (And yes, I do realize they are competing models).
If you want to get the latest info as it comes out, sign up for our mailing list on our website!

Comment Re:Not sure whats more impressive... (Score 5, Informative) 150

1.We are IEEE compliant, but I'm not a fan of it TBH, as it has a ridiculous number of flaws... Check out Unum and the new book "The End Of Error" by John Gustafson (and also search Gustafson's Law, the counterargument to the more famous Amdahl's law), which goes over all of them and proposes a superior floating point format in *every* measure.
2.First thing we get around primarily by having ridiculous bandwidth (288 to 384GB/s aggregate chip-to-chip bandwidth)... we'll have more info out on that in the coming months. When it comes to memory movement, that's the big difficulty and what a big portion of our DARPA work is focused on, but a number of unique features of our network on chip (statically routed, non blocking, single cycle latency between routers, etc) help a lot with allowing the compiler to *know* that it can push things around in given time, and having to put a minimal number of NOPs. There is a lot of work, and it will not be perfect with our first iteration, but the initial customers we are working with do not require perfect auto-optimization to begin with.
3. If you think of it as each core as being a quad issue in order RISC core (think on the performance order of a ARM Cortex A9 or potentially A15, but using a lot less power and being 64 bit), you can have one fully independent and isolated application on each core. That's one of the very nice things about a true MIMD/MPMD architecture. So we do fantastic with things that parallelize well, but you can also use our cores to run a lot of independent programs decently well.

Comment Re:Not sure whats more impressive... (Score 4, Informative) 150

This is a bit old and has some inaccuracies, so I hesitate to share it, but since you can find it if you dig deep enough... here it is: http://rexcomputing.com/REX_OC...
Couple quick things: Our instruction encoding is a bit different than what it has on the slide, we've brought it down to 128 bit VLIW (32 bits per functional unit operation), and there are some pipeline particulars we are not talking about publicly yet. We have also moved all of our compiler and toolchain development to be based on LLVM (and thus the really dense slides in there talking about GCC are mostly irrelevant).
As mentioned in the presentation, we have some ideas of expanding the 2D mesh on the chip, including having it become a 2D torus... our chip-to-chip interconnect allows a lot more interesting geometries, and are working on one with a university research lab that features a special 50-node configuration with max 2 hops between nodes. Our 96GB/s chip-to-chip bandwidth per side is also a big thing differing us from other chips (with the big sacrifice being the very short distance we need to have between chips and having a lot of constraints in packaging and the motherboard). We'll have more news on this in the future.
When it comes to sparse and dense computations, we are mostly focusing on the dense ones to start (FFT's are beautiful on our architecture), but we are capable of doing well with sparse workloads, and while those developments are in the pipeline, it will take a lot more compiler development effort.
We actually had a booth in the emerging technologies exhibition at Supercomputing Conference 2014, and hope to have a presence again this year

Comment Re: By Neruos (Score 5, Informative) 150

The biggest thing is what we have tried to emphasize, which is the fact that we have an entirely different memory system that does away with the hardware managed cache hierarchy. The rest of the really interesting stuff we have not publicly disclosed (yet), but I can tell you that it is very different from both Kalray and Tilera.

Comment Re:Not sure whats more impressive... (Score 5, Informative) 150

I'm hugely biased as I am the founder of the referenced startup, but I figured I would point out a few key things: 1. When it comes to FLOPs per watt, we are actually aiming at a 10 to 25x increase over existing systems... The best GPUs (before you account for the power usage of the CPU required to operate it) get almost 6 double precision GFLOPs per watt, while our chip is aiming for 64. 2. When it comes to being better than a GPU for applications, you have to remember GPUs have abysmal memory bandwidth (due to being limited by PCIe's 16GB/s to the CPU) after you run out of data in the relatively small memory on the GPU. Due to this, GPUs are really only good at level 3 BLAS applications (matrix-matrix based... basically things you would imagine GPUs were designed for, which are related to image/video processing). It just so happened that when the GPGPU craze started ~6/7 years ago, they had enough of an advantage over CPUs that they made sense for some other applications, but in actuality, GPUs do so much worse on level 1 and level 2 BLAS apps compared to the latest CPUs that GPUs are really starting to lose their advantage (and I think will be dying out when it comes to anything other than what they were originally designed for plus some limited heavy matrix workloads... but then again, I'm biased). 3.Programming is the biggest difficulty, and will make or break our company and processor. The DARPA grant is specifically for continued research and work on our development tools, which are intended to automate the unique features of our memory system. We have some papers in the works and will be talking pubicly about our *very* cool software in the next couple of months. 4. Your mention of the Mill and running existing code well, I had a pretty good laugh. Let me preface this by saying that I find stack machines academically interesting and are fun to think about, and I don't discredit the Mill team entirely, and think it is good thing they exist. With that being said they have had barely functioning compilers for years (which they refuse to release pubicly), and stack machines are notorious for having HORRIBLE support for languages like C. The fact that Out Of The Box Computing (the creators of the Mill) have been around for over 10 years and have given nothing but talks with powerpoints (though they clearly are very intelligent and have an interesting architectures) says a lot about their future viability. I hate to be a downer like that, especially since I have found Ivan's talks interesting and that he is a nice and down to earth guy, but I highly doubt they will never have a chip come out. I'll restate my obvious biases for the previous statement. Feel free to ask any other questions.

Slashdot Top Deals

To thine own self be true. (If not that, at least make some money.)

Working...