It’s an open secret that the CSA mandelbrot benchmark tool (available in my ‘basic Transputer tools‘ package) is one of my favorite benchmark and test-tool when playing around with my various Transputer toys.
One fine day I thought VGA with more than 16 colo(u)rs would be nice… and the coding began. First step: Put the original source (well, already enhanced by a timer and some debugging) on github.
The original CSA Mandel program uses the official 640×480 16 color VGA mode (aka 0x12) and uses its own calls for that, i.e. no external 3rd party libs. Very manly 😉 but not very colorful…
So I created the first branch (aka Mandel_3) added a more “modern” command-line options handling and dived into hand-coding VBE (VESA BIOS Extensions) matters. That was very instructive and fun… and the first results showed that I didn’t just got 256 colors now but draw speed was increased, too 😯
Look Mom! More colors:
Running in host-mode (/t) on my P200MMX the initial screen took 6.6s vs 7.1s for 16-colors – so a difference of 0.5s or 7% should be much higher on Transputers, so I thought. And should this mean that bigger Transputer farms had been bottleneck’ed by the actual plotting of pixels?
Because 256 colors and higher resolutions (up to 1280×1024 depending on your VGA cards VESA BIOS) are fine, but even more colors are better, I branched the code a 2nd time (MANDEL_BGI) and replaced the VBE code by a BGI SVGA interface.
While originally Borland only supports VGA, there are 2 BGI drivers written by 3rd party developers which do support SVGA and up to 24-bit colors.
It’s commonly known that BGI is not the fastest graphics interface on planet earth… and the benchmark proved this:
I was hoping the change would have more impact when running the same on my Cube system… well it didn’t:
65x T800 (integer)
So as final conclusion, I will stay with the VBE SVGA drivers included in the V3.x code – it’s a good compromise between overall code/distribution size, comfort and speed.
The original VGA mode (0x12) will stay in the code forever to get comparable benchmark measurements – if you really need CGA/EGA/Hercules, you can always use the 2.x version.
Meet The Cube – this is the Transputer Power-House successor to the Tower of Power, which was a bit of a hacked frame-case and based on somewhat non-standard TRAM carriers with a max. capacity of just 24 size-1 TRAMs…
The Cube hardware
This time I went for something slightly bigger 😎 …A clear bow towards the Parsytec GigaCube within a GigaCluster. The Cube uses genuine INMOS B012 double-hight Euro-card carriers, giving home to 16 size-1 TRAMs – Parsytec would call this a cluster and so will I.
Currently The Cube uses 4 clusters, making a perfect cube of 4x4x4 Transputers… 64 in total. Wooo-hooo, this seems to be the biggest Transputer network running on this planet (to my knowledge)
If not, there still room left for more 😯
Just to give you a quick preview, this is what ispy responds when ran against the Cube:
32 x T800@20MHz/1MB (mainly TRAMs from MSC and ARADEX)
-> 96MB of total RAM
-> 70-130 MFLOPS (single precision)
~800MIPS combined integer power
~60Amps @5V needed (That’s 300W 😯 )
So we’re talking about 70-130 MFLOPS here – depending which documentation you trust and what language (OCCAM vs. Fortran) and/or OS you’re using. That was quite a powerhouse back in 1990 (Cray XM-P class!)… and dwarfed by a simple Pentium III some years later 😉
Just for to give you an comparison with recent hardware:
Raspberry Pi Model B+ (700 MHz)
Raspberry Pi 2 Model B (1000 MHz – one core)
Short break for contemplation about getting old…
Ok, let’s go on… you want to see it. Here it is – the front, one card/cluster pulled, 3 still in. On the left the mighty ol’ 60A power supply:
Well, this is the evaluation version in a standard case, i.e. this is meant for testing and improving. I’m planning for a somehow cooler and more stylish case for the final version (read: Blinkenlights etc.).
And here’s the IMHO more interesting view… the backside. It shows the typical INMOS cabling.
As usual, I color coded some of the cables.
The greenarrow points to the uplink to the host system to which The Cube is connected to. Red are the daisy-chained Analyse/Reset/Error (ARE) signals. The yellow so-called jumper-cables connect some of the IMSB004 links back into the boards network. And in the upper row (blue) four ‘edge-links’ of each board are connected to its neighbor.
This setup connects four 4×4 matrices (using my C004 dummies as discussed here) into a big 4×16 matrix. Finally I will ‘wrap’ that matrix into a torus. Yeah, there might be more clever topologies, but for now I’m fine with this.
Building up power
For completeness, here’s a quick look at how things came together.
The 4 carriers/clusters with lots of size-1 TRAMs… upper right one is the C004-dummy test board (now also fully populated). Upper left is pure AM-B404 love <3
Fixing/replacing the broken power-supply (in the back), including the somewhat difficult search for a working cooling solution:
The Cube software
Well there isn’t any specific software needed to run The Cube, but it definitely cries out loud for some heavily multi-threaded stuff.
So this is running fine – using internal RAM only. On the other hand, it seems that the current power supply has some issues with, well, the electric current.
When booting Helios onto all 64/65 Transputers which uses all of the external RAM, very soon some of them do crash or go into a constant reboot-loop.
By just reducing the network definition (i.e. not pulling any Transputers) to 48, Helios boots and runs rock-solid.
Because measuring the voltage during a 64-T boot shows a solid 5.08V on all TRAM-slots it most likely means the power supply either can’t deliver the needed amount of Amps (~60) or produces noise etc. 😥
So this is the next construction site I have to tackle.
As soon you’re talking about Transputers with people which weren’t there back in 1985 you’ll be asked this very soon: “How fast are these Transputer thingies”? Then there’s a stakkato of “MIPS? Whetstones? Dhrystones?” etc…
As always with benchmarks, the only valid answer is “it depends”. Concerning Transputers that’s even more true.
First, I suggest you read this Lies, Damn lies and benchmarks document from INMOS itself. It pretty much describes the dilemma and all the smoke and mirrors around that matter.
Benchmarks? It depends.
So you’ve read the above INMOS document? As you might saw, it’s full of OCCAM code. That’s the #1 prerequisite to get fast, competitive code (as long you’re not into Transputer assembler). From there it gets worse if you use a C compiler or even FORTRAN…
My little benchmark
Because it scales so well, works with integer as well as floating point CPUs and also runs on the x86 host while using at least the same graphic output routines, my personal benchmark is CSAs Mandelbrot tool (DOS only).
My slightly modified version is part of my Transputer Toolkit, which is downloadable here. You will need that version because I extended the code of this Mandelzoom with a high precision timer (TCHRT, shareware, can’t remove the splashscreen, sorry) when run with the “-a” parameter. You’ll need my provided default “MAN.DAT” file, which contains 2 coordinates to calculate (1st & 2nd run) to get comparable numbers.
So to bench your Transputer system start it with:
man -v -a
which runs it in VGA mode (640x480x16c), loads the coordinates from “MAN.DAT” and when done presents you with a summary screen like this:
To run it on your hosts x86 CPU, call it with “man -t -v -a”
Here are my results of the different Mandelzoon runs I made in the past. The blue background marks the host machine results, yellow are the integer timings and green is where the mucho macho things are happening.. well, sort of 😉
There are two columns for the results, the HD timer and the hand-timed runtimes. This is because these are from days before I enhanced the Mandelzoom.
This table will continously updated of course. e.g. the last row is pretty new – what might that system be? 😯
The sources are available in my github repository – so we can collaborate on enhancing and optimizing it.
HD in-programm Timer (s)
i386DX/33 (0kb L2)
Canceled 1st run after a quarter of Mandelbrot was done…
i386DX/33 (0kb L2) + 387
Am386/40 (0kb L2) + 387
21% faster clock but only 10.5% better result
i386DX/33 (128k L2) + 387
Am386DX/40 (128k L2) + 387
i486DX/33 (8k L1, 0k L2)
Pretty close to a single T800-20
i486DX2/66 (8k L1, 128k L2)
Very close to 2x T800-20
Pentium 133 (256kb L2)
About 8x T800-20
Pentium 200 MMX
About 9x T800-20
Downclocked, 64k L1, 256kb L2, 1M L3
Core i3-2120 3.3GHz
There’s something wrong here – needs re-run
65xT425 (48x25Mhz, 16x20MHz)
Actually it was 64xT800 and one T425 forcing the calculation to integer
25% higher clockrate should result in 17.5% speedup. Incl comm-overhead that pretty much fits
“1st run” shows that the slow ISA interface is really getting a bottleneck