It’s an open secret that the CSA mandelbrot benchmark tool (available in my ‘basic Transputer tools‘ package) is one of my favorite benchmark and test-tool when playing around with my various Transputer toys.
One fine day I thought VGA with more than 16 colo(u)rs would be nice… and the coding began. First step: Put the original source (well, already enhanced by a timer and some debugging) on github.
The original CSA Mandel program uses the official 640×480 16 color VGA mode (aka 0x12) and uses its own calls for that, i.e. no external 3rd party libs. Very manly 😉 but not very colorful…
So I created the first branch (aka Mandel_3) added a more “modern” command-line options handling and dived into hand-coding VBE (VESA BIOS Extensions) matters. That was very instructive and fun… and the first results showed that I didn’t just got 256 colors now but draw speed was increased, too 😯
Look Mom! More colors:
Running in host-mode (/t) on my P200MMX the initial screen took 6.6s vs 7.1s for 16-colors – so a difference of 0.5s or 7% should be much higher on Transputers, so I thought. And should this mean that bigger Transputer farms had been bottleneck’ed by the actual plotting of pixels?
Because 256 colors and higher resolutions (up to 1280×1024 depending on your VGA cards VESA BIOS) are fine, but even more colors are better, I branched the code a 2nd time (MANDEL_BGI) and replaced the VBE code by a BGI SVGA interface.
While originally Borland only supports VGA, there are 2 BGI drivers written by 3rd party developers which do support SVGA and up to 24-bit colors.
It’s commonly known that BGI is not the fastest graphics interface on planet earth… and the benchmark proved this:
P200MMX
Orig
VESA
SVGA
SVGA256
1
7.123
6.623
8.911
6.915
2
38.258
36.635
39.717
37.725
I was hoping the change would have more impact when running the same on my Cube system… well it didn’t:
65x T800 (integer)
Orig
VESA
SVGA
SVGA256
1
2.323
2.288
3.940
2.383
2
8.163
8.173
8.181
8.164
So as final conclusion, I will stay with the VBE SVGA drivers included in the V3.x code – it’s a good compromise between overall code/distribution size, comfort and speed.
The original VGA mode (0x12) will stay in the code forever to get comparable benchmark measurements – if you really need CGA/EGA/Hercules, you can always use the 2.x version.
Meet The Cube – this is the Transputer Power-House successor to the Tower of Power, which was a bit of a hacked frame-case and based on somewhat non-standard TRAM carriers with a max. capacity of just 24 size-1 TRAMs…
The Cube hardware
This time I went for something slightly bigger 😎 …A clear bow towards the Parsytec GigaCube within a GigaCluster. The Cube uses genuine INMOS B012 double-hight Euro-card carriers, giving home to 16 size-1 TRAMs – Parsytec would call this a cluster and so will I.
Currently The Cube uses 4 clusters, making a perfect cube of 4x4x4 Transputers… 64 in total. Wooo-hooo, this seems to be the biggest Transputer network running on this planet (to my knowledge)
If not, there still room left for more 😯
Just to give you a quick preview, this is what ispy responds when ran against the Cube:
32 x T800@20MHz/1MB (mainly TRAMs from MSC and ARADEX)
-> 96MB of total RAM
-> 70-130 MFLOPS (single precision)
~800MIPS combined integer power
~60Amps @5V needed (That’s 300W 😯 )
So we’re talking about 70-130 MFLOPS here – depending which documentation you trust and what language (OCCAM vs. Fortran) and/or OS you’re using. That was quite a powerhouse back in 1990 (Cray XM-P class!)… and dwarfed by a simple Pentium III some years later 😉
Just for to give you an comparison with recent hardware (Linpack MFlops):
Raspberry Pi Model B+ (700 MHz)
~40 DP Mflops
Raspberry Pi 2 Model B (1000 MHz – one core)
~134 DP Mflops
Raspberry Pi 3 Model B (1200 MHz – one core)
~176 DP Mflops
Short break for contemplation about getting old…
Ok, let’s go on… you want to see it. Here it is – the front, one card/cluster pulled, 3 still in. On the left the mighty ol’ 60A power supply:
Well, this is the evaluation version in a standard case, i.e. this is meant for testing and improving. I’m planning for a somehow cooler and more stylish case for the final version (read: Blinkenlights etc.).
And here’s the IMHO more interesting view… the backside. It shows the typical INMOS cabling.
As usual, I color coded some of the cables.
The greenarrow points to the uplink to the host system to which The Cube is connected to. Red are the daisy-chained Analyse/Reset/Error (ARE) signals. The yellow so-called jumper-cables connect some of the IMSB004 links back into the boards network. And in the upper row (blue) four ‘edge-links’ of each board are connected to its neighbor.
This setup connects four 4×4 matrices (using my C004 dummies as discussed here) into a big 4×16 matrix. Finally I will ‘wrap’ that matrix into a torus. Yeah, there might be more clever topologies, but for now I’m fine with this.
Building up power
For completeness, here’s a quick look at how things came together.
The 4 carriers/clusters with lots of size-1 TRAMs… upper right one is the C004-dummy test board (now also fully populated). Upper left is pure AM-B404 love <3
Fixing/replacing the broken power-supply (in the back), including the somewhat difficult search for a working cooling solution:
The Cube software
Well there isn’t any specific software needed to run The Cube, but it definitely cries out loud for some heavily multi-threaded stuff.
So the first thing has definitely to be a Mandelbrot zoom. As usual, I used my very own version with a high-precision timer, available in my Transputer Toolkit.
Here’s the quick run in real-time – you can still figure out visually each Transputer delivering its result:
So this is running fine – using internal RAM only. On the other hand, it seems that the current power supply has some issues with, well, the electric current.
When booting Helios onto all 64/65 Transputers which uses all of the external RAM, very soon some of them do crash or go into a constant reboot-loop.
By just reducing the network definition (i.e. not pulling any Transputers) to 48, Helios boots and runs rock-solid.
Because measuring the voltage during a 64-T boot shows a solid 5.08V on all TRAM-slots it most likely means the power supply either can’t deliver the needed amount of Amps (~60) or produces noise etc. 😥
So this is the next construction site I have to tackle.
As soon you’re talking about Transputers with people which weren’t there back in 1985 you’ll be asked this very soon: “How fast are these Transputer thingies”? Then there’s a stakkato of “MIPS? Whetstones? Dhrystones?” etc…
As always with benchmarks, the only valid answer is “it depends”. Concerning Transputers that’s even more true.
First, I suggest you read this Lies, Damn lies and benchmarks document from INMOS itself. It pretty much describes the dilemma and all the smoke and mirrors around that matter.
Benchmarks? It depends.
So you’ve read the above INMOS document? As you might saw, it’s full of OCCAM code. That’s the #1 prerequisite to get fast, competitive code (as long you’re not into Transputer assembler). From there it gets worse if you use a C compiler or even FORTRAN…
My little benchmark
Because it scales so well, works with integer as well as floating point CPUs and also runs on the x86 host while using at least the same graphic output routines, my personal benchmark is CSAs Mandelbrot tool (DOS only).
My slightly modified version is part of my Transputer Toolkit, which is downloadable here. You will need that version because I extended the code of this Mandelzoom with a high precision timer (TCHRT, shareware, can’t remove the splashscreen, sorry) when run with the “-a” parameter. You’ll need my provided default “MAN.DAT” file, which contains 2 coordinates to calculate (1st & 2nd run) to get comparable numbers.
So to bench your Transputer system start it with:
man -v -a
which runs it in VGA mode (640x480x16c), loads the coordinates from “MAN.DAT” and when done presents you with a summary screen like this:
To run it on your hosts x86 CPU, call it with “man -t -v -a”
The Results
Here are my results of the different Mandelzoon runs I made in the past. The blue background marks the host machine results, yellow are the integer timings and green is where the mucho macho things are happening.. well, sort of 😉
There are two columns for the results, the HD timer and the hand-timed runtimes. This is because these are from days before I enhanced the Mandelzoom.
This table will continously updated of course. e.g. the last row is pretty new – what might that system be? 😯
The sources are available in my github repository – so we can collaborate on enhancing and optimizing it.
HD in-programm Timer (s)
Hand-Timed
System
1st
2nd
1st run
2nd run
Comment
i386DX/33 (0kb L2)
1800
0
1:30:00
(canceled)
0
Canceled 1st run after a quarter of Mandelbrot was done…
i386DX/33 (0kb L2) + 387
588
3316
0:09:48
0:55:16
Am386/40 (0kb L2) + 387
490
2980
0:08:10
0:49:40
21% faster clock but only 10.5% better result
i386DX/33 (128k L2) + 387
274
1547
0:04:34
0:25:47
Am386DX/40 (128k L2) + 387
228
1292
0:03:48
0:21:32
i486DX/33 (8k L1, 0k L2)
01:06.24
368.56
Pretty close to a single T800-20
i486DX2/66 (8k L1, 128k L2)
00:33.72
185.51
Very close to 2x T800-20
Pentium 133 (256kb L2)
00:09.09
00:55.01
About 8x T800-20
Pentium 200 MMX
00:07.13
00:38.06
About 9x T800-20
AMD K6-3+/266
00:06.00
00:32.00
Downclocked, 64k L1, 256kb L2, 1M L3
Core i3-2120 3.3GHz
00:01.66
00:02.13
VirtualBox,1 CPU
1x T425-20
0:00:25
0:02:28
There’s something wrong here – needs re-run
2x T425-20
00:51.55
04:56.60
3x T425-20
00:34.42
03:17.81
4x T425-20
00:25.86
02:28.56
5x T425-20
00:20.74
01:58.96
6x T425-20
00:17.37
01:39.19
9x T425-20
11
62
0:00:11
0:01:02
13x T425-20
8
42
0:00:08
0:00:42
21x T425-20
5
27
0:00:05
0:00:27
25x T425-20
4
23
0:00:04
0:00:23
65xT425 (48x25Mhz, 16x20MHz)
00:02.323
00:08.163
Actually it was 64xT800 and one T425 forcing the calculation to integer
1x T800-20
01:09.13
06:27.18
1x T800-25
55
309
0:00:55
0:05:09
25% higher clockrate should result in 17.5% speedup. Incl comm-overhead that pretty much fits
2x T800-20
00:35.65
03:13.79
3x T800-20
00:23.16
02:09.32
4x T800-20
00:17.43
01:37.04
5x T800-20
00:14.04
01:17.74
6x T800-20
00:11.82
01:04.83
5x T800-25
11
62
0:00:11
0:01:02
9x T800-20
8
40
0:00:08
0:00:40
13x T800-20
5
30
0:00:05
0:00:30
17x T800-25
00:03.8
00:18.59
“1st run” shows that the slow ISA interface is really getting a bottleneck
21x T800-20
4
18
0:00:04
0:00:18
33x T800-20
00:02.88
00:11.97
65x T800 (32×25, 33x20Mhz)
00:02.21
00:05.74
home of real men's hardware
This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish.AcceptRejectRead More
Privacy & Cookies Policy
Privacy Overview
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.