So after the lengthy description of the DSM cards – how can we make use of them? As said in the previous chapter, they were shipped with an assembler and even an early version of GCC (1.3) so development is pretty straightforward.
First, you have to understand how the cards integrate themselves into an ISA/EISA system. While the three versions (8, 16, 32bit) differ in some areas, the integration is more or less similar:
Each version offers a latch for controlling the card. This means to activate the card by writing bits to that latch to define a memory-window inside the hosts RAM to blend-in the cards dual-ported RAM and/or resetting it etc.. The latch is accessible through an IO-port set by jumpers on the card (default 0x300).
So for the ISA cards you have to for example write a 0xC2 at that port-adress to reset & activate the card and use the mem-window of 0xDC000-DC7FF. In Turbo-Pascal this would be something like:
This gives you a 2K mem-window to exchange data between the DSM and the host (just 1K for the DSM860-8).
The EISA cards obviously use other ports depending on the slot-number, so this would be an example to do the same for am DSM860-32, this time in Turbo-C:
This would also open a mem-window at 0xDC000, this time up to 0xDCFFF, i.e. 4K long.
As mentioned above, the Host and the DSM-card are communicating through a memory-window of diffenrent sizes, depending on the DSM used. Due to their nature, the memory is looking different though. That said, at least they’re both litte-endian, so no byte-swapping needed.
The 80×86 side
For the hosting PC, memory looks pretty straightforward. 1KB-4KB of RAM somewhere in ‘lower-RAM’, that’s it.
While we don’t use it, it’s worth mentioning that there’s a 2nd memory window called “Common“. This is fixed at a specific address and is shared between all possible cards plugged into one host. I guess you already got it: This enables easy multi-processor communication… and gives a lot of possibilities for f**k-ups.
The i860 side
The memory-mapping on the i860-side is the same for the 16 and 32bit cards, the dual-ported RAM is located at 0xd0040000 (0xC0000000 for the DSM-8).
In any case the i860 memory is linear, 64bit wide and always on a 64-bit boundary. This means you have to read the DP-RAM area differently depending on which card you run your code. Here’s an example of how the DP-RAM looks like on the Host- and i860 side:
Host DP-RAM in DOS ‘debug’
which would look like this on the i860 side:
So reading and writing from/to the DP-RAM involves some thinking to be done by the programmer. Here are two code-snippets showing the difference between reading the DP-RAM on a DSM860-8 and an DSM860-16. First the ‘8 bit version’:
And the same for the DSM860-16:
Because of reading SHORTs (ld.s) the DSM860-16 version has to loop just 16 times while the 8-bit version has to do that 32 times.
Same applies to writing. You will find an example in the Mandelbrot program (Commented source file).
So here we go, finally some program running showing all the power behind the i860. I took the Mandelbrot example from R.D.Klein and modified it a bit, well quite a bit as it was written for the DSM860-8 and provided CGA output (yuck!).
Like most “external accelerator” programs, there’s one part running on the accelerator (the i860 in this case) and one part running on the host doing useful things with the provided data. In this case we have an i860 assembler code doing the number-crunching on the Mandelbrot algorithm using the i860’s ability of ‘dual instruction-mode‘ and some code done in Trubo-Pascal handling the display and zooming.
The latter was extended to use SVGA (640x480x256) output and providing an interrupt driven timer. [sourcecode package cleanup is still work in progress]
Here are the two running full steam ahead:
Some things worth to mention:
- The host being used here is a P1 133MHz, a bit unfair comparing that to a 40MHz i860 – OTOH they seem quite comparable when it comes down to Mandelbrot crunching speed.
- To calculate the Mandelbrot the same speed as it took the Pentium (~15s) I needed five T800-20 according to my benchmarks.
- To even achieve the 8.2s of the i860 I had to run 9(!) T800-20 in parallel.
- A i486DX/33 took 66 sec to do the same (8.25 times slower!), while it still took 34s for a i486DX2/66!
So while all that moaning about the bad ‘programmability’ and slow context-changes of the i860 are completely correct, in certain tasks that CPU was indeed a real screamer!