program move32_test;

USES Crt;

const count=1000;
var block1,block2:pointer;
    i,time:longint;
    timer:longint absolute $40:$6C;
    size:word;

procedure Move32(var source,dest;count:word);assembler;

asm
                PUSH    DS
                LDS     SI,source
                LES     DI,dest
                MOV     CX,count
                SHR     CX,1
                JNC     @@1
                MOVSB
@@1:            SHR     CX,1
                JNC     @@2
                MOVSW
@@2:            DB      66h
                REP     MOVSW
                POP     DS
end;

{ --- Mainprog --- }
begin

  clrScr;
  getmem(block1,65010);
  getmem(block2,65010);

  for size:=65000 to 65003 do
    begin

     writeln('Timing blocks of ',size,' bytes :');

     writeln('  Timing Move ...');
     time:=timer;
     for i:=1 to count do
       move(block1^,block2^,size);
     writeln('  Time for ',count,' Move''s   : ',(timer-time)/18.2:8:1,' s');

     writeln('  Timing Move32 ...');
     time:=timer;
     for i:=1 to count do
       move32(block1^,block2^,size);
     writeln('  Time for ',count,' Move32''s : ',(timer-time)/18.2:8:1,' s');

    end;

end.
{ -------------------------------------------------------- }

If you can't find anything wrong in it, test it !
Here are the results on a 486DX4-100 :

Timing blocks of 65000 bytes :
  Timing Move ...
  Time for 1000 Move's   :     11.0 s
  Timing Move32 ...
  Time for 1000 Move32's :      3.6 s
Timing blocks of 65001 bytes :
  Timing Move ...
  Time for 1000 Move's   :     11.0 s
  Timing Move32 ...
  Time for 1000 Move32's :      6.0 s
Timing blocks of 65002 bytes :
  Timing Move ...
  Time for 1000 Move's   :     11.0 s
  Timing Move32 ...
  Time for 1000 Move32's :      6.0 s
Timing blocks of 65003 bytes :
  Timing Move ...
  Time for 1000 Move's   :     11.0 s
  Timing Move32 ...
  Time for 1000 Move32's :      6.0 s

3 times faster on a 4 byte boundary and still almost twice as fast on other
addresses ! I think that's a nice score...
 EH> For REP MOVSD to work faster the values to be moved have to be on "32
bit" EH> addresses, that is: both SI and DI have to be a multiple of 4.
 EH> You didn't test for that and with the extra MOVSB and MOVSW it might well
 EH> be they are on a multiple of 4 + 1 or 3 (as the aligment of TP normally
is EH> on EVEN addresses).

You're right about that, maybe I'll work on it... some day. ;-)

 EH> Apart from that you didn't test for overlap (does the move partially
 EH> overwrite the bytes TO be moved, because then those bytes have to be
moved EH> first)

I hadn't tought about that. I'm not often moving overlapping blocks though.
Are you sure the TP Move checks for that ? (I mean, do you not only assume,
but have you tested it ? :-))
 EH> and you didn't set a direction flag so it just MIGHT be you're
 EH> moving the wrong bytes (mostly the direction flag IS upwards, but it just
 EH> might be downwards, which means you're moving the bytes BELOW "ds:si" to
 EH> "es:di").

The direction flag is assumed to be cleared in TP. Every procudere that
changes it, should clear it again. But it's not forbidden to do a CLD of
course...
 EH> A complete Move32 has to be much more complicated than this (and much
 EH> bigger, thus). Further Move is most often used to/from screen memory and
 EH> unless you got a PCI screen card 32-bits moves are not possible to screen
 EH> memory (the cpu will automatically do each 32-bit doubleword as 2 16-bits
 EH> words, as the bus is only 16 bits).

PCI (and VLB) are becoming more common today, so I don't see the problem...
I tested this with mapping the 2'nd block to $A000 in mode 13h. And I've found
these _strange_ results with a VLB card :
Timing blocks of 65000 bytes :
  Timing Move ...
  Time for 1000 Move's   :     10.7 s
  Timing Move32 ...
  Time for 1000 Move32's :      3.4 s
Timing blocks of 65001 bytes :
  Timing Move ...
  Time for 1000 Move's   :     10.7 s
  Timing Move32 ...
  Time for 1000 Move32's :      5.8 s
Timing blocks of 65002 bytes :
  Timing Move ...
  Time for 1000 Move's   :     10.6 s
  Timing Move32 ...
  Time for 1000 Move32's :      5.8 s
Timing blocks of 65003 bytes :
  Timing Move ...
  Time for 1000 Move's   :     10.6 s
  Timing Move32 ...
  Time for 1000 Move32's :      5.8 s

I always tought videoRAM was SLOWER than normal RAM ???
Do you have an explanation for this ?

