You are not logged in.
Lost Password?


Register To Post



 Bottom   Previous Topic   Next Topic

#1
Nintendo Instruction Cycle Counts
Posted on: 2014/8/10 2:46
VUE(xpert)
Joined 2012/12/4
418 Posts
CoderLong Time User (6 Years)
In my quest to know everything, I set my eyes on the Nintendo instructions, and their undocumented cycle counts. We know through the development manual that the MPYHW instruction takes 9 cycles, but what about the others?

What I did was write a routine in assembly that enables the instruction cache and the hardware timer with a 20-microsecond tick frequency, then runs a short loop with three consecutive instances of a given instruction. At the end, I stop the timer and see how many timer ticks it took. I then subtract that number from another run that didn't have any instructions in the loop.

I used the ADD instruction as a baseline, since it takes 1 cycle. Using MUL and DIV to verify it was working, I got the correct cycle counts:


ADD     1966    1.0         1 cycle
MUL     26214   13.33367   13 cycles
DIV     75366   38.33469   38 cycles


Logically, I did the same thing on all the Nintendo instructions to determine their cycle counts. The counts may surprise you.

Please: If you're a coder, run your own tests to get some more sample data on these. I don't doubt my methods, but I'd like some validation from other sources.

My findings are as follows:


MPYHW   18350   9.33367     9 cycles
CLI     23593   12.00051   12 cycles
SEI     23593   12.00051   12 cycles
XB      12452   6.33367     6 cycles
XH      2622    1.33367     1 cycle
REV     43909   22.33418   22 cycles


MPYHW takes 9 cycles as expected.

CLI and SEI both take 12 cycles for some unimaginable reason. Curious is the fact that my program didn't log an extra third of a cycle like it did for all the other instructions I tested. Someone else's testing would be appreciated to get a clearer view on this count.

XB looks like it takes 6 cycles. Fair enough. But what caught my eye is that XH only takes 1 cycle. I'd have expected them to be pretty close to each other.

The REV instruction expectedly takes a while to complete. In my tests, it clocked in (pun intended) at 22 cycles.
Top

#2
Re: Nintendo Instruction Cycle Counts
Posted on: 2014/8/10 3:53
VUE(xpert)
Joined 2012/12/4
418 Posts
CoderLong Time User (6 Years)
As long as I'm at it...

The development manual indicates that using any register other than r0 for reg1 in the XB and XH instructions may cause problems, but regardless of which registers I specified or the values in those registers, the instructions performed correctly.

MPYHW, on the other hand, is giving me some mysterious results when bits 16-31 don't sign-extend bit 15 (just like the developer's manual says). I'm gonna have to put together a test program just for that instruction to figure out exactly what it's doing.
Top

#3
Re: Nintendo Instruction Cycle Counts
Posted on: 2014/8/10 5:52
VUE(xpert)
Joined 2012/12/4
418 Posts
CoderLong Time User (6 Years)
MPYHW definitely performs a faithful multiplication, but exactly how it behaves when extending off the left side of the register is something I'm gonna have to investigate further in the morning.

In the mean time, have a ROM! All controls are done with the left D-Pad. Up and down change the value of the current digit, and left and right change the digit.

Attach file:


vb mpyhwtest.vb Size: 8.00 KB; Hits: 158

png  MPYHWROM_zps60465c81.png (0.60 KB)
1_5b4f6cb60d7c3.png 384X224 px
Top

#4
Re: Nintendo Instruction Cycle Counts
Posted on: 2014/8/10 9:26
PVB Elite
Joined 2003/7/25
USA
1507 Posts
PVBCC 1stCoderContributor#3 PosterHOTY09 EntryLong Time User (15 Years) App CoderPVBCC 2010 EntryPVBCC 2013 Entry
Hmm... some of those results are a bit surprising. It just seems like what's the point of having custom CPU instructions if they're not much faster than what you could do with just a few instructions in software (sure... they're a little bit more convenient, and more compact, but that hardly seems worth a CPU customization).

Quote:

Guy Perfect wrote:
Curious is the fact that my program didn't log an extra third of a cycle like it did for all the other instructions I tested. Someone else's testing would be appreciated to get a clearer view on this count.

Quote:

Guy Perfect wrote:

ADD     1966    1.0         1 cycle

What about ADD? How about running it on a larger chunk of regular V810 instructions to see if it really is an anomaly, or just that some do and some don't (you might make a connection between them)?

And just out of curiosity... why 3 instructions in a row? Do you get the same results with just 1? How about 10?

DogP
Top

#5
Re: Nintendo Instruction Cycle Counts
Posted on: 2014/8/10 13:00
PVB Elite
Joined 2008/12/28
Slovenia
628 Posts
Highscore Top ScoreHighscore Top ScoreCoderContributor10+ Game RatingsLong Time User (10 Years) App CoderPVBCC 2010 EntryPVBCC 2013 Entry
Regarding CLI and SEI, do we even know how long the stock V810 methods take? Specifically:

; CLI
movea 0xEFFF, $0, $10
stsr $PSW, $11
and $10, $11
ldsr $11, $PSW

; SEI
stsr $PSW, $10
ori 0x1000, $10, $10
ldsr $10, $PSW

I can't find the duration of LDSR and STSR in the V810 manual.
Top

#6
Re: Nintendo Instruction Cycle Counts
Posted on: 2014/8/10 14:17
PVB Elite
Joined 2003/7/25
USA
1507 Posts
PVBCC 1stCoderContributor#3 PosterHOTY09 EntryLong Time User (15 Years) App CoderPVBCC 2010 EntryPVBCC 2013 Entry
Quote:

HorvatM wrote:
Regarding CLI and SEI, do we even know how long the stock V810 methods take?

That's a good question... I don't see that listed in the manual. One thing I did notice is that CLI and SEI have the same opcode as EI and DI on the V830 (which is based on the V810 architecture, though not necessarily implemented the same). In the V830 case, they claim to take 4 cycles each. For comparison, LDSR and STSR take 5 cycles on the V830.

DogP
Top

#7
Re: Nintendo Instruction Cycle Counts
Posted on: 2014/8/10 21:01
VUE(xpert)
Joined 2012/12/4
418 Posts
CoderLong Time User (6 Years)
I put together a program to answer this once and for all. It tests all register-based instructions with a simple assembly loop:


# s32 CycleTest(s32 arg1, s32 arg2, s32 num);
vueFunction(_CycleTest)

    
# r2 = 0x02000000, base address for hardware control ports
    # r6 = arg1
    # r7 = arg2
    # r8 = num, also used as the loop iterator

    # Configure the hardware timer
    
MOVHI 0x0200r0r2
    MOV   
-1r1
    ST
.B  r10x0018[r2# Count/reload low  = 0xFF
    
ST.B  r10x001C[r2# Count/reload high = 0xFF

    # Enable and clear the instruction cache
    
MOVEA 0x0803r0r1
    LDSR  r1
CHCW

    
# Enable the timer with 20-microsecond ticks
    
MOVEA 0x0011r0r1
    ST
.B  r10x0020[r2]

    
# Execute the instruction 10 times for the given number of iterations
    
.Lcycle_loop:
        
MOV r6r9
        MOV r7
r10

        
# This comment is located 32 bytes into the function.

        # When the function is not modified, nothing happens in this loop

        # The following bytes are meant to be overwritten in RAM
        
BR .Lcycle_endNOPNOPNOP# Written by 16- and 32-bit instructions
        
BR .Lcycle_endNOPNOPNOP# Written by 32-bit instructions
        
BR .Lcycle_end                 # Always present for consistency

        # End-of-loop code for 32-bit instructions (3 16-bit instructions)
        
.Lcycle_end:
        
ADD -1r8
    BNZ 
.Lcycle_loop

    
# End-of-loop label

    # Disable the timer and instruction cache
    
ST.B r00x0020[r2]
    
LDSR r0CHCW

    
# Retrieve and return the number of timer ticks taken
    
IN.B 0x0018[r2], r6   # Timer count low
    
IN.B 0x001C[r2], r7   # Timer count high
    
SHL  8r7            # r7 = r7 << 8 | r6;
    
OR   r6r7
    MOV  
-1r10          # r10 = -1 - r7 & 0xFFFF;
    
SUB  r7r10
    ANDI 0xFFFF
r10r10

    JMP 
[r31]

vueEnd(_CycleTest)


This function gets copied into RAM at run-time. Those NOPs are dummy bytes that are replaced with meaningful instructions by the program. The reason there are two sets of NOPs is to accommodate both 16- and 32-bit instructions. The following BR instruction is always present to ensure that the loop takes the same number of cycles always except for the desired instructions.

The C code that drives this looks like this:


// Gets the number of timer ticks for a loop of 4 instances of an instruction
s32 GetCount(const INST *insts32 num) {
    
s32 len = (SIZE_CYCLETEST 3) / 4;
    
u32 arg1 0arg2 0;
    
s32 xyoffset 32;
    
u8 func[len];
    
u16 bits[2];

    
// Copy the function into memory
    
memcpy32(func, &CycleTestlen);

    
// If we're not overwriting with an instruction, ignore this all
    
if (inst != NULL) {

        
// Encode the instruction into data bits and get its size
        
len FORMATS[inst->format](instbits);

        
// Copy the instruction into the function buffer 4 times
        
for (04x++) for (0leny++) {
            *(
u16 *)(&func[offset]) = bits[y];
            
offset += 2;
        }

        
// Grab the instruction's pre-defined operands
        
arg1 inst->val1;
        
arg2 inst->val2;
    }

    
// Call the function from the byte buffer
    
return ((s32 (*)(u32u32s32)) func)(arg1arg2num);
}


My main function calls this function 5 times for each instruction (predefined in a const table at the top of the program), and averages the counts. It then subtracts the count from a null call (no instruction overwritten), then divides by the count for ADD, which is known to be 1 cycle.

The output on the hardware looks like this:


ADD 
(Immediate)    051E =  1 cycle
ADD 
(Register)     051E =  1 cycle
ADDF
.S             6F5C 22 cycles
ADDI               051F 
=  1 cycle
AND                051E =  1 cycle
ANDI               051E 
=  1 cycle
CLI                3D71 
12 cycles
CMP 
(Immediate)    06ED =  1 cycle
CMP 
(Register)     051E =  1 cycle
CMPF
.S             228F =  7 cycles
CVT
.SW             4666 14 cycles
CVT
.WS             27AE =  8 cycles
DIV                C148 
38 cycles
DIVF
.S             DFFF 44 cycles
DIVU               B70A 
36 cycles
LDSR               28F6 
=  8 cycles
MOV 
(Immediate)    051F =  1 cycle
MOV 
(Register)     051E =  1 cycle
MOVEA              051F 
=  1 cycle
MOVHI              051E 
=  1 cycle
MPYHW              2CCC 
=  9 cycles
MUL                4148 
13 cycles
MULF
.S             83D7 26 cycles
MULU               4147 
13 cycles
NOT                051E 
=  1 cycle
OR                 051E =  1 cycle
ORI                051F 
=  1 cycle
REV                6F5C 
22 cycles
SAR 
(Immediate)    051E =  1 cycle
SAR 
(Register)     051E =  1 cycle
SEI                3D70 
12 cycles
SETF               051E 
=  1 cycle
SHL 
(Immediate)    051E =  1 cycle
SHL 
(Register)     051E =  1 cycle
SHR 
(Immediate)    051F =  1 cycle
SHR 
(Register)     051E =  1 cycle
STSR               28F5 
=  8 cycles
SUB                051E 
=  1 cycle
SUBF
.S             83D7 26 cycles
TRNC
.SW            4147 13 cycles
XB                 1D70 
=  6 cycles
XH                 051F 
=  1 cycle
XOR                051E =  1 cycle
XORI               051E 
=  1 cycle


All instructions with documented cycle counts have the correct count, so that's a relief. The floating-point instructions fall within their given range. The undocumented cycle counts? Well, that's why I made this program.

LDSR and STSR are 8 cycles each. I was expecting 1 cycle. This is how we learn things, though. Suddenly the CLI and SEI instructions being 12 cycles don't sound so bad.

MPYHW is 9 cycles as seen before. Likewise for XB at 6 cycles, XH at 1 cycle and REV at 22 cycles.

A ROM of this program is attached to this post. After the test is finished, up and down on the left D-Pad scroll the list of instructions.

Attach file:


vb cycletest.vb Size: 8.00 KB; Hits: 162

png  CycleTesting_zpsdc4dd30b.png (0.27 KB)
1_5b4f69c50d6e2.png 384X224 px

png  CycleOutput_zps5f47f3d8.png (1.61 KB)
1_5b4f69dd5b2bb.png 384X224 px
Top

#8
Re: Nintendo Instruction Cycle Counts
Posted on: 2014/8/10 22:10
VUE(xpert)
Joined 2012/12/4
418 Posts
CoderLong Time User (6 Years)
I figured out the operation of MPYHW.

* reg1 is treated as a 17-bit integer, sign-extended to 32 bits in size.
* reg2 is treated as a 32-bit, signed integer.
* Multiplication happens normally, storing the result in reg2.
* r30 is not affected as it is in MUL and MULU.

Algorithm:


// On an unsigned variable
reg2 *= (reg1 0x0001FFFF) | ((reg1 0x00010000) ? 0xFFFE0000 0);

// On a signed variable
reg2 *= reg1 << 15 >> 15;
Top

 Top   Previous Topic   Next Topic


Register To Post