You are not logged in.
Lost Password?


Register To Post



 Bottom   Previous Topic   Next Topic

#1
WRAM access optimizations
Posted on: 2014/8/29 6:27
Nintendoid!
Joined 2007/12/14
169 Posts
CoderLong Time User (11 Years) App Coder
My PCM mixer runs fine on its own but slows execution down to a crawl when paired with music and rendering, so I'm looking to optimize it wherever I can, including dropping down and rewriting parts of it in assembly where practical. When examining the output of building with both -Os and -O3 in gccVB 4, though, I noticed the following peculiar pattern when accessing variables kept in WRAM:


movhi hi
(_masterMusVolume),r0,r10
    ld
.b lo(_masterMusVolume)[r10],r14
    movhi hi
(_noiseVolume),r0,r27
    movhi hi
(_musDataStart),r0,r10
    movhi hi
(_freeVSUChannelCur),r0,r25
    movhi hi
(_noiseVelocity),r0,r26
    movhi hi
(_noiseLeft),r0,r29
    movhi hi
(_noiseRight),r0,r31
    ld
.b lo(_noiseVolume)[r27],r11
    ld
.w lo(_musDataStart)[r10],r18
    movhi hi
(_vbTranspose),r0,r10
    ld
.w lo(_vbTranspose)[r10],r10
    ld
.b lo(_freeVSUChannelCur)[r25],r17
    ld
.b lo(_noiseVelocity)[r26],r23
    ld
.b lo(_noiseLeft)[r29],r22
    ld
.b lo(_noiseRight)[r31],r12


Since WRAM on the VB is located at 0x05000000 and therefore aligned on a 64KB boundary, wouldn't it be more economical to, say, movhi hi(_WRAMStart),r0,r10 just once and then ld lo(_variable)[r10] subsequently for each WRAM access? Why doesn't the code do this or something similar?

I could rewrite this particular routine in assembly (it runs over 8000 times a second via the timer interrupt, so it needs to be as fast as possible) but this kind of code is generated all over the place whenever WRAM is read or written, so that to me just seems like putting a band-aid over a larger problem. Is this a bug in gccVB or is there a way to coax the compiler into generating more efficient code here?
Top

#2
Re: WRAM access optimizations
Posted on: 2014/8/29 14:01
VB Gamer
Joined 2008/8/31
Germany
41 Posts
PVBCC 2ndHighscore Top10 2ndHighscore Top ScoreHighscore Top ScoreHighscore Top ScoreHighscore Top ScoreHighscore Top ScoreHighscore Top ScoreHighscore Top ScoreHighscore Top ScoreHighscore Top ScoreHighscore Top ScoreHighscore Top ScoreHighscore Top ScoreHighscore Top ScoreHighscore Top ScoreHighscore Top ScoreHighscore Top ScoreHighscore Top ScoreHighscore Top ScoreHighscore Top ScoreHighscore Top ScoreHighscore Top ScoreHighscore Top ScoreCoderLong Time User (11 Years) 20+ Game RatingsPVBCC 2013 1st
Take a look at this:
http://www.planetvb.com/modules/newbb/viewtopic.php?post_id=17121
Top

#3
Re: WRAM access optimizations
Posted on: 2014/10/14 5:43
Nintendoid!
Joined 2007/12/14
169 Posts
CoderLong Time User (11 Years) App Coder
Thanks M.K., that should work for WRAM accesses.

Upon closer inspection I see that this pattern is also applied to other areas of memory. I found a simple example using hardware registers:


movhi 0x200
r0r10
movea 0x20
r10r10
ld
.[r10], r11
mov 5
r12
andi 0xFF
r11r11
ori 0x10
r11r11
st
.b r11, [r10]
movhi 0x200r0r11
movea 0x18
r11r11
st
.b r12, [r11]
movhi 0x200r0r11
movea 0x1C
r11r11
st
.b r0, [r11]


This is the equivalent assembly when built with -Os to:


HW_REGS
[TCR] |= TIMER_20US;
HW_REGS[TLR] = 0x05;
HW_REGS[THR] = 0x00;


The instruction 'movhi 0x200, r0, r11' is executed twice even when nothing is done in between to change the value of r11, making this unnecessary. This is when compiled with -Os for code size. Is this something that can be worked around (without writing it by hand in asm) or a bug in GCC/v810?
Top

#4
Re: WRAM access optimizations
Posted on: 2014/10/14 8:18
Nintendoid!
Joined 2007/12/14
169 Posts
CoderLong Time User (11 Years) App Coder
Took a look tonight at the gcc 4.4.2 patch that's floating out there, and I think I might have an idea of what's causing this: in output_move_single...


return "movhi hi(%1),%.,%0ntmovea lo(%1),%0,%0";


That line occurs several times for each time a 32-bit quantity needs to be loaded, and basically encodes those two instructions as a couplet, always. So the compiler doesn't have a chance to optimize away the extra instruction. Looks either to me like a bug, or it simply doesn't bother optimizing that case by design. I'm leaning toward the former, as it's clearly suboptimal code. Anybody with knowledge of GCC have any ideas how to fix it?
Top

 Top   Previous Topic   Next Topic


Register To Post