A look into: AMD ISA from HLSL

I thought after my last post of the Shader Analyser which output AMD ISA (Instruction Set Architecture) it was worth doing a little write up on what exactly that is and why it may be worth while to take the generated code into consideration when doing low level optimisation.

So to get started we will look at a simple example pixel shader. In all of these examples we will be using shader model 5.0 and will be building for Hawaii architecture.

struct PS_INPUT
{
    float4 pos : SV_POSITION;
    float4 tex : TEXCOORD0;
};


float4 psMain(PS_INPUT input) : SV_TARGET
{
	return float4(1,1,1,1);
}

Ok, so here we have our very basic Pixel Shader. It is taking an input structure from the Vertex Shader which is passing in a position but not using it and just writing out one into each channel.

So, we can look at this in three levels getting progressively lower: the ASM that is generated from DirectX, the AMD ISA and then the AMD IL (Input Language) which is the instructions actually passed in the GPU. So lets take a look at each of these:

DirectX ASM

ps_5_0
dcl_globalFlags refactoringAllowed
dcl_output o0.xyzw
mov o0.xyzw, l(1.000000,1.000000,1.000000,1.000000)
ret 

Here we can see that is declaring an output register (o0.xyzw) and then copying 1.0 into each channel at that address and returning. This is as straight forward as you can get. It makes no use of the position or texcoord data passed through to the shader as we haven't accessed them at all in the HLSL shader.

AMD ISA

shader psMain

  v_mov_b32     v0, 1.0                                     // 00000000: 7E0002F2
  v_cvt_pkrtz_f16_f32  v0, v0, v0                           // 00000004: 5E000100
  s_nop         0x0000                                      // 00000008: BF800000
  exp           mrt0, v0, v0, v0, v0 done compr vm          // 0000000C: F8001C0F 00000000
  s_endpgm                                                  // 00000014: BF810000
end

So now we are getting more complex looking, but don't worry it is still very simple when you break it down! Here is a link to the all instructions. So lets take a look at this line by line. 

Our first instruction is v_mov_b32 which the documentation says:

V_MOV_B32
Single operand move instruction. Allows denorms in and out, regardless of denorm mode, in
both single and double precision designs.

This means that instruction is just a move pretty much like the "mov" in the DirectX ASM. Where it is moving the value of 1.0 into the register v0.

Next we have the more intimidatingly named v_cvt_pkrtz_f16_f32, but not to worry again we will go to the documentation to see what this is:

v_cvt_pkrtz_f16_f32
Convert two float 32 numbers into a single register holding two packed 16-bit floats.

So in the first instruction we stored a 32 bit value of 1.0 into the register v0. Now we are going to store two 16 bit values in that same register. And the two 16 bit values are both going to be the vlaue we stored in v0 initially converted into 16 bit. So this gives us a register which is storing two 16 bit values of 1.0. This is a little strange, but things tend to get a little bit strange the further down you go as you start seeing things the compiler has done to make the code run more optimally for its hardware.

Our next instruction is "s_nop" if you have worked with assembly before this may be familiar it it means no operation. The description from the documentation:

s_nop
Do nothing. Repeat NOP 1..8 times based on SIMM16[2:0]. 0 = 1 time, 7 = 8 times.

Now this is even more odd than before you must be thinking. Why on earth would a shader want to waste an instruction doing nothing? Well, this calls for us to dig further into the documentation where we will find this little bit of information:

Must add an S_NOP between two consecutive S_SETREG to the
same register.

S_SETREG is an instruction to write data to an internal hardware register, so this could be telling us that the reason for this s_nop may be that the compiler is adding the required s_nop as the next instruction is going to write to the same register as the instruction above. However, in this case I believe the s_nop is there to pad this shader to be 4 instructions and has been placed before the export instead of after it as an optimisation.

The next instruction is "exp" which our documentation tells us is the export function for this shader program. This is where the shader writes to the render targets.  This line is a little more complex than the others so we can look at it one bit at a time.

The first bit to make sense of would be the words at the end: "done compr vm". These are each individual flags. The flag "done" is used to indicate that this is the last output to a render target from this program, "compr" is telling the GPU that this is 16bit per component rather than 32 bit and "vm" is saying that this is a valid mask for the wavefront and must be set at least once per pixel shader. I will go into wavefronts and what this means in move detail in a later post.

The next part of this line to take a look at is the "mrt0" this is telling the program to write into the first render target. This is specified in our HLSL shader where we set the output of psMain to write to SV_TARGET. 

The last part of the line is the repeat of the register "v0". This is telling the program which value to write into each channel. "v0" currently contains two channels, each with a 16-bit value of 1.0 in it. And due to the "compr" flag only the first component is read. 

Finally the last line of the AMD ISA is "s_endpgm" which is obviously the instruciton to the end the program but to be consistent here is the description from the documentation:

S_ENDPGM
End of program; terminate wavefront.

This is telling GPU to end this program and the wave wavefront. Pretty straight forward. 

So you can see that the ISA is just the lower level version of the the DirectX ASM. It is doing the same things but is a little bit more explicit about how and we are beginning to see the quirks of the GPU come through.

AMD IL

This will be covered in the next post!

Updated Shader Analyser With AMD IL support

After receiving some good feedback from reddit I have added the ability to compare the AMD IL that is generated by the driver. Here is the updated download.

Showing the comparison between the generated DirectX ASM and AMD IL for PS_5_0

Showing the comparison between the generated DirectX ASM and AMD IL for PS_5_0

This is done by calling the AMD  program CodeXLAnalyser.exe to generate it. This .exe is in the same directory as "Shader Analyzer.exe" so if in the future AMD updates that you can just swap out the .exe. I have also added a control on the dialog to specify any AMD IL command line arguments which can be found at the AMD developer site here. However, the information there is either not complete or out of date so I have included a screenshot of the full output of the help command below.