DirectCompute tutorial for Unity 2: Kernels and thread groups | Cheney Shen

Technology blog

DirectCompute tutorial for Unity 2: Kernels and thread groups

Today I will be going over the core concepts for writing compute shaders in Unity.

 

At the heart of a compute shader is the kernel. This is the entry point into the shader and acts like the Main function in other programming languages. I will also cover the tiling of threads by the GPU. These tiles are also known as blocks or thread groups. DirectCompute officially refers to these tiles as thread groups.

【基本的计算单元是 kernel】

 

To create a compute shader in Unity simply go to the project panel and then click create->compute shader and then double click the shader to open it up in Monodevelop for editing. Paste in the following code into the newly created compute shader.

【unity 创建kernel】

t001 - 01

 

This is the bare minimum of content for a compute shader and will of course do nothing but will serve as a good starting point. A compute shader has to be run from a script in Unity so we will need one of those as well. Go to the project panel and click Create->C# script. Name it KernelExample and paste in the following code.

【C# 代码来执行kernel】

t001 - 02

 

Now drag the script onto any game object and then attach the compute shader to the shader attribute. The shader will now run in the start function when the scene is run. Before you run the scene however you need to enable dx11 in Unity. Go to Edit->Project Settings->Player and then tick the “Use Direct3D 11” box. You can now run the scene. The shader will do nothing but there should also be no errors.

【unity场景中如何使用】

t001 - 03

 

In the script you will see the “Dispatch” function called. This is responsible for running the shader. Notice the first variable is a 0. This is the kernel id that you want to run. In the shader you will see the “#pragma kernel CSMain1“. This defines what function in the shader is the kernel as you may have many functions (and even many kernels) in one shader. There must be a function will the name CSMain1 in the shader or the shader will not compile.

【解释一下Dispatch函数就是来执行computeshader的,一个computeshader可以包含很多kernel,可以按照kernel名字取用。】

 

Now notice the “[numthreads(4,1,1)]” line. This tells the GPU how many threads of the kernel to run per group. The 3 numbers relate to each dimension. A thread group can be up to 3 dimensions and in this example we are just running a 1 dimension group with a width of 4 threads. That means we are running a total of 4 threads and each thread will run copy of the kernel. This is why GPU’s are so fast. They can run thousands of threads at a time.

【numthreads这个用来标识逻辑的X*Y*Z的数据结构与线程的对应关系,三个值积就是一次执行这个kernel的线程总数(线程组)】

 

Now lets get the kernel to actually do something. Change the shader to this…

t001 - 04

 

and the scripts start function to this…

t001 - 05

 

Now run the scene and you should see the numbers 0, 1, 2 and 3 printed out.

 

Don’t worry too much about the buffer for now. I will cover them in detail in the future but just know that a buffer is a place to store data and it needs to have the release function called when you are finished with it.

【buffer先不用管后面讲】

 

Notice this argument added to the CSMain1 function “int3 threadID : SV_GroupThreadID“. This is a request to the GPU to pass into the kernel the thread id when it is run. We are then writing the thread id into the buffer and since we have told the GPU we are running 4 threads the id ranges from 0 to 3 as we see from the print out.

【SV_GroupThreadID用到的GPU线程标识,这个computeshader结果就是写出这个标识】

 

Now those 4 threads make up whats called a thread group. In this case we are running 1 group of 4 threads but you can run multiple groups of threads. Lets run 2 groups instead of 1. Change the shaders kernel to this…

【可以采用不止一组GPU线程来跑这个kernel】

t001 - 06

 

and the scripts start function to this…

t001 - 07

 

Now run the scene and you should have 0-3 printed out twice.

 

Now notice the change to the dispatch function. The last three variables (the 2,1,1) are the number of groups we want to run and just like the number of threads groups can go up to 3 dimensions and in this case we are running 1 dimension of 2 groups. We have also had to change the kernel with the argument “int3 groupID : SV_GroupID” added. This is a request to the GPU to pass in the group id when the kernel is run. The reason we need this is because we are now writing out 8 values, 2 groups of 4 threads. We now need the threads  position in the buffer and the formula for this is the thread id plus the group id times the number of threads ( threadID.x + groupID.x*4 ).

【上面的做法是执行两组线性的kernel操作的概念】

 

This is a bit awkward to write. Surely the GPU knows the threads position? Yes it does. Change the shaders kernel to this and rerun the scene.

【但是这样的写法对于kernel来说,外面用多少组是不透明的,不好,改进如下】

t001 - 08

 

The results should be the same, two sets of 0-3 printed. Notice that the group id argument has been replaced with “int3 dispatchID : SV_DispatchThreadID“. This is the same number our formula gave us except now the GPU is doing it for us. This is the threads position in the groups of threads.

【结果是一样的,但是shader使用的组数对sheder来说透明了】

 

So far these have all been in 1 dimension. Lets step thing up a bit and move to 2 dimensions and instead of rewriting the kernel lets just add another one to the shader. Its not uncommon to have a kernel for each dimension in a shader performing the same algorithm. First add this code to the shader below the previous code so there are two kernels in the shader.

【下一步:使用2D kernel来替代执行1D kernel两遍】

t001 - 09

 

and the script to this…

t001 - 10

 

Run the scene and you will see a row printed from 0 to 7 and the next row 8 to 15 and so on to 63.

t001 - 11

 

Why from 0 to 63? Well we now have 4 2D groups of threads and each group is 4 by 4 so has 16 threads. That gives us 64 threads in total.

 

Notice what value we are out putting from this line “int id = dispatchID.x + dispatchID.y * 8“. The dispatch id is the threads position in the groups of threads for each dimension. We now have 2 dimension so we need the threads global position in the buffer and this is just the dispatch x id plus the dispatch y id times the total number of threads in the first dimensions (4 * 2). This is a concept you will have to be familiar with when working with compute shaders. The reason is that buffers are always 1 dimensional and when working in higher dimension you need to calculate what index the result should be written into the buffer at.

【线程解释:[numthreads(4,4,1)]表示一个group16个线程,computershader.Dispatch(kernel, 2, 2, 1);表示开启了2维的4个线程组,因此一共64个线程。】

 

The same theory applies when working with 3 dimensions but as it gets fiddly I will only demonstrate up to 2 dimensions. You just need to know that in 3 dimensions the buffer position is calculated as “int id = dispatchID.x + dispatchID.y * groupSizeX + dispatchID.z * groupSizeX * groupSizeY” where group size is the number of groups times number of threads for that dimension.

【相同的概念扩展到3维】

 

You should also have a understanding of how the semantics work. Take for example this kernel argument…

【你还需要了解相关语义】

 

int3 dispatchID : SV_DispatchThreadID

 

SV_DispatchThreadID is the semantic and tells the GPU what value it should pass in for this argument. The name of the argument does not matter. You can call it what you want. For example this argument works the same as above.

【前面是变量名,随便改,下面的例子就是和上面功能一样。后面是标识语义,告诉GPU是什么意义的值】

 

Also the variable type can be changed. For example…

See the int3 has been changed to int. This is fine if you are only working with 1 dimension. You could also just use a int2 for 2 dimensions and you could also use a unsigned int (uint) instead of a int if you choose.

【int3可表示三维,int使用在一维的情况下是ok的】

 

Since we now have two kernels in the shader we also need to tell the GPU what kernel we want to run when we make the dispatch call. Each kernel is given a id in the order they appear. Our first kernel would be id 0 and the next is id 1. When the number of kernels in a shader becomes larger this can become a bit confusing and its easy to set the wrong id. We can solve this by asking the shader for the kernels id by name. This line here “int kernel = shader.FindKernel (“CSMain2”);” gets the id of kernel “CSMain2“. We then use this id when setting the buffer and making the dispatch call.

【一个computeshader多个kernel的情况下dispatch方法需要用ID来标识使用的kernel,可以使用FindKernel 来标识。】

 

About now you maybe thinking that this concept of groups  of threads is a bit confusing. Why cant I just use one group of threads? Well you can but just know that there is a reason that threads are arranged into groups by the GPU. For a start a thread group is limited by the number of threads it can have ( defined by the line “[numthreads(x,y,z)]” in the shader). This limit is currently 1024 but may change with new hardware. For example you can have a maximum of “numthreads(1024,1,1)” for 1D, “numthreads(32,32,1)” for 2D and so on. You can however have any number of groups of threads and as you will often be processing data with millions of element the concept of thread groups is essential. Threads in a groups can also share memory and this can be used to make dramatic performance gains for certain algorithms but I will cover that in a future post.

【可以一直只用一个线程一个组,但是线程组共享内存而且性能可以显著提升,另外要注意最大总和不要超过1024个线程】

 

Well I think that about covers kernels and thread groups. There is just one more thing I want to cover. How to pass uniforms into your shader. This works the same as in Cg shaders but there is no uniform key word. For the most part this relatively simple but there are a few “Gotcha’s” so I will briefly go over it.

【最后一点:如何传全局量给shader】

 

For example if you want to pass in a float you need this line in the shader…

【数值的例子】

and this line in your script…

To set a vector you need this in the shader…

and this in the script…

You can only pass in a Vector4 from the script but your uniform can be a float, float2, float3 or float4. It will be filled with the appropriate values.

 

Now here’s where it gets tricky. You can pass in arrays of values. Note that this first example wont work. I will explain why.  You need this line in your shader…

【数组的例子】

and this in your script…

Now this wont work. Whether this is by design or a bug in Unity I don’t know. You need to use vectors as uniforms for this to work. In your shader…

and your script…

So here we have a array of two float4’s and it is set from a array of 8 floats from a script. The same principles apply when setting matrices. In your shader…

and your script…

And of course you can have arrays of matrices. In your shader…

and your script…

This same logic does not seem to apply to float2x2 or float3x3. Again, whether this is a bug or design I don’t know.


Post a Comment

Your email address will not be published. Required fields are marked *

  • Categories

  • Tags