Category: All

DirectCompute tutorial for Unity 3: Textures

The focus of days tutorial is on textures. These are arguably the most important feature when using DirectCompute. It is highly likely that every shader you write will use at least one texture. Unfortunately render textures in Unity are Pro only so this tutorial covers Pro only topics. The good new is that this will be the only Pro only tutorial. Textures in DirectCompute are very simple to use but there are a few traps you can fall into. Lets start with something simple. Create a compute shader and paste in the following code.

(使用简单,但是注意只有unity pro版本支持此功能)





And then a C# script called TextureExample and paste in the following code.



Attach the script, bind the shader and run the scene. You should see a texture with the uv’s displayed as colors. This is what we have written into the texture with the compute shaders.



Now notice the kernel argument “uint2 id : SV_DispatchThreadID” from last tutorial. This is the threads position in the groups of threads and just like with buffers we need this to know what location to write the results into the texture, like this “tex[id] = result“. This time however we don’t need the “flattened” index like with buffers. We can just use the uint2 directly. This is because unlike buffers textures can be multidimensional. We have a 2D thread group and a 2D texture.



Now look at the declaration of the texture variable “RWTexture2D<float4> tex;“. The “RWTexture2D” part is important. This is obviously a texture but whats with the RW? This declares that the texture is of the type “unordered access view”. This just means that you can write to any location in the texture from the shader. You may be able to write to the texture but you can not read from it.  Remove the RW part and its just a normal texture but you cant write to it. Just remember that you can only read from a Texture2D and only write to RWTexture2D.



Now lets look at the script. Notice how the texture is created.


There are two important things here. The first is the “enableRandomWrite”. You must set this to true if you want to write into the texture. This basically says that the texture can have unordered access views. If you don’t do this nothing will happen when you run the shader and Unity wont give you a error. It will just fail for no apparent reason. The second is the “Create” function call. You must call create on the texture before you write into it. Again, if you don’t nothing will happen and you wont get a error. It just wont work. If you are use to writing into a texture with graphics blit then you may notice you don’t have to call create. This is because graphics blit checks if the texture is created and creates it if its not. The dispatch function cant do this because it does not know what textures are being written into when it is called.



Also note that the texture is released in the “OnDestroy” function. When you are finished with your render textures make sure you release them.



Now lets look at the dispatch call.


Remember from the last tutorial that this is where we set how many groups to run. So why are the number of groups to run the texture size divide by 8? Look at the shader. You can see we have 8 threads per group running from the “[numthreads(8,8,1)]” line. Now we need a thread for each pixel in the texture to run. So we have a texture 64 pixels wide and if we divide the pixels by the number of threads per group we get the number of groups we will need. We end up with 8 groups of 8 threads which is 64 threads in total in the x dimension and its the same for the y dimension. This gives us a total of 4096 threads running (64 * 64) which is the number of pixels in the texture.



Now have a look at this part of the script.


This is where we are setting the uniforms for the shader. We have to set the texture we are writing into and we also need the textures width and height so we can calculate the uv’s from the dispatch id. Its common to pass variables into a shader like this but in this case we don’t need the width and height. We can get it from within the shader. Change your shader to this…




and remove these two lines from your script…




Run the scene. It should look the same. Notice the “tex.GetDimensions(w, h);” line. Textures are objects. This means they have functions you can call from them.



In this case we are asking for the textures dimensions. Textures have a number of functions with a number of overloads you can call. I will go through the most common and how to use them but first we need to change the scene a bit. What we want to do now is copy the contents of our texture into another texture and then display that result.




And change the script to this…



Now bind the new shader you have created to the “shaderCopy” attribute and run the scene. The scene should look the same as before. What we are doing here is to fill the first texture with its uv’s as a color like before and then we are copying the contents to another texture. This is so I can demonstrate the many ways to sample from a texture in a shader. Notice this line “float4 t = tex[id];” in the copy shader. This is the simplest way to sample from a texture. Just like the dispatch id is the location you want to write to it is also the location you want to read from. You can see here that we can sample from a texture just like it was a array.



There are other ways of doing this. For example…


Here we are accessing the mipmaps of the texture instead. Level 0 would be the first mipmap, the one with the same size as the texture. The next dimension in the mipmap array is the location to sample at using the dispatch id. Just note that we have not enabled mipmaps on the texture so the result of sampling levels other than 0 has no effect.



We can do the same with the textures load function.



In this case the uint3’s x and y value is the location to sample at and the z value (the 0) is the mipmap level. Both these examples work the same.


There is one feature of textures that you will be using a lot. The textures ability to filter and wrap. To do this you need to use a sampler state. If you are use to using HLSL out side of Unity you may know that you need to create a sampler object to use. This works a bit differently in Unity. Basically you have two choices for filtering (Linear or Point) and two for wrap (Clamp or Repeat). All you need to do is declare a sampler state in the shader with the word Linear or Point in the name and the word Repeat or Clamp in the name. For example you could use the name myLinearClamp or aPointRepeat, etc. I prefer a underscore then the name. Change you shader to this.

(采用texture的filter和warp功能,你需要用到sampler state,下面是例子)



If you run the scene it should still look the same. Notice this line “float4 t = tex.SampleLevel(_LinearClamp, uv, 0);“. Here we are using the textures SampleLevel function. The function takes a sampler state, the uv’s and a mipmap level. The uv’s need to be normalized, that is have a range of 0 to 1. Notice the SamplerState variables at the top. If you are using sampler states you are mostly likely wanting to have bilinear filtering. If that’s the case use the _LinearClamp or _LinearRepeat sampler state.




If you are use to using HLSL in a fragment shader (as opposed to a compute shader here) you may know that you can use this function for filtering.



Notice its called Sample not SampleLevel and the mipmap parameter is gone. If you try to use this in a compute shader you will get a error because this function does not exists. The reason why is surprisingly complicated and gives a insight into how the GPU works. Behind the scenes fragments shaders (or any shader) work in much the sample why as a compute shader as they share the same GPU architecture. They run in threads and the threads are arranged into thread groups. Now remember that threads in a group can share memory. Fragment shaders always run in a group of at least 2 by 2 threads. When you sample from a texture the fragment shader checks what its neighbors uv’s are. From this it can work out what the derivatives of the uv’s are. The derivatives are just a rate of change and in areas where there is a high rate of change the textures higher mipmaps are used and in areas of low rate of change the lower mipmaps are used. This is how the GPU reduces aliasing problems and it also has the handy byproduct of reducing memory bandwidth (the higher mipmaps are smaller).



All the examples I have given are in 2D but the same principles apply in 3D. You just need to create the texture a little differently.



This creates a 3D texture that is 64*64*64 pixels. Notice the “volumeDepth” is set to 64 and the “isVolume” is set to true. Remember to set the number of groups and threads to the correct values. The dispatch id also needs to be a uint3 or int3;

(创建一个3D纹理,执行3D shader)




That about covers it for textures. Next time I look at buffers. How to use the default buffer, the other buffer types, how to draw data from your buffers and set/get your buffer data.


(最后补充一下与上面译文无关的就是:shader里面其实定义采用的是rendertexture, texture2D, texture3D的父类texture。因此在shader里面这些只要是texture子类的纹理都是可以直接使用的。这点很重要,你可以采用导入的texture来在compute shader里面做计算。)

DirectCompute tutorial for Unity 2: Kernels and thread groups

Today I will be going over the core concepts for writing compute shaders in Unity.


At the heart of a compute shader is the kernel. This is the entry point into the shader and acts like the Main function in other programming languages. I will also cover the tiling of threads by the GPU. These tiles are also known as blocks or thread groups. DirectCompute officially refers to these tiles as thread groups.

【基本的计算单元是 kernel】


To create a compute shader in Unity simply go to the project panel and then click create->compute shader and then double click the shader to open it up in Monodevelop for editing. Paste in the following code into the newly created compute shader.

【unity 创建kernel】

t001 - 01


This is the bare minimum of content for a compute shader and will of course do nothing but will serve as a good starting point. A compute shader has to be run from a script in Unity so we will need one of those as well. Go to the project panel and click Create->C# script. Name it KernelExample and paste in the following code.

【C# 代码来执行kernel】

t001 - 02


Now drag the script onto any game object and then attach the compute shader to the shader attribute. The shader will now run in the start function when the scene is run. Before you run the scene however you need to enable dx11 in Unity. Go to Edit->Project Settings->Player and then tick the “Use Direct3D 11” box. You can now run the scene. The shader will do nothing but there should also be no errors.


t001 - 03


In the script you will see the “Dispatch” function called. This is responsible for running the shader. Notice the first variable is a 0. This is the kernel id that you want to run. In the shader you will see the “#pragma kernel CSMain1“. This defines what function in the shader is the kernel as you may have many functions (and even many kernels) in one shader. There must be a function will the name CSMain1 in the shader or the shader will not compile.



Now notice the “[numthreads(4,1,1)]” line. This tells the GPU how many threads of the kernel to run per group. The 3 numbers relate to each dimension. A thread group can be up to 3 dimensions and in this example we are just running a 1 dimension group with a width of 4 threads. That means we are running a total of 4 threads and each thread will run copy of the kernel. This is why GPU’s are so fast. They can run thousands of threads at a time.



Now lets get the kernel to actually do something. Change the shader to this…

t001 - 04


and the scripts start function to this…

t001 - 05


Now run the scene and you should see the numbers 0, 1, 2 and 3 printed out.


Don’t worry too much about the buffer for now. I will cover them in detail in the future but just know that a buffer is a place to store data and it needs to have the release function called when you are finished with it.



Notice this argument added to the CSMain1 function “int3 threadID : SV_GroupThreadID“. This is a request to the GPU to pass into the kernel the thread id when it is run. We are then writing the thread id into the buffer and since we have told the GPU we are running 4 threads the id ranges from 0 to 3 as we see from the print out.



Now those 4 threads make up whats called a thread group. In this case we are running 1 group of 4 threads but you can run multiple groups of threads. Lets run 2 groups instead of 1. Change the shaders kernel to this…


t001 - 06


and the scripts start function to this…

t001 - 07


Now run the scene and you should have 0-3 printed out twice.


Now notice the change to the dispatch function. The last three variables (the 2,1,1) are the number of groups we want to run and just like the number of threads groups can go up to 3 dimensions and in this case we are running 1 dimension of 2 groups. We have also had to change the kernel with the argument “int3 groupID : SV_GroupID” added. This is a request to the GPU to pass in the group id when the kernel is run. The reason we need this is because we are now writing out 8 values, 2 groups of 4 threads. We now need the threads  position in the buffer and the formula for this is the thread id plus the group id times the number of threads ( threadID.x + groupID.x*4 ).



This is a bit awkward to write. Surely the GPU knows the threads position? Yes it does. Change the shaders kernel to this and rerun the scene.


t001 - 08


The results should be the same, two sets of 0-3 printed. Notice that the group id argument has been replaced with “int3 dispatchID : SV_DispatchThreadID“. This is the same number our formula gave us except now the GPU is doing it for us. This is the threads position in the groups of threads.



So far these have all been in 1 dimension. Lets step thing up a bit and move to 2 dimensions and instead of rewriting the kernel lets just add another one to the shader. Its not uncommon to have a kernel for each dimension in a shader performing the same algorithm. First add this code to the shader below the previous code so there are two kernels in the shader.

【下一步:使用2D kernel来替代执行1D kernel两遍】

t001 - 09


and the script to this…

t001 - 10


Run the scene and you will see a row printed from 0 to 7 and the next row 8 to 15 and so on to 63.

t001 - 11


Why from 0 to 63? Well we now have 4 2D groups of threads and each group is 4 by 4 so has 16 threads. That gives us 64 threads in total.


Notice what value we are out putting from this line “int id = dispatchID.x + dispatchID.y * 8“. The dispatch id is the threads position in the groups of threads for each dimension. We now have 2 dimension so we need the threads global position in the buffer and this is just the dispatch x id plus the dispatch y id times the total number of threads in the first dimensions (4 * 2). This is a concept you will have to be familiar with when working with compute shaders. The reason is that buffers are always 1 dimensional and when working in higher dimension you need to calculate what index the result should be written into the buffer at.

【线程解释:[numthreads(4,4,1)]表示一个group16个线程,computershader.Dispatch(kernel, 2, 2, 1);表示开启了2维的4个线程组,因此一共64个线程。】


The same theory applies when working with 3 dimensions but as it gets fiddly I will only demonstrate up to 2 dimensions. You just need to know that in 3 dimensions the buffer position is calculated as “int id = dispatchID.x + dispatchID.y * groupSizeX + dispatchID.z * groupSizeX * groupSizeY” where group size is the number of groups times number of threads for that dimension.



You should also have a understanding of how the semantics work. Take for example this kernel argument…



int3 dispatchID : SV_DispatchThreadID


SV_DispatchThreadID is the semantic and tells the GPU what value it should pass in for this argument. The name of the argument does not matter. You can call it what you want. For example this argument works the same as above.



Also the variable type can be changed. For example…

See the int3 has been changed to int. This is fine if you are only working with 1 dimension. You could also just use a int2 for 2 dimensions and you could also use a unsigned int (uint) instead of a int if you choose.



Since we now have two kernels in the shader we also need to tell the GPU what kernel we want to run when we make the dispatch call. Each kernel is given a id in the order they appear. Our first kernel would be id 0 and the next is id 1. When the number of kernels in a shader becomes larger this can become a bit confusing and its easy to set the wrong id. We can solve this by asking the shader for the kernels id by name. This line here “int kernel = shader.FindKernel (“CSMain2”);” gets the id of kernel “CSMain2“. We then use this id when setting the buffer and making the dispatch call.

【一个computeshader多个kernel的情况下dispatch方法需要用ID来标识使用的kernel,可以使用FindKernel 来标识。】


About now you maybe thinking that this concept of groups  of threads is a bit confusing. Why cant I just use one group of threads? Well you can but just know that there is a reason that threads are arranged into groups by the GPU. For a start a thread group is limited by the number of threads it can have ( defined by the line “[numthreads(x,y,z)]” in the shader). This limit is currently 1024 but may change with new hardware. For example you can have a maximum of “numthreads(1024,1,1)” for 1D, “numthreads(32,32,1)” for 2D and so on. You can however have any number of groups of threads and as you will often be processing data with millions of element the concept of thread groups is essential. Threads in a groups can also share memory and this can be used to make dramatic performance gains for certain algorithms but I will cover that in a future post.



Well I think that about covers kernels and thread groups. There is just one more thing I want to cover. How to pass uniforms into your shader. This works the same as in Cg shaders but there is no uniform key word. For the most part this relatively simple but there are a few “Gotcha’s” so I will briefly go over it.



For example if you want to pass in a float you need this line in the shader…


and this line in your script…

To set a vector you need this in the shader…

and this in the script…

You can only pass in a Vector4 from the script but your uniform can be a float, float2, float3 or float4. It will be filled with the appropriate values.


Now here’s where it gets tricky. You can pass in arrays of values. Note that this first example wont work. I will explain why.  You need this line in your shader…


and this in your script…

Now this wont work. Whether this is by design or a bug in Unity I don’t know. You need to use vectors as uniforms for this to work. In your shader…

and your script…

So here we have a array of two float4’s and it is set from a array of 8 floats from a script. The same principles apply when setting matrices. In your shader…

and your script…

And of course you can have arrays of matrices. In your shader…

and your script…

This same logic does not seem to apply to float2x2 or float3x3. Again, whether this is a bug or design I don’t know.

DirectCompute tutorial for Unity 1: Introduction




At this time Unity just started to support Microsofts DirectX 11 and with DirectX 11 came the DirectCompute API. This API opens up a whole new way to use the GPU by writing compute shaders.

【Unity GPGPU 很好用,但是没有教程让使用者非常难受,因此作者准备搞一发。】


I hope to start of with a introduction about why DirectCompute was needed and how it differs to the traditional graphics pipeline, how to write the kernels via compute shaders, how the internal GPU tiling of threads works, how to access textures, how to use the various buffers types, how to use thread synchronization and shared memory and final how to maximize performance.


The Graphics pipline


All developers would need to do was send the geometry that they needed to render to the GPU and OpenGL would pass it through the pipeline until the end output was pixels displayed on the screen. This pipeline was fixed and although developers could enable and disable certain parts as well as adjust some settings there was not a lot that could be changed.  For a time this was all that was needed but as developers started to push the GPU further and demand more advanced features it became apparent that this pipeline had to become more flexible. The solution was to make certain parts of the pipeline programmable. Developers could now write there own programs to perform certain parts of the pipeline. These new programs became know as shaders (there is a technical distinction between programs and shaders but that’s another story).



This new programmable pipeline opened up whole new possibilities and without it we would not have the quality of graphic you see in the previous generation of games.  While the pipeline was originally only developed for creating graphics the new flexibility meant that the GPU could now be used to process many types of algorithms. Researches quickly began to modify algorithms to run in the multi-threaded environment of the GPU and areas as diverse as physics, finance, mathematics, medicine and many more began to use the GPU to process their  data. The raw power of the GPU was too hard to resist and GPGPU or General Purpose Graphical Processing Unit programming was born.



General Purpose Graphical Processing Unit programming


GPGPU quickly began to go mainstream and enter industrial use. The was still one problem however. Graphics API’s were still tied down to the graphics pipeline. It didn’t matter how much flexibility shaders gave you over a certain stage of the pipeline they still had to conform to the restrictions of the pipeline. Vertex shaders still have to output vertices and fragment shaders still have to output pixels. While great strides and creative work arounds were developed it soon became apparent that things had to change if GPGPU was to develop further. A API was needed that would free developers from the shackles of the graphics pipeline and provide a environment where the raw power of the GPU could be harnessed in a non-graphics related setting.



Over the next few years GPGPU API’s started to appear and currently developers are spoilt for choice between CUDA, OpenCL and DirectCompute. These new API’s presented a whole new way to work with the GPU and are no longer bound to the traditional graphics pipeline.

【GPGPU API出现解决上面的问题】


While the desire for GPGPU was to provide a way to work with the GPU in a non-graphics related setting the games industry were naturally just as eager to make use of the new capabilities. As games have become more realistic they have started to simulate real world physics.  These computations are often complicated and demanding to process. The power and flexibility offered by GPGPU has allowed these computations to be performed in ever greater detail and scope. The origin graphics API’s opened up a whole new world in 3D games which we now all take for granted. The new GPGPU API’s are now doing the same thing and bringing in a whole new era in gaming. This current generation of gaming will be dominated by the uses of these API’s resulting in ever more realistic graphics and believable physics.



All you need to do now is learn how to use them. Stay tuned for the next part which will show you how to set up the kernels and how the tiling of threads works.