Category: GPU

DirectCompute tutorial for Unity 7: Counter Buffers

So to continue this tutorial is about counter buffers. All the compute buffer types in Unity have a count property. For structured buffers this is a fixed value as they act like an array. For append buffers the count changes as the buffer is appended/consumed from as they act like a dynamic array.

(所有类型buffer都有count property)


Direct Compute also provides a way to manually increment and decrement the count value. This gives you greater freedom over how the buffer stores its elements and should allow for custom data containers to be implemented. I have seen this used to create an array of linked list all completely done on the GPU.

(count property 支持append/consume buffer的实时大小衡量)


First start by creating a new script and paste in the following code. The code does not do anything interesting. It just creates the buffer, runs the shader and then prints out the buffer contents.

(代码:打印buffer contents)



On this line you will see the creation of the buffer.




Note the buffers type is counter. The buffer count value is also set to zero. I recommend doing this when the buffer is created as Unity will not always create the buffer with its count set to zero. I am not sure why. I think it maybe a bug.



Next create a new compute shader and paste in the following code.




First notice the way the buffer is declared.



It’s just declared as a structured buffer. There is no counter type buffer.


The buffers count value is then incremented here. The increment function will also return what the count value was before it gets incremented.

(函数中则是增加counter,返回buffer size的做法:)


Then the count is stored in the buffer so we can print out the contents to check it worked.

(这样就可以打印出buffer size)


If you run the scene you should see the numbers 0 – 15 printed out.



So how do you decrement the counter? You guessed it. With the decrement function.



The decrement function will also return what the count value was after it gets decremented.


Now let’s say you have run a shader that increments and adds elements to the buffer but you don’t know how many were added. How do you find out the count of the buffer? Well you may recall from the append buffer tutorial that you can use Unity’s CopyCount function to retrieve the count value. You can do the same with the counter buffer. Add this code to the end of the start function. It should print out that the buffers count is 16.

(append buffer结合使用counter buffer就可以做到你增加buffer size的时候实时得到其大小。)



DirectCompute tutorial for Unity 6: Consume buffers

This tutorial will be covering how to use consume buffers in Direct Compute. This tutorial was originally going to be combined with the append buffer tutorial as consume and append buffers are kinda the same thing. I decided it was best to split it up because the tutorial would have been a bit too long. In the last tutorial I had to add a edit because it turns out that there are some issues with using append buffers in Unity. It looks like they do not work on some graphics cards. There has been a bug report submitted and hopefully this will be fixed some time in the future. To use consume buffers you need to use append buffers so the same issue applies to this tutorial. If the last one did not work on your card neither will this one.

(sonsume buffer也和append buffer一样,不是所有的硬件都支持很好,这点在下面的过程中要注意,遇到奇怪的问题可能是硬件造成的。)


I also want to point out that when using append or consume buffers if you make a mistake in the code it can cause unpredictable results when ran even if you later fix the code and run it again. If this happens, especially if the error caused the GPU to crash it is best to restart Unity to clear the GPU context.

(注意很多crush的情况下建议重启unity,清空GPU context。再来)


To get started you will need to add some data to a append buffer as you can only consume data from a append buffer. Create a new C# script and add this code.




Here we are simply creating a append buffer and then adding a position to it from the “appendBufferShader” for each thread that runs.

(创建一个append buffer,跑appendBufferShader加入position信息给这个buffer)


We also need a shader to render the results. The  “Custom/AppendExample/BufferShader” shader posted in the last tutorial can be used so I am not going to post the code again for that. You can find it in the append buffer tutorial or just download the project files (links at the end of this tutorial).



Now attach the script to the camera, bind the material and compute shader and run the scene. You should see a grid of red points.

(跑的结果就是看到a grid of red points)


We have appended some points to our buffer and next we will consume some. Add this variable to the script.



Now add these two lines under the dispatch call to the append shader.




This will run the compute shader that will consume the data from the append buffer. Create a new compute shader and then add this code to it.




Now bind this shader to the script and run the scene. You should see nothing displayed. In the console you should see the vertex count as 0. So what happened to the data?

(shader挂到上面的c#代码上然后跑结果,看不到效果且vertex count显示0)


Its this line here that is responsible.


This removes a element in the append buffer each time it is called. Since we ran the same amount of threads as there are elements in the append buffer in the end everything was removed. Also noticed that the consume function will return the value that was removed.

(原因:consumeBuffer每取出一个数据就会在自己的buffer里面删掉他,因此执行完这个shader后consume buffer就为空了。)


This is fairly simple but there are a few key steps to it. Notice that the buffer needs to be declared as a consume buffer in the compute shader like so…



But notice that in the script the buffer we bound to the uniform was not of the type consume. It was a append buffer. You can see so when it was created.



There is no type consume, there is only append. How the buffer is used depends on how you declare it in the compute shader. Declare it as “AppendStructuredBuffer”  to append data to it and declare it as a “ConsumeStructuredBuffer” to consume data from it.



Consuming data from a buffer is not without is risks. In the last tutorial I mentioned that appending more elements than the buffers size will cause the GPU to crash. What would happen if you consumed more elements than the buffer has? You guessed it. The GPU will crash. Always try and verify that your code is working as expected by printing out the number of elements in the buffer during testing.



Removing every element from the buffer is a good way to clear the append buffer (which also appears to be the only why to clear a buffer with out recreating it) but what happens if we only remove some of the elements?


Edit – Unity 5.4 has added a ‘SetCounterValue’ function to the buffer so you can now use that to clear a append or consume buffer.



Change the dispatch call to the consume shader to this…



Here we are only running the shader for a quarter of the elements in the buffer. But the question is which elements will be removed? Run the scene. You will see the points displayed again but some will be missing. If you look at the console you will see that there are 768  elements in the buffer now. There was 1024 and a quarter (256) have been removed to leave 768. But there is problem. The elements removed seem to be determined at random and it will be (mostly) different each time you run the scene.



This fact revels how append buffers work and why consume buffers have limited use. These buffers are LIFO structures. The elements are added and removed in the order the kernel is ran by the GPU but as each kernel is ran on its own thread the GPU can never guarantee the order they will run. Every time you run the scene the order the elements are added and removed is different.



This does limit the use of consume buffers but does not mean they are useless. LIFO structures are something that have never been available on the GPU and as long as the elements exact order does not matter they will allow you to perform algorithms that where impossible to do so on the GPU in the past. Direct compute also adds the ability to have some control over how threads are ran by using thread synchronization, which will be covered in a later tutorial.

(这个问题确实影响了consume buffer的使用,注意避免这种问题的影响)

DirectCompute tutorial for Unity 5: Append buffers

In today’s tutorial I will be expanding on the topic of buffers covered in the last tutorial by introducing append buffers. These buffers offer greater flexibility over structured buffers by allowing you to dynamically increase their size during run time from a compute or Cg shader.

(append buffer特点是灵活,size可变)


The great thing about these buffers is that they can still be used as structured buffers which makes the actually rendering of their contents simpler. Start of by creating a new shader and pasting in this code. This is what we will be using to draw the points we add into the buffer. Notice the buffer is just declared as a structured buffer.




Now lets make a script to create our buffers. Create a new C# script, paste in this code and then drag it onto the main camera in a new scene.




Notice this line here…



This is the creation of a append buffer. It must be of the type “ComputeBufferType.Append” unsurprisingly. Notice we still pass in a size for the buffer (the width * width parameter). Even though append buffers need to have their elements added from a shader they still need to have a predefined size. Think of this as a reserved area of memory that the elements can be added to. This also raises a subtle error that can arise which I will get to later.

(创建append buffer,这里要注意的是还是要设置一个初始大小。)


The append buffer starts of empty and we need to add to it from a shader. Notice this line here…


Here we are running a compute shader to fill our buffer. Create a new compute shader and paste in this code.




This will fill our buffer with a position for each thread that runs. Nothing fancy. Notice this line here…



The “Append(pos)” is the line that actually adds a position to the buffer. Here we are only adding a position if the x and y dispatch id are even numbers. This is why append buffers are so useful. You can add to them on any conditions you wish from the shader.



The dynamic contents of a append buffer can cause a problem when rendering however. If you remember from last tutorial we rendered a structured buffer using this function…



Unity’s “DrawProcedual” function needs to know the number of elements that need to be drawn. If our append buffers contents are dynamic how do we know how many elements there are?

(Graphics.DrawProcedural方法需要提前知道buffer size,但是上面的append buffer size是不确定的。)


To do that we need to use a argument buffer. Take a look at this line from the script…

(因此我们用到了argument buffer)


Notice it has to be of the type “ComputeBufferType.DrawIndirect“, it has 4 elements and they are integers. This buffer will be used to tell Unity how to draw the append buffer using the function “DrawProceduralIndirect“.

(这个buffer会告诉unity buffer的最终大小,绘制方法要改用DrawProceduralIndirect方法)


These 4 elements represent the number of vertices, the number of instances, the start vertex and the start instance. The number of instances is simply how many times to draw the buffer in a single shader pass. This would be useful for something like a forest where you have a few trees that need to be drawn many times. The start vertex and instance just allow you to adjust where to draw from.



These values can be set from the script. Notice this line here…




Here we are just telling Unity to draw one instance of our buffer starting from the beginning. Its the first number that is the important one however. Its the number of vertices and see its set to 0. This is because we need to get the exact number of elements that are in the append buffer. This is done with the following line…


This copies the number of elements in the append buffer into our argument buffer. At this stages its important to make sure everything’s working correctly so we will also get the values in the argument buffer and print out the contents like so…



Now run the scene, making sure you have bound the shader and material to the script. You should see the vertex count as 256. this is because we ran 1024 threads with the compute shader and we added a position only for even x and y id’s which ends up being the 256 points you see on the screen.

(跑结果,会得到vertex count达到了256.也就是执行了1024线程)


Now remember I said there was a subtle error that can arise and it has to do with the buffers size? When the buffer was declared we set its size as 1024 and now we have entered 256 elements into it. What would happen if we added more that 1024 elements? Nothing good. Appending more elements than the buffers size causes a error in the GPU’s driver. This causes all sorts of issues. The best thing that could happen is that the GPU would crash. The worst is a persistent error in the calculation of the number of elements in the buffer that can only be fixed by restarting Unity and means that the buffer will not be draw correctly with no indication as to why. Whats worse is that since this is a driver issue you may not experience the issue the same way on different computers leading to inconstant behavior which is always hard to fix.

(注意append buffer有最大限制,就是看当前支持多少线程并行,像这里大于1024就会挂。)


This is why I have printed out the number of elements in the buffer. You should always check to make sure the value is within the expected range.



Once you have copied the vertex count to the argument buffer you can then draw the buffer like so…

(使用了argument buffer后你就可以用下面的函数绘制结果了)


Your not limited to filling a compute buffer from a compute shader. You can also fill it from a Cg shader. This is useful if you want to save some pixels from a image for some sort of further post effect. The catch is that since you are now using the normal graphics pipeline you need to play by its rules. You need to output a fragment color into a render texture even if you don’t want to. As such if you find yourself in that situation it means that whatever your doing is probably better done from a compute shader. Other than that the process works much the same but like everything there are a few things to be careful of.



Create a new shader and paste in this code…



Notice that the fragment shader is much the same as the compute shader used previously. Here we are just adding a position to the buffer for each pixel rendered if the uv (as the id) is a even number on both the x and y axis. Don’t ever try and create a material from this shader in the editor! The reason is that Unity will try and render a preview image for the material and will promptly crash. You need to pass the shader to a script and create the material from there.



Create a new C# script and paste in the following code…



Again you will see that this is very much like the previous script using the compute shader. There are a few differences however.




Instead of the compute shader dispatch call we need to use Graphics blit and we need to bind and unbind the buffer. We also need to provide a render texture as the source destination for Graphics blit. This makes the process Pro only unfortunately.

(不再采用dispatch方法,而是采用Graphics的方法来替代。Graphics blit就是来生成render texture)


Attach this script to the main camera in a new scene and run the scene after binding the material and shader to the script. It should look just like the previous scene (a grid of points) but they will be blue.




DirectCompute tutorial for Unity 4: Buffers

In DirectCompute there are two types of data structures you will be using, textures and buffers. Last tutorial covered textures and today I will be covering buffers. While textures are good for data that needs filtering or mipmaping like color information, buffers or more suited to representing information for vertices like position or normals. Buffers can also easily send or retrieve data from the GPU which is a feature that’s rather lacking in Unity.



There are 5 types of buffers you can have, structured, append, consume, counter and raw. Today I will only be covering structured buffers because this is the most commonly used type. I will try and cover the others in separate tutorials as they are a little more advanced and I need to cover some of the basics first.

(有五种类型的buffer你可以用:structured, append, consume, counter and raw,这一篇只讲structured buffer)


So let’s start of by creating a C# script, name it BufferExample and paste in the following code.

t003 - 01


We will also need a material to draw our buffer with so create a normal Cg shader, paste in the follow code and create a material out of it.


t003 - 02


Now to run the scene attach the script to the main camera node and bind the material to the material attribute on the script. This script has to be attached to a camera because to render a buffer we need to use the “OnPostRender” function which will only be called if the script is on a camera (EDIT – turns out you can use “OnRenderObject” if you don’t want the script attached to a camera). Run the scene and you should see a number of random red points.


t003 - 03



Now notice the creation of the buffer from this line.


The three parameters passed to the constructor are the count, stride and type. The count is simply the number of elements in the buffer. In this case 1024 points. The stride is the size in bytes of each element in the buffer. Since each element in the buffer represents a position in 3D space we need a 3 component vector of floats. A float is 4 bytes so we need a stride of 4 * 3. I like to use the sizeof operator as I think it makes the code more clear. The last parameter is the buffer type and it is optional. If left out it creates a buffer of default type which is a structured buffer.

(三个参数:elements数量,每个elements大小,类型默认就是structured buffer)


Now we need to pass the positions we have made to the buffer like so.



This passes the data to the GPU. Just note that passing data to and from the GPU can be a slow processes and in general it is faster to send data than retrieve data.


Now we need to actually draw the data. Drawing buffers has to be done in the “OnPostRender” function which will only be called if the script is attached to the camera. The draw call is made using the “DrawProcedural” function like so…



There are a few key points here.

The first is that the materials pass must be set before the DrawProcedural call. Fail to do this and nothing will be drawn. You must also bind the buffer to the material but you only have to do this once, not every frame like I am here. Now have a look at the “DrawProcedural” function. The first parameter is the topology type and in this case I am just rendering points but you can render the data as lines, line strips, quads or triangles. You must however order them to match the topology. For example if you render lines every two points in the buffer will make a line segment and for triangles every three points will make a triangle. The next two parameters are the vertex count and the instance count. The vertex count is just the number vertices you will be drawing, in this case the number of elements in the buffer. The instance count is how many times you want to draw the same data. Here we are just rendering the points once but you could render them many times and have each instance in a different location.

(材质设置必须在call DrawProcedural 函数之前,不然绘不出来。)

(DrawProcedural函数参数:拓扑结构类型;vertex count;instance count)



t003 - 04

Now for the material. This is pretty straight forward. You just need to declare your buffer as a uniform like so…



Since buffers are only in DirectCompute you must also set the shader target to SM5 like so…



The vertex shader must also have the argument “uint id : SV_VertexID“. This allows you to access the correct element in the buffer like so…

(vertex shader存在SV_VertexID这个参数可以让你很容易的进入目标element)


Buffers are a generic data structure and don’t have to be floats like in this example. We could use integers instead. Change the scripts start function to this…




and the shaders uniform declaration to this…

You could even use a double but just be aware that double precision in shaders is still not widely supported on GPU’s although its common on newer cards.



You are not limited to using primitives like float or int. You can also use Unity’s Vectors like so…

(你还可以使用unity vectors数据结构!!!)



With the uniform as…

You can also create you own structs to use.  Change your scripts start function to this with the struct declaration above…




and the shader to this…


This will draw each point like before but now they will also be random colors. Just be aware that you need to use a struct not a class.



When using buffers you will often find you need to copy one into another. This can be easily done with a compute shader like so…




You can see that buffer1 is copied into buffer2. Since buffers are always 1 dimensional it is best done from a 1 dimension thread group.



Just like textures are objects so to are buffers and they have some functions that can be called from them. You can use the load function to access a buffers element just like with the subscript operator.



Buffers also have a “GetDimension” function. This returns the number of elements in the buffer and their stride.







DirectCompute tutorial for Unity 3: Textures

The focus of days tutorial is on textures. These are arguably the most important feature when using DirectCompute. It is highly likely that every shader you write will use at least one texture. Unfortunately render textures in Unity are Pro only so this tutorial covers Pro only topics. The good new is that this will be the only Pro only tutorial. Textures in DirectCompute are very simple to use but there are a few traps you can fall into. Lets start with something simple. Create a compute shader and paste in the following code.

(使用简单,但是注意只有unity pro版本支持此功能)





And then a C# script called TextureExample and paste in the following code.



Attach the script, bind the shader and run the scene. You should see a texture with the uv’s displayed as colors. This is what we have written into the texture with the compute shaders.



Now notice the kernel argument “uint2 id : SV_DispatchThreadID” from last tutorial. This is the threads position in the groups of threads and just like with buffers we need this to know what location to write the results into the texture, like this “tex[id] = result“. This time however we don’t need the “flattened” index like with buffers. We can just use the uint2 directly. This is because unlike buffers textures can be multidimensional. We have a 2D thread group and a 2D texture.



Now look at the declaration of the texture variable “RWTexture2D<float4> tex;“. The “RWTexture2D” part is important. This is obviously a texture but whats with the RW? This declares that the texture is of the type “unordered access view”. This just means that you can write to any location in the texture from the shader. You may be able to write to the texture but you can not read from it.  Remove the RW part and its just a normal texture but you cant write to it. Just remember that you can only read from a Texture2D and only write to RWTexture2D.



Now lets look at the script. Notice how the texture is created.


There are two important things here. The first is the “enableRandomWrite”. You must set this to true if you want to write into the texture. This basically says that the texture can have unordered access views. If you don’t do this nothing will happen when you run the shader and Unity wont give you a error. It will just fail for no apparent reason. The second is the “Create” function call. You must call create on the texture before you write into it. Again, if you don’t nothing will happen and you wont get a error. It just wont work. If you are use to writing into a texture with graphics blit then you may notice you don’t have to call create. This is because graphics blit checks if the texture is created and creates it if its not. The dispatch function cant do this because it does not know what textures are being written into when it is called.



Also note that the texture is released in the “OnDestroy” function. When you are finished with your render textures make sure you release them.



Now lets look at the dispatch call.


Remember from the last tutorial that this is where we set how many groups to run. So why are the number of groups to run the texture size divide by 8? Look at the shader. You can see we have 8 threads per group running from the “[numthreads(8,8,1)]” line. Now we need a thread for each pixel in the texture to run. So we have a texture 64 pixels wide and if we divide the pixels by the number of threads per group we get the number of groups we will need. We end up with 8 groups of 8 threads which is 64 threads in total in the x dimension and its the same for the y dimension. This gives us a total of 4096 threads running (64 * 64) which is the number of pixels in the texture.



Now have a look at this part of the script.


This is where we are setting the uniforms for the shader. We have to set the texture we are writing into and we also need the textures width and height so we can calculate the uv’s from the dispatch id. Its common to pass variables into a shader like this but in this case we don’t need the width and height. We can get it from within the shader. Change your shader to this…




and remove these two lines from your script…




Run the scene. It should look the same. Notice the “tex.GetDimensions(w, h);” line. Textures are objects. This means they have functions you can call from them.



In this case we are asking for the textures dimensions. Textures have a number of functions with a number of overloads you can call. I will go through the most common and how to use them but first we need to change the scene a bit. What we want to do now is copy the contents of our texture into another texture and then display that result.




And change the script to this…



Now bind the new shader you have created to the “shaderCopy” attribute and run the scene. The scene should look the same as before. What we are doing here is to fill the first texture with its uv’s as a color like before and then we are copying the contents to another texture. This is so I can demonstrate the many ways to sample from a texture in a shader. Notice this line “float4 t = tex[id];” in the copy shader. This is the simplest way to sample from a texture. Just like the dispatch id is the location you want to write to it is also the location you want to read from. You can see here that we can sample from a texture just like it was a array.



There are other ways of doing this. For example…


Here we are accessing the mipmaps of the texture instead. Level 0 would be the first mipmap, the one with the same size as the texture. The next dimension in the mipmap array is the location to sample at using the dispatch id. Just note that we have not enabled mipmaps on the texture so the result of sampling levels other than 0 has no effect.



We can do the same with the textures load function.



In this case the uint3’s x and y value is the location to sample at and the z value (the 0) is the mipmap level. Both these examples work the same.


There is one feature of textures that you will be using a lot. The textures ability to filter and wrap. To do this you need to use a sampler state. If you are use to using HLSL out side of Unity you may know that you need to create a sampler object to use. This works a bit differently in Unity. Basically you have two choices for filtering (Linear or Point) and two for wrap (Clamp or Repeat). All you need to do is declare a sampler state in the shader with the word Linear or Point in the name and the word Repeat or Clamp in the name. For example you could use the name myLinearClamp or aPointRepeat, etc. I prefer a underscore then the name. Change you shader to this.

(采用texture的filter和warp功能,你需要用到sampler state,下面是例子)



If you run the scene it should still look the same. Notice this line “float4 t = tex.SampleLevel(_LinearClamp, uv, 0);“. Here we are using the textures SampleLevel function. The function takes a sampler state, the uv’s and a mipmap level. The uv’s need to be normalized, that is have a range of 0 to 1. Notice the SamplerState variables at the top. If you are using sampler states you are mostly likely wanting to have bilinear filtering. If that’s the case use the _LinearClamp or _LinearRepeat sampler state.




If you are use to using HLSL in a fragment shader (as opposed to a compute shader here) you may know that you can use this function for filtering.



Notice its called Sample not SampleLevel and the mipmap parameter is gone. If you try to use this in a compute shader you will get a error because this function does not exists. The reason why is surprisingly complicated and gives a insight into how the GPU works. Behind the scenes fragments shaders (or any shader) work in much the sample why as a compute shader as they share the same GPU architecture. They run in threads and the threads are arranged into thread groups. Now remember that threads in a group can share memory. Fragment shaders always run in a group of at least 2 by 2 threads. When you sample from a texture the fragment shader checks what its neighbors uv’s are. From this it can work out what the derivatives of the uv’s are. The derivatives are just a rate of change and in areas where there is a high rate of change the textures higher mipmaps are used and in areas of low rate of change the lower mipmaps are used. This is how the GPU reduces aliasing problems and it also has the handy byproduct of reducing memory bandwidth (the higher mipmaps are smaller).



All the examples I have given are in 2D but the same principles apply in 3D. You just need to create the texture a little differently.



This creates a 3D texture that is 64*64*64 pixels. Notice the “volumeDepth” is set to 64 and the “isVolume” is set to true. Remember to set the number of groups and threads to the correct values. The dispatch id also needs to be a uint3 or int3;

(创建一个3D纹理,执行3D shader)




That about covers it for textures. Next time I look at buffers. How to use the default buffer, the other buffer types, how to draw data from your buffers and set/get your buffer data.


(最后补充一下与上面译文无关的就是:shader里面其实定义采用的是rendertexture, texture2D, texture3D的父类texture。因此在shader里面这些只要是texture子类的纹理都是可以直接使用的。这点很重要,你可以采用导入的texture来在compute shader里面做计算。)

DirectCompute tutorial for Unity 2: Kernels and thread groups

Today I will be going over the core concepts for writing compute shaders in Unity.


At the heart of a compute shader is the kernel. This is the entry point into the shader and acts like the Main function in other programming languages. I will also cover the tiling of threads by the GPU. These tiles are also known as blocks or thread groups. DirectCompute officially refers to these tiles as thread groups.

【基本的计算单元是 kernel】


To create a compute shader in Unity simply go to the project panel and then click create->compute shader and then double click the shader to open it up in Monodevelop for editing. Paste in the following code into the newly created compute shader.

【unity 创建kernel】

t001 - 01


This is the bare minimum of content for a compute shader and will of course do nothing but will serve as a good starting point. A compute shader has to be run from a script in Unity so we will need one of those as well. Go to the project panel and click Create->C# script. Name it KernelExample and paste in the following code.

【C# 代码来执行kernel】

t001 - 02


Now drag the script onto any game object and then attach the compute shader to the shader attribute. The shader will now run in the start function when the scene is run. Before you run the scene however you need to enable dx11 in Unity. Go to Edit->Project Settings->Player and then tick the “Use Direct3D 11” box. You can now run the scene. The shader will do nothing but there should also be no errors.


t001 - 03


In the script you will see the “Dispatch” function called. This is responsible for running the shader. Notice the first variable is a 0. This is the kernel id that you want to run. In the shader you will see the “#pragma kernel CSMain1“. This defines what function in the shader is the kernel as you may have many functions (and even many kernels) in one shader. There must be a function will the name CSMain1 in the shader or the shader will not compile.



Now notice the “[numthreads(4,1,1)]” line. This tells the GPU how many threads of the kernel to run per group. The 3 numbers relate to each dimension. A thread group can be up to 3 dimensions and in this example we are just running a 1 dimension group with a width of 4 threads. That means we are running a total of 4 threads and each thread will run copy of the kernel. This is why GPU’s are so fast. They can run thousands of threads at a time.



Now lets get the kernel to actually do something. Change the shader to this…

t001 - 04


and the scripts start function to this…

t001 - 05


Now run the scene and you should see the numbers 0, 1, 2 and 3 printed out.


Don’t worry too much about the buffer for now. I will cover them in detail in the future but just know that a buffer is a place to store data and it needs to have the release function called when you are finished with it.



Notice this argument added to the CSMain1 function “int3 threadID : SV_GroupThreadID“. This is a request to the GPU to pass into the kernel the thread id when it is run. We are then writing the thread id into the buffer and since we have told the GPU we are running 4 threads the id ranges from 0 to 3 as we see from the print out.



Now those 4 threads make up whats called a thread group. In this case we are running 1 group of 4 threads but you can run multiple groups of threads. Lets run 2 groups instead of 1. Change the shaders kernel to this…


t001 - 06


and the scripts start function to this…

t001 - 07


Now run the scene and you should have 0-3 printed out twice.


Now notice the change to the dispatch function. The last three variables (the 2,1,1) are the number of groups we want to run and just like the number of threads groups can go up to 3 dimensions and in this case we are running 1 dimension of 2 groups. We have also had to change the kernel with the argument “int3 groupID : SV_GroupID” added. This is a request to the GPU to pass in the group id when the kernel is run. The reason we need this is because we are now writing out 8 values, 2 groups of 4 threads. We now need the threads  position in the buffer and the formula for this is the thread id plus the group id times the number of threads ( threadID.x + groupID.x*4 ).



This is a bit awkward to write. Surely the GPU knows the threads position? Yes it does. Change the shaders kernel to this and rerun the scene.


t001 - 08


The results should be the same, two sets of 0-3 printed. Notice that the group id argument has been replaced with “int3 dispatchID : SV_DispatchThreadID“. This is the same number our formula gave us except now the GPU is doing it for us. This is the threads position in the groups of threads.



So far these have all been in 1 dimension. Lets step thing up a bit and move to 2 dimensions and instead of rewriting the kernel lets just add another one to the shader. Its not uncommon to have a kernel for each dimension in a shader performing the same algorithm. First add this code to the shader below the previous code so there are two kernels in the shader.

【下一步:使用2D kernel来替代执行1D kernel两遍】

t001 - 09


and the script to this…

t001 - 10


Run the scene and you will see a row printed from 0 to 7 and the next row 8 to 15 and so on to 63.

t001 - 11


Why from 0 to 63? Well we now have 4 2D groups of threads and each group is 4 by 4 so has 16 threads. That gives us 64 threads in total.


Notice what value we are out putting from this line “int id = dispatchID.x + dispatchID.y * 8“. The dispatch id is the threads position in the groups of threads for each dimension. We now have 2 dimension so we need the threads global position in the buffer and this is just the dispatch x id plus the dispatch y id times the total number of threads in the first dimensions (4 * 2). This is a concept you will have to be familiar with when working with compute shaders. The reason is that buffers are always 1 dimensional and when working in higher dimension you need to calculate what index the result should be written into the buffer at.

【线程解释:[numthreads(4,4,1)]表示一个group16个线程,computershader.Dispatch(kernel, 2, 2, 1);表示开启了2维的4个线程组,因此一共64个线程。】


The same theory applies when working with 3 dimensions but as it gets fiddly I will only demonstrate up to 2 dimensions. You just need to know that in 3 dimensions the buffer position is calculated as “int id = dispatchID.x + dispatchID.y * groupSizeX + dispatchID.z * groupSizeX * groupSizeY” where group size is the number of groups times number of threads for that dimension.



You should also have a understanding of how the semantics work. Take for example this kernel argument…



int3 dispatchID : SV_DispatchThreadID


SV_DispatchThreadID is the semantic and tells the GPU what value it should pass in for this argument. The name of the argument does not matter. You can call it what you want. For example this argument works the same as above.



Also the variable type can be changed. For example…

See the int3 has been changed to int. This is fine if you are only working with 1 dimension. You could also just use a int2 for 2 dimensions and you could also use a unsigned int (uint) instead of a int if you choose.



Since we now have two kernels in the shader we also need to tell the GPU what kernel we want to run when we make the dispatch call. Each kernel is given a id in the order they appear. Our first kernel would be id 0 and the next is id 1. When the number of kernels in a shader becomes larger this can become a bit confusing and its easy to set the wrong id. We can solve this by asking the shader for the kernels id by name. This line here “int kernel = shader.FindKernel (“CSMain2”);” gets the id of kernel “CSMain2“. We then use this id when setting the buffer and making the dispatch call.

【一个computeshader多个kernel的情况下dispatch方法需要用ID来标识使用的kernel,可以使用FindKernel 来标识。】


About now you maybe thinking that this concept of groups  of threads is a bit confusing. Why cant I just use one group of threads? Well you can but just know that there is a reason that threads are arranged into groups by the GPU. For a start a thread group is limited by the number of threads it can have ( defined by the line “[numthreads(x,y,z)]” in the shader). This limit is currently 1024 but may change with new hardware. For example you can have a maximum of “numthreads(1024,1,1)” for 1D, “numthreads(32,32,1)” for 2D and so on. You can however have any number of groups of threads and as you will often be processing data with millions of element the concept of thread groups is essential. Threads in a groups can also share memory and this can be used to make dramatic performance gains for certain algorithms but I will cover that in a future post.



Well I think that about covers kernels and thread groups. There is just one more thing I want to cover. How to pass uniforms into your shader. This works the same as in Cg shaders but there is no uniform key word. For the most part this relatively simple but there are a few “Gotcha’s” so I will briefly go over it.



For example if you want to pass in a float you need this line in the shader…


and this line in your script…

To set a vector you need this in the shader…

and this in the script…

You can only pass in a Vector4 from the script but your uniform can be a float, float2, float3 or float4. It will be filled with the appropriate values.


Now here’s where it gets tricky. You can pass in arrays of values. Note that this first example wont work. I will explain why.  You need this line in your shader…


and this in your script…

Now this wont work. Whether this is by design or a bug in Unity I don’t know. You need to use vectors as uniforms for this to work. In your shader…

and your script…

So here we have a array of two float4’s and it is set from a array of 8 floats from a script. The same principles apply when setting matrices. In your shader…

and your script…

And of course you can have arrays of matrices. In your shader…

and your script…

This same logic does not seem to apply to float2x2 or float3x3. Again, whether this is a bug or design I don’t know.

DirectCompute tutorial for Unity 1: Introduction




At this time Unity just started to support Microsofts DirectX 11 and with DirectX 11 came the DirectCompute API. This API opens up a whole new way to use the GPU by writing compute shaders.

【Unity GPGPU 很好用,但是没有教程让使用者非常难受,因此作者准备搞一发。】


I hope to start of with a introduction about why DirectCompute was needed and how it differs to the traditional graphics pipeline, how to write the kernels via compute shaders, how the internal GPU tiling of threads works, how to access textures, how to use the various buffers types, how to use thread synchronization and shared memory and final how to maximize performance.


The Graphics pipline


All developers would need to do was send the geometry that they needed to render to the GPU and OpenGL would pass it through the pipeline until the end output was pixels displayed on the screen. This pipeline was fixed and although developers could enable and disable certain parts as well as adjust some settings there was not a lot that could be changed.  For a time this was all that was needed but as developers started to push the GPU further and demand more advanced features it became apparent that this pipeline had to become more flexible. The solution was to make certain parts of the pipeline programmable. Developers could now write there own programs to perform certain parts of the pipeline. These new programs became know as shaders (there is a technical distinction between programs and shaders but that’s another story).



This new programmable pipeline opened up whole new possibilities and without it we would not have the quality of graphic you see in the previous generation of games.  While the pipeline was originally only developed for creating graphics the new flexibility meant that the GPU could now be used to process many types of algorithms. Researches quickly began to modify algorithms to run in the multi-threaded environment of the GPU and areas as diverse as physics, finance, mathematics, medicine and many more began to use the GPU to process their  data. The raw power of the GPU was too hard to resist and GPGPU or General Purpose Graphical Processing Unit programming was born.



General Purpose Graphical Processing Unit programming


GPGPU quickly began to go mainstream and enter industrial use. The was still one problem however. Graphics API’s were still tied down to the graphics pipeline. It didn’t matter how much flexibility shaders gave you over a certain stage of the pipeline they still had to conform to the restrictions of the pipeline. Vertex shaders still have to output vertices and fragment shaders still have to output pixels. While great strides and creative work arounds were developed it soon became apparent that things had to change if GPGPU was to develop further. A API was needed that would free developers from the shackles of the graphics pipeline and provide a environment where the raw power of the GPU could be harnessed in a non-graphics related setting.



Over the next few years GPGPU API’s started to appear and currently developers are spoilt for choice between CUDA, OpenCL and DirectCompute. These new API’s presented a whole new way to work with the GPU and are no longer bound to the traditional graphics pipeline.

【GPGPU API出现解决上面的问题】


While the desire for GPGPU was to provide a way to work with the GPU in a non-graphics related setting the games industry were naturally just as eager to make use of the new capabilities. As games have become more realistic they have started to simulate real world physics.  These computations are often complicated and demanding to process. The power and flexibility offered by GPGPU has allowed these computations to be performed in ever greater detail and scope. The origin graphics API’s opened up a whole new world in 3D games which we now all take for granted. The new GPGPU API’s are now doing the same thing and bringing in a whole new era in gaming. This current generation of gaming will be dominated by the uses of these API’s resulting in ever more realistic graphics and believable physics.



All you need to do now is learn how to use them. Stay tuned for the next part which will show you how to set up the kernels and how the tiling of threads works.