Tag: GPU

GPU Gems – Animation in the “Dawn” Demo

4.1 Introduction

 
 

“Dawn” is a demonstration(示范) that was created by NVIDIA Corporation to introduce the GeForce FX product line and illustrate how a high-level language (such as HLSL or Cg) could be used to create a realistic human character. The vertex shaders deform a high-resolution mesh through indexed skinning and morph targets, and they provide setup for the lighting model used in the fragment shaders. The skin and wing fragment shaders offer both range and detail that could not have been achieved before the introduction of advanced programmable graphics hardware. See Figure 4-1.

【Dawn是Nvidia新产品中怎样将HLSL应用于真人角色的示范,主要涉及Vertex shader用于morphing, fragment shader用于光照。】

 
 


Figure 4-1 A Screen Capture of the Real-Time Dawn

 
 

This chapter discusses how programmable graphics hardware was used to accelerate the animation of the Dawn character in the demo.

【这里讲的就是如何使用图形硬件编程来加速Draw的角色动画。】

 
 

 
 

 
 

4.2 Mesh Animation

 
 

Traditionally, mesh animation has been prohibitively expensive for complex meshes because it was performed on the CPU, which was already burdened with physical simulation, artificial intelligence, and other computations required by today’s applications. Newer graphics hardware has replaced the traditional fixed-function pipeline with programmable vertex and fragment shaders, and it can now alleviate some of that burden from the CPU.

【传统的网格动画开销非常贵因为局限于CPU顶点计算,而CPU还承担其他的大量的工作,新的图形硬件带来的vertex/fragment shader可以分担部分CPU工作。】

 
 

Sometimes it is still necessary to perform such operations on the CPU. Many stencil-based shadow volume techniques must traverse the transformed mesh in order to find the silhouette edges, and the generation of the dynamic shadow frustum is often best done on the CPU (see Chapter 9, “Efficient Shadow Volume Rendering”). In scenes where the character is drawn multiple times per frame into shadow buffers, glow buffers, and other such temporary surfaces, it may be better to perform the deformations on the CPU if the application becomes vertex-limited. Deciding whether to perform mesh deformations on the CPU or on the GPU should be done on a per-application or even on a per-object basis.

【有时候还是需要将网格动画计算放在CPU执行,因为像体阴影需要实时去找到阴影遮挡关系来计算阴影,这些工作是在CPu上面做的。当一个character需要很多次的去渲染到模版/texture的时候就比较不适合使用GPU运算。】

 
 

The modeling, texturing, and animation of the Dawn character were done primarily in Alias Systems’ Maya package. We therefore based our mesh animation methods on the tool set the software provides. We have since created a similar demo (“Dusk,” used to launch the GeForce FX 5900) in discreet’s 3ds max package, using the same techniques; these methods are common to a variety of modeling packages and not tied to any single workflow. The methods used in these two demos are (indexed) skinning, where vertices are influenced by a weighted array of matrices, and weighted morph targets, used to drive the emotions on Dawn’s face.

【dawn美术资源来源于maya,用于这里的morphing demo】

 
 

 
 

4.3 Morph Targets

 
 

Using morph targets is a common way to represent complex mesh deformation, and the NVIDIA demo team has created a variety of demos using this technique. The “Zoltar” demo and the “Yeah! The Movie” demo (content provided by Spellcraft Studio) started with 30 mesh interpolants per second, then removed mesh keys based on an accumulated error scheme. This allowed us to reduce the file size and the memory footprint—up to two-thirds of the original keys could be removed with little to no visible artifacts. In this type of mesh interpolation, there are only two interpolants active at any given time, and they are animated sequentially.

【morphing 常用于网格变形,nvidia也做过很多相关demo。】

 
 

Alternatively, morph targets can be used in parallel. Dawn is a standard example of how this approach can be useful. Beginning with a neutral head (27,000 triangles), our artist created 50 copies of that head and modeled them into an array of morph targets, as shown in Figure 4-2. Approximately 30 of those heads corresponded to emotions (such as happy, sad, thoughtful, and so on), and 20 more were modifiers (such as left eyebrow up, right eyebrow up, smirk, and so on). In this style of animation, the morph target weights will probably not add to 1, because you may have (0.8 * happy + 1.0 * ear_wiggle), for example—Dawn is a fairy, after all.

【另外,morph target也可是并行的。Dawn 的头包含27000个三角形,做了50个头来作为morph的target array。下图展示了一些,最终morphing的结果可以是多个morph target的加权和。】

 
 


Figure 4-2 Emotional Blend Targets (Blend Shapes)

 
 

Although such complex emotional faces could have been made entirely of blends of more elemental modifiers, our artist found it more intuitive to model the face in the pose he desired, because it is hard to model an element such as an eyebrow creasing, without seeing how the eyes, cheeks, and mouth work together. This combination also helps with hardware register limitations, described later.

【要合成复杂的表情动画还是非常难的,最终的结果要看是否自然,是否会出一些明显的错误是不被允许的。譬如眼睛突出来这样的人不可能会有的行为,要加以约束,如何处理约束后面会讲。】

 
 

 
 

4.3.1 Morph Targets in a High-Level Language

 
 

Luckily, the implementation of morph targets in HLSL or Cg is simple. Assuming that vertexIn is our structure containing per-vertex data, applying morph targets in a linear or serial fashion is easy:

【幸运的是硬件实现morpg target很简单,首先来看先后时间位置的差值做法会是如下。】

 
 

float4 position = (1.0f – interp) * vertexIn.prevPositionKey + interp * vertexIn.nextPositionKey;

 
 

In this code, interp is a constant input parameter in the shader, but prevPositionKey and nextPositionKey are the positions at the prior time and next time, respectively. When applying morph targets in parallel, we find the spatial difference between the morph target and the neutral pose, which results in a difference vector. We then weight that difference vector by a scalar. The result is that a weight of 1.0 will apply the per-vertex offsets to achieve that morph target, but each morph target can be applied separately. The application of each morph target is just a single “multiply-add” instruction:

【interp 是常数输入值,prevPositionKey/nextPositionKey 是前后时刻的位置。同一时间多个morph target做法也是类似,如下加权平均。】

 
 

// vertexIn.positionDiffN = position morph target N – neutralPosition

 
 

float4 position = neutralPosition;

position += weight0 * vertexIn.positionDiff0;

position += weight1 * vertexIn.positionDiff1;

position += weight2 * vertexIn.positionDiff2;

 
 

 
 

4.3.2 Morph Target Implementation

 
 

We wanted our morph targets to influence both the vertex position and the basis (that is, the normal, binormal, and tangent) so that they might influence the lighting performed in the fragment shader. At first it would seem that one would just execute the previous lines for position, normal, binormal, and tangent, but it is easy to run out of vertex input registers. When we wrote the “Dawn” and “Dusk” demos, the GPU could map a maximum of 16 per-vertex input attributes. The mesh must begin with the neutral position, normal, binormal, texture coordinate, bone weights, and bone indices (described later), leaving 10 inputs open for morph targets. We might have mapped the tangent as well, but we opted to take the cross product of the normal and binormal in order to save one extra input.

【我们想要morph target影响顶点位置和basis,相应的影响fragment shader的光照性能。这里要注意的是GPU寄存器的数量是有限的,除去渲染要用的寄存器剩下的就是morph可以使用的寄存器。只用normal,binormal就可以,可以叉乘得到tengent,节约寄存器。】

 
 

Because each difference vector takes one input, we might have 10 blend shapes that influence position, five blend shapes that influence position and normal, three position-normal-binormal blend shapes, or two position-normal-binormal-tangent blend shapes. We ultimately chose to have our vertex shader apply five blend shapes that modified the position and normal. The vertex shader would then orthonormalize the neutral tangent against the new normal (that is, subtract the collinear elements of the new normal from the neutral tangent and then normalize) and take the cross product for the binormal. Orthonormalization is a reasonable approximation for meshes that do not twist around the surface normal:

【每一个vector作为一个输入,通过blend都会影响到最终的position。我们最终选的方案是应用五个shape blend出最终的shape。计算新的tangent如下:】

 
 

// assumes normal is the post-morph-target result

// normalize only needed if not performed in fragment shader

 
 

float3 tangent = vertexIn.neutralTangent – dot(vertexIn.neutralTangent, normal) * normal;

tangent = normalize(tangent);

 
 

Thus, we had a data set with 50 morph targets, but only five could be active (that is, with weight greater than 0) at any given time. We did not wish to burden the CPU with copying data into the mesh every time a different blend shape became active, so we allocated a mesh with vertex channels for neutralPosition, neutralNormal, neutralBinormal, textureCoord, and 50 * (positionDiff, NormalDiff). On a per-frame basis, we merely changed the names of the vertex input attributes so that those that should be active became the valid inputs and those that were inactive were ignored. For each frame, we would find those five position and normal pairs and map those into the vertex shader, allowing all other vertex data to go unused.

【因此我们有了50个morph目标但是只能在同一时刻激活使用5个。我们不希望每一次做差值都需要重新拷贝这五个目标的数据,因此我们为mesh分配相关的vertex channel包括neutralPosition…信息。在每一帧的基础上,我们只是改变vertex input的属性名字来决定其是否激活,在进行计算。】

 
 

Note that the .w components of the positionDiff and normalDiff were not really storing any useful interpolants. We took advantage of this fact and stored a scalar self-occlusion term in the .w of the neutralNormal and the occlusion difference in each of the normal targets. When extracting the resulting normal, we just used the .xyz modifier to the register, which allowed us to compute a dynamic occlusion term that changed based on whether Dawn’s eyes and mouth were open or closed, without any additional instructions. This provided for a soft shadow used in the lighting of her skin (as described in detail in Chapter 3, “Skin in the ‘Dawn’ Demo”).

【positionDiff/normalDiff 的 .w 分量在差值中用不到,我们根据这个来让这个值存储遮蔽信息,这样就可以做到跟据w判读是否启用这里的.xyz,节省空间时间。】

 
 

On the content-creation side, our animator had no difficulty remaining within the limit of five active blend shapes, because he primarily animated between three or so emotional faces and then added the elemental modifiers for complexity. We separated the head mesh from the rest of the body mesh because we did not want the added work of doing the math or storing the zero difference that, say, the happy face would apply to Dawn’s elbow. The result remained seamless—despite the fact that the head was doing morph targets and skinning while the body was doing just skinning—because the outermost vertices of the face mesh were untouched by any of the emotional blend shapes. They were still modified by the skinning described next, but the weights were identical to the matching vertices in the body mesh. This ensured that no visible artifact resulted.

【在内容创建的部分,其实五个差值已经足够用来差出目标效果了。我们这里切分出头和身体,一般身体不参与这里的运算驱动。】

 
 

 
 

4.4 Skinning

 
 

Skinning is a method of mesh deformation in which each vertex of that mesh is assigned an array of matrices that act upon it along with weights (that should add up to 1.0) that describe how bound to that matrix the vertex should be. For example, vertices on the bicep may be acted upon only by the shoulder joint, but a vertex on the elbow may be 50 percent shoulder joint and 50 percent elbow joint, becoming 100 percent elbow joint for vertices beyond the curve of the elbow.

【蒙皮就是骨骼驱动网格数据,就是去定义一个mesh顶点怎样根据其骨骼权重差值得到新的位置。】

 
 

Preparing a mesh for skinning usually involves creating a neutral state for the mesh, called a bind pose. This pose keeps the arms and legs somewhat separated and avoids creases as much as possible, as shown in Figure 4-3. First, we create a transform hierarchy that matches this mesh, and then we assign matrix influences based on distance—usually with the help of animation tools, which can do this reasonably well. Almost always, the result must be massaged to handle problems around shoulders, elbows, hips, and the like. This skeleton can then be animated through a variety of techniques. We used a combination of key-frame animation, inverse kinematics, and motion capture, as supported in our content-creation tool.

【准备好一些bind pose,就是预定义的一些关键帧,这些关键帧就是人为的去除一些不自然的情况。然后的做法就是上述的蒙皮得到变形网格。】

 
 


Figure 4-3 Dawn’s Bind Pose

 
 

A skinned vertex is the weighted summation of that vertex being put through its active joints, or:

【公式描述:vertex最终位置由joint的加权乘结果得到,存在矩阵乘法是因为骨骼间的继承关系。】

 
 


 
 

Conceptually, this equation takes the vertex from its neutral position into a weighted model space and back into world space for each matrix and then blends the results. The concatenated 


 matrices are stored as constant parameters, and the matrix indices and weights are passed as vertex properties. The application of four-bone skinning looks like this:

【上面的计算存在在模型空间完成计算,然后结果再应用到世界空间这一个过程。实现如下】

 
 

float4 skin(float4x4 bones[98],

float4 boneWeights0,

float4 boneIndices0)

{

float4 result = boneWeights0.x * mul(bones[boneIndices.x], position);

result = result + boneWeights0.y * mul(bones[boneIndices.y],

position);

result = result + boneWeights0.z * mul(bones[boneIndices.z],

position);

result = result + boneWeights0.w * mul(bones[boneIndices.w],

position);

return result;

}

 
 

In the “Dawn” demo, we drive a mesh of more than 180,000 triangles with a skeleton of 98 bones. We found that four matrices per vertex was more than enough to drive the body and head, so each vertex had to have four bone indices and four bone weights stored as vertex input attributes (the last two of the 16 xyzw vertex registers mentioned in Section 4.3.2). We sorted bone weights and bone indices so that we could rewrite the vertex shader to artificially truncate the number of bones acting on the vertex if we required higher vertex performance. Note that if you do this, you must also rescale the active bone weights so that they continue to add up to 1.

【在 Dawn 的例子中,我们的网格超过 180000 三角形, 骨骼有 98 根。 我们发现一个顶点被四根骨头驱动就已经足够了,这里就是这么应用的,在这里要注意的一点是要保证权重合为一。】

 
 

4.4.1 Accumulated Matrix Skinning

 
 

When skinning, one must apply the matrix and its bind pose inverse not only to the position, but also to the normal, binormal, and tangent for lighting to be correct. If your hierarchy cannot assume that scales are the same across x, y, and z, then you must apply the inverse transpose of this concatenated matrix. If scales are uniform, then the inverse is the transpose, so the matrix remains unchanged. Nonuniform scales create problems in a variety of areas, so our engine does not permit them.

【skin的时候,我们不仅仅要对pose应用matrix, 其他信息一样要这么做。 一定要注意 uniform scale是必要的。】

 
 

If we call the skin function from the previous code, we must call mul for each matrix for each vertex property. In current hardware, multiplying a point by a matrix is implemented as four dot products and three adds, and vector-multiply is three dot products and two adds. Thus, four-bone skinning of position, normal, binormal, and tangent results in:

【统计一下四骨头驱动那些信息的计算量:一个顶点乘上矩阵就是下面第一个小括号的计算量,再乘上四个顶点共88条指令。】

 
 


 
 

An unintuitive technique that creates the sum of the weighted matrices can be trivially implemented in HLSL or Cg as follows:

【GPU处理矩阵运算:】

 
 

float4x4 accumulate_skin(float4x4 bones[98],

 

float4 boneWeights0,

float4 boneIndices0)

{

float4x4 result = boneWeights0.x * bones[boneIndices0.x];

result = result + boneWeights0.y * bones[boneIndices0.y];

result = result + boneWeights0.z * bones[boneIndices0.z];

result = result + boneWeights0.w * bones[boneIndices0.w];

return result;

}

 
 

Although this technique does burn instructions to build the accumulated matrix (16 multiplies and 12 adds), it now takes only a single matrix multiply to skin a point or vector. Skinning the same properties as before costs:

【这样可以减少总数的指令】

 
 


 
 

 
 

4.5 Conclusion

 
 

It is almost always beneficial to offload mesh animation from the CPU and take advantage of the programmable vertex pipeline offered by modern graphics hardware. Having seen the implementation of skinning and morph targets using shaders, however, it is clear that the inner loops are quite easy to implement using Streaming SIMD Extensions (SSE) instructions and the like, and that in those few cases where it is desirable to remain on the CPU, these same techniques work well.

 
 

In the case of the “Dawn” demo, morph targets were used to drive only the expressions on the head. If we had had more time, we would have used morph targets all over the body to solve problems with simple skinning. Even a well-skinned mesh has the problem that elbows, knees, and other joints lose volume when rotated. This is because the mesh bends but the joint does not get “fatter” to compensate for the pressing of flesh against flesh. A morph target or other mesh deformation applied either before or after the skinning step could provide this soft, fleshy deformation and create a more realistic result. We have done some work on reproducing the variety of mesh deformers provided in digital content-creation tools, and we look forward to applying them in the future.

【废话不翻译了。】

 
 

【这里没有很值得让人记住的技术点,最主要的贡献在于N的显卡的能力强大到如此大计算量的蒙皮人物也能跑的起来,如此复杂的avatar实际应用价值有限,GPU蒙皮的优化方案的效果理论上都达不到50%的优化,实际效果应该更加不如人意。】

 
 

4.6 References

 
 

Alias Systems. Maya 5.0 Devkit. <installation_directory>/devkit/animEngine/

 
 

Alias Systems. Maya 5.0 Documentation.

 
 

Eberly, David H. 2001. 3D Game Engine Design, pp. 356–358. Academic Press.

 
 

Gritz, Larry, Tony Apodaca, Matt Pharr, Dan Goldman, Hayden Landis, Guido Quaroni, and Rob Bredow. 2002. “RenderMan in Production.” Course 16, SIGGRAPH 2002.

 
 

Hagland, Torgeir. 2000. “A Fast and Simple Skinning Technique.” In Game Programming Gems, edited by Mark DeLoura. Charles River Media.

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

DirectCompute tutorial for Unity 7: Counter Buffers

So to continue this tutorial is about counter buffers. All the compute buffer types in Unity have a count property. For structured buffers this is a fixed value as they act like an array. For append buffers the count changes as the buffer is appended/consumed from as they act like a dynamic array.

(所有类型buffer都有count property)

 

Direct Compute also provides a way to manually increment and decrement the count value. This gives you greater freedom over how the buffer stores its elements and should allow for custom data containers to be implemented. I have seen this used to create an array of linked list all completely done on the GPU.

(count property 支持append/consume buffer的实时大小衡量)

 

First start by creating a new script and paste in the following code. The code does not do anything interesting. It just creates the buffer, runs the shader and then prints out the buffer contents.

(代码:打印buffer contents)

 

 

On this line you will see the creation of the buffer.

(创建buffer的代码)

 

 

Note the buffers type is counter. The buffer count value is also set to zero. I recommend doing this when the buffer is created as Unity will not always create the buffer with its count set to zero. I am not sure why. I think it maybe a bug.

(注意buffer类型是counter)

 

Next create a new compute shader and paste in the following code.

(然后再建一个shader)

 

 

First notice the way the buffer is declared.

(除以buffer声明,这里不是counterBuffer类型!)

 

It’s just declared as a structured buffer. There is no counter type buffer.

 

The buffers count value is then incremented here. The increment function will also return what the count value was before it gets incremented.

(函数中则是增加counter,返回buffer size的做法:)

 

Then the count is stored in the buffer so we can print out the contents to check it worked.

(这样就可以打印出buffer size)

 

If you run the scene you should see the numbers 0 – 15 printed out.

(结果是打印0-15的值)

 

So how do you decrement the counter? You guessed it. With the decrement function.

(然后考虑怎么减少counter,如下)

 

The decrement function will also return what the count value was after it gets decremented.

 

Now let’s say you have run a shader that increments and adds elements to the buffer but you don’t know how many were added. How do you find out the count of the buffer? Well you may recall from the append buffer tutorial that you can use Unity’s CopyCount function to retrieve the count value. You can do the same with the counter buffer. Add this code to the end of the start function. It should print out that the buffers count is 16.

(append buffer结合使用counter buffer就可以做到你增加buffer size的时候实时得到其大小。)

 

 

DirectCompute tutorial for Unity 6: Consume buffers

This tutorial will be covering how to use consume buffers in Direct Compute. This tutorial was originally going to be combined with the append buffer tutorial as consume and append buffers are kinda the same thing. I decided it was best to split it up because the tutorial would have been a bit too long. In the last tutorial I had to add a edit because it turns out that there are some issues with using append buffers in Unity. It looks like they do not work on some graphics cards. There has been a bug report submitted and hopefully this will be fixed some time in the future. To use consume buffers you need to use append buffers so the same issue applies to this tutorial. If the last one did not work on your card neither will this one.

(sonsume buffer也和append buffer一样,不是所有的硬件都支持很好,这点在下面的过程中要注意,遇到奇怪的问题可能是硬件造成的。)

 

I also want to point out that when using append or consume buffers if you make a mistake in the code it can cause unpredictable results when ran even if you later fix the code and run it again. If this happens, especially if the error caused the GPU to crash it is best to restart Unity to clear the GPU context.

(注意很多crush的情况下建议重启unity,清空GPU context。再来)

 

To get started you will need to add some data to a append buffer as you can only consume data from a append buffer. Create a new C# script and add this code.

(开始学习,C#代码如下)

 

 

Here we are simply creating a append buffer and then adding a position to it from the “appendBufferShader” for each thread that runs.

(创建一个append buffer,跑appendBufferShader加入position信息给这个buffer)

 

We also need a shader to render the results. The  “Custom/AppendExample/BufferShader” shader posted in the last tutorial can be used so I am not going to post the code again for that. You can find it in the append buffer tutorial or just download the project files (links at the end of this tutorial).

(我们还需要一个shader来渲染最终效果,这里也采用上一篇用到的那个。)

 

Now attach the script to the camera, bind the material and compute shader and run the scene. You should see a grid of red points.

(跑的结果就是看到a grid of red points)

 

We have appended some points to our buffer and next we will consume some. Add this variable to the script.

(然后来改造成consume的。来看代码:首先加入变量)

 

Now add these two lines under the dispatch call to the append shader.

(下面就是设置buffer并执行shader)

 

 

This will run the compute shader that will consume the data from the append buffer. Create a new compute shader and then add this code to it.

(然后创建一个新的shader,填入下面代码)

 

 

Now bind this shader to the script and run the scene. You should see nothing displayed. In the console you should see the vertex count as 0. So what happened to the data?

(shader挂到上面的c#代码上然后跑结果,看不到效果且vertex count显示0)

 

Its this line here that is responsible.

 

This removes a element in the append buffer each time it is called. Since we ran the same amount of threads as there are elements in the append buffer in the end everything was removed. Also noticed that the consume function will return the value that was removed.

(原因:consumeBuffer每取出一个数据就会在自己的buffer里面删掉他,因此执行完这个shader后consume buffer就为空了。)

 

This is fairly simple but there are a few key steps to it. Notice that the buffer needs to be declared as a consume buffer in the compute shader like so…

(很简单的理论,下面是使用的关键步骤:首先是声明buffer)

 

But notice that in the script the buffer we bound to the uniform was not of the type consume. It was a append buffer. You can see so when it was created.

(然后初始化)

 

There is no type consume, there is only append. How the buffer is used depends on how you declare it in the compute shader. Declare it as “AppendStructuredBuffer”  to append data to it and declare it as a “ConsumeStructuredBuffer” to consume data from it.

(buffer类型的定义决定它的使用方式)

 

Consuming data from a buffer is not without is risks. In the last tutorial I mentioned that appending more elements than the buffers size will cause the GPU to crash. What would happen if you consumed more elements than the buffer has? You guessed it. The GPU will crash. Always try and verify that your code is working as expected by printing out the number of elements in the buffer during testing.

(和append一样要注意大小不要越界)

 

Removing every element from the buffer is a good way to clear the append buffer (which also appears to be the only why to clear a buffer with out recreating it) but what happens if we only remove some of the elements?

 

Edit – Unity 5.4 has added a ‘SetCounterValue’ function to the buffer so you can now use that to clear a append or consume buffer.

(不仅仅可以一个一个删,还可以用SetCounterValue来整体清空数据)

 

Change the dispatch call to the consume shader to this…

(最后一步是执行shader)

 

Here we are only running the shader for a quarter of the elements in the buffer. But the question is which elements will be removed? Run the scene. You will see the points displayed again but some will be missing. If you look at the console you will see that there are 768  elements in the buffer now. There was 1024 and a quarter (256) have been removed to leave 768. But there is problem. The elements removed seem to be determined at random and it will be (mostly) different each time you run the scene.

(考虑一个问题:局部删除的时候删除的是哪个部分)

 

This fact revels how append buffers work and why consume buffers have limited use. These buffers are LIFO structures. The elements are added and removed in the order the kernel is ran by the GPU but as each kernel is ran on its own thread the GPU can never guarantee the order they will run. Every time you run the scene the order the elements are added and removed is different.

(这个和硬件结构相关,不知道)

 

This does limit the use of consume buffers but does not mean they are useless. LIFO structures are something that have never been available on the GPU and as long as the elements exact order does not matter they will allow you to perform algorithms that where impossible to do so on the GPU in the past. Direct compute also adds the ability to have some control over how threads are ran by using thread synchronization, which will be covered in a later tutorial.

(这个问题确实影响了consume buffer的使用,注意避免这种问题的影响)

DirectCompute tutorial for Unity 5: Append buffers

In today’s tutorial I will be expanding on the topic of buffers covered in the last tutorial by introducing append buffers. These buffers offer greater flexibility over structured buffers by allowing you to dynamically increase their size during run time from a compute or Cg shader.

(append buffer特点是灵活,size可变)

 

The great thing about these buffers is that they can still be used as structured buffers which makes the actually rendering of their contents simpler. Start of by creating a new shader and pasting in this code. This is what we will be using to draw the points we add into the buffer. Notice the buffer is just declared as a structured buffer.

(使用方式和standard一致,下面是shader代码)

 

 

Now lets make a script to create our buffers. Create a new C# script, paste in this code and then drag it onto the main camera in a new scene.

(下面是C#代码)

 

 

Notice this line here…

 

 

This is the creation of a append buffer. It must be of the type “ComputeBufferType.Append” unsurprisingly. Notice we still pass in a size for the buffer (the width * width parameter). Even though append buffers need to have their elements added from a shader they still need to have a predefined size. Think of this as a reserved area of memory that the elements can be added to. This also raises a subtle error that can arise which I will get to later.

(创建append buffer,这里要注意的是还是要设置一个初始大小。)

 

The append buffer starts of empty and we need to add to it from a shader. Notice this line here…

 

Here we are running a compute shader to fill our buffer. Create a new compute shader and paste in this code.

(我们用一个shader来给这个buffer填充数据)

 

 

This will fill our buffer with a position for each thread that runs. Nothing fancy. Notice this line here…

 

 

The “Append(pos)” is the line that actually adds a position to the buffer. Here we are only adding a position if the x and y dispatch id are even numbers. This is why append buffers are so useful. You can add to them on any conditions you wish from the shader.

(这里注意:Append方法就是在buffer里面加一个position。这里就是当x,y超出范围的时候采用这个方法来添加。)

 

The dynamic contents of a append buffer can cause a problem when rendering however. If you remember from last tutorial we rendered a structured buffer using this function…

(这里buffer的动态增长会导致一些问题)

 

Unity’s “DrawProcedual” function needs to know the number of elements that need to be drawn. If our append buffers contents are dynamic how do we know how many elements there are?

(Graphics.DrawProcedural方法需要提前知道buffer size,但是上面的append buffer size是不确定的。)

 

To do that we need to use a argument buffer. Take a look at this line from the script…

(因此我们用到了argument buffer)

 

Notice it has to be of the type “ComputeBufferType.DrawIndirect“, it has 4 elements and they are integers. This buffer will be used to tell Unity how to draw the append buffer using the function “DrawProceduralIndirect“.

(这个buffer会告诉unity buffer的最终大小,绘制方法要改用DrawProceduralIndirect方法)

 

These 4 elements represent the number of vertices, the number of instances, the start vertex and the start instance. The number of instances is simply how many times to draw the buffer in a single shader pass. This would be useful for something like a forest where you have a few trees that need to be drawn many times. The start vertex and instance just allow you to adjust where to draw from.

(DrawProceduralIndirect的参数分别是:instance数量,instance大小,开始顶点,开始instance。这种方式特别适合绘制森林这样的少量对象绘制多次的情况,每次只需要位置变化。)

 

These values can be set from the script. Notice this line here…

(argBuffer数值设置方法一,一般buffer填充四个值再copy给argBuffer)

 

 

Here we are just telling Unity to draw one instance of our buffer starting from the beginning. Its the first number that is the important one however. Its the number of vertices and see its set to 0. This is because we need to get the exact number of elements that are in the append buffer. This is done with the following line…

 

This copies the number of elements in the append buffer into our argument buffer. At this stages its important to make sure everything’s working correctly so we will also get the values in the argument buffer and print out the contents like so…

(第二种方式就是直接赋值,Debug.Log告诉你这四个值啥含义。)

 

Now run the scene, making sure you have bound the shader and material to the script. You should see the vertex count as 256. this is because we ran 1024 threads with the compute shader and we added a position only for even x and y id’s which ends up being the 256 points you see on the screen.

(跑结果,会得到vertex count达到了256.也就是执行了1024线程)

 

Now remember I said there was a subtle error that can arise and it has to do with the buffers size? When the buffer was declared we set its size as 1024 and now we have entered 256 elements into it. What would happen if we added more that 1024 elements? Nothing good. Appending more elements than the buffers size causes a error in the GPU’s driver. This causes all sorts of issues. The best thing that could happen is that the GPU would crash. The worst is a persistent error in the calculation of the number of elements in the buffer that can only be fixed by restarting Unity and means that the buffer will not be draw correctly with no indication as to why. Whats worse is that since this is a driver issue you may not experience the issue the same way on different computers leading to inconstant behavior which is always hard to fix.

(注意append buffer有最大限制,就是看当前支持多少线程并行,像这里大于1024就会挂。)

 

This is why I have printed out the number of elements in the buffer. You should always check to make sure the value is within the expected range.

(使用的时候最好先检查一下GPU支持情况)

 

Once you have copied the vertex count to the argument buffer you can then draw the buffer like so…

(使用了argument buffer后你就可以用下面的函数绘制结果了)

 

Your not limited to filling a compute buffer from a compute shader. You can also fill it from a Cg shader. This is useful if you want to save some pixels from a image for some sort of further post effect. The catch is that since you are now using the normal graphics pipeline you need to play by its rules. You need to output a fragment color into a render texture even if you don’t want to. As such if you find yourself in that situation it means that whatever your doing is probably better done from a compute shader. Other than that the process works much the same but like everything there are a few things to be careful of.

(另一种新的写法如下:)

 

Create a new shader and paste in this code…

 

 

Notice that the fragment shader is much the same as the compute shader used previously. Here we are just adding a position to the buffer for each pixel rendered if the uv (as the id) is a even number on both the x and y axis. Don’t ever try and create a material from this shader in the editor! The reason is that Unity will try and render a preview image for the material and will promptly crash. You need to pass the shader to a script and create the material from there.

(和原来很像,但是增加了每个pixel的buffer的position)

 

Create a new C# script and paste in the following code…

 

 

Again you will see that this is very much like the previous script using the compute shader. There are a few differences however.

(你会发现和之前的代码很像,略有区别)

 

 

Instead of the compute shader dispatch call we need to use Graphics blit and we need to bind and unbind the buffer. We also need to provide a render texture as the source destination for Graphics blit. This makes the process Pro only unfortunately.

(不再采用dispatch方法,而是采用Graphics的方法来替代。Graphics blit就是来生成render texture)

 

Attach this script to the main camera in a new scene and run the scene after binding the material and shader to the script. It should look just like the previous scene (a grid of points) but they will be blue.

(跑这个新代码会得到和原来一样的效果。)

 

 

DirectCompute tutorial for Unity 4: Buffers

In DirectCompute there are two types of data structures you will be using, textures and buffers. Last tutorial covered textures and today I will be covering buffers. While textures are good for data that needs filtering or mipmaping like color information, buffers or more suited to representing information for vertices like position or normals. Buffers can also easily send or retrieve data from the GPU which is a feature that’s rather lacking in Unity.

(你可选的数据结构除了texture,还有buffer。texture在你的数据需要filtering/mipmapping的时候特别好用,buffer适用于存储点信息比如position/normals)

 

There are 5 types of buffers you can have, structured, append, consume, counter and raw. Today I will only be covering structured buffers because this is the most commonly used type. I will try and cover the others in separate tutorials as they are a little more advanced and I need to cover some of the basics first.

(有五种类型的buffer你可以用:structured, append, consume, counter and raw,这一篇只讲structured buffer)

 

So let’s start of by creating a C# script, name it BufferExample and paste in the following code.

t003 - 01

 

We will also need a material to draw our buffer with so create a normal Cg shader, paste in the follow code and create a material out of it.

(创建一个material赋予下面的shader)

t003 - 02

 

Now to run the scene attach the script to the main camera node and bind the material to the material attribute on the script. This script has to be attached to a camera because to render a buffer we need to use the “OnPostRender” function which will only be called if the script is on a camera (EDIT – turns out you can use “OnRenderObject” if you don’t want the script attached to a camera). Run the scene and you should see a number of random red points.

(把上面的c#代码赋给主相机,代码的material挂上上面建的cameraMaterial)

t003 - 03

 

(下面讲解一下代码)

Now notice the creation of the buffer from this line.

 

The three parameters passed to the constructor are the count, stride and type. The count is simply the number of elements in the buffer. In this case 1024 points. The stride is the size in bytes of each element in the buffer. Since each element in the buffer represents a position in 3D space we need a 3 component vector of floats. A float is 4 bytes so we need a stride of 4 * 3. I like to use the sizeof operator as I think it makes the code more clear. The last parameter is the buffer type and it is optional. If left out it creates a buffer of default type which is a structured buffer.

(三个参数:elements数量,每个elements大小,类型默认就是structured buffer)

 

Now we need to pass the positions we have made to the buffer like so.

(传入数据)

 

This passes the data to the GPU. Just note that passing data to and from the GPU can be a slow processes and in general it is faster to send data than retrieve data.

 

Now we need to actually draw the data. Drawing buffers has to be done in the “OnPostRender” function which will only be called if the script is attached to the camera. The draw call is made using the “DrawProcedural” function like so…

(在OnPostRender函数里绘制数据)

 

There are a few key points here.

The first is that the materials pass must be set before the DrawProcedural call. Fail to do this and nothing will be drawn. You must also bind the buffer to the material but you only have to do this once, not every frame like I am here. Now have a look at the “DrawProcedural” function. The first parameter is the topology type and in this case I am just rendering points but you can render the data as lines, line strips, quads or triangles. You must however order them to match the topology. For example if you render lines every two points in the buffer will make a line segment and for triangles every three points will make a triangle. The next two parameters are the vertex count and the instance count. The vertex count is just the number vertices you will be drawing, in this case the number of elements in the buffer. The instance count is how many times you want to draw the same data. Here we are just rendering the points once but you could render them many times and have each instance in a different location.

(材质设置必须在call DrawProcedural 函数之前,不然绘不出来。)

(DrawProcedural函数参数:拓扑结构类型;vertex count;instance count)

 

(渲染线的效果如下:)

t003 - 04

Now for the material. This is pretty straight forward. You just need to declare your buffer as a uniform like so…

(对于material就直接声明structurebuffer来使用吧)

 

Since buffers are only in DirectCompute you must also set the shader target to SM5 like so…

(shader版本要求)

 

The vertex shader must also have the argument “uint id : SV_VertexID“. This allows you to access the correct element in the buffer like so…

(vertex shader存在SV_VertexID这个参数可以让你很容易的进入目标element)

 

Buffers are a generic data structure and don’t have to be floats like in this example. We could use integers instead. Change the scripts start function to this…

(buffer支持多种数据结构,但是比较推荐使用int,下面就是float转int后使用的例子)

 

 

and the shaders uniform declaration to this…

You could even use a double but just be aware that double precision in shaders is still not widely supported on GPU’s although its common on newer cards.

(buffer你可以使用double,但目前GPU对此支持还不够好,建议不要用)

 

You are not limited to using primitives like float or int. You can also use Unity’s Vectors like so…

(你还可以使用unity vectors数据结构!!!)

 

 

With the uniform as…

You can also create you own structs to use.  Change your scripts start function to this with the struct declaration above…

(还可以使用自定义数据结构)

 

 

and the shader to this…

 

This will draw each point like before but now they will also be random colors. Just be aware that you need to use a struct not a class.

(shader中要注意要使用struct关键字而不是class来建结构)

 

When using buffers you will often find you need to copy one into another. This can be easily done with a compute shader like so…

(buffer拷贝)

 

 

You can see that buffer1 is copied into buffer2. Since buffers are always 1 dimensional it is best done from a 1 dimension thread group.

(拷贝的时候注意维度要一样)

 

Just like textures are objects so to are buffers and they have some functions that can be called from them. You can use the load function to access a buffers element just like with the subscript operator.

(buffer使用和texture一样有其他方法使用:load)

 

Buffers also have a “GetDimension” function. This returns the number of elements in the buffer and their stride.

(buffer使用和texture一样有其他方法使用:getdimensions)

 

 

 

 

 

DirectCompute tutorial for Unity 3: Textures

The focus of days tutorial is on textures. These are arguably the most important feature when using DirectCompute. It is highly likely that every shader you write will use at least one texture. Unfortunately render textures in Unity are Pro only so this tutorial covers Pro only topics. The good new is that this will be the only Pro only tutorial. Textures in DirectCompute are very simple to use but there are a few traps you can fall into. Lets start with something simple. Create a compute shader and paste in the following code.

(使用简单,但是注意只有unity pro版本支持此功能)

 

(kernel例子如下)

t002-01

 

And then a C# script called TextureExample and paste in the following code.

t002-02

 

Attach the script, bind the shader and run the scene. You should see a texture with the uv’s displayed as colors. This is what we have written into the texture with the compute shaders.

t002-03

 

Now notice the kernel argument “uint2 id : SV_DispatchThreadID” from last tutorial. This is the threads position in the groups of threads and just like with buffers we need this to know what location to write the results into the texture, like this “tex[id] = result“. This time however we don’t need the “flattened” index like with buffers. We can just use the uint2 directly. This is because unlike buffers textures can be multidimensional. We have a 2D thread group and a 2D texture.

(使用2D线程组,2D纹理的时候,uint2就够用了)

 

Now look at the declaration of the texture variable “RWTexture2D<float4> tex;“. The “RWTexture2D” part is important. This is obviously a texture but whats with the RW? This declares that the texture is of the type “unordered access view”. This just means that you can write to any location in the texture from the shader. You may be able to write to the texture but you can not read from it.  Remove the RW part and its just a normal texture but you cant write to it. Just remember that you can only read from a Texture2D and only write to RWTexture2D.

(RWTexture2D和Texture2D区别:RW代表你可以在这个shader里面写这张texture的任意位置但是不能读,没有RW就是普通的Texture只能读对应位置。)

 

Now lets look at the script. Notice how the texture is created.

 

There are two important things here. The first is the “enableRandomWrite”. You must set this to true if you want to write into the texture. This basically says that the texture can have unordered access views. If you don’t do this nothing will happen when you run the shader and Unity wont give you a error. It will just fail for no apparent reason. The second is the “Create” function call. You must call create on the texture before you write into it. Again, if you don’t nothing will happen and you wont get a error. It just wont work. If you are use to writing into a texture with graphics blit then you may notice you don’t have to call create. This is because graphics blit checks if the texture is created and creates it if its not. The dispatch function cant do this because it does not know what textures are being written into when it is called.

(enableRandomWrite:表示你可以写texture。然后你必须先调用create方法然后才能使用这张texture。)

 

Also note that the texture is released in the “OnDestroy” function. When you are finished with your render textures make sure you release them.

(纹理需要在OnDestroy方法里面释放)

 

Now lets look at the dispatch call.

 

Remember from the last tutorial that this is where we set how many groups to run. So why are the number of groups to run the texture size divide by 8? Look at the shader. You can see we have 8 threads per group running from the “[numthreads(8,8,1)]” line. Now we need a thread for each pixel in the texture to run. So we have a texture 64 pixels wide and if we divide the pixels by the number of threads per group we get the number of groups we will need. We end up with 8 groups of 8 threads which is 64 threads in total in the x dimension and its the same for the y dimension. This gives us a total of 4096 threads running (64 * 64) which is the number of pixels in the texture.

(线程解释,同上一篇,这边不再细说)

 

Now have a look at this part of the script.

 

This is where we are setting the uniforms for the shader. We have to set the texture we are writing into and we also need the textures width and height so we can calculate the uv’s from the dispatch id. Its common to pass variables into a shader like this but in this case we don’t need the width and height. We can get it from within the shader. Change your shader to this…

(设置shader全局量,这里包括纹理的宽高和纹理,其实可以从传入的texture来获得,不需要设置,例子如下:)

t002-04

 

and remove these two lines from your script…

 

t002-05

 

Run the scene. It should look the same. Notice the “tex.GetDimensions(w, h);” line. Textures are objects. This means they have functions you can call from them.

(得到相同结果,GetDimensions函数的作用就是获得texture的维度)

 

In this case we are asking for the textures dimensions. Textures have a number of functions with a number of overloads you can call. I will go through the most common and how to use them but first we need to change the scene a bit. What we want to do now is copy the contents of our texture into another texture and then display that result.

(texture相关的多种使用方式举例:)

t002-07

 

And change the script to this…

t002-08

 

Now bind the new shader you have created to the “shaderCopy” attribute and run the scene. The scene should look the same as before. What we are doing here is to fill the first texture with its uv’s as a color like before and then we are copying the contents to another texture. This is so I can demonstrate the many ways to sample from a texture in a shader. Notice this line “float4 t = tex[id];” in the copy shader. This is the simplest way to sample from a texture. Just like the dispatch id is the location you want to write to it is also the location you want to read from. You can see here that we can sample from a texture just like it was a array.

(texture拷贝的shader实现)

 

There are other ways of doing this. For example…

 

Here we are accessing the mipmaps of the texture instead. Level 0 would be the first mipmap, the one with the same size as the texture. The next dimension in the mipmap array is the location to sample at using the dispatch id. Just note that we have not enabled mipmaps on the texture so the result of sampling levels other than 0 has no effect.

(采用mipmap的值,0代表不压缩的一层)

 

We can do the same with the textures load function.

(相同功能,不同写法)

 

In this case the uint3’s x and y value is the location to sample at and the z value (the 0) is the mipmap level. Both these examples work the same.

 

There is one feature of textures that you will be using a lot. The textures ability to filter and wrap. To do this you need to use a sampler state. If you are use to using HLSL out side of Unity you may know that you need to create a sampler object to use. This works a bit differently in Unity. Basically you have two choices for filtering (Linear or Point) and two for wrap (Clamp or Repeat). All you need to do is declare a sampler state in the shader with the word Linear or Point in the name and the word Repeat or Clamp in the name. For example you could use the name myLinearClamp or aPointRepeat, etc. I prefer a underscore then the name. Change you shader to this.

(采用texture的filter和warp功能,你需要用到sampler state,下面是例子)

 

 

If you run the scene it should still look the same. Notice this line “float4 t = tex.SampleLevel(_LinearClamp, uv, 0);“. Here we are using the textures SampleLevel function. The function takes a sampler state, the uv’s and a mipmap level. The uv’s need to be normalized, that is have a range of 0 to 1. Notice the SamplerState variables at the top. If you are using sampler states you are mostly likely wanting to have bilinear filtering. If that’s the case use the _LinearClamp or _LinearRepeat sampler state.

(注意sampleLevel的使用方式,你还是会看到相同的结果)

(sample方法的定义与HLSL一样,建议参考学习HLSL的texture取sample的方法就知道了,作用主要是定义超出texture边界的区域如何取sample,比如clamp意思是整体扩充重复,repeat就是边界点重复)

 

If you are use to using HLSL in a fragment shader (as opposed to a compute shader here) you may know that you can use this function for filtering.

(HLSL的写法在这里是不可用的!)

 

Notice its called Sample not SampleLevel and the mipmap parameter is gone. If you try to use this in a compute shader you will get a error because this function does not exists. The reason why is surprisingly complicated and gives a insight into how the GPU works. Behind the scenes fragments shaders (or any shader) work in much the sample why as a compute shader as they share the same GPU architecture. They run in threads and the threads are arranged into thread groups. Now remember that threads in a group can share memory. Fragment shaders always run in a group of at least 2 by 2 threads. When you sample from a texture the fragment shader checks what its neighbors uv’s are. From this it can work out what the derivatives of the uv’s are. The derivatives are just a rate of change and in areas where there is a high rate of change the textures higher mipmaps are used and in areas of low rate of change the lower mipmaps are used. This is how the GPU reduces aliasing problems and it also has the handy byproduct of reducing memory bandwidth (the higher mipmaps are smaller).

(为了统一和兼容,以及使用mipmap)

 

All the examples I have given are in 2D but the same principles apply in 3D. You just need to create the texture a little differently.

(上面都是2D的例子,下面举3D的例子)

 

This creates a 3D texture that is 64*64*64 pixels. Notice the “volumeDepth” is set to 64 and the “isVolume” is set to true. Remember to set the number of groups and threads to the correct values. The dispatch id also needs to be a uint3 or int3;

(创建一个3D纹理,执行3D shader)

 

and…

 

That about covers it for textures. Next time I look at buffers. How to use the default buffer, the other buffer types, how to draw data from your buffers and set/get your buffer data.

 

(最后补充一下与上面译文无关的就是:shader里面其实定义采用的是rendertexture, texture2D, texture3D的父类texture。因此在shader里面这些只要是texture子类的纹理都是可以直接使用的。这点很重要,你可以采用导入的texture来在compute shader里面做计算。)

DirectCompute tutorial for Unity 2: Kernels and thread groups

Today I will be going over the core concepts for writing compute shaders in Unity.

 

At the heart of a compute shader is the kernel. This is the entry point into the shader and acts like the Main function in other programming languages. I will also cover the tiling of threads by the GPU. These tiles are also known as blocks or thread groups. DirectCompute officially refers to these tiles as thread groups.

【基本的计算单元是 kernel】

 

To create a compute shader in Unity simply go to the project panel and then click create->compute shader and then double click the shader to open it up in Monodevelop for editing. Paste in the following code into the newly created compute shader.

【unity 创建kernel】

t001 - 01

 

This is the bare minimum of content for a compute shader and will of course do nothing but will serve as a good starting point. A compute shader has to be run from a script in Unity so we will need one of those as well. Go to the project panel and click Create->C# script. Name it KernelExample and paste in the following code.

【C# 代码来执行kernel】

t001 - 02

 

Now drag the script onto any game object and then attach the compute shader to the shader attribute. The shader will now run in the start function when the scene is run. Before you run the scene however you need to enable dx11 in Unity. Go to Edit->Project Settings->Player and then tick the “Use Direct3D 11” box. You can now run the scene. The shader will do nothing but there should also be no errors.

【unity场景中如何使用】

t001 - 03

 

In the script you will see the “Dispatch” function called. This is responsible for running the shader. Notice the first variable is a 0. This is the kernel id that you want to run. In the shader you will see the “#pragma kernel CSMain1“. This defines what function in the shader is the kernel as you may have many functions (and even many kernels) in one shader. There must be a function will the name CSMain1 in the shader or the shader will not compile.

【解释一下Dispatch函数就是来执行computeshader的,一个computeshader可以包含很多kernel,可以按照kernel名字取用。】

 

Now notice the “[numthreads(4,1,1)]” line. This tells the GPU how many threads of the kernel to run per group. The 3 numbers relate to each dimension. A thread group can be up to 3 dimensions and in this example we are just running a 1 dimension group with a width of 4 threads. That means we are running a total of 4 threads and each thread will run copy of the kernel. This is why GPU’s are so fast. They can run thousands of threads at a time.

【numthreads这个用来标识逻辑的X*Y*Z的数据结构与线程的对应关系,三个值积就是一次执行这个kernel的线程总数(线程组)】

 

Now lets get the kernel to actually do something. Change the shader to this…

t001 - 04

 

and the scripts start function to this…

t001 - 05

 

Now run the scene and you should see the numbers 0, 1, 2 and 3 printed out.

 

Don’t worry too much about the buffer for now. I will cover them in detail in the future but just know that a buffer is a place to store data and it needs to have the release function called when you are finished with it.

【buffer先不用管后面讲】

 

Notice this argument added to the CSMain1 function “int3 threadID : SV_GroupThreadID“. This is a request to the GPU to pass into the kernel the thread id when it is run. We are then writing the thread id into the buffer and since we have told the GPU we are running 4 threads the id ranges from 0 to 3 as we see from the print out.

【SV_GroupThreadID用到的GPU线程标识,这个computeshader结果就是写出这个标识】

 

Now those 4 threads make up whats called a thread group. In this case we are running 1 group of 4 threads but you can run multiple groups of threads. Lets run 2 groups instead of 1. Change the shaders kernel to this…

【可以采用不止一组GPU线程来跑这个kernel】

t001 - 06

 

and the scripts start function to this…

t001 - 07

 

Now run the scene and you should have 0-3 printed out twice.

 

Now notice the change to the dispatch function. The last three variables (the 2,1,1) are the number of groups we want to run and just like the number of threads groups can go up to 3 dimensions and in this case we are running 1 dimension of 2 groups. We have also had to change the kernel with the argument “int3 groupID : SV_GroupID” added. This is a request to the GPU to pass in the group id when the kernel is run. The reason we need this is because we are now writing out 8 values, 2 groups of 4 threads. We now need the threads  position in the buffer and the formula for this is the thread id plus the group id times the number of threads ( threadID.x + groupID.x*4 ).

【上面的做法是执行两组线性的kernel操作的概念】

 

This is a bit awkward to write. Surely the GPU knows the threads position? Yes it does. Change the shaders kernel to this and rerun the scene.

【但是这样的写法对于kernel来说,外面用多少组是不透明的,不好,改进如下】

t001 - 08

 

The results should be the same, two sets of 0-3 printed. Notice that the group id argument has been replaced with “int3 dispatchID : SV_DispatchThreadID“. This is the same number our formula gave us except now the GPU is doing it for us. This is the threads position in the groups of threads.

【结果是一样的,但是shader使用的组数对sheder来说透明了】

 

So far these have all been in 1 dimension. Lets step thing up a bit and move to 2 dimensions and instead of rewriting the kernel lets just add another one to the shader. Its not uncommon to have a kernel for each dimension in a shader performing the same algorithm. First add this code to the shader below the previous code so there are two kernels in the shader.

【下一步:使用2D kernel来替代执行1D kernel两遍】

t001 - 09

 

and the script to this…

t001 - 10

 

Run the scene and you will see a row printed from 0 to 7 and the next row 8 to 15 and so on to 63.

t001 - 11

 

Why from 0 to 63? Well we now have 4 2D groups of threads and each group is 4 by 4 so has 16 threads. That gives us 64 threads in total.

 

Notice what value we are out putting from this line “int id = dispatchID.x + dispatchID.y * 8“. The dispatch id is the threads position in the groups of threads for each dimension. We now have 2 dimension so we need the threads global position in the buffer and this is just the dispatch x id plus the dispatch y id times the total number of threads in the first dimensions (4 * 2). This is a concept you will have to be familiar with when working with compute shaders. The reason is that buffers are always 1 dimensional and when working in higher dimension you need to calculate what index the result should be written into the buffer at.

【线程解释:[numthreads(4,4,1)]表示一个group16个线程,computershader.Dispatch(kernel, 2, 2, 1);表示开启了2维的4个线程组,因此一共64个线程。】

 

The same theory applies when working with 3 dimensions but as it gets fiddly I will only demonstrate up to 2 dimensions. You just need to know that in 3 dimensions the buffer position is calculated as “int id = dispatchID.x + dispatchID.y * groupSizeX + dispatchID.z * groupSizeX * groupSizeY” where group size is the number of groups times number of threads for that dimension.

【相同的概念扩展到3维】

 

You should also have a understanding of how the semantics work. Take for example this kernel argument…

【你还需要了解相关语义】

 

int3 dispatchID : SV_DispatchThreadID

 

SV_DispatchThreadID is the semantic and tells the GPU what value it should pass in for this argument. The name of the argument does not matter. You can call it what you want. For example this argument works the same as above.

【前面是变量名,随便改,下面的例子就是和上面功能一样。后面是标识语义,告诉GPU是什么意义的值】

 

Also the variable type can be changed. For example…

See the int3 has been changed to int. This is fine if you are only working with 1 dimension. You could also just use a int2 for 2 dimensions and you could also use a unsigned int (uint) instead of a int if you choose.

【int3可表示三维,int使用在一维的情况下是ok的】

 

Since we now have two kernels in the shader we also need to tell the GPU what kernel we want to run when we make the dispatch call. Each kernel is given a id in the order they appear. Our first kernel would be id 0 and the next is id 1. When the number of kernels in a shader becomes larger this can become a bit confusing and its easy to set the wrong id. We can solve this by asking the shader for the kernels id by name. This line here “int kernel = shader.FindKernel (“CSMain2”);” gets the id of kernel “CSMain2“. We then use this id when setting the buffer and making the dispatch call.

【一个computeshader多个kernel的情况下dispatch方法需要用ID来标识使用的kernel,可以使用FindKernel 来标识。】

 

About now you maybe thinking that this concept of groups  of threads is a bit confusing. Why cant I just use one group of threads? Well you can but just know that there is a reason that threads are arranged into groups by the GPU. For a start a thread group is limited by the number of threads it can have ( defined by the line “[numthreads(x,y,z)]” in the shader). This limit is currently 1024 but may change with new hardware. For example you can have a maximum of “numthreads(1024,1,1)” for 1D, “numthreads(32,32,1)” for 2D and so on. You can however have any number of groups of threads and as you will often be processing data with millions of element the concept of thread groups is essential. Threads in a groups can also share memory and this can be used to make dramatic performance gains for certain algorithms but I will cover that in a future post.

【可以一直只用一个线程一个组,但是线程组共享内存而且性能可以显著提升,另外要注意最大总和不要超过1024个线程】

 

Well I think that about covers kernels and thread groups. There is just one more thing I want to cover. How to pass uniforms into your shader. This works the same as in Cg shaders but there is no uniform key word. For the most part this relatively simple but there are a few “Gotcha’s” so I will briefly go over it.

【最后一点:如何传全局量给shader】

 

For example if you want to pass in a float you need this line in the shader…

【数值的例子】

and this line in your script…

To set a vector you need this in the shader…

and this in the script…

You can only pass in a Vector4 from the script but your uniform can be a float, float2, float3 or float4. It will be filled with the appropriate values.

 

Now here’s where it gets tricky. You can pass in arrays of values. Note that this first example wont work. I will explain why.  You need this line in your shader…

【数组的例子】

and this in your script…

Now this wont work. Whether this is by design or a bug in Unity I don’t know. You need to use vectors as uniforms for this to work. In your shader…

and your script…

So here we have a array of two float4’s and it is set from a array of 8 floats from a script. The same principles apply when setting matrices. In your shader…

and your script…

And of course you can have arrays of matrices. In your shader…

and your script…

This same logic does not seem to apply to float2x2 or float3x3. Again, whether this is a bug or design I don’t know.

DirectCompute tutorial for Unity 1: Introduction

unity的GPGPU的使用方式computeshader真的非常让人震惊,代码已经不再向openCL,CUDA一样的难读难写难于维护,整体上已经非常好用了。
在这个基础上,等待显卡这两年的跨越式发展后,游戏中采用GPU的并行计算能力会变得非常普及。
这样以后对于原来CPU上面的所有的for/while/foreach等操作都可以进行GPU封装,让用户可以无意识的,或者简单选择的使用GPU并行计算能力。

原文地址:https://scrawkblog.com/category/directcompute

下面开始是教程正文翻译,不想看英文只需看中文总结就可以:


At this time Unity just started to support Microsofts DirectX 11 and with DirectX 11 came the DirectCompute API. This API opens up a whole new way to use the GPU by writing compute shaders.

【Unity GPGPU 很好用,但是没有教程让使用者非常难受,因此作者准备搞一发。】

 

I hope to start of with a introduction about why DirectCompute was needed and how it differs to the traditional graphics pipeline, how to write the kernels via compute shaders, how the internal GPU tiling of threads works, how to access textures, how to use the various buffers types, how to use thread synchronization and shared memory and final how to maximize performance.

 

The Graphics pipline

 

All developers would need to do was send the geometry that they needed to render to the GPU and OpenGL would pass it through the pipeline until the end output was pixels displayed on the screen. This pipeline was fixed and although developers could enable and disable certain parts as well as adjust some settings there was not a lot that could be changed.  For a time this was all that was needed but as developers started to push the GPU further and demand more advanced features it became apparent that this pipeline had to become more flexible. The solution was to make certain parts of the pipeline programmable. Developers could now write there own programs to perform certain parts of the pipeline. These new programs became know as shaders (there is a technical distinction between programs and shaders but that’s another story).

【shader作用】

 

This new programmable pipeline opened up whole new possibilities and without it we would not have the quality of graphic you see in the previous generation of games.  While the pipeline was originally only developed for creating graphics the new flexibility meant that the GPU could now be used to process many types of algorithms. Researches quickly began to modify algorithms to run in the multi-threaded environment of the GPU and areas as diverse as physics, finance, mathematics, medicine and many more began to use the GPU to process their  data. The raw power of the GPU was too hard to resist and GPGPU or General Purpose Graphical Processing Unit programming was born.

【发展出GPGPU】

 

General Purpose Graphical Processing Unit programming

 

GPGPU quickly began to go mainstream and enter industrial use. The was still one problem however. Graphics API’s were still tied down to the graphics pipeline. It didn’t matter how much flexibility shaders gave you over a certain stage of the pipeline they still had to conform to the restrictions of the pipeline. Vertex shaders still have to output vertices and fragment shaders still have to output pixels. While great strides and creative work arounds were developed it soon became apparent that things had to change if GPGPU was to develop further. A API was needed that would free developers from the shackles of the graphics pipeline and provide a environment where the raw power of the GPU could be harnessed in a non-graphics related setting.

【VS,PS的结构限制了GPGPU的使用方式】

 

Over the next few years GPGPU API’s started to appear and currently developers are spoilt for choice between CUDA, OpenCL and DirectCompute. These new API’s presented a whole new way to work with the GPU and are no longer bound to the traditional graphics pipeline.

【GPGPU API出现解决上面的问题】

 

While the desire for GPGPU was to provide a way to work with the GPU in a non-graphics related setting the games industry were naturally just as eager to make use of the new capabilities. As games have become more realistic they have started to simulate real world physics.  These computations are often complicated and demanding to process. The power and flexibility offered by GPGPU has allowed these computations to be performed in ever greater detail and scope. The origin graphics API’s opened up a whole new world in 3D games which we now all take for granted. The new GPGPU API’s are now doing the same thing and bringing in a whole new era in gaming. This current generation of gaming will be dominated by the uses of these API’s resulting in ever more realistic graphics and believable physics.

【开始在游戏中考虑GPGPU】

 

All you need to do now is learn how to use them. Stay tuned for the next part which will show you how to set up the kernels and how the tiling of threads works.