Advanced VR Rendering Performance

Advanced VR Rendering Performance

作者:

Alex Vlachos

ValveAlex@ValveSoftware.com


这是全文的第二部分,第一部分在GDC 15的时候讲的,下面是链接:

Video and slides from last year are free online: http://www.gdcvault.com/play/1021771/Advanced-VR

全篇目标就是:VR渲染如何在保证质量的前提下获得最好的性能。

全文分为四部分:


Multi-GPU for VR

这一点主要考虑的方向就是GPU硬件对于性能的提升。

首先回顾上一篇有讲到隐藏网格区域,指的就是对于最后渲染buffer的不会看到的部分位置的像素点不渲染。

计算机生成了可选文字: ○ 寸 1

bgt_6_1

bgt_6_2

bgt_6_3计算机生成了可选文字:
首先考虑使用单颗
GPU来完成所有工作。

这里单GPU渲染工作方式有很多种,这里以顺序渲染作为例子。

bgt_6_4计算机生成了可选文字:

bgt_6_5计算机生成了可选文字: GPIJO Sha ows eft E e VSync Submit L 11.11 ms Submit R syste VSync

上图展示的就是一次渲染的单GPU工作流,特点是两个眼睛共享shadow buffer

然后我们再来考虑同时使用多个GPU来渲染。

AMD, Nvidia提供的多GPU协同工作API大同小异,都具备以下的重要功能:

  • 广播draw call + mask 确定每个GPU的工作内容
  • 每个GPU拥有独立的shader constant buffers并可以独立设置
  • GPU之间传输(或异步传输:在目标GPU在工作时不会打断,特别有用)渲染目标主体

使用两个GPU的情况:

bgt_6_6

bgt_6_7

计算机生成了可选文字: Submit L Submit R eft Eye GPUO Sha ows syste Win o GPI-Jl Shadows VSync 11.11 ms VSync

  • 每个GPU渲染一只眼睛看到的东西
  • 每个GPU都需要渲染shadow buffer
  • GPU的渲染结果需要提交到主GPU上,最终提交给VR system
  • 实践证明这样的方式大概可以提高30-35%的系统性能

使用4GPU的情况

bgt_6_8计算机生成了可选文字: E ndĐ nd9

bgt_6_9计算机生成了可选文字: Submit L GPIJO GPIJI GPIJ2 GPIJ3 eft E e A Sha ows Shadows Right Eye A Sha ows Shadows Submit R Syste VSync Transfe 11.11 ms VSync

  • 每个GPU只需要渲染半只眼睛看到的画面
  • 但是每个GPU还是需要独立的都运算一遍shadow buffers
  • 对于 pixel shadercost来说每个GPU减到原来的1/4,但是对于vertex shader来说,每个GPU都需要做,cost还是1。(相比较于单GPU
  • 太多的GPU驱动的工作协调会带来更高的CPU cost

这里还要注意的是因为GPU渲染结果之间的传输是支持异步的,因此可以来考虑如何把多个次GPu的结果传输到主GPU上去,可以有多种组合方式,第三种方式的最终等待时间最少被采用。

bgt_6_10

计算机生成了可选文字: GPUO GPU2 GPIJ3 GPI-JO GPUI GPU2 GPU3 GPUO GPI-Jl GPU2 GPU3 Submit L Submit R GPI-Jl Is Left Eye B hadows Shadows Right Eye A Submit R Shadows Left Eye B hadows Right Eye A Submit L Sutrnit R Left Eye Shadows Shadows Right Eye A • Trn<er• Right Eye B Shadows

到此大家可能会发现,当单个GPU渲染改成2个的时候,其性能的提升还是非常明显的。但是再次增加GPU数目的时候,从时间效率的角度的性能提升已经不是很有优势了,这是因为当多GPU可以切分的Cost(主要是Pixel shadercost)分的越来越小,其占据GPU运算的主要瓶颈就在于多GPU不可切分的Costshadow buffer运算和vertex shader相关)。也就是每个GPU都在做同样的重复的事情。

下图就是在相同工作量的情况下,不同GPU数量的时间性能比较。

bgt_6_11

计算机生成了可选文字: GPI-JOI GPI-JO GPI-Jl GPI-JO GPI-Jl GPIJ2 GPIJ3 Submit L Submit L Submit R Submit L Submit R hadow Left Eye B Shadows Right Eye A Shadows Submit R VSync 11.11 ms VSync

但是反过来考虑突出多GPU的优点就是,可以获得更高的最终画质(原pixel shader cost非常高)。

下图就是在相同性能时间的情况下,不同数量GPU可以获得的画面质量比较。

bgt_6_12


Fixed Foveated Rendering & Radial Density Masking

这一点主要考虑VR眼镜的光学特性来提升渲染性能。

投影矩阵投影后的在渲染buffer上的像素密度分布是与我们所希望的相反的。

  • Projection matrix: 在画面中心的位置所能采样到的点的数量比画面周边少
  • VR optics:画面中心位置的采样点在VR里面是你最关注的,看得最清楚的

结果就是导致对画面周边的over rendering

Over Rendering 解释:

bgt_6_13

bgt_6_14计算机生成了可选文字:

优化:Fixed Foveated Rendering

按下列模版去渲染,放大画面中心位置,减少画面周边的所需渲染的pixel数量。

bgt_6_15

bgt_6_16

bgt_6_17

计算机生成了可选文字:

这种模式下推荐使用多GPU渲染

Using NVIDIA’s “Multi-Resolution Shading” we gain an additional ~5-10% GPU perf with less CPU overhead (See “GameWorksVR”, Nathan Reed, SIGGRAPH 2015)

接下来作者想到了新的玩法

Radial Density Masking

对于周边的区域采用2*2pixel块间隔渲染来减少渲染的pixel数量。

Skip rending a checker pattern of 2×2 pixel quads to match current GPU architectures

bgt_6_18

计算机生成了可选文字:

然后根据算法填充完没有渲染的pixel区域

bgt_6_19

计算机生成了可选文字: Average 2 neig ors (Average across dia 1/16 1/8 1/16 1/8 1/4 1/8 1/16 1/8 1/16 Optimized Bilinear Samples Weights near to far: 0.375, 0.375, 0.125, 0.125 Weights near to far: 0.5, 0.28125, 0.09375, 0.09375, 0.03125

左边是理论上的算法,右边是可以根据左边方法直接生成权值模版套用。

总结下步骤:

首先是渲染间隔的2*2pixel块,然后就是套用filter来填充其他pixel

这种方式在Aperture Robot Repair的例子里面减少了5-15%的性能消耗。对于低端的GPU来说这种方式特别有效。


Reprojection

如果你的引擎达不到目标的帧率,那么VR系统就应该通过reproject上一帧来设置这一帧的结果。

reproject包括

  • Rotation-only reprojection
  • Position & rotation reprojection

但特别需要注意的是:这里的这种reprojection的方式看作是帧率的最后的安全网,只有在你的GPU性能都低于你的应用程序最低性能要求的前提下才去使用。

Rotation-Only Reprojection

两张前后序的渲染结果对应位置求平均得到的图片会存在judder

bgt_6_20

计算机生成了可选文字:

judder的存在包括很多原因,比如包括相机移动,动画,对象移动等等。

这里judder存在的一个很大的原因就是对相机的模拟方式不够准确。

首先rotation projection应该是以眼睛为中心的,而不是以头部中心为中心的,不然会导致旋转与用户感知的旋转不一致。

bgt_6_21计算机生成了可选文字:

其次需要考虑的是两眼间的间距,两眼间距如果不是和戴头盔的人眼的间距不一致,也就是旋转半径不同,这样得到的旋转结果也和用户的感受不一致。

bgt_6_22计算机生成了可选文字:

但是综合考虑的话,rotation-onlyreprojection可以说已经足够好用,相比起掉帧来说。

Positional Reprojection

仍然是一个没有解决的问题。

  • 传统的渲染方式只会保留一个深度的内容,因此对于半透明的reprojection来说是一种挑战,特别是粒子系统的使用后的positional reprojection
  • MSAA depth buffer已经存了现有的颜色,再当深度信息到达存储的时候可能会导致渗色。
  • 用户的移动会导致看到的内容出现缺漏,补洞算法也是一个挑战。

Asynchronous Reprojection

作者提出的新的概念,理想的安全网

首先这种方式需要GPU可以精确的被抢占(抢占粒度的控制),当前的GPU理论上可以在draw call之间被抢占,但实际上是看GPU现有的功能。

异步的一大问题还是在于不能保证reproject在一次vsync之内完成,而如果完不成就没有任何意义。

作为应用程序如果想要使用异步时间扭曲,必须注重抢占粒度

“You can split up the screen into tiles and run the post processing on each tile in a separate draw call. That way, you provide the opportunity for asynctimewarp to come in and preempt in between those draws if it needs to.” –“VRDirect”,Nathan Reed, GDC 2015

Interleaved Reprojection

对于老旧的GPU来说是不支持异步reprojection的,因为没有GPU抢占功能,这时候我们就需要寻找替代方案。

如果你的系统不支持 always-on asynchronous reprojection 功能, OpenVR API 提供 every-other-frame rotation-only reprojection 的功能。这模式下应用程序可以获得18ms的时间来渲染一张frame。这种模式对于保证帧率来说是很好的交易:

“In our experience, ATW should run at a fixed fraction of the game frame rate. For example, at 90Hz refresh rate, we should either hit 90Hz or fall down to the half-rate of 45Hz with ATW. This will result in image doubling, but the relative positions of the double images on the retina will be stable. Rendering at an intermediate rate, such as 65Hz, will result in a constantly changing number and position of the images on the retina, which is a worse artifact.” –“Asynchronous Timewarp Examined”, Michael Antonov, Oculus blog, March, 2015


Adaptive Quality

保证帧率是非常困难的一件事情。VR相对于传统游戏来说的挑战在于:

  • 用户对相机的精细控制
  • 用户与游戏世界的新的交互模型

这里作者有提到他们为了让Robor Repair达到目标帧率的经历是整个项目中最难的最累的部分精力。为了让用户在任意角度观看和操作都达到90帧的帧率来微调内容和渲染是最痛苦的。

动态的质量变化就是根据当前GPU的能力动态的调整渲染质量来保证帧率。

  • Goal #1: Reduce the chances of dropping frames and reprojecting
  • Goal #2: Increase quality when there are idle GPU cycles

那么首先考虑VR层面哪些渲染设置是可以调整的:

  • Rendering resolution/ viewport
  • MSAA 层数 抗锯齿算法
  • Fixed Foveated Rendering (第二部分的内容)
  • Radial Density Masking(第二部分的内容)
  • Etc.

而有些渲染设置是不可以调整的:

  • 阴影
  • 视觉效果,比如镜面

作者他们使用的一个动态调整质量的例子:

bgt_6_23

计算机生成了可选文字: Defa U Quality Level +6 +5 +3 +2 1 2 3 _4 MSAA 8x 4x Resolution Scale 1.4 1.3 1.2 1.1 1.0 1.1 1.0 0.9 0.81 0.73 0.65 Radial Density Masking On Render Resolution 2116x2352 1965x2184 1814x2016 1663x1848 1512x1680 1663x1848 1512x1680 1360x1512 1224x1360 1102x1224 992xı 102

这里作者展示了一段视频来说明渲染质量之间的切换,上面的拉条标识的就是当前的渲染质量。

bgt_6_24

计算机生成了可选文字: AT-D179b

在自动调整渲染质量的过程中最关键的就是要去衡量GPU的工作负载。

VR系统里面GPU的工作量是变化的,来源于lens distortion, chromatic aberration, chaperone bounds, overlays, etc.

我们需要了解的是整个VR system的时间流,OpenVR提供了对于all GPU workstotal GPU timer的显示:

bgt_6_25计算机生成了可选文字: VSync Application Rendering Start Timer VR System Rendering VSync Time Remainin End Timer

GPU Timers 存在延时

  • GPU 查询到的是前一个frame的结果
  • 在处理队列中的一两个帧也是不能被修改的

下图展示的就是GPU工作流,一帧的内容从CPUGPU一起处理完是横跨几个Vsync的,因此你要修改的瞬间前已经进入CPU处理的还是会被渲染出来,也就是上面第二点会延迟一两个帧再轮到你就该的结果frame的显示。

关于第一点就是说你在当前帧没处理完提交前的查询,查询的是渲染buffer的内容,就是上一帧提交的结果。

bgt_6_26计算机生成了可选文字: Get ms 1 ms CPU GPU vsync submit D3D call$ 11.11 ms ende 1 VSync Game Simu aor Timer 11.11 ms VSync ame Simu atio submit D3D calls Rende VSync Submit D3D all 11.11 end

作者动态调整渲染级别的细节:三条规则

目标:维持GPU 70-90%的利用率

高于 90%:主动降低两个渲染级别

低于 70%:主动提高一个渲染级别

预测 85% + 线性增长:主动降低两个渲染级别,因为处理延时存在2frame,因此需要提前预测。

10%的空闲GPU流出来可以应对其他进程对于GPU的需求或者是一些系统或其他的突发请求,是非常有用的。

因此这里我们要求每一帧的渲染时间控制在10ms以内,而不是去年提出来的11.11ms

这里还有一个问题需要注意的就是当resolution scalar下降太过的时候会导致文本不宜阅读,因此对于GPU性能很差的时候,我们建议开启 Interleaved Reprojection Hint 来维持帧率。

因此在Aperture Robot Repair例子里面我们提供了两套Adaptive Quality的选择。

bgt_6_27计算机生成了可选文字: Option A +6: 8xMSAA, +5: 8xMSAA, +4: 8xMSAA, +3: 8xMSAA, +2: 8xMSAA, +1: 4xMSAA, 1.4x res 1.3x res 1.2x res 1.1x res 1.0x res 1.1x res 1.0x resolution (Default) Option B — Text-friendly +6: 8xMSAA, +5: 8xMSAA, +4: 8xMSAA, +3: 8xMSAA, +2: 8xMSAA, +1: 4xMSAA, O: 4xMSAA, -1: 4xMSAA, 0.9x res -2: 4xMSAA, 0.81x res -3: 4xMSAA, 0.73x res -4: 4xMSAA, 0.65x res, O: 4xMSAA, -1: 4xMSAA, -2: 4xMSAA, -3: 4xMSAA, 1.4x res 1.3x res 1.2x res 1.1x res 1.0x res 1.1x res 1.0x resolution (Default) 0.9x res 0.81x res 0.81x res, Interleaved Reprojection Hint Radial Density Masking

【这部分需要再看下视频】

还有一个需要注意的问题是GPU内存,这也是我们选择光圈大小的依据之一。

bgt_6_28计算机生成了可选文字: Scalar 2.0 2.0 776 мв Aperture 1.4 342 мв 684 мв 1.2 502 мв 1.0 348 мв 1.1 117 мв 234 мв 1.0 194 мв 128 мв MSAA 8х Resolution 3024х3360 3024х3360 2116х2352 1814х2016 1512х1680 1663х1848 1512х1680 1224х1360 GPU Метогу 1 Еуе = Color + Depth + Resolve GPU Метогу 2 Eyes = Color + Depth + Resolve 1,396 мв 0.81 698 мв 388 мв 251 мв 174 мв 97 мв 64 МВ

Aperture allocates both a 1.4 8xMSAA and a 1.1 4xMSAA render target per eye for a total of 342 MB + 117 MB = 459 MB per eye (918 MB 2 eyes)! So we use sequential rendering to share the render target and limit resolution to 1.4x for 4 GB GPUs.

bgt_6_29计算机生成了可选文字: Scalar MSAA 8х Aperture 2.0 2.0 776 мв 1.4 684 мв 1.2 502 мв 1.0 348 мв 1.1 117 мв 234 мв 1.0 194 мв 128 мв Resolution 3024х3360 3024х3360 2116х2352 1814х2016 1512х1680 1663х1848 1512х1680 1224х1360 GPU Метогу 1 Еуе = Color + Depth + Resolve 698 мв GPU Метогу 2 Eyes = Color + Depth + Resolve 1,396 мв 0.81 388 мв 342 мв 251 мв 174 мв 97 мв 64 МВ

For a 2.0 resolution scalar, we require 698 MB + 117 MB = 815 MB per eye.

Valve’s Unity Rendering Plugin

Valve unity中使用的是自定义的渲染插件,该插件即将免费开放给大家用且开源。

The plugin is a single-pass forward renderer (because we want 4xMSAA and 8xMSAA) supporting up to 18 dynamic shadowing lights and Adaptive Quality

CPU GPU 性能解耦

前提条件是你的渲染线程需要自治。

如果你的CPU还没有准备好新的一帧的渲染内容,那么渲染线程根据HMD pose信息和dynamic resolution的设置信息修改并重新提交上一帧的GPU工作任务给GPU

【这边讲动画judder去除的需要再看下视频】

Then you can plan to run your CPU at 1/2 or 1/3 GPU framerate to do more complex simulation or run on lower end CPUs


总结

  • Multi-GPU support should be in all VR engines (at least 2-GPUs)
  • Fixed Foveated Rendering and Radial Density Masking are solutions that help counteract the optics vs projection matrix battle
  • Adaptive Quality scales fidelity up and down while leaving 10% of the GPU available for other processes. Do not rely on reprojection to hit framerate on your min spec!
  • Valve VR Rendering Plugin for Unity will ship free soon
  • Think about how your engine can decouple CPU and GPU performance with resubmission on your render thread

Optimizing the Unreal Engine 4 Renderer for VR

https://developer.oculus.com/blog/introducing-the-oculus-unreal-renderer/

 

For Farlands, the Oculus team wrote an experimental, fast, single-pass forward renderer for Unreal Engine. It’s also used in Dreamdeck and the Oculus Store version of Showdown. We’re sharing the renderer’s source as a sample to help developers reach higher quality levels and frame rates in their own applications. As of today, you can get it as an Unreal developer from https://github.com/Oculus-VR/UnrealEngine/tree/4.11-ofr.

【Oculus团队写了一个试验性的,快速的,单pass forward renderer的unreal engine工具,在这里我们分享出来见github,这工具已经应用在了Dreamdecks等Oculus应用上了】

 

Rendering immersive VR worlds at a solid 90Hz is complex and technically challenging. Creating VR content is, in many ways, unlike making traditional monitor-only content—it brings us a stunning variety of new interactions and experiences, but forces developers to re-think old assumptions and come up with new tricks. The recent wave of VR titles showcase the opportunities and ingenuity of developers.

【渲染沉浸式的VR世界保证帧率是一件非常有挑战性的事情。渲染VR内容不像是传统的显示器渲染,交互的创新带来了很多改变。这对于渲染来说带来的就是去重新审视过去的一些技术的选择,想说的就是适合屏幕渲染的技术不一定还继续适合VR渲染,这里重新来考虑一些技术的比较。】

 

As we worked, we re-evaluated some of the traditional assumptions made for VR rendering, and developed technology to help us deliver high-fidelity content at 90Hz. Now, we’re sharing some results: an experimental forward renderer for Unreal Engine 4.11.

【我们的工作就是来重新考虑这些旧有技术对于VR的价值,下面就是分享一些实验结果。】

 

We’ve developed the Oculus Unreal Renderer with the specific constraints of VR rendering in mind. It lets us more easily create high-fidelity, high-performance experiences, and we’re eager to share it with all UE4 developers.

【我们开发了一个独立的VR内容渲染器,可以获得更高效的渲染结果,见github.】

 

Background

 

As the team began production on Farlands, we took a moment to reflect on what we learned with the demo experiences we showed at Oculus Connect, GDC, CES, and other events. We used Unreal Engine 4 exclusively to create this content, which provided us with an incredible editing environment and a wealth of advanced rendering features.

【我们团队是使用Unreal开发Farlands的,相关内容已经在各大展会分享过,不作具体介绍】

 

Unfortunately, the reality of rendering to Rift meant we’d only been able to use a subset of these features. We wanted to examine those we used most often, and see if we could design a stripped-down renderer that would deliver higher performance and greater visual fidelity, all while allowing the team to continue using UE4’s world-class editor and engine. While the Oculus Unreal Renderer is focused on the use cases of Oculus applications, it’s been retrofit into pre-existing projects (including Showdown and Oculus Dreamdeck) without needing major content work. In these cases, it delivered clearer visuals, and freed up enough GPU headroom to enable additional features or increase resolution 15-30%.

【Ue4很好用但是相对来说渲染性能对于VR程序来说还有可以针对性优化的空间来提升效率并获得更好的渲染结果】

bgt_5_1

Comparison at high resolution: The Oculus Unreal Renderer runs at 90fps while Unreal’s default deferred renderer is under 60fps.

【oculus采用 forward 渲染效率秒杀Unreal 默认的 defered渲染】

 

The Trouble With Deferred VR

 

【这边相关的基础知识可以见Base里面讲述forward/defered rendering的内容】

 

Unreal Engine is known for its advanced rendering feature set and fidelity. So, what was our rationale for changing it for VR? It mostly came down our experiences building VR content, and the differences rendering to a monitor vs Rift.

【UE本身包含大量功能,我们要做的就是选择合适的应用到VR渲染。】

 

When examining the demos we’d created for Rift, we found most shaders were fairly simple and relied mainly on detailed textures with few lookups and a small amount of arithmetic. When coupled with a deferred renderer, this meant our GBuffer passes were heavily texture-bound—we read from a large number of textures, wrote out to GBuffers, and didn’t do much in between.

【VR更高的分辨率要求如果采用defered rendering带来的是对GBuffer数据传输的超高要求】

 

We also used dynamic lighting and shadows sparingly and leaned more heavily on precomputed lighting. In practice, switching to a renderer helped us provide a more limited set of features in a single pass, yielded better GPU utilization, enabled optimization, removed bandwidth overhead, and made it easier for us to hit 90 Hz.

【我们尽量少的使用动态光照和阴影,取而代之的是使用预计算光照。在使用中使用我们提供的渲染器限制了single pass的一些功能,开启了必要的优化关闭了大量的无效功能,最终有助于提升帧率。】

 

We also wanted to compare hardware accelerated multi-sample anti-aliasing (MSAA) with Unreal’s temporal antialiasing (TAA). TAA works extremely well in monitor-only rendering and is a very good match for deferred rendering, but it causes noticeable artifacts in VR. In particular, it can cause judder and geometric aliasing during head motion. To be clear, this was made worse by some of our own shader and vertex animation tricks. But it’s mostly due to the way VR headsets function.

【我们还想要比较的是硬件加速的MSAA和unreal提供的TAA的效果。】

【TAA对于显示器终端的渲染效果非常好且可以很好的配合deferred rendering,但是在VR渲染中使用明显让人感觉到假像。在head motion的过程中会导致judder和geometric aliasing. 】

 

Compared to a monitor, each Rift pixel covers a larger part of the viewer’s field of view. A typical monitor has over 10 times more pixels per solid angle than a VR headset. Images provided to the Oculus SDK also pass through an additional layer of resampling to compensate for the effects of the headset’s optics. This extra filtering tends to slightly over-smooth the image.

【相比较显示器,头盔的每一个像素覆盖的真实范围视觉比较大。Oculus SDK通过一额外的层来resampling补偿来使得最终的效果更平滑】

 

All these factors together contribute to our desire to preserve as much image detail as possible when rendering. We found MSAA to produce sharper, more detailed images that we preferred.

【所有的这些都是为了使最终的渲染效果更加的细腻,而我们发现MSAA提供的效果更佳的shaper,可以保留更多的细节。】

bgt_5_2

Deferred compared with forward. Zoom in to compare.

 

A Better Fit With Forward

 

Current state-of-the-art rendering often leverages(杠杆) screen-space effects, such as screen-space ambient occlusion (SSAO) and screen-space reflections (SSR). Each of these are well known for their realistic and high-quality visual impact, but they make tradeoffs that aren’t ideal in VR. Operating purely in screen-space can introduce incorrect stereo disparities (differences in the images shown to each eye), which some find uncomfortable. Along with the cost of rendering these effects, this made us more comfortable forgoing support of those features in our use case.

【现在的渲染方式通过采用屏幕空间的一些方式来达到更好的效果,比如SSAO,SSR. 但是这些方法都无法直接在VR渲染上面采用。】

 

Our decision to implement a forward renderer took all these considerations into account. Critically, forward rendering lets us use MSAA for anti-aliasing, adds arithmetic(算数) to our texture-heavy shaders (and removes GBuffer writes), removes expensive full-screen passes that can interfere with(干扰) asynchronous timewarp, and—in general—gives us a moderate speedup over the more featureful deferred renderer. Switching to a forward renderer has also allowed the easy addition of monoscopic(单视场) background rendering, which can provide a substantial performance boost for titles with large, complex distant geometry. However, these advantages come with tradeoffs that aren’t right for everyone. Our aim is to share our learnings with VR developers as they continue fighting to make world-class content run at 90Hz.

【我们决定采用一种把上面这些因素考虑在内的forward renderer。 采用MSAA,texture-heavy shader,去掉了full-screen passes(会干扰异步timewarp),还有增加了forward renderer  支持的 monoscopic(单视场) background rendering(就是说原理相机的背景部分不用渲染两次,而是渲染一次同时提交给左右眼,Oculus的SDk里面有。)】

 

Our implementation is based on Ola Olsson’s 2012 HPG paper, Clustered Deferred and Forward Shading. Readers familiar with traditional forward rendering may be concerned about the CPU and GPU overhead of dynamic lights when using such a renderer. Luckily, modern approaches to forward lighting do not require additional draw calls: All geometry and lights are rendered in a single pass (with an optional z-prepass). This is made possible by using a compute shader to pre-calculate which lights influence 3D “clusters” of the scene (subdivisions of each eye’s viewing frustum, yielding a frustum-voxel grid). Using this data, each pixel can cheaply determine a list of lights that has high screen-space coherence, and perform a lighting loop that leverages the efficient branching capability of modern GPUs. This provides accurate culling and efficiently handles smaller numbers of dynamic lights, without the overhead of additional draw calls and render passes.

【这里的实现是 forward+ 的方法,具体内容见2012年的论文,相关基本的概念见我总结的三种渲染方式的比较。这边后面讲的就是forward+的基本原理:通过与处理来挑选对每个pixel有较大影响的光源,在后面处理的时候只考虑这几个光照,就是light-culling的意思。】

bgt_5_3

(Visualization of 3D light grid, illustrating the lighting coherence and culling)

 

Beyond the renderer, we’ve modified UE4 to allow for additional GPU and CPU optimizations. The renderer is provided as an unmaintained sample and not an officially-supported SDK, but we’re excited to give projects using Unreal Engine’s world-class engine and editor additional options for rendering their VR worlds.

【我们搞了个UE4的版本大家可以试试。】

 

You can grab it today from our Github repository as an Unreal Developer at https://github.com/Oculus-VR/UnrealEngine/tree/4.11-ofr. To see it in action, try out Farlands, Dreamdeck, and Showdown.

Object Space Ambient Occlusion for Molecular Dynamics (OSAO)

http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter14.html

Michael Bunnell

NVIDIA Corporation


In this chapter we describe a new technique for computing diffuse light transfer and show how it can be used to compute global illumination for animated scenes. Our technique is efficient enough when implemented on a fast GPU to calculate ambient occlusion and indirect lighting data on the fly for each rendered frame. It does not have the limitations of precomputed radiance transfer(光辐射传输) (PRT) or precomputed ambient occlusion techniques, which are limited to rigid objects that do not move relative to one another (Sloan 2002). Figure 14-1 illustrates how ambient occlusion and indirect lighting enhance environment lighting.

【这里介绍一种高效的基于GPU运算的ambient occlusion技术。这里突破了一般预计算方式的只可应用于静态对象的局限。】

bgt_4_114_ambient_occlusion_01.jpg

Figure 14-1 Adding Realism with Ambient Occlusion and Indirect Lighting

Our technique works by treating polygon meshes as a set of surface elements that can emit, transmit, or reflect light and that can shadow each other. This method is so efficient because it works without calculating the visibility of one element to another. Instead, it uses a much simpler and faster technique based on approximate shadowing to account for occluding (blocking) geometry.

【我们的技术把多边形表面看作是一组表面的单元集合,他们之间可以emit, transmit, reflect shadow,通过这样的近似可以简单快速的获得起到阻塞效果的几何的形状。】


14.1 Surface Elements

The first step in our algorithm is to convert the polygonal data to surface elements to make it easy to calculate how much one part of a surface shadows or illuminates another.

【这里算法的第一步就是将多边形数据转化成surface elements。】

Figure 14-2 illustrates the basic concept. We define a surface element as an oriented disk with a position, normal, and area. An element has a front face and a back face. Light is emitted and reflected from the front-facing side. Light is transmitted and shadows are cast from the back. We create one element per vertex of the mesh. Assuming that the vertices are defined with a position and normal already, we just need to calculate the area of each element. We calculate the area at a vertex as the sum of one-third of the area of the triangles that share the vertex (or one-fourth of the area for quads). Heron’s formula for the area of a triangle with sides of length a, b, and c is:

bgt_4_2ch14_eqn001.jpg

where s is half the perimeter of the triangle: (a + b + c)/2.

【下图展示的就是这一步的概念示意,surface element定义成圆形表面包含位置/法向/area信息。 surface包含正反面,光线从正面emit/reflect,反面形成transmit/shadow

对于多边形的每一个顶点生成一个surface element. 顶点的位置法线直接赋予surface elementarea的计算由使用到这个顶点的三角形的面积总和的三分之一,计算公式如上。】

bgt_4_314_ambient_occlusion_02.jpg

Figure 14-2 Converting a Polygonal Mesh to Elements

We store element data (position, normal, and area) in texture maps because we will be using a fragment program (that is, a pixel shader) to do all the ambient occlusion calculations. Assuming that vertex positions and normals will change for each frame, we need to be able to change the values in the texture map quickly.

One option is to keep vertex data in a texture map from the start and to do all the animation and transformation from object space to eye (or world) space with fragment programs instead of vertex programs. We can use render-to-vertex-array to create the array of vertices to be sent down the regular pipeline, and then use a simple pass-through vertex shader.

Another, less efficient option is to do the animation and transformation on the CPU and load a texture with the vertex data each frame.

【我们需要把surface elementposition/normal/area的信息存储到texture用于pixel shader. 假设顶点的位置法线是每个frame都变化的,因此我们需要快速改变texture的值。

一种可行的方案是一直保持一开始的时候的顶点信息,之后动画的变化完全由eye space/pixel shader来替代object space/vertex shader的处理,然后render to vertex array生成顶点数组,再交由正常的流水线再处理,之后就是一个简单的vertex shader可以搞定了。

另外一种低效的解决方案是在CPU上面处理动画变化生成texture的方式。】


14.2 Ambient Occlusion

Ambient occlusion is a useful technique for adding shadowing to diffuse objects lit with environment lighting. Without shadows, diffuse objects lit from many directions look flat and unrealistic. Ambient occlusion provides soft shadows by darkening surfaces that are partially visible to the environment. It involves calculating the accessibility value, which is the percentage of the hemisphere above each surface point not occluded by geometry (Landis 2002). In addition to accessibility, it is also useful to calculate the direction of least occlusion, commonly known as the bent normal. The bent normal is used in place of the regular normal when shading the surface for more accurate environment lighting.

AO解释,在对象表面生成软阴影可以有效的提高真实感。】

We can calculate the accessibility(辅助) value at each element as 1 minus the amount by which all the other elements shadow the element. We refer to the element that is shadowed as the receiver and to the element that casts the shadow as the emitter. We use an approximation based on the solid angle of an oriented disk to calculate the amount by which an emitter element shadows a receiver element. Given that A is the area of the emitter, the amount of shadow can be approximated by:

bgt_4_4ch14_eqn002.jpg

Equation 14-1 Shadow Approximation

【计算辅助值:1减去所有其他element在此的阴影。Element 作为接收者 shadowed,作为发光者造成阴影。因为发光者和接收阴影着的角度都是已知的,,我们采用上面的公式来估算,配合下面的示意图。Aemitter的面积。】

As illustrated in Figure 14-3, qE is the angle between the emitter’s normal and the vector from the emitter to the receiver. qR is the corresponding angle for the receiver element. The max(1, 4 x cos qR ) term is added to the disk solid angle formula to ignore emitters that do not lie in the hemisphere above the receiver without causing rendering artifacts for elements that lie near the horizon.

【这一段在解释变量含义】

bgt_4_514_ambient_occlusion_03.jpg

Figure 14-3 The Relationship Between Receiver and Emitter Elements

Here is the fragment program function to approximate the element-to-element occlusion:

【下面是计算函数的实现】

bgt_4_6计算机生成了可选文字: El em entShadow (Elo at3 oat oat El oats e as sme that rs quar e d, o at3 oat (1 emittuArea has already divided Ey r qrt(emi t terkrea/rSquared dot


14.2.1 The Multipass Shadowing Algorithm

We calculate the accessibility values(辅助值) in two passes.

【这里计算包含两个pass

In the first pass, we approximate the accessibility for each element by summing the fraction(分数) of the hemisphere(半球) subtended(对着) by every other element and subtracting(减法) the result from 1.

【第一个pass是根据上面的公式来近似计算每一个element的分数】

After the first pass, some elements will generally be too dark because other elements that are in shadow are themselves casting shadows. So we use a second pass to do the same calculation, but this time we multiply each form factor by the emitter element’s accessibility from the last pass.

【经过第一步会导致有些elements太暗了,原因在于存在投影的过度叠加。因此第二个pass做同样的计算,但是这里我们乘上每一个emitter elements的上一步计算出来的辅助值。】

The effect is that elements that are in shadow will cast fewer shadows on other elements, as illustrated in Figure 14-4. After the second pass, we have removed any double shadowing.

【效果如下图所示,通过第二步我们解决的是double shadowing导致的太暗的问题】

However, surfaces that are triple shadowed or more will end up being too light. We can use more passes to get a better approximation, but we can approximate the same answer by using a weighted average of the combined results of the first and second passes. Figure 14-5 shows the results after each pass, as well as a ray-traced solution for comparison. The bent normal calculation is done during the second pass. We compute the bent normal by first multiplying the normalized vector between elements and the form factor. Then we subtract this result from the original element normal.

【其实通过上面的两步还是得不到很好的结果,比如第二步只去除的是双重叠加的效果,如果是三重叠加我们还需要更进一步的 pass来去除叠加效果,这是个无底洞。 因此我们采用对第二步的结果再设置权重值的方式来获得更好的近似效果,下下图就是结果展示。】

bgt_4_714_ambient_occlusion_04.jpg

Figure 14-4 Correcting for Occlusion by Overlapping Objects

bgt_4_814_ambient_occlusion_05.jpg

Figure 14-5 Comparing Models Rendered with Our Technique to Reference Images

We calculate the occlusion result by rendering a single quad (or two triangles) so that one pixel is rendered for each surface element. The shader calculates the amount of shadow received at each element and writes it as the alpha component of the color of the pixel. The results are rendered to a texture map so the second pass can be performed with another render. In this pass, the bent normal is calculated and written as the RGB value of the color with a new shadow value that is written in the alpha component.

【每一个pass,一个surface element当作一个pixel来处理,这样shader将每个element计算得到的阴影值作为这个pixelalpha值,结果渲染到texture map,这样就可以用于下一个passnormal值当作textureRGB分量参与计算。】


14.2.2 Improving Performance

Even though the element-to-element shadow calculation is very fast (a GeForce 6800 can do 150 million of these calculations per second), we need to improve our algorithm to work on more than a couple of thousand elements in real time. We can reduce the amount of work by using simplified geometry for distant surfaces. This approach works well for diffuse lighting environments because the shadows are so soft that those cast by details in distant geometry are not visible. Fortunately, because we do not use the polygons themselves in our technique, we can create surface elements to represent simplified geometry without needing to create alternate polygonal models. We simply group elements whose vertices are neighbors in the original mesh and represent them with a single, larger element. We can do the same thing with the larger elements, creating fewer and even larger elements, forming a hierarchy. Now instead of traversing every single element for each pixel we render, we traverse the hierarchy of elements. If the receiver element is far enough away from the emitter—say, four times the radius of the emitter—we use it for our calculation. Only if the receiver is close to an emitter do we need to traverse its children (if it has any). See Figure 14-6. By traversing a hierarchy in this way, we can improve the performance of our algorithm from O(n 2) to O(n log n) in practice. The chart in Figure 14-7 shows that the performance per vertex stays consistent as the number of vertices in the hierarchy increases.

【其实这样的element to element(pixel to pixel)的计算已经很快了。我们要增强我们的算法来尽可能多的支持顶点(element/pixel)数。这里的想法就是通过空间几何关系,相邻的一些定点可以组合当作一个element group(计算的时候当作一个element)来处理,然后起作用再细分,就是一般层次化的方法。】

bgt_4_914_ambient_occlu_06.jpg

Figure 14-6 Hierarchical Elements

bgt_4_1014_ambient_occlu_07.jpg

Figure 14-7 Ambient Occlusion Shader Performance for Meshes of Different Densities

【性能图示】

We calculate a parent element’s data using its direct descendants in the hierarchy. We calculate the position and normal of a parent element by averaging the positions and normals of its children. We calculate its area as the sum of its children’s areas. We can use a shader for these calculations by making one pass of the shader for each level in the hierarchy, propagating the values from the leaf nodes up. We can then use the same technique to average the results of an occlusion pass that are needed for a following pass or simply treat parent nodes the same as children and avoid the averaging step. It is worth noting that the area of most animated elements varies little, if at all, even for nonrigid objects; therefore, the area does not have to be recalculated for each frame.

【这里交代父节点(高层次)的数据来源】

The ambient occlusion fragment shader appears in Listing 14-1.

【下面是完整的shader

bgt_4_11计算机生成了可选文字: Ambi entOcc1usi o at4 o at4 o at4 i for m 1 as tRe tMap i for m posi ti onMap i for m i for m oat o at4 o at4 o at3 o at3 o at3 oat o at4 or-mal = rNorma_1 rmnurl el em entNorm rEKUNIT3) nom COL posi t 1 vector receiva t used to calculate El at3 bent N oat oat H th recenvu nonal

 bgt_4_12

计算机生成了可选文字: (em t terlndewx texREcr om XYZ not shed Yaversal , emi t terlnd xy) emi t terlnd xy) ge t t calc squued. r qrt(d2). value = texREcr s receiver close to puent < —4*emitterkrea) (p s hav e go eruchy em 1 t terkre E, rNorma_1, eNorma_L modulate normal by last remlt bentNorma1 value total value; (1 only need normal for last else retu-n Eloat4 normali ze(bentNorma1),

Example 14-1. Ambient Occlusion Shader


14.3 Indirect Lighting and Area Lights

We can add an extra level of realism to rendered images by adding indirect lighting caused by light reflecting off diffuse surfaces (Tabellion 2004). We can add a single bounce of indirect light using a slight variation of the ambient occlusion shader. We replace the solid angle function with a disk-to-disk radiance transfer function. We use one pass of the shader to transfer the reflected or emitted light and two passes to shadow the light.

【直接光照和间接光照的阴影结果我们通过一个shader将结果合到一起。】

For indirect lighting, first we need to calculate the amount of light to reflect off the front face of each surface element. If the reflected light comes from environment lighting, then we compute the ambient occlusion data first and use it to compute the environment light that reaches each vertex. If we are using direct lighting from point or directional lights, we compute the light at each element just as if we are shading the surface, including shadow mapping. We can also do both environment lighting and direct lighting and sum the two results. We then multiply the light values by the color of the surface element, so that red surfaces reflect red, yellow surfaces reflect yellow, and so on. Area lights are handled just like light-reflective diffuse surfaces except that they are initialized with a light value to emit.

【这里解释怎么合兵:首先我们要得到直接光照的结果和OSAO的结果,直接光照结果的计算来自于一般的光照计算方法方法shadow map。亮度就是两种光照结果只和,颜色就是光线颜色。面积光就当作是发光表面来处理。】

Here is the fragment program function to calculate element-to-element radiance transfer:

element-to-element radiance transfer处理的代码片段】

bgt_4_13

计算机生成了可选文字: oat El o at3 El oats e oat o at3 oat that ttuArea has vided by PI ate ( dot

bgt_4_14ch14_eqn003.jpg

Equation 14-2 Disk-to-Disk Form Factor Approximation

We calculate the amount of light transferred from one surface element to another using the geometric term of the disk-to-disk form factor given in Equation 14-2. We leave off the visibility factor, which takes into account blocking (occluding) geometry. Instead we use a shadowing technique like the one we used for calculating ambient occlusion—only this time we use the same form factor that we used to transfer the light. Also, we multiply the shadowing element’s form factor by the three-component light value instead of a single-component accessibility value.

【我们使用上面的公式来计算光线从一个element transfer 到另一个。也就是说我们这里用了OSAO那种思想来做光线的传播。】

We now run one pass of our radiance-transfer shader to calculate the maximum amount of reflected or emitted light that can reach any element. Then we run a shadow pass that subtracts from the total light at each element based on how much light reaches the shadowing elements. Just as with ambient occlusion, we can run another pass to improve the lighting by removing double shadowing. Figure 14-8 shows a scene lit with direct lighting plus one and two bounces of indirect lighting.

【我们首先用一个pass来跑radiance-transfer shader来计算element之间的光线的发出和反射来得到每一个element的光线总和,接着跑shadow pass:从到达element的光线总和的结果再减去这个pass计算的结果就是AO的结果,处理多重阴影的覆盖问题就是通过多个pass和参数解,见上面的讲解。下图展示结果】

bgt_4_1514_ambient_occlu_08.jpg

Figure 14-8 Combining Direct and Indirect Lighting


14.4 Conclusion

Global illumination techniques such as ambient occlusion and indirect lighting greatly enhance the quality of rendered diffuse surfaces. We have presented a new technique for calculating light transfer to and from diffuse surfaces using the GPU. This technique is suitable for implementing various global illumination effects in dynamic scenes with deformable geometry.

【废话不解释】


14.5 References

Landis, Hayden. 2002. “Production-Ready Global Illumination.” Course 16 notes, SIGGRAPH 2002.

Pharr, Matt, and Simon Green. 2004. “Ambient Occlusion.” In GPU Gems, edited by Randima Fernando, pp. 279–292. Addison-Wesley.

Sloan, Peter-Pike, Jan Kautz, and John Snyder. 2002. “Precomputed Radiance Transfer for Real-Time Rendering in Dynamic, Low-Frequency Lighting Environments.” ACM Transactions on Graphics (Proceedings of SIGGRAPH 2002) 21(3), pp. 527–536.

Tabellion, Eric, and Arnauld Lamorlette. 2004. “An Approximate Global Illumination System for Computer Generated Films.” ACM Transactions on Graphics (Proceedings of SIGGRAPH 2004) 23(3), pp. 469–476.

Screen Space Ambient Occlusion(SSAO)

  • BACKGROUND

Ambient occlusion is an approximation of the amount by which a point on a surface is occluded by the surrounding geometry, which affects the accessibility of that point by incoming light. (主要看是否靠近物体)

In effect, ambient occlusion techniques allow the simulation of proximity shadows – the soft shadows that you see in the corners of rooms and the narrow spaces between objects. (用于模拟软阴影)

Ambien occlusion is often subtle, but will dramatically improve the visual realism of a computer-generated scene:

bgt_3_1

 

The basic idea is to compute an occlusion factor(阻塞要素) for each point on a surface and incorporate(合并) this into the lighting model, usually by modulating the ambient term such that more occlusion = less light, less occlusion = more light. Computing the occlusion factor can be expensive; offline renderers typically do it by casting a large number of rays in a normal-oriented hemisphere to sample the occluding geometry around a point. In general this isn’t practical for realtime rendering.

To achieve interactive frame rates, computing the occlusion factor needs to be optimized as far as possible. One option is to pre-calculate it, but this limits how dynamic a scene can be (the lights can move around, but the geometry can’t).(速度是大问题)

  • CRYSIS METHOD

Way back in 2007, Crytek implemented a realtime solution for Crysis, which quickly became the yardstick for game graphics. The idea is simple: use per-fragment depth information as an approximation of the scene geometry and calculate the occlusion factor in screen space. This means that the whole process can be done on the GPU, is 100% dynamic and completely independent of scene complexity. Here we’ll take a quick look at how the Crysis method works, then look at some enhancements.

Rather than(与其) cast(投射) rays in a hemisphere, Crysis samples the depth buffer at points derived(来源) from samples in a sphere:[在深度buffer以当前点为中心的一个圆内取sample]

bgt_3_2

 

This works in the following way:

  • project each sample point into screen space to get the coordinates into the depth buffer(获得深度图及坐标)
  • sample the depth buffer(取深度图的sample)
  • if the sample position is behind the sampled depth (i.e. inside geometry), it contributes to the occlusion factor(sample平均值小于其本身深度值,则起作用)

Clearly the quality of the result is directly proportional to the number of samples, which needs to be minimized in order to achieve decent performance. Reducing the number of samples, however, produces ugly ‘banding’ artifacts in the result. This problem is remedied by randomly rotating the sample kernel at each pixel, trading banding for high frequency noise which can be removed by blurring the result.

bgt_3_3

The Crysis method produces occlusion factors with a particular ‘look’ – because the sample kernel is a sphere, flat walls end up looking grey because ~50% of the samples end up being inside the surrounding geometry. Concave corners darken as expected, but convex ones appear lighter since fewer samples fall inside geometry. Although these artifacts are visually acceptable, they produce a stylistic effect which strays somewhat from photorealism.

  • NORMAL-ORIENTED HEMISPHERE

Rather than sample a spherical kernel at each pixel, we can sample within a hemisphere, oriented along the surface normal at that pixel. This improves the look of the effect with the penalty of requiring per-fragment normal data. For a deferred renderer, however, this is probably already available, so the cost is minimal (especially when compared with the improved quality of the result).

(改进:去法线方向的半球内的sample)

bgt_3_4

  • Generating the Sample Kernel

The first step is to generate the sample kernel itself. The requirements are that

  • sample positions fall within the unit hemisphere
  • sample positions are more densely clustered towards the origin. This effectively attenuates the occlusion contribution according to distance from the kernel centre – samples closer to a point occlude it more than samples further away

Generating the hemisphere is easy:

This creates sample points on the surface of a hemisphere oriented along the z axis.(先建一个标准半球) The choice of orientation is arbitrary(随意) – it will only affect the way we reorient the kernel in the shader. The next step is to scale each of the sample positions to distribute them within the hemisphere. This is most simply done as:

which will produce an evenly distributed set of points. What we actually want is for the distance from the origin to falloff as we generate more points, according to a curve like this:(权重和距离相关)

bgt_3_5

  • Generating the Noise Texture

Next we need to generate a set of random values used to rotate the sample kernel, which will effectively increase the sample count and minimize the ‘banding’ artefacts mentioned previously.

Note that the z component is zero; since our kernel is oriented along the z-axis, we want the random rotation to occur around that axis.(竟然是random rotation!难道不能是顶点或者面法线更符合实际情况?)

These random values are stored in a texture and tiled over(铺满) the screen. The tiling of the texture causes the orientation of the kernel to be repeated and introduces regularity into the result. By keeping the texture size small we can make this regularity occur at a high frequency, which can then be removed with a blur step that preserves the low-frequency detail of the image. Using a 4×4 texture and blur kernel produces excellent results at minimal cost. This is the same approach as used in Crysis.

  • The SSAO Shader

With all the prep work done, we come to the meat of the implementation: the shader itself. There are actually two passes: calculating the occlusion factor, then blurring the result.

Calculating the occlusion factor requires first obtaining the fragment’s view space position and normal:

I reconstruct the view space position by combining the fragment’s linear depth with the interpolated vViewRay. See Matt Pettineo’s blog for a discussion of other methods for reconstructing position from depth. The important thing is that origin ends up being the fragment’s view space position.

Retrieving(检索) the fragment’s normal is a little more straightforward(直截了当); the scale/bias and normalization steps are necessary unless you’re using some high precision format to store the normals:

Next we need to construct a change-of-basis matrix to reorient our sample kernel along the origin’s normal. We can cunningly(巧妙地) incorporate(合并) the random rotation here, as well:

(这儿可以看作shader 如何使用random数范例)

The first line retrieves a random vector rvec from our noise texture. uNoiseScale is a vec2 which scales vTexcoord to tile the noise texture. So if our render target is 1024×768 and our noise texture is 4×4, uNoiseScale would be (1024 / 4, 768 / 4). (This can just be calculated once when initialising the noise texture and passed in as a uniform).

The next three lines use the Gram-Schmidt process to compute an orthogonal basis, incorporating our random rotation vector rvec.

The last line constructs the transformation matrix from our tangent, bitangent and normal vectors. The normal vector fills the z component of our matrix because that is the axis along which the base kernel is oriented.

Next we loop through the sample kernel (passed in as an array of vec3, uSampleKernel), sample the depth buffer and accumulate the occlusion factor:

Getting the view space sample position is simple; we multiply by our orientation matrix tbn, then scale the sample by uRadius (a nice artist-adjustable factor, passed in as a uniform) then add the fragment’s view space position origin.

We now need to project sample (which is in view space) back into screen space to get the texture coordinates with which we sample the depth buffer. This step follows the usual process – multiply by the current projection matrix (uProjectionMat), perform w-divide then scale and bias to get our texture coordinate: offset.xy.

Next we read sampleDepth out of the depth buffer (uTexLinearDepth). If this is in front of the sample position, the sample is ‘inside’ geometry and contributes to occlusion. If sampleDepth is behind the sample position, the sample doesn’t contribute to the occlusion factor. Introducing a rangeCheck helps to prevent erroneous occlusion between large depth discontinuities:

bgt_3_6

As you can see, rangeCheck works by zeroing any contribution from outside the sampling radius.

The final step is to normalize the occlusion factor and invert it, in order to produce a value that can be used to directly scale the light contribution.

  • The Blur Shader

The blur shader is very simple: all we want to do is average a 4×4 rectangle around each pixel to remove the 4×4 noise pattern:

The only thing to note in this shader is uTexelSize, which allows us to accurately sample texel centres based on the resolution of the AO render target.

bgt_3_7

  • CONCLUSION

The normal-oriented hemisphere method produces a more realistic-looking than the basic Crysis method, without much extra cost, especially when implemented as part of a deferred renderer where the extra per-fragment data is readily available. It’s pretty scalable, too – the main performance bottleneck is the size of the sample kernel, so you can either go for fewer samples or have a lower resolution AO target.

A demo implementation is available here.

Anti-aliasing

抗锯齿(英语:anti-aliasing,简称AA),也译为抗锯齿或边缘柔化、消除混叠、抗图像折叠有损等。它是一种消除显示器输出的画面中图物边缘出现凹凸锯齿的技术,那些凹凸的锯齿通常因为高分辨率的信号以低分辨率表示或无法准确运算出3D图形坐标定位时所导致的图形混叠(aliasing)而产生的,反锯齿技术能有效地解决这些问题。它通常被用在在数字信号处理、数字摄影、电脑绘图与数码音效及电子游戏等方面,柔化被混叠的数字信号。


超级采样抗锯齿(SSAA

超级采样抗锯齿(Super-Sampling Anti-aliasing,简称SSAA)此是早期抗锯齿方法,比较消耗资源,但简单直接,先把图像映射到缓存并把它放大,再用超级采样把放大后的图像像素进行采样,一般选取2个或4个邻近像素,把这些采样混合起来后,生成的最终像素,令每个像素拥有邻近像素的特征,像素与像素之间的过渡色彩,就变得近似,令图形的边缘色彩过渡趋于平滑。再把最终像素还原回原来大小的图像,并保存到帧缓存也就是显存中,替代原图像存储起来,最后输出到显示器,显示出一帧画面。这样就等于把一幅模糊的大图,通过细腻化后再缩小成清晰的小图。如果每帧都进行抗锯齿处理,游戏或视频中的所有画面都带有抗锯齿效果。而将图像映射到缓存并把它放大时,放大的倍数被用于分别抗锯齿的效果,如:图1AA后面的x2x4x8就是原图放大的倍数。 超级采样抗锯齿中使用的采样法一般有两种:

1.顺序栅格超级采样(Ordered Grid Super-Sampling,简称OGSS),采样时选取2个邻近像素。

2.旋转栅格超级采样(Rotated Grid Super-Sampling,简称RGSS),采样时选取4个邻近像素。


多重采样抗锯齿(MSAA

多重采样抗锯齿(MultiSampling Anti-Aliasing,简称MSAA)是一种特殊的超级采样抗锯齿(SSAA)。MSAA首先来自于OpenGL。具体是MSAA只对Z缓存(Z-Buffer)和模板缓存(Stencil Buffer)中的数据进行超级采样抗锯齿的处理。可以简单理解为只对多边形的边缘进行抗锯齿处理。这样的话,相比SSAA对画面中所有数据进行处理,MSAA对资源的消耗需求大大减弱,不过在画质上可能稍有不如SSAA


覆盖采样抗锯齿(CSAA

覆盖采样抗锯齿(CoverageSampling Anti-Aliasing,简称CSAA)是nVidiaG80及其衍生产品首次推向实用化的AA技术,也是目前nVidia GeForce 8/9/G200系列独享的AA技术。CSAA就是在MSAA基础上更进一步的节省显存使用量及带宽,简单说CSAA就是将边缘多边形里需要取样的子像素坐标覆盖掉,把原像素坐标强制安置在硬件和驱动程序预先算好的坐标中。这就好比取样标准统一的MSAA,能够最高效率的执行边缘取样,效能提升非常的显著。比方说16xCSAA取样性能下降幅度仅比4xMSAA略高一点,处理效果却几乎和8xMSAA一样。8xCSAA有着4xMSAA的处理效果,性能消耗却和2xMSAA相同。[1]

NVIDIA已经移除了CSAA,可能这种抗锯齿技术有点落伍了吧,论画质不如TXAA,论性能不如FXAA,而且只有NVIDIA支持,兼容性也是个问题。


可编程过滤抗锯齿(CFAA)

可编程过滤抗锯齿(Custom Filter Anti-Aliasing)技术起源于AMD-ATIR600家庭。简单地说CFAA就是扩大取样面积的MSAA,比方说之前的MSAA是严格选取物体边缘像素进行缩放的,而CFAA则可以通过驱动和谐灵活地选择对影响锯齿效果较大的像素进行缩放,以较少的性能牺牲换取平滑效果。显卡资源占用也比较小。


快速近似抗锯齿(FXAA)

快速近似抗锯齿(Fast Approximate Anti-Aliasing) 它是传统MSAA(多重采样抗锯齿)效果的一种高性能近似值。它是一种单程像素着色器,和MLAA一样运行于目标游戏渲染管线的后期处理阶段,但不像后者那样使用DirectCompute,而只是单纯的后期处理着色器,不依赖于任何GPU计算API。正因为如此,FXAA技术对显卡没有特殊要求,完全兼容NVIDIAAMD的不同显卡(MLAA仅支持A)DX9DX10DX11


时间性抗锯齿(TXAA/TAA

TXAA的原理就是通过HDR后处理管线从硬件层面上提供颜色矫正处理,后期处理的方式实际上原理和FXAA差不多:整合硬件AA以及类似于CG电影中所采用的复杂的高画质过滤器,来减少抗锯齿中出现的撕裂和抖动现象。

但是如果实现比FXAA更强画质以及更流畅的体验,则只能通过游戏的开发上实现TXAA了。所以TXAA是一种后发式的抗锯齿技术,并不像FXAA那样具有通用性,而是通过游戏来进行优化,这样的一种专用性使得TXAA的执行效率是最高的。

所以,TXAA是一种新的抗锯齿,是需要重新研发加入TXAA的代码来支持

TXAA 是一种全新的电影风格抗锯齿技术,旨在减少时间性锯齿 (运动中的蠕动和闪烁) 该技术集时间性过滤器、硬件抗锯齿以及定制的 CG 电影式抗锯齿解算法于一身。 要过滤屏幕上任意特定的像素,TXAA 需要使用像素内部和外部的采样以及之前帧中的采样,以便提供超级高画质的过滤。 TXAA 在标准 2xMSAA 4xMSAA 的基础上改进了时间性过滤。 例如,在栅栏或植物上以及在运动画面中,TXAA 已经开始接近、有时甚至超过了其它高端专业抗锯齿算法的画质。TXAA 由于采用更高画质的过滤,因而与传统 MSAA 较低画质的过滤相比,图像更加柔和。

bgt_2_1


多帧采样抗锯齿(MFAA

NVIDIA(英伟达)根据MSAA改进出的一种抗锯齿技术。目前只有使用麦克斯韦架构GPU的显卡才可以使用。在 Maxwell 上,英伟达推出了用于光栅化的可编程采样位置,它们被存储在随机存取存储器 (RAM) 中。如此一来便为更灵活、更创新的全新抗锯齿技术创造了机会,这类抗锯齿技术能够独特地解决现代游戏引擎所带来的难题,例如高画质抗锯齿对性能的更高要求。只要在NVIDIA控制面板里为程序开启MFAA并在游戏中选择MSAA就可以开启。画面表现明显强于同级别的MSAA,这种全新抗锯齿技术在提升边缘画质的同时能够将性能代价降至最低。通过在时间和空间两方面交替使用抗锯齿采样格式,4xMFAA 的性能代价仅相当于 2xMSAA,但是抗锯齿效果却与 4xMSAA 相当。[3]

支持MFAA的显卡(GPU):GTX TITAN ZGTX TITAN XGTX980TiGTX980GTX970GTX960GTX950[4]

Defered/Forward Rendering

http://www.cnblogs.com/polobymulberry/p/5126892.html

1. rendering path的技术基础

在介绍各种光照渲染方式之前,首先必须介绍一下现代的图形渲染管线。这是下面提到的几种Rendering Path的技术基础。

bgt_1_1

目前主流的游戏和图形渲染引擎,包括底层的API(如DirectXOpenGL)都开始支持现代的图形渲染管线。现代的渲染管线也称为可编程管线(Programmable Pipeline),简单点说就是将以前固定管线写死的部分(比如顶点的处理,像素颜色的处理等等)变成在GPU上可以进行用户自定义编程的部分,好处就是用户可以自由发挥的空间增大,缺点就是必须用户自己实现很多功能。

下面简单介绍下可编程管线的流程。以OpenGL绘制一个三角形举例。首先用户指定三个顶点传给Vertex Shader。然后用户可以选择是否进行Tessellation Shader(曲面细分可能会用到)和Geometry Shader(可以在GPU上增删几何信息)。紧接着进行光栅化,再将光栅化后的结果传给Fragment Shader进行pixel级别的处理。最后将处理的像素传给FrameBuffer并显示到屏幕上。

2. 几种常用的Rendering Path

Rendering Path其实指的就是渲染场景中光照的方式。由于场景中的光源可能很多,甚至是动态的光源。所以怎么在速度和效果上达到一个最好的结果确实很困难。以当今的显卡发展为契机,人们才衍生出了这么多的Rendering Path来处理各种光照。

2.1 Forward Rendering

bgt_1_2

Forward Rendering是绝大数引擎都含有的一种渲染方式。要使用Forward Rendering,一般在Vertex Shader或Fragment Shader阶段对每个顶点或每个像素进行光照计算,并且是对每个光源进行计算产生最终结果。下面是Forward Rendering的核心伪代码[1]。

比如在Unity3D 4.x引擎中,对于下图中的圆圈(表示一个Geometry),进行Forward Rendering处理。

bgt_1_3

将得到下面的处理结果

bgt_1_4

也就是说,对于ABCD四个光源我们在Fragment Shader中我们对每个pixel处理光照,对于DEFG光源我们在Vertex Shader中对每个vertex处理光照,而对于GH光源,我们采用球调和(SH)函数进行处理。

Forward Rendering优缺点

很明显,对于Forward Rendering,光源数量对计算复杂度影响巨大,所以比较适合户外这种光源较少的场景(一般只有太阳光)。

但是对于多光源,我们使用Forward Rendering的效率会极其低下。光源数目和复杂度是成线性增长的。

对此,我们需要进行必要的优化。比如

1.多在vertex shader中进行光照处理,因为有一个几何体有10000个顶点,那么对于n个光源,至少要在vertex shader中计算10000n次。而对于在fragment shader中进行处理,这种消耗会更多,因为对于一个普通的1024×768屏幕,将近有8百万的像素要处理。所以如果顶点数小于像素个数的话,尽量在vertex shader中进行光照。

2.如果要在fragment shader中处理光照,我们大可不必对每个光源进行计算时,把所有像素都对该光源进行处理一次。因为每个光源都有其自己的作用区域。比如点光源的作用区域是一个球体,而平行光的作用区域就是整个空间了。对于不在此光照作用区域的像素就不进行处理。但是这样做的话,CPU端的负担将加重,因为要计算作用区域。

3.对于某个几何体,光源对其作用的程度是不同,所以有些作用程度特别小的光源可以不进行考虑。典型的例子就是Unity中只考虑重要程度最大的4个光源。

2.2 Deferred Rendering

bgt_1_5

Deferred Rendering(延迟渲染)顾名思义,就是将光照处理这一步骤延迟一段时间再处理。具体做法就是将光照处理这一步放在已经三维物体生成二维图片之后进行处理。也就是说将物空间的光照处理放到了像空间进行处理。要做到这一步,需要一个重要的辅助工具——G-Buffer。G-Buffer主要是用来存储每个像素对应的Position,Normal,Diffuse Color和其他Material parameters。根据这些信息,我们就可以在像空间中对每个像素进行光照处理[3]。下面是Deferred Rendering的核心伪代码。

下面简单举个例子。

首先我们用存储各种信息的纹理图。比如下面这张Depth Buffer,主要是用来确定该像素距离视点的远近的。

bgt_1_6

. Depth Buffer

根据反射光的密度/强度分度图来计算反射效果。

bgt_1_7

.Specular Intensity/Power

下图表示法向数据,这个很关键。进行光照计算最重要的一组数据。

bgt_1_8

.Normal Buffer

下图使用了Diffuse Color Buffer。

bgt_1_9

.Diffuse Color Buffer

这是使用Deferred Rendering最终的结果。

bgt_1_10

.Deferred Lighting Results

Deferred Rendering的最大的优势就是将光源的数目和场景中物体的数目在复杂度层面上完全分开。也就是说场景中不管是一个三角形还是一百万个三角形,最后的复杂度不会随光源数目变化而产生巨大变化。从上面的伪代码可以看出deferred rendering的复杂度为 。

但是Deferred Rendering局限性也是显而易见。比如我在G-Buffer存储以下数据

bgt_1_11

这样的话,对于一个普通的1024×768的屏幕分辨率。总共得使用1024x768x128bit=20MB,对于目前的动则上GB的显卡内存,可能不算什么。但是使用G-Buffer耗费的显存还是很多的。一方面,对于低端显卡,这么大的显卡内存确实很耗费资源。另一方面,如果要渲染更酷的特效,使用的G-Buffer大小将增加,并且其增加的幅度也是很可观的。顺带说一句,存取G-Buffer耗费的带宽也是一个不可忽视的缺陷。

对于Deferred Rendering的优化也是一个很有挑战的问题。下面简单介绍几种降低Deferred Rendering存取带宽的方式。最简单也是最容易想到的就是将存取的G-Buffer数据结构最小化,这也就衍生出了light pre-pass方法。另一种方式是将多个光照组成一组,然后一起处理,这种方法衍生了Tile-based deferred Rendering。

2.2.1 Light Pre-Pass

Light Pre-Pass最早是由Wolfgang Engel在他的博客[2]中提到的。具体的做法是

(1)只在G-Buffer中存储Z值和Normal值。对比Deferred Render,少了Diffuse Color Specular Color以及对应位置的材质索引值。

(2)FS阶段利用上面的G-Buffer计算出所必须的light properties,比如Normal*LightDir,LightColor,Specularlight properties。将这些计算出的光照进行alpha-blend并存入LightBuffer(就是用来存储light propertiesbuffer)。

(3)最后将结果送到forward rendering渲染方式计算最后的光照效果。

相对于传统的Deferred Render,使用Light Pre-Pass可以对每个不同的几何体使用不同的shader进行渲染,所以每个物体的material properties将有更多变化。这里我们可以看出相对于传统的Deferred Render,它的第二步(见伪代码)是遍历每个光源,这样就增加了光源设置的灵活性,而Light Pre-Pass第三步使用的其实是forward rendering,所以可以对每个mesh设置其材质,这两者是相辅相成的,有利有弊。另一个Light Pre-Pass的优点是在使用MSAA上很有利。虽然并不是100%使用上了MSAA(除非使用DX10/11的特性),但是由于使用了Z值和Normal值,就可以很容易找到边缘,并进行采样。

下面这两张图,左边是使用传统Deferred Render绘制的,右边是使用Light Pre-Pass绘制的。这两张图在效果上不应该有太大区别。

bgt_1_12

2.2.2 Tile-Based Deferred Rendering

TBDR主要思想就是将屏幕分成一个个小块tile。然后根据这些Depth求得每个tilebounding box。对每个tilebounding boxlight进行求交,这样就得到了对该tile有作用的light的序列。最后根据得到的序列计算所在tile的光照效果。[4][5]

对比Deferred Render,之前是对每个光源求取其作用区域light volume,然后决定其作用的的pixel,也就是说每个光源要求取一次。而使用TBDR,只要遍历每个pixel,让其所属tile与光线求交,来计算作用其上的light,并利用G-Buffer进行Shading。一方面这样做减少了所需考虑的光源个数,另一方面与传统的Deferred Rendering相比,减少了存取的带宽。

2.3 Forward+


Forward+ == Forward + Light Culling[6]Forward+ 很类似Tiled-based Deferred Rendering。其具体做法就是先对输入的场景进行z-prepass,也就是说关闭写入color,只向z-buffer写入z值。注意此步骤是Forward+必须的,而其他渲染方式是可选的。接下来来的步骤和TBDR很类似,都是划分tiles,并计算bounding box。只不过TBDR是在G-Buffer中完成这一步骤的,而Forward+是根据Z-Buffer。最后一步其实使用的是forward方式,即在FS阶段对每个pixel根据其所在tilelight序列计算光照效果。而TBDR使用的是基于G-Bufferdeferred rendering

实际上,forward+deferred运行的更快。我们可以看出由于Forward+只要写深度缓存就可以,而Deferred Render除了深度缓存,还要写入法向缓存。而在Light Culling步骤,Forward+只需要计算出哪些light对该tile有影响即可。而Deferred Render还在这一部分把光照处理给做了。而这一部分,Forward+是放在Shading阶段做的。所以Shading阶段Forward+耗费更多时间。但是对目前硬件来说,Shading耗费的时间没有那么多。

bgt_1_13

Forward+的优势还有很多,其实大多就是传统Forward Rendering本身的优势,所以Forward+更像一个集各种Rendering Path优势于一体的Rendering Path

bgt_1_14

3. 总结

首先我们列出Rendering Equation,然后对比Forward RenderingDeferred RenderingForward+ Rendering[6]

3.1 Rendering Equation

其中点 处有一入射光,其光强为 ,入射角度为 。根据函数 来计算出射角为 处的出射光强度。最后在辅以出射光的相对于视点可见性 。注意此处的 为场景中总共有 个光源。

image

 bgt_1_15

3.2 Forward Renderng

由于Forward本身对多光源支持力度不高,所以此处对于每个点 的处理不再考虑所有的 个光源,仅仅考虑少量的或者说经过挑选的 个光源。可以看出这样的光照效果并不完美。另外,每个光线的 是计算不了的。

bgt_1_16image

3.3 Deferred Rendering

由于Deferred Rendering使用了light culling,所以不用遍历场景中的所有光源,只需遍历经过light culling后的 个光源即可。并且Deferred Rendering将计算BxDF的部分单独分出来了。

bgt_1_17image

3.4 Forward+ Rendering

可以看出Forward+Forward最大区别就是光源的挑选上有了很到改进。

bgt_1_18image

参考文献

[1] Shawn Hargreaves. (2004) “Deferred Shading”. [Online] Available:

http://hall.org.ua/halls/wizzard/books/articles-cg/DeferredShading.pdf (April 15,2015)

[2] Wolfgang Engel. (March 16, 2008) “Light Pre-Pass Renderer”. [Online] Available:

http://diaryofagraphicsprogrammer.blogspot.com/2008/03/light-pre-pass-renderer.html(April 14,2015)

[3] Klint J. Deferred Rendering in Leadwerks Engine[J]. Copyright Leadwerks Corporation, 2008.

[4] 龚敏敏.(April 22, 2012) “Forward框架的逆袭:解析Forward+渲染”. [Online] Available:

http://www.cnblogs.com/gongminmin/archive/2012/04/22/2464982.html(April 13,2015)

[5] Lauritzen A. Deferred rendering for current and future rendering pipelines[J]. SIGGRAPH Course: Beyond Programmable Shading, 2010: 1-34.

[6] Harada T, McKee J, Yang J C. Forward+: Bringing deferred lighting to the next level[J]. 2012.

DirectCompute tutorial for Unity 7: Counter Buffers

So to continue this tutorial is about counter buffers. All the compute buffer types in Unity have a count property. For structured buffers this is a fixed value as they act like an array. For append buffers the count changes as the buffer is appended/consumed from as they act like a dynamic array.

(所有类型buffer都有count property)

 

Direct Compute also provides a way to manually increment and decrement the count value. This gives you greater freedom over how the buffer stores its elements and should allow for custom data containers to be implemented. I have seen this used to create an array of linked list all completely done on the GPU.

(count property 支持append/consume buffer的实时大小衡量)

 

First start by creating a new script and paste in the following code. The code does not do anything interesting. It just creates the buffer, runs the shader and then prints out the buffer contents.

(代码:打印buffer contents)

 

 

On this line you will see the creation of the buffer.

(创建buffer的代码)

 

 

Note the buffers type is counter. The buffer count value is also set to zero. I recommend doing this when the buffer is created as Unity will not always create the buffer with its count set to zero. I am not sure why. I think it maybe a bug.

(注意buffer类型是counter)

 

Next create a new compute shader and paste in the following code.

(然后再建一个shader)

 

 

First notice the way the buffer is declared.

(除以buffer声明,这里不是counterBuffer类型!)

 

It’s just declared as a structured buffer. There is no counter type buffer.

 

The buffers count value is then incremented here. The increment function will also return what the count value was before it gets incremented.

(函数中则是增加counter,返回buffer size的做法:)

 

Then the count is stored in the buffer so we can print out the contents to check it worked.

(这样就可以打印出buffer size)

 

If you run the scene you should see the numbers 0 – 15 printed out.

(结果是打印0-15的值)

 

So how do you decrement the counter? You guessed it. With the decrement function.

(然后考虑怎么减少counter,如下)

 

The decrement function will also return what the count value was after it gets decremented.

 

Now let’s say you have run a shader that increments and adds elements to the buffer but you don’t know how many were added. How do you find out the count of the buffer? Well you may recall from the append buffer tutorial that you can use Unity’s CopyCount function to retrieve the count value. You can do the same with the counter buffer. Add this code to the end of the start function. It should print out that the buffers count is 16.

(append buffer结合使用counter buffer就可以做到你增加buffer size的时候实时得到其大小。)

 

 

DirectCompute tutorial for Unity 6: Consume buffers

This tutorial will be covering how to use consume buffers in Direct Compute. This tutorial was originally going to be combined with the append buffer tutorial as consume and append buffers are kinda the same thing. I decided it was best to split it up because the tutorial would have been a bit too long. In the last tutorial I had to add a edit because it turns out that there are some issues with using append buffers in Unity. It looks like they do not work on some graphics cards. There has been a bug report submitted and hopefully this will be fixed some time in the future. To use consume buffers you need to use append buffers so the same issue applies to this tutorial. If the last one did not work on your card neither will this one.

(sonsume buffer也和append buffer一样,不是所有的硬件都支持很好,这点在下面的过程中要注意,遇到奇怪的问题可能是硬件造成的。)

 

I also want to point out that when using append or consume buffers if you make a mistake in the code it can cause unpredictable results when ran even if you later fix the code and run it again. If this happens, especially if the error caused the GPU to crash it is best to restart Unity to clear the GPU context.

(注意很多crush的情况下建议重启unity,清空GPU context。再来)

 

To get started you will need to add some data to a append buffer as you can only consume data from a append buffer. Create a new C# script and add this code.

(开始学习,C#代码如下)

 

 

Here we are simply creating a append buffer and then adding a position to it from the “appendBufferShader” for each thread that runs.

(创建一个append buffer,跑appendBufferShader加入position信息给这个buffer)

 

We also need a shader to render the results. The  “Custom/AppendExample/BufferShader” shader posted in the last tutorial can be used so I am not going to post the code again for that. You can find it in the append buffer tutorial or just download the project files (links at the end of this tutorial).

(我们还需要一个shader来渲染最终效果,这里也采用上一篇用到的那个。)

 

Now attach the script to the camera, bind the material and compute shader and run the scene. You should see a grid of red points.

(跑的结果就是看到a grid of red points)

 

We have appended some points to our buffer and next we will consume some. Add this variable to the script.

(然后来改造成consume的。来看代码:首先加入变量)

 

Now add these two lines under the dispatch call to the append shader.

(下面就是设置buffer并执行shader)

 

 

This will run the compute shader that will consume the data from the append buffer. Create a new compute shader and then add this code to it.

(然后创建一个新的shader,填入下面代码)

 

 

Now bind this shader to the script and run the scene. You should see nothing displayed. In the console you should see the vertex count as 0. So what happened to the data?

(shader挂到上面的c#代码上然后跑结果,看不到效果且vertex count显示0)

 

Its this line here that is responsible.

 

This removes a element in the append buffer each time it is called. Since we ran the same amount of threads as there are elements in the append buffer in the end everything was removed. Also noticed that the consume function will return the value that was removed.

(原因:consumeBuffer每取出一个数据就会在自己的buffer里面删掉他,因此执行完这个shader后consume buffer就为空了。)

 

This is fairly simple but there are a few key steps to it. Notice that the buffer needs to be declared as a consume buffer in the compute shader like so…

(很简单的理论,下面是使用的关键步骤:首先是声明buffer)

 

But notice that in the script the buffer we bound to the uniform was not of the type consume. It was a append buffer. You can see so when it was created.

(然后初始化)

 

There is no type consume, there is only append. How the buffer is used depends on how you declare it in the compute shader. Declare it as “AppendStructuredBuffer”  to append data to it and declare it as a “ConsumeStructuredBuffer” to consume data from it.

(buffer类型的定义决定它的使用方式)

 

Consuming data from a buffer is not without is risks. In the last tutorial I mentioned that appending more elements than the buffers size will cause the GPU to crash. What would happen if you consumed more elements than the buffer has? You guessed it. The GPU will crash. Always try and verify that your code is working as expected by printing out the number of elements in the buffer during testing.

(和append一样要注意大小不要越界)

 

Removing every element from the buffer is a good way to clear the append buffer (which also appears to be the only why to clear a buffer with out recreating it) but what happens if we only remove some of the elements?

 

Edit – Unity 5.4 has added a ‘SetCounterValue’ function to the buffer so you can now use that to clear a append or consume buffer.

(不仅仅可以一个一个删,还可以用SetCounterValue来整体清空数据)

 

Change the dispatch call to the consume shader to this…

(最后一步是执行shader)

 

Here we are only running the shader for a quarter of the elements in the buffer. But the question is which elements will be removed? Run the scene. You will see the points displayed again but some will be missing. If you look at the console you will see that there are 768  elements in the buffer now. There was 1024 and a quarter (256) have been removed to leave 768. But there is problem. The elements removed seem to be determined at random and it will be (mostly) different each time you run the scene.

(考虑一个问题:局部删除的时候删除的是哪个部分)

 

This fact revels how append buffers work and why consume buffers have limited use. These buffers are LIFO structures. The elements are added and removed in the order the kernel is ran by the GPU but as each kernel is ran on its own thread the GPU can never guarantee the order they will run. Every time you run the scene the order the elements are added and removed is different.

(这个和硬件结构相关,不知道)

 

This does limit the use of consume buffers but does not mean they are useless. LIFO structures are something that have never been available on the GPU and as long as the elements exact order does not matter they will allow you to perform algorithms that where impossible to do so on the GPU in the past. Direct compute also adds the ability to have some control over how threads are ran by using thread synchronization, which will be covered in a later tutorial.

(这个问题确实影响了consume buffer的使用,注意避免这种问题的影响)

DirectCompute tutorial for Unity 5: Append buffers

In today’s tutorial I will be expanding on the topic of buffers covered in the last tutorial by introducing append buffers. These buffers offer greater flexibility over structured buffers by allowing you to dynamically increase their size during run time from a compute or Cg shader.

(append buffer特点是灵活,size可变)

 

The great thing about these buffers is that they can still be used as structured buffers which makes the actually rendering of their contents simpler. Start of by creating a new shader and pasting in this code. This is what we will be using to draw the points we add into the buffer. Notice the buffer is just declared as a structured buffer.

(使用方式和standard一致,下面是shader代码)

 

 

Now lets make a script to create our buffers. Create a new C# script, paste in this code and then drag it onto the main camera in a new scene.

(下面是C#代码)

 

 

Notice this line here…

 

 

This is the creation of a append buffer. It must be of the type “ComputeBufferType.Append” unsurprisingly. Notice we still pass in a size for the buffer (the width * width parameter). Even though append buffers need to have their elements added from a shader they still need to have a predefined size. Think of this as a reserved area of memory that the elements can be added to. This also raises a subtle error that can arise which I will get to later.

(创建append buffer,这里要注意的是还是要设置一个初始大小。)

 

The append buffer starts of empty and we need to add to it from a shader. Notice this line here…

 

Here we are running a compute shader to fill our buffer. Create a new compute shader and paste in this code.

(我们用一个shader来给这个buffer填充数据)

 

 

This will fill our buffer with a position for each thread that runs. Nothing fancy. Notice this line here…

 

 

The “Append(pos)” is the line that actually adds a position to the buffer. Here we are only adding a position if the x and y dispatch id are even numbers. This is why append buffers are so useful. You can add to them on any conditions you wish from the shader.

(这里注意:Append方法就是在buffer里面加一个position。这里就是当x,y超出范围的时候采用这个方法来添加。)

 

The dynamic contents of a append buffer can cause a problem when rendering however. If you remember from last tutorial we rendered a structured buffer using this function…

(这里buffer的动态增长会导致一些问题)

 

Unity’s “DrawProcedual” function needs to know the number of elements that need to be drawn. If our append buffers contents are dynamic how do we know how many elements there are?

(Graphics.DrawProcedural方法需要提前知道buffer size,但是上面的append buffer size是不确定的。)

 

To do that we need to use a argument buffer. Take a look at this line from the script…

(因此我们用到了argument buffer)

 

Notice it has to be of the type “ComputeBufferType.DrawIndirect“, it has 4 elements and they are integers. This buffer will be used to tell Unity how to draw the append buffer using the function “DrawProceduralIndirect“.

(这个buffer会告诉unity buffer的最终大小,绘制方法要改用DrawProceduralIndirect方法)

 

These 4 elements represent the number of vertices, the number of instances, the start vertex and the start instance. The number of instances is simply how many times to draw the buffer in a single shader pass. This would be useful for something like a forest where you have a few trees that need to be drawn many times. The start vertex and instance just allow you to adjust where to draw from.

(DrawProceduralIndirect的参数分别是:instance数量,instance大小,开始顶点,开始instance。这种方式特别适合绘制森林这样的少量对象绘制多次的情况,每次只需要位置变化。)

 

These values can be set from the script. Notice this line here…

(argBuffer数值设置方法一,一般buffer填充四个值再copy给argBuffer)

 

 

Here we are just telling Unity to draw one instance of our buffer starting from the beginning. Its the first number that is the important one however. Its the number of vertices and see its set to 0. This is because we need to get the exact number of elements that are in the append buffer. This is done with the following line…

 

This copies the number of elements in the append buffer into our argument buffer. At this stages its important to make sure everything’s working correctly so we will also get the values in the argument buffer and print out the contents like so…

(第二种方式就是直接赋值,Debug.Log告诉你这四个值啥含义。)

 

Now run the scene, making sure you have bound the shader and material to the script. You should see the vertex count as 256. this is because we ran 1024 threads with the compute shader and we added a position only for even x and y id’s which ends up being the 256 points you see on the screen.

(跑结果,会得到vertex count达到了256.也就是执行了1024线程)

 

Now remember I said there was a subtle error that can arise and it has to do with the buffers size? When the buffer was declared we set its size as 1024 and now we have entered 256 elements into it. What would happen if we added more that 1024 elements? Nothing good. Appending more elements than the buffers size causes a error in the GPU’s driver. This causes all sorts of issues. The best thing that could happen is that the GPU would crash. The worst is a persistent error in the calculation of the number of elements in the buffer that can only be fixed by restarting Unity and means that the buffer will not be draw correctly with no indication as to why. Whats worse is that since this is a driver issue you may not experience the issue the same way on different computers leading to inconstant behavior which is always hard to fix.

(注意append buffer有最大限制,就是看当前支持多少线程并行,像这里大于1024就会挂。)

 

This is why I have printed out the number of elements in the buffer. You should always check to make sure the value is within the expected range.

(使用的时候最好先检查一下GPU支持情况)

 

Once you have copied the vertex count to the argument buffer you can then draw the buffer like so…

(使用了argument buffer后你就可以用下面的函数绘制结果了)

 

Your not limited to filling a compute buffer from a compute shader. You can also fill it from a Cg shader. This is useful if you want to save some pixels from a image for some sort of further post effect. The catch is that since you are now using the normal graphics pipeline you need to play by its rules. You need to output a fragment color into a render texture even if you don’t want to. As such if you find yourself in that situation it means that whatever your doing is probably better done from a compute shader. Other than that the process works much the same but like everything there are a few things to be careful of.

(另一种新的写法如下:)

 

Create a new shader and paste in this code…

 

 

Notice that the fragment shader is much the same as the compute shader used previously. Here we are just adding a position to the buffer for each pixel rendered if the uv (as the id) is a even number on both the x and y axis. Don’t ever try and create a material from this shader in the editor! The reason is that Unity will try and render a preview image for the material and will promptly crash. You need to pass the shader to a script and create the material from there.

(和原来很像,但是增加了每个pixel的buffer的position)

 

Create a new C# script and paste in the following code…

 

 

Again you will see that this is very much like the previous script using the compute shader. There are a few differences however.

(你会发现和之前的代码很像,略有区别)

 

 

Instead of the compute shader dispatch call we need to use Graphics blit and we need to bind and unbind the buffer. We also need to provide a render texture as the source destination for Graphics blit. This makes the process Pro only unfortunately.

(不再采用dispatch方法,而是采用Graphics的方法来替代。Graphics blit就是来生成render texture)

 

Attach this script to the main camera in a new scene and run the scene after binding the material and shader to the script. It should look just like the previous scene (a grid of points) but they will be blue.

(跑这个新代码会得到和原来一样的效果。)

 

 

DirectCompute tutorial for Unity 4: Buffers

In DirectCompute there are two types of data structures you will be using, textures and buffers. Last tutorial covered textures and today I will be covering buffers. While textures are good for data that needs filtering or mipmaping like color information, buffers or more suited to representing information for vertices like position or normals. Buffers can also easily send or retrieve data from the GPU which is a feature that’s rather lacking in Unity.

(你可选的数据结构除了texture,还有buffer。texture在你的数据需要filtering/mipmapping的时候特别好用,buffer适用于存储点信息比如position/normals)

 

There are 5 types of buffers you can have, structured, append, consume, counter and raw. Today I will only be covering structured buffers because this is the most commonly used type. I will try and cover the others in separate tutorials as they are a little more advanced and I need to cover some of the basics first.

(有五种类型的buffer你可以用:structured, append, consume, counter and raw,这一篇只讲structured buffer)

 

So let’s start of by creating a C# script, name it BufferExample and paste in the following code.

t003 - 01

 

We will also need a material to draw our buffer with so create a normal Cg shader, paste in the follow code and create a material out of it.

(创建一个material赋予下面的shader)

t003 - 02

 

Now to run the scene attach the script to the main camera node and bind the material to the material attribute on the script. This script has to be attached to a camera because to render a buffer we need to use the “OnPostRender” function which will only be called if the script is on a camera (EDIT – turns out you can use “OnRenderObject” if you don’t want the script attached to a camera). Run the scene and you should see a number of random red points.

(把上面的c#代码赋给主相机,代码的material挂上上面建的cameraMaterial)

t003 - 03

 

(下面讲解一下代码)

Now notice the creation of the buffer from this line.

 

The three parameters passed to the constructor are the count, stride and type. The count is simply the number of elements in the buffer. In this case 1024 points. The stride is the size in bytes of each element in the buffer. Since each element in the buffer represents a position in 3D space we need a 3 component vector of floats. A float is 4 bytes so we need a stride of 4 * 3. I like to use the sizeof operator as I think it makes the code more clear. The last parameter is the buffer type and it is optional. If left out it creates a buffer of default type which is a structured buffer.

(三个参数:elements数量,每个elements大小,类型默认就是structured buffer)

 

Now we need to pass the positions we have made to the buffer like so.

(传入数据)

 

This passes the data to the GPU. Just note that passing data to and from the GPU can be a slow processes and in general it is faster to send data than retrieve data.

 

Now we need to actually draw the data. Drawing buffers has to be done in the “OnPostRender” function which will only be called if the script is attached to the camera. The draw call is made using the “DrawProcedural” function like so…

(在OnPostRender函数里绘制数据)

 

There are a few key points here.

The first is that the materials pass must be set before the DrawProcedural call. Fail to do this and nothing will be drawn. You must also bind the buffer to the material but you only have to do this once, not every frame like I am here. Now have a look at the “DrawProcedural” function. The first parameter is the topology type and in this case I am just rendering points but you can render the data as lines, line strips, quads or triangles. You must however order them to match the topology. For example if you render lines every two points in the buffer will make a line segment and for triangles every three points will make a triangle. The next two parameters are the vertex count and the instance count. The vertex count is just the number vertices you will be drawing, in this case the number of elements in the buffer. The instance count is how many times you want to draw the same data. Here we are just rendering the points once but you could render them many times and have each instance in a different location.

(材质设置必须在call DrawProcedural 函数之前,不然绘不出来。)

(DrawProcedural函数参数:拓扑结构类型;vertex count;instance count)

 

(渲染线的效果如下:)

t003 - 04

Now for the material. This is pretty straight forward. You just need to declare your buffer as a uniform like so…

(对于material就直接声明structurebuffer来使用吧)

 

Since buffers are only in DirectCompute you must also set the shader target to SM5 like so…

(shader版本要求)

 

The vertex shader must also have the argument “uint id : SV_VertexID“. This allows you to access the correct element in the buffer like so…

(vertex shader存在SV_VertexID这个参数可以让你很容易的进入目标element)

 

Buffers are a generic data structure and don’t have to be floats like in this example. We could use integers instead. Change the scripts start function to this…

(buffer支持多种数据结构,但是比较推荐使用int,下面就是float转int后使用的例子)

 

 

and the shaders uniform declaration to this…

You could even use a double but just be aware that double precision in shaders is still not widely supported on GPU’s although its common on newer cards.

(buffer你可以使用double,但目前GPU对此支持还不够好,建议不要用)

 

You are not limited to using primitives like float or int. You can also use Unity’s Vectors like so…

(你还可以使用unity vectors数据结构!!!)

 

 

With the uniform as…

You can also create you own structs to use.  Change your scripts start function to this with the struct declaration above…

(还可以使用自定义数据结构)

 

 

and the shader to this…

 

This will draw each point like before but now they will also be random colors. Just be aware that you need to use a struct not a class.

(shader中要注意要使用struct关键字而不是class来建结构)

 

When using buffers you will often find you need to copy one into another. This can be easily done with a compute shader like so…

(buffer拷贝)

 

 

You can see that buffer1 is copied into buffer2. Since buffers are always 1 dimensional it is best done from a 1 dimension thread group.

(拷贝的时候注意维度要一样)

 

Just like textures are objects so to are buffers and they have some functions that can be called from them. You can use the load function to access a buffers element just like with the subscript operator.

(buffer使用和texture一样有其他方法使用:load)

 

Buffers also have a “GetDimension” function. This returns the number of elements in the buffer and their stride.

(buffer使用和texture一样有其他方法使用:getdimensions)