Software Occlusion Culling – intel

Software Occlusion Culling – intel



The technique divides scene objects into occluders and occludees and culls occludees based on a depth comparison with the occluders that are software rasterized to the depth buffer. 【做法】

The sample code uses frustum culling and is optimized with Streaming SIMD Extensions (SSE) instruction set and multi-threading to achieve up to 8X performance speedup compared to a non-culled display of the sample scene.【好处】




If we can lower the computational cost of occluded objects, we can improve the application’s performance with no impact to the scene quality. 【把挡住的物体不送进GPU渲染,则可以提高性能】

Using the GPU-generated depth buffer on the CPU is one technique for early detection of occluded objects. However, doing occlusion culling in this manner may cause latency and overhead issues and introduce visual artifacts since the GPU may be several frames behind the CPU. To avoid these problems, we propose software rasterization of the depth buffer on the CPU.【GPU生成的depth buffer是一种方法,但是会影响GPU性能】


In this approach, we use the CPU to rasterize the depth buffer. Then, we use axis-aligned bounding box tests to determine if an object is occluded. Occluded objects are removed from the rendering pipeline to reduce overhead. We do not cull visible and partially visible objects; we submit them to the GPU for rendering. The sample code associated with the Software Occlusion Culling sample implements this technique. Additionally, in this sample, the software rasterizer is vectorized using SSE and multi-threaded to improve performance.【这里使用CPU来做,然后使用axis-aligned bounding box来测试剔除,剔除完提交GPU渲染】


Occluders: Objects that are sufficiently large to hide or occlude other objects in the scene.【遮挡其他物体的物体】

Occludees: Objects that are hidden, or occluded, by other objects in the scene.【被其他物体遮挡的物体】


Software Occlusion Culling has two steps: depth buffer rasterization and depth test culling. The next sections provide details about these steps.


Depth Buffer Rasterization


The occluders in the scene are rasterized to the depth buffer on the CPU. Figure 1 shows a screenshot of the Software Occlusion Culling sample. The castle walls and the ground plane are considered as occluders in the scene. The castle walls along with the attached tiny wooden trim and decorations are used as occluders to avoid special pre-processing of the art assets. Ideally, it would have been better to hand pick just the large castle walls as occluders, but we wanted to make sure that the software occlusion culling algorithm worked with a realistic scene without the need for expensive content changes.【墙可以挡住大部分的东西可以不画】


Our rasterizer divides the frame buffer into tiles and bins occluder triangles based on the intersection with the tiles. The rasterizer bins triangles into multiple tiles when the triangles cross tile boundaries. When processing a tile, a thread steps through the binned triangles and uses bounding box traversal to rasterize the triangles to the depth buffer.【这边是在说自动根据三角形获得包围盒?】 The rasterizer checks if the pixel under consideration is inside the triangle, and if so, it uses barycentric coordinates to interpolate the depth value at the pixel location. The depth buffer rasterization process updates the depth buffer at the pixel position if the newly computed depth is closer to the observer as compared to the already existing depth at the same pixel location.


Once the depth buffer has been rasterized on the CPU, all the occludees in the scene have to be depth tested against the depth buffer to determine which occludees are visible and which can be culled.【等rastered完成,测试被遮挡体是否被遮挡】


Depth Test Culling


We use object space axis-aligned bounding boxes (AABBs) to depth test the occludees in the scene against the CPU-generated depth buffer. The depth test algorithm treats all objects in the scene including the occluders (castle walls and ground plane) as occludees. Using AABB makes the depth tests more conservative. If the AABB is occluded, then the object contained inside it is also occluded and can be culled. If the AABB is visible, then the assumption is that the object contained inside might also be visible. However, as the bounding box is conservative, this assumption may not always be true, and we may have some false positives.【使用物体的AABB来测试】


To determine if an object is clipped by the near clip plane, we could use the homogeneous coordinate w, of the vertices that make up the bounding box. The near clip plane is set to 1.0 in the sample. Objects that are in front of the camera have a w> 1.0. So if any of the vertices that make up the bounding box has a w < 1.0, then the object is being clipped by the near clip plane.【近面裁剪】


However, we store 1/w instead of the homogenous coordinate w to avoid divide by w at multiple locations in the rasterizer.


So the bounding box of an occludee is being clipped if 1/w > 1.0 or 1/w < 0.0 for any vertex of the bounding box in which case we trivially accept the occludee as visible. 【使用1/W替代w来表示更容易处理】





找到了paper:Masked Software Occlusion Culling


Parallel Graphics in Frostbite –
Current & Future


同样的,battlefield3 也是手工建遮挡体


Umbra occlusion culling 也是手动处理出模型

Occlusion Culling in Alan Wake

For solving visibility from the cameras point of view, Alan Wake used a hierarchical hardware based occlusion query algorithm [Bittner et al. 2004], [Guthe et al. 2006]. The advantages of using a hardware based culling algorithm outweighed the content pipeline issues associated with using a typical software occlusion system based on either precomputed potentially visible sets (PVS) or artist generated occlusion geometry and software rasterizer [Aila and Miettinen 2004]. Occlusion culling minimized the overdraw and the bottleneck shifted to shadow map generation.


GPU-Driven Rendering Pipelines



Killzone 软光栅做法

【这里的做法是使用手工调整大小的physics mesh

We didn’t want artists to hand-make occluders for the entire level, so we looked for an automatic solution. We experimented with using visual meshes, but the good polygon reduction proved to be difficult to get right. Physics mesh is much better choice. It’s available for each mesh in the game and it’s cleaned and sufficiently low polygon count. Unfortunately the physics meshes can be slightly larger than visual meshes causing objects to disappear. Worst offenders have to be fixed manually.

Here’s an example of occluders generated automatically from physics mesh. You can notice there’s some unnecessary detail, but in general the quality is pretty good

Even using physics mesh there was too much occluder geometry.

We needed to reject occluders that were unlikely to contribute much to the occlusion buffer. Our simple heuristics rejects all small objects or objects where the name suggests that they are not good occluders. We also reject meshes whose surface area suggests that they are thin or with too many holes. Artists can of course step in and override the heuristics or provide their custom occluders for difficult cases and optimization.

【physics mesh太多,我们需要筛选去掉一些:形状奇怪的,有洞的,小的都不要】



Solving Visibility and Streaming in the The Witcher 3: Wild Hunt with Umbra 3

首先看 LOD0 的才作为遮挡体





Umbra 分析

umbra3,也就是unity的做法,采用的是PVS,基于区域的预计算方法来做occlusion culling

Umbra pre-processes the scene offline, voxelizes it, then structures it into portals and cells to create a spatial database used at runtime. Queries can then be performed in real-time and involve the generation of a low-resolution hierarchical depth-buffer of the scene, performed by an highly-optimized software rasterizer.


It makes sense on a lot of levels and recently Intel also published another example of a purely CPU-based solution — similar to the Umbra technique, occluders are software-rasterized to create a hierarchical depth buffer.intel的实现可以看作是和umbra类似,采用了分层的深度缓冲器】

Umbra’s culling system precomputes a simplified visibility database representing a 3D world. A fast runtime library then traverses the database in order to solve visibility for a view.umbra的方法是需要与计算获得一些信息,运行时使用这些数据做遮挡剔除】

We call these databases tomes. They consist of automatically placed portals and cells. These are common concepts in the world of visibility: a cell is a volume in 3D space, within which visibility is similar. The world is divided into cells. Portals are links between cells: openings in space leading to another cell that also have a shape. The difficult problem is finding good portals and cells in fast and practical fashion.【预计算做的是快速的自动场景体素化】

Umbra’s precomputation step, Umbra optimizer, solves this problem. It starts with voxelization – discretization of 3D space. The algorithm begins by generating small “raw” cells. These are then improved by combining them together into bigger final cells, so that loss in occlusion is minimal.【细分体素后,再重新组合形成最终的cell

Each raw cell is a continuous set of empty voxels not bigger than some user defined size. Portals are generated on open surfaces between these cells. User controls database accuracy by supplying the size for these raw cells. This “raw cell size” together with voxel size are two most important user controlled parameters in Umbra. They are publicly given more understandable names: “smallest occluder” (raw cell size) and “smallest hole” (voxel size). Controlling these parameters allows scaling the data for any environment and requirements. Several versions of data are generated for different levels of detail – occlusion near camera is more important than farther away.cell>=体素大小的块,cell的最大值则是用户可以控制的,这点要注意设置好】

Umbra’s runtime walks these cells linked by portals to figure out the visible volume. Visible objects are collected along the path. Compared to other kind of portal culling systems, Umbra generates a lot of portals. Runtime thus presents quite an engineering challenge: it needs to be fast and optimized on each platform to handle high portal count. Common method for portal culling is narrowing the view frustum to fit encountered portals. Umbra doesn’t do this – Umbra rasterizes the portals instead. In the end, a low-resolution depth buffer is generated.【运行时遍历cell之间的关系链收集到可见物体,如果遇到太多的物体,一般做法是缩小可见区域来删减需要处理的物体的量,umbra的做法则是采用更粗的粒度来参与剔除】

Some visibility approaches are based on hand made occluder models – a representation of what’s solid in the world. You could see Umbra as a method for modelling empty space instead. This approach has the benefit that all inaccuracy can always be countered by growing the empty space (i.e. portals). Sources of inaccuracy include voxelization and low rasterization resolution. Conversely this allows us make the runtime faster by using lower database and runtime resolutions, while always retaining error-free output.【一般做法是手动建遮挡体模型,umbra则是对空区域建模,核心思路就是标记不被遮挡的cell,看有哪些看得见的,看不见的不就不用管了,这样voxel越粗只会影响精度,不会导致错误】

My previous blog post discussed global nature of visibility. Umbra is divided into a local offline operation and a global runtime one. Unlike PVS systems, this allows fast reuse of partial results from previous precomputations – and thus fast local updates to the database.【回顾第一篇的内容,umbra将操作分为本地操作和全局操作,这方法的好处是可以重用预计算结果】

There’s also benefits in having a separate set of static data on which the runtime algorithm operates. It makes the system independent from anything else: it’s parallelizable and easier to integrate. An application can begin to execute Umbra’s runtime as soon as camera’s position is known for a frame.【还有的好处就是数据独立,可差值】


There’s an early and simple solution to the visibility problem aptly called the painter’s algorithm ( 画家算法’s_algorithm ). Like an old renaissance master, the painter’s algorithm has you drawing everything in back-to-front order. While the result looks correct, it causes pentimenti – overdraw that wastes computing cycles. Computer scientists knew how to solve the visibility problem in the 70s like the artists did in the 1700s. The challenge is in solving it in a fast and practical fashion. The art of doing so is called occlusion culling, and we’ve thought about it quite a bit here at Umbra.【画家算法,会导致很多的画的部分最终看不到,overdraw

Modern rendering uses z-buffers – sort of the same thing as painter’s algorithm but for each point instead of each surface. While that does solve the problem, it does that very late in the rendering pipeline. By the time we get around rejecting anything, we’ve wasted a lot of time processing things hiding behind walls. A good visibility solution should know what the hidden objects are very early on, so that we can avoid even sending them for rendering. Avoiding spending time on hidden objects allows for more budget with visible ones.【现代渲染采用zbuffer来考虑深度信息,通过这个让后面的不画,但是如果可以很早的知道哪些对象不画,可以在渲染阶段避免很多的查询计算量,提高渲染效率】

Of course you could solve visibility by doing manual work. And indeed, many occlusion culling systems are based on hand tuning portals or occluders. Less manual work means better workflow for anyone creating content for the system. The bad end of automation spectrum would require tagging all visibility manually, a huge task that will quickly become impractical. An ideal solution would be in the other end – not require manual work and also not place restrictions on content.【基于手动的遮挡剔除管用,但是我们希望这一步可以自动化】

It’s worth mentioning a couple obvious things. A visibility solution should also produce correct results. No visible object should be reported hidden. To guarantee visually correct result, any inaccuracy should be on the conservative side: reporting hidden objects visible. It should also be fast: too slow occlusion culling would negate the benefits.【不管怎样简化都一定要保证最终的渲染结果是正确的,宁可有些应该不可见的没被裁掉】

One big practical issue with visibility is that it is a global problem that involves everything – a small change in visibility can affect the other side of a 3D world. This must somehow be reflected in the computation we perform to solve it. Reducing something into a local problem would allow to divide and conquer: easy parallelization and reuse of partial results. A generic visibility solution doesn’t really lend itself to this. This also means amount of work performed by the algorithm is at least relative to visible volume. We can compute some parts beforehand offline however and some parts at runtime. This is a choice we do get to make, and it allows us to find some extent of localization.【把global问题转化为local的问题好处在于可以并行运算】


再补充些信息:umbra 3的PPT

This is the high level overview of the workflow.

From the original geometry we want to create a data structure that can be used for efficient visibility queries. And the whole process should be both fast and automatic. (and allow at least some amount of dynamic changes to the data)【预计算获得后面用于可见性查询的数据结构,同时支持数据变化】

In runtime visibility query the basic operation is figuring out which objects are visible from specific view points.

These are the requirements we had, we need to create a data structure and algorithm within these bounds.【设计数据结构和算法需要如下的支持】

  • For any input 3D scene, memory usage and processing time of the visibility system must be bounded and independent of the complexity of the input.【够快】
  • The whole process should be automatic.【全自动化】
  • It should allow fast content iteration. 【可写】
  • And it should set no limits on the size of the game worlds.【支持足够大的地图】

So what is visibility?

It’s the set of all the lines in space which do not intersect occluding geometry.


Creating data strucure for accurate visibility presentation from polygon soup is theoretically possible, but in practice the data strucure would be so large and so slow to compute it’s unusable.【基于可见几何体的数据结构,对于大场景太过庞大,没有实用性】

Instead we need to use conservative visibility, where we have bounded time and space and within these bounds give the best results we can.

Visibility is ultimately reasoning about space. Interesting thing here to note is that once we have a topological data structure like this it can be used for other things as well, not just visibility.


This is the high level overview of the workflow.

From the original geometry we want to create a data structure that can be used for efficient visibility queries. And the whole process should be both fast and automatic. (and allow at least some amount of dynamic changes to the data)【我们需要根据可见对象预计算建立实用的可见性查询数据结构】

In runtime visibility query the basic operation is figuring out which objects are visible from specific view points.【运行时查询相关数据获得可见性】

Here you’re looking at top down view of an example scene.【场景顶视图】

The requirement was that there should be no manual markup or other requirements on the input geometry. So what we take as input is all the level geometry as it is.

So we really don’t have any other information besides the long list of polygons, which are possibly grouped into objects.

Doing geometric operations with polygons directly has all kinds of difficulties related to floating point accuracy. Also, almost all real life game levels contain some small modeling errors such as t-vertices and cracks between objects. We need to be able to work with this kind of input as well.


The approach we chose is to create a cell-and-portal graph by grouping the voxels together based on proximity and connectivity. Cells are created from groups of voxels. Portals are then created on the boundaries of these groups.We chose to create portals because in the past they have been proven to be an efficient way to represent visibility, and we solve the issues of manually placed portals by generating them automatically.


In constrast to manually placed portals, we might generate thousands of portals which allows us to have accurate visibility in outdoor spaces as well.

By controlling the number of output cells and portals we can choose the output resolution of the visibility data so that it meets the memory and performance requirements.


In addition to the cell and portal graph we need to be able to figure out in which cell the viewpoint is located when making a visibility query.【可见性查询的时候我们还需要知道viewport位于哪个cell】

We create a KD-tree from the voxels, which can be packed very efficiently with only a few bits per node. The leaf nodes of the tree contain indices to cells in the graph.【voxels建立KD-tree】

The final data stucture contains the cell-and-portal graph, the view tree, and a list of statis objects contained in each cell. 【最终的数据结构】

We call the output data structure a tome .. Tome of visibility .. It contains all the information about topological structure need for visibility determination in a 3D scene.






The basic operation is the visibility query, which uses the tome data to figure out what is visible from a specific view point inside the scene.【根据当前所在场景位置去找出可见性对象】

The goal is to create a depth buffer, or occlusion buffer, from the tome, which can be then used to very efficiently test the visibility of objects.

First step is to figure out in which cell the view point is located in. This is done by traversing the view tree based on the query origin. This is a simple tree traversal and it’s a fast operation.【第一步找出视点在哪个cell

There are several options on how we could use the portal data. We could do a traditional recursive portal traversal, clipping the view frustum as we go through each portal. Or we could do ray-tracing or cone-tracing in the graph.

The approach we choose is to rasterize the portals using a custom software rasterizer optimized for this purpose. With rasterizer we need to touch each portal only once, as opposed to recursive portal traversal which can suffer from exponential blowup if there’s a lot of portal intersections in the screenspace.

(We could also traverse other kind of queries for connectivity.)【通过软光栅来处理可见性,好处是每个portal只需要处理一次】


Also really useful property of the rasterizer is that it produces a depth buffer as output, which is almost optimal data structure for doing further visibility tests on the query output.

Also with rasterization we can choose the output resolution based on the platform and accuracy requirements.

Since we’re rasterizing portals instead of occluders, it’s trivial to implement conservative rasterization, which is a requirement for getting correct results in lower resolutions.

【光栅化输出深度信息,再用来可见性判断,portals instead of occluders 的好处是光栅化粒度可控】

After we have generated the depth buffer, testing objects for visibility is quite straightforward bounding box vs. depth buffer visibility test.

If we keep the depth buffer resolution small enough to fit into the CPU cache, this test is super-super fast, and also trivial to parallelize.

boundingboxdepth buffer比较做可见性判断】





SurRender Umbra: A Visibility Determination Framework for Dynamic Environments

Efficient Algorithms for occlusion culling and shadows (Jan2005)

dPVS: An Occlusion Culling System for Massive Dynamic Environments (2004)