Latency mitigation strategies解读

契机

最近介入了VR的项目，很多时候都会听到小伙伴们在谈论一些晦涩的名词，其中提到比较多的是ATW，因此溯源了一下整个词语的出处，所以才发现了这边John Carmack的文章。借着做项目的机会匀了一点时间读了这篇文章，也借这个机会组织了小伙伴们一同分享了一下，才有了这样的一篇读书笔记。

Abstract

人对于延迟感受的阈值

Human sensory systems can detect very small relative delays in parts of the visual or, especially, audio fields, but when absolute delays are below approximately 20 milliseconds they are generally imperceptible.

作者在开篇提出了人类对于20ms以下的延迟是几乎无感知的

Interactive 3D systems today typically have latencies that are several times that figure, but alternate configurations of the same hardware components can allow that target to be reached.

但是3D应用的延迟是20ms这个数字的好多倍（2013年）

Introduction

不同的设备，人们对于延迟感受的容忍度是不一样的

Users can adapt to control systems with a significant amount of latency and still perform challenging tasks or enjoy a game; many thousands of people enjoyed playing early network games, even with 400+ milliseconds of latency between pressing a key and seeing a response on screen.

早期的时候，我们对于400ms+的延迟也是可以接受的，那是因为我们可以在屏幕前等待‘

Perceiving latency in the response to head motion is also one of the primary causes of simulator sickness.

VR设备中这种延迟会带来晕眩症

使用sensor数据来推算运动趋势

Extrapolation of sensor data can be used to mitigate some system latency, but even with a sophisticated model of the motion of the human head, there will be artifacts as movements are initiated and changed.

通过使用sensor数据的推演/算法可以用来降低系统的延迟，但是这样的算法/模型通常是比较复杂的，但是会引入一些假的运动轨迹或者变化

It is always better to not have a problem than to mitigate it, so true latency reduction should be aggressively pursued, leaving extrapolation to smooth out sensor jitter issues and perform only a small amount of prediction.

由于做预测多少会存在偏差，因此Carmack建议是使用尽量小的时间片段来做预测

Data collection

这一章其实并没有什么特别的内容了，我感觉主要是阐释了一下用高速摄像机来做延迟的测量。

An effective technique is to record high speed video that simultaneously captures the initiating physical motion and the eventual display update.

Sensor,Display

这几个章节的部分主要是介绍了来自sensor，以及display的延迟，其中比较有意思的点是：

A subtle latency point is that most displays present an image incrementally as it is scanned out from the computer, which has the effect that the bottom of the screen changes 16 milliseconds later than the top of the screen on a 60 fps display. This is rarely a problem on a static display, but on a head mounted display it can cause the world to appear to shear left and right, or “waggle” as the head is rotated, because the source image was generated for an instant in time, but different parts are presented at different times. This effect is usually masked by switching times on LCD HMDs, but it is obvious with fast OLED HMDs.

从另外一个侧面引出了android中VSYNC的概念

Host processing

这个部分准确的定义了每个阶段的处理逻辑，分为ISRGV。

Read user input -> run simulation -> issue rendering commands -> graphics drawing -> wait for vsync -> scanout
I = Input sampling and dependent calculation
S = simulation / game execution
R = rendering engine
G = GPU drawing time
V = video scanout time

然后这边举了几个例子，非常形象的呈现了如何一步一步降低从I到V的延迟

However, if running synchronized to the video refresh, the minimum latency will still be 16 ms even if the system is infinitely fast.
第一种情况，所有的指令都是一个同步集，那么就算事情做完了，也是要等16ms之后才会刷新。
- 1
  2
  3
  Ample performance, vsync:
  ISRG------------|VVVVVVVVVVVVVVVV|
  .................. latency 16 – 32 milliseconds
Running without vsync on a very fast system will deliver better latency, but only over a fraction of the screen, and with visible tear lines.

第二种情况，如果没有vsync，那么就带入了撕裂的问题

Ample performance, unsynchronized:
ISRG
VVVVV
..... latency 5 – 8 milliseconds at ~200 frames per second

第三种情况，也是常见的情况，CPU,GPU,Display分别工作

In most cases, performance is a constant point of concern, and a parallel pipelined architecture is adopted to allow multiple processors to work in parallel instead of sequentially.

CPU:ISSSSSRRRRRR----|
GPU:                |GGGGGGGGGGG----|
VID:                |               |VVVVVVVVVVVVVVVV|
    .................................. latency 32 – 48 milliseconds

第四种情况，就回到了堆硬件的思路理去了，如果如果CPU的loading过大，完全超出了一个frame的工作时间，那么可以考虑引入第二个CPU，把事情拆开了做

When the CPU load for the simulation and rendering no longer fit in a single frame, multiple CPU cores can be used in parallel to produce more frames.

CPU1:ISSSSSSSS-------|
CPU2:                |RRRRRRRRR-------|
GPU :                |                |GGGGGGGGGG------|
VID :                |                |                |VVVVVVVVVVVVVVVV|
     .................................................... latency 48 – 64 milliseconds

根据上述的描述，Carmack最后也说了，虽然我们做了很多的努力，应用看起来也跑的非常流畅，但是我们可以看出从I事件下发到V显示出来，依旧会存在50ms的延迟，所以对于降低延迟到20ms，我们依旧还有很长的路要走。

Even if an application is running at a perfectly smooth 60 fps, it can still have host latencies of over 50 milliseconds, and an application targeting 30 fps could have twice that. Sensor and display latencies can add significant additional amounts on top of that, so the goal of 20 milliseconds motion-to-photons latency is challenging to achieve.

Latency Reduction Strategies

上半篇文章主要阐释的是一些背景知识，从这个章节开始才是正文部分，作者Carmack提出了四个策略，主要是

Prevent GPU buffering
Late frame scheduling
View bypass
Time warping

最后的部分是提出了一个Continuously time warping的想法。

Prevent GPU buffering

想法构思

这个思想的主题是，如果GPU可以提前开始做G的动作，那么V的部分也是可以提前的，基于这个思想，作者提出了不要让GPU缓冲指令的构思，其实我觉得这个策略的意义更像是是在于请开发者做好GPU的buffering管控，不要缓存太多，也不要完全不缓冲。

可行方案

SwapBuffers();
DrawTinyPrimitive();
InsertGPUFence();
BlockUntilFenceIsReached();

在我看来，作者的意思是在每个指令发送以后都显式的让GPU fence一次，以完成Draw的动作

worse case也很明显：

If the code issuing drawing commands does not proceed fast enough, the GPU may complete all the work and go idle with a “pipeline bubble”. Because the CPU time to issue a drawing command may have little relation to the GPU time required to draw it, these pipeline bubbles may cause the GPU to take noticeably longer to draw the frame than if it were completely buffered. Ordering the drawing so that larger and slower operations happen first will provide a cushion, as will pushing as much preparatory work as possible before the synchronization point.

因为CPU会处于等待Fence，所以整个CPU可能会闲置下来

作者先用minimal buffering做了一下举例：

Run GPU with minimal buffering:
CPU1:ISSSSSSSS-------|
CPU2:                |RRRRRRRRR-------|
GPU :                |-GGGGGGGGGG-----|
VID :                |                |VVVVVVVVVVVVVVVV|
     ................................... latency 32 – 48 milliseconds

同时，作者也提到，这种没有缓冲的做法是会”destroy far more overlap”，其实我没有太理解，猜测是因为这种单独fence的做法，可能会导致在某个阶段只渲染了一部分内容，让用户看到了“非全貌”的中间态的意思吧。

另外一方面，作者也提出了基于这种设计的情况下，可以引入多个GPU做AFR（Alternate Frame Rendering），其实就是前后两帧使用不同GPU绘制，扩展到VR设备上，可以使用两个GPU绘制两个眼睛看到的图像，这样也是可以降低延迟的。

Alternate Frame Rendering dual GPU:
CPU1:IOSSSSSSS-------|IOSSSSSSS-------|
CPU2:                |RRRRRRRRR-------|RRRRRRRRR-------|
GPU1:                | GGGGGGGGGGGGGGGGGGGGGGGG--------|
GPU2:                |                | GGGGGGGGGGGGGGGGGGGGGGG---------|
VID :                |                |                |VVVVVVVVVVVVVVVV|
     .................................................... latency 48 – 64 milliseconds

局限性

这个部分其实上面也提到了，一个是画面可能是非全貌的，另外一个就是在极端情况下，这类无buffering的操作会导致处理不过来的丢帧，有点类似于“分不清轻重缓急来什么做什么，从而导致了整体效率低下”的情况。

The downside to preventing GPU buffering is that throughput performance may drop, resulting in more dropped frames under heavily loaded conditions.

Late frame scheduling

想法构思

作者提到，呈现的画面中大部分的内容其实并不直接依赖于用户的操作，因此

我们在整个流程靠后的地方去获取用户的输入信息，而不是在流程的最开始去获取，这样的话用户的操作的反馈，会更加接近于显示的时间点，给人感觉是延迟变低了。
跟用户输入无关的操作，可以提前做计算

Much of the work in the simulation task does not depend directly on the user input, or would be insensitive to a frame of latency in it.If the user processing is done last, and the input is sampled just before it is needed, rather than stored off at the beginning of the frame, the total latency can be reduced.

可行方案

Late frame scheduling:
CPU1:SSSSSSSSS------I|
CPU2:                |RRRRRRRRR-------|
GPU :                |-GGGGGGGGGG-----|
VID :                |                |VVVVVVVVVVVVVVVV|
                    .................... latency 18 – 34 milliseconds

在代码中，其实就是在read user input的时候靠后一些，从上面的图可以看出来，整个I到V的延迟明显就降低了。

局限性

作者提到了late frame scheduliing会非常的消耗资源，因为需要频繁的做调度握手（usually requires busy waiting to meet），这一个部分其实我是没有太明白的。

The drawback to late frame scheduling is that it introduces a tight scheduling requirement that usually requires busy waiting to meet, wasting power.

View bypass

想法构思

上面是提到了在靠后的时机去做采样，这边提到了可以根据采样得到的user input去尝试修改用户看到的部分。

An alternate way of accomplishing a similar, or slightly greater latency reduction Is to allow the rendering code to modify the parameters delivered to it by the game code, based on a newer sampling of user input.

作者举了一个例子，通过计算前后两次采样的delta，我们可以修改view matrix。

At the simplest level, the user input can be used to calculate a delta from the previous sampling to the current one, which can be used to modify the view matrix that the game submitted to the rendering code.

另外作者还提到了，有一些场景是不会因为用户的输入而受到干扰的，比如转场动画(cinematic cut scenes)，在设计的时候其实是要把这一类的场景（图层）跟用户操作会干扰的图层分开。

Delta processing in this way is minimally intrusive, but there will often be situations where the user input should not affect the rendering, such as cinematic cut scenes or when the player has died. It can be argued that a game designed from scratch for virtual reality should avoid those situations, because a non-responsive view in a HMD is disorienting and unpleasant, but conventional game design has many such cases.

可行方案

基于上述的构思，作者提出了这样一个模型，注意其中的I出现了两次，每个帧中会采一次，所以是两个不同的I，但是会被不同的task中使用：

The input is only sampled once per frame, but it is simultaneously used by both the simulation task and the rendering task. Some input processing work is now duplicated by the simulation task and the render task, but it is generally minimal.

View bypass:
CPU1:ISSSSSSSSS------|
CPU2:                |IRRRRRRRRR------|
GPU :                |--GGGGGGGGGG----|
VID :                |                |VVVVVVVVVVVVVVVV|
                      .................. latency 16 – 32 milliseconds

接着作者还提出了一个tile based GPU的做法，这个我确实没有看明白，另外还有一个非常有意思的说辞：

1	All calculations that depend on the view matrix must reference it independently from a buffer object, rather than from inline parameters or as a composite model-view-projection (MVP) matrix.

可能是担心buffer会复用，从而导致参数混乱的原因？

Tiler optimized view bypass:
CPU1:ISSSSSSSSS------|
CPU2:                |IRRRRRRRRRR-----|I
GPU :                |                |-GGGGGGGGGG-----|
VID :                |                |                |VVVVVVVVVVVVVVVV|
                                       .................. latency 16 – 32 milliseconds

局限性

由于整个视图的采样是靠后的，而这些输入的部分并没有经过CPU计算，因此可能会出现一些画面的不正常。

Any view frustum culling that was performed to avoid drawing some models may be invalid if the new view matrix has changed substantially enough from what was used during the rendering task

Time warping

最后的阶段是来到了time warping，开头作者先提出了，如果我们知道整个渲染的过程是多久，那么完全可以通过late frame scheduling来做。

If you had perfect knowledge of how long the rendering of a frame would take, some additional amount of latency could be saved by late frame scheduling the entire rendering task, but this is not practical due to the wide variability in frame rendering times.

Late frame input sampled view bypass:
CPU1:ISSSSSSSSS------|
CPU2:                |----IRRRRRRRRR--|
GPU :                |------GGGGGGGGGG|
VID :                |                |VVVVVVVVVVVVVVVV|
                          .............. latency 12 – 28 millisec

想法构思

Time warping的重点在于屏幕上的任何像素，我们是可以通过当前的一些参数，比如时间，运动矢量来进行预测的，让它在预计的时间近似的显示出那个样子。说的比较拗口，换个话说就是预测它的位置，然后使用视图矩阵计算出来。

After drawing a frame with the best information at your disposal, possibly with bypassed view parameters, instead of displaying it directly, fetch the latest user input, generate updated view parameters, and calculate a transformation that warps the rendered image into a position that approximates where it would be with the updated parameters

可行方案

这部分跟late frame scheduling比较类似，其实也是在time warping前，需要获取一次I，这个I可能是sensor的数据，也可以是predict time。

Late frame scheduled time warp:
CPU1:ISSSSSSSSS------|
CPU2:                |RRRRRRRRRR----IR|
GPU :                |-GGGGGGGGGG----G|
VID :                |                |VVVVVVVVVVVVVVVV|
                                    .... latency 2 – 18 milliseconds

局限性

在这个部分，作者提到了如果是用在做只是direction的预测，效果就会比较好，如果运动中带入了平移，那么可能就会出现一些错误的渲染。

If the difference between the view parameters at the time of the scene rendering and the time of the final warp is only a change in direction, the warped image can be almost exactly correct within the limits of the image filtering. Effects that are calculated relative to the screen, like depth based fog (versus distance based fog) and billboard sprites will be slightly different, but not in a manner that is objectionable.

If the warp involves translation as well as direction changes, geometric silhouette edges begin to introduce artifacts where internal parallax would have revealed surfaces not visible in the original rendering. A scene with no silhouette edges, like the inside of a box, can be warped significant amounts and display only changes in texture density, but translation warping realistic scenes will result in smears or gaps along edges.

其中也举例说明了一下，比如盒子的移动，东西可能会移动到盒子内部，但是却显示出来了，导致一些乌龙事件。

Continuously updated time warping

最后的最后，是带出了“Continuously updated time warping”。

想法构思

其实我们如果真的理解了Time warping和之前的一些做法，那么这边应该也没什么问题，整个想法的本质还是基于预测+增量的思想来做。

If the necessary feedback and scheduling mechanisms are available, instead of predicting what the warp transformation should be at the bottom of the frame and warping the entire screen at once, the warp to screen can be done incrementally while continuously updating the warp matrix as new input arrives.

可行方案

可以看到底部的”latency 2 – 3 milliseconds for 500hz sensor updates”，所以整个warp其实是靠着”sensor”来做的，是一个额外的器件

Continuous time warp:
CPU1:ISSSSSSSSS------|
CPU2:                |RRRRRRRRRRR-----|
GPU :                |-GGGGGGGGGGGG---|
WARP:                |               W| W W W W W W W W|
VID :                |                |VVVVVVVVVVVVVVVV|
                                     ... latency 2 – 3 milliseconds for 500hz sensor updates

在作者介绍中，我们也可以看到，他认为最好的最理想的做法是使用类似”scanout shader”的设备来做，而这类设备在一些终端上其实是有实现的，如”Nintendo DS”

The ideal interface for doing this would be some form of “scanout shader” that would be called “just in time” for the video display. Several video game systems like the Atari 2600, Jaguar, and Nintendo DS have had buffers ranging from half a scan line to several scan lines that were filled up in this manner.

接着作者又话锋一转，提到如果没有硬件的支持，那么整个实现上的另外一个方案是使用GPU来做，这个部分跟现在高通平台的实现就很像了，也就是SXR SDK中的ATW线程的实现。

GPUs can perform the time warping operation much more efficiently than a conventional CPU can, but the GPU will be busy drawing the next frame during video scanout, and GPU drawing operations cannot currently be scheduled with high precision due to the difficulty of task switching the deep pipelines and extensive context state. However, modern GPUs are beginning to allow compute tasks to run in parallel with graphics operations, which may allow a fraction of a GPU to be dedicated to performing the warp operations as a shared parameter buffer is updated by the CPU.