图形引擎实战：UGUI名字版合批，LightMap/ShadowMap贴图通道合并，与安卓GPU Query

近期在为Unity通用渲染管线下的手游项目进行性能优化，主要是环绕当前渲染瓶颈展开的尝试和验证

1. UGUI名字板合批优化

1.1. 优化背景

当前项目的角色头顶名字板采用UGUI的渲染方式（renderMode为WorldSpace），存在draw call数量较高，CPU耗时过大的问题，需要在尽可能保证效果表现的前提下，提升渲染性能。

优化前角色名字板父子逻辑关系大体上可以参照以下结构图：

NPCRoot
|____...
|____NPCUI-01(Canvas组件/自定义组件处理Billboard逻辑、Click逻辑等)
     |____...
     |____Image
     |____Text(MeshPro)
     |____...
|____NPCUI-02(同NPCUI-01)
     |____...
     |____Image
     |____Text(MeshPro)
     |____...
|____...

每个Canvas间不会进行合批，并且Image和Text(MeshPro)所用Shader材质不同也会打断合批，所以可以从这两方面入手进行优化。

1.2. 核心思路

上面提到的结构不仅会涉及到名字板渲染，还会牵扯到点击、Billboard、状态图标管理等其它逻辑，因此不会选择完全重构。这里先贴上优化后的结构关系图，然后再详细展开：

NPCRoot
|____...
|____NPCUICanvas(Canvas组件)
     |____NPCUI-01(自定义组件处理Billboard逻辑、Click逻辑等)
          |____...
          |____Image(UGUINameBoardComponent->NameBoard Type: Image)
          |____Text(UGUINameBoardComponent->NameBoard Type: Text)
          |____...
     |____NPCUI-02(同NPCUI-01)
          |____...
          |____Image(UGUINameBoardComponent->NameBoard Type: Image)
          |____Text(UGUINameBoardComponent->NameBoard Type: Text)
          |____...
     |____...
|____...

针对Canvas组件的问题，可以采取NPCRoot下统一挂载Canvas组件（见上方结构图中的NPCUICanvas），然后移除掉每个NPCUI-xx上的Canvas组件的方法，但这样做会导致NPCUICanvas下子类UI的遮挡关系存在问题，而Canvas下的UI组件遮挡与实体顺序相关，因此需要根据子类UI所对应目标角色的名字板挂载点世界空间位置到相机位置的距离来进行实时排序，距离相机越近的名字板在子类中的位置应越靠后。排序C#代码见下方所示：

private struct NameBoardItem
{
    public int npcIndex;
    public float cameraDistance;
};
private readonly Comparison<NameBoardItem> _DistanceSortComparison = NameBoardDistanceComparer;
private static int NameBoardDistanceComparer(NameBoardItem item1, NameBoardItem item2)
{
    float f = item1.cameraDistance - item2.cameraDistance;
    if (f > 0.0f)
        return -1;
    else if (f < 0.0f)
        return 1;
    else
        return 0;
}
private void SortNPCUIElements()
{
    //_npcUIList为NPCUI-xx上挂载的自定义脚本数组
    //_sceneCameraTransform为角色跟随相机的Transform组件
    if (_npcUIList == null || _sceneCameraTransform == null)
        return;
    _sortedNPCUIList.Clear();
    for (int index = 0; index < _npcUIList.Count; ++index) 
    {
        NameBoardItem item = new NameBoardItem();
        item.npcIndex = index;
        item.cameraDistance = Vector3.Distance(
            _npcUIList[index].target.transform.position, 
            _sceneCameraTransform.position);
        _sortedNPCUIList.Add(item);
    }
    _sortedNPCUIList.Sort(_DistanceSortComparison);
    for (int i = 0; i < _sortedNPCUIList.Count; i++)
    {
        _npcUIList[_sortedNPCUIList[i].npcIndex].transform.SetSiblingIndex(i);
    }
}

针对旧版Image和Text(MeshPro)会打断合批的问题，比较便捷的方法是创建两个Canvas，一个下面负责Image，另一个负责Text(MeshPro)，但这样要放弃现有的NPCUI-xx实体管理结构，与优化原则相悖。在保留当前结构的前提下也同样有修改方案，就是参考（合并）Image和Text(MeshPro)组件的脚本实现方法编写自定义脚本UGUINameBoardComponent。

UGUINameBoardComponent脚本类这里对Text、ISerializationCallbackReceiver、 ICanvasRaycastFilter等类进行了多重继承，并根据项目需求覆写其中一些关键虚函数（如SetAllDirty等），这里只对与渲染优化相关的内容进行整理。

首先从Mesh说起，在脚本中需要生成并写入UI网格顶点，为了之后Shader采样时能对Image及Text做出区分，需要对Image类型的UV0通道写入常规范围外的某固定值，如-1，UV1通道写入原有UV0通道的数值，而Text类型就保持原有的逻辑即可，具体可参考如下Image Type为Simple时的C#代码：

protected override void OnPopulateMesh(VertexHelper toFill)
{
    if (m_UGUINameBoardType == UGUINameBoardType.Image)
    {
        //针对Image进行顶点数据区分
        switch (type)
        {
            case Image.Type.Simple:
                if (!useSpriteMesh)
                    GenerateSimpleSprite(toFill, m_PreserveAspect);
                else
                    //GenerateSprite(toFill, m_PreserveAspect);
                break;
            case Image.Type.Sliced:
                //GenerateSlicedSprite(toFill);
                break;
            case Image.Type.Tiled:
                //GenerateTiledSprite(toFill);
                break;
            case Image.Type.Filled:
                //GenerateFilledSprite(toFill, m_PreserveAspect);
                break;
        }
    }
    else if (m_UGUINameBoardType == UGUINameBoardType.Text)
    {
        //调用Text类中的OnPopulateMesh方法
        base.OnPopulateMesh(toFill);
        //省略ApplyGradient时顶点处理...
    }
}
void GenerateSimpleSprite(VertexHelper vh, bool lPreserveAspect)
{
    Vector4 v = GetDrawingDimensions(lPreserveAspect);
    var uv = (activeSprite != null) ? 
        Sprites.DataUtility.GetOuterUV(activeSprite) : Vector4.zero;
    var color32 = color;
    vh.Clear();
    //在UV0通道写入-Vector2.one
    vh.AddVert(new Vector3(v.x, v.y)/*position*/, 
               color32/*color*/, 
               -Vector2.one/*uv0*/, 
               new Vector2(uv.x, uv.y)/*uv1*/, 
               new Vector3(0, 0, -1f)/*normal*/, 
               new Vector4(1f, 0, 0, -1f)/*tangent*/);
    vh.AddVert(new Vector3(v.x, v.w), color32, -Vector2.one, new Vector2(uv.x, uv.w), new Vector3(0, 0, -1f), new Vector4(1f, 0, 0, -1f));
    vh.AddVert(new Vector3(v.z, v.w), color32, -Vector2.one, new Vector2(uv.z, uv.w), new Vector3(0, 0, -1f), new Vector4(1f, 0, 0, -1f));
    vh.AddVert(new Vector3(v.z, v.y), color32, -Vector2.one, new Vector2(uv.z, uv.y), new Vector3(0, 0, -1f), new Vector4(1f, 0, 0, -1f));
    vh.AddTriangle(0, 1, 2);
    vh.AddTriangle(2, 3, 0);
}

对应的，在Shader中可使用uv0来进行Image和Text类型的区分，对不同贴图进行采样，见下方Shader代码：

if ((IN.uv0.x + IN.uv0.y) <= -2)//Image Type 
{
    color = tex2D(_ImageTex, IN.uv1) * IN.color;
}
else//Text Type
{
    color = (tex2D(_MainTex, IN.uv0) + _TextureSampleAdd) * IN.color;
}

Shader中应保证同时声明了Image Sprite Texture（即上方_ImageTex）以及Font Texture（即上方_MainTex，因为脚本继承Text类，所以这里的_MainTex槽位留给Font Texture）两张贴图槽位，并且在脚本中进行资源绑定时也要给定统一的贴图资源，意味着需要所有名字板Image使用同一个图集，所有名字板Text都使用同一种字体类型，这样才不会因为贴图不同而导致合批失败。

1.3. 优化成果

为了对优化结果进行量化评估并考察名字板数量与性能提升程度的简单关系，创建了两个空场景，分别摆放9和18个名字板（每个名字板目前仅包含一排文字和一排图像，字体尺寸为18），使用小米12s设备进行移动平台性能测试，得到如下结果：

名字板数量(个)	优化前（仅统计半透渲染Pass）	优化后（仅统计半透渲染Pass）	备注
CPU耗时均值(ms)	GPU耗时均值(ms)	DrawCall数量(个)	CPU耗时均值(ms)	GPU耗时均值(ms)	DrawCall数量(个)
18	0.440	0.154	36	0.090	0.202	1
9	0.250	0.073	18	0.080	0.102	UI屏幕像素占比也同样下降50%左右

由表可知，当场景中相机视角内同时出现的名字板数量较多时（正式场景下的名字板数量或UI元素数量会远超过此测试数值，所以优化前DrawCall数量可能会是测试场景的2-3倍，意味着CPU负担会更重），对于CPU组织UI渲染数据的耗时提升比较明显，即使GPU耗时有所提升，但在此种情况下并不是耗时瓶颈，对整体基本没有影响。此优化方案通用性较强，基于UGUI的名字板方案且是CPU瓶颈的可以考虑做此优化。

1.4. 相关问题

优化后虽然渲染效率提升较大，但是普通文字表现上在有放大拉伸等情况时不如基于距离场的文字表现优秀，之后可以将现有Text基类替换为TextMeshPro基类，但在尽可能小改动的情形下，目前有两种通用解决方案，第一个是文字的像素大小在距离相机远近不一时保持恒定不变，这个取决于项目的需求，如果需要名字板到相机距离减小时文字进行一定的放大，那么就需要采用第二种方案，以文字像素大小最大时为基准适当增大Font Size，并将Rect Transform的localScale数值适当降低直到文字表现可以接受，这样做类似于降采样，缺点是可能会增大Font Texture大小，增加贴图读取带宽。

2. Lightmap与Shadowmask贴图通道合并

2.1. 优化背景

参考安卓官网所列出的3D类游戏贴图带宽以及Cache命中建议值，发现项目这边还有很大的优化空间，从这方面入手需要挨个贴图排查，看是否有贴图过大、压缩格式未合适选择等问题。在lightmap这部分除了一些硬性参数指标的控制，在LightingMode为Shadowmask模式时设想将已有lightmap和shadowmask两张贴图进行通道合并也能提升部分渲染效率，在参考调研了部分同类游戏也同样有这种做法之后决定实现一版验证想法。

2.2. 核心思路

通道合并实现主要集中在编辑器这边，大体上的想法是实现一个合并函数，在两个地方调用，一个是监听lightmap烘焙完成的事件，只要结束就会调用，另一个是添加菜单按钮，对于已经烘焙过lightmap的情况来说，在已有基础上手动点击按钮来进行调用合并。

合并函数的主要流程可以参考：

通过LightmapSettings获得当前场景合并前的lightmap，记作 ULM，与shadowmask，记作 USM
通过路径获取 ULM 的Texture Importer，记作 ULT，使用RGBA32格式覆写默认格式
Blit ULM 像素数据到一个新创建的RT上，此RT与 ULM 等宽高，记作 RT1，格式为RenderTextureFormat.ARGB32，RenderTextureReadWrite为sRGB
Blit USM 像素数据到一个新创建的RT上，此RT与 USM 等宽高，记作 RT2，格式为RenderTextureFormat.ARGB32，RenderTextureReadWrite为Linear
逐像素遍历 RT1，若当前平台下为 RGBM 编码[1]，参考如下公式进行解码[2]： ��.��∗(��.�∗5) 然后进行 dLDR 编码（只需将[0,2]范围映射到[0,1]）： ��∗0.5
若当前平台下已经是 dLDR 编码，则无需额外处理
RT2 的单通道数据写入至 RT1 的Alpha通道中
使用 RT1 替换 ULM 原有路径下的资源，将 ULT 的各平台压缩格式进行适当修改[3]，并将Texture Type属性修改为Default

除去编辑器工具，Shader同样需要做出修改，主要需要顾及两方面，一方面是EntityLighting.hlsl中的lightmap解码统一按照Double LDR的方式进行：

//UNITY_COMBINE_LIGHT_SHADOWMASK为自定义添加的宏 用于控制是否启用通道合并功能
#ifdef UNITY_COMBINE_LIGHT_SHADOWMASK
    #ifdef UNITY_COLORSPACE_GAMMA
        #define LIGHTMAP_HDR_MULTIPLIER real(2.0)
    #else
        #define LIGHTMAP_HDR_MULTIPLIER real(4.59) // 2.0 ^ 2.2
    #endif
    #define LIGHTMAP_HDR_EXPONENT real(0.0)
#else
    #ifdef UNITY_LIGHTMAP_RGBM_ENCODING
        #ifdef UNITY_COLORSPACE_GAMMA
            #define LIGHTMAP_HDR_MULTIPLIER LIGHTMAP_RGBM_MAX_GAMMA
            #define LIGHTMAP_HDR_EXPONENT   real(1.0)
        #else
            #define LIGHTMAP_HDR_MULTIPLIER LIGHTMAP_RGBM_MAX_LINEAR
            #define LIGHTMAP_HDR_EXPONENT   real(2.2)
        #endif
    #elif defined(UNITY_LIGHTMAP_DLDR_ENCODING)
        #ifdef UNITY_COLORSPACE_GAMMA
            #define LIGHTMAP_HDR_MULTIPLIER real(2.0)
        #else
            #define LIGHTMAP_HDR_MULTIPLIER real(4.59)
        #endif
        #define LIGHTMAP_HDR_EXPONENT real(0.0)
    #else // (UNITY_LIGHTMAP_FULL_HDR)
        #define LIGHTMAP_HDR_MULTIPLIER real(1.0)
        #define LIGHTMAP_HDR_EXPONENT real(1.0)
    #endif
#endif
...
real3 DecodeLightmap(real4 encodedIlluminance, real4 decodeInstructions)
{
#ifdef UNITY_COMBINE_LIGHT_SHADOWMASK
    return UnpackLightmapDoubleLDR(encodedIlluminance, decodeInstructions);
#else
#if defined(UNITY_LIGHTMAP_RGBM_ENCODING)
    return UnpackLightmapRGBM(encodedIlluminance, decodeInstructions);
#elif defined(UNITY_LIGHTMAP_DLDR_ENCODING)
    return UnpackLightmapDoubleLDR(encodedIlluminance, decodeInstructions);
#else // (UNITY_LIGHTMAP_FULL_HDR)
    return encodedIlluminance.rgb;
#endif
#endif
}

另一方面是在Shadows.hlsl中对shadowmask进行采样的地方需要替换为合并之后的lightmap：

#if defined(UNITY_COMBINE_LIGHT_SHADOWMASK) && defined(LIGHTMAP_ON)
    #define SAMPLE_SHADOWMASK(uv) half4(SAMPLE_TEXTURE2D_LIGHTMAP(LIGHTMAP_NAME, LIGHTMAP_SAMPLER_NAME, uv).a, 0, 0, 0);
#else
    #if defined(SHADOWS_SHADOWMASK) && defined(LIGHTMAP_ON)
        #define SAMPLE_SHADOWMASK(uv) SAMPLE_TEXTURE2D_LIGHTMAP(SHADOWMASK_NAME, SHADOWMASK_SAMPLER_NAME, uv SHADOWMASK_SAMPLE_EXTRA_ARGS);
    #elif !defined (LIGHTMAP_ON)
        #define SAMPLE_SHADOWMASK(uv) unity_ProbesOcclusion;
    #else
        #define SAMPLE_SHADOWMASK(uv) half4(1, 1, 1, 1);
    #endif
#endif

到这里就基本完成了，在PC Platform下或者Mobile Platform下烘焙lightmap都可以做到兼容。

2.3. 优化成果

常规的战斗场景（固定相机视角下约20k三角面，450左右DrawCall）使用Snapdragon Profiler在realtime模式下对小米12s安卓设备GPU指标进行了监测，参考以下60秒内的统计数值（个别指标取近似数值）：

GPU指标	优化前	优化后
max	min	avg	max	min	avg
Texture Memory Read BW(109 Bytes/Sec)	2.994	2.767	2.895	2.951	2.726	2.836
SP Memory Read(106 Bytes/Sec)	5.512	5.067	5.292	5.236	4.834	5.014
Read Total(109 Bytes/Sec)	3.399	3.138	3.285	3.350	3.089	3.210
Avg Bytes / Fragment	0.311	0.308	0.310	0.297	0.295	0.297
% Texture Pipes Busy	76.51	75.55	76.01	75.51	74.40	74.91
Textures / Fragment	3.720	3.700	3.710	3.580	3.560	3.570

贴图的读取带宽略有降低，约有2%左右，Shader Processors每秒的内存数据访问量降低约5.5%，平均每像素的访问贴图数量和字节数降低约4.0-4.4%左右。综上来讲，对于使用光贴图且GPU负载较大的项目可以考虑采用。

依旧是上述战斗场景，opaque/transparent pass优化前后平均gpu耗时统计情况如下：

Pass名称	优化前gpu耗时均值（ms）	优化后gpu耗时均值（ms）	整体节省约
Opaque Pass	6.732	6.542	3%
Transparent Pass	0.614	0.604

2.4. 相关问题

与其他绝大多数3d类项目相同，当前项目的颜色空间为linear
在使用编辑器工具对lightmap和shadowmask进行通道合并之后，将会是RGBA四通道贴图，如果继续沿用原有lightmap贴图的（ASTC）压缩格式，会有表现质量下降的问题，因此对于移动端，通常来讲正常质量可以选择采用ASTC5x5，高质量可以选择使用ASTC4x4，而对RGB三通道来说，正常质量下选择ASTC8x8，高质量选用ASTC6x6。
以上处理都是默认项目配置Player中对应移动平台的Lightmap Encoding选为Low Quality或者桌面独占平台选为Normal Quality，如果选择其它选项，取决于是何种编码格式，需要额外做脚本上的兼容处理。
合并之后的lightmap贴图的导入设置类别中Texture Type保持Default，sRGB保持勾选，不要改为Lightmap。

3. 安卓平台下的GPU Query

3.1. 需求背景

在为项目进行GPU性能评估时，iOS平台有XCode可以提供具体到Pass的渲染耗时等信息，Android平台虽然可以使用Snapdragon Profiler或RenderDoc这样的工具来评估GPU性能，但是使用前者统计耗费情况，虽然可以将Clocks/Second换算到时间单位，但不够直观，后者通常情况耗时信息不够准确，因此就需要一个工具或插件能够准确直观显示各个基本（包含自定义）渲染Pass的GPU耗时。

较为全面的工具需要更多的人力时间，这次暂且在已有插件的基础上进行了部分改动以解燃眉之急。

3.2. 原生插件

关于Unity原生插件的开发和注意事项请参考Unity文档。

这里就原生插件C++和Unity C#端的改动展开说下。C++源文件RenderTimingPlugin.cpp核心改动点是添加了BeginTimeQueryEvent、EndTimeQueryEvent以及PrintTimeQueryEvent函数，用于具体查询不同Pass或者DrawCall的GPU耗时：

static const int FRAME_COUNT = 2;
static const int QUERY_COUNT = 50;
static int _frameCount = 0;
static GLuint _query[QUERY_COUNT * FRAME_COUNT];
static std::vector<int> _eventIDList;
static void InitRenderTiming()
{
    glGenQueries(QUERY_COUNT * FRAME_COUNT, _query);
    GLint disjointOccurred;
    glGetIntegerv(GL_GPU_DISJOINT, &disjointOccurred);
    _eventIDList.reserve(QUERY_COUNT);
    _eventIDList.clear();
}
//eventID是什么后面会提
static void UNITY_INTERFACE_API BeginTimeQueryEvent(int eventID) {
    if (s_DeviceType == kUnityGfxRendererNull)
        return;
    for (int index = 0; index < (int)_eventIDList.size(); ++index) 
    {
        if (_eventIDList[index] == eventID)
            return;
    }
    if (eventID < QUERY_COUNT)
    {
        int writeIndex = (_frameCount % 2) * QUERY_COUNT + eventID;
        glBeginQuery(GL_TIME_ELAPSED, _query[writeIndex]);
        if (glGetError() == GL_NO_ERROR) 
            _eventIDList.push_back(eventID);
    }
}
static void UNITY_INTERFACE_API EndTimeQueryEvent(int eventID/*unused*/) {
    if (s_DeviceType == kUnityGfxRendererNull)
        return;
    bool isMatched = false;
    for (int index = 0; index < (int)_eventIDList.size(); ++index) 
    {
        if (_eventIDList[index] == eventID) 
        {
            isMatched = true;
            break;
        }
    }
    if (isMatched)
        glEndQuery(GL_TIME_ELAPSED);
}
static void UNITY_INTERFACE_API PrintTimeQueryEvent(int eventID/*unused*/) {
    GLint disjointOccurred = false;
    glGetIntegerv(GL_GPU_DISJOINT, &disjointOccurred);
    if (_frameCount > 1)//dont check first frame
    {
        for (int index = 0; index < (int)_eventIDList.size(); ++index) 
        {
            GLuint available = 0;
            int currentEventID = _eventIDList[index];
            int writeIndex = ((_frameCount + 1) % 2) * QUERY_COUNT + currentEventID;
            glGetQueryObjectuiv(_query[writeIndex], 
                                GL_QUERY_RESULT_AVAILABLE, 
                                &available);
            if (available) 
            {
                GLuint elapsed_time_ns;
                glGetQueryObjectuiv(_query[writeIndex], 
                                    GL_QUERY_RESULT, 
                                    &elapsed_time_ns);
                if (glGetError() == GL_NO_ERROR) 
                {
                    float elapsed_time_m_seconds = elapsed_time_ns / 1e6f;
                    EventTime(currentEventID, elapsed_time_m_seconds);
                } 
            }
        }
    }
    _eventIDList.clear();
    _frameCount++;
}

然后将上述静态函数注册导出，参考下方形式：

extern "C" UnityRenderingEvent UNITY_INTERFACE_EXPORT UNITY_INTERFACE_API TimeQueryBegin() {
  return BeginTimeQueryEvent;
}
extern "C" UnityRenderingEvent UNITY_INTERFACE_EXPORT UNITY_INTERFACE_API 
TimeQueryEnd() {
  return EndTimeQueryEvent;
}
extern "C" UnityRenderingEvent UNITY_INTERFACE_EXPORT UNITY_INTERFACE_API TimeQueryEndFrame() {
  return PrintTimeQueryEvent;
}

TimeQueryBegin()需要在每个pass（或者draw call）最开始的地方调用
TimeQueryEnd需要与TimeQueryBegin配套使用，在pass（或者draw call）之后调用
以上两函数的本质都是gpu指令插入，相当于一个插入到绘制指令的前方，另一个插入到绘制指令的后方
一帧之内可以插入很多这样的query对象到管线中，我们需要event ID来管理query对象，换句话说就是需要通过event ID和当前帧的索引来决定具体使用哪个query对象，详情请见TimeQueryBegin()函数
在一帧结束的时候会调用TimeQueryEndFrame，此函数里的机制意在避免等待，查询的是上一帧所有query对象的结果

以上PrintTimeQueryEvent函数中调用了EventTime()，按照上面的方法同理导出至Unity：

typedef void (*PrintEventTimeCallback)(int, float);//int: eventID . float: time in ms
static PrintEventTimeCallback EventTime;
extern "C" {
    void PrintEventTime(PrintEventTimeCallback callback) {EventTime = callback;}
}

然后通过Android NDK对以上源码进行编译，生成不同CPU平台类型的.so库，在导入Unity这边后，需要使用command Buffer的IssuePluginEvent接口来触发以上三个主要函数，可以采用URP下的Render Feature功能来达成此目的^4。

具体来说，就是继承ScriptableRenderPass的GPUTimingPrintPass来执行插件函数，继承ScriptableRendererFeature的GPUTimingPrintFeature来决定将GPUTimingPrintPass插入到管线的哪个位置中去。

以监测shadow pass耗时为例，GPUTimingPrintFeature类可简单写为：

public class GPUTimingPrintFeature : ScriptableRendererFeature
{
    public enum QueryEventList
    {
        EVENT_CASCADE_SHADOW_PASS = 0,
        //TODO...
        EVENT_MAX_COUNT = 50,
    }
#if UNITY_ANDROID && !UNITY_EDITOR
    [DllImport("RenderTimingPlugin")]
    private static extern void PrintEventTime(IntPtr ftp);
    [DllImport("RenderTimingPlugin")]
    private static extern IntPtr TimeQueryBegin();
    [DllImport("RenderTimingPlugin")]
    private static extern IntPtr TimeQueryEnd();
    [DllImport("RenderTimingPlugin")]
    private static extern IntPtr TimeQueryEndFrame();
    [UnmanagedFunctionPointer(CallingConvention.Cdecl)]
    private delegate void PrintTimeDelegate(int eventID, float timeElapsed);
    [AOT.MonoPInvokeCallback(typeof(PrintTimeDelegate))]
    static void EventTimingCallBack(int eventID, float timeElapsed)
    {
        QueryEventList eventEnum = (QueryEventList)eventID;
        String gpuTiming = String.Format("GPU Time: {0:F3} ms | {1,-10}",
        timeElapsed,
        eventEnum.ToString());
        Debug.LogWarning("GPUQueryPlugin RESULT: " + gpuTiming);
    }
#endif
    public override void Create()
    {
#if UNITY_ANDROID && !UNITY_EDITOR
        PrintTimeDelegate event_timing_callback_delegate = 
    new PrintTimeDelegate(EventTimingCallBack);
        IntPtr intptr_delegate_event_timing = Marshal.GetFunctionPointerForDelegate(event_timing_callback_delegate);
        PrintEventTime(intptr_delegate_event_timing);
#endif
    }
    public override void AddRenderPasses(ScriptableRenderer renderer, ref RenderingData renderingData)
    {
#if UNITY_ANDROID && !UNITY_EDITOR
        //Shadow Pass
        renderer.EnqueuePass(new GPUTimingPrintPass(
            RenderPassEvent.BeforeRenderingShadows, 
            TimeQueryBegin(), 
            "ShadowBeginQuery", 
            QueryEventList.EVENT_CASCADE_SHADOW_PASS));
        renderer.EnqueuePass(new GPUTimingPrintPass(
            RenderPassEvent.AfterRenderingShadows, 
            TimeQueryEnd(), 
            "ShadowEndQuery", 
            QueryEventList.EVENT_CASCADE_SHADOW_PASS));
        //End of frame
        renderer.EnqueuePass(new GPUTimingPrintPass(
            RenderPassEvent.AfterRendering, 
            TimeQueryEndFrame(),
            "QueryGPUTiming", 
            0/*usused*/));
#endif
    }
}

GPUTimingPrintPass类可简单写为：

public class GPUTimingPrintPass : ScriptableRenderPass
{
    private IntPtr _pluginCallback;
    private string _eventName;
    private GPUTimingPrintFeature.QueryEventList _eventID;
    public GPUTimingPrintPass(RenderPassEvent renderPassEvent, IntPtr pluginCallBack, string eventName, GPUTimingPrintFeature.QueryEventList eventID)
    {
        this.renderPassEvent = renderPassEvent;
        this._pluginCallback = pluginCallBack;
        this._eventName = eventName;
        this._eventID = eventID;
    }
    public override void Execute(ScriptableRenderContext context, ref RenderingData renderingData)
    {
        CommandBuffer cmd = CommandBufferPool.Get();
        using (new ProfilingScope(cmd, new ProfilingSampler(this._eventName)))
        {
            cmd.name = this._eventName;
            cmd.IssuePluginEvent(this._pluginCallback, (int)this._eventID);
        }
        context.ExecuteCommandBuffer(cmd);
        CommandBufferPool.Release(cmd);
    }
}

除此之外，可以在GPUTimingPrintFeature中添加用于监测自定义Pass的静态函数，以及添加GUI显示时间信息、统计平均耗时、最大最小耗时等额外内容，这里就不再列出了。

连接使用此插件的设备后，在Frame Debugger中可以清晰看到管线中插入的指令：

目前此插件在小米10-12s设备的GPU上测试后没有问题，其它机型的GPU型号有待测试。

3.3. 功能作用及限制

原有插件仅可查询当前帧的GPU耗时，此插件在原有基础上不仅增添了前向或延迟渲染管线各个Pass的耗时查询，同样支持自定义RenderFeature的耗时查询。
正如需求背景中所提到的，此插件可以帮助优化人员快速定位耗时较长的Pass，并辅助其它工具来进行问题细化分析，但目前仅支持安卓平台的OpenGLES3图形API，暂不支持Vulkan，且只支持Unity通用渲染管线

[1]光贴图编码方案请参考光照贴图：技术信息 - Unity 手册 (unity3d.com) ↩

[2]Gamma空间像素数据范围是0-5

[3]参考 2.4. 相关问题的第二条内容

[4]Camera.AddCommandBuffer的接口调用一直没有效果，也可以尝试在MonoBehaviour中实现

欢迎加入我们！

感兴趣的同学可以投递简历至： [email protected]