PaddleOCR-PP-OCRv4推理详解及部署实现（中）

前言

接着上篇文章 PaddleOCR-PP-OCRv4推理详解及部署实现（上）来讲，在上篇文章中我们已经导出 PP-OCRv4 模型中的三个 ONNX，这篇文章我们就来梳理下这三个模型的前后处理方便我们后续在 C++ 上的实现，若有问题欢迎各位看官批评指正😄

repo：https://github.com/PaddlePaddle/PaddleOCR

1. 调试配置

我们分析调试的代码主要是 PaddleOCR 项目下的 tools/infer/predict_system.py 脚本文件，开始之前需要你将相应的环境配置好，这个我们在上篇文章中有介绍过，这边博主不再赘述

另外需要你将 PaddleOCR 项目 clone 下来，并参考上篇文章准备好导出的 ONNX，大家也可以点击 here 下载博主准备好的源码和 ONNX 模型文件（注意代码下载于 2024/7/20 日，如果改动请参考最新）

下面我们就来调试 tools/infer/predict_system.py 文件：

在这里插入图片描述

Note：博主这里采用的是 vscode 进行代码的调试，其中的 launch.json 文件内容如下：

{
    // 使用 IntelliSense 了解相关属性。 
    // 悬停以查看现有属性的描述。
    // 欲了解更多信息，请访问: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Python 调试程序: 当前文件",
            "type": "python",
            "request": "launch",
            "program": "${workspaceFolder}/tools/infer/predict_system.py",
            "console": "integratedTerminal",
            "args": [
                "--use_gpu", "False",
                "--use_onnx", "True",
                "--use_angle_cls", "True",
                "--det_model_dir", "./models/det/det.sim.onnx",
                "--rec_model_dir", "./models/rec/rec.sim.onnx",
                "--cls_model_dir", "./models/cls/cls.sim.onnx",
                "--image_dir", "./deploy/lite/imgs/lite_demo.png",
            ]
        }
    ]
}

2. 检测模型

下面我们调试来分析检测模型的前后处理

2.1 预处理

首先看预处理，找到检测模型的入口：

在这里插入图片描述

接着找到模型预测部分：

在这里插入图片描述

在 predict 函数中通过 transform 函数对传入的图像进行了一系列的预处理：

在这里插入图片描述

它包括以下几个操作：

DetResizeForTest
NormalizeImage
ToCHWImage
KeepKeys

其中 KeepKeys 对图像没有进行操作，因此我们重点看前面三个操作，我们一个个看

在这里插入图片描述

首先是 DetResizeForTest 部分，它其实调用的是 resize_image_type1 这个函数，函数内容如下：

在这里插入图片描述

这里其实就是做了一个 resize 操作，把原图直接 resize 缩放到 960x960，我们接着看 NormalizeImage 部分：

在这里插入图片描述

这边将图像乘以 scale 然后做了减均值除以标准差，那这里 scale 的数值 0.003921569 是咋来的呢，那其实它就是 1/255.0，所以这里的操作我们也很熟悉了，就是除以 255.0 然后减均值除以标准差

OK，接着看 ToCHWImage 部分：

在这里插入图片描述

这部分比较简单就是通过 transpose 将 H，W，C 转换成 C，H，W

接着必不可少的添加 batch 维度，如下所示：

在这里插入图片描述

我们简单总结下整个预处理做了哪些操作：

1. resize
2. /255.0 减均值除标准差
3. c,h,w->h,w,c
4. h,w,c->b,c,h,w

因此我们不难写出其预处理，代码如下所示：

def preprocess(self, img, tar_w=960, tar_h=960):
    # 1. resize
    img  = cv2.resize(img, (int(tar_w), int(tar_h)))
    # 2. normalize
    img  = img.astype("float32") / 255.0
    mean = [0.485, 0.456, 0.406]
    std  = [0.229, 0.224, 0.225]
    mean = np.array(mean).reshape(1, 1, 3).astype("float32")
    std  = np.array(std).reshape(1, 1, 3).astype("float32")
    img  = (img - mean) / std
    # 3. to bchw
    img  = img.transpose((2, 0, 1))[None]
    return img

至此，检测模型的预处理梳理完毕，下面看其后处理

2.2 后处理

在梳理后处理之前我们先打印下模型推理后的输出维度：

在这里插入图片描述

可以看到模型的输出维度是 1x1x960x960，它其实得到的就是一张和原图同样大小的概率图，文本检测模型实际上是一个分割模型，它最终得到的是每个像素点的概率值，只是通过一系列繁琐的后处理能够得到文本检测框，这个我们在上篇文章中提到过

接着我们来看后处理都做了哪些操作：

在这里插入图片描述

它主要是通过 boxes_from_bitmap 函数从 mask 中得到对应的 boxes 和 scores，我们重点来看下该函数是如何做的：

def boxes_from_bitmap(self, pred, _bitmap, dest_width, dest_height):
    """
    _bitmap: single map with shape (1, H, W),
            whose values are binarized as {0, 1}
    """

    bitmap = _bitmap
    height, width = bitmap.shape

    outs = cv2.findContours(
        (bitmap * 255).astype(np.uint8), cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE
    )
    if len(outs) == 3:
        img, contours, _ = outs[0], outs[1], outs[2]
    elif len(outs) == 2:
        contours, _ = outs[0], outs[1]

    num_contours = min(len(contours), self.max_candidates)

    boxes = []
    scores = []
    for index in range(num_contours):
        contour = contours[index]
        points, sside = self.get_mini_boxes(contour)
        if sside < self.min_size:
            continue
        points = np.array(points)
        if self.score_mode == "fast":
            score = self.box_score_fast(pred, points.reshape(-1, 2))
        else:
            score = self.box_score_slow(pred, contour)
        if self.box_thresh > score:
            continue

        box = self.unclip(points, self.unclip_ratio)
        if len(box) > 1:
            continue
        box = np.array(box).reshape(-1, 1, 2)
        box, sside = self.get_mini_boxes(box)
        if sside < self.min_size + 2:
            continue
        box = np.array(box)

        box[:, 0] = np.clip(np.round(box[:, 0] / width * dest_width), 0, dest_width)
        box[:, 1] = np.clip(
            np.round(box[:, 1] / height * dest_height), 0, dest_height
        )
        boxes.append(box.astype("int32"))
        scores.append(score)
    return np.array(boxes, dtype="int32"), scores

该函数的主要作用是从给定的二值化位图 _bitmap 中提取检测到的文本框，并根据给定的目标宽高对这些框进行缩放，具体分析如下：(from ChatGPT)

输入参数

pred：预测结果，通常是文本检测模型的输出，用来获取目标框的 score
_bitmap：单通道的二值化图像，shape 为 (1,H,W)，值为 0 或 1
dest_width：缩放后的目标宽度
dest_height：缩放后的目标高度

函数步骤详解

提取轮廓：使用 OpenCV 的 findContours 函数从二值化图像中提取轮廓。将二值图像乘以 255 是因为 OpenCV 期望输入为 0 或 255 的图像。RETR_LIST 标志表示提取所有轮廓，CHAIN_APPROX_SIMPLE 标志表示压缩轮廓的水平、垂直和对角线部分，只保留端点。
检查轮廓输出：findContours 返回的结果可能有 3 个（图像、轮廓、层级）或 2 个（轮廓、层级）。根据返回值的长度提取轮廓数据。
处理每个轮廓：限制要处理的轮廓数量，以 self.max_candidates 为上限。
从轮廓生成 boxes
- 生成最小框: 调用 self.get_mini_boxes(contour) 从每个轮廓生成最小包围框。
- 框大小检查: 检查生成的框的尺寸是否符合最小尺寸要求 self.min_size。
- 计算评分: 使用快速评分函数 box_score_fast 计算每个 box 的 score 分数
- 评分阈值过滤: 检查评分是否大于阈值 self.box_thresh。
- 框校正: 调用 self.unclip(points, self.unclip_ratio) 对框进行校正。
- 缩放坐标: 将框坐标从原始图像大小缩放到目标图像大小 dest_width 和 dest_height，确保坐标在目标图像范围内。
- 存储结果: 将框和评分添加到结果列表中。

保存的二值化图如下所示：

在这里插入图片描述

get_mini_boxes 函数内容如下：

def get_mini_boxes(self, contour):
    bounding_box = cv2.minAreaRect(contour)
    points = sorted(list(cv2.boxPoints(bounding_box)), key=lambda x: x[0])

    index_1, index_2, index_3, index_4 = 0, 1, 2, 3
    if points[1][1] > points[0][1]:
        index_1 = 0
        index_4 = 1
    else:
        index_1 = 1
        index_4 = 0
    if points[3][1] > points[2][1]:
        index_2 = 2
        index_3 = 3
    else:
        index_2 = 3
        index_3 = 2

    box = [points[index_1], points[index_2], points[index_3], points[index_4]]
    return box, min(bounding_box[1])

get_mini_boxes 函数的主要功能是从给定的轮廓中生成一个最小的矩形框，并将这个框的角点按照特定顺序排序，它的主要功能是：

从给定的轮廓中计算一个最小的旋转矩形
将矩形的四个角点按照从左到右和从上到下的顺序排序
返回排序后的角点和矩形的最小边长度

box_score_fast 函数内容如下：

def box_score_fast(self, bitmap, _box):
    """
    box_score_fast: use bbox mean score as the mean score
    """
    h, w = bitmap.shape[:2]
    box = _box.copy()
    xmin = np.clip(np.floor(box[:, 0].min()).astype("int32"), 0, w - 1)
    xmax = np.clip(np.ceil(box[:, 0].max()).astype("int32"), 0, w - 1)
    ymin = np.clip(np.floor(box[:, 1].min()).astype("int32"), 0, h - 1)
    ymax = np.clip(np.ceil(box[:, 1].max()).astype("int32"), 0, h - 1)

    mask = np.zeros((ymax - ymin + 1, xmax - xmin + 1), dtype=np.uint8)
    box[:, 0] = box[:, 0] - xmin
    box[:, 1] = box[:, 1] - ymin
    cv2.fillPoly(mask, box.reshape(1, -1, 2).astype("int32"), 1)
    return cv2.mean(bitmap[ymin : ymax + 1, xmin : xmax + 1], mask)[0]

box_score_fast 函数的目的是计算一个给定框的评分，该评分基于二值化位图的平均值。具体来说，它通过计算框区域内的平均分数来评估框的质量，它的主要功能是：

从给定的二值化位图和框中计算该框的评分
评分是框区域内的平均值，通过掩膜来只计算框内的像素
使用 cv2.mean 函数来计算平均值

unclip 函数内容如下：

def unclip(self, box, unclip_ratio):
    poly = Polygon(box)
    distance = poly.area * unclip_ratio / poly.length
    offset = pyclipper.PyclipperOffset()
    offset.AddPath(box, pyclipper.JT_ROUND, pyclipper.ET_CLOSEDPOLYGON)
    expanded = offset.Execute(distance)
    return expanded

unclip 函数的作用是对给定的框进行扩展处理，具体来说，是通过增加边界的方式来扩大框的区域。这个过程常用于文本检测等任务中，以确保框能够更好地包含实际的文本区域，它的主要功能是：

扩展多边形: 根据给定的扩展比例 unclip_ratio 扩展输入框 box。
计算扩展距离: 使用多边形的面积和周长来计算扩展的距离。
执行扩展操作: 使用 pyclipper 库的 PyclipperOffset 对象进行实际的多边形扩展。

最终我们从二值化图像中获取到了对应的文本框：

在这里插入图片描述

可以看到其 shape 为 46x4x2，其中：

46 代表文本框个数
4 代表多边形四个点
2 代表 x,y

接着我们还会对获取到的文本框过滤：

在这里插入图片描述

filter_tag_det_res 函数的主要功能是对检测到的文本框进行过滤和处理，以确保只保留有效的框。它对每个文本框进行一系列处理，包括点排序、边界裁剪和尺寸过滤，它的主要功能如下：

排序：通过 order_points_clockwise 函数确保每个框的四个点按顺时针顺序排列
裁剪：通过 clip_det_res 函数将框的坐标裁剪到图像的边界内，以避免超出图像范围
过滤小框：通过计算框的宽度和高度，并过滤掉那些尺寸小于等于 3 的框，确保只保留有效的文本框。

其实检测模型的后处理到这里就结束了，只不过后续利用文本框进行图像裁剪会先对 boxes 排序：

在这里插入图片描述

这里我们把它放到后处理中一并做了，排序函数 sorted_boxes 内容如下：

在这里插入图片描述

sorted_boxes 函数的主要功能是对检测到的文本框进行排序，按照从上到下、从左到右的顺序排列，它的主要功能是：

初步排序: 按照文本框第一个点的 y 坐标（从上到下）和 x 坐标（从左到右）对框进行初步排序。
二次排序: 在 y 坐标相近的情况下，根据 x 坐标进一步排序，确保框在相同水平线上的顺序正确。
返回排序结果: 返回经过排序的文本框列表。

至此，检测模型的后处理已经全部梳理完，我们简单总结下它做了哪些操作：

boxes_from_bitmap：从给定的二值化图 _bitmap 中提取检测到的文本框并缩放
filter_tag_det_res：对检测到的文本框进行过滤
sorted_boxes：对过滤的文本框排序

因此我们不难写出其后处理，代码如下所示：

def postprocess(self, pred, src_h, src_w):
    # pred->1x1x960x960
    pred = pred[0, 0, :, :]
    mask = pred > self.mask_thresh
    boxes, _ = self._boxes_from_bitmap(pred, mask, src_w, src_h)
    boxes = self._filter_boxes(boxes, src_h, src_w)
    boxes = self._sorted_boxes(boxes)
    return boxes

def _boxes_from_bitmap(self, pred, bitmap, dest_width, dest_height):
    """
    bitmap: single map with shape (H, W),
            whose values are binarized as {0, 1}
    """
    
    height, width = bitmap.shape
    
    # bitmap_image = (bitmap * 255).astype(np.uint8)
    # cv2.imwrite("bitmap_image.jpg", bitmap_image)

    outs = cv2.findContours((bitmap * 255).astype(np.uint8), cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
    if len(outs) == 3:
        _, contours, _ = outs[0], outs[1], outs[2]  # opencv3.x
    elif len(outs) == 2:
        contours, _ = outs[0], outs[1]  # opencv4.x
    
    num_contours = min(len(contours), self.max_candidates)
    
    # contour_image = cv2.cvtColor(bitmap_image, cv2.COLOR_GRAY2BGR)
    # for contour in contours:
    #     cv2.drawContours(contour_image, [contour], -1, (0, 0, 255), 2)
    # cv2.imwrite('contour_image.jpg', contour_image)

    boxes  = []
    scores = []
    for index in range(num_contours):
        contour = contours[index]
        points, sside = self._get_mini_boxes(contour)
        if sside < self.min_size:
            continue
        points = np.array(points)
        score  = self._box_score(pred, points.reshape(-1, 2))
        if score < self.box_thresh:
            continue
        
        box = self._unclip(points, self.unclip_ratio)
        if len(box) > 1:
            continue
        box = np.array(box).reshape(-1, 1, 2)
        box, sside = self._get_mini_boxes(box)
        if sside < self.min_size + 2:
            continue
        box = np.array(box)

        box[:, 0] = np.clip(np.round(box[:, 0] / width  * dest_width), 0, dest_width)
        box[:, 1] = np.clip(np.round(box[:, 1] / height * dest_height), 0, dest_height)
        boxes.append(box.astype("int32"))
        scores.append(score)
    return np.array(boxes, dtype="int32"), scores

def _get_mini_boxes(self, contour):
    # [[center_x, center_y], [width, height], angle]
    bounding_box = cv2.minAreaRect(contour)
    points = sorted(list(cv2.boxPoints(bounding_box)), key=lambda x: x[0])

    index_1, index_2, index_3, index_4 = 0, 1, 2, 3
    if points[1][1] > points[0][1]:
        index_1 = 0
        index_4 = 1
    else:
        index_1 = 1
        index_4 = 0
    if points[3][1] > points[2][1]:
        index_2 = 2
        index_3 = 3
    else:
        index_2 = 3
        index_3 = 2
    
    box = [points[index_1], points[index_2], points[index_3], points[index_4]]
    return box, min(bounding_box[1])

def _box_score(self, bitmap, _box):
    """
    box_score: use bbox mean score as the mean score
    """
    h, w = bitmap.shape[:2]
    box  = _box.copy()
    xmin = np.clip(np.floor(box[:, 0].min()).astype("int32"), 0, w - 1)
    xmax = np.clip(np.ceil(box[:, 0].max()).astype("int32"), 0, w - 1)
    ymin = np.clip(np.floor(box[:, 1].min()).astype("int32"), 0, h - 1)
    ymax = np.clip(np.ceil(box[:, 1].max()).astype("int32"), 0, h - 1)

    mask = np.zeros((ymax - ymin + 1, xmax - xmin + 1), dtype=np.uint8)
    box[:, 0] = box[:, 0] - xmin
    box[:, 1] = box[:, 1] - ymin
    cv2.fillPoly(mask, box.reshape(1, -1, 2).astype("int32"), 1)
    return cv2.mean(bitmap[ymin : ymax + 1, xmin : xmax + 1], mask)[0]

def _unclip(self, box, unclip_ratio):
    poly = Polygon(box)
    distance = poly.area * unclip_ratio / poly.length
    offset = pyclipper.PyclipperOffset()
    offset.AddPath(box, pyclipper.JT_ROUND, pyclipper.ET_CLOSEDPOLYGON)
    expanded = offset.Execute(distance)
    return expanded

def _filter_boxes(self, boxes, src_h, src_w):
    boxes_filter = []
    for box in boxes:
        box = self._order_points_clockwise(box)
        box = self._clip(box, src_h, src_w)
        rect_width  = int(np.linalg.norm(box[0] - box[1]))
        rect_height = int(np.linalg.norm(box[0] - box[3]))
        if rect_width <= 3 or rect_height <= 3:
            continue
        boxes_filter.append(box)
    return np.array(boxes_filter)
        
def _order_points_clockwise(self, pts):
    rect = np.zeros((4, 2), dtype="float32")
    s = pts.sum(axis=1)
    rect[0] = pts[np.argmin(s)]
    rect[2] = pts[np.argmax(s)]
    tmp  = np.delete(pts, (np.argmin(s), np.argmax(s)), axis=0)
    diff = np.diff(np.array(tmp), axis=1)
    rect[1] = tmp[np.argmin(diff)]
    rect[3] = tmp[np.argmax(diff)]
    return rect

def _clip(self, points, img_height, img_width):
    for idx in range(points.shape[0]):
        points[idx, 0] = int(min(max(points[idx, 0], 0), img_width - 1))
        points[idx, 1] = int(min(max(points[idx, 1], 0), img_height - 1))
    return points

def _sorted_boxes(self, boxes):
    """
    Sort text boxes in order from top to bottom, left to right
    """
    num_boxes = boxes.shape[0]
    boxes_sorted = sorted(boxes, key=lambda x: (x[0][1], x[0][0]))
    _boxes = list(boxes_sorted)

    for i in range(num_boxes - 1):
        for j in range(i, -1, -1):
            if abs(_boxes[j + 1][0][1] - _boxes[j][0][1]) < 10 and (_boxes[j + 1][0][0] < _boxes[j][0][0]):
                tmp = _boxes[j]
                _boxes[j] = _boxes[j + 1]
                _boxes[j + 1] = tmp
            else:
                break
    return _boxes

核心就是三个函数，但是由于辅助函数比较多，所以整体看上去后处理比较繁琐，大家稍微有点耐心慢慢梳理就行

至此，检测模型的后处理梳理完毕，下面我们来看推理

2.3 推理

通过上面对检测模型的预处理和后处理分析之后，整个推理过程就显而易见：

import cv2
import math
import copy
import random
import numpy as np
import onnxruntime as ort
from PIL import Image, ImageFont, ImageDraw
import pyclipper
from shapely.geometry import Polygon

class TextDetector(object):
    def __init__(self, model_path, mask_thresh=0.3, box_thresh=0.6, 
                 max_candidates=1000, min_size=3, unclip_ratio=1.5) -> None:
        self.predictor      = ort.InferenceSession(model_path, provider_options=["CPUExecutionProvider"])
        self.mask_thresh    = mask_thresh
        self.box_thresh     = box_thresh
        self.max_candidates = max_candidates
        self.min_size       = min_size
        self.unclip_ratio   = unclip_ratio

    def preprocess(self, img, tar_w=960, tar_h=960):
        # 1. resize
        img  = cv2.resize(img, (int(tar_w), int(tar_h)))
        # 2. normalize
        img  = img.astype("float32") / 255.0
        mean = [0.485, 0.456, 0.406]
        std  = [0.229, 0.224, 0.225]
        mean = np.array(mean).reshape(1, 1, 3).astype("float32")
        std  = np.array(std).reshape(1, 1, 3).astype("float32")
        img  = (img - mean) / std
        # 3. to bchw
        img  = img.transpose((2, 0, 1))[None]
        return img
    
    def forward(self, input):
        # input->1x3x960x960
        output = self.predictor.run(None, {"images": input})[0]
        return output
    
    def postprocess(self, pred, src_h, src_w):
        # pred->1x1x960x960
        pred = pred[0, 0, :, :]
        mask = pred > self.mask_thresh
        boxes, _ = self._boxes_from_bitmap(pred, mask, src_w, src_h)
        boxes = self._filter_boxes(boxes, src_h, src_w)
        boxes = self._sorted_boxes(boxes)
        return boxes

    def _boxes_from_bitmap(self, pred, bitmap, dest_width, dest_height):
        """
        bitmap: single map with shape (H, W),
                whose values are binarized as {0, 1}
        """
        
        height, width = bitmap.shape
        
        # bitmap_image = (bitmap * 255).astype(np.uint8)
        # cv2.imwrite("bitmap_image.jpg", bitmap_image)

        outs = cv2.findContours((bitmap * 255).astype(np.uint8), cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
        if len(outs) == 3:
            _, contours, _ = outs[0], outs[1], outs[2]  # opencv3.x
        elif len(outs) == 2:
            contours, _ = outs[0], outs[1]  # opencv4.x
        
        num_contours = min(len(contours), self.max_candidates)
        
        # contour_image = cv2.cvtColor(bitmap_image, cv2.COLOR_GRAY2BGR)
        # for contour in contours:
        #     cv2.drawContours(contour_image, [contour], -1, (0, 0, 255), 2)
        # cv2.imwrite('contour_image.jpg', contour_image)

        boxes  = []
        scores = []
        for index in range(num_contours):
            contour = contours[index]
            points, sside = self._get_mini_boxes(contour)
            if sside < self.min_size:
                continue
            points = np.array(points)
            score  = self._box_score(pred, points.reshape(-1, 2))
            if score < self.box_thresh:
                continue
            
            box = self._unclip(points, self.unclip_ratio)
            if len(box) > 1:
                continue
            box = np.array(box).reshape(-1, 1, 2)
            box, sside = self._get_mini_boxes(box)
            if sside < self.min_size + 2:
                continue
            box = np.array(box)

            box[:, 0] = np.clip(np.round(box[:, 0] / width  * dest_width), 0, dest_width)
            box[:, 1] = np.clip(np.round(box[:, 1] / height * dest_height), 0, dest_height)
            boxes.append(box.astype("int32"))
            scores.append(score)
        return np.array(boxes, dtype="int32"), scores

    def _get_mini_boxes(self, contour):
        # [[center_x, center_y], [width, height], angle]
        bounding_box = cv2.minAreaRect(contour)
        points = sorted(list(cv2.boxPoints(bounding_box)), key=lambda x: x[0])

        index_1, index_2, index_3, index_4 = 0, 1, 2, 3
        if points[1][1] > points[0][1]:
            index_1 = 0
            index_4 = 1
        else:
            index_1 = 1
            index_4 = 0
        if points[3][1] > points[2][1]:
            index_2 = 2
            index_3 = 3
        else:
            index_2 = 3
            index_3 = 2
        
        box = [points[index_1], points[index_2], points[index_3], points[index_4]]
        return box, min(bounding_box[1])

    def _box_score(self, bitmap, _box):
        """
        box_score: use bbox mean score as the mean score
        """
        h, w = bitmap.shape[:2]
        box  = _box.copy()
        xmin = np.clip(np.floor(box[:, 0].min()).astype("int32"), 0, w - 1)
        xmax = np.clip(np.ceil(box[:, 0].max()).astype("int32"), 0, w - 1)
        ymin = np.clip(np.floor(box[:, 1].min()).astype("int32"), 0, h - 1)
        ymax = np.clip(np.ceil(box[:, 1].max()).astype("int32"), 0, h - 1)

        mask = np.zeros((ymax - ymin + 1, xmax - xmin + 1), dtype=np.uint8)
        box[:, 0] = box[:, 0] - xmin
        box[:, 1] = box[:, 1] - ymin
        cv2.fillPoly(mask, box.reshape(1, -1, 2).astype("int32"), 1)
        return cv2.mean(bitmap[ymin : ymax + 1, xmin : xmax + 1], mask)[0]

    def _unclip(self, box, unclip_ratio):
        poly = Polygon(box)
        distance = poly.area * unclip_ratio / poly.length
        offset = pyclipper.PyclipperOffset()
        offset.AddPath(box, pyclipper.JT_ROUND, pyclipper.ET_CLOSEDPOLYGON)
        expanded = offset.Execute(distance)
        return expanded

    def _filter_boxes(self, boxes, src_h, src_w):
        boxes_filter = []
        for box in boxes:
            box = self._order_points_clockwise(box)
            box = self._clip(box, src_h, src_w)
            rect_width  = int(np.linalg.norm(box[0] - box[1]))
            rect_height = int(np.linalg.norm(box[0] - box[3]))
            if rect_width <= 3 or rect_height <= 3:
                continue
            boxes_filter.append(box)
        return np.array(boxes_filter)
            
    def _order_points_clockwise(self, pts):
        rect = np.zeros((4, 2), dtype="float32")
        s = pts.sum(axis=1)
        rect[0] = pts[np.argmin(s)]
        rect[2] = pts[np.argmax(s)]
        tmp  = np.delete(pts, (np.argmin(s), np.argmax(s)), axis=0)
        diff = np.diff(np.array(tmp), axis=1)
        rect[1] = tmp[np.argmin(diff)]
        rect[3] = tmp[np.argmax(diff)]
        return rect

    def _clip(self, points, img_height, img_width):
        for idx in range(points.shape[0]):
            points[idx, 0] = int(min(max(points[idx, 0], 0), img_width - 1))
            points[idx, 1] = int(min(max(points[idx, 1], 0), img_height - 1))
        return points

    def _sorted_boxes(self, boxes):
        """
        Sort text boxes in order from top to bottom, left to right
        """
        num_boxes = boxes.shape[0]
        boxes_sorted = sorted(boxes, key=lambda x: (x[0][1], x[0][0]))
        _boxes = list(boxes_sorted)

        for i in range(num_boxes - 1):
            for j in range(i, -1, -1):
                if abs(_boxes[j + 1][0][1] - _boxes[j][0][1]) < 10 and (_boxes[j + 1][0][0] < _boxes[j][0][0]):
                    tmp = _boxes[j]
                    _boxes[j] = _boxes[j + 1]
                    _boxes[j + 1] = tmp
                else:
                    break
        return _boxes

检测模型的推理包括图像预处理、模型推理、预测结果后处理三部分，其中预处理主要包括 resize 等操作，后处理主要包括文本框的提取、过滤以及排序

3. 方向分类器模型

下面我们接着来调试分析下方向分类器模型的前后处理

Note：该模块并非必需，如果你检测的文本没有旋转的情况可以去除该模块，直接将文本检测框送入到识别模型即可

3.1 预处理

前面我们在检测模型中得到排序后的文本检测框，接下来首先就要根据这些文本框裁剪图像：

在这里插入图片描述

其中 get_rotate_crop_image 函数内容如下：

def get_rotate_crop_image(img, points):
    """
    img_height, img_width = img.shape[0:2]
    left = int(np.min(points[:, 0]))
    right = int(np.max(points[:, 0]))
    top = int(np.min(points[:, 1]))
    bottom = int(np.max(points[:, 1]))
    img_crop = img[top:bottom, left:right, :].copy()
    points[:, 0] = points[:, 0] - left
    points[:, 1] = points[:, 1] - top
    """
    assert len(points) == 4, "shape of points must be 4*2"
    img_crop_width = int(
        max(
            np.linalg.norm(points[0] - points[1]), np.linalg.norm(points[2] - points[3])
        )
    )
    img_crop_height = int(
        max(
            np.linalg.norm(points[0] - points[3]), np.linalg.norm(points[1] - points[2])
        )
    )
    pts_std = np.float32(
        [
            [0, 0],
            [img_crop_width, 0],
            [img_crop_width, img_crop_height],
            [0, img_crop_height],
        ]
    )
    M = cv2.getPerspectiveTransform(points, pts_std)
    dst_img = cv2.warpPerspective(
        img,
        M,
        (img_crop_width, img_crop_height),
        borderMode=cv2.BORDER_REPLICATE,
        flags=cv2.INTER_CUBIC,
    )
    dst_img_height, dst_img_width = dst_img.shape[0:2]
    if dst_img_height * 1.0 / dst_img_width >= 1.5:
        dst_img = np.rot90(dst_img)
    return dst_img

get_rotate_crop_image 函数的主要功能是从给定的图像中裁剪出一个旋转的矩形区域，并返回裁剪后的图像。这在文本检测和图像处理任务中非常有用，尤其是当文本框或目标区域不总是水平对齐时，它的主要功能是：

计算裁剪区域的尺寸: 通过分析输入点计算裁剪区域的宽度和高度。
进行透视变换: 将图像中指定的四边形区域变换为标准矩形区域，处理旋转和透视。
调整图像方向: 如果裁剪后的图像纵横比过大，则旋转图像以使其更符合预期。
返回裁剪和调整后的图像: 使得裁剪出的区域在图像处理或展示时可以正确显示。

获取完裁剪后的图像后找到方向分类器模型的入口：

在这里插入图片描述

接着找到预处理部分：

在这里插入图片描述

其中的 resize_norm_img 函数内容如下：

在这里插入图片描述

它主要包括以下操作：

1. resize
2. /255.0，将像素值归一化到 [0,1]
3. 减均值（0.5）除标准差（0.5），将像素值转换到 [-1,1]
4. 填充

和我们之前的 resize 以及 warpAffine 操作略有不同，这里的 resize 处理是先把高度缩放到指定大小，再根据 ratio 和宽度缩放，如果缩放后的宽度大于指定宽度则只缩放到指定宽度，如果小于则正常缩放，剩余部分填充 0，且没有居中操作

我们下面看两张对比图就清晰了：

在这里插入图片描述

我们的目标是将文本图像缩放到 48x192 大小，所以两张图都是 48x192，但是第一张图宽度正常缩放，剩余部分填充 0，第二张图宽度超出指定范围则直接缩放到 192

因此我们不难写出其预处理，代码如下所示：

def _resize_norm_img(self, img, dest_width, dest_height):
    h, w, _ = img.shape
    ratio = w / float(h)
    if math.ceil(dest_height * ratio) > dest_width:
        resized_w = dest_width
    else:
        resized_w = int(math.ceil(dest_height * ratio))
    resized_image = cv2.resize(img, (resized_w, dest_height))
    resized_image = resized_image.astype("float32")
    resized_image = resized_image.transpose((2, 0, 1)) / 255.0
    resized_image -= 0.5
    resized_image /= 0.5
    padding_im = np.zeros((3, dest_height, dest_width), dtype=np.float32)
    padding_im[:, :, 0:resized_w] = resized_image
    return padding_im

至此，方向分类器模型的预处理梳理完毕，下面看其后处理

3.2 后处理

在梳理后处理之前我们下打印下模型推理后的输出维度：

在这里插入图片描述

可以看到模型的输出维度是 6x2，其中 6 是 batch 维度，2 代表着预测 0° 和 180° 的概率值，上面文章我们提到过 PP-OCRv4 中的方向分类器只支持 0° 和 180°，因此这里的维度就是 2

接着我们来看后处理都做了哪些操作：

在这里插入图片描述

它主要是根据分类标签和得分判断是否需要旋转图像。如果标签包含 "180" 且得分超过阈值，旋转图像 90 度（顺时针）。

因此我们不难写出其后处理，代码如下所示：

def postprocess(self, img_list, imgs_pre_batch, indices):
    cls_res = [["", 0.0]] * len(img_list)
    for batch in range(len(imgs_pre_batch)):
        # infer
        cls_pred = self.forward(imgs_pre_batch[batch])
        # cls_pred->bx2
        pred_idxs = cls_pred.argmax(axis=1)
        label_list = ["0", "180"]
        cls_result = [(label_list[idx], cls_pred[i, idx]) for i, idx in enumerate(pred_idxs)]
        for i in range(len(cls_result)):
            label, score = cls_result[i]
            cls_res[indices[batch * self.cls_batch_num + i]] = [label, score]
            if "180" in label and score > self.cls_thresh:
                img_list[indices[batch * self.cls_batch_num + i]] = cv2.rotate(
                    img_list[indices[batch * self.cls_batch_num + i]], 1)
    return img_list, cls_res

至此，方向分类器模型的后处理梳理完毕，下面我们来看推理

3.3 推理

通过上面对方向分类器模型的预处理和后处理分析之后，整个推理过程就显而易见：

class TextClassifier(object):
    def __init__(self, model_path, cls_thresh=0.9, cls_batch_num=6) -> None:
        self.predictor     = ort.InferenceSession(model_path, provider_options=["CPUExecutionProvider"])
        self.cls_thresh    = cls_thresh
        self.cls_batch_num = cls_batch_num
    
    def preprocess(self, img, boxes, tar_w=192, tar_h=48):
        img_crop_list = []
        for box in boxes:
            tmp_box  = copy.deepcopy(box)
            img_crop = self._get_rotate_crop_image(img, tmp_box)
            img_crop_list.append(img_crop)
        img_num = len(img_crop_list)
        ratio_list = [img.shape[1] / float(img.shape[0]) for img in img_crop_list]
        indices = np.argsort(np.array(ratio_list))

        imgs_pre_batch = []
        for beg_img_idx in range(0, img_num, self.cls_batch_num):
            end_img_idx = min(img_num, beg_img_idx + self.cls_batch_num)
            norm_img_batch = []
            for idx in range(beg_img_idx, end_img_idx):
                norm_img = self._resize_norm_img(img_crop_list[indices[idx]], tar_w, tar_h)
                norm_img = norm_img[None]
                norm_img_batch.append(norm_img)
            norm_img_batch = np.concatenate(norm_img_batch)
            imgs_pre_batch.append(norm_img_batch)

        return img_crop_list, imgs_pre_batch, indices

    def forward(self, inputs):
        # inputs->bx3x48x196
        output = self.predictor.run(None, {"images": inputs})[0]
        return output

    def postprocess(self, img_list, imgs_pre_batch, indices):
        cls_res = [["", 0.0]] * len(img_list)
        for batch in range(len(imgs_pre_batch)):
            # infer
            cls_pred = self.forward(imgs_pre_batch[batch])
            # cls_pred->bx2
            pred_idxs = cls_pred.argmax(axis=1)
            label_list = ["0", "180"]
            cls_result = [(label_list[idx], cls_pred[i, idx]) for i, idx in enumerate(pred_idxs)]
            for i in range(len(cls_result)):
                label, score = cls_result[i]
                cls_res[indices[batch * self.cls_batch_num + i]] = [label, score]
                if "180" in label and score > self.cls_thresh:
                    img_list[indices[batch * self.cls_batch_num + i]] = cv2.rotate(
                        img_list[indices[batch * self.cls_batch_num + i]], 1)
        return img_list, cls_res
    
    def _get_rotate_crop_image(self, img, points):
        img_crop_width  = int(max(np.linalg.norm(points[0] - points[1]), np.linalg.norm(points[2] - points[3])))
        img_crop_height = int(max(np.linalg.norm(points[0] - points[3]), np.linalg.norm(points[1] - points[2])))
        pts_std = np.float32(
            [
                [0, 0],
                [img_crop_width, 0],
                [img_crop_width, img_crop_height],
                [0, img_crop_height]
            ]
        )
        M = cv2.getPerspectiveTransform(points, pts_std)
        dst_img = cv2.warpPerspective(
            img,
            M,
            (img_crop_width, img_crop_height),
            borderMode=cv2.BORDER_REPLICATE,
            flags=cv2.INTER_CUBIC
        )
        dst_img_height, dst_img_width = dst_img.shape[:2]
        if (dst_img_height * 1.0 / dst_img_width) >= 1.5:
            dst_img = np.rot90(dst_img)
        return dst_img

    def _resize_norm_img(self, img, dest_width, dest_height):
        h, w, _ = img.shape
        ratio = w / float(h)
        if math.ceil(dest_height * ratio) > dest_width:
            resized_w = dest_width
        else:
            resized_w = int(math.ceil(dest_height * ratio))
        resized_image = cv2.resize(img, (resized_w, dest_height))
        resized_image = resized_image.astype("float32")
        resized_image = resized_image.transpose((2, 0, 1)) / 255.0
        resized_image -= 0.5
        resized_image /= 0.5
        padding_im = np.zeros((3, dest_height, dest_width), dtype=np.float32)
        padding_im[:, :, 0:resized_w] = resized_image
        return padding_im

方向分类器模型的推理包括图像预处理、模型推理、预测结果后处理三部分，其中预处理主要包括 resize 等操作，后处理主要包括根据预测标签旋转图像

4. 识别模型

最后我们来调试分析下识别模型的前后处理

4.1 预处理

首先看预处理，找到识别模型的入口：

在这里插入图片描述

接着找到模型预处理部分：

在这里插入图片描述

大家可能会发现其实识别模型的预处理和方向分类器的预处理一模一样，它主要包括以下操作：

1. resize
2. /255.0，将像素值归一化到 [0,1]
3. 减均值（0.5）除标准差（0.5），将像素值转换到 [-1,1]
4. 填充

因此我们不难写出其预处理，代码如下所示：

def _resize_norm_img(self, img, dest_width, dest_height):
    h, w, _ = img.shape
    ratio = w / float(h)
    if math.ceil(dest_height * ratio) > dest_width:
        resized_w = dest_width
    else:
        resized_w = int(math.ceil(dest_height * ratio))
    resized_image = cv2.resize(img, (resized_w, dest_height))
    resized_image = resized_image.astype("float32")
    resized_image = resized_image.transpose((2, 0, 1)) / 255.0
    resized_image -= 0.5
    resized_image /= 0.5
    padding_im = np.zeros((3, dest_height, dest_width), dtype=np.float32)
    padding_im[:, :, 0:resized_w] = resized_image
    return padding_im

至此，识别模型的预处理梳理完毕，下面看其后处理

4.2 后处理

在梳理后处理之前我们先打印下模型推理后的输出维度：

在这里插入图片描述

可以看到模型的输出维度是 6x80x6625，分别代表的含义是：

6：batch size 大小，表示模型一次处理 6 张图像
80：每个图像中可以识别的字符的最大数目，它和模型输入的宽度相关，目前输入宽度是 640，如果设置为 320 该值会变为 40
6625：表示每个序列位置的分类输出，这个维度对应于字符集的大小加上一些额外的标记

在这里插入图片描述

从上图中我们可以清晰的看到原本的字符集长度是 6623，然后再字符集开始添加了 'blank' 字符，结束添加了 ' ' 字符，所以总的字符集长度为 6625，对应识别模型的输出维度

接着我们来看后处理都做了哪些操作：

在这里插入图片描述

它主要是通过 decode 函数对预测结果进行解码，我们重点来看下该函数的实现：

def decode(
    self,
    text_index,
    text_prob=None,
    is_remove_duplicate=False,
    return_word_box=False,
):
    """convert text-index into text-label."""
    result_list = []
    ignored_tokens = self.get_ignored_tokens()
    batch_size = len(text_index)
    for batch_idx in range(batch_size):
        selection = np.ones(len(text_index[batch_idx]), dtype=bool)
        if is_remove_duplicate:
            selection[1:] = text_index[batch_idx][1:] != text_index[batch_idx][:-1]
        for ignored_token in ignored_tokens:
            selection &= text_index[batch_idx] != ignored_token

        char_list = [
            self.character[text_id] for text_id in text_index[batch_idx][selection]
        ]
        if text_prob is not None:
            conf_list = text_prob[batch_idx][selection]
        else:
            conf_list = [1] * len(selection)
        if len(conf_list) == 0:
            conf_list = [0]

        text = "".join(char_list)

        if self.reverse:  # for arabic rec
            text = self.pred_reverse(text)

        if return_word_box:
            word_list, word_col_list, state_list = self.get_word_info(
                text, selection
            )
            result_list.append(
                (
                    text,
                    np.mean(conf_list).tolist(),
                    [
                        len(text_index[batch_idx]),
                        word_list,
                        word_col_list,
                        state_list,
                    ],
                )
            )
        else:
            result_list.append((text, np.mean(conf_list).tolist()))
    return result_list

这个 decode 函数的主要目的是将文本识别模型的输出（索引和概率）转换为可读的文本标签，它的主要功能有：

处理字符索引: 根据模型的字符索引生成文本字符串，并移除重复字符或忽略特定字符。
置信度处理: 计算并存储每个字符的置信度分数，如果未提供置信度，则使用默认值。
返回解码结果: 返回包括文本和及其置信度的解码结果列表。

返回的结果如下图所示：

在这里插入图片描述

因此我们不难写出其后处理，代码如下所示：

def postprocess(self, imgs_pre_batch, indices):
    rec_res = [["", 0.0]] * len(indices)
    for batch in range(len(imgs_pre_batch)):
        # infer
        rec_pred = self.forward(imgs_pre_batch[batch])
        # rec_pred->bx80x6625
        preds_idx  = rec_pred.argmax(axis=2)
        preds_prob = rec_pred.max(axis=2)
        text = self._decode(preds_idx, preds_prob)
        for i in range(len(text)):
            rec_res[indices[batch * self.rec_batch_num + i]] = text[i]
    return rec_res

def _decode(self, text_index, text_prob):
    "convert text-index into text-label"
    result_list = []
    batch_size = len(text_index)
    for batch_idx in range(batch_size):
        selection = np.ones(len(text_index[0]), dtype=bool)
        selection[1:] = text_index[batch_idx][1:] != text_index[batch_idx][:-1]
        selection &= text_index[batch_idx] != 0
        char_list = [self.character_str[text_id] for text_id in text_index[batch_idx][selection]]
        conf_list = text_prob[batch_idx][selection]
        if len(conf_list) == 0:
            conf_list = 0
        text = "".join(char_list)
        result_list.append((text, np.mean(conf_list).tolist()))
    return result_list

至此，识别模型的后处理梳理完毕，下面我们来看推理

4.3 推理

通过上面对识别模型的预处理和后处理分析之后，整个推理过程就显而易见：

class TextRecognizer(object):
    def __init__(self, model_path, character_dict_path, rec_batch_num=6) -> None:
        self.predictor     = ort.InferenceSession(model_path, provider_options=["CPUExecutionProvider"])
        self.character_str = []
        self.rec_batch_num = rec_batch_num

        with open(character_dict_path, "rb") as fin:
            lines = fin.readlines()
            for line in lines:
                line = line.decode("utf-8").strip("\n").strip("\r\n")
                self.character_str.append(line)
        self.character_str.append(" ")
        self.character_str = ["blank"] + self.character_str
        self.dict = {}
        for i, char in enumerate(self.character_str):
            self.dict[char] = i
    
    def preprocess(self, img_list, tar_w=640, tar_h=48):
        # for img in img_list:
        #     plt.imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
        #     plt.show()
        img_num = len(img_list)
        ratio_list = [img.shape[1] / float(img.shape[0]) for img in img_list]
        indices = np.argsort(np.array(ratio_list))
        
        imgs_pre_batch = []
        for beg_img_idx in range(0, img_num, self.rec_batch_num):
            end_img_idx = min(img_num, beg_img_idx + self.rec_batch_num)
            norm_img_batch = []
            for idx in range(beg_img_idx, end_img_idx):
                norm_img = self._resize_norm_img(img_list[indices[idx]], tar_w, tar_h)

                # processed_img = norm_img.transpose(1, 2, 0)
                # processed_img = (processed_img * 0.5 + 0.5) * 255
                # processed_img = processed_img.astype(np.uint8)
                # plt.imshow(cv2.cvtColor(processed_img, cv2.COLOR_BGR2RGB))
                # plt.show()

                norm_img = norm_img[None]
                norm_img_batch.append(norm_img)
            norm_img_batch = np.concatenate(norm_img_batch)
            imgs_pre_batch.append(norm_img_batch)
        
        return imgs_pre_batch, indices

    def forward(self, inputs):
        # inputs->bx3x48x640
        output = self.predictor.run(None, {"images": inputs})[0]
        return output        

    def postprocess(self, imgs_pre_batch, indices):
        rec_res = [["", 0.0]] * len(indices)
        for batch in range(len(imgs_pre_batch)):
            # infer
            rec_pred = self.forward(imgs_pre_batch[batch])
            # rec_pred->bx80x6625
            preds_idx  = rec_pred.argmax(axis=2)
            preds_prob = rec_pred.max(axis=2)
            text = self._decode(preds_idx, preds_prob)
            for i in range(len(text)):
                rec_res[indices[batch * self.rec_batch_num + i]] = text[i]
        return rec_res

    def _resize_norm_img(self, img, dest_width, dest_height):
        h, w, _ = img.shape
        ratio = w / float(h)
        if math.ceil(dest_height * ratio) > dest_width:
            resized_w = dest_width
        else:
            resized_w = int(math.ceil(dest_height * ratio))
        resized_image = cv2.resize(img, (resized_w, dest_height))
        resized_image = resized_image.astype("float32")
        resized_image = resized_image.transpose((2, 0, 1)) / 255.0
        resized_image -= 0.5
        resized_image /= 0.5
        padding_im = np.zeros((3, dest_height, dest_width), dtype=np.float32)
        padding_im[:, :, 0:resized_w] = resized_image
        return padding_im

    def _decode(self, text_index, text_prob):
        "convert text-index into text-label"
        result_list = []
        batch_size = len(text_index)
        for batch_idx in range(batch_size):
            selection = np.ones(len(text_index[0]), dtype=bool)
            selection[1:] = text_index[batch_idx][1:] != text_index[batch_idx][:-1]
            selection &= text_index[batch_idx] != 0
            char_list = [self.character_str[text_id] for text_id in text_index[batch_idx][selection]]
            conf_list = text_prob[batch_idx][selection]
            if len(conf_list) == 0:
                conf_list = 0
            text = "".join(char_list)
            result_list.append((text, np.mean(conf_list).tolist()))
        return result_list

识别模型的推理包括图像预处理、模型推理、预测结果后处理三部分，其中预处理主要包括 resize 等操作，后处理主要将模型的输出转换为文本标签

至此，PP-OCRv4 中的三个模型的前后处理都已经梳理完毕，下面我们简单看下可视化部分

5. 可视化

在拿到对应的文本框和文本内容后可以进行可视化：

在这里插入图片描述

主要是通过 draw_ocr_box_txt 函数，我们重点看下该函数：

在这里插入图片描述

draw_ocr_box_txt 函数负责绘制文本框及其内容，它包含以下步骤：

创建一个 ImageDraw 对象，用于在 img_left 上绘制多边形（文本框）
检查文本识别置信度 score，如果其值低于 drop_score，则跳过该文本框的绘制
调用 polygon 绘制每个文本框
调用 draw_box_txt_fine 函数在 img_right_text 上绘制文本框内的文本
使用 OpenCV 的 cv2.polylines 绘制文本框边界
使用位运算将文本绘制在 img_right 图像上。
将原始图像与包含文本框的图像 img_left 进行融合，生成一个混合图像
创建一个新图像 img_show，将原始图像和带有文本的图像拼接在一起。
将最终合成的图像转换为 numpy 数组并返回。

其中的 draw_box_txt_fine 函数内容如下：

在这里插入图片描述

draw_box_txt_fine 函数负责在给定的文本框区域内绘制文本，并将其透视变换到原始图像中对应的位置，它包含以下步骤：

计算文本框的尺寸
创建文本图像
- 根据文本框的宽度和高度创建一个新的图像 img_text
- 使用 create_font 函数创建字体，并在图像上绘制文本
透视变换
- 定义一个矩形区域（pts1）和文本框的四个顶点（pts2）
- 使用 OpenCV 的 cv2.getPerspectiveTransform 计算透视变换矩阵 M
- 使用 cv2.warpPerspective 将 img_text 变换到文本框的位置 img_right_text
返回变换后的图像 img_right_text

6. PP-OCRv4推理

前面我们梳理了 PP-OCRv4 中三个模型的推理以及可视化，下面我们来看看完整的推理代码：

import cv2
import math
import copy
import random
import numpy as np
import onnxruntime as ort
import matplotlib.pyplot as plt
from PIL import Image, ImageFont, ImageDraw
import pyclipper
from shapely.geometry import Polygon

class TextDetector(object):
    def __init__(self, model_path, mask_thresh=0.3, box_thresh=0.6, 
                 max_candidates=1000, min_size=3, unclip_ratio=1.5) -> None:
        self.predictor      = ort.InferenceSession(model_path, provider_options=["CPUExecutionProvider"])
        self.mask_thresh    = mask_thresh
        self.box_thresh     = box_thresh
        self.max_candidates = max_candidates
        self.min_size       = min_size
        self.unclip_ratio   = unclip_ratio

    def preprocess(self, img, tar_w=960, tar_h=960):
        # 1. resize
        img  = cv2.resize(img, (int(tar_w), int(tar_h)))
        # 2. normalize
        img  = img.astype("float32") / 255.0
        mean = [0.485, 0.456, 0.406]
        std  = [0.229, 0.224, 0.225]
        mean = np.array(mean).reshape(1, 1, 3).astype("float32")
        std  = np.array(std).reshape(1, 1, 3).astype("float32")
        img  = (img - mean) / std
        # 3. to bchw
        img  = img.transpose((2, 0, 1))[None]
        return img
    
    def forward(self, input):
        # input->1x3x960x960
        output = self.predictor.run(None, {"images": input})[0]
        return output
    
    def postprocess(self, pred, src_h, src_w):
        # pred->1x1x960x960
        pred = pred[0, 0, :, :]
        mask = pred > self.mask_thresh
        boxes, _ = self._boxes_from_bitmap(pred, mask, src_w, src_h)
        boxes = self._filter_boxes(boxes, src_h, src_w)
        boxes = self._sorted_boxes(boxes)
        return boxes

    def _boxes_from_bitmap(self, pred, bitmap, dest_width, dest_height):
        """
        bitmap: single map with shape (H, W),
                whose values are binarized as {0, 1}
        """
        
        height, width = bitmap.shape
        
        # bitmap_image = (bitmap * 255).astype(np.uint8)
        # cv2.imwrite("bitmap_image.jpg", bitmap_image)

        outs = cv2.findContours((bitmap * 255).astype(np.uint8), cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
        if len(outs) == 3:
            _, contours, _ = outs[0], outs[1], outs[2]  # opencv3.x
        elif len(outs) == 2:
            contours, _ = outs[0], outs[1]  # opencv4.x
        
        num_contours = min(len(contours), self.max_candidates)
        
        # contour_image = cv2.cvtColor(bitmap_image, cv2.COLOR_GRAY2BGR)
        # for contour in contours:
        #     cv2.drawContours(contour_image, [contour], -1, (0, 0, 255), 2)
        # cv2.imwrite('contour_image.jpg', contour_image)

        boxes  = []
        scores = []
        for index in range(num_contours):
            contour = contours[index]
            points, sside = self._get_mini_boxes(contour)
            if sside < self.min_size:
                continue
            points = np.array(points)
            score  = self._box_score(pred, points.reshape(-1, 2))
            if score < self.box_thresh:
                continue
            
            box = self._unclip(points, self.unclip_ratio)
            if len(box) > 1:
                continue
            box = np.array(box).reshape(-1, 1, 2)
            box, sside = self._get_mini_boxes(box)
            if sside < self.min_size + 2:
                continue
            box = np.array(box)

            box[:, 0] = np.clip(np.round(box[:, 0] / width  * dest_width), 0, dest_width)
            box[:, 1] = np.clip(np.round(box[:, 1] / height * dest_height), 0, dest_height)
            boxes.append(box.astype("int32"))
            scores.append(score)
        return np.array(boxes, dtype="int32"), scores

    def _get_mini_boxes(self, contour):
        # [[center_x, center_y], [width, height], angle]
        bounding_box = cv2.minAreaRect(contour)
        points = sorted(list(cv2.boxPoints(bounding_box)), key=lambda x: x[0])

        index_1, index_2, index_3, index_4 = 0, 1, 2, 3
        if points[1][1] > points[0][1]:
            index_1 = 0
            index_4 = 1
        else:
            index_1 = 1
            index_4 = 0
        if points[3][1] > points[2][1]:
            index_2 = 2
            index_3 = 3
        else:
            index_2 = 3
            index_3 = 2
        
        box = [points[index_1], points[index_2], points[index_3], points[index_4]]
        return box, min(bounding_box[1])

    def _box_score(self, bitmap, _box):
        """
        box_score: use bbox mean score as the mean score
        """
        h, w = bitmap.shape[:2]
        box  = _box.copy()
        xmin = np.clip(np.floor(box[:, 0].min()).astype("int32"), 0, w - 1)
        xmax = np.clip(np.ceil(box[:, 0].max()).astype("int32"), 0, w - 1)
        ymin = np.clip(np.floor(box[:, 1].min()).astype("int32"), 0, h - 1)
        ymax = np.clip(np.ceil(box[:, 1].max()).astype("int32"), 0, h - 1)

        mask = np.zeros((ymax - ymin + 1, xmax - xmin + 1), dtype=np.uint8)
        box[:, 0] = box[:, 0] - xmin
        box[:, 1] = box[:, 1] - ymin
        cv2.fillPoly(mask, box.reshape(1, -1, 2).astype("int32"), 1)
        return cv2.mean(bitmap[ymin : ymax + 1, xmin : xmax + 1], mask)[0]

    def _unclip(self, box, unclip_ratio):
        poly = Polygon(box)
        distance = poly.area * unclip_ratio / poly.length
        offset = pyclipper.PyclipperOffset()
        offset.AddPath(box, pyclipper.JT_ROUND, pyclipper.ET_CLOSEDPOLYGON)
        expanded = offset.Execute(distance)
        return expanded

    def _filter_boxes(self, boxes, src_h, src_w):
        boxes_filter = []
        for box in boxes:
            box = self._order_points_clockwise(box)
            box = self._clip(box, src_h, src_w)
            rect_width  = int(np.linalg.norm(box[0] - box[1]))
            rect_height = int(np.linalg.norm(box[0] - box[3]))
            if rect_width <= 3 or rect_height <= 3:
                continue
            boxes_filter.append(box)
        return np.array(boxes_filter)
            
    def _order_points_clockwise(self, pts):
        rect = np.zeros((4, 2), dtype="float32")
        s = pts.sum(axis=1)
        rect[0] = pts[np.argmin(s)]
        rect[2] = pts[np.argmax(s)]
        tmp  = np.delete(pts, (np.argmin(s), np.argmax(s)), axis=0)
        diff = np.diff(np.array(tmp), axis=1)
        rect[1] = tmp[np.argmin(diff)]
        rect[3] = tmp[np.argmax(diff)]
        return rect

    def _clip(self, points, img_height, img_width):
        for idx in range(points.shape[0]):
            points[idx, 0] = int(min(max(points[idx, 0], 0), img_width - 1))
            points[idx, 1] = int(min(max(points[idx, 1], 0), img_height - 1))
        return points

    def _sorted_boxes(self, boxes):
        """
        Sort text boxes in order from top to bottom, left to right
        """
        num_boxes = boxes.shape[0]
        boxes_sorted = sorted(boxes, key=lambda x: (x[0][1], x[0][0]))
        _boxes = list(boxes_sorted)

        for i in range(num_boxes - 1):
            for j in range(i, -1, -1):
                if abs(_boxes[j + 1][0][1] - _boxes[j][0][1]) < 10 and (_boxes[j + 1][0][0] < _boxes[j][0][0]):
                    tmp = _boxes[j]
                    _boxes[j] = _boxes[j + 1]
                    _boxes[j + 1] = tmp
                else:
                    break
        return _boxes

class TextClassifier(object):
    def __init__(self, model_path, cls_thresh=0.9, cls_batch_num=6) -> None:
        self.predictor     = ort.InferenceSession(model_path, provider_options=["CPUExecutionProvider"])
        self.cls_thresh    = cls_thresh
        self.cls_batch_num = cls_batch_num
    
    def preprocess(self, img, boxes, tar_w=192, tar_h=48):
        img_crop_list = []
        for box in boxes:
            tmp_box  = copy.deepcopy(box)
            img_crop = self._get_rotate_crop_image(img, tmp_box)
            img_crop_list.append(img_crop)
        img_num = len(img_crop_list)
        ratio_list = [img.shape[1] / float(img.shape[0]) for img in img_crop_list]
        indices = np.argsort(np.array(ratio_list))

        imgs_pre_batch = []
        for beg_img_idx in range(0, img_num, self.cls_batch_num):
            end_img_idx = min(img_num, beg_img_idx + self.cls_batch_num)
            norm_img_batch = []
            for idx in range(beg_img_idx, end_img_idx):
                norm_img = self._resize_norm_img(img_crop_list[indices[idx]], tar_w, tar_h)
                norm_img = norm_img[None]
                norm_img_batch.append(norm_img)
            norm_img_batch = np.concatenate(norm_img_batch)
            imgs_pre_batch.append(norm_img_batch)

        return img_crop_list, imgs_pre_batch, indices

    def forward(self, inputs):
        # inputs->bx3x48x196
        output = self.predictor.run(None, {"images": inputs})[0]
        return output

    def postprocess(self, img_list, imgs_pre_batch, indices):
        cls_res = [["", 0.0]] * len(img_list)
        for batch in range(len(imgs_pre_batch)):
            # infer
            cls_pred = self.forward(imgs_pre_batch[batch])
            # cls_pred->bx2
            pred_idxs = cls_pred.argmax(axis=1)
            label_list = ["0", "180"]
            cls_result = [(label_list[idx], cls_pred[i, idx]) for i, idx in enumerate(pred_idxs)]
            for i in range(len(cls_result)):
                label, score = cls_result[i]
                cls_res[indices[batch * self.cls_batch_num + i]] = [label, score]
                if "180" in label and score > self.cls_thresh:
                    img_list[indices[batch * self.cls_batch_num + i]] = cv2.rotate(
                        img_list[indices[batch * self.cls_batch_num + i]], 1)
        return img_list, cls_res
    
    def _get_rotate_crop_image(self, img, points):
        img_crop_width  = int(max(np.linalg.norm(points[0] - points[1]), np.linalg.norm(points[2] - points[3])))
        img_crop_height = int(max(np.linalg.norm(points[0] - points[3]), np.linalg.norm(points[1] - points[2])))
        pts_std = np.float32(
            [
                [0, 0],
                [img_crop_width, 0],
                [img_crop_width, img_crop_height],
                [0, img_crop_height]
            ]
        )
        M = cv2.getPerspectiveTransform(points, pts_std)
        dst_img = cv2.warpPerspective(
            img,
            M,
            (img_crop_width, img_crop_height),
            borderMode=cv2.BORDER_REPLICATE,
            flags=cv2.INTER_CUBIC
        )
        dst_img_height, dst_img_width = dst_img.shape[:2]
        if (dst_img_height * 1.0 / dst_img_width) >= 1.5:
            dst_img = np.rot90(dst_img)
        return dst_img

    def _resize_norm_img(self, img, dest_width, dest_height):
        h, w, _ = img.shape
        ratio = w / float(h)
        if math.ceil(dest_height * ratio) > dest_width:
            resized_w = dest_width
        else:
            resized_w = int(math.ceil(dest_height * ratio))
        resized_image = cv2.resize(img, (resized_w, dest_height))
        resized_image = resized_image.astype("float32")
        resized_image = resized_image.transpose((2, 0, 1)) / 255.0
        resized_image -= 0.5
        resized_image /= 0.5
        padding_im = np.zeros((3, dest_height, dest_width), dtype=np.float32)
        padding_im[:, :, 0:resized_w] = resized_image
        return padding_im

class TextRecognizer(object):
    def __init__(self, model_path, character_dict_path, rec_batch_num=6) -> None:
        self.predictor     = ort.InferenceSession(model_path, provider_options=["CPUExecutionProvider"])
        self.character_str = []
        self.rec_batch_num = rec_batch_num

        with open(character_dict_path, "rb") as fin:
            lines = fin.readlines()
            for line in lines:
                line = line.decode("utf-8").strip("\n").strip("\r\n")
                self.character_str.append(line)
        self.character_str.append(" ")
        self.character_str = ["blank"] + self.character_str
        self.dict = {}
        for i, char in enumerate(self.character_str):
            self.dict[char] = i
    
    def preprocess(self, img_list, tar_w=640, tar_h=48):
        # for img in img_list:
        #     plt.imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
        #     plt.show()
        img_num = len(img_list)
        ratio_list = [img.shape[1] / float(img.shape[0]) for img in img_list]
        indices = np.argsort(np.array(ratio_list))
        
        imgs_pre_batch = []
        for beg_img_idx in range(0, img_num, self.rec_batch_num):
            end_img_idx = min(img_num, beg_img_idx + self.rec_batch_num)
            norm_img_batch = []
            for idx in range(beg_img_idx, end_img_idx):
                norm_img = self._resize_norm_img(img_list[indices[idx]], tar_w, tar_h)

                # processed_img = norm_img.transpose(1, 2, 0)
                # processed_img = (processed_img * 0.5 + 0.5) * 255
                # processed_img = processed_img.astype(np.uint8)
                # plt.imshow(cv2.cvtColor(processed_img, cv2.COLOR_BGR2RGB))
                # plt.show()

                norm_img = norm_img[None]
                norm_img_batch.append(norm_img)
            norm_img_batch = np.concatenate(norm_img_batch)
            imgs_pre_batch.append(norm_img_batch)
        
        return imgs_pre_batch, indices

    def forward(self, inputs):
        # inputs->bx3x48x640
        output = self.predictor.run(None, {"images": inputs})[0]
        return output        

    def postprocess(self, imgs_pre_batch, indices):
        rec_res = [["", 0.0]] * len(indices)
        for batch in range(len(imgs_pre_batch)):
            # infer
            rec_pred = self.forward(imgs_pre_batch[batch])
            # rec_pred->bx80x6625
            preds_idx  = rec_pred.argmax(axis=2)
            preds_prob = rec_pred.max(axis=2)
            text = self._decode(preds_idx, preds_prob)
            for i in range(len(text)):
                rec_res[indices[batch * self.rec_batch_num + i]] = text[i]
        return rec_res

    def _resize_norm_img(self, img, dest_width, dest_height):
        h, w, _ = img.shape
        ratio = w / float(h)
        if math.ceil(dest_height * ratio) > dest_width:
            resized_w = dest_width
        else:
            resized_w = int(math.ceil(dest_height * ratio))
        resized_image = cv2.resize(img, (resized_w, dest_height))
        resized_image = resized_image.astype("float32")
        resized_image = resized_image.transpose((2, 0, 1)) / 255.0
        resized_image -= 0.5
        resized_image /= 0.5
        padding_im = np.zeros((3, dest_height, dest_width), dtype=np.float32)
        padding_im[:, :, 0:resized_w] = resized_image
        return padding_im

    def _decode(self, text_index, text_prob):
        "convert text-index into text-label"
        result_list = []
        batch_size = len(text_index)
        for batch_idx in range(batch_size):
            selection = np.ones(len(text_index[0]), dtype=bool)
            selection[1:] = text_index[batch_idx][1:] != text_index[batch_idx][:-1]
            selection &= text_index[batch_idx] != 0
            char_list = [self.character_str[text_id] for text_id in text_index[batch_idx][selection]]
            conf_list = text_prob[batch_idx][selection]
            if len(conf_list) == 0:
                conf_list = 0
            text = "".join(char_list)
            result_list.append((text, np.mean(conf_list).tolist()))
        return result_list

def create_font(txt, sz, font_path):
    font_size = int(sz[1] * 0.99)
    font = ImageFont.truetype(font_path, font_size, encoding="utf-8")
    length = font.getlength(txt)
    if(length > sz[0]):
        font_size = int(font_size * sz[0] / length)
        font = ImageFont.truetype(font_path, font_size, encoding="utf-8")
    return font

def draw_box_txt(img_size, box, txt, font_path=None):
    box_height = int(np.linalg.norm(box[0] - box[3]))
    box_width  = int(np.linalg.norm(box[0] - box[1]))

    if box_height > 2 * box_width and box_height > 30:
        img_text  = Image.new("RGB", (box_height, box_width), (255, 255, 255))
        draw_text = ImageDraw.Draw(img_text)
        font = create_font(txt, (box_height, box_width), font_path)
        draw_text.text([0, 0], txt, fill=(0, 0, 0), font=font)
        img_text = img_text.transpose(Image.ROTATE_270)
    else:
        img_text  = Image.new("RGB", (box_width, box_height), (255, 255, 255))
        draw_text = ImageDraw.Draw(img_text)
        font = create_font(txt, (box_width, box_height), font_path)
        draw_text.text([0, 0], txt, fill=(0, 0, 0), font=font)
    
    pts1 = np.float32([[0, 0], [box_width, 0], [box_width, box_height], [0, box_height]])
    pts2 = np.array(box, dtype=np.float32)
    M = cv2.getPerspectiveTransform(pts1, pts2)

    img_text = np.array(img_text, dtype=np.uint8)
    img_right_text = cv2.warpPerspective(
        img_text,
        M,
        img_size,
        flags=cv2.INTER_NEAREST,
        borderMode=cv2.BORDER_CONSTANT,
        borderValue=(255, 255, 255)
    )
    return img_right_text        

def draw_ocr_box_txt(image, boxes, txts, scores, font_path=None, drop_score=0.5):
    image = Image.fromarray(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
    h, w  = image.height, image.width
    img_left  = image.copy()
    img_right = np.ones((h, w, 3), dtype=np.uint8) * 255
    random.seed(0)

    draw_left = ImageDraw.Draw(img_left)
    if txts is None or len(txts) != len(boxes):
        txts = [None] * len(boxes)
    for idx, (box, txt) in enumerate(zip(boxes, txts)):
        if scores is not None and scores[idx] < drop_score:
            continue
        color = tuple(random.randint(0, 255) for _ in range(3))
        draw_left.polygon(box, fill=color)
        img_right_text = draw_box_txt((w, h), box, txt, font_path)
        pts = np.array(box, np.int32).reshape((-1, 1, 2))
        cv2.polylines(img_right_text, [pts], True, color, 1)
        img_right = cv2.bitwise_and(img_right, img_right_text)
    img_left = Image.blend(image, img_left, 0.5)
    img_show = Image.new("RGB", (w * 2, h), (255, 255, 255))
    img_show.paste(img_left, (0, 0, w, h))
    img_show.paste(Image.fromarray(img_right), (w, 0, w * 2, h))
    return np.array(img_show)

if __name__ == "__main__":

    image = cv2.imread("deploy/lite/imgs/lite_demo.png")
    src_h, src_w, _ = image.shape

    det_model_file_path = "models/det/det.sim.onnx"
    cls_model_file_path = "models/cls/cls.sim.onnx"
    rec_model_file_path = "models/rec/rec.sim.onnx"
    character_dict_path = "ppocr/utils/ppocr_keys_v1.txt"
    font_path = "doc/fonts/simfang.ttf"

    # 1. text detection
    text_detector = TextDetector(det_model_file_path)
    img_pre   = text_detector.preprocess(image)
    det_pred  = text_detector.forward(img_pre)
    det_boxes = text_detector.postprocess(det_pred, src_h, src_w)

    # 2. text classification
    if det_boxes is None:
        print("warning, no det_boxes found")
        exit()
    else:
        print(f"det_boxes num: {len(det_boxes)}")
    text_classifier = TextClassifier(cls_model_file_path)
    img_list, imgs_pre_batch, indices = text_classifier.preprocess(image, det_boxes)
    img_list, _ = text_classifier.postprocess(img_list, imgs_pre_batch, indices)

    # 3. text recognition
    text_recognizer = TextRecognizer(rec_model_file_path, character_dict_path)
    imgs_pre_batch, indices = text_recognizer.preprocess(img_list)
    rec_txts = text_recognizer.postprocess(imgs_pre_batch, indices)

    # 4. visualization
    txts   = [rec_txts[i][0] for i in range(len(rec_txts))]
    scores = [rec_txts[i][1] for i in range(len(rec_txts))]
    
    draw_img = draw_ocr_box_txt(image, det_boxes, txts, scores, font_path)
    cv2.imwrite("result.jpg", draw_img[:, :, ::-1])

执行该代码在当前目录下会保存 result.jpg 推理的图片，如下图所示：

在这里插入图片描述

可以看到效果还是不错的，和 PaddlePaddle 推理的结果差不多，这说明我们梳理的各个模型的前后处理没有问题

最后我们再来看一下其它图片的推理结果：

在这里插入图片描述

结语

这篇文章我们主要梳理了 PP-OCRv4 中的检测、方向分类器以及识别三个模块的前后处理，整个代码还是比较清晰的，只是某些步骤可能有些繁琐，大家稍微细心一些就行

OK，以上就是 PP-OCRv4 前后处理梳理的全部内容了，下篇我们来看看如何在 TensorRT 上推理得到结果，敬请期待😄

下载链接

源代码、权重下载链接【提取码:20tr】