openvino系列 19. OpenVINO 与 PaddleOCR 实现视频实时OCR处理

此案例展示了如何在OpenVINO上本地实时视频流运行PPOCR模型。无需将PaddlePaddle模型导出到 ONNX，然后通过 OpenVINO 模型优化器转换为中间表示 (IR) 格式，我们现在可以直接从 PaddlePaddle 模型中读取。PaddleOCR是一个使用PaddlePaddle深度学习框架训练的超轻OCR模型，旨在创建多语言实用的OCR工具。

案例中使用的paddleOCR预训练模型参考Chinese and English ultra-lightweight PP-OCR model (9.4M)。

环境描述：

本案例运行环境：Win10，9代i7笔记本
IDE：VSCode
openvino版本：2022.1
代码链接

提示：写完文章后，目录可以自动生成，如何生成可参考右边的帮助文档

文章目录

openvino系列 19. OpenVINO 与 PaddleOCR 实现视频实时OCR处理

0 代码的运行

将Gitee代码下载下来后，进入14-realtime-paddleOcr，你需要新建一个虚拟环境，并下载相关的依赖（关于OpenVINO的安装和使用，这里就不再赘述，需要了解的同学可以直接打开Gitee的链接，在前几个章节或主路径下的README文件有介绍）。进入虚拟环境后，直接在terminal中运行python main.py即可。

1 模型的下载和导入

此案例包含了两个模型：文本检测以及文本识别模型。

和其他的案例一样，我们先导入对应的预训练模型，并进行编译。概念这里就不再赘述，代码如下：

1.1 文本检测模型

'''
text detection model
'''
det_model_url = "https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_det_infer.tar"
det_model_file_path = Path("model/ch_ppocr_mobile_v2.0_det_infer/inference.pdmodel")
print("1 - Download and load model for text detection.")
run_model_download(det_model_url, det_model_file_path)
# initialize inference engine for text detection
core = Core()
det_model = core.read_model(model=det_model_file_path)
det_compiled_model = core.compile_model(model=det_model, device_name="CPU")
# get input and output nodes for text detection
det_input_layer = det_compiled_model.input(0)
det_output_layer = det_compiled_model.output(0)
print("- Text detection Model Input layer: ", det_input_layer)
print("- Text detection Model Output layer: ", det_output_layer)

Terminal打印：

1 - Download and load model for text detection.
Model already exists
- Text detection Model Input layer:  <ConstOutput: names[x] shape{?,3,640,640} type: f32>
- Text detection Model Output layer:  <ConstOutput: names[save_infer_model/scale_0.tmp_1] shape{?,1,640,640} type: f32>

需要注意，这个文本检测模型的输入是640x640的图像，所以我们输入之前需要进行一些预处理；模型的输出也是640x640，所以我们还需从中找到对应的文本框，然后才能进行下一步文本识别。

1.2 文本识别模型

'''
text recognition model
'''
print("2 - Download and load model for text recognition.")
rec_model_url = "https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_rec_infer.tar"
rec_model_file_path = Path("model/ch_ppocr_mobile_v2.0_rec_infer/inference.pdmodel")
run_model_download(rec_model_url, rec_model_file_path)
# 当我们完成文本检测模型后，系统会输出一系列的Bounding Box，这些框会作为下一个文本识别模型的输入。
# 这里需要注意，不同的框大小不同。所以我们需要将文本识别模型的输入大小从固定修改为动态。
# 动态的大小设计即将需要动态的那一个维度改为-1。
# 字体识别原本的输入尺寸是：{?,3,32,100}，我们希望将第四个维度的值由100（固定值）变为？（动态值）
# read the model and corresponding weights from file
rec_model = core.read_model(model=rec_model_file_path)

# assign dynamic shapes to every input layer on the last dimension
for input_layer in rec_model.inputs:
    input_shape = input_layer.partial_shape
    input_shape[3] = -1
    rec_model.reshape({input_layer: input_shape})

rec_compiled_model = core.compile_model(model=rec_model, device_name="CPU")

# get input and output nodes
rec_input_layer = rec_compiled_model.input(0)
rec_output_layer = rec_compiled_model.output(0)

print("- Text recognition Model Input layer: ", rec_input_layer)
print("- Text recognition Model Output layer: ", rec_output_layer)

Terminal打印：

2 - Download and load model for text recognition.
Model already exists
- Text recognition Model original input:  [<Output: names[x] shape{?,3,32,100} type: f32>]
- Text recognition Model original output:  [<Output: names[save_infer_model/scale_0.tmp_1] shape{?,25,6625} type: f32>]
- Text recognition Model Input layer:  <ConstOutput: names[x] shape{?,3,32,?} type: f32>
- Text recognition Model Output layer:  <ConstOutput: names[save_infer_model/scale_0.tmp_1] shape{?,?,6625} type: f32>

需要注意的是，原模型的输入是{?,3,32,100}。第一个维度对应的是Batch，第二个维度代表Channel，第三和第四个维度则是输入图像的宽和长。这里的尺寸是定死的。在我们的案例中，文本识别后，我们可以提取出包含文本的Bounding Box，然后从输入图像中抠出对应的图，再进行一些resize。但这些图片的尺寸不可能都是32X100，比如我们可以让所有的图宽是32，但长不可能都是100。这个时候我们就需要进行一些修改，让文本识别的输入，在第四个维度可以是动态的，而不是固定的。所以，我们读取模型后（core.read_model(model=rec_model_file_path)），对输入的第四个维度赋值为-1（input_shape[3] = -1），最后compile一下即可（core.compile_model(model=rec_model, device_name="CPU")）。

1.3 模型下载代码

模型下载（run_model_download）相关代码如下：

# Define the function to download text detection and recognition models from PaddleOCR resources
def run_model_download(model_url, model_file_path):
    """
    Download pre-trained models from PaddleOCR resources

    Parameters:
        model_url: url link to pre-trained models
        model_file_path: file path to store the downloaded model
    """
    model_name = model_url.split("/")[-1]
    
    if model_file_path.is_file(): 
        print("Model already exists")
    else:
        # Download the model from the server, and untar it.
        print("Downloading the pre-trained model... May take a while...")

        # create a directory
        os.makedirs("model", exist_ok=True)
        urllib.request.urlretrieve(model_url, f"model/{model_name} ")
        print("Model Downloaded")

        file = tarfile.open(f"model/{model_name} ")
        res = file.extractall("model")
        file.close()
        if not res:
            print(f"Model Extracted to {model_file_path}.")
        else:
            print("Error Extracting the model. Please check the network.")

2 模型的使用

在解释细节之前，我们来学习如何输入一张照片，输出OCR检测结果相关的流程。这里我们先附上代码：

import matplotlib.pyplot as plt

'''
Let's try to do some test based on images
'''
# Read the image
image = cv2.imread("data/label4.png")
print("- Now we test paddle OCR with image.")
print("1 - Input image size: {}".format(image.shape))
# N,C,H,W = batch size, number of channels, height, width
# if frame larger than full HD, reduce size to improve the performance
scale = 1280 / max(image.shape)
if scale < 1:
    image = cv2.resize(src=image, dsize=None, fx=scale, fy=scale,
                        interpolation=cv2.INTER_AREA)
# preprocess image for text detection
textDetectionImage = processing.image_preprocess_textDetection(image, 640)
print("2 - Text detection image preprocessing. Image is resized into {}".format(image.shape))

# perform the inference step
det_results = det_compiled_model([textDetectionImage])[det_output_layer]
print("3 - Text detection inference result scale {}".format(det_results.shape))

# Postprocessing for Paddle Detection
dt_boxes = processing.post_processing_detection(image, det_results)
print("4 - Transform text detection inference result into bounding boxes: {}".format(dt_boxes.shape))
# Preprocess detection results for recognition
# Sort text boxes in order from top to bottom, left to right
dt_boxes = processing.sorted_boxes(dt_boxes) 

# preprocess for text recognition
img_crop_list, img_num, indices = processing.prep_for_rec(dt_boxes, image)
print("5 - Text recognition image recognition. {} texts detected, and related images are cropped.".format(img_num))
print("- Text1 scale: {}; Text2 scale: {}".format(img_crop_list[0].shape, img_crop_list[1].shape))

# For storing recognition results, include two parts:
# txts are the recognized text results, scores are the recognition confidence level 
rec_res = [['', 0.0]] * img_num
txts = [] 
scores = []
batch_num = 1

for beg_img_no in range(0, img_num, batch_num):

    # Recognition starts from here
    norm_img_batch = processing.batch_text_box(
        img_crop_list, img_num, indices, beg_img_no, batch_num)

    print("- Transform the cropped image from different scale to shape ?,3,32,? , where first ? is batch size, and second ? is dynamic image height: {}".format(norm_img_batch.shape))

    # Run inference for text recognition 
    rec_results = rec_compiled_model([norm_img_batch])[rec_output_layer]
    print("6 - Text recognition inference result scale {}. The last dimension 6625 indicates to dictionary of ground truth texts.".format(rec_results.shape))

    # Postprocessing recognition results
    postprocess_op = processing.build_post_process(processing.postprocess_params)
    rec_result = postprocess_op(rec_results)
    print("- Text recognition result is: ", rec_result)

    for rno in range(len(rec_result)):
        rec_res[indices[beg_img_no + rno]] = rec_result[rno]   
    if rec_res:
        txts = [rec_res[i][0] for i in range(len(rec_res))] 
        scores = [rec_res[i][1] for i in range(len(rec_res))]
                        
image = Image.fromarray(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
boxes = dt_boxes

print("- Finally we draw the results on the image.")

# draw text recognition results beside the image
draw_img = processing.draw_ocr_box_txt(
    image,
    boxes,
    txts,
    scores,
    drop_score=0.5)

# Visualize PaddleOCR results
f_height, f_width = draw_img.shape[:2]
draw_img = cv2.cvtColor(draw_img, cv2.COLOR_RGB2BGR)

plt.figure()
plt.axis("off")
plt.imshow(draw_img, cmap="gray", vmin=0, vmax=255);

Terminal打印：

- Now we test paddle OCR with image.
1 - Input image size: (256, 644, 3)
2 - Text detection image preprocessing. Image is resized into (256, 644, 3)
3 - Text detection inference result scale (1, 1, 640, 640)
4 - Transform text detection inference result into bounding boxes: (2, 4, 2)
5 - Text recognition image recognition. 2 texts detected, and related images are cropped.
- Text1 scale: (77, 464, 3); Text2 scale: (46, 347, 3)
- Transform the cropped image from different scale to shape ?,3,32,? , where first ? is batch size, and second ? is dynamic image height: (1, 3, 32, 192)
6 - Text recognition inference result scale (1, 48, 6625). The last dimension 6625 indicates to dictionary of ground truth texts.
- Text recognition result is:  [('Apple Support', 0.9873969)]
- Transform the cropped image from different scale to shape ?,3,32,? , where first ? is batch size, and second ? is dynamic image height: (1, 3, 32, 241)
6 - Text recognition inference result scale (1, 60, 6625). The last dimension 6625 indicates to dictionary of ground truth texts.
- Text recognition result is:  [('1-800-275-2273', 0.99747336)]
- Finally we draw the results on the image.

以下是一些结果：

在这里插入图片描述

在这里插入图片描述
这里没有检测角度的模块，所以带有角度的图片检测效果较差。

在这里插入图片描述

从Terminal中的返回我们基本上可以看到整个Pipeline所涉猎的步骤。接下来，我们将一一解释。

3 文本检测模型相关

3.1 预训练

当我们输入了图片后，我们需要进行一些预训练，使得预训练过后，图片的尺寸和文本检测模型的输入尺寸一致。相关代码如下

'''
Preprocess for text detection
'''
def image_preprocess_textDetection(input_image, size):
    """
    Preprocess input image for text detection

    Parameters:
        input_image: input image 
        size: value for the image to be resized for text detection model
    """
    img = cv2.resize(input_image, (size, size))
    img = np.transpose(img, [2, 0, 1]) / 255
    img = np.expand_dims(img, 0)
    # NormalizeImage: {mean: [0.485, 0.456, 0.406], std: [0.229, 0.224, 0.225], is_scale: True}
    img_mean = np.array([0.485, 0.456, 0.406]).reshape((3, 1, 1))
    img_std = np.array([0.229, 0.224, 0.225]).reshape((3, 1, 1))
    img -= img_mean
    img /= img_std
    return img.astype(np.float32)

这里的预训练实际上包括了image resize，dimension调整，以及image normalization。

3.2 文本检测模型推理

然后，我们将预训练完成后的图片作为输入进行文本检测模型推理：

# perform the inference step
det_results = det_compiled_model([textDetectionImage])[det_output_layer]
print("3 - Text detection inference result scale {}".format(det_results.shape))

模型的输出是一个(1, 1, 640, 640)的矩阵。其实这里我不是很明白，为什么不是直接输出检测框。

3.3 后处理

所谓后处理，即将文本检测模型的输出转化成文本框。

# Postprocessing for Paddle Detection
dt_boxes = processing.post_processing_detection(image, det_results)
print("4 - Transform text detection inference result into bounding boxes: {}".format(dt_boxes.shape))
# Preprocess detection results for recognition
# Sort text boxes in order from top to bottom, left to right
dt_boxes = processing.sorted_boxes(dt_boxes)

post_processing_detection的相关代码如下：

def post_processing_detection(frame, det_results):
    """
    Postprocess the results from text detection into bounding boxes

    Parameters:
        frame: input image 
        det_results: inference results from text detection model
    """   
    ori_im = frame.copy()
    data = {'image': frame}
    # data_resize is a dict which contains data_resize['image'] and 
    # data_resize['shape'] which contains [src_h, src_w, ratio_h, ratio_w]
    data_resize = DetResizeForTest(data)
    data_list = []
    keep_keys = ['image', 'shape']
    # Convert dict to list (data_list) which contains 2 elements
    for key in keep_keys:
        data_list.append(data_resize[key])
    img, shape_list = data_list
    # Expand 1 dimension
    shape_list = np.expand_dims(shape_list, axis=0) 
    pred = det_results[0]    
    if isinstance(pred, paddle.Tensor):
        pred = pred.numpy()
    segmentation = pred > 0.3

    boxes_batch = []
    for batch_index in range(pred.shape[0]):
        src_h, src_w, ratio_h, ratio_w = shape_list[batch_index]
        mask = segmentation[batch_index]
        boxes, scores = boxes_from_bitmap(pred[batch_index], mask, src_w, src_h)
        boxes_batch.append({'points': boxes})
    post_result = boxes_batch
    dt_boxes = post_result[0]['points']
    dt_boxes = filter_tag_det_res(dt_boxes, ori_im.shape)    
    return dt_boxes

4 文本识别模型相关

4.1 预处理

第一步和文本检测模型一样，也是需要进行一些预处理，将Bounding Box的图片抠出来，resize一下，以符合我们的文本识别模型输入的尺寸。

# preprocess for text recognition
img_crop_list, img_num, indices = processing.prep_for_rec(dt_boxes, image)
print("5 - Text recognition image recognition. {} texts detected, and related images are cropped.".format(img_num))
print("- Text1 scale: {}; Text2 scale: {}".format(img_crop_list[0].shape, img_crop_list[1].shape))

prep_for_rec函数中，我们通过检测到的Bounding Box，从原图像中抠图，并返回有多少个文本条被检测出来，以及indices值：

def prep_for_rec(dt_boxes, frame):
    """
    Preprocessing of the detected bounding boxes for text recognition

    Parameters:
        dt_boxes: detected bounding boxes from text detection 
        frame: original input frame 
        img_crop_list: A list which contains bounding box images (means, images which are cropped by bounding box detected)
        img_num: Number of bounding box detected by text detection
        indices: Calculate and sort the aspect ratio of all text bars
    """
    ori_im = frame.copy()
    img_crop_list = [] 
    for bno in range(len(dt_boxes)):
        tmp_box = copy.deepcopy(dt_boxes[bno])
        img_crop = get_rotate_crop_image(ori_im, tmp_box)
        img_crop_list.append(img_crop)
        
    img_num = len(img_crop_list)
    # Calculate the aspect ratio of all text bars
    width_list = []
    for img in img_crop_list:
        width_list.append(img.shape[1] / float(img.shape[0]))
    
    # Sorting can speed up the recognition process
    indices = np.argsort(np.array(width_list))
    return img_crop_list, img_num, indices

然后，对于每一个检测框以及对应的图像，我们都需要通过batch_text_box函数将其尺寸转化成文字识别模型的输入尺寸：

def batch_text_box(img_crop_list, img_num, indices, beg_img_no, batch_num):
    """
    Batch for text recognition

    Parameters:
        img_crop_list: processed detected bounding box images 
        img_num: number of bounding boxes from text detection
        indices: sorting for bounding boxes to speed up text recognition
        beg_img_no: the beginning number of bounding boxes for each batch of text recognition inference
        batch_num: number of images for each batch
    """
    norm_img_batch = []
    max_wh_ratio = 0
    end_img_no = min(img_num, beg_img_no + batch_num)
    for ino in range(beg_img_no, end_img_no):
        h, w = img_crop_list[indices[ino]].shape[0:2]
        wh_ratio = w * 1.0 / h
        max_wh_ratio = max(max_wh_ratio, wh_ratio)
    for ino in range(beg_img_no, end_img_no):
        norm_img = resize_norm_img(img_crop_list[indices[ino]], max_wh_ratio)
        norm_img = norm_img[np.newaxis, :]
        norm_img_batch.append(norm_img)

    norm_img_batch = np.concatenate(norm_img_batch)
    norm_img_batch = norm_img_batch.copy()
    return norm_img_batch

4.2 文本识别模型推理

我们将预训练完成后的图片作为输入进行文本识别模型推理：

# Run inference for text recognition 
rec_results = rec_compiled_model([norm_img_batch])[rec_output_layer]
print("6 - Text recognition inference result scale {}. The last dimension 6625 indicates to dictionary of ground truth texts.".format(rec_results.shape))

模型输出的尺寸是(1, ?, 6625)。第三个维度是模型对应的Ground Truth字典的尺寸。第二个维度我不是很懂。

4.3 后处理

最后我们需要做的是，比对Ground Truth字典，然后得到文字识别的结果：

# Postprocessing recognition results
postprocess_op = processing.build_post_process(processing.postprocess_params)
rec_result = postprocess_op(rec_results)
print("- Text recognition result is: ", rec_result)

## Postprocessing for recognition
postprocess_params = {
            'name': 'CTCLabelDecode',
            "character_type": "ch",
            "character_dict_path": "./data/ppocr_keys_v1.txt",
            "use_space_char": True
        }

def build_post_process(config):
    config = copy.deepcopy(config)
    module_name = config.pop('name')
    module_class = eval(module_name)(**config)
    return module_class

6 VideoPlayer 类

这一部分就是我觉得很值得一看的地方了。VideoPlayer 类中，我们通过 cv2.VideoCapture 读取每一帧视频数据，新建一个线程，这个线程负责按照指定的FPS读取视频数据。而主线程需要的时候，可以通过 next 函数调用下一帧图像。这里的代码写的非常好，因为如果我们将推理和读每一帧图像数据放在同一个线程，就可能会发生丢帧导致的各种问题，比如视频卡顿，延迟，甚至是延迟时间太长导致的程序“奔溃”。
三种视频输入方式在source的不同写法：

视频文件：比如，…/201-vision-monodepth/data/Coco Walking in Berkeley.mp4
USB摄像头：比如，0（取决于接口的值，可能是0，或者1，或者其他）
RTSP流：比如，rtsp://192.168.1.2:8080/out.h264

VIdeoPlayer 类相关的代码如下：

class VideoPlayer:
    """
    Custom video player to fulfill FPS requirements. You can set target FPS and output size,
    flip the video horizontally or skip first N frames.

    :param source: Video source. It could be either camera device or video file. For rtsp camera, format should be something like: rtsp://192.168.1.2:8080/out.h264
    :param size: Output frame size.
    :param flip: Flip source horizontally.
    :param fps: Target FPS.
    :param skip_first_frames: Skip first N frames.
    """

    def __init__(self, source, size=None, flip=False, fps=None, skip_first_frames=0):
        self.__cap = cv2.VideoCapture(source)
        if not self.__cap.isOpened():
            raise RuntimeError(
                f"Cannot open {'camera' if isinstance(source, int) else ''} {source}"
            )
        # skip first N frames
        self.__cap.set(cv2.CAP_PROP_POS_FRAMES, skip_first_frames)
        # fps of input file
        self.__input_fps = self.__cap.get(cv2.CAP_PROP_FPS)
        if self.__input_fps <= 0:
            self.__input_fps = 60
        # target fps given by user
        self.__output_fps = fps if fps is not None else self.__input_fps
        self.__flip = flip
        self.__size = None
        self.__interpolation = None
        if size is not None:
            self.__size = size
            # AREA better for shrinking, LINEAR better for enlarging
            self.__interpolation = (
                cv2.INTER_AREA
                if size[0] < self.__cap.get(cv2.CAP_PROP_FRAME_WIDTH)
                else cv2.INTER_LINEAR
            )
        # first frame
        _, self.__frame = self.__cap.read()
        self.__lock = threading.Lock()
        self.__thread = None
        self.__stop = False

    """
    Start playing.
    """

    def start(self):
        self.__stop = False
        self.__thread = threading.Thread(target=self.__run, daemon=True)
        self.__thread.start()

    """
    Stop playing and release resources.
    """

    def stop(self):
        self.__stop = True
        if self.__thread is not None:
            self.__thread.join()
        self.__cap.release()

    def __run(self):
        prev_time = 0
        while not self.__stop:
            t1 = time.time()
            ret, frame = self.__cap.read()
            if not ret:
                break

            # fulfill target fps
            if 1 / self.__output_fps < time.time() - prev_time:
                prev_time = time.time()
                # replace by current frame
                with self.__lock:
                    self.__frame = frame

            t2 = time.time()
            # time to wait [s] to fulfill input fps
            wait_time = 1 / self.__input_fps - (t2 - t1)
            # wait until
            time.sleep(max(0, wait_time))

        self.__frame = None

    """
    Get current frame.
    """

    def next(self):
        with self.__lock:
            if self.__frame is None:
                return None
            # need to copy frame, because can be cached and reused if fps is low
            frame = self.__frame.copy()
        if self.__size is not None:
            frame = cv2.resize(frame, self.__size, interpolation=self.__interpolation)
        if self.__flip:
            frame = cv2.flip(frame, 1)
        return frame