Bootstrap

windows部署功能完整的Unstructured项目

**windows本地部署功能完整的Unstructured项目的*踩过的坑

一丶下载unstructured的python包

使用pipy下载:

支持所有文档

pip install "unstructured[all-docs]"

支持不需要额外以来的文档类型,如 plain text files, HTML, XML, JSON and Emails

pip install unstructured

需要支持额外文档

pip install "unstructured[docx,pptx]"

原文链接:https://blog.csdn.net/lovechris00/article/details/137599877

Unstructured - 提取非结构化数据_python unstructured-CSDN博客

二丶使用时常见错误

问题1.模型未下载

参考网站:Models - Unstructured

下载安装包后,恭喜你,你已经可以使用unstructured的基本功能,但是无法使用其已经训练好的模型检索,功能较弱。若你想使用官方文档中的模型,则会报错

from unstructured.partition.pdf import partition_pdf


fname = "C:\\Users\\Lenovo\\Desktop\\2023量化\\附件2 信息学院本科生素质量化考评办法.pdf"
elements = partition_pdf(filename=fname,
                         strategy='hi_res',
                         hi_res_model_name="yolox"
           )
for el in elements:
    if el["type"] == "Table":
        print(el["metadata"]["text_as_html"])
#     print(el.type),

错误信息:

SSLError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /unstructuredio/yolo_x_layout/resolve/main/yolox_l0.05.onnx (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)')))"), '(Request ID: 757ef56e-88d9-4a7a-88ef-ff3fade2139c)')

解决方法:在hanggingface中下载模型,并修改配置

​ 1.下载

​ 下载地址:unstructuredio/yolo_x_layout at main (huggingface.co)

​ 2.修改配置

​ 打开python包所在的文件夹:

​ Lib\site-packages\unstructured_inference\models\yolox.py

​ 修改37行代码:

MODEL_TYPES = {
    "yolox": LazyDict(
        model_path='你的模型地址',
        label_map=YOLOX_LABEL_MAP,
    ),
    "yolox_tiny": LazyDict(
        model_path=LazyEvaluateInfo(
            download_if_needed_and_get_local_path,
            "unstructuredio/yolo_x_layout",
            "yolox_tiny.onnx",
        ),
        label_map=YOLOX_LABEL_MAP,
    ),
    "yolox_quantized": LazyDict(
        model_path=LazyEvaluateInfo(
            download_if_needed_and_get_local_path,
            "unstructuredio/yolo_x_layout",
            "yolox_l0.05_quantized.onnx",
        ),
        label_map=YOLOX_LABEL_MAP,
    ),
}

问题2.依赖项未配置

  • 下系统依赖项根据需要安装
    • libmagic-dev(文件类型检测)
    • poppler-utils(图像和 PDF)
    • tesseract-ocr(图像和 PDF,安装tesseract-lang以获得其他语言支持)
    • libreoffice(微软 Office 文档)

1.poppler

下载window版本:

oschwartz10612/poppler-windows: Download Poppler binaries packaged for Windows with dependencies (github.com)

并将其bin文件路径配置至环境变量path中

2.tesseract

下载与配置:

参考本片文章Tesseract-OCR 下载安装和使用_tesseract-ocr下载-CSDN博客

3.libmagic-dev

下载与配置:

How To Install libmagic-dev on Ubuntu 22.04 | Installati.one

;