**windows本地部署功能完整的Unstructured项目的*踩过的坑
一丶下载unstructured的python包
使用pipy下载:
支持所有文档
pip install "unstructured[all-docs]"
支持不需要额外以来的文档类型,如 plain text files, HTML, XML, JSON and Emails
pip install unstructured
需要支持额外文档
pip install "unstructured[docx,pptx]"
原文链接:https://blog.csdn.net/lovechris00/article/details/137599877
Unstructured - 提取非结构化数据_python unstructured-CSDN博客
二丶使用时常见错误
问题1.模型未下载
下载安装包后,恭喜你,你已经可以使用unstructured的基本功能,但是无法使用其已经训练好的模型检索,功能较弱。若你想使用官方文档中的模型,则会报错
from unstructured.partition.pdf import partition_pdf
fname = "C:\\Users\\Lenovo\\Desktop\\2023量化\\附件2 信息学院本科生素质量化考评办法.pdf"
elements = partition_pdf(filename=fname,
strategy='hi_res',
hi_res_model_name="yolox"
)
for el in elements:
if el["type"] == "Table":
print(el["metadata"]["text_as_html"])
# print(el.type),
错误信息:
SSLError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /unstructuredio/yolo_x_layout/resolve/main/yolox_l0.05.onnx (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)')))"), '(Request ID: 757ef56e-88d9-4a7a-88ef-ff3fade2139c)')
解决方法:在hanggingface中下载模型,并修改配置
1.下载
下载地址:unstructuredio/yolo_x_layout at main (huggingface.co)
2.修改配置
打开python包所在的文件夹:
Lib\site-packages\unstructured_inference\models\yolox.py
修改37行代码:
MODEL_TYPES = {
"yolox": LazyDict(
model_path='你的模型地址',
label_map=YOLOX_LABEL_MAP,
),
"yolox_tiny": LazyDict(
model_path=LazyEvaluateInfo(
download_if_needed_and_get_local_path,
"unstructuredio/yolo_x_layout",
"yolox_tiny.onnx",
),
label_map=YOLOX_LABEL_MAP,
),
"yolox_quantized": LazyDict(
model_path=LazyEvaluateInfo(
download_if_needed_and_get_local_path,
"unstructuredio/yolo_x_layout",
"yolox_l0.05_quantized.onnx",
),
label_map=YOLOX_LABEL_MAP,
),
}
问题2.依赖项未配置
- 下系统依赖项根据需要安装
libmagic-dev
(文件类型检测)poppler-utils
(图像和 PDF)tesseract-ocr
(图像和 PDF,安装tesseract-lang
以获得其他语言支持)libreoffice
(微软 Office 文档)
1.poppler
下载window版本:
oschwartz10612/poppler-windows: Download Poppler binaries packaged for Windows with dependencies (github.com)
并将其bin文件路径配置至环境变量path中
2.tesseract
下载与配置:
参考本片文章Tesseract-OCR 下载安装和使用_tesseract-ocr下载-CSDN博客
3.libmagic-dev
下载与配置:
How To Install libmagic-dev on Ubuntu 22.04 | Installati.one