使用 OCR 识别扫描文件的一些记录

2024-10-21 Monday · By cctags · Posted in Tools · 0 Comments

/*
 * 这个记录有多次更新。
 */

0. 问题

需要对一篇扫描件里的内容进行识别和提取，以下是一些记录。

1. PDF 转换成图片

文档资料是 PDF 格式的，首先需要先截取页面并转换成图片。

1. 使用 qpdf

qpdf 可以从一篇 PDF 里截取某些页面，并转存为 PDF 格式。

安装 qpdf 软件包：

sudo apt install qpdf

截取。比如从 a.pdf 里，截取第 30 页至第 35 页，并另存为 b.pdf：

qpdf a.pdf --pages a.pdf 30-35 -- b.pdf

2. PDF 转换成图片

安装 poppler-utils 软件包：

sudo apt install poppler-utils

转换成图片：

pdftoppm b.pdf output -png

2.1. 使用 tesseract 识别

编译并安装 tesseract：

这里有编译过程的描述说明。

git clone --recurse-submodules https://github.com/tesseract-ocr/tesseract.git
cd tesseract
./autogen.sh
./configure --prefix=$HOME/local/
make
make install

下载语言文件：

这里有说明，需要下载中文相关的文件，并保存到 ${HOME}/local/share/tessdata/ 目录下。当然也有 tessdata_best 和 tessdata_fast 可以使用。

运行软件：

export TESSDATA_PREFIX=${HOME}/local/share/tessdata
./tesseract input.png - -l chi_sim+eng | tee -a output.txt

识别出来的文本，在 output.txt 文件里面。

2.2. 使用 chineseocr_lite 识别

开源项目 chineseocr_lite，是一个超轻量级中文 ocr，支持竖排文字识别。

项目地址：https://github.com/DayBreak-u/chineseocr_lite

安装过程：

// 先安装 python3.6 的虚拟环境
git clone https://github.com/DayBreak-u/chineseocr_lite.git
cd chineseocr_lite
pip3 install -r requirements.txt

使用：

python backend/main.py

通过浏览器访问即可。

改造程序：

之前步骤里已经把页面转存为图片了，以下代码片断里，就是对每一张图片，使用 model.py 里实现的 OcrHandle 来完成识别：

from model import OcrHandle

for filename in sorted(glob.glob('*.png')):
    f = open(filename, 'rb')
    img = Image.open(f)
    result = ocrhandle.text_predict(img, short_size)
    print(result)

result 变量里有完整的信息，按实际需求解析即可。

2.3. 使用 EasyOCR 识别

开源项目 EasyOCR，项目地址：https://github.com/jaidedai/easyocr

安装过程：

pip3 install easyocr

依赖项目都会自动安装。

使用：

import easyocr

reader = easyocr.Reader(['ch_sim','en'])
result = reader.readtext('input.jpg', detail=0)
print(result)

2.4. 使用 MinerU 识别

开源项目 MinerU，项目地址：https://github.com/opendatalab/MinerU

安装过程：

// 先安装 python3.10 的虚拟环境
conda create -n mineru python=3.10
conda activate mineru

pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com

安装时碰到过一个问题，如果尝试按惯例安装 pip install magic-pdf，安装出来的版本很低，只有 0.6.1，跟文档里的使用说明对不上。

下载模型权重文件：

按照这篇文档来下载：

pip install modelscope
wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/scripts/download_models.py -O download_models.py
python download_models.py

下载的模型在 ~/.cache/huggingface/hub 目录下。另外，用户目录下会生成 magic-pdf.json 配置文件。

使用：

$ magic-pdf --help

Usage: magic-pdf [OPTIONS]

Options:
  -v, --version                display the version and exit
  -p, --path PATH              local filepath or directory. support PDF, PPT,
                               PPTX, DOC, DOCX, PNG, JPG files  [required]
  -o, --output-dir PATH        output local directory  [required]
  -m, --method [ocr|txt|auto]  the method for parsing pdf. ocr: using ocr
                               technique to extract information from pdf. txt:
                               suitable for the text-based pdf only and
                               outperform ocr. auto: automatically choose the
                               best method for parsing pdf from ocr and txt.
                               without method specified, auto will be used by
                               default.
  -l, --lang TEXT              Input the languages in the pdf (if known) to
                               improve OCR accuracy.  Optional. You should
                               input "Abbreviation" with language form url: ht
                               tps://paddlepaddle.github.io/PaddleOCR/latest/e
                               n/ppocr/blog/multi_languages.html#5-support-
                               languages-and-abbreviations
  -d, --debug BOOLEAN          Enables detailed debugging information during
                               the execution of the CLI commands.
  -s, --start INTEGER          The starting page for PDF parsing, beginning
                               from 0.
  -e, --end INTEGER            The ending page for PDF parsing, beginning from
                               0.
  --help                       Show this message and exit.

比如：

magic-pdf -p input -o output -m ocr  -s 100 -e 102

运行结束后，生成的结果在 output 目录下。

2.5. 使用 pdf-craft 识别

开源项目 pdf-craft，可以把 PDF 文件转换成比如 markdown 以及 epub 等格式。项目地址：https://github.com/oomol-lab/pdf-craft

安装过程：

按照项目提供的中文文档：

// 先安装 3.10.16 的虚拟环境
conda create -n pdfcraft python=3.10.16
conda activate pdfcraft

pip3 install pdf-craft
pip3 install onnxruntime==1.21.0

使用：

编辑 python 文件并执行：

#!/usr/bin/env python3

from pdf_craft import PDFPageExtractor, MarkDownWriter

def main():
    extractor = PDFPageExtractor(
            device="cpu",
            model_dir_path="/data/.cache/pdf-craft",
            )

    markdown_path = 'output/out.md'

    with MarkDownWriter(markdown_path, "images", "utf-8") as md:
        for block in extractor.extract(pdf="a.pdf"):
            md.write(block)

if __name__ == '__main__':
    main()

其中 model_dir_path 指定了 AI 模型下载和安装的文件夹地址。下载和安装后的目录结构是这样的：

├── ch_ppocr_server_v2.0
│   └── ppocr_keys_v1.txt
├── doclayout_yolo_ft.pt
└── ppocrv4
    ├── cls
    │   └── cls.onnx
    ├── det
    │   └── det.onnx
    └── rec
        └── rec.onnx

程序运行结束后，markdown_path 指定的目录下存放了运行输出结果，目录结构是这样的：

├── images
│   ├── ***.png
│   └── ***.png
└── out.md

其中 out.md 就是识别后生成的 markdown 文件。

# Tagged as ocr ·

0. 问题

1. PDF 转换成图片

2.1. 使用 tesseract 识别

2.2. 使用 chineseocr_lite 识别

2.3. 使用 EasyOCR 识别

2.4. 使用 MinerU 识别

2.5. 使用 pdf-craft 识别

Read More: