部分解决 | ocrmypdf对中文pdf进行ocr识别后存在多余空格-526互联

1、问题

ocrmypdf安装采用的是在windows安装方法具体看

https://media.readthedocs.org/pdf/ocrmypdf/latest/ocrmypdf.pdf

由于ocrmypdf对中文pdf识别后存在空格，根据以下链接知道

https://github.com/tesseract-ocr/tesseract/issues/781

https://github.com/ocrmypdf/OCRmyPDF/issues/715

https://github.com/tesseract-ocr/tesseract/issues/991

preserve_interword_spaces=1

正如这个链接所问的，https://github.com/ocrmypdf/OCRmyPDF/issues/715#issuecomment-849422552，那么如何在ocrmypdf中设置呢？

2、解决过程

首先尝试的试试按照使用--tesseract-config方法（下面pdf9.2.5章节）：

https://media.readthedocs.org/pdf/ocrmypdf/latest/ocrmypdf.pdf

命令：

ocrmypdf  -l chi_sim+eng --tesseract-oem 1 --tesseract-pagesegmode 6 --tesseract-config C:\Users\Administrator\Desktop\my.cfg C:\Users\Administrator\Desktop\11.pdf 121.pdf

其中my.cfg是一个本地文件：里面内容是

preserve_interword_spaces 1

测试无效。

原因是：https://github.com/ocrmypdf/OCRmyPDF/issues/885#issuecomment-1033367021 这个网友说了当你选择OEM选择LSTM模型（如下面说明，oem选择1或者2）时候，--tesseract-config不会生效，我估计是优先级问题，默认是优先使用LSTM训练数据的config。那么现在问题转向，如何对chi_sim.traineddata进行设置。

下面资料来源于：https://muthu.co/all-tesseract-ocr-options/

也可以参考这里：https://zhuanlan.zhihu.com/p/64470012

OCR options:
  --tessdata-dir PATH   Specify the location of tessdata path.
  --user-words PATH     Specify the location of user words file.
  --user-patterns PATH  Specify the location of user patterns file.
  -l LANG[+LANG]        Specify language(s) used for OCR.
  -c VAR=VALUE          Set value for config variables.
                        Multiple -c arguments are allowed.
  --psm NUM             Specify page segmentation mode.
  --oem NUM             Specify OCR Engine mode.
NOTE: These options must occur before any configfile.
Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR.
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.
OCR Engine modes: (see https://github.com/tesseract-ocr/tesseract/wiki#linux)
  0    Legacy engine only.
  1    Neural nets LSTM engine only.
  2    Legacy + LSTM engines.
  3    Default, based on what is available.

现在问题转向，如何对chi_sim.traineddata进行设置。

按照这个链接：https://0o0.me/java/tesseract-update-traineddata.html 将本地文件（C:\Program Files\Tesseract-OCR\tessdata\chi_sim.traineddata），

将官方训练字库进行解包：

Win+r ，输入cmd，进入训练数据所在文件夹：

cd "C:\Program Files\Tesseract-OCR\tessdata"

然后运行

combine_tessdata -u chi_sim.traineddata chi_sim

得到：

需要修改chi_sim.config里面的设置，其实就是tesseract的LSTM模型识别参数，增加2行内容：

# avoid extra spaces

preserve_interword_spaces 1

然后重新打包：

combine_tessdata chi_sim

（其他chi_sim开头的可以删除，只保留chi_sim.traineddata）。有人会问怎么不直接在chi_sim.traineddata进行修改，估计是涉及到编译问题，直接改的话，我试了ocrmypdf不识别，会出错。

再次使用命令：（其中--sidecar 121.txt表示输出txt）

ocrmypdf  --force-ocr -l chi_sim --sidecar 121.txt  C:\Users\Administrator\Desktop\11.pdf 121.pdf

效果：输出121.txt没有空格，121.pdf复制还有空格。（如果不对chi_sim.traineddata修改，那么两个都是空格）

经过测试：跟这个里面https://github.com/ocrmypdf/OCRmyPDF/issues/715说的一样（但是他是ocrmypdf的老版本）即输出txt才会出现没有空格，pdf还是复制有空格。

Ocrmypdf的作者@jbarlow83一直说的是阅读器问题，但是事实上不是阅读器问题。

也就是说我们只是部分解决（曲线救国）了pdf出现文本图层含有多余空格的问题。

如果不修改chi_sim.traineddata，那么目前我测试了大量方法均无效。作者也从未给出有效解决方案，目前日韩网友（https://github.com/tesseract-ocr/tesseract/issues/1009）也存在这个问题。

3、最终思路

修改chi_sim.traineddata，输出txt以复制文字。如果有大神可以继续给我提示，谢谢！

4、致谢

上面链接分享者。

tesseract-ocr tesseract ocr