tesseract 提升识别质量1.图像处理2.页面分割方法3.词典，单词列表和模式

278 阅读 0 评论 184 点赞

我是靠谱客的博主端庄绿草，这篇文章主要介绍tesseract 提升识别质量1.图像处理2.页面分割方法3.词典，单词列表和模式，现在分享给大家，希望可以做个参考。

1.图像处理

tesseract内置了一些图像处理方法（基于leptonica library）。
如果我们想要观察tesseract如何处理图片可以将tessedit_write_images变量设置为true。

改变尺度

tesseract默认dpi是300，最好把图片的dpi设置为300

二值化

将图片二值化，tesseract虽然内置了改方法，但是可能结果并不理想，所以最好在ocr之前先进行二值化。使用pillow。

二值化的作用是去掉噪声，比如黑点或者颜色。
在这里插入图片描述

在这里插入图片描述

旋转/抗扭斜

将倾斜的文章旋转称垂直。
在这里插入图片描述

除边界

在这里插入图片描述

2.页面分割方法

默认的tesseract将一个图片当成一个文档来看。如果只需要指定的区域可以使用不同的分割模式，使用psm参数。

  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR.
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
			bypassing hacks that are Tesseract-specific.

3.词典，单词列表和模式

默认的tesseract尽可能识别普通的句子。如果想要识别
收入，价格或者代码等则需要以下步骤
1.选择合适的分割方法。
参考：
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#page-segmentation-method

2.禁用字典。如果我们需要识别的字符大多不是字典单词。通过将load_system_dawg和load_freq_dawg设置为false。
参考：
https://github.com/tesseract-ocr/tesseract/wiki/ControlParams
3.将词语店家到词语列表，将提升Tesseract的识别准确率，或者添加字符模式。参考
https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#config-files-and-augmenting-with-user-data
4.如果只想识别语料库中的一部分字符，比如只需要识别数字，则可以设置tessedit_char_whitelist参数。
参考：
https://github.com/tesseract-ocr/tesseract/wiki/ControlParams

参考文献:
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
https://blog.csdn.net/hechaojie_com/article/details/81560153