Date: March 13, 2022 /  Author: Ralf Eichinger

OCR (Optical Character Recognition) with Tesseract

[Update on 20230709: update to tesseract5 and more usage examples]

For OCR the tool tesseract-ocr is the best tool under linux. It is recommened to also install some conversion tools for image preparation.

Installation

On Ubuntu 22.10:

($ sudo apt install imagemagick)
$ sudo apt install tesseract-ocr
...
tesseract-ocr (5.1.0-2) wird eingerichtet ...
...
$ sudo apt install tesseract-ocr-script-frak
...
tesseract-ocr-script-frak (1:4.1.0-2) wird eingerichtet ...
...
$ tesseract --list-langs
List of available languages in "/usr/share/tesseract-ocr/5/tessdata/" (3):
Fraktur
eng
osd
$ tesseract --version
tesseract 5.1.0
 leptonica-1.82.0
  libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.38 : libtiff 4.4.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.5.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511
 Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8
 Found libcurl/7.85.0 OpenSSL/3.0.5 zlib/1.2.11 brotli/1.0.9 zstd/1.5.2 libidn2/2.3.3 libpsl/0.21.0 (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.49.0 librtmp/2.3

Usage

Do OCR processing of an image with fraktur text

$ convert Achleitner-Der_Finanzer.pdf Achleitner-Der_Finanzer.jpg
$ tesseract Achleitner-Der_Finanzer-6.jpg Achleitner-Der_Finanzer-6 -l Fraktur

OCR to ALTO format

$ tesseract 00000016.tif 00000016 -l Fraktur alto
$ ls -al
...
-rwxrwxrwx 1 ralf ralf   103559 Jul  9 20:38 00000016.xml
...
$ less 00000016.xml
<?xml version="1.0" encoding="UTF-8"?>
<alto xmlns="http://www.loc.gov/standards/alto/ns-v3#" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v3# http://www.loc.gov/alto/v3/alto-3-0.xsd">
        <Description>
                <MeasurementUnit>pixel</MeasurementUnit>
                <sourceImageInformation>
                        <fileName>00000016.tif</fileName>
                </sourceImageInformation>
                <OCRProcessing ID="OCR_0">
                        <ocrProcessingStep>
                                <processingSoftware>
                                        <softwareName>tesseract 5.1.0</softwareName>
                                </processingSoftware>
                        </ocrProcessingStep>
                </OCRProcessing>
        </Description>
        <Layout>
                <Page WIDTH="2039" HEIGHT="3127" PHYSICAL_IMG_NR="0" ID="page_0">
                        <PrintSpace HPOS="0" VPOS="0" WIDTH="2039" HEIGHT="3127">
                                <ComposedBlock ID="cblock_0" HPOS="232" VPOS="291" WIDTH="1048" HEIGHT="34">
                                        <TextBlock ID="block_0" HPOS="232" VPOS="291" WIDTH="1048" HEIGHT="34">
                                                <TextLine ID="line_0" HPOS="232" VPOS="291" WIDTH="1048" HEIGHT="34">
                                                        <String ID="string_0" HPOS="232" VPOS="297" WIDTH="22" HEIGHT="28" WC="0.96" CONTENT="8"/><SP WIDTH="509" VPOS="297" HPOS="254"/>
                                                        <String ID="string_1" HPOS="763" VPOS="292" WIDTH="20" HEIGHT="26" WC="0.96" CONTENT="I."/><SP WIDTH="21" VPOS="292" HPOS="783"/>
                                                        <String ID="string_2" HPOS="804" VPOS="292" WIDTH="49" HEIGHT="26" WC="0.93" CONTENT="Die"/><SP WIDTH="20" VPOS="292" HPOS="853"/>
                                                        <String ID="string_3" HPOS="873" VPOS="292" WIDTH="242" HEIGHT="33" WC="0.73" CONTENT="weltgeſchihtlihen"/><SP WIDTH="21" VPOS="292" HPOS="1115"/>
                                                        <String ID="string_4" HPOS="1136" VPOS="291" WIDTH="144" HEIGHT="34" WC="0.95" CONTENT="Ereigniſſe."/>
                                                </TextLine>
                                        </TextBlock>
                                </ComposedBlock>
                                <ComposedBlock ID="cblock_1" HPOS="228" VPOS="378" WIDTH="1590" HEIGHT="2208">
                                        <TextBlock ID="block_1" HPOS="232" VPOS="378" WIDTH="1586" HEIGHT="308">
                                                <TextLine ID="line_1" HPOS="335" VPOS="378" WIDTH="1483" HEIGHT="44">
                                                        <String ID="string_5" HPOS="335" VPOS="381" WIDTH="50" HEIGHT="41" WC="0.96" CONTENT="Zu"/><SP WIDTH="19" VPOS="381" HPOS="385"/>
                                                        <String ID="string_6" HPOS="404" VPOS="381" WIDTH="59" HEIGHT="33" WC="0.93" CONTENT="den"/><SP WIDTH="17" VPOS="381" HPOS="463"/>
                                                        <String ID="string_7" HPOS="480" VPOS="379" WIDTH="232" HEIGHT="42" WC="0.92" CONTENT="Kulturſtaaten"/><SP WIDTH="18" VPOS="379" HPOS="712"/>
                                                        <String ID="string_8" HPOS="730" VPOS="380" WIDTH="84" HEIGHT="41" WC="0.96" CONTENT="jener"/><SP WIDTH="19" VPOS="380" HPOS="814"/>
                                                        <String ID="string_9" HPOS="833" VPOS="379" WIDTH="70" HEIGHT="41" WC="0.96" CONTENT="Zeit"/><SP WIDTH="20" VPOS="379" HPOS="903"/>
                                                        <String ID="string_10" HPOS="923" VPOS="379" WIDTH="71" HEIGHT="32" WC="0.96" CONTENT="weit"/><SP WIDTH="20" VPOS="379" HPOS="994"/>
                                                        <String ID="string_11" HPOS="1014" VPOS="383" WIDTH="54" HEIGHT="27" WC="0.96" CONTENT="vor"/><SP WIDTH="19" VPOS="383" HPOS="1068"/>
                                                        <String ID="string_12" HPOS="1087" VPOS="379" WIDTH="106" HEIGHT="40" WC="0.96" CONTENT="Chriſti"/><SP WIDTH="19" VPOS="379" HPOS="1193"/>
                                                        <String ID="string_13" HPOS="1212" VPOS="379" WIDTH="120" HEIGHT="33" WC="0.96" CONTENT="Geburt"/><SP WIDTH="19" VPOS="379" HPOS="1332"/>
                                                        <String ID="string_14" HPOS="1351" VPOS="379" WIDTH="96" HEIGHT="41" WC="0.96" CONTENT="zählte"/><SP WIDTH="21" VPOS="379" HPOS="1447"/>
                                                        <String ID="string_15" HPOS="1468" VPOS="378" WIDTH="71" HEIGHT="41" WC="0.10" CONTENT="auh"/><SP WIDTH="20" VPOS="378" HPOS="1539"/>
                                                        <String ID="string_16" HPOS="1559" VPOS="378" WIDTH="180" HEIGHT="41" WC="0.92" CONTENT="Phönizien,"/><SP WIDTH="20" VPOS="378" HPOS="1739"/>
                                                        <String ID="string_17" HPOS="1759" VPOS="378" WIDTH="59" HEIGHT="32" WC="0.96" CONTENT="das"/>
                                                </TextLine>
...
$

OCR whole directory to ALTO, hOCR and TXT

for i in *tif; do b=`basename "$i" .tif`; echo "$i"; tesseract "$i" "$b" -l Fraktur alto hocr txt; done

The OCR files are named like this (e.g. for file alx-0001.tif):

  • alto: alx-0001.xml
  • hOCR: alx-0001.hocr
  • txt: alx-0001.txt

Rename ALTO-XML-files to contain “alto” in filename, e.g alx-0001.alto.xml:

for i in alx-*xml; do b=`basename "$i" .xml`; echo "$i"; mv "$i" "$b.alto.xml"; done
 Tags:  topics linux

Previous
⏪ Apache NetBeans 13

Next
GitHub.com Basics ⏩