OCR (Optical Character Recognition) with Tesseract
[Update on 20230709: update to tesseract5 and more usage examples]
For OCR the tool tesseract-ocr
is the best tool under linux.
It is recommened to also install some conversion tools for image preparation.
Installation
On Ubuntu 22.10:
($ sudo apt install imagemagick)
$ sudo apt install tesseract-ocr
...
tesseract-ocr (5.1.0-2) wird eingerichtet ...
...
$ sudo apt install tesseract-ocr-script-frak
...
tesseract-ocr-script-frak (1:4.1.0-2) wird eingerichtet ...
...
$ tesseract --list-langs
List of available languages in "/usr/share/tesseract-ocr/5/tessdata/" (3):
Fraktur
eng
osd
$ tesseract --version
tesseract 5.1.0
leptonica-1.82.0
libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.38 : libtiff 4.4.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.5.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8
Found libcurl/7.85.0 OpenSSL/3.0.5 zlib/1.2.11 brotli/1.0.9 zstd/1.5.2 libidn2/2.3.3 libpsl/0.21.0 (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.49.0 librtmp/2.3
Usage
Do OCR processing of an image with fraktur text
$ convert Achleitner-Der_Finanzer.pdf Achleitner-Der_Finanzer.jpg
$ tesseract Achleitner-Der_Finanzer-6.jpg Achleitner-Der_Finanzer-6 -l Fraktur
OCR to ALTO format
$ tesseract 00000016.tif 00000016 -l Fraktur alto
$ ls -al
...
-rwxrwxrwx 1 ralf ralf 103559 Jul 9 20:38 00000016.xml
...
$ less 00000016.xml
<?xml version="1.0" encoding="UTF-8"?>
<alto xmlns="http://www.loc.gov/standards/alto/ns-v3#" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v3# http://www.loc.gov/alto/v3/alto-3-0.xsd">
<Description>
<MeasurementUnit>pixel</MeasurementUnit>
<sourceImageInformation>
<fileName>00000016.tif</fileName>
</sourceImageInformation>
<OCRProcessing ID="OCR_0">
<ocrProcessingStep>
<processingSoftware>
<softwareName>tesseract 5.1.0</softwareName>
</processingSoftware>
</ocrProcessingStep>
</OCRProcessing>
</Description>
<Layout>
<Page WIDTH="2039" HEIGHT="3127" PHYSICAL_IMG_NR="0" ID="page_0">
<PrintSpace HPOS="0" VPOS="0" WIDTH="2039" HEIGHT="3127">
<ComposedBlock ID="cblock_0" HPOS="232" VPOS="291" WIDTH="1048" HEIGHT="34">
<TextBlock ID="block_0" HPOS="232" VPOS="291" WIDTH="1048" HEIGHT="34">
<TextLine ID="line_0" HPOS="232" VPOS="291" WIDTH="1048" HEIGHT="34">
<String ID="string_0" HPOS="232" VPOS="297" WIDTH="22" HEIGHT="28" WC="0.96" CONTENT="8"/><SP WIDTH="509" VPOS="297" HPOS="254"/>
<String ID="string_1" HPOS="763" VPOS="292" WIDTH="20" HEIGHT="26" WC="0.96" CONTENT="I."/><SP WIDTH="21" VPOS="292" HPOS="783"/>
<String ID="string_2" HPOS="804" VPOS="292" WIDTH="49" HEIGHT="26" WC="0.93" CONTENT="Die"/><SP WIDTH="20" VPOS="292" HPOS="853"/>
<String ID="string_3" HPOS="873" VPOS="292" WIDTH="242" HEIGHT="33" WC="0.73" CONTENT="weltgeſchihtlihen"/><SP WIDTH="21" VPOS="292" HPOS="1115"/>
<String ID="string_4" HPOS="1136" VPOS="291" WIDTH="144" HEIGHT="34" WC="0.95" CONTENT="Ereigniſſe."/>
</TextLine>
</TextBlock>
</ComposedBlock>
<ComposedBlock ID="cblock_1" HPOS="228" VPOS="378" WIDTH="1590" HEIGHT="2208">
<TextBlock ID="block_1" HPOS="232" VPOS="378" WIDTH="1586" HEIGHT="308">
<TextLine ID="line_1" HPOS="335" VPOS="378" WIDTH="1483" HEIGHT="44">
<String ID="string_5" HPOS="335" VPOS="381" WIDTH="50" HEIGHT="41" WC="0.96" CONTENT="Zu"/><SP WIDTH="19" VPOS="381" HPOS="385"/>
<String ID="string_6" HPOS="404" VPOS="381" WIDTH="59" HEIGHT="33" WC="0.93" CONTENT="den"/><SP WIDTH="17" VPOS="381" HPOS="463"/>
<String ID="string_7" HPOS="480" VPOS="379" WIDTH="232" HEIGHT="42" WC="0.92" CONTENT="Kulturſtaaten"/><SP WIDTH="18" VPOS="379" HPOS="712"/>
<String ID="string_8" HPOS="730" VPOS="380" WIDTH="84" HEIGHT="41" WC="0.96" CONTENT="jener"/><SP WIDTH="19" VPOS="380" HPOS="814"/>
<String ID="string_9" HPOS="833" VPOS="379" WIDTH="70" HEIGHT="41" WC="0.96" CONTENT="Zeit"/><SP WIDTH="20" VPOS="379" HPOS="903"/>
<String ID="string_10" HPOS="923" VPOS="379" WIDTH="71" HEIGHT="32" WC="0.96" CONTENT="weit"/><SP WIDTH="20" VPOS="379" HPOS="994"/>
<String ID="string_11" HPOS="1014" VPOS="383" WIDTH="54" HEIGHT="27" WC="0.96" CONTENT="vor"/><SP WIDTH="19" VPOS="383" HPOS="1068"/>
<String ID="string_12" HPOS="1087" VPOS="379" WIDTH="106" HEIGHT="40" WC="0.96" CONTENT="Chriſti"/><SP WIDTH="19" VPOS="379" HPOS="1193"/>
<String ID="string_13" HPOS="1212" VPOS="379" WIDTH="120" HEIGHT="33" WC="0.96" CONTENT="Geburt"/><SP WIDTH="19" VPOS="379" HPOS="1332"/>
<String ID="string_14" HPOS="1351" VPOS="379" WIDTH="96" HEIGHT="41" WC="0.96" CONTENT="zählte"/><SP WIDTH="21" VPOS="379" HPOS="1447"/>
<String ID="string_15" HPOS="1468" VPOS="378" WIDTH="71" HEIGHT="41" WC="0.10" CONTENT="auh"/><SP WIDTH="20" VPOS="378" HPOS="1539"/>
<String ID="string_16" HPOS="1559" VPOS="378" WIDTH="180" HEIGHT="41" WC="0.92" CONTENT="Phönizien,"/><SP WIDTH="20" VPOS="378" HPOS="1739"/>
<String ID="string_17" HPOS="1759" VPOS="378" WIDTH="59" HEIGHT="32" WC="0.96" CONTENT="das"/>
</TextLine>
...
$
OCR whole directory to ALTO, hOCR and TXT
for i in *tif; do b=`basename "$i" .tif`; echo "$i"; tesseract "$i" "$b" -l Fraktur alto hocr txt; done
The OCR files are named like this (e.g. for file alx-0001.tif
):
- alto:
alx-0001.xml
- hOCR:
alx-0001.hocr
- txt:
alx-0001.txt
Rename ALTO-XML-files to contain “alto” in filename, e.g alx-0001.alto.xml
:
for i in alx-*xml; do b=`basename "$i" .xml`; echo "$i"; mv "$i" "$b.alto.xml"; done