| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493 |
- TESSERACT(1)
- ============
- :doctype: manpage
- NAME
- ----
- tesseract - command-line OCR engine
- SYNOPSIS
- --------
- *tesseract* 'FILE' 'OUTPUTBASE' ['OPTIONS']... ['CONFIGFILE']...
- DESCRIPTION
- -----------
- tesseract(1) is a commercial quality OCR engine originally developed at HP
- between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by
- UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed
- at Google until 2018.
- IN/OUT ARGUMENTS
- ----------------
- 'FILE'::
- The name of the input file.
- This can either be an image file or a text file. +
- Most image file formats (anything readable by Leptonica) are supported. +
- A text file lists the names of all input images (one image name per line).
- The results will be combined in a single file for each output file format
- (txt, pdf, hocr, xml). +
- If 'FILE' is `stdin` or `-` then the standard input is used.
- 'OUTPUTBASE'::
- The basename of the output file (to which the appropriate extension
- will be appended). By default the output will be a text file
- with `.txt` added to the basename unless there are one or more
- parameters set which explicitly specify the desired output. +
- If 'OUTPUTBASE' is `stdout` or `-` then the standard output is used.
- [[TESSDATADIR]]
- OPTIONS
- -------
- *-c* 'CONFIGVAR=VALUE'::
- Set value for parameter 'CONFIGVAR' to VALUE. Multiple *-c* arguments are allowed.
- *--dpi* 'N'::
- Specify the resolution 'N' in DPI for the input image(s).
- A typical value for 'N' is `300`. Without this option,
- the resolution is read from the metadata included in the image.
- If an image does not include that information, Tesseract tries to guess it.
- *-l* 'LANG'::
- *-l* 'SCRIPT'::
- The language or script to use.
- If none is specified, `eng` (English) is assumed.
- Multiple languages may be specified, separated by plus characters.
- Tesseract uses 3-character ISO 639-2 language codes
- (see <<LANGUAGES,*LANGUAGES AND SCRIPTS*>>).
- *--psm* 'N'::
- Set Tesseract to only run a subset of layout analysis and assume
- a certain form of image. The options for 'N' are:
- 0 = Orientation and script detection (OSD) only.
- 1 = Automatic page segmentation with OSD.
- 2 = Automatic page segmentation, but no OSD, or OCR. (not implemented)
- 3 = Fully automatic page segmentation, but no OSD. (Default)
- 4 = Assume a single column of text of variable sizes.
- 5 = Assume a single uniform block of vertically aligned text.
- 6 = Assume a single uniform block of text.
- 7 = Treat the image as a single text line.
- 8 = Treat the image as a single word.
- 9 = Treat the image as a single word in a circle.
- 10 = Treat the image as a single character.
- 11 = Sparse text. Find as much text as possible in no particular order.
- 12 = Sparse text with OSD.
- 13 = Raw line. Treat the image as a single text line,
- bypassing hacks that are Tesseract-specific.
- *--oem* 'N'::
- Specify OCR Engine mode. The options for 'N' are:
- 0 = Original Tesseract only.
- 1 = Neural nets LSTM only.
- 2 = Tesseract + LSTM.
- 3 = Default, based on what is available.
- *--tessdata-dir* 'PATH'::
- Specify the location of tessdata path.
- *--user-patterns* 'FILE'::
- Specify the location of user patterns file.
- *--user-words* 'FILE'::
- Specify the location of user words file.
- [[CONFIGFILE]]
- 'CONFIGFILE'::
- The name of a config to use. The name can be a file in `tessdata/configs`
- or `tessdata/tessconfigs`, or an absolute or relative file path.
- A config is a plain text file which contains a list of parameters and
- their values, one per line, with a space separating parameter from value. +
- Interesting config files include:
- * *alto* -- Output in ALTO format ('OUTPUTBASE'`.xml`).
- * *hocr* -- Output in hOCR format ('OUTPUTBASE'`.hocr`).
- * *page* -- Output in PAGE format ('OUTPUTBASE'`.page.xml`).
- The output can be customized with the flags:
- page_xml_polygon -- Create polygons instead of bounding boxes (default: true)
- page_xml_level -- Create the PAGE file on 0=linelevel or 1=wordlevel (default: 0)
- * *pdf* -- Output PDF ('OUTPUTBASE'`.pdf`).
- * *tsv* -- Output TSV ('OUTPUTBASE'`.tsv`).
- * *txt* -- Output plain text ('OUTPUTBASE'`.txt`).
- * *get.images* -- Write processed input images to file ('OUTPUTBASE'`.processedPAGENUMBER.tif`).
- * *logfile* -- Redirect debug messages to file (`tesseract.log`).
- * *lstm.train* -- Output files used by LSTM training ('OUTPUTBASE'`.lstmf`).
- * *makebox* -- Write box file ('OUTPUTBASE'`.box`).
- * *quiet* -- Redirect debug messages to '/dev/null'.
- It is possible to select several config files, for example
- `tesseract image.png demo alto hocr pdf txt` will create four output files
- `demo.alto`, `demo.hocr`, `demo.pdf` and `demo.txt` with the OCR results.
- *Nota bene:* The options *-l* 'LANG', *-l* 'SCRIPT' and *--psm* 'N'
- must occur before any 'CONFIGFILE'.
- SINGLE OPTIONS
- --------------
- *-h, --help*::
- Show help message.
- *--help-extra*::
- Show extra help for advanced users.
- *--help-psm*::
- Show page segmentation modes.
- *--help-oem*::
- Show OCR Engine modes.
- *-v, --version*::
- Returns the current version of the tesseract(1) executable.
- *--list-langs*::
- List available languages for tesseract engine.
- Can be used with *--tessdata-dir* 'PATH'.
- *--print-parameters*::
- Print tesseract parameters.
- [[LANGUAGES]]
- LANGUAGES AND SCRIPTS
- ---------------------
- To recognize some text with Tesseract, it is normally necessary to specify
- the language(s) or script(s) of the text (unless it is English text which is
- supported by default) using *-l* 'LANG' or *-l* 'SCRIPT'.
- Selecting a language automatically also selects the language specific
- character set and dictionary (word list).
- Selecting a script typically selects all characters of that script
- which can be from different languages. The dictionary which is included
- also contains a mix from different languages.
- In most cases, a script also supports English.
- So it is possible to recognize a language that has not been specifically
- trained for by using traineddata for the script it is written in.
- More than one language or script may be specified by using `+`.
- Example: `tesseract myimage.png myimage -l eng+deu+fra`.
- https://github.com/tesseract-ocr/tessdata_fast provides fast language and
- script models which are also part of Linux distributions.
- For Tesseract 4, `tessdata_fast` includes traineddata files for the
- following languages:
- *afr* (Afrikaans),
- *amh* (Amharic),
- *ara* (Arabic),
- *asm* (Assamese),
- *aze* (Azerbaijani),
- *aze_cyrl* (Azerbaijani - Cyrilic),
- *bel* (Belarusian),
- *ben* (Bengali),
- *bod* (Tibetan),
- *bos* (Bosnian),
- *bre* (Breton),
- *bul* (Bulgarian),
- *cat* (Catalan; Valencian),
- *ceb* (Cebuano),
- *ces* (Czech),
- *chi_sim* (Chinese simplified),
- *chi_tra* (Chinese traditional),
- *chr* (Cherokee),
- *cos* (Corsican),
- *cym* (Welsh),
- *dan* (Danish),
- *deu* (German),
- *deu_latf* (German Fraktur Latin),
- *div* (Dhivehi),
- *dzo* (Dzongkha),
- *ell* (Greek, Modern, 1453-),
- *eng* (English),
- *enm* (English, Middle, 1100-1500),
- *epo* (Esperanto),
- *equ* (Math / equation detection module),
- *est* (Estonian),
- *eus* (Basque),
- *fas* (Persian),
- *fao* (Faroese),
- *fil* (Filipino),
- *fin* (Finnish),
- *fra* (French),
- *frm* (French, Middle, ca.1400-1600),
- *fry* (West Frisian),
- *gla* (Scottish Gaelic),
- *gle* (Irish),
- *glg* (Galician),
- *grc* (Greek, Ancient, to 1453),
- *guj* (Gujarati),
- *hat* (Haitian; Haitian Creole),
- *heb* (Hebrew),
- *hin* (Hindi),
- *hrv* (Croatian),
- *hun* (Hungarian),
- *hye* (Armenian),
- *iku* (Inuktitut),
- *ind* (Indonesian),
- *isl* (Icelandic),
- *ita* (Italian),
- *ita_old* (Italian - Old),
- *jav* (Javanese),
- *jpn* (Japanese),
- *kan* (Kannada),
- *kat* (Georgian),
- *kat_old* (Georgian - Old),
- *kaz* (Kazakh),
- *khm* (Central Khmer),
- *kir* (Kirghiz; Kyrgyz),
- *kmr* (Kurdish Kurmanji),
- *kor* (Korean),
- *kor_vert* (Korean vertical),
- *lao* (Lao),
- *lat* (Latin),
- *lav* (Latvian),
- *lit* (Lithuanian),
- *ltz* (Luxembourgish),
- *mal* (Malayalam),
- *mar* (Marathi),
- *mkd* (Macedonian),
- *mlt* (Maltese),
- *mon* (Mongolian),
- *mri* (Maori),
- *msa* (Malay),
- *mya* (Burmese),
- *nep* (Nepali),
- *nld* (Dutch; Flemish),
- *nor* (Norwegian),
- *oci* (Occitan post 1500),
- *ori* (Oriya),
- *osd* (Orientation and script detection module),
- *pan* (Panjabi; Punjabi),
- *pol* (Polish),
- *por* (Portuguese),
- *pus* (Pushto; Pashto),
- *que* (Quechua),
- *ron* (Romanian; Moldavian; Moldovan),
- *rus* (Russian),
- *san* (Sanskrit),
- *sin* (Sinhala; Sinhalese),
- *slk* (Slovak),
- *slv* (Slovenian),
- *snd* (Sindhi),
- *spa* (Spanish; Castilian),
- *spa_old* (Spanish; Castilian - Old),
- *sqi* (Albanian),
- *srp* (Serbian),
- *srp_latn* (Serbian - Latin),
- *sun* (Sundanese),
- *swa* (Swahili),
- *swe* (Swedish),
- *syr* (Syriac),
- *tam* (Tamil),
- *tat* (Tatar),
- *tel* (Telugu),
- *tgk* (Tajik),
- *tha* (Thai),
- *tir* (Tigrinya),
- *ton* (Tonga),
- *tur* (Turkish),
- *uig* (Uighur; Uyghur),
- *ukr* (Ukrainian),
- *urd* (Urdu),
- *uzb* (Uzbek),
- *uzb_cyrl* (Uzbek - Cyrilic),
- *vie* (Vietnamese),
- *yid* (Yiddish),
- *yor* (Yoruba)
- To use a non-standard language pack named `foo.traineddata`, set the
- `TESSDATA_PREFIX` environment variable so the file can be found at
- `TESSDATA_PREFIX/tessdata/foo.traineddata` and give Tesseract the
- argument *-l* `foo`.
- For Tesseract 4, `tessdata_fast` includes traineddata files for the
- following scripts:
- *Arabic*,
- *Armenian*,
- *Bengali*,
- *Canadian_Aboriginal*,
- *Cherokee*,
- *Cyrillic*,
- *Devanagari*,
- *Ethiopic*,
- *Fraktur*,
- *Georgian*,
- *Greek*,
- *Gujarati*,
- *Gurmukhi*,
- *HanS* (Han simplified),
- *HanS_vert* (Han simplified, vertical),
- *HanT* (Han traditional),
- *HanT_vert* (Han traditional, vertical),
- *Hangul*,
- *Hangul_vert* (Hangul vertical),
- *Hebrew*,
- *Japanese*,
- *Japanese_vert* (Japanese vertical),
- *Kannada*,
- *Khmer*,
- *Lao*,
- *Latin*,
- *Malayalam*,
- *Myanmar*,
- *Oriya* (Odia),
- *Sinhala*,
- *Syriac*,
- *Tamil*,
- *Telugu*,
- *Thaana*,
- *Thai*,
- *Tibetan*,
- *Vietnamese*.
- The same languages and scripts are available from
- https://github.com/tesseract-ocr/tessdata_best.
- `tessdata_best` provides slow language and script models.
- These models are needed for training. They also can give better OCR results,
- but the recognition takes much more time.
- Both `tessdata_fast` and `tessdata_best` only support the LSTM OCR engine.
- There is a third repository, https://github.com/tesseract-ocr/tessdata,
- with models which support both the Tesseract 3 legacy OCR engine and the
- Tesseract 4 LSTM OCR engine.
- CONFIG FILES AND AUGMENTING WITH USER DATA
- ------------------------------------------
- Tesseract config files consist of lines with parameter-value pairs (space
- separated). The parameters are documented as flags in the source code like
- the following one in tesseractclass.h:
- `STRING_VAR_H(tessedit_char_blacklist, "",
- "Blacklist of chars not to recognize");`
- These parameters may enable or disable various features of the engine, and
- may cause it to load (or not load) various data. For instance, let's suppose
- you want to OCR in English, but suppress the normal dictionary and load an
- alternative word list and an alternative list of patterns -- these two files
- are the most commonly used extra data files.
- If your language pack is in '/path/to/eng.traineddata' and the hocr config
- is in '/path/to/configs/hocr' then create three new files:
- '/path/to/eng.user-words':
- [verse]
- the
- quick
- brown
- fox
- jumped
- '/path/to/eng.user-patterns':
- [verse]
- 1-\d\d\d-GOOG-411
- www.\n\\\*.com
- '/path/to/configs/bazaar':
- [verse]
- load_system_dawg F
- load_freq_dawg F
- user_words_suffix user-words
- user_patterns_suffix user-patterns
- Now, if you pass the word 'bazaar' as a <<CONFIGFILE,'CONFIGFILE'>> to
- Tesseract, Tesseract will not bother loading the system dictionary nor
- the dictionary of frequent words and will load and use the 'eng.user-words'
- and 'eng.user-patterns' files you provided. The former is a simple word list,
- one per line. The format of the latter is documented in 'dict/trie.h'
- on 'read_pattern_list()'.
- ENVIRONMENT VARIABLES
- ---------------------
- *`TESSDATA_PREFIX`*::
- If the `TESSDATA_PREFIX` is set to a path, then that path is used to
- find the `tessdata` directory with language and script recognition
- models and config files.
- Using <<TESSDATADIR,*--tessdata-dir* 'PATH'>> is the recommended alternative.
- *`OMP_THREAD_LIMIT`*::
- If the `tesseract` executable was built with multithreading support,
- it will normally use four CPU cores for the OCR process. While this
- can be faster for a single image, it gives bad performance if the host
- computer provides less than four CPU cores or if OCR is made for many images.
- Only a single CPU core is used with `OMP_THREAD_LIMIT=1`.
- HISTORY
- -------
- The engine was developed at Hewlett Packard Laboratories Bristol and at
- Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more
- changes made in 1996 to port to Windows, and some $$C++$$izing in 1998. A
- lot of the code was written in C, and then some more was written in $$C++$$.
- The $$C++$$ code makes heavy use of a list system using macros. This predates
- STL, was portable before STL, and is more efficient than STL lists, but has
- the big negative that if you do get a segmentation violation, it is hard to
- debug.
- Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability
- to train Tesseract.
- Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy.
- See <https://github.com/tesseract-ocr/docs/blob/main/AT-1995.pdf>.
- Since Tesseract 2.00,
- scripts are now included to allow anyone to reproduce some of these tests.
- See <https://tesseract-ocr.github.io/tessdoc/TestingTesseract.html> for more
- details.
- Tesseract 3.00 added a number of new languages, including Chinese, Japanese,
- and Korean. It also introduced a new, single-file based system of managing
- language data.
- Tesseract 3.02 added BiDirectional text support, the ability to recognize
- multiple languages in a single image, and improved layout analysis.
- Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused
- on line recognition, but also still supports the legacy Tesseract OCR engine of
- Tesseract 3 which works by recognizing character patterns. Compatibility with
- Tesseract 3 is enabled by `--oem 0`. This also needs traineddata files which
- support the legacy engine, for example those from the tessdata repository
- (https://github.com/tesseract-ocr/tessdata).
- For further details, see the release notes in the Tesseract documentation
- (<https://tesseract-ocr.github.io/tessdoc/ReleaseNotes.html>).
- RESOURCES
- ---------
- Main web site: <https://github.com/tesseract-ocr> +
- User forum: <https://groups.google.com/g/tesseract-ocr> +
- Documentation: <https://tesseract-ocr.github.io/> +
- Information on training: <https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html>
- SEE ALSO
- --------
- ambiguous_words(1), cntraining(1), combine_tessdata(1), dawg2wordlist(1),
- shape_training(1), mftraining(1), unicharambigs(5), unicharset(5),
- unicharset_extractor(1), wordlist2dawg(1)
- AUTHOR
- ------
- Tesseract development was led at Hewlett-Packard and Google by Ray Smith.
- The development team has included:
- Ahmad Abdulkader, Chris Newton, Dan Johnson, Dar-Shyang Lee, David Eger,
- Eric Wiseblatt, Faisal Shafait, Hiroshi Takenaka, Joe Liu, Joern Wanke,
- Mark Seaman, Mickey Namiki, Nicholas Beato, Oded Fuhrmann, Phil Cheatle,
- Pingping Xiu, Pong Eksombatchai (Chantat), Ranjith Unnikrishnan, Raquel
- Romano, Ray Smith, Rika Antonova, Robert Moss, Samuel Charron, Sheelagh
- Lloyd, Shobhit Saxena, and Thomas Kielbus.
- For a list of contributors see
- <https://github.com/tesseract-ocr/tesseract/blob/main/AUTHORS>.
- COPYING
- -------
- Licensed under the Apache License, Version 2.0
|