| 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889 |
- UNICHARAMBIGS(5)
- ================
- NAME
- ----
- unicharambigs - Tesseract unicharset ambiguities
- DESCRIPTION
- -----------
- The unicharambigs file (a component of traineddata, see combine_tessdata(1) )
- is used by Tesseract to represent possible ambiguities between characters,
- or groups of characters.
- The file contains a number of lines, laid out as follow:
- ...........................
- [num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]
- ...........................
- [horizontal]
- Field one:: the number of characters contained in field two
- Field two:: the character sequence to be replaced
- Field three:: the number of characters contained in field four
- Field four:: the character sequence used to replace field two
- Field five:: contains either 1 or 0. 1 denotes a mandatory
- replacement, 0 denotes an optional replacement.
- Characters appearing in fields two and four should appear in
- unicharset. The numbers in fields one and three refer to the
- number of unichars (not bytes).
- EXAMPLE
- -------
- ...............................
- v1
- 2 ' ' 1 " 1
- 1 m 2 r n 0
- 3 i i i 1 m 0
- ...............................
- The first line is a version identifier.
- In this example, all instances of the '2' character sequence '''' will
- *always* be replaced by the '1' character sequence '"'; a '1' character
- sequence 'm' *may* be replaced by the '2' character sequence 'rn', and
- the '3' character sequence *may* be replaced by the '1' character
- sequence 'm'.
- Version 3.03 and on supports a new, simpler format for the unicharambigs
- file:
- ...............................
- v2
- '' " 1
- m rn 0
- iii m 0
- ...............................
- In this format, the "error" and "correction" are simple UTF-8 strings
- separated by a space, and, after another space, the same type specifier
- as v1 (0 for optional and 1 for mandatory substitution). Note the downside
- of this simpler format is that Tesseract has to encode the UTF-8 strings
- into the components of the unicharset. In complex scripts, this encoding
- may be ambiguous. In this case, the encoding is chosen such as to use the
- least UTF-8 characters for each component, ie the shortest unicharset
- components will make up the encoding.
- HISTORY
- -------
- The unicharambigs file first appeared in Tesseract 3.00; prior to that, a
- similar format, called DangAmbigs ('dangerous ambiguities') was used: the
- format was almost identical, except only mandatory replacements could be
- specified, and field 5 was absent.
- BUGS
- ----
- This is a documentation "bug": it's not currently clear what should be done
- in the case of ligatures (such as 'fi') which may also appear as regular
- letters in the unicharset.
- SEE ALSO
- --------
- tesseract(1), unicharset(5)
- https://tesseract-ocr.github.io/tessdoc/Training-Tesseract-3.03%E2%80%933.05.html#the-unicharambigs-file
- AUTHOR
- ------
- The Tesseract OCR engine was written by Ray Smith and his research groups
- at Hewlett Packard (1985-1995) and Google (2006-2018).
|