unicharambigs.5.asc 2.9 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
  1. UNICHARAMBIGS(5)
  2. ================
  3. NAME
  4. ----
  5. unicharambigs - Tesseract unicharset ambiguities
  6. DESCRIPTION
  7. -----------
  8. The unicharambigs file (a component of traineddata, see combine_tessdata(1) )
  9. is used by Tesseract to represent possible ambiguities between characters,
  10. or groups of characters.
  11. The file contains a number of lines, laid out as follow:
  12. ...........................
  13. [num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]
  14. ...........................
  15. [horizontal]
  16. Field one:: the number of characters contained in field two
  17. Field two:: the character sequence to be replaced
  18. Field three:: the number of characters contained in field four
  19. Field four:: the character sequence used to replace field two
  20. Field five:: contains either 1 or 0. 1 denotes a mandatory
  21. replacement, 0 denotes an optional replacement.
  22. Characters appearing in fields two and four should appear in
  23. unicharset. The numbers in fields one and three refer to the
  24. number of unichars (not bytes).
  25. EXAMPLE
  26. -------
  27. ...............................
  28. v1
  29. 2 ' ' 1 " 1
  30. 1 m 2 r n 0
  31. 3 i i i 1 m 0
  32. ...............................
  33. The first line is a version identifier.
  34. In this example, all instances of the '2' character sequence '''' will
  35. *always* be replaced by the '1' character sequence '"'; a '1' character
  36. sequence 'm' *may* be replaced by the '2' character sequence 'rn', and
  37. the '3' character sequence *may* be replaced by the '1' character
  38. sequence 'm'.
  39. Version 3.03 and on supports a new, simpler format for the unicharambigs
  40. file:
  41. ...............................
  42. v2
  43. '' " 1
  44. m rn 0
  45. iii m 0
  46. ...............................
  47. In this format, the "error" and "correction" are simple UTF-8 strings
  48. separated by a space, and, after another space, the same type specifier
  49. as v1 (0 for optional and 1 for mandatory substitution). Note the downside
  50. of this simpler format is that Tesseract has to encode the UTF-8 strings
  51. into the components of the unicharset. In complex scripts, this encoding
  52. may be ambiguous. In this case, the encoding is chosen such as to use the
  53. least UTF-8 characters for each component, ie the shortest unicharset
  54. components will make up the encoding.
  55. HISTORY
  56. -------
  57. The unicharambigs file first appeared in Tesseract 3.00; prior to that, a
  58. similar format, called DangAmbigs ('dangerous ambiguities') was used: the
  59. format was almost identical, except only mandatory replacements could be
  60. specified, and field 5 was absent.
  61. BUGS
  62. ----
  63. This is a documentation "bug": it's not currently clear what should be done
  64. in the case of ligatures (such as 'fi') which may also appear as regular
  65. letters in the unicharset.
  66. SEE ALSO
  67. --------
  68. tesseract(1), unicharset(5)
  69. https://tesseract-ocr.github.io/tessdoc/Training-Tesseract-3.03%E2%80%933.05.html#the-unicharambigs-file
  70. AUTHOR
  71. ------
  72. The Tesseract OCR engine was written by Ray Smith and his research groups
  73. at Hewlett Packard (1985-1995) and Google (2006-2018).