unicharset.5.asc 5.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133
  1. UNICHARSET(5)
  2. =============
  3. :doctype: manpage
  4. NAME
  5. ----
  6. unicharset - character properties file used by tesseract(1)
  7. DESCRIPTION
  8. -----------
  9. Tesseract's unicharset file contains information on each symbol
  10. (unichar) the Tesseract OCR engine is trained to recognize.
  11. A unicharset file (i.e. 'eng.unicharset') is distributed as part of a
  12. Tesseract language pack (i.e. 'eng.traineddata'). For information on
  13. extracting the unicharset file, see combine_tessdata(1).
  14. The first line of a unicharset file contains the number of unichars in
  15. the file. After this line, each subsequent line provides information for
  16. a single unichar. The first such line contains a placeholder reserved for
  17. the space character. Each unichar is referred to within Tesseract by its
  18. Unichar ID, which is the line number (minus 1) within the unicharset file.
  19. Therefore, space gets unichar 0.
  20. Each unichar line in the unicharset file (v2+) may have four space-separated fields:
  21. 'character' 'properties' 'script' 'id'
  22. Starting with Tesseract v3.02, more information may be given for each unichar:
  23. 'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'
  24. Entries:
  25. 'character':: The UTF-8 encoded string to be produced for this unichar.
  26. 'properties':: An integer mask of character properties, one per bit.
  27. From least to most significant bit, these are: isalpha, islower, isupper,
  28. isdigit, ispunctuation.
  29. 'glyph_metrics':: Ten comma-separated integers representing various standards
  30. for where this glyph is to be found within a baseline-normalized coordinate
  31. system where 128 is normalized to x-height.
  32. * min_bottom, max_bottom: the ranges where the bottom of the character can
  33. be found.
  34. * min_top, max_top: the ranges where the top of the character may be found.
  35. * min_width, max_width: horizontal width of the character.
  36. * min_bearing, max_bearing: how far from the usual start position does the
  37. leftmost part of the character begin.
  38. * min_advance, max_advance: how far from the printer's cell left do we
  39. advance to begin the next character.
  40. 'script':: Name of the script (Latin, Common, Greek, Cyrillic, Han, null).
  41. 'other_case':: The Unichar ID of the other case version of this character
  42. (upper or lower).
  43. 'direction':: The Unicode BiDi direction of this character, as defined by
  44. ICU's enum UCharDirection. (0 = Left to Right, 1 = Right to Left,
  45. 2 = European Number...)
  46. 'mirror':: The Unichar ID of the BiDirectional mirror of this character.
  47. For example the mirror of open paren is close paren, but Latin Capital C
  48. has no mirror, so it remains a Latin Capital C.
  49. 'normed_form':: The UTF-8 representation of a "normalized form" of this unichar
  50. for the purpose of blaming a module for errors given ground truth text.
  51. For instance, a left or right single quote may normalize to an ASCII quote.
  52. EXAMPLE (v2)
  53. ------------
  54. ..............
  55. ; 10 Common 46
  56. b 3 Latin 59
  57. W 5 Latin 40
  58. 7 8 Common 66
  59. = 0 Common 93
  60. ..............
  61. ";" is a punctuation character. Its properties are thus represented by the
  62. binary number 10000 (10 in hexadecimal).
  63. "b" is an alphabetic character and a lower case character. Its properties are
  64. thus represented by the binary number 00011 (3 in hexadecimal).
  65. "W" is an alphabetic character and an upper case character. Its properties are
  66. thus represented by the binary number 00101 (5 in hexadecimal).
  67. "7" is just a digit. Its properties are thus represented by the binary number
  68. 01000 (8 in hexadecimal).
  69. "=" is not punctuation nor a digit nor an alphabetic character. Its properties
  70. are thus represented by the binary number 00000 (0 in hexadecimal).
  71. Japanese or Chinese alphabetic character properties are represented by the
  72. binary number 00001 (1 in hexadecimal): they are alphabetic, but neither
  73. upper nor lower case.
  74. EXAMPLE (v3.02)
  75. ---------------
  76. ..................................................................
  77. 110
  78. NULL 0 NULL 0
  79. N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
  80. Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
  81. 1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
  82. 9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
  83. a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
  84. . . .
  85. ..................................................................
  86. CAVEATS
  87. -------
  88. Although the unicharset reader maintains the ability to read unicharsets
  89. of older formats and will assign default values to missing fields,
  90. the accuracy will be degraded.
  91. Further, most other data files are indexed by the unicharset file,
  92. so changing it without re-generating the others is likely to have dire
  93. consequences.
  94. HISTORY
  95. -------
  96. The unicharset format first appeared with Tesseract 2.00, which was the
  97. first version to support languages other than English. The unicharset file
  98. contained only the first two fields, and the "ispunctuation" property was
  99. absent (punctuation was regarded as "0", as "=" is in the above example.
  100. SEE ALSO
  101. --------
  102. tesseract(1), combine_tessdata(1), unicharset_extractor(1)
  103. <https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html>
  104. AUTHOR
  105. ------
  106. The Tesseract OCR engine was written by Ray Smith and his research groups
  107. at Hewlett Packard (1985-1995) and Google (2006-2018).