wordlist2dawg.1.asc 1.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
  1. WORDLIST2DAWG(1)
  2. ================
  3. :doctype: manpage
  4. NAME
  5. ----
  6. wordlist2dawg - convert a wordlist to a DAWG for Tesseract
  7. SYNOPSIS
  8. --------
  9. *wordlist2dawg* 'WORDLIST' 'DAWG' 'lang.unicharset'
  10. *wordlist2dawg* -t 'WORDLIST' 'DAWG' 'lang.unicharset'
  11. *wordlist2dawg* -r 1 'WORDLIST' 'DAWG' 'lang.unicharset'
  12. *wordlist2dawg* -r 2 'WORDLIST' 'DAWG' 'lang.unicharset'
  13. *wordlist2dawg* -l <short> <long> 'WORDLIST' 'DAWG' 'lang.unicharset'
  14. DESCRIPTION
  15. -----------
  16. wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph
  17. (DAWG) for use with Tesseract. A DAWG is a compressed, space and time
  18. efficient representation of a word list.
  19. OPTIONS
  20. -------
  21. -t
  22. Verify that a given dawg file is equivalent to a given wordlist.
  23. -r 1
  24. Reverse a word if it contains an RTL character.
  25. -r 2
  26. Reverse all words.
  27. -l <short> <long>
  28. Produce a file with several dawgs in it, one each for words
  29. of length <short>, <short+1>,... <long>
  30. ARGUMENTS
  31. ---------
  32. 'WORDLIST'
  33. A plain text file in UTF-8, one word per line.
  34. 'DAWG'
  35. The output DAWG to write.
  36. 'lang.unicharset'
  37. The unicharset of the language. This is the unicharset
  38. generated by mftraining(1).
  39. SEE ALSO
  40. --------
  41. tesseract(1), combine_tessdata(1), dawg2wordlist(1)
  42. <https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html>
  43. COPYING
  44. -------
  45. Copyright \(C) 2006 Google, Inc.
  46. Licensed under the Apache License, Version 2.0
  47. AUTHOR
  48. ------
  49. The Tesseract OCR engine was written by Ray Smith and his research groups
  50. at Hewlett Packard (1985-1995) and Google (2006-2018).