public class TessBaseAPI
Java interface for the Tesseract OCR engine. Does not implement all available JNI methods, but does implement enough to be useful. Comments are adapted from original Tesseract source.
public static java.lang.String VAR_CHAR_WHITELIST
Whitelist of characters to recognize.
public static java.lang.String VAR_CHAR_BLACKLIST
Blacklist of characters to not recognize.
public static java.lang.String VAR_SAVE_BLOB_CHOICES
Save blob choices allowing us to get alternative results.
public static java.lang.String VAR_TRUE
String value used to assign a boolean variable to true.
public static java.lang.String VAR_FALSE
String value used to assign a boolean variable to false.
public static int OEM_TESSERACT_ONLY
Run Tesseract only - fastest
public static int OEM_LSTM_ONLY
Run Cube only - better accuracy, but slower
public static int OEM_TESSERACT_LSTM_COMBINED
Run both and combine results - best accuracy
public static int OEM_DEFAULT
Default OCR engine mode.
public TessBaseAPI()
Constructs an instance of TessBaseAPI.
When the instance of TessBaseAPI is no longer needed, its method must be invoked to dispose of it.com.googlecode.tesseract.android.TessBaseAPI$end()
public TessBaseAPI(com.googlecode.tesseract.android.TessBaseAPI.ProgressNotifier progressNotifier)
Constructs an instance of TessBaseAPI with a callback method for receiving progress updates during OCR.
When the instance of TessBaseAPI is no longer needed, its method must be invoked to dispose of it.com.googlecode.tesseract.android.TessBaseAPI$end()
progressNotifier - Callback to receive progress notificationscom.googlecode.tesseract.android.TessBaseAPI$end()public boolean init(java.lang.String datapath,
java.lang.String language)
Initializes the Tesseract engine with a specified language model. Returns true on success.
Instances are now mostly thread-safe and totally independent, but some global parameters remain. Basically it is safe to use multiple TessBaseAPIs in different threads in parallel, UNLESS you use SetVariable on some of the Params in classify and textord. If you do, then the effect will be to change it for all your instances.
The datapath must be the name of the parent directory of tessdata and must end in / . Any name after the last / will be stripped. The language is (usually) an ISO 639-3 string or null will default to eng. It is entirely safe (and eventually will be efficient too) to call Init multiple times on the same instance to change language, or just to reset the classifier.
The language may be a string of the form [~] indicating that multiple languages are to be loaded. Eg hin+eng will load Hindi and English. Languages may specify internally that they want to be loaded with one or more other languages, so the ~ sign is available to override that. Eg if hin were set to load eng by default, then hin+~eng would force loading only hin. The number of loaded languages is limited only by memory, with the caveat that loading additional languages will impact both speed and accuracy, as there is more work to do to decide on the applicable language, and there is more chance of hallucinating incorrect words.
WARNING: On changing languages, all Tesseract parameters are reset back to their default values. (Which may vary between languages.)
If you have a rare need to set a Variable that controls initialization for a second call to Init you should explicitly call End() and then use SetVariable before Init. This is only a very rare use case, since there are very few uses that require any parameters to be set before Init.
datapath - the parent directory of tessdata ending in a forward slashlanguage - an ISO 639-3 string representing the language(s)true on successpublic boolean init(java.lang.String datapath,
java.lang.String language,
int ocrEngineMode)
Initializes the Tesseract engine with the specified language model(s). Returns true on success.
datapath - the parent directory of tessdata ending in a forward slashlanguage - an ISO 639-3 string representing the language(s)ocrEngineMode - the OCR engine mode to be settrue on success#init(String, String)public java.lang.String getInitLanguagesAsString()
Returns the languages string used in the last valid initialization. If the last initialization specified "deu+hin" then that will be returned. If hin loaded eng automatically as well, then that will not be included in this list.
public void clear()
Frees up recognition results and any stored image data, without actually freeing any recognition data that would be time-consuming to reload. Afterwards, you must call SetImage or SetRectangle before doing any Recognize or Get* operation.
public void end()
Closes down tesseract and free up all memory. End() is equivalent to destructing and reconstructing your TessBaseAPI.
Once End() has been used, none of the other API functions may be used other than Init and anything declared above it in the class definition.
public boolean setVariable(java.lang.String var,
java.lang.String value)
Set the value of an internal "parameter."
Supply the name of the parameter and the value as a string, just as you would in a config file.
Returns false if the name lookup failed.
Eg setVariable("tessedit_char_blacklist", "xyz"); to ignore x, y and z. Or setVariable("classify_bln_numeric_mode", "1"); to set numeric-only mode.
setVariable may be used before init, but settings will revert to defaults on end().
Note: Must be called after init(). Only works for non-init variables.
var - name of the variablevalue - value to setpublic int getPageSegMode()
Return the current page segmentation mode.
public void setPageSegMode(int mode)
Sets the page segmentation mode. Defaults to PageSegMode#PSM_SINGLE_BLOCK. This controls how much processing the OCR engine will perform before recognizing text.
The mode can also be modified by readConfigFile or setVariable("tessedit_pageseg_mode", mode as string).
mode - the class TessBaseAPI.PageSegMode to setpublic void setDebug(boolean enabled)
Sets debug mode. This controls how much information is displayed in the log during recognition.
enabled - true to enable debugging modepublic void setRectangle(android.graphics.Rect rect)
Restricts recognition to a sub-rectangle of the image. Call after SetImage. Each SetRectangle clears the recognition results so multiple rectangles can be recognized with the same image.
rect - the bounding rectanglepublic void setRectangle(int left,
int top,
int width,
int height)
Restricts recognition to a sub-rectangle of the image. Call after SetImage. Each SetRectangle clears the recognition results so multiple rectangles can be recognized with the same image.
left - the left boundtop - the right boundwidth - the width of the bounding boxheight - the height of the bounding box@WorkerThread public void setImage(java.io.File file)
Provides an image for Tesseract to recognize. Copies the image buffer. The source image may be destroyed immediately after SetImage is called. SetImage clears all recognition results, and sets the rectangle to the full image, so it may be followed immediately by a GetUTF8Text, and it will automatically perform recognition.
file - absolute path to the image file@WorkerThread public void setImage(android.graphics.Bitmap bmp)
Provides an image for Tesseract to recognize. Copies the image buffer. The source image may be destroyed immediately after SetImage is called. SetImage clears all recognition results, and sets the rectangle to the full image, so it may be followed immediately by a GetUTF8Text, and it will automatically perform recognition.
bmp - bitmap representation of the image@WorkerThread public void setImage(Pix image)
Provides a Leptonica pix format image for Tesseract to recognize. Clones the pix object. The source image may be destroyed immediately after SetImage is called, but its contents may not be modified.
image - Leptonica pix representation of the image@WorkerThread
public void setImage(kotlin.Array[] imagedata,
int width,
int height,
int bpp,
int bpl)
Provides an image for Tesseract to recognize. Copies the image buffer. The source image may be destroyed immediately after SetImage is called. SetImage clears all recognition results, and sets the rectangle to the full image, so it may be followed immediately by a GetUTF8Text, and it will automatically perform recognition.
imagedata - byte representation of the imagewidth - image widthheight - image heightbpp - bytes per pixelbpl - bytes per line@WorkerThread public java.lang.String getUTF8Text()
The recognized text is returned as a String which is coded as UTF8. This is a blocking operation that will not work with . Call com.googlecode.tesseract.android.TessBaseAPI$stop() before calling this function to interrupt a recognition task with com.googlecode.tesseract.android.TessBaseAPI$getHOCRText(kotlin.Int)com.googlecode.tesseract.android.TessBaseAPI$stop()
com.googlecode.tesseract.android.TessBaseAPI$stop(),
com.googlecode.tesseract.android.TessBaseAPI$getHOCRText(kotlin.Int),
com.googlecode.tesseract.android.TessBaseAPI$stop()public int meanConfidence()
Returns the (average) confidence value between 0 and 100.
public kotlin.Array[] wordConfidences()
Returns all word confidences (between 0 and 100) in an array.
The number of confidences should correspond to the number of space-delimited words in GetUTF8Text().
public Pix getThresholdedImage()
Get a copy of the internal thresholded image from Tesseract.
Caller takes ownership of the Pix and must recycle() it. May be called any time after setImage.
public Pixa getRegions()
Returns the result of page layout analysis as a Pixa, in reading order.
Can be called before or after Recognize.
public Pixa getTextlines()
Returns the textlines as a Pixa. Textlines are extracted from the thresholded image.
Can be called before or after Recognize. Block IDs are not returned. Paragraph IDs are not returned.
public Pixa getStrips()
Get textlines and strips of image regions as a Pixa, in reading order.
Enables downstream handling of non-rectangular regions. Can be called before or after Recognize. Block IDs are not returned.
public Pixa getWords()
Get the words as a Pixa, in reading order.
Can be called before or after Recognize.
public Pixa getConnectedComponents()
Gets the individual connected (text) components (created after pages segmentation step, but before recognition) as a Pixa, in reading order.
Can be called before or after Recognize. Note: the caller is responsible for calling recycle() on the returned Pixa.
public ResultIterator getResultIterator()
Get a reading-order iterator to the results of LayoutAnalysis and/or Recognize. The returned iterator must be deleted after use.
@WorkerThread public java.lang.String getHOCRText(int page)
Make a HTML-formatted string with hOCR markup from the internal data structures. Interruptible by .com.googlecode.tesseract.android.TessBaseAPI$stop()
page - is 0-based but will appear in the output as 1-based.com.googlecode.tesseract.android.TessBaseAPI$stop()public void setInputName(java.lang.String name)
Set the name of the input file. Needed for training and reading a UNLV zone file.
name - input file namepublic void setOutputName(java.lang.String name)
Set the name of the bonus output files. Needed only for debugging.
name - output file namepublic void readConfigFile(java.lang.String filename)
Read a "config" file containing a set of variable, value pairs.
Searches the standard places: tessdata/configs, tessdata/tessconfigs. Note: only non-init params will be set.
filename - the configuration filename, without the pathpublic java.lang.String getBoxText(int page)
The recognized text is returned as coded in the same format as a UTF8 box file used in training.
Constructs coordinates in the original image - not just the rectangle.
page - a 0-based page index that will appear in the box file.public java.lang.String getVersion()
Returns the version identifier as a string.
public void stop()
Cancel recognition started by .com.googlecode.tesseract.android.TessBaseAPI$getHOCRText(kotlin.Int)
protected void onProgressValues(int percent,
int left,
int right,
int top,
int bottom,
int textLeft,
int textRight,
int textTop,
int textBottom)
Called from native code to update progress of ongoing recognition passes.
percent - Percent completeleft - Left bound of word bounding boxright - Right bound of word bounding boxtop - Top bound of word bounding boxbottom - Bottom bound of word bounding boxtextLeft - Left bound of text bounding boxtextRight - Right bound of text bounding boxtextTop - Top bound of text bounding boxtextBottom - Bottom bound of text bounding boxpublic boolean beginDocument(TessPdfRenderer tessPdfRenderer, java.lang.String title)
Starts a new document. This clears the contents of the output data. Caller is responsible for escaping the provided title.
tessPdfRenderer - the renderer instance to usetitle - a title to be used in the document metadatatrue on success. false on failurepublic boolean beginDocument(TessPdfRenderer tessPdfRenderer)
Starts a new document with no title.
tessPdfRenderer - the renderer instance to usetrue on success. false on failure#beginDocument(TessPdfRenderer, String)public boolean endDocument(TessPdfRenderer tessPdfRenderer)
Finishes the document and finalizes the output data. Invalid if beginDocument not yet called.
tessPdfRenderer - the renderer instance to usetrue on success. false on failurepublic boolean addPageToDocument(Pix imageToProcess, java.lang.String imageToWrite, TessPdfRenderer tessPdfRenderer)
Adds the given data to the opened document (if any).
imageToProcess - image to be used for OCRimageToWrite - path to image to be written into resulting documenttessPdfRenderer - the renderer instance to usetrue on success. false on failurepublic kotlin.Array[] getOutputBuffer(TessPdfRenderer tessPdfRenderer)
tessPdfRenderer - the renderer instance to use