Navigation:  »No topics above this level«

OCR

Top 

The following are the rules available when the OCR option is selected in the recognition profile:

 

Optional

When you select this rule, the system will recognize the specific area or the full page and extract the captured value. See below an example of its application:

 

1.In the recognition profile, after setting an ID # and a name, the image of a customer receipt was imported.

 

2.In the Indexes panel, an index called "Company name" was configured.

 

3.In the index details, it was defined that the recognition would be done considering a specific area. After that, the OCR option and the Optional rule was selected.

 

4.In the preview panel, the index was delimited in the place where, commonly, the company name is found, in a customer receipt.

 

Thus, in the Image captured field, it is possible to see that it was only considered the specific area of the customer receipt. In the Data recognized field, the name of the company that issued the customer receipt is displayed:

 

 

Substring

By selecting this rule, the system performs the recognition in the specific area or the full page and extracts the captured value according to the delimitations and instances configured. See below an example of its application:

 

1.In the recognition profile, after setting an ID # and a name for it, the image of a contract was imported.

 

2.In the Indexes panel, an index called "Phone # - Company" was configured.

 

3.In the index details, it was set that the recognition would be made considering the full page.

 

4.After that, the OCR option and the Substring rule was selected, and the following fields were filled out:

Fields

Capture from

It was filled out with the words preceding the Federal EIN # of the contracting company.

To

It was filled out with the words that succeed the Federal EIN # of the contracting company.

Occurrence

The number 2 was entered, so that only in the second time that the "registered number" excerpt appears in the text, it is considered in the capture.

 

Thus, in the Image captured field, it is possible to view that the entire contract page that was considered. In the Data recognized field, the author's ID number is displayed, as specified in the contract:

 

 

Line information

By selecting this rule, the system performs the recognition of a particular row in the specific area or the full page and extracts the captured value according to the delimitations and instances configured. See below an example of its application:

 

1.In the recognition profile, after setting an ID # and a name, the image of a customer receipt was imported.

 

2.In the Indexes panel, was configured an index called "Form of payment".

 

3.In the index details, it was set that the recognition would be made considering the full page.

 

4.After that, the OCR option and Line information rule was selected, and the following fields were filled out:

Fields

Capture from

It was filled out with the first word of the line preceding the line in which the form of payment is commonly found in a customer receipt.

The line number

It was informed the number of the line that succeeds the value reported in the "Capture from", whose value must be extracted. In that case, number 1 was informed because you wish to capture the next line.

Occurrence

The number 1 was entered, so the recognition is done the first time that the word "Total" appears on the customer receipt.

 

Thus, in the Image captured field, it is possible to view that the entire customer receipt page that was considered. In the Data recognized field, the system presents the payment form and the total amount paid as specified in the customer receipt:

 

 

Offset

When selecting this rule, the system performs recognition within a sensitive area of the specific area or the full page, depending on the delimitations and instances configured. See below an example of its application:

 

1.In the recognition profile, after setting an ID # and a name for it, the image of a DANFE was imported.

 

2.In the Indexes panel, an index called "Total" was configured.

 

3.In the index details, was defined that the recognition would be done considering a specific area.

 

4.After that, the OCR option and the Offset rule was selected, and the following fields were filled out:

Fields

Capture from

It was filled with the words preceding the total amount of the customer receipt.

Search area

In the Position field, it was set that recognition should be made next to where the "Total" excerpt is located. After that, enter the width and the height of the search area, that is, of the area where the desired value is located.

Occurrence

The number 1 was informed, so that the recognition is performed at the first time that the word "Total" appears on the customer receipt.

 

Thus, in the Image captured field, it is possible to view that it was only considered the specific area of the customer receipt. In the Data recognized field, the total value of the receipt is displayed:

 

 

Mask

When selecting this rule, the system performs the recognition in the specific area or the full page and extracts the captured value according to the regular expression configured. See below an example of its application:

 

1.In the recognition profile, after setting an ID # and a name, the image of a customer receipt was imported.

 

2.In the Indexes panel, an index called "Total" was configured.

 

3.In the index details, it was set that the recognition would be made considering the full page.

 

4.After that, the OCR option and Mask rule was selected, and the following fields were filled out:

Fields

Regular expression

It was filled with the regular expression referring to the composition of a phone number. See below how to assemble regular expressions.

Occurrence

The number 1 was entered, so the recognition is done the first time the result of the regular expression finds similarity in the customer receipt.

 

Thus, in the Image captured field, it is possible to view that the entire customer receipt page that was considered. In the Data recognized field, the telephone of the company that issued the customer receipt is displayed:

 

 

Building a regular expression

A regular expression is a notation for describing a pattern of characters. It serves to validate data inputs or to search and extract information in texts. For example, to verify if an entered piece of data is a number from 0.00 to 9.99, it is possible to use the regular expression ^\d,\d\d$ because the \d symbol is a wildcard character that matches one digit. The ^ and $ special characters indicate, respectively, how the string must start and end; without them, the numbers 10,00 or 100,123 would be valid because they contain digits that match the regular expression.

A metacharacter is a character or a sequence of characters with special meaning in the regular expressions. Metacharacters can be categorized according to their use.

 

In regular expressions, the verb 'marry' is used as a translation for the match, in order to combine, fit, and pare.

 

Specifiers

Specify the set of characters to be married in a position.

Metacharacter

Description

.

Wildcard: Matches any character except the \n line break.

[...]

Set: Matches any character added in the set. For example:

[a-z] will accept strings with lowercase characters between 'a' and 'z', while [A-Z] accepts uppercase characters between 'A' and 'Z'.

[abcABC] will accept strings that contain only the 'a', 'b', 'c', 'A', 'B' and/or 'C' characters.

[123] will accept strings that contain only the '1', '2' and/or '3' characters;

[0-9] will accept strings with characters between '0' and '9'.

[^...]

Denied set: Matches any character that is not included in the set

\d

Digit: the same as [0-9].

\D

Non-digit: the same as [^0-9].

\s

Whitespace character: space, line break, tabs etc.; the same as [\t\n\r\f\v].

\S

Non-whitespace character: the same as [^ \t\n\r\f\v].

\w

Alphanumeric: the same as [a-zA-Z0-9_] (but may include Unicode characters)

\W

Non-alphanumeric: the complement of \w.

\

Escape: annuls the special meaning of a metacharacter; for example, \. represents only a point, and not the wildcard character.

 

Quantifiers

They define the allowed number of repetitions for the regular expression right before it.

Metacharacter

Description

{n}

Allow exactly n occurrences. For example:

[abc]{3}: Accepts strings containing 3 characters, such as 'a', 'b' or 'c', such as: aaa, abc, acb, bba, etc.

[0-9]{5}: Accepts 5-character strings between '0' and '9', such as: 11111, 12345, 15973, etc.

{n,m}

Allows at least n occurrences and at most m. For example:

[abc]{3,5}: Accepts strings containing between 3 and 5 characters, such as 'a', 'b' or 'c', such as: aaaaa, acbca, abc, acba, etc.

[0-9]{5,6}: Accepts strings containing 5 or 6 characters between '0' and '9', such as: 12345, 123456, 01030, 000000, etc.

{n,}

Allows at least n occurrences. For example:

[abc]{2,}: Accepts strings containing at least 2 characters, such as 'a', 'b' or 'c', such as: aa, abc, ccc, abcabc, etc.

[0-9]{2,}: Accepts strings containing at least 2 characters between '0' and '9', such as: 12, 123, 987654321, etc.

?

Allows 0 or 1 occurrence; the same as {0,1}.

+

Allows 1 or more occurrences; the same as {1,}.

*

Allows 0 or more occurrences.

 

Anchors

They establish reference positions for the matching of the remainder of the regular expression. Notice that these metacharacters do not match characters in the text, but rather with positions before, after, or between characters.

Metacharacter

Description

^

Matches the beginning of a string.

$

Matches the end of a string; does not capture the \n at the end of the text or line.

\A

Beginning of the text.

\Z

End of the text.

\b

Boundary position: Encounters a match at the beginning or end of a string;

\B

Non-boundary position.

 

Grouping

It defines groups or alternatives.

Metacharacter

Description

(...)

Defines a group, for the purpose of applying a quantifier, alternative or later extraction or reuse.

...|...

Alternative; matches the regular expression to the right or to the left.

\«n»

Retrieves the text matched in the nth group.

 

Examples: To provide a general idea, see some examples with a brief explanation:

\d{5}-\d{3}

The pattern of a zip code like 05432-001: 5 digits, a - (hyphen) and 3 more digits. The sequence \d is a metacharacter, a wildcard character that matches a digit (0 to 9). The sequence {5} is a quantifier: it indicates that the previous pattern must be repeated 5 times, so \d{5} is the same as \d\d\d\d\d.

[012]\d:[0-5]\d

Similar to the hour and minute format, such as 03:10 or 23:59. The sequence between brackets [012] defines a set. In that case, the set specifies that the first character must be 0, 1, or 2. Inside the [], the hyphen indicates a range of characters; that is, [0-5] is a short form for the set [012345]. The set that represents all the digits, [0-9], is the same as \d. Notice that this regular expression also accepts the text 29:00 which is not a valid time.

[A-Z]{3}-\d{4}

It is the standard for a license plate in Brazil: three letters from A and Z, followed by a - (hyphen), followed by four digits, such as CKD-4592.

 

Fixed value

By selecting this rule, the system allows the user to define a certain value to be used in the classification of the information generated from the recognition profile in question. See below an example of its application:

 

1.In the recognition profile, after setting an ID # and a name for it, the image of an electricity bill was imported.

 

2.In the Indexes panel, an index has been configured with the name "Company".

 

3.In the Rule field of the index details, the Fixed value option was selected. Then, in the Return field, the name of the "Electricity company" was entered.

 

With that, for example, it is possible to parameterize, in the capture configuration, that all documents generated from the recognition profile (with the fixed value configured) will automatically be recorded in the "Electricity bills" category.