OCR-Based Solution to Retrieve Data from Receipts

Client

The Client is a provider of personalized solutions in the field of banking and finance. The Client’s was looking for data extraction services to enhance apps for business with the use of machine learning.

Challenge: information extraction from receipts using machine learning

The Client provided us with a mobile application designed to store digital receipts. The challenge was to amplify this app by applying machine learning.

Our data scientists decided to employ optical character recognition (OCR) technology to train an algorithm to extract key data from raw images. They also used classical computer vision methods to improve the quality of a recognized image before applying optical character recognition to receipts.

Solution: OCR-based solution for processing semi-structured data from receipts

The first step was to preprocess digital images for data extraction. Our team used computer vision to read receipts.

Semi-structured text in receipts may contain not only plain text but also figures, titles, tables, or non-text elements. Also, texts in receipts have such attributes as different fonts, symbols, columns, etc. These peculiarities contribute to bad character recognition. The way out was to focus on a careful selection of areas, accurate extraction of data from each area, and synthesizing results.
Our team developed a solution that allowed splitting a receipt into several areas or boxes so that to extract data, column by column, process it, and move to required forms or the Client’s CRM automatically.

Another challenge was to extract account number digits and a routing number. Banks often use specialized fonts, and thus, a symbol consists of multiple parts. Also, texts on cheques may get wiped out or fade, which may pose difficulties to recognition in some cases.

Our team came up with a method that made it possible to compute a bounding box for each symbol automatically. Such an approach allowed treating each symbol as an image to extract the whole number with a high percentage of accuracy.

Result: automation of routine tasks and improved overall performance

We assisted the Client with processes automation in the field of data extraction. The Client received a solution, based on optical character recognition, capable of eliminating time-consuming and error-prone work. That included processing data on financial transactions from receipts.

This custom solution can be used to improve the efficiency of back-office workflows. By replacing human employees engaged in routine tasks, the Client reassign more talents to solving business-critical that need human supervision.

Challenge	Information extraction from receipts using machine learning
Solution	OCR-based solution for processing semi-structured data from receipts
Technologies and tools	OpenCV, Python, C++, ABBYY Cloud OCR SDK

OCR-Based Solution to Retrieve Data from Receipts

Key Details

Client

Challenge: information extraction from receipts using machine learning

Solution: OCR-based solution for processing semi-structured data from receipts

Result: automation of routine tasks and improved overall performance

Autre Articles

Business Intelligence Solution for Retail Platform

Building Predictive Models to Improve Debt Collection Process

OCR-Based Solution to Retrieve Data from Receipts

Making a Project Management System Smarter

Making a Project Management System Smarter

Image Analysis to Enhance Ticket Processing Software

Image Analysis to Enhance Ticket Processing Software

Image Analysis to Enhance Ticket Processing Software