Automatic document processing

(invoices, contracts, CVs, official documents, etc.)

Context: manual entry

Manual entry remains a widespread practice in various fields such as accounting (invoices, purchase orders, slips, etc.), administrative (registrations, files, supporting documents, etc.), finance (reports and publications), legal (contracts, statutes, etc.), and human resources (CVs, etc.).

It is not uncommon to see approximately one full-time equivalent (FTE) allocated for invoice processing for every 100 employees.

problem:

Document processing and manual entry are tedious tasks, prone to error and for which it is increasingly difficult to hire.

They represent a significant cost for companies, whatever the sector of activity.

Generic solutions to document processing problems encounter various major challenges:

  • Privacy issue (SaaS):

    Solutions based on the Software as a Service (SaaS) model can raise data privacy concerns because sensitive documents are processed outside of the company's internal infrastructures.

  • Limitation to a few types of documents:

    Generic solutions may be specialized in a limited number of document types (typically invoices), which may limit their applicability to businesses with more diverse needs.

  • Lack of consensus on what “good extraction” is:

    The lack of a universal definition of what a “good extraction” is can lead to variable results and subject to interpretation. For example, the numerical representation 1,000 will mean “one” in France and “thousand” in the United States. The diversity of formats and conventions poses a significant challenge for generic solutions, which must be flexible enough to accommodate these nuances while ensuring consistency in results.

Solution: OCR and LLM (Artificial Intelligence)

Our approach :

Thanks to OCR (Optical Character Recognition) and Large Language Models (especially those that take layout into account), we have developed a tool that reads PDF files and fills the columns of a spreadsheet (excel or google sheet) automatically.

What we need to work:

The adaptation of the algorithm to a particular type of document is done using a labeled dataset.

From a few hundred examples of documents and the expected result (Excel spreadsheet filled with the right values ​​for example), we construct a “corrected” dataset and use it to specialize an algorithm on this new task.

The correction step is a method developed by Scopeo to significantly reduce the amount of data required.

OCR

Automatic document processing

PREDICT

Automatic document processing

Original document

Initially, the image is obtained by scanning the document. It contains text but also other types of printing such as table lines, logos, stamps, signatures etc.

OCR analysis

We apply state-of-the-art OCR technologies to read the text and identify its position on the document

 LLM Prediction

We use a large, specialized, layout-aware language model to find the right answer among the OCR results. 

Results

300 invoices 

Without correction

Average accuracy 68%

7000 invoices 
WITHOUT CORRECTION

Average accuracy 70%

300 invoices 

With correction

Average accuracy 90%

This difference underlines the effectiveness of this method, which makes it possible to develop automatic document processing applications with a very reasonable data set and in a very short time.

Are you leaving us?

Stay up to date with news and blog posts
by subscribing to our newsletter!