How can you build a model that reads out receipts and invoices?

Question

The objective is to build a model that is capable of identifying information on receipts and invoices that can look completely different.

I've had a discussion with my brother about the right approach. I have attached an example, here the original and below is the important information in boxes:

The green boxes are the must-have information. The one in purple and green indicates that we need either or. The orange information would be a nice-to-have, but not necessarily required. Some of the boxes have context and inter linkage.

From a data set point of view, we have a sample size of 1,000 receipts and all of them have the necessary information extracted. We could increase the sample size further if that was required.

The approach that I would have chosen:
Treat everyone of the images of the receipts like a game and let the model figure out itself how to arrive at the right conclusion. This will most likely be very computing intensive but I feel like it will be more robust when dealing with new image types.

The approach my brother has suggested:
Basically using the boxes that I've provided and let me model learn from that. The model would then learn to identify the important areas on a receipt or invoice and would go from there. He compared the model to one that would identify license plates.

Edit: Just to reiterate why I think OCR is only part of the solution but not the solution itself. Here is the Adobe Acrobat OCR result:

Perfect if you ask me. It just doesn't help me figure out what values to use and which to ignore. I don't want to do this manually. I want the model(s) to return for me:

Total amount
Sales Tax & Amount
Creditor (i.e. the company and ideally the tax identifier CHE-xxxxx MWST
Date and ideally time
Payment method

Does this make more sense now? I just don't see how OCR gets me there. It will only be the method to extract the values.

You could run OCR to convert image to text and then parse the text result. — Nat, May 19 '18 at 14:12
But this would hardly be smart. The receipt variety is substantial. I'd expect the model to partially use OCR, but using it exclusively would be not getting to a 80%+ accuracy. — Spurious, May 19 '18 at 14:28
You presumably came here for advice, and you got some. I encourage you to try it before rejecting it. If nothing else, it will help you edit your question to elaborate on your requirements and show in what ways OCR is not suitable, and demonstrate that you are trying to help yourself. — D.W., May 19 '18 at 18:13
@Spurious I think you and your brothers approaches are hardly smart :) Did you try Google? TAGGUN uses OCR to extract text and then machine learning to classify keywords https://www.taggun.io — Nat, May 20 '18 at 01:39
I've seen taggun, are you behind it by any chance? Could you elaborate why the approaches are not smart? OCR + keyword extraction is a very brute-forcey way of doing things. The exact same receipt in Spanish would deliver different results. This is the problem I currently have with taggun, it never recognizes the sales tax and it has difficulty with anything but the total amount. Here is the result from taggun: https://imgur.com/Ma8Cyt0 It'd feel that the approach is not efficient in the long run. — Spurious, May 20 '18 at 09:33
Sean Owen, I believe it belongs to the same question and that you can understand this as an answer if you know how to use the algorithm inside the link. I'm still trying to figure this out. Any ideas? Is this algorithm available? — António, Oct 11 '18 at 00:58

score 1 · Answer 1 · answered May 19 '18 at 18:12

1

You should start by trying OCR. I am not sure why you are rejecting it without trying it. Start with Tesseract (free but not very good) and then try a commercial OCR as well (Abbyy is well regarded, but you could also try Adobe Acrobat). Also, research document structure extraction.

That idea that you are going to re-invent a machine learning model that does OCR better than existing OCR solutions seems ... not very realistic. Current OCR tools have years or decades of engineering put into them. There is no way you are going to be able to afford to put that much effort into your own custom tool.

Image quality on those example images is not wonderful, and you might have to accept that accuracy will be less than 100%.

answered May 19 '18 at 18:12

D.W.

3,312
15
42

I appreciate the input and maybe I can further elaborate why I don't think OCR is the full solution. I didn't discount it and it will be used to extract the values. But maybe you can help me further understand how OCR will help me determine the right context of each of those boxes. Maybe I have the wrong idea about OCR, but it can only help you read out the values. It will not determine whether something is the total or a sub total. Am I wrong? – Spurious May 19 '18 at 23:28
Maybe to further expand. I have access to Adobe Acrobat and here is the result of the OCR: https://i.imgur.com/NUanUhs.png As far as I am concerned, it's a perfect result. But I am still left with figuring out what the sales tax is and which amount, what the total amount is, who the creditor is, what the payment method is and what the date is. – Spurious May 19 '18 at 23:33
@Spurious, some OCR tools can output for each letter *where* on the page it was found. Then if you apply document structure extraction, you might be able to use that to figure out each of the fields you have listed. It is likely that OCR will be an important building block. And it would have been helpful to have included that information in the question, so that we don't waste your time by telling you something you already know. – D.W. May 20 '18 at 03:02
No harm, no foul, I wrongly thought that my intent was obvious which is a stupid thing to think, apologies! Do you have any other pointers besides document structure extraction? – Spurious May 20 '18 at 09:39

How can you build a model that reads out receipts and invoices?

1 Answers1