Training an invoice model
In this guide you will learn how to train your own invoice model with Cradl. We will use the following sample dataset, but you can also use your own dataset when following this guide.
The sample invoices are for demonstrational purposes only. In order to train a good model for invoice parsing, we recommend that you use a real dataset with invoices extracted from e.g. a database or an ERP system.
To follow this guide, you will need a Cradl account. If you don't have one, you can sign up for a free account here.
1. Configure your model
Log in to Cradl and do the following:
- Go to Models > New model.
- Give your model a name and optionally a description.
- Add three fields invoice_id, total_amount and due_date to your model. Make sure that the type of the fields are String, Amount and Date respectively.
- Click Create Model when you are done.
2. Prepare training data
Now that you have a model, the next step is to prepare training data. Download the sample invoices and unzip them to a local folder. Notice that the files are structured in pairs such as invoice1.pdf and invoice1.json. The JSON-file is called a ground truth, and each invoice in your dataset must have a corresponding ground truth-file.
Here is an example of how an invoice and corresponding ground truth file may look like:
- invoice-east-repair.json - Ground Truth
- invoice-east-repair.pdf - Invoice
Make sure that the labels in your ground truth files matches exactly the names of the fields you defined in your model.
In the example above, notice that the due date is written as an ISO formatted date "2019-02-26" even though on the document it is written as "26/02/2019". Similarly, an Amount-field would be written as "1500.36" in the ground truth even if it says e.g. "$1,500.36" on the document.
Using historical data
Ground truths only consist of labels and values and do not require positional information like bounding boxes. Therefore they can often be generated automatically by extracting invoices from e.g. a database or an ERP system. This enables you to train on thousands or even millions of documents without having to go through a manual labelling process.
3. Start training
Now we are ready to train our model.
- Go to Models > Your model > Training jobs > New training.
- Choose Create new dataset. Give your dataset a descriptive name.
- Upload the dataset you prepared in the previous step. Remember to include both ground truths (JSON) and PDFs.
- Generate a data report. The data report enables you to inspect your training data to ensure that everything is correct.
- Leave the default settings and press Start model training to send a training request.
Congratulations, you have now started your first training! Once the training is complete, you can start using your model in production.
4. Make predictions
Once you have a trained model, your are ready to start making predictions.