Skip to main content

Creating a dataset

After defining our model, the next step is to train it. This requires that we provide the model with a sufficient number of example documents. Documents can be grouped together in datasets, so before we start uploading documents, let's create a dataset.

las datasets create --name "Receipts" --description "Initial training data"
{
"datasetId": "<datasetId>",
"description": "Initial training data",
"name": "Receipts",
"numberOfDocuments": 0,
"createdTime": "2021-08-16T12:53:13.374930+0000",
"updatedTime": null,
"createdBy": "<appClientId>",
"updatedBy": null,
"retentionInDays": 1825,
"storageLocation": "EU",
"containsPersonallyIdentifiableInformation": true,
"version": 0
}

Adding documents to a dataset

After the dataset is created we can start uploading documents and assign them to our dataset. Since we want to use the documents for training, we'll also provide ground truth values that will define the correct output for the model on each document. Note that the labels in the ground truth must match the field names we defined in our model.

caution

It is important to have correct ground truth values for every document we use for training. Errors in the ground truth can degrade the training process as the model may learn to make the same mistakes.

las documents create receipt.pdf --dataset-id <datasetId> --ground-truth-fields total_amount=300.00 date=2020-02-28
{
"documentId": "<documentId>",
"contentType": "application/pdf",
"retentionInDays": 1825,
"createdTime": "2021-08-16T13:34:13.724393+0000",
"updatedTime": null,
"createdBy": "<appClientId>",
"updatedBy": null,
"datasetId": "<datasetId>",
"groundTruth": [
{
"label": "total_amount",
"value": "300.00"
},
{
"label": "date",
"value": "2020-02-28"
}
]
}
info

You can also assign your documents to a new dataset later through the API.

Uploading many documents

Uploading documents one by one can be useful for testing purposes, but is not recommended for large scale training datasets. For large-scale datasets we recommend using the datasets create-documents command from the CLI . This allows you to upload your dataset in a fast and consistent way, without worrying about looping over all the documents yourself.

Alternative 1: Documents and ground truths with same file name prefix

In order to upload all the documents in a folder, the following naming convention must be used:

  1. Each document in the folder must have a corresponding ground truth file in JSON or YAML format.
  2. The ground truth is provided in the following format:
[
{
"label": "total_amount",
"value": "100.00"
},
{
"label": "due_date",
"value": "2021-10-30"
},
{
"label": "vendor_name",
"value": "Company X"
}
]
  1. The ground truth must have the same file name as the document. So if your document is named a.pdf the ground truth must be either a.json, a.yaml or a.yml. So your folder will need to look something like this:
my/new/training/data
├── invoice_a.pdf
├── invoice_a.json
├── invoice_b.png
├── invoice_b.json
├── invoice_c.png
└── invoice_c.json

When you have structured your data according to these three points you are ready to start uploading your data.

las datasets create-documents <datasetId> my/new/training/data
info

If some of your documents are missing ground-truths they will simply be skipped.

Alternative 2: Use a JSON file to specify ground truths for all documents

The other alternative is to specify a file that contains all the paths and ground truths to the documents we want to upload, let us call it upload-specification.json. Below is an example of how this file would look if we only want to upload two documents.

{
"path/to/document1.pdf": {
"ground_truth": [
{
"label": "total_amount",
"value": "100.00"
},
{
"label": "due_date",
"value": "2021-10-30"
},
{
"label": "vendor_name",
"value": "Company X"
}
]
},
"path/to/document2.png": {
"ground_truth": [
{
"label": "total_amount",
"value": "200.00"
},
{
"label": "due_date",
"value": "2021-11-30"
},
{
"label": "vendor_name",
"value": "Company Y"
}
]
}
}

In this file, documents and ground truths are represented in a dictionary. The keys are assumed to be a path to a document you want to upload and each value is assumed to be the corresponding ground truth. We are now ready to upload all the documents and ground truths with the create-documents command.

las datasets create-documents <datasetId> upload-specification.json

This command handles interruptions gracefully and can be safely restarted without having to worry about duplicate documents being uploaded.

Alternative 3: Using the web app

From Cradl you can upload data in 3 simple steps.

  1. Go to Datasets > YourDataset > Upload Data and add the local files you want to upload.

Screenshot of upload choice

  1. Verify that the data you want to upload contains ground truth by looking at the section Files selected.

Screenshot of upload status

  1. Press Start upload in the upper right corner to upload the data.