Skip to main content

Documents

A document in Cradl consists of an image or a pdf, along with some additional information. Documents serve two purposes: as data points when training models and as inputs when making predictions. If a document is going to be used for training, it needs to have a ground truth. To help you organize your documents from different sources we recommend that you group them in separate Datasets.

Creating a Document

tip

Allowed formats for documents are PDF, JPEG, PNG and TIFF.

las documents create path/to/my/document.pdf
{
"documentId": "las:document:84ed1bb2d2634072bd3134274ed56ebe",
"contentType": "application/pdf"
}

The returned documentId can be used together with a modelId to make a prediction on the document once a model has been trained. You can also set a ground truth for the document and add it to a Dataset to use it as training data for a model.

Assigning ground truth to to a document

To use a document as training data, it must have an attached ground truth; since our models learn by example, you must provide both the example input (the file) and its expected output (the ground truth). The ground truth can be provided when you create the document, or it can be added as an update to an existing document.

las documents create path/to/document.pdf --ground-truth-fields amount=100.00 due_date='2021-05-20'
las documents create path/to/document.pdf --ground-truth-path path/to/ground_truth.json
las documents update <documentId> --ground-truth-fields amount=100.00 due_date='2021-05-20'
las documents update <documentId> --ground-truth-path path/to/ground_truth.json

The JSON format for a ground truth file is an array of objects containing label and value keys. See below for examples. Values in the objects must be strings.

caution

The label name is used as a key in several places. Make sure you are consistent in using the same label names across documents and models.

Documents with personal consents

In addition to grouping documents in datasets, documents can be assigned a consentId to facilitate deletion of single-user data. If your application requires users to register data use consent, you should label this consent by a user-unique ID, and label all user data uploaded to Cradl with a corresponding consentId at creation time.

A consentId must be formatted as "las:consent:[a-f0-9]{32}".

las documents create path/to/document.pdf --consent-id <consentId>

Deleting documents

Documents may be deleted one-by-one:

las documents delete <documentId>

Or using a group identifier (consentId or datasetId):

las documents delete-all --dataset-id <datasetId>
{
"documents": [...],
"consentId": [...]
}

The delete-all command will delete all documents with the given group identifier.