Choose a ready to use dataset¶

Whether it is for training or for evaluation, having predefined objects to access datasets in your prefered framework can be a significant save of time.

Available Datasets¶

In the package reference you will also find some samples for each dataset.

Here are all datasets that are available through docTR:

Detection¶

This datasets contains the information to train or validate a text detection model.

Dataset	Train Samples	Test Samples	Information
FUNSD	149	50
SROIE	626	360
CORD	800	100
IIIT5K	2000	3000
SVT	100	249
SVHN	33402	13068	Character Localization
SynthText	772875	85875
IC03	246	249
IC13	229	233	external resources
IMGUR5K	7149	796	Handwritten / external resources
WILDRECEIPT	1268	472	external resources
COCOTEXT	13880	3261	external resources / legible filtered

from doctr.datasets import CORD
# Load straight boxes
train_set = CORD(train=True, download=True, detection_task=True)
# Load rotated boxes
train_set = CORD(train=True, download=True, use_polygons=True, detection_task=True)
img, target = train_set[0]

Recognition¶

This datasets contains the information to train or validate a text recognition model.

Dataset	Train Samples	Test Samples	Information
FUNSD	21888	8707	english
SROIE	33608	19342	english / only uppercase labels
CORD	19370	2186	english
IIIT5K	2000	3000	english
SVT	257	647	english / only uppercase labels
SVHN	73257	26032	digits
SynthText	~7100000	707470	english
IC03	1156	1107	english
IC13	849	1095	english / external resources
IMGUR5K	207901	22672	english / handwritten / external resources
MJSynth	7581382	1337891	english / external resources
IIITHWS	7141797	793533	english / handwritten / external resources
WILDRECEIPT	49377	19598	english / external resources
COCOTEXT	59130	13258	english / external resources / legible filtered

from doctr.datasets import CORD
# Crop boxes as is (can contain irregular)
train_set = CORD(train=True, download=True, recognition_task=True)
# Crop rotated boxes (always regular)
train_set = CORD(train=True, download=True, use_polygons=True, recognition_task=True)
img, target = train_set[0]

OCR¶

The same dataset table as for detection, but with information about the bounding boxes and labels.

from doctr.datasets import CORD
# Load straight boxes
train_set = CORD(train=True, download=True)
# Load rotated boxes
train_set = CORD(train=True, download=True, use_polygons=True)
img, target = train_set[0]

Object Detection¶

This datasets contains the information to train or validate a object detection model.

Dataset	Train Samples	Test Samples	Information
DocArtefacts	2700	300	[“background”, “qr_code”, “bar_code”, “logo”, “photo”]

from doctr.datasets import DocArtefacts
train_set = DocArtefacts(train=True, download=True)
img, target = train_set[0]

Synthetic dataset generator¶

docTR provides also some generator objects, which can be used to generate synthetic datasets. Both are also integrated in the training scripts to train a classification or recognition model.

from doctr.datasets import CharacterGenerator
ds = CharacterGenerator(vocab='abdef', num_samples=100)
img, target = ds[0]

from doctr.datasets import WordGenerator
ds = WordGenerator(vocab='abdef', min_chars=1, max_chars=32, num_samples=100)
img, target = ds[0]

Use your own datasets¶

Since many documents include sensitive / personal information, we are not able to share all the data that has been used for this project. However, we provide some guidance on how to format your own dataset into the same format so that you can use all docTR tools more easily. You can find further information about the format in references.

from doctr.datasets import DetectionDataset
# Load a detection dataset
train_set = DetectionDataset(img_folder="/path/to/images", label_path="/path/to/labels.json")
# Load a recognition Dataset
train_set = RecognitionDataset(img_folder="/path/to/images", labels_path="/path/to/labels.json")
# Load a OCR dataset which contains annotations for the boxes and labels
train_set = OCRDataset(img_folder="/path/to/images", label_file="/path/to/labels.json")
img, target = train_set[0]

Data Loading¶

Each dataset has its specific way to load a sample, but handling batch aggregation and the underlying iterator is a task deferred to another object in docTR.

from doctr.datasets import CORD, DataLoader
train_set = CORD(train=True, download=True)
train_loader = DataLoader(train_set, batch_size=32)
train_iter = iter(train_loader)
images, targets = next(train_iter)