Choose a ready to use dataset¶
Whether it is for training or for evaluation, having predefined objects to access datasets in your prefered framework can be a significant save of time.
Available Datasets¶
In the package reference you will also find some samples for each dataset.
Here are all datasets that are available through docTR:
Detection¶
This datasets contains the information to train or validate a text detection model.
Dataset |
Train Samples |
Test Samples |
Information |
---|---|---|---|
FUNSD |
149 |
50 |
|
SROIE |
626 |
360 |
|
CORD |
800 |
100 |
|
IIIT5K |
2000 |
3000 |
|
SVT |
100 |
249 |
|
SVHN |
33402 |
13068 |
Character Localization |
SynthText |
772875 |
85875 |
|
IC03 |
246 |
249 |
|
IC13 |
229 |
233 |
external resources |
IMGUR5K |
7149 |
796 |
Handwritten / external resources |
WILDRECEIPT |
1268 |
472 |
external resources |
from doctr.datasets import CORD
# Load straight boxes
train_set = CORD(train=True, download=True, detection_task=True)
# Load rotated boxes
train_set = CORD(train=True, download=True, use_polygons=True, detection_task=True)
img, target = train_set[0]
Recognition¶
This datasets contains the information to train or validate a text recognition model.
Dataset |
Train Samples |
Test Samples |
Information |
---|---|---|---|
FUNSD |
21888 |
8707 |
english |
SROIE |
33608 |
19342 |
english / only uppercase labels |
CORD |
19370 |
2186 |
english |
IIIT5K |
2000 |
3000 |
english |
SVT |
257 |
647 |
english / only uppercase labels |
SVHN |
73257 |
26032 |
digits |
SynthText |
~7100000 |
707470 |
english |
IC03 |
1156 |
1107 |
english |
IC13 |
849 |
1095 |
english / external resources |
IMGUR5K |
207901 |
22672 |
english / handwritten / external resources |
MJSynth |
7581382 |
1337891 |
english / external resources |
IIITHWS |
7141797 |
793533 |
english / handwritten / external resources |
WILDRECEIPT |
49377 |
19598 |
english / external resources |
from doctr.datasets import CORD
# Crop boxes as is (can contain irregular)
train_set = CORD(train=True, download=True, recognition_task=True)
# Crop rotated boxes (always regular)
train_set = CORD(train=True, download=True, use_polygons=True, recognition_task=True)
img, target = train_set[0]
OCR¶
The same dataset table as for detection, but with information about the bounding boxes and labels.
from doctr.datasets import CORD
# Load straight boxes
train_set = CORD(train=True, download=True)
# Load rotated boxes
train_set = CORD(train=True, download=True, use_polygons=True)
img, target = train_set[0]
Object Detection¶
This datasets contains the information to train or validate a object detection model.
Dataset |
Train Samples |
Test Samples |
Information |
---|---|---|---|
DocArtefacts |
2700 |
300 |
[“background”, “qr_code”, “bar_code”, “logo”, “photo”] |
from doctr.datasets import DocArtefacts
train_set = DocArtefacts(train=True, download=True)
img, target = train_set[0]
Synthetic dataset generator¶
docTR provides also some generator objects, which can be used to generate synthetic datasets. Both are also integrated in the training scripts to train a classification or recognition model.
from doctr.datasets import CharacterGenerator
ds = CharacterGenerator(vocab='abdef', num_samples=100)
img, target = ds[0]
from doctr.datasets import WordGenerator
ds = WordGenerator(vocab='abdef', min_chars=1, max_chars=32, num_samples=100)
img, target = ds[0]
Use your own datasets¶
Since many documents include sensitive / personal information, we are not able to share all the data that has been used for this project. However, we provide some guidance on how to format your own dataset into the same format so that you can use all docTR tools more easily. You can find further information about the format in references.
from doctr.datasets import DetectionDataset
# Load a detection dataset
train_set = DetectionDataset(img_folder="/path/to/images", label_path="/path/to/labels.json")
# Load a recognition Dataset
train_set = RecognitionDataset(img_folder="/path/to/images", labels_path="/path/to/labels.json")
# Load a OCR dataset which contains anotations for the boxes and labels
train_set = OCRDataset(img_folder="/path/to/images", label_file="/path/to/labels.json")
img, target = train_set[0]
Data Loading¶
Each dataset has its specific way to load a sample, but handling batch aggregation and the underlying iterator is a task deferred to another object in docTR.
from doctr.datasets import CORD, DataLoader
train_set = CORD(train=True, download=True)
train_loader = DataLoader(train_set, batch_size=32)
train_iter = iter(train_loader)
images, targets = next(train_iter)