doctr.datasets#
Whether it is for training or for evaluation, having predefined objects to access datasets in your prefered framework can be a significant save of time.
Available Datasets#
Here are all datasets that are available through docTR:
Public datasets#
- class doctr.datasets.FUNSD(train: bool = True, use_polygons: bool = False, **kwargs: Any)[source]#
FUNSD dataset from “FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents”.
>>> from doctr.datasets import FUNSD >>> train_set = FUNSD(train=True, download=True) >>> img, target = train_set[0]
- Parameters
train – whether the subset should be the training one
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
**kwargs – keyword arguments from VisionDataset.
- class doctr.datasets.SROIE(train: bool = True, use_polygons: bool = False, **kwargs: Any)[source]#
SROIE dataset from “ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction”.
>>> from doctr.datasets import SROIE >>> train_set = SROIE(train=True, download=True) >>> img, target = train_set[0]
- Parameters
train – whether the subset should be the training one
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
**kwargs – keyword arguments from VisionDataset.
- class doctr.datasets.CORD(train: bool = True, use_polygons: bool = False, **kwargs: Any)[source]#
CORD dataset from “CORD: A Consolidated Receipt Dataset forPost-OCR Parsing”.
>>> from doctr.datasets import CORD >>> train_set = CORD(train=True, download=True) >>> img, target = train_set[0]
- Parameters
train – whether the subset should be the training one
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
**kwargs – keyword arguments from VisionDataset.
- class doctr.datasets.IIIT5K(train: bool = True, use_polygons: bool = False, **kwargs: Any)[source]#
IIIT-5K character-level localization dataset from “BMVC 2012 Scene Text Recognition using Higher Order Language Priors”.
>>> # NOTE: this dataset is for character-level localization >>> from doctr.datasets import IIIT5K >>> train_set = IIIT5K(train=True, download=True) >>> img, target = train_set[0]
- Parameters
train – whether the subset should be the training one
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
**kwargs – keyword arguments from VisionDataset.
- class doctr.datasets.SVT(train: bool = True, use_polygons: bool = False, **kwargs: Any)[source]#
SVT dataset from “The Street View Text Dataset - UCSD Computer Vision”.
>>> from doctr.datasets import SVT >>> train_set = SVT(train=True, download=True) >>> img, target = train_set[0]
- Parameters
train – whether the subset should be the training one
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
**kwargs – keyword arguments from VisionDataset.
- class doctr.datasets.SVHN(train: bool = True, use_polygons: bool = False, **kwargs: Any)[source]#
SVHN dataset from “The Street View House Numbers (SVHN) Dataset”.
>>> from doctr.datasets import SVHN >>> train_set = SVHN(train=True, download=True) >>> img, target = train_set[0]
- Parameters
train – whether the subset should be the training one
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
**kwargs – keyword arguments from VisionDataset.
- class doctr.datasets.SynthText(train: bool = True, use_polygons: bool = False, **kwargs: Any)[source]#
SynthText dataset from “Synthetic Data for Text Localisation in Natural Images” | “repository” | “website”.
>>> from doctr.datasets import SynthText >>> train_set = SynthText(train=True, download=True) >>> img, target = train_set[0]
- Parameters
train – whether the subset should be the training one
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
**kwargs – keyword arguments from VisionDataset.
- class doctr.datasets.IC03(train: bool = True, use_polygons: bool = False, **kwargs: Any)[source]#
IC03 dataset from “ICDAR 2003 Robust Reading Competitions: Entries, Results and Future Directions”.
>>> from doctr.datasets import IC03 >>> train_set = IC03(train=True, download=True) >>> img, target = train_set[0]
- Parameters
train – whether the subset should be the training one
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
**kwargs – keyword arguments from VisionDataset.
- class doctr.datasets.IC13(img_folder: str, label_folder: str, use_polygons: bool = False, **kwargs: Any)[source]#
IC13 dataset from “ICDAR 2013 Robust Reading Competition”.
>>> # NOTE: You need to download both image and label parts from Focused Scene Text challenge Task2.1 2013-2015. >>> from doctr.datasets import IC13 >>> train_set = IC13(img_folder="/path/to/Challenge2_Training_Task12_Images", >>> label_folder="/path/to/Challenge2_Training_Task1_GT") >>> img, target = train_set[0] >>> test_set = IC13(img_folder="/path/to/Challenge2_Test_Task12_Images", >>> label_folder="/path/to/Challenge2_Test_Task1_GT") >>> img, target = test_set[0]
- Parameters
img_folder – folder with all the images of the dataset
label_folder – folder with all annotation files for the images
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
**kwargs – keyword arguments from AbstractDataset.
- class doctr.datasets.IMGUR5K(img_folder: str, label_path: str, train: bool = True, use_polygons: bool = False, **kwargs: Any)[source]#
IMGUR5K dataset from “TextStyleBrush: Transfer of Text Aesthetics from a Single Example” | repository.
>>> # NOTE: You need to download/generate the dataset from the repository. >>> from doctr.datasets import IMGUR5K >>> train_set = IMGUR5K(train=True, img_folder="/path/to/IMGUR5K-Handwriting-Dataset/images", >>> label_path="/path/to/IMGUR5K-Handwriting-Dataset/dataset_info/imgur5k_annotations.json") >>> img, target = train_set[0] >>> test_set = IMGUR5K(train=False, img_folder="/path/to/IMGUR5K-Handwriting-Dataset/images", >>> label_path="/path/to/IMGUR5K-Handwriting-Dataset/dataset_info/imgur5k_annotations.json") >>> img, target = test_set[0]
- Parameters
img_folder – folder with all the images of the dataset
label_path – path to the annotations file of the dataset
train – whether the subset should be the training one
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
**kwargs – keyword arguments from AbstractDataset.
docTR synthetic datasets#
- class doctr.datasets.DocArtefacts(train: bool = True, use_polygons: bool = False, **kwargs: Any)[source]#
Object detection dataset for non-textual elements in documents. The dataset includes a variety of synthetic document pages with non-textual elements.
>>> from doctr.datasets import DocArtefacts >>> train_set = DocArtefacts(train=True, download=True) >>> img, target = train_set[0]
- Parameters
train – whether the subset should be the training one
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
**kwargs – keyword arguments from VisionDataset.
- class doctr.datasets.CharacterGenerator(*args, **kwargs)[source]#
Implements a character image generation dataset
>>> from doctr.datasets import CharacterGenerator >>> ds = CharacterGenerator(vocab='abdef') >>> img, target = ds[0]
- Parameters
vocab – vocabulary to take the character from
num_samples – number of samples that will be generated iterating over the dataset
cache_samples – whether generated images should be cached firsthand
font_family – font to use to generate the text images
img_transforms – composable transformations that will be applied to each image
sample_transforms – composable transformations that will be applied to both the image and the target
- class doctr.datasets.WordGenerator(vocab: str, min_chars: int, max_chars: int, num_samples: int, cache_samples: bool = False, font_family: Optional[Union[str, List[str]]] = None, img_transforms: Optional[Callable[[Any], Any]] = None, sample_transforms: Optional[Callable[[Any, Any], Tuple[Any, Any]]] = None)[source]#
Implements a character image generation dataset
>>> from doctr.datasets import WordGenerator >>> ds = WordGenerator(vocab='abdef') >>> img, target = ds[0]
- Parameters
vocab – vocabulary to take the character from
min_chars – minimum number of characters in a word
max_chars – maximum number of characters in a word
num_samples – number of samples that will be generated iterating over the dataset
cache_samples – whether generated images should be cached firsthand
font_family – font to use to generate the text images
img_transforms – composable transformations that will be applied to each image
sample_transforms – composable transformations that will be applied to both the image and the target
docTR private datasets#
Since many documents include sensitive / personal information, we are not able to share all the data that has been used for this project. However, we provide some guidance on how to format your own dataset into the same format so that you can use all docTR tools all the same.
- class doctr.datasets.DetectionDataset(img_folder: str, label_path: str, use_polygons: bool = False, **kwargs: Any)[source]#
Implements a text detection dataset
>>> from doctr.datasets import DetectionDataset >>> train_set = DetectionDataset(img_folder="/path/to/images", >>> label_path="/path/to/labels.json") >>> img, target = train_set[0]
- Parameters
img_folder – folder with all the images of the dataset
label_path – path to the annotations of each image
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
**kwargs – keyword arguments from AbstractDataset.
- class doctr.datasets.RecognitionDataset(img_folder: str, labels_path: str, **kwargs: Any)[source]#
Dataset implementation for text recognition tasks
>>> from doctr.datasets import RecognitionDataset >>> train_set = RecognitionDataset(img_folder="/path/to/images", >>> labels_path="/path/to/labels.json") >>> img, target = train_set[0]
- Parameters
img_folder – path to the images folder
labels_path – pathe to the json file containing all labels (character sequences)
**kwargs – keyword arguments from AbstractDataset.
- class doctr.datasets.OCRDataset(img_folder: str, label_file: str, use_polygons: bool = False, **kwargs: Any)[source]#
Implements an OCR dataset
- Parameters
img_folder – local path to image folder (all jpg at the root)
label_file – local path to the label file
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
**kwargs – keyword arguments from AbstractDataset.
Data Loading#
Each dataset has its specific way to load a sample, but handling batch aggregation and the underlying iterator is a task deferred to another object in docTR.
- class doctr.datasets.loader.DataLoader(dataset, shuffle: bool = True, batch_size: int = 1, drop_last: bool = False, num_workers: Optional[int] = None, collate_fn: Optional[Callable] = None)[source]#
Implements a dataset wrapper for fast data loading
>>> from doctr.datasets import FUNSD, DataLoader >>> train_set = CORD(train=True, download=True) >>> train_loader = DataLoader(train_set, batch_size=32) >>> train_iter = iter(train_loader) >>> images, targets = next(train_iter)
- Parameters
dataset – the dataset
shuffle – whether the samples should be shuffled before passing it to the iterator
batch_size – number of elements in each batch
drop_last – if True, drops the last batch if it isn’t full
num_workers – number of workers to use for data loading
collate_fn – function to merge samples into a batch
Supported Vocabs#
Since textual content has to be encoded properly for models to interpret them efficiently, docTR supports multiple sets of vocabs.
Name |
size |
characters |
---|---|---|
digits |
10 |
0123456789 |
ascii_letters |
52 |
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ |
punctuation |
32 |
!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~ |
currency |
5 |
£€¥¢฿ |
latin |
94 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~ |
english |
100 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿ |
legacy_french |
123 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~°àâéèêëîïôùûçÀÂÉÈËÎÏÔÙÛÇ£€¥¢฿ |
french |
126 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿àâéèêëîïôùûüçÀÂÉÈÊËÎÏÔÙÛÜÇ |
portuguese |
131 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áàâãéêëíïóôõúüçÁÀÂÃÉËÍÏÓÔÕÚÜÇ¡¿ |
spanish |
116 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áéíóúüñÁÉÍÓÚÜÑ¡¿ |
german |
108 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿äöüßÄÖÜẞ |
- doctr.datasets.encode_sequences(sequences: List[str], vocab: str, target_size: Optional[int] = None, eos: int = - 1, sos: Optional[int] = None, pad: Optional[int] = None, dynamic_seq_length: bool = False, **kwargs: Any) ndarray [source]#
Encode character sequences using a given vocab as mapping
- Parameters
sequences – the list of character sequences of size N
vocab – the ordered vocab to use for encoding
target_size – maximum length of the encoded data
eos – encoding of End Of String
sos – optional encoding of Start Of String
pad – optional encoding for padding. In case of padding, all sequences are followed by 1 EOS then PAD
dynamic_seq_length – if target_size is specified, uses it as upper bound and enables dynamic sequence size
- Returns
the padded encoded data as a tensor