doctr.datasets#

Whether it is for training or for evaluation, having predefined objects to access datasets in your prefered framework can be a significant save of time.

Available Datasets#

Here are all datasets that are available through docTR:

Public datasets#

class doctr.datasets.FUNSD(train: bool = True, use_polygons: bool = False, **kwargs: Any)[source]#

FUNSD dataset from “FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents”.

https://github.com/mindee/doctr/releases/download/v0.5.0/funsd-grid.png
>>> from doctr.datasets import FUNSD
>>> train_set = FUNSD(train=True, download=True)
>>> img, target = train_set[0]
Parameters
  • train – whether the subset should be the training one

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • **kwargs – keyword arguments from VisionDataset.

class doctr.datasets.SROIE(train: bool = True, use_polygons: bool = False, **kwargs: Any)[source]#

SROIE dataset from “ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction”.

https://github.com/mindee/doctr/releases/download/v0.5.0/sroie-grid.png
>>> from doctr.datasets import SROIE
>>> train_set = SROIE(train=True, download=True)
>>> img, target = train_set[0]
Parameters
  • train – whether the subset should be the training one

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • **kwargs – keyword arguments from VisionDataset.

class doctr.datasets.CORD(train: bool = True, use_polygons: bool = False, **kwargs: Any)[source]#

CORD dataset from “CORD: A Consolidated Receipt Dataset forPost-OCR Parsing”.

https://github.com/mindee/doctr/releases/download/v0.5.0/cord-grid.png
>>> from doctr.datasets import CORD
>>> train_set = CORD(train=True, download=True)
>>> img, target = train_set[0]
Parameters
  • train – whether the subset should be the training one

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • **kwargs – keyword arguments from VisionDataset.

class doctr.datasets.IIIT5K(train: bool = True, use_polygons: bool = False, **kwargs: Any)[source]#

IIIT-5K character-level localization dataset from “BMVC 2012 Scene Text Recognition using Higher Order Language Priors”.

https://github.com/mindee/doctr/releases/download/v0.5.0/iiit5k-grid.png
>>> # NOTE: this dataset is for character-level localization
>>> from doctr.datasets import IIIT5K
>>> train_set = IIIT5K(train=True, download=True)
>>> img, target = train_set[0]
Parameters
  • train – whether the subset should be the training one

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • **kwargs – keyword arguments from VisionDataset.

class doctr.datasets.SVT(train: bool = True, use_polygons: bool = False, **kwargs: Any)[source]#

SVT dataset from “The Street View Text Dataset - UCSD Computer Vision”.

https://github.com/mindee/doctr/releases/download/v0.5.0/svt-grid.png
>>> from doctr.datasets import SVT
>>> train_set = SVT(train=True, download=True)
>>> img, target = train_set[0]
Parameters
  • train – whether the subset should be the training one

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • **kwargs – keyword arguments from VisionDataset.

class doctr.datasets.SVHN(train: bool = True, use_polygons: bool = False, **kwargs: Any)[source]#

SVHN dataset from “The Street View House Numbers (SVHN) Dataset”.

https://github.com/mindee/doctr/releases/download/v0.5.0/svhn-grid.png
>>> from doctr.datasets import SVHN
>>> train_set = SVHN(train=True, download=True)
>>> img, target = train_set[0]
Parameters
  • train – whether the subset should be the training one

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • **kwargs – keyword arguments from VisionDataset.

class doctr.datasets.SynthText(train: bool = True, use_polygons: bool = False, **kwargs: Any)[source]#

SynthText dataset from “Synthetic Data for Text Localisation in Natural Images” | “repository” | “website”.

https://github.com/mindee/doctr/releases/download/v0.5.0/svt-grid.png
>>> from doctr.datasets import SynthText
>>> train_set = SynthText(train=True, download=True)
>>> img, target = train_set[0]
Parameters
  • train – whether the subset should be the training one

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • **kwargs – keyword arguments from VisionDataset.

class doctr.datasets.IC03(train: bool = True, use_polygons: bool = False, **kwargs: Any)[source]#

IC03 dataset from “ICDAR 2003 Robust Reading Competitions: Entries, Results and Future Directions”.

https://github.com/mindee/doctr/releases/download/v0.5.0/ic03-grid.png
>>> from doctr.datasets import IC03
>>> train_set = IC03(train=True, download=True)
>>> img, target = train_set[0]
Parameters
  • train – whether the subset should be the training one

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • **kwargs – keyword arguments from VisionDataset.

class doctr.datasets.IC13(img_folder: str, label_folder: str, use_polygons: bool = False, **kwargs: Any)[source]#

IC13 dataset from “ICDAR 2013 Robust Reading Competition”.

https://github.com/mindee/doctr/releases/download/v0.5.0/ic13-grid.png
>>> # NOTE: You need to download both image and label parts from Focused Scene Text challenge Task2.1 2013-2015.
>>> from doctr.datasets import IC13
>>> train_set = IC13(img_folder="/path/to/Challenge2_Training_Task12_Images",
>>>                  label_folder="/path/to/Challenge2_Training_Task1_GT")
>>> img, target = train_set[0]
>>> test_set = IC13(img_folder="/path/to/Challenge2_Test_Task12_Images",
>>>                 label_folder="/path/to/Challenge2_Test_Task1_GT")
>>> img, target = test_set[0]
Parameters
  • img_folder – folder with all the images of the dataset

  • label_folder – folder with all annotation files for the images

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • **kwargs – keyword arguments from AbstractDataset.

class doctr.datasets.IMGUR5K(img_folder: str, label_path: str, train: bool = True, use_polygons: bool = False, **kwargs: Any)[source]#

IMGUR5K dataset from “TextStyleBrush: Transfer of Text Aesthetics from a Single Example” | repository.

https://github.com/mindee/doctr/releases/download/v0.5.0/imgur5k-grid.png
>>> # NOTE: You need to download/generate the dataset from the repository.
>>> from doctr.datasets import IMGUR5K
>>> train_set = IMGUR5K(train=True, img_folder="/path/to/IMGUR5K-Handwriting-Dataset/images",
>>>                     label_path="/path/to/IMGUR5K-Handwriting-Dataset/dataset_info/imgur5k_annotations.json")
>>> img, target = train_set[0]
>>> test_set = IMGUR5K(train=False, img_folder="/path/to/IMGUR5K-Handwriting-Dataset/images",
>>>                    label_path="/path/to/IMGUR5K-Handwriting-Dataset/dataset_info/imgur5k_annotations.json")
>>> img, target = test_set[0]
Parameters
  • img_folder – folder with all the images of the dataset

  • label_path – path to the annotations file of the dataset

  • train – whether the subset should be the training one

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • **kwargs – keyword arguments from AbstractDataset.

docTR synthetic datasets#

class doctr.datasets.DocArtefacts(train: bool = True, use_polygons: bool = False, **kwargs: Any)[source]#

Object detection dataset for non-textual elements in documents. The dataset includes a variety of synthetic document pages with non-textual elements.

https://github.com/mindee/doctr/releases/download/v0.5.0/artefacts-grid.png
>>> from doctr.datasets import DocArtefacts
>>> train_set = DocArtefacts(train=True, download=True)
>>> img, target = train_set[0]
Parameters
  • train – whether the subset should be the training one

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • **kwargs – keyword arguments from VisionDataset.

class doctr.datasets.CharacterGenerator(*args, **kwargs)[source]#

Implements a character image generation dataset

>>> from doctr.datasets import CharacterGenerator
>>> ds = CharacterGenerator(vocab='abdef')
>>> img, target = ds[0]
Parameters
  • vocab – vocabulary to take the character from

  • num_samples – number of samples that will be generated iterating over the dataset

  • cache_samples – whether generated images should be cached firsthand

  • font_family – font to use to generate the text images

  • img_transforms – composable transformations that will be applied to each image

  • sample_transforms – composable transformations that will be applied to both the image and the target

class doctr.datasets.WordGenerator(vocab: str, min_chars: int, max_chars: int, num_samples: int, cache_samples: bool = False, font_family: Optional[Union[str, List[str]]] = None, img_transforms: Optional[Callable[[Any], Any]] = None, sample_transforms: Optional[Callable[[Any, Any], Tuple[Any, Any]]] = None)[source]#

Implements a character image generation dataset

>>> from doctr.datasets import WordGenerator
>>> ds = WordGenerator(vocab='abdef')
>>> img, target = ds[0]
Parameters
  • vocab – vocabulary to take the character from

  • min_chars – minimum number of characters in a word

  • max_chars – maximum number of characters in a word

  • num_samples – number of samples that will be generated iterating over the dataset

  • cache_samples – whether generated images should be cached firsthand

  • font_family – font to use to generate the text images

  • img_transforms – composable transformations that will be applied to each image

  • sample_transforms – composable transformations that will be applied to both the image and the target

docTR private datasets#

Since many documents include sensitive / personal information, we are not able to share all the data that has been used for this project. However, we provide some guidance on how to format your own dataset into the same format so that you can use all docTR tools all the same.

class doctr.datasets.DetectionDataset(img_folder: str, label_path: str, use_polygons: bool = False, **kwargs: Any)[source]#

Implements a text detection dataset

>>> from doctr.datasets import DetectionDataset
>>> train_set = DetectionDataset(img_folder="/path/to/images",
>>>                              label_path="/path/to/labels.json")
>>> img, target = train_set[0]
Parameters
  • img_folder – folder with all the images of the dataset

  • label_path – path to the annotations of each image

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • **kwargs – keyword arguments from AbstractDataset.

class doctr.datasets.RecognitionDataset(img_folder: str, labels_path: str, **kwargs: Any)[source]#

Dataset implementation for text recognition tasks

>>> from doctr.datasets import RecognitionDataset
>>> train_set = RecognitionDataset(img_folder="/path/to/images",
>>>                                labels_path="/path/to/labels.json")
>>> img, target = train_set[0]
Parameters
  • img_folder – path to the images folder

  • labels_path – pathe to the json file containing all labels (character sequences)

  • **kwargs – keyword arguments from AbstractDataset.

class doctr.datasets.OCRDataset(img_folder: str, label_file: str, use_polygons: bool = False, **kwargs: Any)[source]#

Implements an OCR dataset

Parameters
  • img_folder – local path to image folder (all jpg at the root)

  • label_file – local path to the label file

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • **kwargs – keyword arguments from AbstractDataset.

Data Loading#

Each dataset has its specific way to load a sample, but handling batch aggregation and the underlying iterator is a task deferred to another object in docTR.

class doctr.datasets.loader.DataLoader(dataset, shuffle: bool = True, batch_size: int = 1, drop_last: bool = False, num_workers: Optional[int] = None, collate_fn: Optional[Callable] = None)[source]#

Implements a dataset wrapper for fast data loading

>>> from doctr.datasets import FUNSD, DataLoader
>>> train_set = CORD(train=True, download=True)
>>> train_loader = DataLoader(train_set, batch_size=32)
>>> train_iter = iter(train_loader)
>>> images, targets = next(train_iter)
Parameters
  • dataset – the dataset

  • shuffle – whether the samples should be shuffled before passing it to the iterator

  • batch_size – number of elements in each batch

  • drop_last – if True, drops the last batch if it isn’t full

  • num_workers – number of workers to use for data loading

  • collate_fn – function to merge samples into a batch

Supported Vocabs#

Since textual content has to be encoded properly for models to interpret them efficiently, docTR supports multiple sets of vocabs.

docTR Vocabs#

Name

size

characters

digits

10

0123456789

ascii_letters

52

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ

punctuation

32

!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~

currency

5

£€¥¢฿

latin

94

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~

english

100

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿

legacy_french

123

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~°àâéèêëîïôùûçÀÂÉÈËÎÏÔÙÛÇ£€¥¢฿

french

126

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿àâéèêëîïôùûüçÀÂÉÈÊËÎÏÔÙÛÜÇ

portuguese

131

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áàâãéêëíïóôõúüçÁÀÂÃÉËÍÏÓÔÕÚÜÇ¡¿

spanish

116

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áéíóúüñÁÉÍÓÚÜÑ¡¿

german

108

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿äöüßÄÖÜẞ

doctr.datasets.encode_sequences(sequences: List[str], vocab: str, target_size: Optional[int] = None, eos: int = - 1, sos: Optional[int] = None, pad: Optional[int] = None, dynamic_seq_length: bool = False, **kwargs: Any) ndarray[source]#

Encode character sequences using a given vocab as mapping

Parameters
  • sequences – the list of character sequences of size N

  • vocab – the ordered vocab to use for encoding

  • target_size – maximum length of the encoded data

  • eos – encoding of End Of String

  • sos – optional encoding of Start Of String

  • pad – optional encoding for padding. In case of padding, all sequences are followed by 1 EOS then PAD

  • dynamic_seq_length – if target_size is specified, uses it as upper bound and enables dynamic sequence size

Returns

the padded encoded data as a tensor