doctr.datasets#

doctr.datasets#

class doctr.datasets.FUNSD(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#

FUNSD dataset from “FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents”.

https://doctr-static.mindee.com/models?id=v0.5.0/funsd-grid.png&src=0
>>> from doctr.datasets import FUNSD
>>> train_set = FUNSD(train=True, download=True)
>>> img, target = train_set[0]
Parameters
  • train – whether the subset should be the training one

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • recognition_task – whether the dataset should be used for recognition task

  • **kwargs – keyword arguments from VisionDataset.

class doctr.datasets.SROIE(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#

SROIE dataset from “ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction”.

https://doctr-static.mindee.com/models?id=v0.5.0/sroie-grid.png&src=0
>>> from doctr.datasets import SROIE
>>> train_set = SROIE(train=True, download=True)
>>> img, target = train_set[0]
Parameters
  • train – whether the subset should be the training one

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • recognition_task – whether the dataset should be used for recognition task

  • **kwargs – keyword arguments from VisionDataset.

class doctr.datasets.CORD(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#

CORD dataset from “CORD: A Consolidated Receipt Dataset forPost-OCR Parsing”.

https://doctr-static.mindee.com/models?id=v0.5.0/cord-grid.png&src=0
>>> from doctr.datasets import CORD
>>> train_set = CORD(train=True, download=True)
>>> img, target = train_set[0]
Parameters
  • train – whether the subset should be the training one

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • recognition_task – whether the dataset should be used for recognition task

  • **kwargs – keyword arguments from VisionDataset.

class doctr.datasets.IIIT5K(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#

IIIT-5K character-level localization dataset from “BMVC 2012 Scene Text Recognition using Higher Order Language Priors”.

https://doctr-static.mindee.com/models?id=v0.5.0/iiit5k-grid.png&src=0
>>> # NOTE: this dataset is for character-level localization
>>> from doctr.datasets import IIIT5K
>>> train_set = IIIT5K(train=True, download=True)
>>> img, target = train_set[0]
Parameters
  • train – whether the subset should be the training one

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • recognition_task – whether the dataset should be used for recognition task

  • **kwargs – keyword arguments from VisionDataset.

class doctr.datasets.SVT(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#

SVT dataset from “The Street View Text Dataset - UCSD Computer Vision”.

https://doctr-static.mindee.com/models?id=v0.5.0/svt-grid.png&src=0
>>> from doctr.datasets import SVT
>>> train_set = SVT(train=True, download=True)
>>> img, target = train_set[0]
Parameters
  • train – whether the subset should be the training one

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • recognition_task – whether the dataset should be used for recognition task

  • **kwargs – keyword arguments from VisionDataset.

class doctr.datasets.SVHN(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#

SVHN dataset from “The Street View House Numbers (SVHN) Dataset”.

https://doctr-static.mindee.com/models?id=v0.5.0/svhn-grid.png&src=0
>>> from doctr.datasets import SVHN
>>> train_set = SVHN(train=True, download=True)
>>> img, target = train_set[0]
Parameters
  • train – whether the subset should be the training one

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • recognition_task – whether the dataset should be used for recognition task

  • **kwargs – keyword arguments from VisionDataset.

class doctr.datasets.SynthText(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#

SynthText dataset from “Synthetic Data for Text Localisation in Natural Images” | “repository” | “website”.

https://doctr-static.mindee.com/models?id=v0.5.0/svt-grid.png&src=0
>>> from doctr.datasets import SynthText
>>> train_set = SynthText(train=True, download=True)
>>> img, target = train_set[0]
Parameters
  • train – whether the subset should be the training one

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • recognition_task – whether the dataset should be used for recognition task

  • **kwargs – keyword arguments from VisionDataset.

class doctr.datasets.IC03(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#

IC03 dataset from “ICDAR 2003 Robust Reading Competitions: Entries, Results and Future Directions”.

https://doctr-static.mindee.com/models?id=v0.5.0/ic03-grid.png&src=0
>>> from doctr.datasets import IC03
>>> train_set = IC03(train=True, download=True)
>>> img, target = train_set[0]
Parameters
  • train – whether the subset should be the training one

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • recognition_task – whether the dataset should be used for recognition task

  • **kwargs – keyword arguments from VisionDataset.

class doctr.datasets.IC13(img_folder: str, label_folder: str, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#

IC13 dataset from “ICDAR 2013 Robust Reading Competition”.

https://doctr-static.mindee.com/models?id=v0.5.0/ic13-grid.png&src=0
>>> # NOTE: You need to download both image and label parts from Focused Scene Text challenge Task2.1 2013-2015.
>>> from doctr.datasets import IC13
>>> train_set = IC13(img_folder="/path/to/Challenge2_Training_Task12_Images",
>>>                  label_folder="/path/to/Challenge2_Training_Task1_GT")
>>> img, target = train_set[0]
>>> test_set = IC13(img_folder="/path/to/Challenge2_Test_Task12_Images",
>>>                 label_folder="/path/to/Challenge2_Test_Task1_GT")
>>> img, target = test_set[0]
Parameters
  • img_folder – folder with all the images of the dataset

  • label_folder – folder with all annotation files for the images

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • recognition_task – whether the dataset should be used for recognition task

  • **kwargs – keyword arguments from AbstractDataset.

class doctr.datasets.IMGUR5K(img_folder: str, label_path: str, train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#

IMGUR5K dataset from “TextStyleBrush: Transfer of Text Aesthetics from a Single Example” | repository.

https://doctr-static.mindee.com/models?id=v0.5.0/imgur5k-grid.png&src=0
>>> # NOTE: You need to download/generate the dataset from the repository.
>>> from doctr.datasets import IMGUR5K
>>> train_set = IMGUR5K(train=True, img_folder="/path/to/IMGUR5K-Handwriting-Dataset/images",
>>>                     label_path="/path/to/IMGUR5K-Handwriting-Dataset/dataset_info/imgur5k_annotations.json")
>>> img, target = train_set[0]
>>> test_set = IMGUR5K(train=False, img_folder="/path/to/IMGUR5K-Handwriting-Dataset/images",
>>>                    label_path="/path/to/IMGUR5K-Handwriting-Dataset/dataset_info/imgur5k_annotations.json")
>>> img, target = test_set[0]
Parameters
  • img_folder – folder with all the images of the dataset

  • label_path – path to the annotations file of the dataset

  • train – whether the subset should be the training one

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • recognition_task – whether the dataset should be used for recognition task

  • **kwargs – keyword arguments from AbstractDataset.

class doctr.datasets.MJSynth(img_folder: str, label_path: str, train: bool = True, **kwargs: Any)[source]#

MJSynth dataset from “Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition”.

>>> # NOTE: This is a pure recognition dataset without bounding box labels.
>>> # NOTE: You need to download the dataset.
>>> from doctr.datasets import MJSynth
>>> train_set = MJSynth(img_folder="/path/to/mjsynth/mnt/ramdisk/max/90kDICT32px",
>>>                     label_path="/path/to/mjsynth/mnt/ramdisk/max/90kDICT32px/imlist.txt",
>>>                     train=True)
>>> img, target = train_set[0]
>>> test_set = MJSynth(img_folder="/path/to/mjsynth/mnt/ramdisk/max/90kDICT32px",
>>>                    label_path="/path/to/mjsynth/mnt/ramdisk/max/90kDICT32px/imlist.txt")
>>>                    train=False)
>>> img, target = test_set[0]
Parameters
  • img_folder – folder with all the images of the dataset

  • label_path – path to the file with the labels

  • train – whether the subset should be the training one

  • **kwargs – keyword arguments from AbstractDataset.

class doctr.datasets.DocArtefacts(train: bool = True, use_polygons: bool = False, **kwargs: Any)[source]#

Object detection dataset for non-textual elements in documents. The dataset includes a variety of synthetic document pages with non-textual elements.

https://doctr-static.mindee.com/models?id=v0.5.0/artefacts-grid.png&src=0
>>> from doctr.datasets import DocArtefacts
>>> train_set = DocArtefacts(train=True, download=True)
>>> img, target = train_set[0]
Parameters
  • train – whether the subset should be the training one

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • **kwargs – keyword arguments from VisionDataset.

Synthetic dataset generator#

class doctr.datasets.CharacterGenerator(*args, **kwargs)[source]#

Implements a character image generation dataset

>>> from doctr.datasets import CharacterGenerator
>>> ds = CharacterGenerator(vocab='abdef', num_samples=100)
>>> img, target = ds[0]
Parameters
  • vocab – vocabulary to take the character from

  • num_samples – number of samples that will be generated iterating over the dataset

  • cache_samples – whether generated images should be cached firsthand

  • font_family – font to use to generate the text images

  • img_transforms – composable transformations that will be applied to each image

  • sample_transforms – composable transformations that will be applied to both the image and the target

class doctr.datasets.WordGenerator(vocab: str, min_chars: int, max_chars: int, num_samples: int, cache_samples: bool = False, font_family: Optional[Union[str, List[str]]] = None, img_transforms: Optional[Callable[[Any], Any]] = None, sample_transforms: Optional[Callable[[Any, Any], Tuple[Any, Any]]] = None)[source]#

Implements a character image generation dataset

>>> from doctr.datasets import WordGenerator
>>> ds = WordGenerator(vocab='abdef', min_chars=1, max_chars=32, num_samples=100)
>>> img, target = ds[0]
Parameters
  • vocab – vocabulary to take the character from

  • min_chars – minimum number of characters in a word

  • max_chars – maximum number of characters in a word

  • num_samples – number of samples that will be generated iterating over the dataset

  • cache_samples – whether generated images should be cached firsthand

  • font_family – font to use to generate the text images

  • img_transforms – composable transformations that will be applied to each image

  • sample_transforms – composable transformations that will be applied to both the image and the target

Custom dataset loader#

class doctr.datasets.DetectionDataset(img_folder: str, label_path: str, use_polygons: bool = False, **kwargs: Any)[source]#

Implements a text detection dataset

>>> from doctr.datasets import DetectionDataset
>>> train_set = DetectionDataset(img_folder="/path/to/images",
>>>                              label_path="/path/to/labels.json")
>>> img, target = train_set[0]
Parameters
  • img_folder – folder with all the images of the dataset

  • label_path – path to the annotations of each image

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • **kwargs – keyword arguments from AbstractDataset.

class doctr.datasets.RecognitionDataset(img_folder: str, labels_path: str, **kwargs: Any)[source]#

Dataset implementation for text recognition tasks

>>> from doctr.datasets import RecognitionDataset
>>> train_set = RecognitionDataset(img_folder="/path/to/images",
>>>                                labels_path="/path/to/labels.json")
>>> img, target = train_set[0]
Parameters
  • img_folder – path to the images folder

  • labels_path – pathe to the json file containing all labels (character sequences)

  • **kwargs – keyword arguments from AbstractDataset.

class doctr.datasets.OCRDataset(img_folder: str, label_file: str, use_polygons: bool = False, **kwargs: Any)[source]#

Implements an OCR dataset

>>> from doctr.datasets import OCRDataset
>>> train_set = OCRDataset(img_folder="/path/to/images",
>>>                        label_file="/path/to/labels.json")
>>> img, target = train_set[0]
Parameters
  • img_folder – local path to image folder (all jpg at the root)

  • label_file – local path to the label file

  • use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)

  • **kwargs – keyword arguments from AbstractDataset.

Dataloader#

class doctr.datasets.loader.DataLoader(dataset, shuffle: bool = True, batch_size: int = 1, drop_last: bool = False, num_workers: Optional[int] = None, collate_fn: Optional[Callable] = None)[source]#

Implements a dataset wrapper for fast data loading

>>> from doctr.datasets import CORD, DataLoader
>>> train_set = CORD(train=True, download=True)
>>> train_loader = DataLoader(train_set, batch_size=32)
>>> train_iter = iter(train_loader)
>>> images, targets = next(train_iter)
Parameters
  • dataset – the dataset

  • shuffle – whether the samples should be shuffled before passing it to the iterator

  • batch_size – number of elements in each batch

  • drop_last – if True, drops the last batch if it isn’t full

  • num_workers – number of workers to use for data loading

  • collate_fn – function to merge samples into a batch

Supported Vocabs#

Since textual content has to be encoded properly for models to interpret them efficiently, docTR supports multiple sets of vocabs.

docTR Vocabs#

Name

size

characters

digits

10

0123456789

hindi_digits

10

٠١٢٣٤٥٦٧٨٩

ascii_letters

52

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ

punctuation

32

!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~

currency

5

£€¥¢฿

ancient_greek

48

αβγδεζηθικλμνξοπρστυφχψωΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ

arabic_letters

37

ءآأؤإئابةتثجحخدذرزسشصضطظعغـفقكلمنهوىي

persian_letters

5

پچڢڤگ

arabic_diacritics

2

‘ًٌٍَُِّْ’

arabic_punctuation

5

؟؛«»—

latin

94

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~

english

100

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿

legacy_french

123

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~°àâéèêëîïôùûçÀÂÉÈËÎÏÔÙÛÇ£€¥¢฿

french

126

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿àâéèêëîïôùûüçÀÂÉÈÊËÎÏÔÙÛÜÇ

portuguese

131

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áàâãéêëíïóôõúüçÁÀÂÃÉËÍÏÓÔÕÚÜÇ¡¿

spanish

116

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áéíóúüñÁÉÍÓÚÜÑ¡¿

german

108

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿äöüßÄÖÜẞ

arabic

101

ءآأؤإئابةتثجحخدذرزسشصضطظعغـفقكلمنهوىيپچڢڤگ؟؛«»—0123456789٠١٢٣٤٥٦٧٨٩’ًٌٍَُِّْ’!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~

czech

130

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áčďéěíňóřšťúůýžÁČĎÉĚÍŇÓŘŠŤÚŮÝŽ

vietnamese

234

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áàảạãăắằẳẵặâấầẩẫậéèẻẽẹêếềểễệóòỏõọôốồổộỗơớờởợỡúùủũụưứừửữựiíìỉĩịýỳỷỹỵÁÀẢẠÃĂẮẰẲẴẶÂẤẦẨẪẬÉÈẺẼẸÊẾỀỂỄỆÓÒỎÕỌÔỐỒỔỘỖƠỚỜỞỢỠÚÙỦŨỤƯỨỪỬỮỰIÍÌỈĨỊÝỲỶỸỴ

doctr.datasets.encode_sequences(sequences: List[str], vocab: str, target_size: Optional[int] = None, eos: int = -1, sos: Optional[int] = None, pad: Optional[int] = None, dynamic_seq_length: bool = False, **kwargs: Any) ndarray[source]#

Encode character sequences using a given vocab as mapping

Parameters
  • sequences – the list of character sequences of size N

  • vocab – the ordered vocab to use for encoding

  • target_size – maximum length of the encoded data

  • eos – encoding of End Of String

  • sos – optional encoding of Start Of String

  • pad – optional encoding for padding. In case of padding, all sequences are followed by 1 EOS then PAD

  • dynamic_seq_length – if target_size is specified, uses it as upper bound and enables dynamic sequence size

Returns

the padded encoded data as a tensor