doctr.datasets

doctr.datasets

class doctr.datasets.FUNSD(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, detection_task: bool = False, **kwargs: Any)[source]

FUNSD dataset from “FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents”.

https://doctr-static.mindee.com/models?id=v0.5.0/funsd-grid.png&src=0
>>> from doctr.datasets import FUNSD
>>> train_set = FUNSD(train=True, download=True)
>>> img, target = train_set[0]

Args:

train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task detection_task: whether the dataset should be used for detection task **kwargs: keyword arguments from VisionDataset.

class doctr.datasets.SROIE(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, detection_task: bool = False, **kwargs: Any)[source]

SROIE dataset from “ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction”.

https://doctr-static.mindee.com/models?id=v0.5.0/sroie-grid.png&src=0
>>> from doctr.datasets import SROIE
>>> train_set = SROIE(train=True, download=True)
>>> img, target = train_set[0]

Args:

train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task detection_task: whether the dataset should be used for detection task **kwargs: keyword arguments from VisionDataset.

class doctr.datasets.CORD(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, detection_task: bool = False, **kwargs: Any)[source]

CORD dataset from “CORD: A Consolidated Receipt Dataset forPost-OCR Parsing”.

https://doctr-static.mindee.com/models?id=v0.5.0/cord-grid.png&src=0
>>> from doctr.datasets import CORD
>>> train_set = CORD(train=True, download=True)
>>> img, target = train_set[0]

Args:

train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task detection_task: whether the dataset should be used for detection task **kwargs: keyword arguments from VisionDataset.

class doctr.datasets.IIIT5K(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, detection_task: bool = False, **kwargs: Any)[source]

IIIT-5K character-level localization dataset from “BMVC 2012 Scene Text Recognition using Higher Order Language Priors”.

https://doctr-static.mindee.com/models?id=v0.5.0/iiit5k-grid.png&src=0
>>> # NOTE: this dataset is for character-level localization
>>> from doctr.datasets import IIIT5K
>>> train_set = IIIT5K(train=True, download=True)
>>> img, target = train_set[0]

Args:

train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task detection_task: whether the dataset should be used for detection task **kwargs: keyword arguments from VisionDataset.

class doctr.datasets.SVT(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, detection_task: bool = False, **kwargs: Any)[source]

SVT dataset from “The Street View Text Dataset - UCSD Computer Vision”.

https://doctr-static.mindee.com/models?id=v0.5.0/svt-grid.png&src=0
>>> from doctr.datasets import SVT
>>> train_set = SVT(train=True, download=True)
>>> img, target = train_set[0]

Args:

train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task detection_task: whether the dataset should be used for detection task **kwargs: keyword arguments from VisionDataset.

class doctr.datasets.SVHN(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, detection_task: bool = False, **kwargs: Any)[source]

SVHN dataset from “The Street View House Numbers (SVHN) Dataset”.

https://doctr-static.mindee.com/models?id=v0.5.0/svhn-grid.png&src=0
>>> from doctr.datasets import SVHN
>>> train_set = SVHN(train=True, download=True)
>>> img, target = train_set[0]

Args:

train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task detection_task: whether the dataset should be used for detection task **kwargs: keyword arguments from VisionDataset.

class doctr.datasets.SynthText(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, detection_task: bool = False, **kwargs: Any)[source]

SynthText dataset from “Synthetic Data for Text Localisation in Natural Images” | “repository” | “website”.

https://doctr-static.mindee.com/models?id=v0.5.0/svt-grid.png&src=0
>>> from doctr.datasets import SynthText
>>> train_set = SynthText(train=True, download=True)
>>> img, target = train_set[0]

Args:

train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task detection_task: whether the dataset should be used for detection task **kwargs: keyword arguments from VisionDataset.

class doctr.datasets.IC03(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, detection_task: bool = False, **kwargs: Any)[source]

IC03 dataset from “ICDAR 2003 Robust Reading Competitions: Entries, Results and Future Directions”.

https://doctr-static.mindee.com/models?id=v0.5.0/ic03-grid.png&src=0
>>> from doctr.datasets import IC03
>>> train_set = IC03(train=True, download=True)
>>> img, target = train_set[0]

Args:

train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task detection_task: whether the dataset should be used for detection task **kwargs: keyword arguments from VisionDataset.

class doctr.datasets.IC13(img_folder: str, label_folder: str, use_polygons: bool = False, recognition_task: bool = False, detection_task: bool = False, **kwargs: Any)[source]

IC13 dataset from “ICDAR 2013 Robust Reading Competition”.

https://doctr-static.mindee.com/models?id=v0.5.0/ic13-grid.png&src=0
>>> # NOTE: You need to download both image and label parts from Focused Scene Text challenge Task2.1 2013-2015.
>>> from doctr.datasets import IC13
>>> train_set = IC13(img_folder="/path/to/Challenge2_Training_Task12_Images",
>>>                  label_folder="/path/to/Challenge2_Training_Task1_GT")
>>> img, target = train_set[0]
>>> test_set = IC13(img_folder="/path/to/Challenge2_Test_Task12_Images",
>>>                 label_folder="/path/to/Challenge2_Test_Task1_GT")
>>> img, target = test_set[0]

Args:

img_folder: folder with all the images of the dataset label_folder: folder with all annotation files for the images use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task detection_task: whether the dataset should be used for detection task **kwargs: keyword arguments from AbstractDataset.

class doctr.datasets.IMGUR5K(img_folder: str, label_path: str, train: bool = True, use_polygons: bool = False, recognition_task: bool = False, detection_task: bool = False, **kwargs: Any)[source]

IMGUR5K dataset from “TextStyleBrush: Transfer of Text Aesthetics from a Single Example” | repository.

https://doctr-static.mindee.com/models?id=v0.5.0/imgur5k-grid.png&src=0
>>> # NOTE: You need to download/generate the dataset from the repository.
>>> from doctr.datasets import IMGUR5K
>>> train_set = IMGUR5K(train=True, img_folder="/path/to/IMGUR5K-Handwriting-Dataset/images",
>>>                     label_path="/path/to/IMGUR5K-Handwriting-Dataset/dataset_info/imgur5k_annotations.json")
>>> img, target = train_set[0]
>>> test_set = IMGUR5K(train=False, img_folder="/path/to/IMGUR5K-Handwriting-Dataset/images",
>>>                    label_path="/path/to/IMGUR5K-Handwriting-Dataset/dataset_info/imgur5k_annotations.json")
>>> img, target = test_set[0]

Args:

img_folder: folder with all the images of the dataset label_path: path to the annotations file of the dataset train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task detection_task: whether the dataset should be used for detection task **kwargs: keyword arguments from AbstractDataset.

class doctr.datasets.MJSynth(img_folder: str, label_path: str, train: bool = True, **kwargs: Any)[source]

MJSynth dataset from “Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition”.

>>> # NOTE: This is a pure recognition dataset without bounding box labels.
>>> # NOTE: You need to download the dataset.
>>> from doctr.datasets import MJSynth
>>> train_set = MJSynth(img_folder="/path/to/mjsynth/mnt/ramdisk/max/90kDICT32px",
>>>                     label_path="/path/to/mjsynth/mnt/ramdisk/max/90kDICT32px/imlist.txt",
>>>                     train=True)
>>> img, target = train_set[0]
>>> test_set = MJSynth(img_folder="/path/to/mjsynth/mnt/ramdisk/max/90kDICT32px",
>>>                    label_path="/path/to/mjsynth/mnt/ramdisk/max/90kDICT32px/imlist.txt")
>>>                    train=False)
>>> img, target = test_set[0]

Args:

img_folder: folder with all the images of the dataset label_path: path to the file with the labels train: whether the subset should be the training one **kwargs: keyword arguments from AbstractDataset.

class doctr.datasets.IIITHWS(img_folder: str, label_path: str, train: bool = True, **kwargs: Any)[source]

IIITHWS dataset from “Generating Synthetic Data for Text Recognition” | “repository” | “website”.

>>> # NOTE: This is a pure recognition dataset without bounding box labels.
>>> # NOTE: You need to download the dataset.
>>> from doctr.datasets import IIITHWS
>>> train_set = IIITHWS(img_folder="/path/to/iiit-hws/Images_90K_Normalized",
>>>                     label_path="/path/to/IIIT-HWS-90K.txt",
>>>                     train=True)
>>> img, target = train_set[0]
>>> test_set = IIITHWS(img_folder="/path/to/iiit-hws/Images_90K_Normalized",
>>>                    label_path="/path/to/IIIT-HWS-90K.txt")
>>>                    train=False)
>>> img, target = test_set[0]

Args:

img_folder: folder with all the images of the dataset label_path: path to the file with the labels train: whether the subset should be the training one **kwargs: keyword arguments from AbstractDataset.

class doctr.datasets.DocArtefacts(train: bool = True, use_polygons: bool = False, **kwargs: Any)[source]

Object detection dataset for non-textual elements in documents. The dataset includes a variety of synthetic document pages with non-textual elements.

https://doctr-static.mindee.com/models?id=v0.5.0/artefacts-grid.png&src=0
>>> from doctr.datasets import DocArtefacts
>>> train_set = DocArtefacts(train=True, download=True)
>>> img, target = train_set[0]

Args:

train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) **kwargs: keyword arguments from VisionDataset.

class doctr.datasets.WILDRECEIPT(img_folder: str, label_path: str, train: bool = True, use_polygons: bool = False, recognition_task: bool = False, detection_task: bool = False, **kwargs: Any)[source]
WildReceipt dataset from `”Spatial Dual-Modality Graph Reasoning for Key Information Extraction”

<https://arxiv.org/abs/2103.14470v1>`_ |

repository.

https://doctr-static.mindee.com/models?id=v0.7.0/wildreceipt-dataset.jpg&src=0
>>> # NOTE: You need to download the dataset first.
>>> from doctr.datasets import WILDRECEIPT
>>> train_set = WILDRECEIPT(train=True, img_folder="/path/to/wildreceipt/",
>>>                     label_path="/path/to/wildreceipt/train.txt")
>>> img, target = train_set[0]
>>> test_set = WILDRECEIPT(train=False, img_folder="/path/to/wildreceipt/",
>>>                    label_path="/path/to/wildreceipt/test.txt")
>>> img, target = test_set[0]

Args:

img_folder: folder with all the images of the dataset label_path: path to the annotations file of the dataset train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task detection_task: whether the dataset should be used for detection task **kwargs: keyword arguments from AbstractDataset.

Synthetic dataset generator

class doctr.datasets.CharacterGenerator(*args, **kwargs)[source]

Implements a character image generation dataset

>>> from doctr.datasets import CharacterGenerator
>>> ds = CharacterGenerator(vocab='abdef', num_samples=100)
>>> img, target = ds[0]

Args:

vocab: vocabulary to take the character from num_samples: number of samples that will be generated iterating over the dataset cache_samples: whether generated images should be cached firsthand font_family: font to use to generate the text images img_transforms: composable transformations that will be applied to each image sample_transforms: composable transformations that will be applied to both the image and the target

class doctr.datasets.WordGenerator(vocab: str, min_chars: int, max_chars: int, num_samples: int, cache_samples: bool = False, font_family: str | List[str] | None = None, img_transforms: Callable[[Any], Any] | None = None, sample_transforms: Callable[[Any, Any], Tuple[Any, Any]] | None = None)[source]

Implements a character image generation dataset

>>> from doctr.datasets import WordGenerator
>>> ds = WordGenerator(vocab='abdef', min_chars=1, max_chars=32, num_samples=100)
>>> img, target = ds[0]

Args:

vocab: vocabulary to take the character from min_chars: minimum number of characters in a word max_chars: maximum number of characters in a word num_samples: number of samples that will be generated iterating over the dataset cache_samples: whether generated images should be cached firsthand font_family: font to use to generate the text images img_transforms: composable transformations that will be applied to each image sample_transforms: composable transformations that will be applied to both the image and the target

Custom dataset loader

class doctr.datasets.DetectionDataset(img_folder: str, label_path: str, use_polygons: bool = False, **kwargs: Any)[source]

Implements a text detection dataset

>>> from doctr.datasets import DetectionDataset
>>> train_set = DetectionDataset(img_folder="/path/to/images",
>>>                              label_path="/path/to/labels.json")
>>> img, target = train_set[0]

Args:

img_folder: folder with all the images of the dataset label_path: path to the annotations of each image use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) **kwargs: keyword arguments from AbstractDataset.

class doctr.datasets.RecognitionDataset(img_folder: str, labels_path: str, **kwargs: Any)[source]

Dataset implementation for text recognition tasks

>>> from doctr.datasets import RecognitionDataset
>>> train_set = RecognitionDataset(img_folder="/path/to/images",
>>>                                labels_path="/path/to/labels.json")
>>> img, target = train_set[0]

Args:

img_folder: path to the images folder labels_path: pathe to the json file containing all labels (character sequences) **kwargs: keyword arguments from AbstractDataset.

class doctr.datasets.OCRDataset(img_folder: str, label_file: str, use_polygons: bool = False, **kwargs: Any)[source]

Implements an OCR dataset

>>> from doctr.datasets import OCRDataset
>>> train_set = OCRDataset(img_folder="/path/to/images",
>>>                        label_file="/path/to/labels.json")
>>> img, target = train_set[0]

Args:

img_folder: local path to image folder (all jpg at the root) label_file: local path to the label file use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) **kwargs: keyword arguments from AbstractDataset.

Dataloader

class doctr.datasets.loader.DataLoader(dataset, shuffle: bool = True, batch_size: int = 1, drop_last: bool = False, collate_fn: Callable | None = None)[source]

Implements a dataset wrapper for fast data loading

>>> from doctr.datasets import CORD, DataLoader
>>> train_set = CORD(train=True, download=True)
>>> train_loader = DataLoader(train_set, batch_size=32)
>>> train_iter = iter(train_loader)
>>> images, targets = next(train_iter)

Args:

dataset: the dataset shuffle: whether the samples should be shuffled before passing it to the iterator batch_size: number of elements in each batch drop_last: if True, drops the last batch if it isn’t full collate_fn: function to merge samples into a batch

Supported Vocabs

Since textual content has to be encoded properly for models to interpret them efficiently, docTR supports multiple sets of vocabs.

docTR Vocabs

Name

size

characters

digits

10

0123456789

hindi_digits

10

٠١٢٣٤٥٦٧٨٩

ascii_letters

52

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ

punctuation

32

!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~

currency

5

£€¥¢฿

ancient_greek

48

αβγδεζηθικλμνξοπρστυφχψωΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ

arabic_letters

37

ءآأؤإئابةتثجحخدذرزسشصضطظعغـفقكلمنهوىي

generic_cyrillic_letters

58

абвгдежзийклмнопрстуфхцчшщьюяАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЬЮЯ

persian_letters

5

پچڢڤگ

arabic_diacritics

2

‘ًٌٍَُِّْ’

arabic_punctuation

5

؟؛«»—

latin

94

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~

english

100

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿

legacy_french

123

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~°àâéèêëîïôùûçÀÂÉÈËÎÏÔÙÛÇ£€¥¢฿

french

126

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿àâéèêëîïôùûüçÀÂÉÈÊËÎÏÔÙÛÜÇ

portuguese

131

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áàâãéêëíïóôõúüçÁÀÂÃÉËÍÏÓÔÕÚÜÇ¡¿

spanish

116

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áéíóúüñÁÉÍÓÚÜÑ¡¿

italian

120

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿àèéìíîòóùúÀÈÉÌÍÎÒÓÙÚ

german

108

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿äöüßÄÖÜẞ

arabic

101

ءآأؤإئابةتثجحخدذرزسشصضطظعغـفقكلمنهوىيپچڢڤگ؟؛«»—0123456789٠١٢٣٤٥٦٧٨٩’ًٌٍَُِّْ’!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~

czech

130

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áčďéěíňóřšťúůýžÁČĎÉĚÍŇÓŘŠŤÚŮÝŽ

polish

118

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿ąćęłńóśźżĄĆĘŁŃÓŚŹŻ

dutch

114

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áéíóúüñÁÉÍÓÚÜÑ

norwegian

106

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿æøåÆØÅ

danish

106

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~°£€¥¢฿æøåÆØÅ

finnish

104

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿äöÄÖ

swedish

106

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿åäöÅÄÖ

ukrainian

115

абвгдежзийклмнопрстуфхцчшщьюяАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЬЮЯ0123456789!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿ґіїєҐІЇЄ₴

vietnamese

236

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áàảạãăắằẳẵặâấầẩẫậđéèẻẽẹêếềểễệóòỏõọôốồổộỗơớờởợỡúùủũụưứừửữựiíìỉĩịýỳỷỹỵÁÀẢẠÃĂẮẰẲẴẶÂẤẦẨẪẬĐÉÈẺẼẸÊẾỀỂỄỆÓÒỎÕỌÔỐỒỔỘỖƠỚỜỞỢỠÚÙỦŨỤƯỨỪỬỮỰIÍÌỈĨỊÝỲỶỸỴ

hebrew

123

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿אבגדהוזחטיכלמנסעפצקרשת₪

hindi

71

अआइईउऊऋॠऌॡएऐओऔअंअःकखगघङचछजझञटठडढणतथदधनपफबभमयरलवशषसह०१२३४५६७८९।,?!:्ॐ॰॥॰

bangla

70

অআইঈউঊঋএঐওঔকখগঘঙচছজঝঞটঠডঢণতথদধনপফবভমযরলশষসহ়ঽািীুূৃেৈোৌ্ৎংঃঁ০১২৩৪৫৬৭৮৯

multilingual

195

english & french & german & italian & spanish & portuguese & czech & polish & dutch & norwegian & danish & finnish & swedish & §

doctr.datasets.encode_sequences(sequences: List[str], vocab: str, target_size: int | None = None, eos: int = -1, sos: int | None = None, pad: int | None = None, dynamic_seq_length: bool = False) ndarray[source]

Encode character sequences using a given vocab as mapping

Args:

sequences: the list of character sequences of size N vocab: the ordered vocab to use for encoding target_size: maximum length of the encoded data eos: encoding of End Of String sos: optional encoding of Start Of String pad: optional encoding for padding. In case of padding, all sequences are followed by 1 EOS then PAD dynamic_seq_length: if target_size is specified, uses it as upper bound and enables dynamic sequence size

Returns:

the padded encoded data as a tensor