doctr.datasets#

doctr.datasets#

class doctr.datasets.FUNSD(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#

FUNSD dataset from “FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents”.

https://doctr-static.mindee.com/models?id=v0.5.0/funsd-grid.png&src=0
>>> from doctr.datasets import FUNSD
>>> train_set = FUNSD(train=True, download=True)
>>> img, target = train_set[0]

Args:#

train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task **kwargs: keyword arguments from VisionDataset.

class doctr.datasets.SROIE(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#

SROIE dataset from “ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction”.

https://doctr-static.mindee.com/models?id=v0.5.0/sroie-grid.png&src=0
>>> from doctr.datasets import SROIE
>>> train_set = SROIE(train=True, download=True)
>>> img, target = train_set[0]

Args:#

train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task **kwargs: keyword arguments from VisionDataset.

class doctr.datasets.CORD(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#

CORD dataset from “CORD: A Consolidated Receipt Dataset forPost-OCR Parsing”.

https://doctr-static.mindee.com/models?id=v0.5.0/cord-grid.png&src=0
>>> from doctr.datasets import CORD
>>> train_set = CORD(train=True, download=True)
>>> img, target = train_set[0]

Args:#

train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task **kwargs: keyword arguments from VisionDataset.

class doctr.datasets.IIIT5K(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#

IIIT-5K character-level localization dataset from “BMVC 2012 Scene Text Recognition using Higher Order Language Priors”.

https://doctr-static.mindee.com/models?id=v0.5.0/iiit5k-grid.png&src=0
>>> # NOTE: this dataset is for character-level localization
>>> from doctr.datasets import IIIT5K
>>> train_set = IIIT5K(train=True, download=True)
>>> img, target = train_set[0]

Args:#

train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task **kwargs: keyword arguments from VisionDataset.

class doctr.datasets.SVT(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#

SVT dataset from “The Street View Text Dataset - UCSD Computer Vision”.

https://doctr-static.mindee.com/models?id=v0.5.0/svt-grid.png&src=0
>>> from doctr.datasets import SVT
>>> train_set = SVT(train=True, download=True)
>>> img, target = train_set[0]

Args:#

train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task **kwargs: keyword arguments from VisionDataset.

class doctr.datasets.SVHN(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#

SVHN dataset from “The Street View House Numbers (SVHN) Dataset”.

https://doctr-static.mindee.com/models?id=v0.5.0/svhn-grid.png&src=0
>>> from doctr.datasets import SVHN
>>> train_set = SVHN(train=True, download=True)
>>> img, target = train_set[0]

Args:#

train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task **kwargs: keyword arguments from VisionDataset.

class doctr.datasets.SynthText(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#

SynthText dataset from “Synthetic Data for Text Localisation in Natural Images” | “repository” | “website”.

https://doctr-static.mindee.com/models?id=v0.5.0/svt-grid.png&src=0
>>> from doctr.datasets import SynthText
>>> train_set = SynthText(train=True, download=True)
>>> img, target = train_set[0]

Args:#

train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task **kwargs: keyword arguments from VisionDataset.

class doctr.datasets.IC03(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#

IC03 dataset from “ICDAR 2003 Robust Reading Competitions: Entries, Results and Future Directions”.

https://doctr-static.mindee.com/models?id=v0.5.0/ic03-grid.png&src=0
>>> from doctr.datasets import IC03
>>> train_set = IC03(train=True, download=True)
>>> img, target = train_set[0]

Args:#

train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task **kwargs: keyword arguments from VisionDataset.

class doctr.datasets.IC13(img_folder: str, label_folder: str, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#

IC13 dataset from “ICDAR 2013 Robust Reading Competition”.

https://doctr-static.mindee.com/models?id=v0.5.0/ic13-grid.png&src=0
>>> # NOTE: You need to download both image and label parts from Focused Scene Text challenge Task2.1 2013-2015.
>>> from doctr.datasets import IC13
>>> train_set = IC13(img_folder="/path/to/Challenge2_Training_Task12_Images",
>>>                  label_folder="/path/to/Challenge2_Training_Task1_GT")
>>> img, target = train_set[0]
>>> test_set = IC13(img_folder="/path/to/Challenge2_Test_Task12_Images",
>>>                 label_folder="/path/to/Challenge2_Test_Task1_GT")
>>> img, target = test_set[0]

Args:#

img_folder: folder with all the images of the dataset label_folder: folder with all annotation files for the images use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task **kwargs: keyword arguments from AbstractDataset.

class doctr.datasets.IMGUR5K(img_folder: str, label_path: str, train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#

IMGUR5K dataset from “TextStyleBrush: Transfer of Text Aesthetics from a Single Example” | repository.

https://doctr-static.mindee.com/models?id=v0.5.0/imgur5k-grid.png&src=0
>>> # NOTE: You need to download/generate the dataset from the repository.
>>> from doctr.datasets import IMGUR5K
>>> train_set = IMGUR5K(train=True, img_folder="/path/to/IMGUR5K-Handwriting-Dataset/images",
>>>                     label_path="/path/to/IMGUR5K-Handwriting-Dataset/dataset_info/imgur5k_annotations.json")
>>> img, target = train_set[0]
>>> test_set = IMGUR5K(train=False, img_folder="/path/to/IMGUR5K-Handwriting-Dataset/images",
>>>                    label_path="/path/to/IMGUR5K-Handwriting-Dataset/dataset_info/imgur5k_annotations.json")
>>> img, target = test_set[0]

Args:#

img_folder: folder with all the images of the dataset label_path: path to the annotations file of the dataset train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task **kwargs: keyword arguments from AbstractDataset.

class doctr.datasets.MJSynth(img_folder: str, label_path: str, train: bool = True, **kwargs: Any)[source]#

MJSynth dataset from “Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition”.

>>> # NOTE: This is a pure recognition dataset without bounding box labels.
>>> # NOTE: You need to download the dataset.
>>> from doctr.datasets import MJSynth
>>> train_set = MJSynth(img_folder="/path/to/mjsynth/mnt/ramdisk/max/90kDICT32px",
>>>                     label_path="/path/to/mjsynth/mnt/ramdisk/max/90kDICT32px/imlist.txt",
>>>                     train=True)
>>> img, target = train_set[0]
>>> test_set = MJSynth(img_folder="/path/to/mjsynth/mnt/ramdisk/max/90kDICT32px",
>>>                    label_path="/path/to/mjsynth/mnt/ramdisk/max/90kDICT32px/imlist.txt")
>>>                    train=False)
>>> img, target = test_set[0]

Args:#

img_folder: folder with all the images of the dataset label_path: path to the file with the labels train: whether the subset should be the training one **kwargs: keyword arguments from AbstractDataset.

class doctr.datasets.IIITHWS(img_folder: str, label_path: str, train: bool = True, **kwargs: Any)[source]#

IIITHWS dataset from “Generating Synthetic Data for Text Recognition” | “repository” | “website”.

>>> # NOTE: This is a pure recognition dataset without bounding box labels.
>>> # NOTE: You need to download the dataset.
>>> from doctr.datasets import IIITHWS
>>> train_set = IIITHWS(img_folder="/path/to/iiit-hws/Images_90K_Normalized",
>>>                     label_path="/path/to/IIIT-HWS-90K.txt",
>>>                     train=True)
>>> img, target = train_set[0]
>>> test_set = IIITHWS(img_folder="/path/to/iiit-hws/Images_90K_Normalized",
>>>                    label_path="/path/to/IIIT-HWS-90K.txt")
>>>                    train=False)
>>> img, target = test_set[0]

Args:#

img_folder: folder with all the images of the dataset label_path: path to the file with the labels train: whether the subset should be the training one **kwargs: keyword arguments from AbstractDataset.

class doctr.datasets.DocArtefacts(train: bool = True, use_polygons: bool = False, **kwargs: Any)[source]#

Object detection dataset for non-textual elements in documents. The dataset includes a variety of synthetic document pages with non-textual elements.

https://doctr-static.mindee.com/models?id=v0.5.0/artefacts-grid.png&src=0
>>> from doctr.datasets import DocArtefacts
>>> train_set = DocArtefacts(train=True, download=True)
>>> img, target = train_set[0]

Args:#

train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) **kwargs: keyword arguments from VisionDataset.

class doctr.datasets.WILDRECEIPT(img_folder: str, label_path: str, train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#
WildReceipt dataset from `”Spatial Dual-Modality Graph Reasoning for Key Information Extraction”

<https://arxiv.org/abs/2103.14470v1>`_ |

repository.

https://doctr-static.mindee.com/models?id=v0.7.0/wildreceipt-dataset.jpg&src=0
>>> # NOTE: You need to download the dataset first.
>>> from doctr.datasets import WILDRECEIPT
>>> train_set = WILDRECEIPT(train=True, img_folder="/path/to/wildreceipt/",
>>>                     label_path="/path/to/wildreceipt/train.txt")
>>> img, target = train_set[0]
>>> test_set = WILDRECEIPT(train=False, img_folder="/path/to/wildreceipt/",
>>>                    label_path="/path/to/wildreceipt/test.txt")
>>> img, target = test_set[0]

Args:#

img_folder: folder with all the images of the dataset label_path: path to the annotations file of the dataset train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task **kwargs: keyword arguments from AbstractDataset.

Synthetic dataset generator#

class doctr.datasets.CharacterGenerator(*args, **kwargs)[source]#

Implements a character image generation dataset

>>> from doctr.datasets import CharacterGenerator
>>> ds = CharacterGenerator(vocab='abdef', num_samples=100)
>>> img, target = ds[0]

Args:#

vocab: vocabulary to take the character from num_samples: number of samples that will be generated iterating over the dataset cache_samples: whether generated images should be cached firsthand font_family: font to use to generate the text images img_transforms: composable transformations that will be applied to each image sample_transforms: composable transformations that will be applied to both the image and the target

class doctr.datasets.WordGenerator(vocab: str, min_chars: int, max_chars: int, num_samples: int, cache_samples: bool = False, font_family: str | List[str] | None = None, img_transforms: Callable[[Any], Any] | None = None, sample_transforms: Callable[[Any, Any], Tuple[Any, Any]] | None = None)[source]#

Implements a character image generation dataset

>>> from doctr.datasets import WordGenerator
>>> ds = WordGenerator(vocab='abdef', min_chars=1, max_chars=32, num_samples=100)
>>> img, target = ds[0]

Args:#

vocab: vocabulary to take the character from min_chars: minimum number of characters in a word max_chars: maximum number of characters in a word num_samples: number of samples that will be generated iterating over the dataset cache_samples: whether generated images should be cached firsthand font_family: font to use to generate the text images img_transforms: composable transformations that will be applied to each image sample_transforms: composable transformations that will be applied to both the image and the target

Custom dataset loader#

class doctr.datasets.DetectionDataset(img_folder: str, label_path: str, use_polygons: bool = False, **kwargs: Any)[source]#

Implements a text detection dataset

>>> from doctr.datasets import DetectionDataset
>>> train_set = DetectionDataset(img_folder="/path/to/images",
>>>                              label_path="/path/to/labels.json")
>>> img, target = train_set[0]

Args:#

img_folder: folder with all the images of the dataset label_path: path to the annotations of each image use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) **kwargs: keyword arguments from AbstractDataset.

class doctr.datasets.RecognitionDataset(img_folder: str, labels_path: str, **kwargs: Any)[source]#

Dataset implementation for text recognition tasks

>>> from doctr.datasets import RecognitionDataset
>>> train_set = RecognitionDataset(img_folder="/path/to/images",
>>>                                labels_path="/path/to/labels.json")
>>> img, target = train_set[0]

Args:#

img_folder: path to the images folder labels_path: pathe to the json file containing all labels (character sequences) **kwargs: keyword arguments from AbstractDataset.

class doctr.datasets.OCRDataset(img_folder: str, label_file: str, use_polygons: bool = False, **kwargs: Any)[source]#

Implements an OCR dataset

>>> from doctr.datasets import OCRDataset
>>> train_set = OCRDataset(img_folder="/path/to/images",
>>>                        label_file="/path/to/labels.json")
>>> img, target = train_set[0]

Args:#

img_folder: local path to image folder (all jpg at the root) label_file: local path to the label file use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) **kwargs: keyword arguments from AbstractDataset.

Dataloader#

class doctr.datasets.loader.DataLoader(dataset, shuffle: bool = True, batch_size: int = 1, drop_last: bool = False, num_workers: int | None = None, collate_fn: Callable | None = None)[source]#

Implements a dataset wrapper for fast data loading

>>> from doctr.datasets import CORD, DataLoader
>>> train_set = CORD(train=True, download=True)
>>> train_loader = DataLoader(train_set, batch_size=32)
>>> train_iter = iter(train_loader)
>>> images, targets = next(train_iter)

Args:#

dataset: the dataset shuffle: whether the samples should be shuffled before passing it to the iterator batch_size: number of elements in each batch drop_last: if True, drops the last batch if it isn’t full num_workers: number of workers to use for data loading collate_fn: function to merge samples into a batch

Supported Vocabs#

Since textual content has to be encoded properly for models to interpret them efficiently, docTR supports multiple sets of vocabs.

docTR Vocabs#

Name

size

characters

digits

10

0123456789

hindi_digits

10

٠١٢٣٤٥٦٧٨٩

ascii_letters

52

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ

punctuation

32

!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~

currency

5

£€¥¢฿

ancient_greek

48

αβγδεζηθικλμνξοπρστυφχψωΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ

arabic_letters

37

ءآأؤإئابةتثجحخدذرزسشصضطظعغـفقكلمنهوىي

persian_letters

5

پچڢڤگ

arabic_diacritics

2

‘ًٌٍَُِّْ’

arabic_punctuation

5

؟؛«»—

latin

94

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~

english

100

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿

legacy_french

123

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~°àâéèêëîïôùûçÀÂÉÈËÎÏÔÙÛÇ£€¥¢฿

french

126

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿àâéèêëîïôùûüçÀÂÉÈÊËÎÏÔÙÛÜÇ

portuguese

131

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áàâãéêëíïóôõúüçÁÀÂÃÉËÍÏÓÔÕÚÜÇ¡¿

spanish

116

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áéíóúüñÁÉÍÓÚÜÑ¡¿

italian

120

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿àèéìíîòóùúÀÈÉÌÍÎÒÓÙÚ

german

108

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿äöüßÄÖÜẞ

arabic

101

ءآأؤإئابةتثجحخدذرزسشصضطظعغـفقكلمنهوىيپچڢڤگ؟؛«»—0123456789٠١٢٣٤٥٦٧٨٩’ًٌٍَُِّْ’!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~

czech

130

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áčďéěíňóřšťúůýžÁČĎÉĚÍŇÓŘŠŤÚŮÝŽ

polish

118

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿ąćęłńóśźżĄĆĘŁŃÓŚŹŻ

dutch

114

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áéíóúüñÁÉÍÓÚÜÑ

norwegian

106

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿æøåÆØÅ

danish

106

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~°£€¥¢฿æøåÆØÅ

finnish

104

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿äöÄÖ

swedish

106

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿åäöÅÄÖ

vietnamese

234

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áàảạãăắằẳẵặâấầẩẫậéèẻẽẹêếềểễệóòỏõọôốồổộỗơớờởợỡúùủũụưứừửữựiíìỉĩịýỳỷỹỵÁÀẢẠÃĂẮẰẲẴẶÂẤẦẨẪẬÉÈẺẼẸÊẾỀỂỄỆÓÒỎÕỌÔỐỒỔỘỖƠỚỜỞỢỠÚÙỦŨỤƯỨỪỬỮỰIÍÌỈĨỊÝỲỶỸỴ

hebrew

123

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿אבגדהוזחטיכלמנסעפצקרשת₪

multilingual

195

english & french & german & italian & spanish & portuguese & czech & polish & dutch & norwegian & danish & finnish & swedish & §

doctr.datasets.encode_sequences(sequences: List[str], vocab: str, target_size: int | None = None, eos: int = -1, sos: int | None = None, pad: int | None = None, dynamic_seq_length: bool = False) ndarray[source]#

Encode character sequences using a given vocab as mapping

Args:#

sequences: the list of character sequences of size N vocab: the ordered vocab to use for encoding target_size: maximum length of the encoded data eos: encoding of End Of String sos: optional encoding of Start Of String pad: optional encoding for padding. In case of padding, all sequences are followed by 1 EOS then PAD dynamic_seq_length: if target_size is specified, uses it as upper bound and enables dynamic sequence size

Returns:#

the padded encoded data as a tensor