doctr.datasets¶
doctr.datasets¶
- class doctr.datasets.FUNSD(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]¶
FUNSD dataset from “FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents”.
>>> from doctr.datasets import FUNSD >>> train_set = FUNSD(train=True, download=True) >>> img, target = train_set[0]
Args:¶
train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task **kwargs: keyword arguments from VisionDataset.
- class doctr.datasets.SROIE(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]¶
SROIE dataset from “ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction”.
>>> from doctr.datasets import SROIE >>> train_set = SROIE(train=True, download=True) >>> img, target = train_set[0]
Args:¶
train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task **kwargs: keyword arguments from VisionDataset.
- class doctr.datasets.CORD(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]¶
CORD dataset from “CORD: A Consolidated Receipt Dataset forPost-OCR Parsing”.
>>> from doctr.datasets import CORD >>> train_set = CORD(train=True, download=True) >>> img, target = train_set[0]
Args:¶
train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task **kwargs: keyword arguments from VisionDataset.
- class doctr.datasets.IIIT5K(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]¶
IIIT-5K character-level localization dataset from “BMVC 2012 Scene Text Recognition using Higher Order Language Priors”.
>>> # NOTE: this dataset is for character-level localization >>> from doctr.datasets import IIIT5K >>> train_set = IIIT5K(train=True, download=True) >>> img, target = train_set[0]
Args:¶
train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task **kwargs: keyword arguments from VisionDataset.
- class doctr.datasets.SVT(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]¶
SVT dataset from “The Street View Text Dataset - UCSD Computer Vision”.
>>> from doctr.datasets import SVT >>> train_set = SVT(train=True, download=True) >>> img, target = train_set[0]
Args:¶
train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task **kwargs: keyword arguments from VisionDataset.
- class doctr.datasets.SVHN(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]¶
SVHN dataset from “The Street View House Numbers (SVHN) Dataset”.
>>> from doctr.datasets import SVHN >>> train_set = SVHN(train=True, download=True) >>> img, target = train_set[0]
Args:¶
train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task **kwargs: keyword arguments from VisionDataset.
- class doctr.datasets.SynthText(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]¶
SynthText dataset from “Synthetic Data for Text Localisation in Natural Images” | “repository” | “website”.
>>> from doctr.datasets import SynthText >>> train_set = SynthText(train=True, download=True) >>> img, target = train_set[0]
Args:¶
train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task **kwargs: keyword arguments from VisionDataset.
- class doctr.datasets.IC03(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]¶
IC03 dataset from “ICDAR 2003 Robust Reading Competitions: Entries, Results and Future Directions”.
>>> from doctr.datasets import IC03 >>> train_set = IC03(train=True, download=True) >>> img, target = train_set[0]
Args:¶
train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task **kwargs: keyword arguments from VisionDataset.
- class doctr.datasets.IC13(img_folder: str, label_folder: str, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]¶
IC13 dataset from “ICDAR 2013 Robust Reading Competition”.
>>> # NOTE: You need to download both image and label parts from Focused Scene Text challenge Task2.1 2013-2015. >>> from doctr.datasets import IC13 >>> train_set = IC13(img_folder="/path/to/Challenge2_Training_Task12_Images", >>> label_folder="/path/to/Challenge2_Training_Task1_GT") >>> img, target = train_set[0] >>> test_set = IC13(img_folder="/path/to/Challenge2_Test_Task12_Images", >>> label_folder="/path/to/Challenge2_Test_Task1_GT") >>> img, target = test_set[0]
Args:¶
img_folder: folder with all the images of the dataset label_folder: folder with all annotation files for the images use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task **kwargs: keyword arguments from AbstractDataset.
- class doctr.datasets.IMGUR5K(img_folder: str, label_path: str, train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]¶
IMGUR5K dataset from “TextStyleBrush: Transfer of Text Aesthetics from a Single Example” | repository.
>>> # NOTE: You need to download/generate the dataset from the repository. >>> from doctr.datasets import IMGUR5K >>> train_set = IMGUR5K(train=True, img_folder="/path/to/IMGUR5K-Handwriting-Dataset/images", >>> label_path="/path/to/IMGUR5K-Handwriting-Dataset/dataset_info/imgur5k_annotations.json") >>> img, target = train_set[0] >>> test_set = IMGUR5K(train=False, img_folder="/path/to/IMGUR5K-Handwriting-Dataset/images", >>> label_path="/path/to/IMGUR5K-Handwriting-Dataset/dataset_info/imgur5k_annotations.json") >>> img, target = test_set[0]
Args:¶
img_folder: folder with all the images of the dataset label_path: path to the annotations file of the dataset train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task **kwargs: keyword arguments from AbstractDataset.
- class doctr.datasets.MJSynth(img_folder: str, label_path: str, train: bool = True, **kwargs: Any)[source]¶
MJSynth dataset from “Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition”.
>>> # NOTE: This is a pure recognition dataset without bounding box labels. >>> # NOTE: You need to download the dataset. >>> from doctr.datasets import MJSynth >>> train_set = MJSynth(img_folder="/path/to/mjsynth/mnt/ramdisk/max/90kDICT32px", >>> label_path="/path/to/mjsynth/mnt/ramdisk/max/90kDICT32px/imlist.txt", >>> train=True) >>> img, target = train_set[0] >>> test_set = MJSynth(img_folder="/path/to/mjsynth/mnt/ramdisk/max/90kDICT32px", >>> label_path="/path/to/mjsynth/mnt/ramdisk/max/90kDICT32px/imlist.txt") >>> train=False) >>> img, target = test_set[0]
Args:¶
img_folder: folder with all the images of the dataset label_path: path to the file with the labels train: whether the subset should be the training one **kwargs: keyword arguments from AbstractDataset.
- class doctr.datasets.IIITHWS(img_folder: str, label_path: str, train: bool = True, **kwargs: Any)[source]¶
IIITHWS dataset from “Generating Synthetic Data for Text Recognition” | “repository” | “website”.
>>> # NOTE: This is a pure recognition dataset without bounding box labels. >>> # NOTE: You need to download the dataset. >>> from doctr.datasets import IIITHWS >>> train_set = IIITHWS(img_folder="/path/to/iiit-hws/Images_90K_Normalized", >>> label_path="/path/to/IIIT-HWS-90K.txt", >>> train=True) >>> img, target = train_set[0] >>> test_set = IIITHWS(img_folder="/path/to/iiit-hws/Images_90K_Normalized", >>> label_path="/path/to/IIIT-HWS-90K.txt") >>> train=False) >>> img, target = test_set[0]
Args:¶
img_folder: folder with all the images of the dataset label_path: path to the file with the labels train: whether the subset should be the training one **kwargs: keyword arguments from AbstractDataset.
- class doctr.datasets.DocArtefacts(train: bool = True, use_polygons: bool = False, **kwargs: Any)[source]¶
Object detection dataset for non-textual elements in documents. The dataset includes a variety of synthetic document pages with non-textual elements.
>>> from doctr.datasets import DocArtefacts >>> train_set = DocArtefacts(train=True, download=True) >>> img, target = train_set[0]
Args:¶
train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) **kwargs: keyword arguments from VisionDataset.
- class doctr.datasets.WILDRECEIPT(img_folder: str, label_path: str, train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]¶
- WildReceipt dataset from `”Spatial Dual-Modality Graph Reasoning for Key Information Extraction”
>>> # NOTE: You need to download the dataset first. >>> from doctr.datasets import WILDRECEIPT >>> train_set = WILDRECEIPT(train=True, img_folder="/path/to/wildreceipt/", >>> label_path="/path/to/wildreceipt/train.txt") >>> img, target = train_set[0] >>> test_set = WILDRECEIPT(train=False, img_folder="/path/to/wildreceipt/", >>> label_path="/path/to/wildreceipt/test.txt") >>> img, target = test_set[0]
Args:¶
img_folder: folder with all the images of the dataset label_path: path to the annotations file of the dataset train: whether the subset should be the training one use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) recognition_task: whether the dataset should be used for recognition task **kwargs: keyword arguments from AbstractDataset.
Synthetic dataset generator¶
- class doctr.datasets.CharacterGenerator(*args, **kwargs)[source]¶
Implements a character image generation dataset
>>> from doctr.datasets import CharacterGenerator >>> ds = CharacterGenerator(vocab='abdef', num_samples=100) >>> img, target = ds[0]
Args:¶
vocab: vocabulary to take the character from num_samples: number of samples that will be generated iterating over the dataset cache_samples: whether generated images should be cached firsthand font_family: font to use to generate the text images img_transforms: composable transformations that will be applied to each image sample_transforms: composable transformations that will be applied to both the image and the target
- class doctr.datasets.WordGenerator(vocab: str, min_chars: int, max_chars: int, num_samples: int, cache_samples: bool = False, font_family: str | List[str] | None = None, img_transforms: Callable[[Any], Any] | None = None, sample_transforms: Callable[[Any, Any], Tuple[Any, Any]] | None = None)[source]¶
Implements a character image generation dataset
>>> from doctr.datasets import WordGenerator >>> ds = WordGenerator(vocab='abdef', min_chars=1, max_chars=32, num_samples=100) >>> img, target = ds[0]
Args:¶
vocab: vocabulary to take the character from min_chars: minimum number of characters in a word max_chars: maximum number of characters in a word num_samples: number of samples that will be generated iterating over the dataset cache_samples: whether generated images should be cached firsthand font_family: font to use to generate the text images img_transforms: composable transformations that will be applied to each image sample_transforms: composable transformations that will be applied to both the image and the target
Custom dataset loader¶
- class doctr.datasets.DetectionDataset(img_folder: str, label_path: str, use_polygons: bool = False, **kwargs: Any)[source]¶
Implements a text detection dataset
>>> from doctr.datasets import DetectionDataset >>> train_set = DetectionDataset(img_folder="/path/to/images", >>> label_path="/path/to/labels.json") >>> img, target = train_set[0]
Args:¶
img_folder: folder with all the images of the dataset label_path: path to the annotations of each image use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) **kwargs: keyword arguments from AbstractDataset.
- class doctr.datasets.RecognitionDataset(img_folder: str, labels_path: str, **kwargs: Any)[source]¶
Dataset implementation for text recognition tasks
>>> from doctr.datasets import RecognitionDataset >>> train_set = RecognitionDataset(img_folder="/path/to/images", >>> labels_path="/path/to/labels.json") >>> img, target = train_set[0]
Args:¶
img_folder: path to the images folder labels_path: pathe to the json file containing all labels (character sequences) **kwargs: keyword arguments from AbstractDataset.
- class doctr.datasets.OCRDataset(img_folder: str, label_file: str, use_polygons: bool = False, **kwargs: Any)[source]¶
Implements an OCR dataset
>>> from doctr.datasets import OCRDataset >>> train_set = OCRDataset(img_folder="/path/to/images", >>> label_file="/path/to/labels.json") >>> img, target = train_set[0]
Args:¶
img_folder: local path to image folder (all jpg at the root) label_file: local path to the label file use_polygons: whether polygons should be considered as rotated bounding box (instead of straight ones) **kwargs: keyword arguments from AbstractDataset.
Dataloader¶
- class doctr.datasets.loader.DataLoader(dataset, shuffle: bool = True, batch_size: int = 1, drop_last: bool = False, collate_fn: Callable | None = None)[source]¶
Implements a dataset wrapper for fast data loading
>>> from doctr.datasets import CORD, DataLoader >>> train_set = CORD(train=True, download=True) >>> train_loader = DataLoader(train_set, batch_size=32) >>> train_iter = iter(train_loader) >>> images, targets = next(train_iter)
Args:¶
dataset: the dataset shuffle: whether the samples should be shuffled before passing it to the iterator batch_size: number of elements in each batch drop_last: if True, drops the last batch if it isn’t full collate_fn: function to merge samples into a batch
Supported Vocabs¶
Since textual content has to be encoded properly for models to interpret them efficiently, docTR supports multiple sets of vocabs.
Name |
size |
characters |
---|---|---|
digits |
10 |
0123456789 |
hindi_digits |
10 |
٠١٢٣٤٥٦٧٨٩ |
ascii_letters |
52 |
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ |
punctuation |
32 |
!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~ |
currency |
5 |
£€¥¢฿ |
ancient_greek |
48 |
αβγδεζηθικλμνξοπρστυφχψωΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ |
arabic_letters |
37 |
ءآأؤإئابةتثجحخدذرزسشصضطظعغـفقكلمنهوىي |
persian_letters |
5 |
پچڢڤگ |
arabic_diacritics |
2 |
‘ًٌٍَُِّْ’ |
arabic_punctuation |
5 |
؟؛«»— |
latin |
94 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~ |
english |
100 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿ |
legacy_french |
123 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~°àâéèêëîïôùûçÀÂÉÈËÎÏÔÙÛÇ£€¥¢฿ |
french |
126 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿àâéèêëîïôùûüçÀÂÉÈÊËÎÏÔÙÛÜÇ |
portuguese |
131 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áàâãéêëíïóôõúüçÁÀÂÃÉËÍÏÓÔÕÚÜÇ¡¿ |
spanish |
116 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áéíóúüñÁÉÍÓÚÜÑ¡¿ |
italian |
120 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿àèéìíîòóùúÀÈÉÌÍÎÒÓÙÚ |
german |
108 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿äöüßÄÖÜẞ |
arabic |
101 |
ءآأؤإئابةتثجحخدذرزسشصضطظعغـفقكلمنهوىيپچڢڤگ؟؛«»—0123456789٠١٢٣٤٥٦٧٨٩’ًٌٍَُِّْ’!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~ |
czech |
130 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áčďéěíňóřšťúůýžÁČĎÉĚÍŇÓŘŠŤÚŮÝŽ |
polish |
118 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿ąćęłńóśźżĄĆĘŁŃÓŚŹŻ |
dutch |
114 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áéíóúüñÁÉÍÓÚÜÑ |
norwegian |
106 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿æøåÆØÅ |
danish |
106 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~°£€¥¢฿æøåÆØÅ |
finnish |
104 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿äöÄÖ |
swedish |
106 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿åäöÅÄÖ |
vietnamese |
234 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áàảạãăắằẳẵặâấầẩẫậéèẻẽẹêếềểễệóòỏõọôốồổộỗơớờởợỡúùủũụưứừửữựiíìỉĩịýỳỷỹỵÁÀẢẠÃĂẮẰẲẴẶÂẤẦẨẪẬÉÈẺẼẸÊẾỀỂỄỆÓÒỎÕỌÔỐỒỔỘỖƠỚỜỞỢỠÚÙỦŨỤƯỨỪỬỮỰIÍÌỈĨỊÝỲỶỸỴ |
hebrew |
123 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿אבגדהוזחטיכלמנסעפצקרשת₪ |
hindi |
71 |
अआइईउऊऋॠऌॡएऐओऔअंअःकखगघङचछजझञटठडढणतथदधनपफबभमयरलवशषसह०१२३४५६७८९।,?!:्ॐ॰॥॰ |
bangla |
70 |
অআইঈউঊঋএঐওঔকখগঘঙচছজঝঞটঠডঢণতথদধনপফবভমযরলশষসহ়ঽািীুূৃেৈোৌ্ৎংঃঁ০১২৩৪৫৬৭৮৯ |
multilingual |
195 |
english & french & german & italian & spanish & portuguese & czech & polish & dutch & norwegian & danish & finnish & swedish & § |
- doctr.datasets.encode_sequences(sequences: List[str], vocab: str, target_size: int | None = None, eos: int = -1, sos: int | None = None, pad: int | None = None, dynamic_seq_length: bool = False) ndarray [source]¶
Encode character sequences using a given vocab as mapping
Args:¶
sequences: the list of character sequences of size N vocab: the ordered vocab to use for encoding target_size: maximum length of the encoded data eos: encoding of End Of String sos: optional encoding of Start Of String pad: optional encoding for padding. In case of padding, all sequences are followed by 1 EOS then PAD dynamic_seq_length: if target_size is specified, uses it as upper bound and enables dynamic sequence size
Returns:¶
the padded encoded data as a tensor