doctr.datasets#
doctr.datasets#
- class doctr.datasets.FUNSD(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#
FUNSD dataset from “FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents”.
>>> from doctr.datasets import FUNSD >>> train_set = FUNSD(train=True, download=True) >>> img, target = train_set[0]
- Parameters:
train – whether the subset should be the training one
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
recognition_task – whether the dataset should be used for recognition task
**kwargs – keyword arguments from VisionDataset.
- class doctr.datasets.SROIE(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#
SROIE dataset from “ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction”.
>>> from doctr.datasets import SROIE >>> train_set = SROIE(train=True, download=True) >>> img, target = train_set[0]
- Parameters:
train – whether the subset should be the training one
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
recognition_task – whether the dataset should be used for recognition task
**kwargs – keyword arguments from VisionDataset.
- class doctr.datasets.CORD(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#
CORD dataset from “CORD: A Consolidated Receipt Dataset forPost-OCR Parsing”.
>>> from doctr.datasets import CORD >>> train_set = CORD(train=True, download=True) >>> img, target = train_set[0]
- Parameters:
train – whether the subset should be the training one
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
recognition_task – whether the dataset should be used for recognition task
**kwargs – keyword arguments from VisionDataset.
- class doctr.datasets.IIIT5K(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#
IIIT-5K character-level localization dataset from “BMVC 2012 Scene Text Recognition using Higher Order Language Priors”.
>>> # NOTE: this dataset is for character-level localization >>> from doctr.datasets import IIIT5K >>> train_set = IIIT5K(train=True, download=True) >>> img, target = train_set[0]
- Parameters:
train – whether the subset should be the training one
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
recognition_task – whether the dataset should be used for recognition task
**kwargs – keyword arguments from VisionDataset.
- class doctr.datasets.SVT(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#
SVT dataset from “The Street View Text Dataset - UCSD Computer Vision”.
>>> from doctr.datasets import SVT >>> train_set = SVT(train=True, download=True) >>> img, target = train_set[0]
- Parameters:
train – whether the subset should be the training one
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
recognition_task – whether the dataset should be used for recognition task
**kwargs – keyword arguments from VisionDataset.
- class doctr.datasets.SVHN(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#
SVHN dataset from “The Street View House Numbers (SVHN) Dataset”.
>>> from doctr.datasets import SVHN >>> train_set = SVHN(train=True, download=True) >>> img, target = train_set[0]
- Parameters:
train – whether the subset should be the training one
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
recognition_task – whether the dataset should be used for recognition task
**kwargs – keyword arguments from VisionDataset.
- class doctr.datasets.SynthText(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#
SynthText dataset from “Synthetic Data for Text Localisation in Natural Images” | “repository” | “website”.
>>> from doctr.datasets import SynthText >>> train_set = SynthText(train=True, download=True) >>> img, target = train_set[0]
- Parameters:
train – whether the subset should be the training one
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
recognition_task – whether the dataset should be used for recognition task
**kwargs – keyword arguments from VisionDataset.
- class doctr.datasets.IC03(train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#
IC03 dataset from “ICDAR 2003 Robust Reading Competitions: Entries, Results and Future Directions”.
>>> from doctr.datasets import IC03 >>> train_set = IC03(train=True, download=True) >>> img, target = train_set[0]
- Parameters:
train – whether the subset should be the training one
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
recognition_task – whether the dataset should be used for recognition task
**kwargs – keyword arguments from VisionDataset.
- class doctr.datasets.IC13(img_folder: str, label_folder: str, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#
IC13 dataset from “ICDAR 2013 Robust Reading Competition”.
>>> # NOTE: You need to download both image and label parts from Focused Scene Text challenge Task2.1 2013-2015. >>> from doctr.datasets import IC13 >>> train_set = IC13(img_folder="/path/to/Challenge2_Training_Task12_Images", >>> label_folder="/path/to/Challenge2_Training_Task1_GT") >>> img, target = train_set[0] >>> test_set = IC13(img_folder="/path/to/Challenge2_Test_Task12_Images", >>> label_folder="/path/to/Challenge2_Test_Task1_GT") >>> img, target = test_set[0]
- Parameters:
img_folder – folder with all the images of the dataset
label_folder – folder with all annotation files for the images
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
recognition_task – whether the dataset should be used for recognition task
**kwargs – keyword arguments from AbstractDataset.
- class doctr.datasets.IMGUR5K(img_folder: str, label_path: str, train: bool = True, use_polygons: bool = False, recognition_task: bool = False, **kwargs: Any)[source]#
IMGUR5K dataset from “TextStyleBrush: Transfer of Text Aesthetics from a Single Example” | repository.
>>> # NOTE: You need to download/generate the dataset from the repository. >>> from doctr.datasets import IMGUR5K >>> train_set = IMGUR5K(train=True, img_folder="/path/to/IMGUR5K-Handwriting-Dataset/images", >>> label_path="/path/to/IMGUR5K-Handwriting-Dataset/dataset_info/imgur5k_annotations.json") >>> img, target = train_set[0] >>> test_set = IMGUR5K(train=False, img_folder="/path/to/IMGUR5K-Handwriting-Dataset/images", >>> label_path="/path/to/IMGUR5K-Handwriting-Dataset/dataset_info/imgur5k_annotations.json") >>> img, target = test_set[0]
- Parameters:
img_folder – folder with all the images of the dataset
label_path – path to the annotations file of the dataset
train – whether the subset should be the training one
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
recognition_task – whether the dataset should be used for recognition task
**kwargs – keyword arguments from AbstractDataset.
- class doctr.datasets.MJSynth(img_folder: str, label_path: str, train: bool = True, **kwargs: Any)[source]#
MJSynth dataset from “Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition”.
>>> # NOTE: This is a pure recognition dataset without bounding box labels. >>> # NOTE: You need to download the dataset. >>> from doctr.datasets import MJSynth >>> train_set = MJSynth(img_folder="/path/to/mjsynth/mnt/ramdisk/max/90kDICT32px", >>> label_path="/path/to/mjsynth/mnt/ramdisk/max/90kDICT32px/imlist.txt", >>> train=True) >>> img, target = train_set[0] >>> test_set = MJSynth(img_folder="/path/to/mjsynth/mnt/ramdisk/max/90kDICT32px", >>> label_path="/path/to/mjsynth/mnt/ramdisk/max/90kDICT32px/imlist.txt") >>> train=False) >>> img, target = test_set[0]
- Parameters:
img_folder – folder with all the images of the dataset
label_path – path to the file with the labels
train – whether the subset should be the training one
**kwargs – keyword arguments from AbstractDataset.
- class doctr.datasets.IIITHWS(img_folder: str, label_path: str, train: bool = True, **kwargs: Any)[source]#
IIITHWS dataset from “Generating Synthetic Data for Text Recognition” | “repository” | “website”.
>>> # NOTE: This is a pure recognition dataset without bounding box labels. >>> # NOTE: You need to download the dataset. >>> from doctr.datasets import IIITHWS >>> train_set = IIITHWS(img_folder="/path/to/iiit-hws/Images_90K_Normalized", >>> label_path="/path/to/IIIT-HWS-90K.txt", >>> train=True) >>> img, target = train_set[0] >>> test_set = IIITHWS(img_folder="/path/to/iiit-hws/Images_90K_Normalized", >>> label_path="/path/to/IIIT-HWS-90K.txt") >>> train=False) >>> img, target = test_set[0]
- Parameters:
img_folder – folder with all the images of the dataset
label_path – path to the file with the labels
train – whether the subset should be the training one
**kwargs – keyword arguments from AbstractDataset.
- class doctr.datasets.DocArtefacts(train: bool = True, use_polygons: bool = False, **kwargs: Any)[source]#
Object detection dataset for non-textual elements in documents. The dataset includes a variety of synthetic document pages with non-textual elements.
>>> from doctr.datasets import DocArtefacts >>> train_set = DocArtefacts(train=True, download=True) >>> img, target = train_set[0]
- Parameters:
train – whether the subset should be the training one
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
**kwargs – keyword arguments from VisionDataset.
Synthetic dataset generator#
- class doctr.datasets.CharacterGenerator(*args, **kwargs)[source]#
Implements a character image generation dataset
>>> from doctr.datasets import CharacterGenerator >>> ds = CharacterGenerator(vocab='abdef', num_samples=100) >>> img, target = ds[0]
- Parameters:
vocab – vocabulary to take the character from
num_samples – number of samples that will be generated iterating over the dataset
cache_samples – whether generated images should be cached firsthand
font_family – font to use to generate the text images
img_transforms – composable transformations that will be applied to each image
sample_transforms – composable transformations that will be applied to both the image and the target
- class doctr.datasets.WordGenerator(vocab: str, min_chars: int, max_chars: int, num_samples: int, cache_samples: bool = False, font_family: str | List[str] | None = None, img_transforms: Callable[[Any], Any] | None = None, sample_transforms: Callable[[Any, Any], Tuple[Any, Any]] | None = None)[source]#
Implements a character image generation dataset
>>> from doctr.datasets import WordGenerator >>> ds = WordGenerator(vocab='abdef', min_chars=1, max_chars=32, num_samples=100) >>> img, target = ds[0]
- Parameters:
vocab – vocabulary to take the character from
min_chars – minimum number of characters in a word
max_chars – maximum number of characters in a word
num_samples – number of samples that will be generated iterating over the dataset
cache_samples – whether generated images should be cached firsthand
font_family – font to use to generate the text images
img_transforms – composable transformations that will be applied to each image
sample_transforms – composable transformations that will be applied to both the image and the target
Custom dataset loader#
- class doctr.datasets.DetectionDataset(img_folder: str, label_path: str, use_polygons: bool = False, **kwargs: Any)[source]#
Implements a text detection dataset
>>> from doctr.datasets import DetectionDataset >>> train_set = DetectionDataset(img_folder="/path/to/images", >>> label_path="/path/to/labels.json") >>> img, target = train_set[0]
- Parameters:
img_folder – folder with all the images of the dataset
label_path – path to the annotations of each image
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
**kwargs – keyword arguments from AbstractDataset.
- class doctr.datasets.RecognitionDataset(img_folder: str, labels_path: str, **kwargs: Any)[source]#
Dataset implementation for text recognition tasks
>>> from doctr.datasets import RecognitionDataset >>> train_set = RecognitionDataset(img_folder="/path/to/images", >>> labels_path="/path/to/labels.json") >>> img, target = train_set[0]
- Parameters:
img_folder – path to the images folder
labels_path – pathe to the json file containing all labels (character sequences)
**kwargs – keyword arguments from AbstractDataset.
- class doctr.datasets.OCRDataset(img_folder: str, label_file: str, use_polygons: bool = False, **kwargs: Any)[source]#
Implements an OCR dataset
>>> from doctr.datasets import OCRDataset >>> train_set = OCRDataset(img_folder="/path/to/images", >>> label_file="/path/to/labels.json") >>> img, target = train_set[0]
- Parameters:
img_folder – local path to image folder (all jpg at the root)
label_file – local path to the label file
use_polygons – whether polygons should be considered as rotated bounding box (instead of straight ones)
**kwargs – keyword arguments from AbstractDataset.
Dataloader#
- class doctr.datasets.loader.DataLoader(dataset, shuffle: bool = True, batch_size: int = 1, drop_last: bool = False, num_workers: int | None = None, collate_fn: Callable | None = None)[source]#
Implements a dataset wrapper for fast data loading
>>> from doctr.datasets import CORD, DataLoader >>> train_set = CORD(train=True, download=True) >>> train_loader = DataLoader(train_set, batch_size=32) >>> train_iter = iter(train_loader) >>> images, targets = next(train_iter)
- Parameters:
dataset – the dataset
shuffle – whether the samples should be shuffled before passing it to the iterator
batch_size – number of elements in each batch
drop_last – if True, drops the last batch if it isn’t full
num_workers – number of workers to use for data loading
collate_fn – function to merge samples into a batch
Supported Vocabs#
Since textual content has to be encoded properly for models to interpret them efficiently, docTR supports multiple sets of vocabs.
Name |
size |
characters |
---|---|---|
digits |
10 |
0123456789 |
hindi_digits |
10 |
٠١٢٣٤٥٦٧٨٩ |
ascii_letters |
52 |
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ |
punctuation |
32 |
!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~ |
currency |
5 |
£€¥¢฿ |
ancient_greek |
48 |
αβγδεζηθικλμνξοπρστυφχψωΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ |
arabic_letters |
37 |
ءآأؤإئابةتثجحخدذرزسشصضطظعغـفقكلمنهوىي |
persian_letters |
5 |
پچڢڤگ |
arabic_diacritics |
2 |
‘ًٌٍَُِّْ’ |
arabic_punctuation |
5 |
؟؛«»— |
latin |
94 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~ |
english |
100 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿ |
legacy_french |
123 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~°àâéèêëîïôùûçÀÂÉÈËÎÏÔÙÛÇ£€¥¢฿ |
french |
126 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿àâéèêëîïôùûüçÀÂÉÈÊËÎÏÔÙÛÜÇ |
portuguese |
131 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áàâãéêëíïóôõúüçÁÀÂÃÉËÍÏÓÔÕÚÜÇ¡¿ |
spanish |
116 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áéíóúüñÁÉÍÓÚÜÑ¡¿ |
german |
108 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿äöüßÄÖÜẞ |
arabic |
101 |
ءآأؤإئابةتثجحخدذرزسشصضطظعغـفقكلمنهوىيپچڢڤگ؟؛«»—0123456789٠١٢٣٤٥٦٧٨٩’ًٌٍَُِّْ’!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~ |
czech |
130 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áčďéěíňóřšťúůýžÁČĎÉĚÍŇÓŘŠŤÚŮÝŽ |
vietnamese |
234 |
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!”#$%&’()*+,-./:;<=>?@[]^_`{|}~°£€¥¢฿áàảạãăắằẳẵặâấầẩẫậéèẻẽẹêếềểễệóòỏõọôốồổộỗơớờởợỡúùủũụưứừửữựiíìỉĩịýỳỷỹỵÁÀẢẠÃĂẮẰẲẴẶÂẤẦẨẪẬÉÈẺẼẸÊẾỀỂỄỆÓÒỎÕỌÔỐỒỔỘỖƠỚỜỞỢỠÚÙỦŨỤƯỨỪỬỮỰIÍÌỈĨỊÝỲỶỸỴ |
- doctr.datasets.encode_sequences(sequences: List[str], vocab: str, target_size: int | None = None, eos: int = -1, sos: int | None = None, pad: int | None = None, dynamic_seq_length: bool = False, **kwargs: Any) ndarray [source]#
Encode character sequences using a given vocab as mapping
- Parameters:
sequences – the list of character sequences of size N
vocab – the ordered vocab to use for encoding
target_size – maximum length of the encoded data
eos – encoding of End Of String
sos – optional encoding of Start Of String
pad – optional encoding for padding. In case of padding, all sequences are followed by 1 EOS then PAD
dynamic_seq_length – if target_size is specified, uses it as upper bound and enables dynamic sequence size
- Returns:
the padded encoded data as a tensor