doctr.models#

doctr.models.classification#

doctr.models.classification.vgg16_bn_r(pretrained: bool = False, **kwargs: Any) VGG[source]#

VGG-16 architecture as described in “Very Deep Convolutional Networks for Large-Scale Image Recognition”, modified by adding batch normalization, rectangular pooling and a simpler classification head.

>>> import tensorflow as tf
>>> from doctr.models import vgg16_bn_r
>>> model = vgg16_bn_r(pretrained=False)
>>> input_tensor = tf.random.uniform(shape=[1, 512, 512, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained (bool) – If True, returns a model pre-trained on ImageNet

Returns:

VGG feature extractor

doctr.models.classification.resnet18(pretrained: bool = False, **kwargs: Any) ResNet[source]#

Resnet-18 architecture as described in “Deep Residual Learning for Image Recognition”,.

>>> import tensorflow as tf
>>> from doctr.models import resnet18
>>> model = resnet18(pretrained=False)
>>> input_tensor = tf.random.uniform(shape=[1, 512, 512, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained – boolean, True if model is pretrained

Returns:

A classification model

doctr.models.classification.resnet34(pretrained: bool = False, **kwargs: Any) ResNet[source]#

Resnet-34 architecture as described in “Deep Residual Learning for Image Recognition”,.

>>> import tensorflow as tf
>>> from doctr.models import resnet34
>>> model = resnet34(pretrained=False)
>>> input_tensor = tf.random.uniform(shape=[1, 512, 512, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained – boolean, True if model is pretrained

Returns:

A classification model

doctr.models.classification.resnet50(pretrained: bool = False, **kwargs: Any) ResNet[source]#

Resnet-50 architecture as described in “Deep Residual Learning for Image Recognition”,.

>>> import tensorflow as tf
>>> from doctr.models import resnet50
>>> model = resnet50(pretrained=False)
>>> input_tensor = tf.random.uniform(shape=[1, 512, 512, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained – boolean, True if model is pretrained

Returns:

A classification model

doctr.models.classification.resnet31(pretrained: bool = False, **kwargs: Any) ResNet[source]#

Resnet31 architecture with rectangular pooling windows as described in “Show, Attend and Read:A Simple and Strong Baseline for Irregular Text Recognition”,. Downsizing: (H, W) –> (H/8, W/4)

>>> import tensorflow as tf
>>> from doctr.models import resnet31
>>> model = resnet31(pretrained=False)
>>> input_tensor = tf.random.uniform(shape=[1, 512, 512, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained – boolean, True if model is pretrained

Returns:

A classification model

doctr.models.classification.mobilenet_v3_small(pretrained: bool = False, **kwargs: Any) MobileNetV3[source]#

MobileNetV3-Small architecture as described in “Searching for MobileNetV3”,.

>>> import tensorflow as tf
>>> from doctr.models import mobilenet_v3_small
>>> model = mobilenet_v3_small(pretrained=False)
>>> input_tensor = tf.random.uniform(shape=[1, 512, 512, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained – boolean, True if model is pretrained

Returns:

a keras.Model

doctr.models.classification.mobilenet_v3_large(pretrained: bool = False, **kwargs: Any) MobileNetV3[source]#

MobileNetV3-Large architecture as described in “Searching for MobileNetV3”,.

>>> import tensorflow as tf
>>> from doctr.models import mobilenet_v3_large
>>> model = mobilenet_v3_large(pretrained=False)
>>> input_tensor = tf.random.uniform(shape=[1, 512, 512, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained – boolean, True if model is pretrained

Returns:

a keras.Model

doctr.models.classification.mobilenet_v3_small_r(pretrained: bool = False, **kwargs: Any) MobileNetV3[source]#

MobileNetV3-Small architecture as described in “Searching for MobileNetV3”,, with rectangular pooling.

>>> import tensorflow as tf
>>> from doctr.models import mobilenet_v3_small_r
>>> model = mobilenet_v3_small_r(pretrained=False)
>>> input_tensor = tf.random.uniform(shape=[1, 512, 512, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained – boolean, True if model is pretrained

Returns:

a keras.Model

doctr.models.classification.mobilenet_v3_large_r(pretrained: bool = False, **kwargs: Any) MobileNetV3[source]#

MobileNetV3-Large architecture as described in “Searching for MobileNetV3”,.

>>> import tensorflow as tf
>>> from doctr.models import mobilenet_v3_large_r
>>> model = mobilenet_v3_large_r(pretrained=False)
>>> input_tensor = tf.random.uniform(shape=[1, 512, 512, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained – boolean, True if model is pretrained

Returns:

a keras.Model

doctr.models.classification.mobilenet_v3_small_orientation(pretrained: bool = False, **kwargs: Any) MobileNetV3[source]#

MobileNetV3-Small architecture as described in “Searching for MobileNetV3”,.

>>> import tensorflow as tf
>>> from doctr.models import mobilenet_v3_small_orientation
>>> model = mobilenet_v3_small_orientation(pretrained=False)
>>> input_tensor = tf.random.uniform(shape=[1, 512, 512, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained – boolean, True if model is pretrained

Returns:

a keras.Model

doctr.models.classification.magc_resnet31(pretrained: bool = False, **kwargs: Any) ResNet[source]#

Resnet31 architecture with Multi-Aspect Global Context Attention as described in “MASTER: Multi-Aspect Non-local Network for Scene Text Recognition”,.

>>> import tensorflow as tf
>>> from doctr.models import magc_resnet31
>>> model = magc_resnet31(pretrained=False)
>>> input_tensor = tf.random.uniform(shape=[1, 224, 224, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained – boolean, True if model is pretrained

Returns:

A feature extractor model

doctr.models.classification.vit_s(pretrained: bool = False, **kwargs: Any) VisionTransformer[source]#

VisionTransformer-S architecture “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”,. Patches: (H, W) -> (H/8, W/8)

NOTE: unofficial config used in ViTSTR and ParSeq

>>> import tensorflow as tf
>>> from doctr.models import vit_s
>>> model = vit_s(pretrained=False)
>>> input_tensor = tf.random.uniform(shape=[1, 32, 32, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained – boolean, True if model is pretrained

Returns:

A feature extractor model

doctr.models.classification.vit_b(pretrained: bool = False, **kwargs: Any) VisionTransformer[source]#

VisionTransformer-B architecture as described in “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”,. Patches: (H, W) -> (H/8, W/8)

>>> import tensorflow as tf
>>> from doctr.models import vit_b
>>> model = vit_b(pretrained=False)
>>> input_tensor = tf.random.uniform(shape=[1, 32, 32, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained – boolean, True if model is pretrained

Returns:

A feature extractor model

doctr.models.classification.crop_orientation_predictor(arch: str = 'mobilenet_v3_small_orientation', pretrained: bool = False, **kwargs: Any) CropOrientationPredictor[source]#

Orientation classification architecture.

>>> import numpy as np
>>> from doctr.models import crop_orientation_predictor
>>> model = crop_orientation_predictor(arch='classif_mobilenet_v3_small', pretrained=True)
>>> input_crop = (255 * np.random.rand(600, 800, 3)).astype(np.uint8)
>>> out = model([input_crop])
Parameters:
  • arch – name of the architecture to use (e.g. ‘mobilenet_v3_small’)

  • pretrained – If True, returns a model pre-trained on our recognition crops dataset

Returns:

CropOrientationPredictor

doctr.models.detection#

doctr.models.detection.linknet_resnet18(pretrained: bool = False, **kwargs: Any) LinkNet[source]#

LinkNet as described in “LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation”.

>>> import tensorflow as tf
>>> from doctr.models import linknet_resnet18
>>> model = linknet_resnet18(pretrained=True)
>>> input_tensor = tf.random.uniform(shape=[1, 1024, 1024, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained (bool) – If True, returns a model pre-trained on our text detection dataset

Returns:

text detection architecture

doctr.models.detection.linknet_resnet18_rotation(pretrained: bool = False, **kwargs: Any) LinkNet[source]#

LinkNet as described in “LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation”.

>>> import tensorflow as tf
>>> from doctr.models import linknet_resnet18_rotation
>>> model = linknet_resnet18_rotation(pretrained=True)
>>> input_tensor = tf.random.uniform(shape=[1, 1024, 1024, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained (bool) – If True, returns a model pre-trained on our text detection dataset

Returns:

text detection architecture

doctr.models.detection.linknet_resnet34(pretrained: bool = False, **kwargs: Any) LinkNet[source]#

LinkNet as described in “LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation”.

>>> import tensorflow as tf
>>> from doctr.models import linknet_resnet34
>>> model = linknet_resnet34(pretrained=True)
>>> input_tensor = tf.random.uniform(shape=[1, 1024, 1024, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained (bool) – If True, returns a model pre-trained on our text detection dataset

Returns:

text detection architecture

doctr.models.detection.linknet_resnet50(pretrained: bool = False, **kwargs: Any) LinkNet[source]#

LinkNet as described in “LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation”.

>>> import tensorflow as tf
>>> from doctr.models import linknet_resnet50
>>> model = linknet_resnet50(pretrained=True)
>>> input_tensor = tf.random.uniform(shape=[1, 1024, 1024, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained (bool) – If True, returns a model pre-trained on our text detection dataset

Returns:

text detection architecture

doctr.models.detection.db_resnet50(pretrained: bool = False, **kwargs: Any) DBNet[source]#

DBNet as described in “Real-time Scene Text Detection with Differentiable Binarization”, using a ResNet-50 backbone.

>>> import tensorflow as tf
>>> from doctr.models import db_resnet50
>>> model = db_resnet50(pretrained=True)
>>> input_tensor = tf.random.uniform(shape=[1, 1024, 1024, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained (bool) – If True, returns a model pre-trained on our text detection dataset

Returns:

text detection architecture

doctr.models.detection.db_mobilenet_v3_large(pretrained: bool = False, **kwargs: Any) DBNet[source]#

DBNet as described in “Real-time Scene Text Detection with Differentiable Binarization”, using a mobilenet v3 large backbone.

>>> import tensorflow as tf
>>> from doctr.models import db_mobilenet_v3_large
>>> model = db_mobilenet_v3_large(pretrained=True)
>>> input_tensor = tf.random.uniform(shape=[1, 1024, 1024, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained (bool) – If True, returns a model pre-trained on our text detection dataset

Returns:

text detection architecture

doctr.models.detection.detection_predictor(arch: Any = 'db_resnet50', pretrained: bool = False, assume_straight_pages: bool = True, **kwargs: Any) DetectionPredictor[source]#

Text detection architecture.

>>> import numpy as np
>>> from doctr.models import detection_predictor
>>> model = detection_predictor(arch='db_resnet50', pretrained=True)
>>> input_page = (255 * np.random.rand(600, 800, 3)).astype(np.uint8)
>>> out = model([input_page])
Parameters:
  • arch – name of the architecture or model itself to use (e.g. ‘db_resnet50’)

  • pretrained – If True, returns a model pre-trained on our text detection dataset

  • assume_straight_pages – If True, fit straight boxes to the page

Returns:

Detection predictor

doctr.models.recognition#

doctr.models.recognition.crnn_vgg16_bn(pretrained: bool = False, **kwargs: Any) CRNN[source]#

CRNN with a VGG-16 backbone as described in “An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition”.

>>> import tensorflow as tf
>>> from doctr.models import crnn_vgg16_bn
>>> model = crnn_vgg16_bn(pretrained=True)
>>> input_tensor = tf.random.uniform(shape=[1, 32, 128, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained (bool) – If True, returns a model pre-trained on our text recognition dataset

Returns:

text recognition architecture

doctr.models.recognition.crnn_mobilenet_v3_small(pretrained: bool = False, **kwargs: Any) CRNN[source]#

CRNN with a MobileNet V3 Small backbone as described in “An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition”.

>>> import tensorflow as tf
>>> from doctr.models import crnn_mobilenet_v3_small
>>> model = crnn_mobilenet_v3_small(pretrained=True)
>>> input_tensor = tf.random.uniform(shape=[1, 32, 128, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained (bool) – If True, returns a model pre-trained on our text recognition dataset

Returns:

text recognition architecture

doctr.models.recognition.crnn_mobilenet_v3_large(pretrained: bool = False, **kwargs: Any) CRNN[source]#

CRNN with a MobileNet V3 Large backbone as described in “An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition”.

>>> import tensorflow as tf
>>> from doctr.models import crnn_mobilenet_v3_large
>>> model = crnn_mobilenet_v3_large(pretrained=True)
>>> input_tensor = tf.random.uniform(shape=[1, 32, 128, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained (bool) – If True, returns a model pre-trained on our text recognition dataset

Returns:

text recognition architecture

doctr.models.recognition.sar_resnet31(pretrained: bool = False, **kwargs: Any) SAR[source]#

SAR with a resnet-31 feature extractor as described in “Show, Attend and Read:A Simple and Strong Baseline for Irregular Text Recognition”.

>>> import tensorflow as tf
>>> from doctr.models import sar_resnet31
>>> model = sar_resnet31(pretrained=False)
>>> input_tensor = tf.random.uniform(shape=[1, 64, 256, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained (bool) – If True, returns a model pre-trained on our text recognition dataset

Returns:

text recognition architecture

doctr.models.recognition.master(pretrained: bool = False, **kwargs: Any) MASTER[source]#

MASTER as described in paper: <https://arxiv.org/pdf/1910.02562.pdf>`_.

>>> import tensorflow as tf
>>> from doctr.models import master
>>> model = master(pretrained=False)
>>> input_tensor = tf.random.uniform(shape=[1, 32, 128, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained (bool) – If True, returns a model pre-trained on our text recognition dataset

Returns:

text recognition architecture

doctr.models.recognition.vitstr_small(pretrained: bool = False, **kwargs: Any) ViTSTR[source]#

ViTSTR-Small as described in “Vision Transformer for Fast and Efficient Scene Text Recognition”.

>>> import tensorflow as tf
>>> from doctr.models import vitstr_small
>>> model = vitstr_small(pretrained=False)
>>> input_tensor = tf.random.uniform(shape=[1, 32, 128, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained (bool) – If True, returns a model pre-trained on our text recognition dataset

Returns:

text recognition architecture

doctr.models.recognition.vitstr_base(pretrained: bool = False, **kwargs: Any) ViTSTR[source]#

ViTSTR-Base as described in “Vision Transformer for Fast and Efficient Scene Text Recognition”.

>>> import tensorflow as tf
>>> from doctr.models import vitstr_base
>>> model = vitstr_base(pretrained=False)
>>> input_tensor = tf.random.uniform(shape=[1, 32, 128, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained (bool) – If True, returns a model pre-trained on our text recognition dataset

Returns:

text recognition architecture

doctr.models.recognition.parseq(pretrained: bool = False, **kwargs: Any) PARSeq[source]#

PARSeq architecture from “Scene Text Recognition with Permuted Autoregressive Sequence Models”.

>>> import tensorflow as tf
>>> from doctr.models import parseq
>>> model = parseq(pretrained=False)
>>> input_tensor = tf.random.uniform(shape=[1, 32, 128, 3], maxval=1, dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:

pretrained (bool) – If True, returns a model pre-trained on our text recognition dataset

Returns:

text recognition architecture

doctr.models.recognition.recognition_predictor(arch: Any = 'crnn_vgg16_bn', pretrained: bool = False, **kwargs: Any) RecognitionPredictor[source]#

Text recognition architecture.

Example::
>>> import numpy as np
>>> from doctr.models import recognition_predictor
>>> model = recognition_predictor(pretrained=True)
>>> input_page = (255 * np.random.rand(32, 128, 3)).astype(np.uint8)
>>> out = model([input_page])
Parameters:
  • arch – name of the architecture or model itself to use (e.g. ‘crnn_vgg16_bn’)

  • pretrained – If True, returns a model pre-trained on our text recognition dataset

Returns:

Recognition predictor

doctr.models.zoo#

doctr.models.ocr_predictor(det_arch: Any = 'db_resnet50', reco_arch: Any = 'crnn_vgg16_bn', pretrained: bool = False, pretrained_backbone: bool = True, assume_straight_pages: bool = True, preserve_aspect_ratio: bool = True, symmetric_pad: bool = True, export_as_straight_boxes: bool = False, detect_orientation: bool = False, detect_language: bool = False, **kwargs: Any) OCRPredictor[source]#

End-to-end OCR architecture using one model for localization, and another for text recognition.

>>> import numpy as np
>>> from doctr.models import ocr_predictor
>>> model = ocr_predictor('db_resnet50', 'crnn_vgg16_bn', pretrained=True)
>>> input_page = (255 * np.random.rand(600, 800, 3)).astype(np.uint8)
>>> out = model([input_page])
Parameters:
  • det_arch – name of the detection architecture or the model itself to use (e.g. ‘db_resnet50’, ‘db_mobilenet_v3_large’)

  • reco_arch – name of the recognition architecture or the model itself to use (e.g. ‘crnn_vgg16_bn’, ‘sar_resnet31’)

  • pretrained – If True, returns a model pre-trained on our OCR dataset

  • pretrained_backbone – If True, returns a model with a pretrained backbone

  • assume_straight_pages – if True, speeds up the inference by assuming you only pass straight pages without rotated textual elements.

  • preserve_aspect_ratio – If True, pad the input document image to preserve the aspect ratio before running the detection model on it.

  • symmetric_pad – if True, pad the image symmetrically instead of padding at the bottom-right.

  • export_as_straight_boxes – when assume_straight_pages is set to False, export final predictions (potentially rotated) as straight bounding boxes.

  • detect_orientation – if True, the estimated general page orientation will be added to the predictions for each page. Doing so will slightly deteriorate the overall latency.

  • detect_language – if True, the language prediction will be added to the predictions for each page. Doing so will slightly deteriorate the overall latency.

  • kwargs – keyword args of OCRPredictor

Returns:

OCR predictor

doctr.models.kie_predictor(det_arch: Any = 'db_resnet50', reco_arch: Any = 'crnn_vgg16_bn', pretrained: bool = False, pretrained_backbone: bool = True, assume_straight_pages: bool = True, preserve_aspect_ratio: bool = True, symmetric_pad: bool = True, export_as_straight_boxes: bool = False, detect_orientation: bool = False, detect_language: bool = False, **kwargs: Any) KIEPredictor[source]#

End-to-end KIE architecture using one model for localization, and another for text recognition.

>>> import numpy as np
>>> from doctr.models import ocr_predictor
>>> model = ocr_predictor('db_resnet50', 'crnn_vgg16_bn', pretrained=True)
>>> input_page = (255 * np.random.rand(600, 800, 3)).astype(np.uint8)
>>> out = model([input_page])
Parameters:
  • det_arch – name of the detection architecture or the model itself to use (e.g. ‘db_resnet50’, ‘db_mobilenet_v3_large’)

  • reco_arch – name of the recognition architecture or the model itself to use (e.g. ‘crnn_vgg16_bn’, ‘sar_resnet31’)

  • pretrained – If True, returns a model pre-trained on our OCR dataset

  • pretrained_backbone – If True, returns a model with a pretrained backbone

  • assume_straight_pages – if True, speeds up the inference by assuming you only pass straight pages without rotated textual elements.

  • preserve_aspect_ratio – If True, pad the input document image to preserve the aspect ratio before running the detection model on it.

  • symmetric_pad – if True, pad the image symmetrically instead of padding at the bottom-right.

  • export_as_straight_boxes – when assume_straight_pages is set to False, export final predictions (potentially rotated) as straight bounding boxes.

  • detect_orientation – if True, the estimated general page orientation will be added to the predictions for each page. Doing so will slightly deteriorate the overall latency.

  • detect_language – if True, the language prediction will be added to the predictions for each page. Doing so will slightly deteriorate the overall latency.

  • kwargs – keyword args of OCRPredictor

Returns:

KIE predictor

doctr.models.factory#

doctr.models.factory.login_to_hub() None[source]#

Login to huggingface hub

doctr.models.factory.from_hub(repo_id: str, **kwargs: Any)[source]#

Instantiate & load a pretrained model from HF hub.

>>> from doctr.models import from_hub
>>> model = from_hub("mindee/fasterrcnn_mobilenet_v3_large_fpn")
Parameters:
  • repo_id – HuggingFace model hub repo

  • kwargs – kwargs of hf_hub_download or snapshot_download

Returns:

Model loaded with the checkpoint

doctr.models.factory.push_to_hf_hub(model: Any, model_name: str, task: str, **kwargs) None[source]#

Save model and its configuration on HF hub

>>> from doctr.models import login_to_hub, push_to_hf_hub
>>> from doctr.models.recognition import crnn_mobilenet_v3_small
>>> login_to_hub()
>>> model = crnn_mobilenet_v3_small(pretrained=True)
>>> push_to_hf_hub(model, 'my-model', 'recognition', arch='crnn_mobilenet_v3_small')
Parameters:
  • model – TF or PyTorch model to be saved

  • model_name – name of the model which is also the repository name

  • task – task name

  • **kwargs – keyword arguments for push_to_hf_hub