doctr.models

doctr.models.classification

doctr.models.classification.vgg16_bn_r(pretrained: bool = False, **kwargs: Any) VGG[source]

VGG-16 architecture as described in “Very Deep Convolutional Networks for Large-Scale Image Recognition”, modified by adding batch normalization, rectangular pooling and a simpler classification head.

>>> import torch
>>> from doctr.models import vgg16_bn_r
>>> model = vgg16_bn_r(pretrained=False)
>>> input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on ImageNet

  • **kwargs – keyword arguments of the VGG architecture

Returns:

VGG feature extractor

doctr.models.classification.resnet18(pretrained: bool = False, **kwargs: Any) ResNet[source]

ResNet-18 architecture as described in “Deep Residual Learning for Image Recognition”,.

>>> import torch
>>> from doctr.models import resnet18
>>> model = resnet18(pretrained=False)
>>> input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our classification dataset

  • **kwargs – keyword arguments of the ResNet architecture

Returns:

A resnet18 model

doctr.models.classification.resnet34(pretrained: bool = False, **kwargs: Any) ResNet[source]

ResNet-34 architecture as described in “Deep Residual Learning for Image Recognition”,.

>>> import torch
>>> from doctr.models import resnet34
>>> model = resnet34(pretrained=False)
>>> input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our classification dataset

  • **kwargs – keyword arguments of the ResNet architecture

Returns:

A resnet34 model

doctr.models.classification.resnet50(pretrained: bool = False, **kwargs: Any) ResNet[source]

ResNet-50 architecture as described in “Deep Residual Learning for Image Recognition”,.

>>> import torch
>>> from doctr.models import resnet50
>>> model = resnet50(pretrained=False)
>>> input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our classification dataset

  • **kwargs – keyword arguments of the ResNet architecture

Returns:

A resnet50 model

doctr.models.classification.resnet31(pretrained: bool = False, **kwargs: Any) ResNet[source]

Resnet31 architecture with rectangular pooling windows as described in “Show, Attend and Read:A Simple and Strong Baseline for Irregular Text Recognition”,. Downsizing: (H, W) –> (H/8, W/4)

>>> import torch
>>> from doctr.models import resnet31
>>> model = resnet31(pretrained=False)
>>> input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our classification dataset

  • **kwargs – keyword arguments of the ResNet architecture

Returns:

A resnet31 model

doctr.models.classification.mobilenet_v3_small(pretrained: bool = False, **kwargs: Any) MobileNetV3[source]

MobileNetV3-Small architecture as described in “Searching for MobileNetV3”,.

>>> import torch
>>> from doctr.models import mobilenet_v3_small
>>> model = mobilenetv3_small(pretrained=False)
>>> input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our classification dataset

  • **kwargs – keyword arguments of the MobileNetV3 architecture

Returns:

a torch.nn.Module

doctr.models.classification.mobilenet_v3_large(pretrained: bool = False, **kwargs: Any) MobileNetV3[source]

MobileNetV3-Large architecture as described in “Searching for MobileNetV3”,.

>>> import torch
>>> from doctr.models import mobilenet_v3_large
>>> model = mobilenet_v3_large(pretrained=False)
>>> input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our classification dataset

  • **kwargs – keyword arguments of the MobileNetV3 architecture

Returns:

a torch.nn.Module

doctr.models.classification.mobilenet_v3_small_r(pretrained: bool = False, **kwargs: Any) MobileNetV3[source]

MobileNetV3-Small architecture as described in “Searching for MobileNetV3”,, with rectangular pooling.

>>> import torch
>>> from doctr.models import mobilenet_v3_small_r
>>> model = mobilenet_v3_small_r(pretrained=False)
>>> input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our classification dataset

  • **kwargs – keyword arguments of the MobileNetV3 architecture

Returns:

a torch.nn.Module

doctr.models.classification.mobilenet_v3_large_r(pretrained: bool = False, **kwargs: Any) MobileNetV3[source]

MobileNetV3-Large architecture as described in “Searching for MobileNetV3”,, with rectangular pooling.

>>> import torch
>>> from doctr.models import mobilenet_v3_large_r
>>> model = mobilenet_v3_large_r(pretrained=False)
>>> input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our classification dataset

  • **kwargs – keyword arguments of the MobileNetV3 architecture

Returns:

a torch.nn.Module

doctr.models.classification.mobilenet_v3_small_crop_orientation(pretrained: bool = False, **kwargs: Any) MobileNetV3[source]

MobileNetV3-Small architecture as described in “Searching for MobileNetV3”,.

>>> import torch
>>> from doctr.models import mobilenet_v3_small_crop_orientation
>>> model = mobilenet_v3_small_crop_orientation(pretrained=False)
>>> input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our classification dataset

  • **kwargs – keyword arguments of the MobileNetV3 architecture

Returns:

a torch.nn.Module

doctr.models.classification.mobilenet_v3_small_page_orientation(pretrained: bool = False, **kwargs: Any) MobileNetV3[source]

MobileNetV3-Small architecture as described in “Searching for MobileNetV3”,.

>>> import torch
>>> from doctr.models import mobilenet_v3_small_page_orientation
>>> model = mobilenet_v3_small_page_orientation(pretrained=False)
>>> input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our classification dataset

  • **kwargs – keyword arguments of the MobileNetV3 architecture

Returns:

a torch.nn.Module

doctr.models.classification.magc_resnet31(pretrained: bool = False, **kwargs: Any) ResNet[source]

Resnet31 architecture with Multi-Aspect Global Context Attention as described in “MASTER: Multi-Aspect Non-local Network for Scene Text Recognition”,.

>>> import torch
>>> from doctr.models import magc_resnet31
>>> model = magc_resnet31(pretrained=False)
>>> input_tensor = torch.rand((1, 3, 224, 224), dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our classification dataset

  • **kwargs – keyword arguments of the ResNet architecture

Returns:

A feature extractor model

doctr.models.classification.vit_s(pretrained: bool = False, **kwargs: Any) VisionTransformer[source]

VisionTransformer-S architecture “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”,. Patches: (H, W) -> (H/8, W/8)

NOTE: unofficial config used in ViTSTR and ParSeq

>>> import torch
>>> from doctr.models import vit_s
>>> model = vit_s(pretrained=False)
>>> input_tensor = torch.rand((1, 3, 32, 32), dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our classification dataset

  • **kwargs – keyword arguments of the VisionTransformer architecture

Returns:

A feature extractor model

doctr.models.classification.vit_b(pretrained: bool = False, **kwargs: Any) VisionTransformer[source]

VisionTransformer-B architecture as described in “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”,. Patches: (H, W) -> (H/8, W/8)

>>> import torch
>>> from doctr.models import vit_b
>>> model = vit_b(pretrained=False)
>>> input_tensor = torch.rand((1, 3, 32, 32), dtype=tf.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our classification dataset

  • **kwargs – keyword arguments of the VisionTransformer architecture

Returns:

A feature extractor model

doctr.models.classification.textnet_tiny(pretrained: bool = False, **kwargs: Any) TextNet[source]

Implements TextNet architecture from “FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation”. Implementation based on the official Pytorch implementation: <https://github.com/czczup/FAST>`_.

>>> import torch
>>> from doctr.models import textnet_tiny
>>> model = textnet_tiny(pretrained=False)
>>> input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our classification dataset

  • **kwargs – keyword arguments of the TextNet architecture

Returns:

A textnet tiny model

doctr.models.classification.textnet_small(pretrained: bool = False, **kwargs: Any) TextNet[source]

Implements TextNet architecture from “FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation”. Implementation based on the official Pytorch implementation: <https://github.com/czczup/FAST>`_.

>>> import torch
>>> from doctr.models import textnet_small
>>> model = textnet_small(pretrained=False)
>>> input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our classification dataset

  • **kwargs – keyword arguments of the TextNet architecture

Returns:

A TextNet small model

doctr.models.classification.textnet_base(pretrained: bool = False, **kwargs: Any) TextNet[source]

Implements TextNet architecture from “FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation”. Implementation based on the official Pytorch implementation: <https://github.com/czczup/FAST>`_.

>>> import torch
>>> from doctr.models import textnet_base
>>> model = textnet_base(pretrained=False)
>>> input_tensor = torch.rand((1, 3, 512, 512), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our classification dataset

  • **kwargs – keyword arguments of the TextNet architecture

Returns:

A TextNet base model

doctr.models.classification.vip_tiny(pretrained: bool = False, **kwargs: Any) VIPNet[source]

VIP-Tiny encoder architecture.Corresponds to SVIPTRv2-T variant in the paper (VIPTRv2 function in the official implementation: https://github.com/cxfyxl/VIPTR/blob/main/modules/VIPTRv2.py)

Parameters:
  • pretrained – If True, returns a model pre-trained on our classification dataset

  • **kwargs – optional arguments

Returns:

VIPNet model

doctr.models.classification.vip_base(pretrained: bool = False, **kwargs: Any) VIPNet[source]

VIP-Base encoder architecture. Corresponds to SVIPTRv2-B variant in the paper (VIPTRv2B function in the official implementation: https://github.com/cxfyxl/VIPTR/blob/main/modules/VIPTRv2.py)

Parameters:
  • pretrained – If True, returns a model pre-trained on our classification dataset

  • **kwargs – optional arguments

Returns:

VIPNet model

doctr.models.classification.vit_det_s(pretrained: bool = False, **kwargs: Any) VisionDetectionTransformer[source]

VisionDetectionTransformer-S architecture “Exploring Plain Vision Transformer Backbones for Object Detection”,.

NOTE: Modified for LW-DETR

>>> import torch
>>> from doctr.models import vit_det_s
>>> model = vit_det_s(pretrained=False)
>>> input_tensor = torch.rand((1, 3, 32, 32), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our classification dataset

  • **kwargs – keyword arguments of the VisionDetectionTransformer architecture

Returns:

A feature extractor model

doctr.models.classification.vit_det_m(pretrained: bool = False, **kwargs: Any) VisionDetectionTransformer[source]

VisionDetectionTransformer-B architecture as described in “Exploring Plain Vision Transformer Backbones for Object Detection”,.

NOTE: Modified for LW-DETR

>>> import torch
>>> from doctr.models import vit_det_m
>>> model = vit_det_m(pretrained=False)
>>> input_tensor = torch.rand((1, 3, 32, 32), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our classification dataset

  • **kwargs – keyword arguments of the VisionTransformer architecture

Returns:

A feature extractor model

doctr.models.classification.starnet_s3(pretrained: bool = False, **kwargs: Any) StarNet[source]

StarNet-S3 from “Rewrite the Stars”.

>>> import torch
>>> from doctr.models import starnet_s3
>>> model = starnet_s3(pretrained=False)
>>> out = model(torch.rand((1, 3, 32, 32), dtype=torch.float32))
Parameters:
  • pretrained – boolean, True if model is pretrained

  • **kwargs – keyword arguments of the StarNet architecture

Returns:

A StarNet-S3 model

doctr.models.classification.crop_orientation_predictor(arch: Any = 'mobilenet_v3_small_crop_orientation', pretrained: bool = False, batch_size: int = 128, **kwargs: Any) OrientationPredictor[source]

Crop orientation classification architecture.

>>> import numpy as np
>>> from doctr.models import crop_orientation_predictor
>>> model = crop_orientation_predictor(arch='mobilenet_v3_small_crop_orientation', pretrained=True)
>>> input_crop = (255 * np.random.rand(256, 256, 3)).astype(np.uint8)
>>> out = model([input_crop])
Parameters:
  • arch – name of the architecture to use (e.g. ‘mobilenet_v3_small_crop_orientation’)

  • pretrained – If True, returns a model pre-trained on our recognition crops dataset

  • batch_size – number of samples the model processes in parallel

  • **kwargs – keyword arguments to be passed to the OrientationPredictor

Returns:

OrientationPredictor

doctr.models.classification.page_orientation_predictor(arch: Any = 'mobilenet_v3_small_page_orientation', pretrained: bool = False, batch_size: int = 4, **kwargs: Any) OrientationPredictor[source]

Page orientation classification architecture.

>>> import numpy as np
>>> from doctr.models import page_orientation_predictor
>>> model = page_orientation_predictor(arch='mobilenet_v3_small_page_orientation', pretrained=True)
>>> input_page = (255 * np.random.rand(512, 512, 3)).astype(np.uint8)
>>> out = model([input_page])
Parameters:
  • arch – name of the architecture to use (e.g. ‘mobilenet_v3_small_page_orientation’)

  • pretrained – If True, returns a model pre-trained on our recognition crops dataset

  • batch_size – number of samples the model processes in parallel

  • **kwargs – keyword arguments to be passed to the OrientationPredictor

Returns:

OrientationPredictor

doctr.models.detection

doctr.models.detection.linknet_resnet18(pretrained: bool = False, **kwargs: Any) LinkNet[source]

LinkNet as described in “LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation”.

>>> import torch
>>> from doctr.models import linknet_resnet18
>>> model = linknet_resnet18(pretrained=True).eval()
>>> input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our text detection dataset

  • **kwargs – keyword arguments of the LinkNet architecture

Returns:

text detection architecture

doctr.models.detection.linknet_resnet34(pretrained: bool = False, **kwargs: Any) LinkNet[source]

LinkNet as described in “LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation”.

>>> import torch
>>> from doctr.models import linknet_resnet34
>>> model = linknet_resnet34(pretrained=True).eval()
>>> input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our text detection dataset

  • **kwargs – keyword arguments of the LinkNet architecture

Returns:

text detection architecture

doctr.models.detection.linknet_resnet50(pretrained: bool = False, **kwargs: Any) LinkNet[source]

LinkNet as described in “LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation”.

>>> import torch
>>> from doctr.models import linknet_resnet50
>>> model = linknet_resnet50(pretrained=True).eval()
>>> input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our text detection dataset

  • **kwargs – keyword arguments of the LinkNet architecture

Returns:

text detection architecture

doctr.models.detection.db_resnet50(pretrained: bool = False, **kwargs: Any) DBNet[source]

DBNet as described in “Real-time Scene Text Detection with Differentiable Binarization”, using a ResNet-50 backbone.

>>> import torch
>>> from doctr.models import db_resnet50
>>> model = db_resnet50(pretrained=True)
>>> input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our text detection dataset

  • **kwargs – keyword arguments of the DBNet architecture

Returns:

text detection architecture

doctr.models.detection.db_mobilenet_v3_large(pretrained: bool = False, **kwargs: Any) DBNet[source]

DBNet as described in “Real-time Scene Text Detection with Differentiable Binarization”, using a MobileNet V3 Large backbone.

>>> import torch
>>> from doctr.models import db_mobilenet_v3_large
>>> model = db_mobilenet_v3_large(pretrained=True)
>>> input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our text detection dataset

  • **kwargs – keyword arguments of the DBNet architecture

Returns:

text detection architecture

doctr.models.detection.fast_tiny(pretrained: bool = False, **kwargs: Any) FAST[source]

FAST as described in “FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation”, using a tiny TextNet backbone.

>>> import torch
>>> from doctr.models import fast_tiny
>>> model = fast_tiny(pretrained=True)
>>> input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our text detection dataset

  • **kwargs – keyword arguments of the DBNet architecture

Returns:

text detection architecture

doctr.models.detection.fast_small(pretrained: bool = False, **kwargs: Any) FAST[source]

FAST as described in “FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation”, using a small TextNet backbone.

>>> import torch
>>> from doctr.models import fast_small
>>> model = fast_small(pretrained=True)
>>> input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our text detection dataset

  • **kwargs – keyword arguments of the DBNet architecture

Returns:

text detection architecture

doctr.models.detection.fast_base(pretrained: bool = False, **kwargs: Any) FAST[source]

FAST as described in “FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation”, using a base TextNet backbone.

>>> import torch
>>> from doctr.models import fast_base
>>> model = fast_base(pretrained=True)
>>> input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our text detection dataset

  • **kwargs – keyword arguments of the DBNet architecture

Returns:

text detection architecture

doctr.models.detection.detection_predictor(arch: Any = 'fast_base', pretrained: bool = False, assume_straight_pages: bool = True, preserve_aspect_ratio: bool = True, symmetric_pad: bool = True, batch_size: int = 2, **kwargs: Any) DetectionPredictor[source]

Text detection architecture.

>>> import numpy as np
>>> from doctr.models import detection_predictor
>>> model = detection_predictor(arch='db_resnet50', pretrained=True)
>>> input_page = (255 * np.random.rand(600, 800, 3)).astype(np.uint8)
>>> out = model([input_page])
Parameters:
  • arch – name of the architecture or model itself to use (e.g. ‘db_resnet50’)

  • pretrained – If True, returns a model pre-trained on our text detection dataset

  • assume_straight_pages – If True, fit straight boxes to the page

  • preserve_aspect_ratio – If True, pad the input document image to preserve the aspect ratio before running the detection model on it

  • symmetric_pad – if True, pad the image symmetrically instead of padding at the bottom-right

  • batch_size – number of samples the model processes in parallel

  • **kwargs – optional keyword arguments passed to the architecture

Returns:

Detection predictor

doctr.models.layout

doctr.models.layout.lw_detr_s(pretrained: bool = False, **kwargs: Any) LWDETR[source]

LW-DETR as described in “LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection”.

>>> import torch
>>> from doctr.models import lw_detr_s
>>> model = lw_detr_s(pretrained=True).eval()
>>> input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our text detection dataset

  • **kwargs – keyword arguments of the LinkNet architecture

Returns:

text detection architecture

doctr.models.layout.lw_detr_m(pretrained: bool = False, **kwargs: Any) LWDETR[source]

LW-DETR as described in “LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection”.

>>> import torch
>>> from doctr.models import lw_detr_m
>>> model = lw_detr_m(pretrained=True).eval()
>>> input_tensor = torch.rand((1, 3, 1024, 1024), dtype=torch.float32)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our text detection dataset

  • **kwargs – keyword arguments of the LinkNet architecture

Returns:

text detection architecture

doctr.models.layout.layout_predictor(arch: Any = 'lw_detr_s', pretrained: bool = False, assume_straight_pages: bool = True, preserve_aspect_ratio: bool = True, symmetric_pad: bool = True, batch_size: int = 2, **kwargs: Any) LayoutPredictor[source]

Layout prediction architecture.

>>> import numpy as np
>>> from doctr.models import layout_predictor
>>> model = layout_predictor(arch='lw_detr_s', pretrained=True)
>>> input_page = (255 * np.random.rand(600, 800, 3)).astype(np.uint8)
>>> out = model([input_page])
Parameters:
  • arch – name of the architecture or model itself to use (e.g. ‘lw_detr_s’)

  • pretrained – If True, returns a model pre-trained on our layout prediction dataset

  • assume_straight_pages – If True, fit straight boxes to the page

  • preserve_aspect_ratio – If True, pad the input document image to preserve the aspect ratio before running the detection model on it

  • symmetric_pad – if True, pad the image symmetrically instead of padding at the bottom-right

  • batch_size – number of samples the model processes in parallel

  • **kwargs – optional keyword arguments passed to the architecture

Returns:

Layout predictor

doctr.models.table_structure

doctr.models.table_structure.tablecenternet(pretrained: bool = False, **kwargs: Any) TableCenterNet[source]

TableCenterNet with a StarNet-S3 backbone, matching the official checkpoint.

>>> import torch
>>> from doctr.models import tablecenternet
>>> model = tablecenternet(pretrained=False)
>>> out = model(torch.rand((1, 3, 1024, 1024), dtype=torch.float32), return_preds=True)
Parameters:
  • pretrained – boolean, True if model is pretrained

  • **kwargs – keyword arguments of the TableCenterNet architecture

Returns:

A TableCenterNet model with a StarNet-S3 backbone

doctr.models.table_structure.table_predictor(arch: Any = 'tablecenternet', pretrained: bool = False, assume_straight_pages: bool = False, preserve_aspect_ratio: bool = True, symmetric_pad: bool = True, batch_size: int = 2, **kwargs: Any) TablePredictor[source]

Table structure recognition architecture.

>>> import numpy as np
>>> from doctr.models import table_predictor
>>> model = table_predictor(arch='tablecenternet', pretrained=True)
>>> input_page = (255 * np.random.rand(600, 800, 3)).astype(np.uint8)
>>> out = model([input_page])
Parameters:
  • arch – name of the architecture or model itself to use (e.g. ‘tablecenternet’)

  • pretrained – If True, returns a model pre-trained on a table structure recognition dataset

  • assume_straight_pages – if True, fit straight boxes to the detected cells

  • preserve_aspect_ratio – if True, pad the input document image to preserve the aspect ratio before running the model on it

  • symmetric_pad – if True, pad the image symmetrically instead of padding at the bottom-right

  • batch_size – number of samples the model processes in parallel

  • **kwargs – optional keyword arguments passed to the architecture

Returns:

Table structure recognition predictor

doctr.models.recognition

doctr.models.recognition.crnn_vgg16_bn(pretrained: bool = False, **kwargs: Any) CRNN[source]

CRNN with a VGG-16 backbone as described in “An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition”.

>>> import torch
>>> from doctr.models import crnn_vgg16_bn
>>> model = crnn_vgg16_bn(pretrained=True)
>>> input_tensor = torch.rand(1, 3, 32, 128)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our text recognition dataset

  • **kwargs – keyword arguments of the CRNN architecture

Returns:

text recognition architecture

doctr.models.recognition.crnn_mobilenet_v3_small(pretrained: bool = False, **kwargs: Any) CRNN[source]

CRNN with a MobileNet V3 Small backbone as described in “An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition”.

>>> import torch
>>> from doctr.models import crnn_mobilenet_v3_small
>>> model = crnn_mobilenet_v3_small(pretrained=True)
>>> input_tensor = torch.rand(1, 3, 32, 128)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our text recognition dataset

  • **kwargs – keyword arguments of the CRNN architecture

Returns:

text recognition architecture

doctr.models.recognition.crnn_mobilenet_v3_large(pretrained: bool = False, **kwargs: Any) CRNN[source]

CRNN with a MobileNet V3 Large backbone as described in “An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition”.

>>> import torch
>>> from doctr.models import crnn_mobilenet_v3_large
>>> model = crnn_mobilenet_v3_large(pretrained=True)
>>> input_tensor = torch.rand(1, 3, 32, 128)
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our text recognition dataset

  • **kwargs – keyword arguments of the CRNN architecture

Returns:

text recognition architecture

doctr.models.recognition.sar_resnet31(pretrained: bool = False, **kwargs: Any) SAR[source]

SAR with a resnet-31 feature extractor as described in “Show, Attend and Read:A Simple and Strong Baseline for Irregular Text Recognition”.

>>> import torch
>>> from doctr.models import sar_resnet31
>>> model = sar_resnet31(pretrained=False)
>>> input_tensor = torch.rand((1, 3, 32, 128))
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our text recognition dataset

  • **kwargs – keyword arguments of the SAR architecture

Returns:

text recognition architecture

doctr.models.recognition.master(pretrained: bool = False, **kwargs: Any) MASTER[source]

MASTER as described in paper: <https://arxiv.org/pdf/1910.02562.pdf>`_.

>>> import torch
>>> from doctr.models import master
>>> model = master(pretrained=False)
>>> input_tensor = torch.rand((1, 3, 32, 128))
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our text recognition dataset

  • **kwargs – keywoard arguments passed to the MASTER architecture

Returns:

text recognition architecture

doctr.models.recognition.vitstr_small(pretrained: bool = False, **kwargs: Any) ViTSTR[source]

ViTSTR-Small as described in “Vision Transformer for Fast and Efficient Scene Text Recognition”.

>>> import torch
>>> from doctr.models import vitstr_small
>>> model = vitstr_small(pretrained=False)
>>> input_tensor = torch.rand((1, 3, 32, 128))
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our text recognition dataset

  • kwargs – keyword arguments of the ViTSTR architecture

Returns:

text recognition architecture

doctr.models.recognition.vitstr_base(pretrained: bool = False, **kwargs: Any) ViTSTR[source]

ViTSTR-Base as described in “Vision Transformer for Fast and Efficient Scene Text Recognition”.

>>> import torch
>>> from doctr.models import vitstr_base
>>> model = vitstr_base(pretrained=False)
>>> input_tensor = torch.rand((1, 3, 32, 128))
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our text recognition dataset

  • kwargs – keyword arguments of the ViTSTR architecture

Returns:

text recognition architecture

doctr.models.recognition.parseq(pretrained: bool = False, **kwargs: Any) PARSeq[source]

PARSeq architecture from “Scene Text Recognition with Permuted Autoregressive Sequence Models”.

>>> import torch
>>> from doctr.models import parseq
>>> model = parseq(pretrained=False)
>>> input_tensor = torch.rand((1, 3, 32, 128))
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our text recognition dataset

  • **kwargs – keyword arguments of the PARSeq architecture

Returns:

text recognition architecture

doctr.models.recognition.viptr_tiny(pretrained: bool = False, **kwargs: Any) VIPTR[source]

VIPTR-Tiny as described in “A Vision Permutable Extractor for Fast and Efficient Scene Text Recognition”.

>>> import torch
>>> from doctr.models import viptr_tiny
>>> model = viptr_tiny(pretrained=False)
>>> input_tensor = torch.rand((1, 3, 32, 128))
>>> out = model(input_tensor)
Parameters:
  • pretrained – If True, returns a model pre-trained on our text recognition dataset

  • **kwargs – keyword arguments of the VIPTR architecture

Returns:

a VIPTR model instance

Return type:

VIPTR

doctr.models.recognition.recognition_predictor(arch: Any = 'crnn_vgg16_bn', pretrained: bool = False, symmetric_pad: bool = False, batch_size: int = 128, **kwargs: Any) RecognitionPredictor[source]

Text recognition architecture.

>>> import numpy as np
>>> from doctr.models import recognition_predictor
>>> model = recognition_predictor(pretrained=True)
>>> input_page = (255 * np.random.rand(32, 128, 3)).astype(np.uint8)
>>> out = model([input_page])
Parameters:
  • arch – name of the architecture or model itself to use (e.g. ‘crnn_vgg16_bn’)

  • pretrained – If True, returns a model pre-trained on our text recognition dataset

  • symmetric_pad – if True, pad the image symmetrically instead of padding at the bottom-right

  • batch_size – number of samples the model processes in parallel

  • **kwargs – optional parameters to be passed to the architecture

Returns:

Recognition predictor

doctr.models.zoo

doctr.models.ocr_predictor(det_arch: Any = 'fast_base', reco_arch: Any = 'crnn_vgg16_bn', pretrained: bool = False, pretrained_backbone: bool = True, assume_straight_pages: bool = True, preserve_aspect_ratio: bool = True, symmetric_pad: bool = True, export_as_straight_boxes: bool = False, detect_orientation: bool = False, straighten_pages: bool = False, detect_language: bool = False, detect_layout: bool = False, layout_arch: Any = 'lw_detr_s', detect_tables: bool = False, **kwargs: Any) OCRPredictor[source]

End-to-end OCR architecture using one model for localization, and another for text recognition.

>>> import numpy as np
>>> from doctr.models import ocr_predictor
>>> model = ocr_predictor('db_resnet50', 'crnn_vgg16_bn', pretrained=True)
>>> input_page = (255 * np.random.rand(600, 800, 3)).astype(np.uint8)
>>> out = model([input_page])
Parameters:
  • det_arch – name of the detection architecture or the model itself to use (e.g. ‘db_resnet50’, ‘db_mobilenet_v3_large’)

  • reco_arch – name of the recognition architecture or the model itself to use (e.g. ‘crnn_vgg16_bn’, ‘sar_resnet31’)

  • pretrained – If True, returns a model pre-trained on our OCR dataset

  • pretrained_backbone – If True, returns a model with a pretrained backbone

  • assume_straight_pages – if True, speeds up the inference by assuming you only pass straight pages without rotated textual elements.

  • preserve_aspect_ratio – If True, pad the input document image to preserve the aspect ratio before running the detection model on it.

  • symmetric_pad – if True, pad the image symmetrically instead of padding at the bottom-right.

  • export_as_straight_boxes – when assume_straight_pages is set to False, export final predictions (potentially rotated) as straight bounding boxes.

  • detect_orientation – if True, the estimated general page orientation will be added to the predictions for each page. Doing so will slightly deteriorate the overall latency.

  • straighten_pages – if True, estimates the page general orientation based on the segmentation map median line orientation. Then, rotates page before passing it again to the deep learning detection module. Doing so will improve performances for documents with page-uniform rotations.

  • detect_language – if True, the language prediction will be added to the predictions for each page. Doing so will slightly deteriorate the overall latency.

  • detect_layout – if True, a layout detection model is run on each page and the detected regions are attached to each page. Doing so will slightly deteriorate the overall latency.

  • layout_arch – name of the layout architecture or the model itself to use.

  • detect_tables – if True, table regions found by the layout model are cropped and passed to a table structure model. Words falling inside a detected table are regrouped into a structured table (accessible via page.tables) and removed from the regular text output. This enables the layout model and slightly deteriorates the overall latency.

  • kwargs – keyword args of OCRPredictor

Returns:

OCR predictor

doctr.models.kie_predictor(det_arch: Any = 'fast_base', reco_arch: Any = 'crnn_vgg16_bn', pretrained: bool = False, pretrained_backbone: bool = True, assume_straight_pages: bool = True, preserve_aspect_ratio: bool = True, symmetric_pad: bool = True, export_as_straight_boxes: bool = False, detect_orientation: bool = False, straighten_pages: bool = False, detect_language: bool = False, detect_layout: bool = False, layout_arch: Any = 'lw_detr_s', **kwargs: Any) KIEPredictor[source]

End-to-end KIE architecture using one model for localization, and another for text recognition.

>>> import numpy as np
>>> from doctr.models import kie_predictor
>>> model = kie_predictor('db_resnet50', 'crnn_vgg16_bn', pretrained=True)
>>> input_page = (255 * np.random.rand(600, 800, 3)).astype(np.uint8)
>>> out = model([input_page])
Parameters:
  • det_arch – name of the detection architecture or the model itself to use (e.g. ‘db_resnet50’, ‘db_mobilenet_v3_large’)

  • reco_arch – name of the recognition architecture or the model itself to use (e.g. ‘crnn_vgg16_bn’, ‘sar_resnet31’)

  • pretrained – If True, returns a model pre-trained on our OCR dataset

  • pretrained_backbone – If True, returns a model with a pretrained backbone

  • assume_straight_pages – if True, speeds up the inference by assuming you only pass straight pages without rotated textual elements.

  • preserve_aspect_ratio – If True, pad the input document image to preserve the aspect ratio before running the detection model on it.

  • symmetric_pad – if True, pad the image symmetrically instead of padding at the bottom-right.

  • export_as_straight_boxes – when assume_straight_pages is set to False, export final predictions (potentially rotated) as straight bounding boxes.

  • detect_orientation – if True, the estimated general page orientation will be added to the predictions for each page. Doing so will slightly deteriorate the overall latency.

  • straighten_pages – if True, estimates the page general orientation based on the segmentation map median line orientation. Then, rotates page before passing it again to the deep learning detection module. Doing so will improve performances for documents with page-uniform rotations.

  • detect_language – if True, the language prediction will be added to the predictions for each page. Doing so will slightly deteriorate the overall latency.

  • detect_layout – if True, a layout detection model is run on each page and the detected regions are attached to each page. Doing so will slightly deteriorate the overall latency.

  • layout_arch – name of the layout architecture or the model itself to use.

  • kwargs – keyword args of OCRPredictor

Returns:

KIE predictor

doctr.models.factory

doctr.models.factory.login_to_hub() None[source]

Login to huggingface hub

doctr.models.factory.from_hub(repo_id: str, **kwargs: Any)[source]

Instantiate & load a pretrained model from HF hub.

>>> from doctr.models import from_hub
>>> model = from_hub("mindee/fasterrcnn_mobilenet_v3_large_fpn")
Parameters:
  • repo_id – HuggingFace model hub repo

  • kwargs – kwargs of hf_hub_download or snapshot_download

Returns:

Model loaded with the checkpoint

doctr.models.factory.push_to_hf_hub(model: Any, model_name: str, task: str, **kwargs) None[source]

Save model and its configuration on HF hub

>>> from doctr.models import login_to_hub, push_to_hf_hub
>>> from doctr.models.recognition import crnn_mobilenet_v3_small
>>> login_to_hub()
>>> model = crnn_mobilenet_v3_small(pretrained=True)
>>> push_to_hf_hub(model, 'my-model', 'recognition', arch='crnn_mobilenet_v3_small')
Parameters:
  • model – PyTorch model to be saved

  • model_name – name of the model which is also the repository name

  • task – task name

  • **kwargs – keyword arguments for push_to_hf_hub

doctr.models.utils

doctr.models.utils.export_model_to_onnx(model: Module, model_name: str, dummy_input: Tensor | tuple[Tensor, Tensor], input_names: list[str] | None = None, output_names: list[str] | None = None, dynamic_axes: dict[str, dict[int, str]] | None = None, **kwargs: Any) str[source]

Export model to ONNX format.

>>> import torch
>>> from doctr.models.classification import resnet18
>>> from doctr.models.utils import export_model_to_onnx
>>> model = resnet18(pretrained=True)
>>> export_model_to_onnx(model, "my_model", dummy_input=torch.randn(1, 3, 32, 32))
Parameters:
  • model – the PyTorch model to be exported

  • model_name – the name for the exported model

  • dummy_input – the dummy input to the model

  • input_names – optional names for the model inputs. Defaults to [“input”] (or [“input”, “masks”] when dummy_input is a tuple).

  • output_names – optional names for the model outputs. Defaults to [“logits”] (or [“logits”, “pred_boxes”] when dummy_input is a tuple). Pass the names of every output when the model returns more than one tensor (e.g. a multi-head model).

  • dynamic_axes – optional dynamic axes. Defaults to a dynamic batch dimension on every input and output.

  • kwargs – additional arguments to be passed to torch.onnx.export

Returns:

the path to the exported model

doctr.models.utils.add_whitelist(model: Module, vocabs: str | Iterable[str], *, strategy: str = 'mask', mapping: str | dict[str, str] | None = None, verbose: bool = False) WhitelistHandle[source]

Restrict a recognition model so it can only predict a subset of its vocabulary.

The whitelist is enforced at the model’s final projection layer, before the decoding argmax. Because the projection is the single point every logit flows through, the constraint also applies inside the autoregressive decoding loop of SAR, MASTER and PARSeq, so a forbidden character can never be produced – not even fed back mid-word. The sequence terminator (CTC blank / attention <eos>) is always kept so decoding still terminates. It works with every recognition architecture and with any predictor wrapping one (ocr_predictor, kie_predictor, recognition_predictor).

Two strategies are available:

  • “mask” (default): the logits of forbidden characters are set to -inf, so decoding falls back to the highest-scoring allowed character.

  • “nearest”: the score of each forbidden character is first reassigned to the closest allowed character (so e.g. ä folds onto a), then forbidden logits are masked. Forbidden characters without a mapping fall back to masking.

A whitelist can only restrict a model to characters it already knows: characters that are not part of the model’s own vocabulary are silently ignored.

>>> from doctr.datasets import VOCABS
>>> from doctr.models import ocr_predictor
>>> from doctr.models.utils import add_whitelist
>>> predictor = ocr_predictor(pretrained=True)
>>> handle = add_whitelist(predictor, [VOCABS["polish"], VOCABS["german"]])
>>> # ... run the predictor; only Polish/German characters can be predicted ...
>>> handle.remove()  # restore the original, unconstrained decoding
Parameters:
  • model – an ocr_predictor, kie_predictor, recognition_predictor, or a recognition model.

  • vocabs – a vocabulary string (e.g. VOCABS[“german”]) or an iterable of vocabulary strings (e.g. [VOCABS[“polish”], VOCABS[“german”]]) whose characters are allowed.

  • strategy“mask” (default) to drop forbidden characters, or “nearest” to fold them onto the closest allowed character.

  • mapping – only used when “strategy=”nearest””. None or “anyascii” builds the forbidden-to-allowed map by transliteration (the default); “weights” derives it from the projection weights (the model’s own confusions); a dict of {forbidden_char: allowed_char} overrides specific characters on top of the transliteration map.

  • verbose – if True, log how many characters were kept, forbidden and reassigned per model.

Returns:

a WhitelistHandle; call its remove() method to restore the original, unconstrained decoding.