doctr.models#
doctr.models.classification#
- doctr.models.classification.vgg16_bn_r(pretrained: bool = False, **kwargs: Any) VGG [source]#
VGG-16 architecture as described in “Very Deep Convolutional Networks for Large-Scale Image Recognition”, modified by adding batch normalization, rectangular pooling and a simpler classification head.
>>> import tensorflow as tf >>> from doctr.models import vgg16_bn_r >>> model = vgg16_bn_r(pretrained=False) >>> input_tensor = tf.random.uniform(shape=[1, 512, 512, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained (bool) – If True, returns a model pre-trained on ImageNet
- Returns:
VGG feature extractor
- doctr.models.classification.resnet18(pretrained: bool = False, **kwargs: Any) ResNet [source]#
Resnet-18 architecture as described in “Deep Residual Learning for Image Recognition”,.
>>> import tensorflow as tf >>> from doctr.models import resnet18 >>> model = resnet18(pretrained=False) >>> input_tensor = tf.random.uniform(shape=[1, 512, 512, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained – boolean, True if model is pretrained
- Returns:
A classification model
- doctr.models.classification.resnet34(pretrained: bool = False, **kwargs: Any) ResNet [source]#
Resnet-34 architecture as described in “Deep Residual Learning for Image Recognition”,.
>>> import tensorflow as tf >>> from doctr.models import resnet34 >>> model = resnet34(pretrained=False) >>> input_tensor = tf.random.uniform(shape=[1, 512, 512, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained – boolean, True if model is pretrained
- Returns:
A classification model
- doctr.models.classification.resnet50(pretrained: bool = False, **kwargs: Any) ResNet [source]#
Resnet-50 architecture as described in “Deep Residual Learning for Image Recognition”,.
>>> import tensorflow as tf >>> from doctr.models import resnet50 >>> model = resnet50(pretrained=False) >>> input_tensor = tf.random.uniform(shape=[1, 512, 512, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained – boolean, True if model is pretrained
- Returns:
A classification model
- doctr.models.classification.resnet31(pretrained: bool = False, **kwargs: Any) ResNet [source]#
Resnet31 architecture with rectangular pooling windows as described in “Show, Attend and Read:A Simple and Strong Baseline for Irregular Text Recognition”,. Downsizing: (H, W) –> (H/8, W/4)
>>> import tensorflow as tf >>> from doctr.models import resnet31 >>> model = resnet31(pretrained=False) >>> input_tensor = tf.random.uniform(shape=[1, 512, 512, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained – boolean, True if model is pretrained
- Returns:
A classification model
- doctr.models.classification.mobilenet_v3_small(pretrained: bool = False, **kwargs: Any) MobileNetV3 [source]#
MobileNetV3-Small architecture as described in “Searching for MobileNetV3”,.
>>> import tensorflow as tf >>> from doctr.models import mobilenet_v3_small >>> model = mobilenet_v3_small(pretrained=False) >>> input_tensor = tf.random.uniform(shape=[1, 512, 512, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained – boolean, True if model is pretrained
- Returns:
a keras.Model
- doctr.models.classification.mobilenet_v3_large(pretrained: bool = False, **kwargs: Any) MobileNetV3 [source]#
MobileNetV3-Large architecture as described in “Searching for MobileNetV3”,.
>>> import tensorflow as tf >>> from doctr.models import mobilenet_v3_large >>> model = mobilenet_v3_large(pretrained=False) >>> input_tensor = tf.random.uniform(shape=[1, 512, 512, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained – boolean, True if model is pretrained
- Returns:
a keras.Model
- doctr.models.classification.mobilenet_v3_small_r(pretrained: bool = False, **kwargs: Any) MobileNetV3 [source]#
MobileNetV3-Small architecture as described in “Searching for MobileNetV3”,, with rectangular pooling.
>>> import tensorflow as tf >>> from doctr.models import mobilenet_v3_small_r >>> model = mobilenet_v3_small_r(pretrained=False) >>> input_tensor = tf.random.uniform(shape=[1, 512, 512, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained – boolean, True if model is pretrained
- Returns:
a keras.Model
- doctr.models.classification.mobilenet_v3_large_r(pretrained: bool = False, **kwargs: Any) MobileNetV3 [source]#
MobileNetV3-Large architecture as described in “Searching for MobileNetV3”,.
>>> import tensorflow as tf >>> from doctr.models import mobilenet_v3_large_r >>> model = mobilenet_v3_large_r(pretrained=False) >>> input_tensor = tf.random.uniform(shape=[1, 512, 512, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained – boolean, True if model is pretrained
- Returns:
a keras.Model
- doctr.models.classification.mobilenet_v3_small_orientation(pretrained: bool = False, **kwargs: Any) MobileNetV3 [source]#
MobileNetV3-Small architecture as described in “Searching for MobileNetV3”,.
>>> import tensorflow as tf >>> from doctr.models import mobilenet_v3_small_orientation >>> model = mobilenet_v3_small_orientation(pretrained=False) >>> input_tensor = tf.random.uniform(shape=[1, 512, 512, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained – boolean, True if model is pretrained
- Returns:
a keras.Model
- doctr.models.classification.magc_resnet31(pretrained: bool = False, **kwargs: Any) ResNet [source]#
Resnet31 architecture with Multi-Aspect Global Context Attention as described in “MASTER: Multi-Aspect Non-local Network for Scene Text Recognition”,.
>>> import tensorflow as tf >>> from doctr.models import magc_resnet31 >>> model = magc_resnet31(pretrained=False) >>> input_tensor = tf.random.uniform(shape=[1, 224, 224, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained – boolean, True if model is pretrained
- Returns:
A feature extractor model
- doctr.models.classification.vit_s(pretrained: bool = False, **kwargs: Any) VisionTransformer [source]#
VisionTransformer-S architecture “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”,. Patches: (H, W) -> (H/8, W/8)
NOTE: unofficial config used in ViTSTR and ParSeq
>>> import tensorflow as tf >>> from doctr.models import vit_s >>> model = vit_s(pretrained=False) >>> input_tensor = tf.random.uniform(shape=[1, 32, 32, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained – boolean, True if model is pretrained
- Returns:
A feature extractor model
- doctr.models.classification.vit_b(pretrained: bool = False, **kwargs: Any) VisionTransformer [source]#
VisionTransformer-B architecture as described in “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”,. Patches: (H, W) -> (H/8, W/8)
>>> import tensorflow as tf >>> from doctr.models import vit_b >>> model = vit_b(pretrained=False) >>> input_tensor = tf.random.uniform(shape=[1, 32, 32, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained – boolean, True if model is pretrained
- Returns:
A feature extractor model
- doctr.models.classification.crop_orientation_predictor(arch: str = 'mobilenet_v3_small_orientation', pretrained: bool = False, **kwargs: Any) CropOrientationPredictor [source]#
Orientation classification architecture.
>>> import numpy as np >>> from doctr.models import crop_orientation_predictor >>> model = crop_orientation_predictor(arch='classif_mobilenet_v3_small', pretrained=True) >>> input_crop = (255 * np.random.rand(600, 800, 3)).astype(np.uint8) >>> out = model([input_crop])
- Parameters:
arch – name of the architecture to use (e.g. ‘mobilenet_v3_small’)
pretrained – If True, returns a model pre-trained on our recognition crops dataset
- Returns:
CropOrientationPredictor
doctr.models.detection#
- doctr.models.detection.linknet_resnet18(pretrained: bool = False, **kwargs: Any) LinkNet [source]#
LinkNet as described in “LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation”.
>>> import tensorflow as tf >>> from doctr.models import linknet_resnet18 >>> model = linknet_resnet18(pretrained=True) >>> input_tensor = tf.random.uniform(shape=[1, 1024, 1024, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained (bool) – If True, returns a model pre-trained on our text detection dataset
- Returns:
text detection architecture
- doctr.models.detection.linknet_resnet18_rotation(pretrained: bool = False, **kwargs: Any) LinkNet [source]#
LinkNet as described in “LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation”.
>>> import tensorflow as tf >>> from doctr.models import linknet_resnet18_rotation >>> model = linknet_resnet18_rotation(pretrained=True) >>> input_tensor = tf.random.uniform(shape=[1, 1024, 1024, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained (bool) – If True, returns a model pre-trained on our text detection dataset
- Returns:
text detection architecture
- doctr.models.detection.linknet_resnet34(pretrained: bool = False, **kwargs: Any) LinkNet [source]#
LinkNet as described in “LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation”.
>>> import tensorflow as tf >>> from doctr.models import linknet_resnet34 >>> model = linknet_resnet34(pretrained=True) >>> input_tensor = tf.random.uniform(shape=[1, 1024, 1024, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained (bool) – If True, returns a model pre-trained on our text detection dataset
- Returns:
text detection architecture
- doctr.models.detection.linknet_resnet50(pretrained: bool = False, **kwargs: Any) LinkNet [source]#
LinkNet as described in “LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation”.
>>> import tensorflow as tf >>> from doctr.models import linknet_resnet50 >>> model = linknet_resnet50(pretrained=True) >>> input_tensor = tf.random.uniform(shape=[1, 1024, 1024, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained (bool) – If True, returns a model pre-trained on our text detection dataset
- Returns:
text detection architecture
- doctr.models.detection.db_resnet50(pretrained: bool = False, **kwargs: Any) DBNet [source]#
DBNet as described in “Real-time Scene Text Detection with Differentiable Binarization”, using a ResNet-50 backbone.
>>> import tensorflow as tf >>> from doctr.models import db_resnet50 >>> model = db_resnet50(pretrained=True) >>> input_tensor = tf.random.uniform(shape=[1, 1024, 1024, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained (bool) – If True, returns a model pre-trained on our text detection dataset
- Returns:
text detection architecture
- doctr.models.detection.db_mobilenet_v3_large(pretrained: bool = False, **kwargs: Any) DBNet [source]#
DBNet as described in “Real-time Scene Text Detection with Differentiable Binarization”, using a mobilenet v3 large backbone.
>>> import tensorflow as tf >>> from doctr.models import db_mobilenet_v3_large >>> model = db_mobilenet_v3_large(pretrained=True) >>> input_tensor = tf.random.uniform(shape=[1, 1024, 1024, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained (bool) – If True, returns a model pre-trained on our text detection dataset
- Returns:
text detection architecture
- doctr.models.detection.detection_predictor(arch: Any = 'db_resnet50', pretrained: bool = False, assume_straight_pages: bool = True, **kwargs: Any) DetectionPredictor [source]#
Text detection architecture.
>>> import numpy as np >>> from doctr.models import detection_predictor >>> model = detection_predictor(arch='db_resnet50', pretrained=True) >>> input_page = (255 * np.random.rand(600, 800, 3)).astype(np.uint8) >>> out = model([input_page])
- Parameters:
arch – name of the architecture or model itself to use (e.g. ‘db_resnet50’)
pretrained – If True, returns a model pre-trained on our text detection dataset
assume_straight_pages – If True, fit straight boxes to the page
- Returns:
Detection predictor
doctr.models.recognition#
- doctr.models.recognition.crnn_vgg16_bn(pretrained: bool = False, **kwargs: Any) CRNN [source]#
CRNN with a VGG-16 backbone as described in “An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition”.
>>> import tensorflow as tf >>> from doctr.models import crnn_vgg16_bn >>> model = crnn_vgg16_bn(pretrained=True) >>> input_tensor = tf.random.uniform(shape=[1, 32, 128, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained (bool) – If True, returns a model pre-trained on our text recognition dataset
- Returns:
text recognition architecture
- doctr.models.recognition.crnn_mobilenet_v3_small(pretrained: bool = False, **kwargs: Any) CRNN [source]#
CRNN with a MobileNet V3 Small backbone as described in “An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition”.
>>> import tensorflow as tf >>> from doctr.models import crnn_mobilenet_v3_small >>> model = crnn_mobilenet_v3_small(pretrained=True) >>> input_tensor = tf.random.uniform(shape=[1, 32, 128, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained (bool) – If True, returns a model pre-trained on our text recognition dataset
- Returns:
text recognition architecture
- doctr.models.recognition.crnn_mobilenet_v3_large(pretrained: bool = False, **kwargs: Any) CRNN [source]#
CRNN with a MobileNet V3 Large backbone as described in “An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition”.
>>> import tensorflow as tf >>> from doctr.models import crnn_mobilenet_v3_large >>> model = crnn_mobilenet_v3_large(pretrained=True) >>> input_tensor = tf.random.uniform(shape=[1, 32, 128, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained (bool) – If True, returns a model pre-trained on our text recognition dataset
- Returns:
text recognition architecture
- doctr.models.recognition.sar_resnet31(pretrained: bool = False, **kwargs: Any) SAR [source]#
SAR with a resnet-31 feature extractor as described in “Show, Attend and Read:A Simple and Strong Baseline for Irregular Text Recognition”.
>>> import tensorflow as tf >>> from doctr.models import sar_resnet31 >>> model = sar_resnet31(pretrained=False) >>> input_tensor = tf.random.uniform(shape=[1, 64, 256, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained (bool) – If True, returns a model pre-trained on our text recognition dataset
- Returns:
text recognition architecture
- doctr.models.recognition.master(pretrained: bool = False, **kwargs: Any) MASTER [source]#
MASTER as described in paper: <https://arxiv.org/pdf/1910.02562.pdf>`_.
>>> import tensorflow as tf >>> from doctr.models import master >>> model = master(pretrained=False) >>> input_tensor = tf.random.uniform(shape=[1, 32, 128, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained (bool) – If True, returns a model pre-trained on our text recognition dataset
- Returns:
text recognition architecture
- doctr.models.recognition.vitstr_small(pretrained: bool = False, **kwargs: Any) ViTSTR [source]#
ViTSTR-Small as described in “Vision Transformer for Fast and Efficient Scene Text Recognition”.
>>> import tensorflow as tf >>> from doctr.models import vitstr_small >>> model = vitstr_small(pretrained=False) >>> input_tensor = tf.random.uniform(shape=[1, 32, 128, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained (bool) – If True, returns a model pre-trained on our text recognition dataset
- Returns:
text recognition architecture
- doctr.models.recognition.vitstr_base(pretrained: bool = False, **kwargs: Any) ViTSTR [source]#
ViTSTR-Base as described in “Vision Transformer for Fast and Efficient Scene Text Recognition”.
>>> import tensorflow as tf >>> from doctr.models import vitstr_base >>> model = vitstr_base(pretrained=False) >>> input_tensor = tf.random.uniform(shape=[1, 32, 128, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained (bool) – If True, returns a model pre-trained on our text recognition dataset
- Returns:
text recognition architecture
- doctr.models.recognition.parseq(pretrained: bool = False, **kwargs: Any) PARSeq [source]#
PARSeq architecture from “Scene Text Recognition with Permuted Autoregressive Sequence Models”.
>>> import tensorflow as tf >>> from doctr.models import parseq >>> model = parseq(pretrained=False) >>> input_tensor = tf.random.uniform(shape=[1, 32, 128, 3], maxval=1, dtype=tf.float32) >>> out = model(input_tensor)
- Parameters:
pretrained (bool) – If True, returns a model pre-trained on our text recognition dataset
- Returns:
text recognition architecture
- doctr.models.recognition.recognition_predictor(arch: Any = 'crnn_vgg16_bn', pretrained: bool = False, **kwargs: Any) RecognitionPredictor [source]#
Text recognition architecture.
- Example::
>>> import numpy as np >>> from doctr.models import recognition_predictor >>> model = recognition_predictor(pretrained=True) >>> input_page = (255 * np.random.rand(32, 128, 3)).astype(np.uint8) >>> out = model([input_page])
- Parameters:
arch – name of the architecture or model itself to use (e.g. ‘crnn_vgg16_bn’)
pretrained – If True, returns a model pre-trained on our text recognition dataset
- Returns:
Recognition predictor
doctr.models.zoo#
- doctr.models.ocr_predictor(det_arch: Any = 'db_resnet50', reco_arch: Any = 'crnn_vgg16_bn', pretrained: bool = False, pretrained_backbone: bool = True, assume_straight_pages: bool = True, preserve_aspect_ratio: bool = True, symmetric_pad: bool = True, export_as_straight_boxes: bool = False, detect_orientation: bool = False, detect_language: bool = False, **kwargs: Any) OCRPredictor [source]#
End-to-end OCR architecture using one model for localization, and another for text recognition.
>>> import numpy as np >>> from doctr.models import ocr_predictor >>> model = ocr_predictor('db_resnet50', 'crnn_vgg16_bn', pretrained=True) >>> input_page = (255 * np.random.rand(600, 800, 3)).astype(np.uint8) >>> out = model([input_page])
- Parameters:
det_arch – name of the detection architecture or the model itself to use (e.g. ‘db_resnet50’, ‘db_mobilenet_v3_large’)
reco_arch – name of the recognition architecture or the model itself to use (e.g. ‘crnn_vgg16_bn’, ‘sar_resnet31’)
pretrained – If True, returns a model pre-trained on our OCR dataset
pretrained_backbone – If True, returns a model with a pretrained backbone
assume_straight_pages – if True, speeds up the inference by assuming you only pass straight pages without rotated textual elements.
preserve_aspect_ratio – If True, pad the input document image to preserve the aspect ratio before running the detection model on it.
symmetric_pad – if True, pad the image symmetrically instead of padding at the bottom-right.
export_as_straight_boxes – when assume_straight_pages is set to False, export final predictions (potentially rotated) as straight bounding boxes.
detect_orientation – if True, the estimated general page orientation will be added to the predictions for each page. Doing so will slightly deteriorate the overall latency.
detect_language – if True, the language prediction will be added to the predictions for each page. Doing so will slightly deteriorate the overall latency.
kwargs – keyword args of OCRPredictor
- Returns:
OCR predictor
- doctr.models.kie_predictor(det_arch: Any = 'db_resnet50', reco_arch: Any = 'crnn_vgg16_bn', pretrained: bool = False, pretrained_backbone: bool = True, assume_straight_pages: bool = True, preserve_aspect_ratio: bool = True, symmetric_pad: bool = True, export_as_straight_boxes: bool = False, detect_orientation: bool = False, detect_language: bool = False, **kwargs: Any) KIEPredictor [source]#
End-to-end KIE architecture using one model for localization, and another for text recognition.
>>> import numpy as np >>> from doctr.models import ocr_predictor >>> model = ocr_predictor('db_resnet50', 'crnn_vgg16_bn', pretrained=True) >>> input_page = (255 * np.random.rand(600, 800, 3)).astype(np.uint8) >>> out = model([input_page])
- Parameters:
det_arch – name of the detection architecture or the model itself to use (e.g. ‘db_resnet50’, ‘db_mobilenet_v3_large’)
reco_arch – name of the recognition architecture or the model itself to use (e.g. ‘crnn_vgg16_bn’, ‘sar_resnet31’)
pretrained – If True, returns a model pre-trained on our OCR dataset
pretrained_backbone – If True, returns a model with a pretrained backbone
assume_straight_pages – if True, speeds up the inference by assuming you only pass straight pages without rotated textual elements.
preserve_aspect_ratio – If True, pad the input document image to preserve the aspect ratio before running the detection model on it.
symmetric_pad – if True, pad the image symmetrically instead of padding at the bottom-right.
export_as_straight_boxes – when assume_straight_pages is set to False, export final predictions (potentially rotated) as straight bounding boxes.
detect_orientation – if True, the estimated general page orientation will be added to the predictions for each page. Doing so will slightly deteriorate the overall latency.
detect_language – if True, the language prediction will be added to the predictions for each page. Doing so will slightly deteriorate the overall latency.
kwargs – keyword args of OCRPredictor
- Returns:
KIE predictor
doctr.models.factory#
- doctr.models.factory.from_hub(repo_id: str, **kwargs: Any)[source]#
Instantiate & load a pretrained model from HF hub.
>>> from doctr.models import from_hub >>> model = from_hub("mindee/fasterrcnn_mobilenet_v3_large_fpn")
- Parameters:
repo_id – HuggingFace model hub repo
kwargs – kwargs of hf_hub_download or snapshot_download
- Returns:
Model loaded with the checkpoint
- doctr.models.factory.push_to_hf_hub(model: Any, model_name: str, task: str, **kwargs) None [source]#
Save model and its configuration on HF hub
>>> from doctr.models import login_to_hub, push_to_hf_hub >>> from doctr.models.recognition import crnn_mobilenet_v3_small >>> login_to_hub() >>> model = crnn_mobilenet_v3_small(pretrained=True) >>> push_to_hf_hub(model, 'my-model', 'recognition', arch='crnn_mobilenet_v3_small')
- Parameters:
model – TF or PyTorch model to be saved
model_name – name of the model which is also the repository name
task – task name
**kwargs – keyword arguments for push_to_hf_hub