# doctr.utils#

This module regroups non-core features that are complementary to the rest of the package.

## Visualization#

Easy-to-use functions to make sense of your model’s predictions.

doctr.utils.visualization.visualize_page(page: Dict[str, Any], image: ndarray, words_only: bool = True, display_artefacts: bool = True, scale: float = 10, interactive: bool = True, add_labels: bool = True, **kwargs: Any) Figure[source]#

Visualize a full page with predicted blocks, lines and words

>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> from doctr.utils.visualization import visualize_page
>>> from doctr.models import ocr_db_crnn
>>> model = ocr_db_crnn(pretrained=True)
>>> input_page = (255 * np.random.rand(600, 800, 3)).astype(np.uint8)
>>> out = model([[input_page]])
>>> visualize_page(out[0].pages[0].export(), input_page)
>>> plt.show()

Parameters
• page – the exported Page of a Document

• image – np array of the page, needs to have the same shape than page[‘dimensions’]

• words_only – whether only words should be displayed

• display_artefacts – whether artefacts should be displayed

• scale – figsize of the largest windows side

• interactive – whether the plot should be interactive

• add_labels – for static plot, adds text labels on top of bounding box

doctr.utils.visualization.synthesize_page(page: Dict[str, Any], draw_proba: bool = False, font_family: = None) ndarray[source]#

Draw a the content of the element page (OCR response) on a blank page.

Parameters
• page – exported Page object to represent

• draw_proba – if True, draw words in colors to represent confidence. Blue: p=1, red: p=0

• font_size – size of the font, default font = 13

• font_family – family of the font

Returns

the synthesized page

class doctr.utils.metrics.TextMatch[source]#

Implements text match metric (word-level accuracy) for recognition task.

The raw aggregated metric is computed as follows:

$\forall X, Y \in \mathcal{W}^N, TextMatch(X, Y) = \frac{1}{N} \sum\limits_{i=1}^N f_{Y_i}(X_i)$

with the indicator function $$f_{a}$$ defined as:

$\begin{split}\forall a, x \in \mathcal{W}, f_a(x) = \left\{ \begin{array}{ll} 1 & \mbox{if } x = a \\ 0 & \mbox{otherwise.} \end{array} \right.\end{split}$

where $$\mathcal{W}$$ is the set of all possible character sequences, $$N$$ is a strictly positive integer.

>>> from doctr.utils import TextMatch
>>> metric = TextMatch()
>>> metric.update(['Hello', 'world'], ['hello', 'world'])
>>> metric.summary()

update(gt: List[str], pred: List[str]) None[source]#

Update the state of the metric with new predictions

Parameters
• gt – list of groung-truth character sequences

• pred – list of predicted character sequences

summary() Dict[str, float][source]#

Computes the aggregated metrics

Returns

a dictionary with the exact match score for the raw data, its lower-case counterpart, its unidecode counterpart and its lower-case unidecode counterpart

class doctr.utils.metrics.LocalizationConfusion(iou_thresh: float = 0.5, use_polygons: bool = False, mask_shape: Tuple[int, int] = (1024, 1024), use_broadcasting: bool = True)[source]#

Implements common confusion metrics and mean IoU for localization evaluation.

The aggregated metrics are computed as follows:

$\begin{split}\forall Y \in \mathcal{B}^N, \forall X \in \mathcal{B}^M, \\ Recall(X, Y) = \frac{1}{N} \sum\limits_{i=1}^N g_{X}(Y_i) \\ Precision(X, Y) = \frac{1}{M} \sum\limits_{i=1}^M g_{X}(Y_i) \\ meanIoU(X, Y) = \frac{1}{M} \sum\limits_{i=1}^M \max\limits_{j \in [1, N]} IoU(X_i, Y_j)\end{split}$

with the function $$IoU(x, y)$$ being the Intersection over Union between bounding boxes $$x$$ and $$y$$, and the function $$g_{X}$$ defined as:

$\begin{split}\forall y \in \mathcal{B}, g_X(y) = \left\{ \begin{array}{ll} 1 & \mbox{if } y\mbox{ has been assigned to any }(X_i)_i\mbox{ with an }IoU \geq 0.5 \\ 0 & \mbox{otherwise.} \end{array} \right.\end{split}$

where $$\mathcal{B}$$ is the set of possible bounding boxes, $$N$$ (number of ground truths) and $$M$$ (number of predictions) are strictly positive integers.

>>> import numpy as np
>>> from doctr.utils import LocalizationConfusion
>>> metric = LocalizationConfusion(iou_thresh=0.5)
>>> metric.update(np.asarray([[0, 0, 100, 100]]), np.asarray([[0, 0, 70, 70], [110, 95, 200, 150]]))
>>> metric.summary()

Parameters
• iou_thresh – minimum IoU to consider a pair of prediction and ground truth as a match

• use_polygons – if set to True, predictions and targets will be expected to have rotated format

• mask_shape – if use_polygons is True, describes the spatial shape of the image used

• use_broadcasting – if use_polygons is True, use broadcasting for IoU computation by consuming more memory

update(gts: ndarray, preds: ndarray) None[source]#

Parameters
• gts – a set of relative bounding boxes either of shape (N, 4) or (N, 5) if they are rotated ones

• preds – a set of relative bounding boxes either of shape (M, 4) or (M, 5) if they are rotated ones

summary() [source]#

Computes the aggregated metrics

Returns

a tuple with the recall, precision and meanIoU scores

class doctr.utils.metrics.OCRMetric(iou_thresh: float = 0.5, use_polygons: bool = False, mask_shape: Tuple[int, int] = (1024, 1024), use_broadcasting: bool = True)[source]#

Implements an end-to-end OCR metric.

The aggregated metrics are computed as follows:

$\begin{split}\forall (B, L) \in \mathcal{B}^N \times \mathcal{L}^N, \forall (\hat{B}, \hat{L}) \in \mathcal{B}^M \times \mathcal{L}^M, \\ Recall(B, \hat{B}, L, \hat{L}) = \frac{1}{N} \sum\limits_{i=1}^N h_{B,L}(\hat{B}_i, \hat{L}_i) \\ Precision(B, \hat{B}, L, \hat{L}) = \frac{1}{M} \sum\limits_{i=1}^M h_{B,L}(\hat{B}_i, \hat{L}_i) \\ meanIoU(B, \hat{B}) = \frac{1}{M} \sum\limits_{i=1}^M \max\limits_{j \in [1, N]} IoU(\hat{B}_i, B_j)\end{split}$

with the function $$IoU(x, y)$$ being the Intersection over Union between bounding boxes $$x$$ and $$y$$, and the function $$h_{B, L}$$ defined as:

$\begin{split}\forall (b, l) \in \mathcal{B} \times \mathcal{L}, h_{B,L}(b, l) = \left\{ \begin{array}{ll} 1 & \mbox{if } b\mbox{ has been assigned to a given }B_j\mbox{ with an } \\ & IoU \geq 0.5 \mbox{ and that for this assignment, } l = L_j\\ 0 & \mbox{otherwise.} \end{array} \right.\end{split}$

where $$\mathcal{B}$$ is the set of possible bounding boxes, $$\mathcal{L}$$ is the set of possible character sequences, $$N$$ (number of ground truths) and $$M$$ (number of predictions) are strictly positive integers.

>>> import numpy as np
>>> from doctr.utils import OCRMetric
>>> metric = OCRMetric(iou_thresh=0.5)
>>> metric.update(np.asarray([[0, 0, 100, 100]]), np.asarray([[0, 0, 70, 70], [110, 95, 200, 150]]),
>>>               ['hello'], ['hello', 'world'])
>>> metric.summary()

Parameters
• iou_thresh – minimum IoU to consider a pair of prediction and ground truth as a match

• use_polygons – if set to True, predictions and targets will be expected to have rotated format

• mask_shape – if use_polygons is True, describes the spatial shape of the image used

• use_broadcasting – if use_polygons is True, use broadcasting for IoU computation by consuming more memory

update(gt_boxes: ndarray, pred_boxes: ndarray, gt_labels: List[str], pred_labels: List[str]) None[source]#

Parameters
• gt_boxes – a set of relative bounding boxes either of shape (N, 4) or (N, 5) if they are rotated ones

• pred_boxes – a set of relative bounding boxes either of shape (M, 4) or (M, 5) if they are rotated ones

• gt_labels – a list of N string labels

• pred_labels – a list of M string labels

summary() Tuple[Dict[str, Optional[float]], Dict[str, Optional[float]], Optional[float]][source]#

Computes the aggregated metrics

Returns

a tuple with the recall & precision for each string comparison and the mean IoU

class doctr.utils.metrics.DetectionMetric(iou_thresh: float = 0.5, use_polygons: bool = False, mask_shape: Tuple[int, int] = (1024, 1024), use_broadcasting: bool = True)[source]#

Implements an object detection metric.

The aggregated metrics are computed as follows:

$\begin{split}\forall (B, C) \in \mathcal{B}^N \times \mathcal{C}^N, \forall (\hat{B}, \hat{C}) \in \mathcal{B}^M \times \mathcal{C}^M, \\ Recall(B, \hat{B}, C, \hat{C}) = \frac{1}{N} \sum\limits_{i=1}^N h_{B,C}(\hat{B}_i, \hat{C}_i) \\ Precision(B, \hat{B}, C, \hat{C}) = \frac{1}{M} \sum\limits_{i=1}^M h_{B,C}(\hat{B}_i, \hat{C}_i) \\ meanIoU(B, \hat{B}) = \frac{1}{M} \sum\limits_{i=1}^M \max\limits_{j \in [1, N]} IoU(\hat{B}_i, B_j)\end{split}$

with the function $$IoU(x, y)$$ being the Intersection over Union between bounding boxes $$x$$ and $$y$$, and the function $$h_{B, C}$$ defined as:

$\begin{split}\forall (b, c) \in \mathcal{B} \times \mathcal{C}, h_{B,C}(b, c) = \left\{ \begin{array}{ll} 1 & \mbox{if } b\mbox{ has been assigned to a given }B_j\mbox{ with an } \\ & IoU \geq 0.5 \mbox{ and that for this assignment, } c = C_j\\ 0 & \mbox{otherwise.} \end{array} \right.\end{split}$

where $$\mathcal{B}$$ is the set of possible bounding boxes, $$\mathcal{C}$$ is the set of possible class indices, $$N$$ (number of ground truths) and $$M$$ (number of predictions) are strictly positive integers.

>>> import numpy as np
>>> from doctr.utils import DetectionMetric
>>> metric = DetectionMetric(iou_thresh=0.5)
>>> metric.update(np.asarray([[0, 0, 100, 100]]), np.asarray([[0, 0, 70, 70], [110, 95, 200, 150]]),
>>>               np.zeros(1, dtype=np.int64), np.array([0, 1], dtype=np.int64))
>>> metric.summary()

Parameters
• iou_thresh – minimum IoU to consider a pair of prediction and ground truth as a match

• use_polygons – if set to True, predictions and targets will be expected to have rotated format

• mask_shape – if use_polygons is True, describes the spatial shape of the image used

• use_broadcasting – if use_polygons is True, use broadcasting for IoU computation by consuming more memory

update(gt_boxes: ndarray, pred_boxes: ndarray, gt_labels: ndarray, pred_labels: ndarray) None[source]#

Parameters
• gt_boxes – a set of relative bounding boxes either of shape (N, 4) or (N, 5) if they are rotated ones

• pred_boxes – a set of relative bounding boxes either of shape (M, 4) or (M, 5) if they are rotated ones

• gt_labels – an array of class indices of shape (N,)

• pred_labels – an array of class indices of shape (M,)

summary() [source]#

Computes the aggregated metrics

Returns

a tuple with the recall & precision for each class prediction and the mean IoU