doctr.utils

This module regroups non-core features that are complementary to the rest of the package.

Visualization

Easy-to-use functions to make sense of your model’s predictions.

doctr.utils.visualization.visualize_page(page: Dict[str, Any], image: ndarray, words_only: bool = True, display_artefacts: bool = True, scale: float = 10, interactive: bool = True, add_labels: bool = True, **kwargs: Any) Figure[source]

Visualize a full page with predicted blocks, lines and words

>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> from doctr.utils.visualization import visualize_page
>>> from doctr.models import ocr_db_crnn
>>> model = ocr_db_crnn(pretrained=True)
>>> input_page = (255 * np.random.rand(600, 800, 3)).astype(np.uint8)
>>> out = model([[input_page]])
>>> visualize_page(out[0].pages[0].export(), input_page)
>>> plt.show()

Args:

page: the exported Page of a Document image: np array of the page, needs to have the same shape than page[‘dimensions’] words_only: whether only words should be displayed display_artefacts: whether artefacts should be displayed scale: figsize of the largest windows side interactive: whether the plot should be interactive add_labels: for static plot, adds text labels on top of bounding box **kwargs: keyword arguments for the polygon patch

Returns:

the matplotlib figure

doctr.utils.visualization.synthesize_page(page: Dict[str, Any], draw_proba: bool = False, font_family: str | None = None) ndarray[source]

Draw a the content of the element page (OCR response) on a blank page.

Args:

page: exported Page object to represent draw_proba: if True, draw words in colors to represent confidence. Blue: p=1, red: p=0 font_size: size of the font, default font = 13 font_family: family of the font

Returns:

the synthesized page

Task evaluation

Implementations of task-specific metrics to easily assess your model performances.

class doctr.utils.metrics.TextMatch[source]

Implements text match metric (word-level accuracy) for recognition task.

The raw aggregated metric is computed as follows:

\[\forall X, Y \in \mathcal{W}^N, TextMatch(X, Y) = \frac{1}{N} \sum\limits_{i=1}^N f_{Y_i}(X_i)\]

with the indicator function \(f_{a}\) defined as:

\[\begin{split}\forall a, x \in \mathcal{W}, f_a(x) = \left\{ \begin{array}{ll} 1 & \mbox{if } x = a \\ 0 & \mbox{otherwise.} \end{array} \right.\end{split}\]

where \(\mathcal{W}\) is the set of all possible character sequences, \(N\) is a strictly positive integer.

>>> from doctr.utils import TextMatch
>>> metric = TextMatch()
>>> metric.update(['Hello', 'world'], ['hello', 'world'])
>>> metric.summary()
update(gt: List[str], pred: List[str]) None[source]

Update the state of the metric with new predictions

Args:

gt: list of groung-truth character sequences pred: list of predicted character sequences

summary() Dict[str, float][source]

Computes the aggregated metrics

Returns:

  • a dictionary with the exact match score for the raw data, its lower-case counterpart, its unidecode

  • counterpart and its lower-case unidecode counterpart

class doctr.utils.metrics.LocalizationConfusion(iou_thresh: float = 0.5, use_polygons: bool = False, mask_shape: Tuple[int, int] = (1024, 1024), use_broadcasting: bool = True)[source]

Implements common confusion metrics and mean IoU for localization evaluation.

The aggregated metrics are computed as follows:

\[\begin{split}\forall Y \in \mathcal{B}^N, \forall X \in \mathcal{B}^M, \\ Recall(X, Y) = \frac{1}{N} \sum\limits_{i=1}^N g_{X}(Y_i) \\ Precision(X, Y) = \frac{1}{M} \sum\limits_{i=1}^M g_{X}(Y_i) \\ meanIoU(X, Y) = \frac{1}{M} \sum\limits_{i=1}^M \max\limits_{j \in [1, N]} IoU(X_i, Y_j)\end{split}\]

with the function \(IoU(x, y)\) being the Intersection over Union between bounding boxes \(x\) and \(y\), and the function \(g_{X}\) defined as:

\[\begin{split}\forall y \in \mathcal{B}, g_X(y) = \left\{ \begin{array}{ll} 1 & \mbox{if } y\mbox{ has been assigned to any }(X_i)_i\mbox{ with an }IoU \geq 0.5 \\ 0 & \mbox{otherwise.} \end{array} \right.\end{split}\]

where \(\mathcal{B}\) is the set of possible bounding boxes, \(N\) (number of ground truths) and \(M\) (number of predictions) are strictly positive integers.

>>> import numpy as np
>>> from doctr.utils import LocalizationConfusion
>>> metric = LocalizationConfusion(iou_thresh=0.5)
>>> metric.update(np.asarray([[0, 0, 100, 100]]), np.asarray([[0, 0, 70, 70], [110, 95, 200, 150]]))
>>> metric.summary()

Args:

iou_thresh: minimum IoU to consider a pair of prediction and ground truth as a match use_polygons: if set to True, predictions and targets will be expected to have rotated format mask_shape: if use_polygons is True, describes the spatial shape of the image used use_broadcasting: if use_polygons is True, use broadcasting for IoU computation by consuming more memory

update(gts: ndarray, preds: ndarray) None[source]

Updates the metric

Args:

gts: a set of relative bounding boxes either of shape (N, 4) or (N, 5) if they are rotated ones preds: a set of relative bounding boxes either of shape (M, 4) or (M, 5) if they are rotated ones

summary() Tuple[float | None, float | None, float | None][source]

Computes the aggregated metrics

Return type:

a tuple with the recall, precision and meanIoU scores

class doctr.utils.metrics.OCRMetric(iou_thresh: float = 0.5, use_polygons: bool = False, mask_shape: Tuple[int, int] = (1024, 1024), use_broadcasting: bool = True)[source]

Implements an end-to-end OCR metric.

The aggregated metrics are computed as follows:

\[\begin{split}\forall (B, L) \in \mathcal{B}^N \times \mathcal{L}^N, \forall (\hat{B}, \hat{L}) \in \mathcal{B}^M \times \mathcal{L}^M, \\ Recall(B, \hat{B}, L, \hat{L}) = \frac{1}{N} \sum\limits_{i=1}^N h_{B,L}(\hat{B}_i, \hat{L}_i) \\ Precision(B, \hat{B}, L, \hat{L}) = \frac{1}{M} \sum\limits_{i=1}^M h_{B,L}(\hat{B}_i, \hat{L}_i) \\ meanIoU(B, \hat{B}) = \frac{1}{M} \sum\limits_{i=1}^M \max\limits_{j \in [1, N]} IoU(\hat{B}_i, B_j)\end{split}\]

with the function \(IoU(x, y)\) being the Intersection over Union between bounding boxes \(x\) and \(y\), and the function \(h_{B, L}\) defined as:

\[\begin{split}\forall (b, l) \in \mathcal{B} \times \mathcal{L}, h_{B,L}(b, l) = \left\{ \begin{array}{ll} 1 & \mbox{if } b\mbox{ has been assigned to a given }B_j\mbox{ with an } \\ & IoU \geq 0.5 \mbox{ and that for this assignment, } l = L_j\\ 0 & \mbox{otherwise.} \end{array} \right.\end{split}\]

where \(\mathcal{B}\) is the set of possible bounding boxes, \(\mathcal{L}\) is the set of possible character sequences, \(N\) (number of ground truths) and \(M\) (number of predictions) are strictly positive integers.

>>> import numpy as np
>>> from doctr.utils import OCRMetric
>>> metric = OCRMetric(iou_thresh=0.5)
>>> metric.update(np.asarray([[0, 0, 100, 100]]), np.asarray([[0, 0, 70, 70], [110, 95, 200, 150]]),
>>>               ['hello'], ['hello', 'world'])
>>> metric.summary()

Args:

iou_thresh: minimum IoU to consider a pair of prediction and ground truth as a match use_polygons: if set to True, predictions and targets will be expected to have rotated format mask_shape: if use_polygons is True, describes the spatial shape of the image used use_broadcasting: if use_polygons is True, use broadcasting for IoU computation by consuming more memory

update(gt_boxes: ndarray, pred_boxes: ndarray, gt_labels: List[str], pred_labels: List[str]) None[source]

Updates the metric

Args:

gt_boxes: a set of relative bounding boxes either of shape (N, 4) or (N, 5) if they are rotated ones pred_boxes: a set of relative bounding boxes either of shape (M, 4) or (M, 5) if they are rotated ones gt_labels: a list of N string labels pred_labels: a list of M string labels

summary() Tuple[Dict[str, float | None], Dict[str, float | None], float | None][source]

Computes the aggregated metrics

Return type:

a tuple with the recall & precision for each string comparison and the mean IoU

class doctr.utils.metrics.DetectionMetric(iou_thresh: float = 0.5, use_polygons: bool = False, mask_shape: Tuple[int, int] = (1024, 1024), use_broadcasting: bool = True)[source]

Implements an object detection metric.

The aggregated metrics are computed as follows:

\[\begin{split}\forall (B, C) \in \mathcal{B}^N \times \mathcal{C}^N, \forall (\hat{B}, \hat{C}) \in \mathcal{B}^M \times \mathcal{C}^M, \\ Recall(B, \hat{B}, C, \hat{C}) = \frac{1}{N} \sum\limits_{i=1}^N h_{B,C}(\hat{B}_i, \hat{C}_i) \\ Precision(B, \hat{B}, C, \hat{C}) = \frac{1}{M} \sum\limits_{i=1}^M h_{B,C}(\hat{B}_i, \hat{C}_i) \\ meanIoU(B, \hat{B}) = \frac{1}{M} \sum\limits_{i=1}^M \max\limits_{j \in [1, N]} IoU(\hat{B}_i, B_j)\end{split}\]

with the function \(IoU(x, y)\) being the Intersection over Union between bounding boxes \(x\) and \(y\), and the function \(h_{B, C}\) defined as:

\[\begin{split}\forall (b, c) \in \mathcal{B} \times \mathcal{C}, h_{B,C}(b, c) = \left\{ \begin{array}{ll} 1 & \mbox{if } b\mbox{ has been assigned to a given }B_j\mbox{ with an } \\ & IoU \geq 0.5 \mbox{ and that for this assignment, } c = C_j\\ 0 & \mbox{otherwise.} \end{array} \right.\end{split}\]

where \(\mathcal{B}\) is the set of possible bounding boxes, \(\mathcal{C}\) is the set of possible class indices, \(N\) (number of ground truths) and \(M\) (number of predictions) are strictly positive integers.

>>> import numpy as np
>>> from doctr.utils import DetectionMetric
>>> metric = DetectionMetric(iou_thresh=0.5)
>>> metric.update(np.asarray([[0, 0, 100, 100]]), np.asarray([[0, 0, 70, 70], [110, 95, 200, 150]]),
>>>               np.zeros(1, dtype=np.int64), np.array([0, 1], dtype=np.int64))
>>> metric.summary()

Args:

iou_thresh: minimum IoU to consider a pair of prediction and ground truth as a match use_polygons: if set to True, predictions and targets will be expected to have rotated format mask_shape: if use_polygons is True, describes the spatial shape of the image used use_broadcasting: if use_polygons is True, use broadcasting for IoU computation by consuming more memory

update(gt_boxes: ndarray, pred_boxes: ndarray, gt_labels: ndarray, pred_labels: ndarray) None[source]

Updates the metric

Args:

gt_boxes: a set of relative bounding boxes either of shape (N, 4) or (N, 5) if they are rotated ones pred_boxes: a set of relative bounding boxes either of shape (M, 4) or (M, 5) if they are rotated ones gt_labels: an array of class indices of shape (N,) pred_labels: an array of class indices of shape (M,)

summary() Tuple[float | None, float | None, float | None][source]

Computes the aggregated metrics

Return type:

a tuple with the recall & precision for each class prediction and the mean IoU