doctr.io

The io module enables users to easily access content from documents and export analysis results to structured formats.

Document structure

Structural organization of the documents.

Word

A Word is an uninterrupted sequence of characters.

class doctr.io.Word(value: str, confidence: float, geometry: Tuple[Tuple[float, float], Tuple[float, float]] | ndarray, objectness_score: float, crop_orientation: Dict[str, Any])[source]

Implements a word element

Parameters:
  • value – the text string of the word

  • confidence – the confidence associated with the text prediction

  • geometry – bounding box of the word in format ((xmin, ymin), (xmax, ymax)) where coordinates are relative to

  • size (the page's)

  • objectness_score – the objectness score of the detection

  • crop_orientation – the general orientation of the crop in degrees and its confidence

Line

A Line is a collection of Words aligned spatially and meant to be read together (on a two-column page, on the same horizontal, we will consider that there are two Lines).

class doctr.io.Line(words: List[Word], geometry: Tuple[Tuple[float, float], Tuple[float, float]] | ndarray | None = None, objectness_score: float | None = None)[source]

Implements a line element as a collection of words

Parameters:
  • words – list of word elements

  • geometry – bounding box of the word in format ((xmin, ymin), (xmax, ymax)) where coordinates are relative to the page’s size. If not specified, it will be resolved by default to the smallest bounding box enclosing all words in it.

Artefact

An Artefact is a non-textual element (e.g. QR code, picture, chart, signature, logo, etc.).

class doctr.io.Artefact(artefact_type: str, confidence: float, geometry: Tuple[Tuple[float, float], Tuple[float, float]])[source]

Implements a non-textual element

Parameters:
  • artefact_type – the type of artefact

  • confidence – the confidence of the type prediction

  • geometry – bounding box of the word in format ((xmin, ymin), (xmax, ymax)) where coordinates are relative to the page’s size.

Block

A Block is a collection of Lines (e.g. an address written on several lines) and Artefacts (e.g. a graph with its title underneath).

class doctr.io.Block(lines: List[Line] = [], artefacts: List[Artefact] = [], geometry: Tuple[Tuple[float, float], Tuple[float, float]] | ndarray | None = None, objectness_score: float | None = None)[source]

Implements a block element as a collection of lines and artefacts

Parameters:
  • lines – list of line elements

  • artefacts – list of artefacts

  • geometry – bounding box of the word in format ((xmin, ymin), (xmax, ymax)) where coordinates are relative to the page’s size. If not specified, it will be resolved by default to the smallest bounding box enclosing all lines and artefacts in it.

Page

A Page is a collection of Blocks that were on the same physical page.

class doctr.io.Page(page: ndarray, blocks: List[Block], page_idx: int, dimensions: Tuple[int, int], orientation: Dict[str, Any] | None = None, language: Dict[str, Any] | None = None)[source]

Implements a page element as a collection of blocks

Parameters:
  • page – image encoded as a numpy array in uint8

  • blocks – list of block elements

  • page_idx – the index of the page in the input raw document

  • dimensions – the page size in pixels in format (height, width)

  • orientation – a dictionary with the value of the rotation angle in degress and confidence of the prediction

  • language – a dictionary with the language value and confidence of the prediction

show(interactive: bool = True, preserve_aspect_ratio: bool = False, **kwargs) None[source]

Overlay the result on a given image

Parameters:
  • interactive – whether the display should be interactive

  • preserve_aspect_ratio – pass True if you passed True to the predictor

  • **kwargs – additional keyword arguments passed to the matplotlib.pyplot.show method

Document

A Document is a collection of Pages.

class doctr.io.Document(pages: List[Page])[source]

Implements a document element as a collection of pages

Parameters:

pages – list of page elements

show(**kwargs) None[source]

Overlay the result on a given image

File reading

High-performance file reading and conversion to processable structured data.

doctr.io.read_pdf(file: str | Path | bytes, scale: float = 2, rgb_mode: bool = True, password: str | None = None, **kwargs: Any) List[ndarray][source]

Read a PDF file and convert it into an image in numpy format

>>> from doctr.io import read_pdf
>>> doc = read_pdf("path/to/your/doc.pdf")
Parameters:
  • file – the path to the PDF file

  • scale – rendering scale (1 corresponds to 72dpi)

  • rgb_mode – if True, the output will be RGB, otherwise BGR

  • password – a password to unlock the document, if encrypted

  • **kwargs – additional parameters to pypdfium2.PdfPage.render()

Returns:

the list of pages decoded as numpy ndarray of shape H x W x C

doctr.io.read_img_as_numpy(file: str | Path | bytes, output_size: Tuple[int, int] | None = None, rgb_output: bool = True) ndarray[source]

Read an image file into numpy format

>>> from doctr.io import read_img_as_numpy
>>> page = read_img_as_numpy("path/to/your/doc.jpg")
Parameters:
  • file – the path to the image file

  • output_size – the expected output size of each page in format H x W

  • rgb_output – whether the output ndarray channel order should be RGB instead of BGR.

Returns:

the page decoded as numpy ndarray of shape H x W x 3

doctr.io.read_img_as_tensor(img_path: str | Path, dtype: DType = tf.float32) Tensor[source]

Read an image file as a TensorFlow tensor

Parameters:
  • img_path – location of the image file

  • dtype – the desired data type of the output tensor. If it is float-related, values will be divided by 255.

Returns:

decoded image as a tensor

doctr.io.decode_img_as_tensor(img_content: bytes, dtype: DType = tf.float32) Tensor[source]

Read a byte stream as a TensorFlow tensor

Parameters:
  • img_content – bytes of a decoded image

  • dtype – the desired data type of the output tensor. If it is float-related, values will be divided by 255.

Returns:

decoded image as a tensor

doctr.io.read_html(url: str, **kwargs: Any) bytes[source]

Read a PDF file and convert it into an image in numpy format

>>> from doctr.io import read_html
>>> doc = read_html("https://www.yoursite.com")
Parameters:
  • url – URL of the target web page

  • **kwargs – keyword arguments from weasyprint.HTML

Returns:

decoded PDF file as a bytes stream

class doctr.io.DocumentFile[source]

Read a document from multiple extensions

classmethod from_pdf(file: str | Path | bytes, **kwargs) List[ndarray][source]

Read a PDF file

>>> from doctr.io import DocumentFile
>>> doc = DocumentFile.from_pdf("path/to/your/doc.pdf")
Parameters:
  • file – the path to the PDF file or a binary stream

  • **kwargs – additional parameters to pypdfium2.PdfPage.render()

Returns:

the list of pages decoded as numpy ndarray of shape H x W x 3

classmethod from_url(url: str, **kwargs) List[ndarray][source]

Interpret a web page as a PDF document

>>> from doctr.io import DocumentFile
>>> doc = DocumentFile.from_url("https://www.yoursite.com")
Parameters:
  • url – the URL of the target web page

  • **kwargs – additional parameters to pypdfium2.PdfPage.render()

Returns:

the list of pages decoded as numpy ndarray of shape H x W x 3

classmethod from_images(files: Sequence[str | Path | bytes] | str | Path | bytes, **kwargs) List[ndarray][source]

Read an image file (or a collection of image files) and convert it into an image in numpy format

>>> from doctr.io import DocumentFile
>>> pages = DocumentFile.from_images(["path/to/your/page1.png", "path/to/your/page2.png"])
Parameters:
  • files – the path to the image file or a binary stream, or a collection of those

  • **kwargs – additional parameters to doctr.io.image.read_img_as_numpy()

Returns:

the list of pages decoded as numpy ndarray of shape H x W x 3