doctr.io ======== .. currentmodule:: doctr.io The io module enables users to easily access content from documents and export analysis results to structured formats. .. _document_structure: Document structure ------------------ Structural organization of the documents. Word ^^^^ A Word is an uninterrupted sequence of characters. .. autoclass:: Word Line ^^^^ A Line is a collection of Words aligned spatially and meant to be read together (on a two-column page, on the same horizontal, we will consider that there are two Lines). .. autoclass:: Line Artefact ^^^^^^^^ An Artefact is a non-textual element (e.g. QR code, picture, chart, signature, logo, etc.). .. autoclass:: Artefact Block ^^^^^ A Block is a collection of Lines (e.g. an address written on several lines) and Artefacts (e.g. a graph with its title underneath). .. autoclass:: Block Page ^^^^ A Page is a collection of Blocks that were on the same physical page. .. autoclass:: Page .. automethod:: show Document ^^^^^^^^ A Document is a collection of Pages. .. autoclass:: Document .. automethod:: show File reading ------------ High-performance file reading and conversion to processable structured data. .. autofunction:: read_pdf .. autofunction:: read_img_as_numpy .. autofunction:: read_img_as_tensor .. autofunction:: decode_img_as_tensor .. autofunction:: read_html .. autoclass:: DocumentFile .. automethod:: from_pdf .. automethod:: from_url .. automethod:: from_images