V1 PDF

Multi-Receipts Extractor

extract_receipts(input_source, inference)

Extracts individual receipts from multi-receipts documents.

Parameters:
  • input_source (LocalInputSource) – Local Input Source to extract sub-receipts from.

  • inference (Inference) – Results of the inference.

Return type:

list[ExtractedImage]

Returns:

Individual extracted receipts as an array of ExtractedMultiReceiptsImage.

PDF Extractor

class PDFExtractor(local_input)

V1-specific PDF extractor.

Parameters:

local_input (LocalInputSource)

cut_pages(page_indexes)

Create a new PDF from pages and save it into a buffer.

Parameters:

page_indexes (list) – List of pages number to use for merging in the original PDF.

Return type:

BinaryIO

Returns:

The buffer containing the new PDF.

extract_documents(page_indexes)

Extracts complete PDFs from the document.

Parameters:

page_indexes (list[list[int]]) – List of sub-lists of pages to keep.

Return type:

list[ExtractedPDF]

Returns:

A list of extracted invoices.

extract_invoices(page_indexes, strict=False)

Extracts invoices as complete PDFs from the document from either a list of pages or a list of page groups.

Parameters:
  • page_indexes (list[InvoiceSplitterV1InvoicePageGroup | list[int]]) – List of sub-lists of pages to keep.

  • strict (bool, default: False) – Whether to trust confidence scores above 0.5 (included) or not.

Return type:

list[ExtractedPDF]

Returns:

A list of extracted invoices.

extract_sub_documents(page_indexes)

Extract the sub-documents from the main pdf, based on the given list of page indexes.

Parameters:

page_indexes (list[list[int]]) – List of list of numbers, representing page indexes.

Return type:

list[ExtractedPDF]

Returns:

A list of created PDFS.

get_page_count()

Get the number of pages in the PDF file.

Return type:

int