Input

class Base64Input(base64_string, filename)

Base64-encoded text input.

Parameters:
  • base64_string (str) –

  • filename (str) –

file_mimetype: str
file_object: BinaryIO
filename: str
filepath: Optional[str]
input_type: InputType
class BytesInput(raw_bytes, filename)

Raw bytes input.

Parameters:
  • raw_bytes (bytes) –

  • filename (str) –

file_mimetype: str
file_object: BinaryIO
filename: str
filepath: Optional[str]
input_type: InputType
class FileInput(file)

A binary file input.

Parameters:

file (BinaryIO) –

file_mimetype: str
file_object: BinaryIO
filename: str
filepath: Optional[str]
input_type: InputType
class InputType(value)

The input type, for internal use.

BASE64 = 'base64'
BYTES = 'bytes'
FILE = 'file'
PATH = 'path'
URL = 'url'
class LocalInputSource(input_type)

Base class for all input sources coming from the local machine.

Parameters:

input_type (InputType) –

close()

Close the file object.

Return type:

None

count_doc_pages()

Count the pages in the PDF.

Return type:

int

Returns:

the number of pages.

fix_pdf(maximum_offset=500)

Fix a potentially broken pdf file.

WARNING: this feature alters the data of the enqueued file by removing unnecessary headers.

Reads the bytes of a PDF file until a proper pdf tag is encountered, or until the maximum offset has been reached. If a tag denoting a PDF file is found, deletes all bytes before it.

Parameters:

maximum_offset (int, default: 500) – maximum byte offset where superfluous headers will be removed. Cannot be less than 0.

Return type:

None

is_pdf()
Return type:

bool

Returns:

True if the file is a PDF.

is_pdf_empty()

Check if the PDF is empty.

Return type:

bool

Returns:

True if the PDF is empty

merge_pdf_pages(page_numbers)

Create a new PDF from pages and set it to file_object.

Parameters:

page_numbers (set) – List of pages number to use for merging in the original PDF.

Return type:

None

Returns:

None

process_pdf(behavior, on_min_pages, page_indexes)

Run any required processing on a PDF file.

Return type:

None

Parameters:
  • behavior (str) –

  • on_min_pages (int) –

  • page_indexes (Sequence) –

read_contents(close_file)

Read the contents of the input file.

Parameters:

close_file (bool) – whether to close the file after reading

Return type:

Tuple[str, bytes]

Returns:

a Tuple with the file name and binary data

file_mimetype: str
file_object: BinaryIO
filename: str
filepath: Optional[str]
input_type: InputType
class PageOptions(page_indexes, operation='KEEP_ONLY', on_min_pages=0)

Options to pass to the parse method for cutting multipage documents.

Parameters:
  • page_indexes (Sequence[int]) –

  • operation (str) –

  • on_min_pages (int) –

on_min_pages: int

Apply the operation only if document has at least this many pages.

Default: 0 (apply on all documents)

operation: str

Operation to apply on the document, given the page_indexes specified:

  • KEEP_ONLY - keep only the specified pages, and remove all others.

  • REMOVE - remove the specified pages, and keep all others.

page_indexes: Sequence[int]

Zero-based list of page indexes. A negative index can be used, indicating an offset from the end of the document.

[0, -1] represents the fist and last pages of the document.

class PathInput(filepath)

A local path input.

Parameters:

filepath (Optional[str]) –

file_mimetype: str
file_object: BinaryIO
filename: str
filepath: Optional[str]
input_type: InputType
class UrlInputSource(url)

A local or distant URL input.

Parameters:

url (str) –

url: str

The Uniform Resource Locator.