PDF Helpers

class PDFCharData(char, left, right, top, bottom, font_name, font_size, font_weight, font_flags, font_stroke_color, font_fill_color, page_id)

Data class representing character data.

Parameters:
  • char (str)

  • left (int)

  • right (int)

  • top (int)

  • bottom (int)

  • font_name (str)

  • font_size (float)

  • font_weight (int)

  • font_flags (int)

  • font_stroke_color (tuple[int, int, int, int])

  • font_fill_color (tuple[int, int, int, int])

  • page_id (int)

bottom: int

Bottom bound.

char: str

The character.

font_fill_color: tuple[int, int, int, int]

RGBA representation of the font’s fill color.

font_flags: int

The font flags.

font_name: str

The font name.

font_size: float

The font size in pt.

font_stroke_color: tuple[int, int, int, int]

RGBA representation of the font’s stroke color.

font_weight: int

The font weight.

left: int

Left bound.

page_id: int

ID of the page the character was found on.

right: int

Right bound.

top: int

Top bound.

compress_pdf(pdf_data, image_quality=85, force_source_text_compression=False, disable_source_text=True)

Compresses each page of a provided PDF buffer.

Parameters:
  • pdf_data (BinaryIO | bytes) – The input PDF as bytes.

  • image_quality (int, default: 85) – Compression quality (70-100 for most JPG images).

  • force_source_text_compression (bool, default: False) – If true, attempts to re-write detected text.

  • disable_source_text (bool, default: True) – If true, doesn’t re-apply source text to the output PDF.

Return type:

bytes

Returns:

Compressed PDF as bytes.

extract_text_from_pdf(pdf_bytes)

Extracts the raw text from a given PDF’s bytes along with font data.

Parameters:

pdf_bytes (bytes) – Raw bytes representation of a PDF file.

Return type:

list[list[PDFCharData]]

Returns:

A list of info regarding each read character.

lerp(start, end, t)

Performs linear interpolation between two numbers.

Parameters:
  • start (float) – The starting value.

  • end (float) – The ending value.

  • t (float) – The interpolation factor (0 to 1).

Return type:

float

Returns:

The interpolated value.

pdf_has_source_text(pdf_bytes)

Checks if the provided PDF bytes contain source text.

Parameters:

pdf_bytes (bytes) – Raw bytes representation of a PDF file

Return type:

bool

Returns:

True if source text is found, False otherwise.