PDF Helpers

class PDFCharData(char, left, right, top, bottom, font_name, font_size, font_weight, font_flags, font_stroke_color, font_fill_color, page_id)

Data class representing character data.

Parameters:

char (str)
left (int)
right (int)
top (int)
bottom (int)
font_name (str)
font_size (float)
font_weight (int)
font_flags (int)
font_stroke_color (tuple[int, int, int, int])
font_fill_color (tuple[int, int, int, int])
page_id (int)

bottom: int: Bottom bound.

char: str: The character.

font_fill_color: tuple[int, int, int, int]: RGBA representation of the font’s fill color.

font_flags: int: The font flags.

font_name: str: The font name.

font_size: float: The font size in pt.

font_stroke_color: tuple[int, int, int, int]: RGBA representation of the font’s stroke color.

font_weight: int: The font weight.

left: int: Left bound.

page_id: int: ID of the page the character was found on.

right: int: Right bound.

top: int: Top bound.

compress_pdf(pdf_data, image_quality=85, force_source_text_compression=False, disable_source_text=True)

Compresses each page of a provided PDF buffer.

Parameters:

pdf_data (BinaryIO | bytes) – The input PDF as bytes.
image_quality (int, default: 85) – Compression quality (70-100 for most JPG images).
force_source_text_compression (bool, default: False) – If true, attempts to re-write detected text.
disable_source_text (bool, default: True) – If true, doesn’t re-apply source text to the output PDF.

Return type:

bytes

Returns:

Compressed PDF as bytes.

extract_text_from_pdf(pdf_bytes)

Extracts the raw text from a given PDF’s bytes along with font data.

Parameters:: pdf_bytes (bytes) – Raw bytes representation of a PDF file.
Return type:: list[list[PDFCharData]]
Returns:: A list of info regarding each read character.

lerp(start, end, t)

Performs linear interpolation between two numbers.

Parameters:

start (float) – The starting value.
end (float) – The ending value.
t (float) – The interpolation factor (0 to 1).

Return type:

float

Returns:

The interpolated value.

pdf_has_source_text(pdf_bytes)

Checks if the provided PDF bytes contain source text.

Parameters:: pdf_bytes (bytes) – Raw bytes representation of a PDF file
Return type:: bool
Returns:: True if source text is found, False otherwise.