PDF Helpers
- class PDFCharData(char, left, right, top, bottom, font_name, font_size, font_weight, font_flags, font_stroke_color, font_fill_color, page_id)
Data class representing character data.
- Parameters:
char (str)
left (int)
right (int)
top (int)
bottom (int)
font_name (str)
font_size (float)
font_weight (int)
font_flags (int)
font_stroke_color (tuple[int, int, int, int])
font_fill_color (tuple[int, int, int, int])
page_id (int)
- bottom: int
Bottom bound.
- char: str
The character.
- font_fill_color: tuple[int, int, int, int]
RGBA representation of the font’s fill color.
- font_flags: int
The font flags.
- font_name: str
The font name.
- font_size: float
The font size in pt.
- font_stroke_color: tuple[int, int, int, int]
RGBA representation of the font’s stroke color.
- font_weight: int
The font weight.
- left: int
Left bound.
- page_id: int
ID of the page the character was found on.
- right: int
Right bound.
- top: int
Top bound.
- compress_pdf(pdf_data, image_quality=85, force_source_text_compression=False, disable_source_text=True)
Compresses each page of a provided PDF buffer.
- Parameters:
pdf_data (
BinaryIO|bytes) – The input PDF as bytes.image_quality (
int, default:85) – Compression quality (70-100 for most JPG images).force_source_text_compression (
bool, default:False) – If true, attempts to re-write detected text.disable_source_text (
bool, default:True) – If true, doesn’t re-apply source text to the output PDF.
- Return type:
bytes- Returns:
Compressed PDF as bytes.
- extract_text_from_pdf(pdf_bytes)
Extracts the raw text from a given PDF’s bytes along with font data.
- Parameters:
pdf_bytes (
bytes) – Raw bytes representation of a PDF file.- Return type:
list[list[PDFCharData]]- Returns:
A list of info regarding each read character.
- lerp(start, end, t)
Performs linear interpolation between two numbers.
- Parameters:
start (
float) – The starting value.end (
float) – The ending value.t (
float) – The interpolation factor (0 to 1).
- Return type:
float- Returns:
The interpolated value.
- pdf_has_source_text(pdf_bytes)
Checks if the provided PDF bytes contain source text.
- Parameters:
pdf_bytes (
bytes) – Raw bytes representation of a PDF file- Return type:
bool- Returns:
True if source text is found, False otherwise.