Package com.mindee.extraction
Class PDFExtractor
- java.lang.Object
-
- com.mindee.extraction.PDFExtractor
-
public class PDFExtractor extends Object
PDF extraction class.
-
-
Constructor Summary
Constructors Constructor Description PDFExtractor(LocalInputSource source)Init from aLocalInputSource.PDFExtractor(String filePath)Init from a path.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static BufferedImagebyteArrayToBufferedImage(byte[] byteArray)Converts an array to a buffered image.List<ExtractedPDF>extractInvoices(List<InvoiceSplitterV1InvoicePageGroup> pageIndexes)Extract invoices from the given page indexes (from an invoice-splitter prediction).List<ExtractedPDF>extractInvoices(List<InvoiceSplitterV1InvoicePageGroup> pageIndexes, boolean strict)Extract invoices from the given page indexes (from an invoice-splitter prediction).List<ExtractedPDF>extractSubDocuments(List<List<Integer>> pageIndexes)Given a list of page indexes, extracts the corresponding documents.intgetPageCount()
-
-
-
Constructor Detail
-
PDFExtractor
public PDFExtractor(String filePath) throws IOException
Init from a path.- Parameters:
filePath- Path to the file.- Throws:
IOException- Throws if the file can't be accessed.
-
PDFExtractor
public PDFExtractor(LocalInputSource source) throws IOException
Init from aLocalInputSource.- Parameters:
source- The local source.- Throws:
IOException- Throws if the file can't be accessed.
-
-
Method Detail
-
getPageCount
public int getPageCount()
- Returns:
- The number of pages in the file.
-
byteArrayToBufferedImage
public static BufferedImage byteArrayToBufferedImage(byte[] byteArray) throws IOException
Converts an array to a buffered image.- Parameters:
byteArray- Raw byte array.- Returns:
- a valid ImageIO buffer.
- Throws:
IOException- Throws if the file can't be accessed.
-
extractSubDocuments
public List<ExtractedPDF> extractSubDocuments(List<List<Integer>> pageIndexes) throws IOException
Given a list of page indexes, extracts the corresponding documents.- Parameters:
pageIndexes- List of page indexes.- Returns:
- A list of extracted files.
- Throws:
IOException- Throws if the file can't be accessed.
-
extractInvoices
public List<ExtractedPDF> extractInvoices(List<InvoiceSplitterV1InvoicePageGroup> pageIndexes) throws IOException
Extract invoices from the given page indexes (from an invoice-splitter prediction).- Parameters:
pageIndexes- List of page indexes.- Returns:
- a list of extracted files.
- Throws:
IOException- Throws if the file can't be accessed.
-
extractInvoices
public List<ExtractedPDF> extractInvoices(List<InvoiceSplitterV1InvoicePageGroup> pageIndexes, boolean strict) throws IOException
Extract invoices from the given page indexes (from an invoice-splitter prediction).- Parameters:
pageIndexes- List of page indexes.strict- Whether the extraction should strictly follow the confidence scores or not.- Returns:
- a list of extracted files.
- Throws:
IOException- Throws if the file can't be accessed.
-
-