Package com.mindee.extraction
Class PDFExtractor
- java.lang.Object
-
- com.mindee.extraction.PDFExtractor
-
public class PDFExtractor extends Object
PDF extraction class.
-
-
Constructor Summary
Constructors Constructor Description PDFExtractor(LocalInputSource source)
Init from aLocalInputSource
.PDFExtractor(String filePath)
Init from a path.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static BufferedImage
byteArrayToBufferedImage(byte[] byteArray)
Converts an array to a buffered image.List<ExtractedPDF>
extractInvoices(List<InvoiceSplitterV1Document.PageIndexes> pageIndexes)
Extract invoices from the given page indexes (from an invoice-splitter prediction).List<ExtractedPDF>
extractInvoices(List<InvoiceSplitterV1Document.PageIndexes> pageIndexes, boolean strict)
Extract invoices from the given page indexes (from an invoice-splitter prediction).List<ExtractedPDF>
extractSubDocuments(List<List<Integer>> pageIndexes)
Given a list of page indexes, extracts the corresponding documents.int
getPageCount()
-
-
-
Constructor Detail
-
PDFExtractor
public PDFExtractor(String filePath) throws IOException
Init from a path.- Parameters:
filePath
- Path to the file.- Throws:
IOException
- Throws if the file can't be accessed.
-
PDFExtractor
public PDFExtractor(LocalInputSource source) throws IOException
Init from aLocalInputSource
.- Parameters:
source
- The local source.- Throws:
IOException
- Throws if the file can't be accessed.
-
-
Method Detail
-
getPageCount
public int getPageCount()
- Returns:
- The number of pages in the file.
-
byteArrayToBufferedImage
public static BufferedImage byteArrayToBufferedImage(byte[] byteArray) throws IOException
Converts an array to a buffered image.- Parameters:
byteArray
- Raw byte array.- Returns:
- a valid ImageIO buffer.
- Throws:
IOException
- Throws if the file can't be accessed.
-
extractSubDocuments
public List<ExtractedPDF> extractSubDocuments(List<List<Integer>> pageIndexes) throws IOException
Given a list of page indexes, extracts the corresponding documents.- Parameters:
pageIndexes
- List of page indexes.- Returns:
- A list of extracted files.
- Throws:
IOException
- Throws if the file can't be accessed.
-
extractInvoices
public List<ExtractedPDF> extractInvoices(List<InvoiceSplitterV1Document.PageIndexes> pageIndexes) throws IOException
Extract invoices from the given page indexes (from an invoice-splitter prediction).- Parameters:
pageIndexes
- List of page indexes.- Returns:
- a list of extracted files.
- Throws:
IOException
- Throws if the file can't be accessed.
-
extractInvoices
public List<ExtractedPDF> extractInvoices(List<InvoiceSplitterV1Document.PageIndexes> pageIndexes, boolean strict) throws IOException
Extract invoices from the given page indexes (from an invoice-splitter prediction).- Parameters:
pageIndexes
- List of page indexes.strict
- Whether the extraction should strictly follow the confidence scores or not.- Returns:
- a list of extracted files.
- Throws:
IOException
- Throws if the file can't be accessed.
-
-