Class PDFExtractor


  • public class PDFExtractor
    extends Object
    PDF extraction class.
    • Method Detail

      • getPageCount

        public int getPageCount()
        Returns:
        The number of pages in the file.
      • byteArrayToBufferedImage

        public static BufferedImage byteArrayToBufferedImage​(byte[] byteArray)
                                                      throws IOException
        Converts an array to a buffered image.
        Parameters:
        byteArray - Raw byte array.
        Returns:
        a valid ImageIO buffer.
        Throws:
        IOException - Throws if the file can't be accessed.
      • extractSubDocuments

        public List<ExtractedPDF> extractSubDocuments​(List<List<Integer>> pageIndexes)
                                               throws IOException
        Given a list of page indexes, extracts the corresponding documents.
        Parameters:
        pageIndexes - List of page indexes.
        Returns:
        A list of extracted files.
        Throws:
        IOException - Throws if the file can't be accessed.
      • extractInvoices

        public List<ExtractedPDF> extractInvoices​(List<InvoiceSplitterV1Document.PageIndexes> pageIndexes,
                                                  boolean strict)
                                           throws IOException
        Extract invoices from the given page indexes (from an invoice-splitter prediction).
        Parameters:
        pageIndexes - List of page indexes.
        strict - Whether the extraction should strictly follow the confidence scores or not.
        Returns:
        a list of extracted files.
        Throws:
        IOException - Throws if the file can't be accessed.