Module: Mindee::Image::ImageExtractor

Defined in:
lib/mindee/image/image_extractor.rb

Overview

Image Extraction wrapper class.

Class Method Summary collapse

Class Method Details

.attach_image_as_new_file(input_buffer, format: 'jpg') ⇒ Origami::PDF

Attaches an image as a new page in a PdfDocument object.

Parameters:

  • input_buffer (StringIO)

    Input buffer. Only supports JPEG.

Returns:

  • (Origami::PDF)

    A PdfDocument handle.



19
20
21
22
23
24
25
26
27
28
29
# File 'lib/mindee/image/image_extractor.rb', line 19

def self.attach_image_as_new_file(input_buffer, format: 'jpg')
  magick_image = MiniMagick::Image.read(input_buffer)
  # NOTE: some jpeg images get rendered as three different versions of themselves per output if the format isn't
  # converted.
  magick_image.format(format)
  original_density = magick_image.resolution
  scale_factor = original_density[0].to_f / 4.166666 # No clue why the resolution needs to be reduced for
  # the pdf otherwise the resulting image shrinks.
  magick_image.format('pdf', 0, { density: scale_factor.to_s })
  Origami::PDF.read(StringIO.new(magick_image.to_blob))
end

.create_extracted_image(buffer, file_name, page_id, element_id) ⇒ Object

Generates an ExtractedImage.

Parameters:

  • buffer (StringIO)

    Buffer containing the image.

  • file_name (String)

    Name for the file.

  • page_id (Object)

    ID of the page the file was universal from.

  • element_id (Object)

    ID of the element of a given page.



95
96
97
98
99
100
101
102
# File 'lib/mindee/image/image_extractor.rb', line 95

def self.create_extracted_image(buffer, file_name, page_id, element_id)
  buffer.rewind
  ExtractedImage.new(
    Mindee::Input::Source::BytesInputSource.new(buffer.read.to_s, file_name),
    page_id,
    element_id
  )
end

.extract_images_from_polygons(input_source, pdf_stream, page_id, polygons) ⇒ Array<Mindee::Image::ExtractedImage>

Extracts images from their positions on a file (as polygons).

Parameters:

Returns:



52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# File 'lib/mindee/image/image_extractor.rb', line 52

def self.extract_images_from_polygons(input_source, pdf_stream, page_id, polygons)
  extracted_elements = []

  polygons.each_with_index do |polygon, element_id|
    polygon = ImageUtils.normalize_polygon(polygon)
    page_content = ImageUtils.read_page_content(pdf_stream)

    min_max_x = Geometry.get_min_max_x([
                                         polygon.top_left,
                                         polygon.bottom_right,
                                         polygon.top_right,
                                         polygon.bottom_left,
                                       ])
    min_max_y = Geometry.get_min_max_y([
                                         polygon.top_left,
                                         polygon.bottom_right,
                                         polygon.top_right,
                                         polygon.bottom_left,
                                       ])
    file_extension = ImageUtils.determine_file_extension(input_source)
    cropped_image = ImageUtils.crop_image(page_content, min_max_x, min_max_y)
    if file_extension == 'pdf'
      cropped_image.format('jpg')
    else
      cropped_image.format(file_extension.to_s)
    end

    buffer = StringIO.new
    ImageUtils.write_image_to_buffer(cropped_image, buffer)
    file_name = "#{input_source.filename}_page#{page_id}-#{element_id}.#{file_extension}"

    extracted_elements << create_extracted_image(buffer, file_name, page_id, element_id)
  end

  extracted_elements
end

.extract_multiple_images_from_source(input_source, page_id, polygons) ⇒ Array<Mindee::Image::ExtractedImage>

Extracts multiple images from a given local input source.

to extract.

Parameters:

Returns:



38
39
40
41
42
43
# File 'lib/mindee/image/image_extractor.rb', line 38

def self.extract_multiple_images_from_source(input_source, page_id, polygons)
  new_stream = load_input_source_pdf_page_as_stringio(input_source, page_id)
  new_stream.seek(0)

  extract_images_from_polygons(input_source, new_stream, page_id, polygons)
end

.load_input_source_pdf_page_as_stringio(input_file, page_id) ⇒ StringIO

Loads a single_page from an image file or a pdf document.

Parameters:

  • input_file (LocalInputSource)

    Local input.

  • page_id (Integer)

    Page ID.

Returns:

  • (StringIO)

    A valid PdfDocument handle.



109
110
111
112
113
114
115
116
# File 'lib/mindee/image/image_extractor.rb', line 109

def self.load_input_source_pdf_page_as_stringio(input_file, page_id)
  input_file.io_stream.rewind
  if input_file.pdf?
    Mindee::PDF::PDFProcessor.get_page(Origami::PDF.read(input_file.io_stream), page_id)
  else
    input_file.io_stream
  end
end