Module: Mindee::PDF::PDFTools

Defined in:
lib/mindee/pdf/pdf_tools.rb

Overview

Collection of miscellaneous PDF operations,as well as some monkey-patching for Origami.

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.add_content_to_page(page, xobject_name, width, height) ⇒ Object

Adds a content stream to the specified PDF page to display an image XObject.

Parameters:

  • page (Origami::Page)

    The PDF page to which content will be added.

  • xobject_name (String)

    The name identifying the XObject.

  • width (Integer)

    The width for the transformation matrix.

  • height (Integer)

    The height for the transformation matrix.



138
139
140
141
142
# File 'lib/mindee/pdf/pdf_tools.rb', line 138

def self.add_content_to_page(page, xobject_name, width, height)
  content = "q\n#{width} 0 0 #{height} 0 0 cm\n/#{xobject_name} Do\nQ\n"
  content_stream = Origami::Stream.new(content)
  page.Contents = content_stream
end

.create_xobject(image) ⇒ Origami::Graphics::ImageXObject

Creates an image XObject from the provided image.

Converts the given image to a binary stream using Mindee’s image utilities, then creates an Origami::Graphics::ImageXObject with a JPEG filter.

Parameters:

  • image (Minimagick::Image)

    An image object with the necessary data and structure.

Returns:

  • (Origami::Graphics::ImageXObject)

    The created image XObject.



89
90
91
92
# File 'lib/mindee/pdf/pdf_tools.rb', line 89

def self.create_xobject(image)
  image_io = Mindee::Image::ImageUtils.image_to_stringio(image)
  Origami::Graphics::ImageXObject.from_image_file(image_io, 'jpg')
end

.determine_colorspace(image) ⇒ Symbol

Determines the colorspace for an image based on its metadata.

Parameters:

  • image (Hash)

    The image data hash.

Returns:

  • (Symbol)

    One of :DeviceCMYK, :DeviceGray or :DeviceRGB.



123
124
125
126
127
128
129
130
# File 'lib/mindee/pdf/pdf_tools.rb', line 123

def self.determine_colorspace(image)
  colorspace = image.data['colorspace']
  case colorspace
  when 'CMYK' then :DeviceCMYK
  when 'Gray', 'PseudoClass Gray' then :DeviceGray
  else :DeviceRGB
  end
end

.determine_filter(image) ⇒ Symbol

Determines the appropriate filter for an image based on its properties.

Parameters:

  • image (Hash)

    The image data hash containing properties.

Returns:

  • (Symbol)

    One of :FlateDecode, :LZWDecode or :DCTDecode.



110
111
112
113
114
115
116
117
# File 'lib/mindee/pdf/pdf_tools.rb', line 110

def self.determine_filter(image)
  filter = image.data['properties']['filter']
  case filter
  when %r{Zip}i then :FlateDecode
  when %r{LZW}i then :LZWDecode
  else :DCTDecode
  end
end

.process_image_xobject(image_data, image_quality, width, height) ⇒ Origami::Graphics::ImageXObject

Processes an image into an image XObject for PDF embedding.

Parameters:

  • image_data (Hash)

    The raw image data.

  • image_quality (Integer)

    The quality setting for image compression.

  • width (Numeric)

    The desired width of the output image.

  • height (Numeric)

    The desired height of the output image.

Returns:

  • (Origami::Graphics::ImageXObject)

    The resulting image XObject.



161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
# File 'lib/mindee/pdf/pdf_tools.rb', line 161

def self.process_image_xobject(image_data, image_quality, width, height)
  compressed_data = Image::ImageCompressor.compress_image(
    image_data,
    quality: image_quality,
    max_width: width,
    max_height: height
  )

  new_image = Origami::Graphics::ImageXObject.new
  new_image.data = compressed_data
  new_image.Width = width
  new_image.Height = height
  new_image.ColorSpace = :DeviceRGB
  new_image.BitsPerComponent = 8

  new_image
end

.set_page_dimensions(page, width, height) ⇒ Object

Sets the dimensions for the specified PDF page.

Parameters:

  • page (Origami::Page)

    The PDF page whose dimensions are being set.

  • width (Numeric)

    The target width of the page.

  • height (Numeric)

    The target height of the page.



149
150
151
152
# File 'lib/mindee/pdf/pdf_tools.rb', line 149

def self.set_page_dimensions(page, width, height)
  page[:MediaBox] = [0, 0, width, height]
  page[:CropBox] = [0, 0, width, height]
end

.set_xobject_properties(xobject, image) ⇒ Object

Sets properties on the provided image XObject based on image metadata.

Parameters:

  • xobject (Origami::Graphics::ImageXObject)

    The image XObject to update.

  • image (Hash)

    A hash containing image metadata (such as width, height, properties, etc.).



98
99
100
101
102
103
104
# File 'lib/mindee/pdf/pdf_tools.rb', line 98

def self.set_xobject_properties(xobject, image)
  xobject.dictionary[:BitsPerComponent] = 8
  xobject.dictionary[:Filter] = determine_filter(image)
  xobject.dictionary[:Width] = image[:width]
  xobject.dictionary[:Height] = image[:height]
  xobject.dictionary[:ColorSpace] = determine_colorspace(image)
end

.source_text?(pdf_data) ⇒ bool

Checks whether the file has source_text. Sends false if the file isn’t a PDF.

Parameters:

  • pdf_data (StringIO)

    Abinary-encoded stream representing the PDF file.

Returns:

  • (bool)

    True if the pdf has source text, false otherwise.



57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# File 'lib/mindee/pdf/pdf_tools.rb', line 57

def self.source_text?(pdf_data)
  begin
    pdf_data.rewind
    pdf = Origami::PDF.read(pdf_data)

    pdf.each_page do |page|
      next unless page[:Contents]

      contents = page[:Contents].solve
      contents = [contents] unless contents.is_a?(Origami::Array)

      contents.each do |stream_ref|
        stream = stream_ref.solve
        return true if stream_has_text?(stream)
      end
    end

    false
  end

  false
rescue Origami::InvalidPDFError
  false
end

.stream_has_text?(stream) ⇒ bool

Checks a PDFs stream content for text operators See opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf page 243-251.

Parameters:

  • stream (StringIO)

    Stream object from a PDFs page.

Returns:

  • (bool)

    True if a text operator is found in the stream.



46
47
48
49
50
51
52
# File 'lib/mindee/pdf/pdf_tools.rb', line 46

def self.stream_has_text?(stream)
  data = stream.data
  return false if data.nil? || data.empty?

  text_operators = ['Tc', 'Tw', 'Th', 'TL', 'Tf', 'Tk', 'Tr', 'Tm', 'T*', 'Tj', 'TJ', "'", '"']
  text_operators.any? { |op| data.include?(op) }
end

Instance Method Details

#to_io_stream(params = {}) ⇒ StringIO

Converts the current PDF document into a binary-encoded StringIO stream.

Parameters:

  • params (Hash) (defaults to: {})

    Optional settings to override default processing flags. - :delinearize [bool] (default: true) Whether to convert a linearized PDF to its full form. - :recompile [bool] (default: true) Whether to recompile the PDF after processing. - :decrypt [bool] (default: false) Whether to attempt to decrypt the PDF. - Other keys such as :intent, :rebuild_xrefs, :noindent, and :obfuscate may be modified automatically.

Returns:

  • (StringIO)

    A binary-encoded stream representing the processed PDF.



16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# File 'lib/mindee/pdf/pdf_tools.rb', line 16

def to_io_stream(params = {})
  options = {
    delinearize: true,
    recompile: true,
    decrypt: false,
    noindent: nil,
  }
  options.update(params)

  if frozen? # incompatible flags with frozen doc (signed)
    options[:recompile] = nil
    options[:rebuild_xrefs] = nil
    options[:noindent] = nil
    options[:obfuscate] = false
  end
  load_all_objects unless @loaded

  intents_as_pdfa1 if options[:intent].to_s =~ %r{pdf[/-]?A1?/i}
  delinearize! if options[:delinearize] && linearized?
  compile(options) if options[:recompile]

  io_stream = StringIO.new(output(options))
  io_stream.set_encoding Encoding::BINARY
  io_stream
end