Module: Mindee::PDF::PDFTools

Defined in:
lib/mindee/pdf/pdf_tools.rb

Overview

Collection of miscellaneous PDF operations,as well as some monkey-patching for Origami.

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.add_content_to_page(page, xobject_name, width, height) ⇒ Object

Adds a content stream to the specified PDF page to display an image XObject.

Parameters:

  • page (Origami::Page)

    The PDF page to which content will be added.

  • xobject_name (String)

    The name identifying the XObject.

  • width (Integer)

    The width for the transformation matrix.

  • height (Integer)

    The height for the transformation matrix.

[View source]

138
139
140
141
142
# File 'lib/mindee/pdf/pdf_tools.rb', line 138

def self.add_content_to_page(page, xobject_name, width, height)
  content = "q\n#{width} 0 0 #{height} 0 0 cm\n/#{xobject_name} Do\nQ\n"
  content_stream = Origami::Stream.new(content)
  page.Contents = content_stream
end

.create_xobject(image) ⇒ Origami::Graphics::ImageXObject

Creates an image XObject from the provided image.

Converts the given image to a binary stream using Mindee’s image utilities, then creates an Origami::Graphics::ImageXObject with a JPEG filter.

Parameters:

  • image (Minimagick::Image)

    An image object with the necessary data and structure.

Returns:

  • (Origami::Graphics::ImageXObject)

    The created image XObject.

[View source]

89
90
91
92
# File 'lib/mindee/pdf/pdf_tools.rb', line 89

def self.create_xobject(image)
  image_io = Mindee::Image::ImageUtils.image_to_stringio(image)
  Origami::Graphics::ImageXObject.from_image_file(image_io, 'jpg')
end

.determine_colorspace(image) ⇒ Symbol

Determines the colorspace for an image based on its metadata.

Parameters:

  • image (Hash)

    The image data hash.

Returns:

  • (Symbol)

    One of :DeviceCMYK, :DeviceGray or :DeviceRGB.

[View source]

123
124
125
126
127
128
129
130
# File 'lib/mindee/pdf/pdf_tools.rb', line 123

def self.determine_colorspace(image)
  colorspace = image.data['colorspace']
  case colorspace
  when 'CMYK' then :DeviceCMYK
  when 'Gray', 'PseudoClass Gray' then :DeviceGray
  else :DeviceRGB
  end
end

.determine_filter(image) ⇒ Symbol

Determines the appropriate filter for an image based on its properties.

Parameters:

  • image (Hash)

    The image data hash containing properties.

Returns:

  • (Symbol)

    One of :FlateDecode, :LZWDecode or :DCTDecode.

[View source]

110
111
112
113
114
115
116
117
# File 'lib/mindee/pdf/pdf_tools.rb', line 110

def self.determine_filter(image)
  filter = image.data['properties']['filter']
  case filter
  when %r{Zip}i then :FlateDecode
  when %r{LZW}i then :LZWDecode
  else :DCTDecode
  end
end

.process_image_xobject(image_data, image_quality, width, height) ⇒ Origami::Graphics::ImageXObject

Processes an image into an image XObject for PDF embedding.

Parameters:

  • image_data (Hash)

    The raw image data.

  • image_quality (Integer)

    The quality setting for image compression.

  • width (Numeric)

    The desired width of the output image.

  • height (Numeric)

    The desired height of the output image.

Returns:

  • (Origami::Graphics::ImageXObject)

    The resulting image XObject.

[View source]

161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
# File 'lib/mindee/pdf/pdf_tools.rb', line 161

def self.process_image_xobject(image_data, image_quality, width, height)
  compressed_data = Image::ImageCompressor.compress_image(
    image_data,
    quality: image_quality,
    max_width: width,
    max_height: height
  )

  new_image = Origami::Graphics::ImageXObject.new
  new_image.data = compressed_data
  new_image.Width = width
  new_image.Height = height
  new_image.ColorSpace = :DeviceRGB
  new_image.BitsPerComponent = 8

  new_image
end

.set_page_dimensions(page, width, height) ⇒ Object

Sets the dimensions for the specified PDF page.

Parameters:

  • page (Origami::Page)

    The PDF page whose dimensions are being set.

  • width (Numeric)

    The target width of the page.

  • height (Numeric)

    The target height of the page.

[View source]

149
150
151
152
# File 'lib/mindee/pdf/pdf_tools.rb', line 149

def self.set_page_dimensions(page, width, height)
  page[:MediaBox] = [0, 0, width, height]
  page[:CropBox] = [0, 0, width, height]
end

.set_xobject_properties(xobject, image) ⇒ Object

Sets properties on the provided image XObject based on image metadata.

Parameters:

  • xobject (Origami::Graphics::ImageXObject)

    The image XObject to update.

  • image (Hash)

    A hash containing image metadata (such as width, height, properties, etc.).

[View source]

98
99
100
101
102
103
104
# File 'lib/mindee/pdf/pdf_tools.rb', line 98

def self.set_xobject_properties(xobject, image)
  xobject.dictionary[:BitsPerComponent] = 8
  xobject.dictionary[:Filter] = determine_filter(image)
  xobject.dictionary[:Width] = image[:width]
  xobject.dictionary[:Height] = image[:height]
  xobject.dictionary[:ColorSpace] = determine_colorspace(image)
end

.source_text?(pdf_data) ⇒ bool

Checks whether the file has source_text. Sends false if the file isn’t a PDF.

Parameters:

  • pdf_data (StringIO)

    Abinary-encoded stream representing the PDF file.

Returns:

  • (bool)

    True if the pdf has source text, false otherwise.

[View source]

57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# File 'lib/mindee/pdf/pdf_tools.rb', line 57

def self.source_text?(pdf_data)
  begin
    pdf_data.rewind
    pdf = Origami::PDF.read(pdf_data)

    pdf.each_page do |page|
      next unless page[:Contents]

      contents = page[:Contents].solve
      contents = [contents] unless contents.is_a?(Origami::Array)

      contents.each do |stream_ref|
        stream = stream_ref.solve
        return true if stream_has_text?(stream)
      end
    end

    false
  end

  false
rescue Origami::InvalidPDFError
  false
end

.stream_has_text?(stream) ⇒ bool

Checks a PDFs stream content for text operators See opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf page 243-251.

Parameters:

  • stream (StringIO)

    Stream object from a PDFs page.

Returns:

  • (bool)

    True if a text operator is found in the stream.

[View source]

46
47
48
49
50
51
52
# File 'lib/mindee/pdf/pdf_tools.rb', line 46

def self.stream_has_text?(stream)
  data = stream.data
  return false if data.nil? || data.empty?

  text_operators = ['Tc', 'Tw', 'Th', 'TL', 'Tf', 'Tk', 'Tr', 'Tm', 'T*', 'Tj', 'TJ', "'", '"']
  text_operators.any? { |op| data.include?(op) }
end

Instance Method Details

#to_io_stream(params = {}) ⇒ StringIO

Converts the current PDF document into a binary-encoded StringIO stream.

Parameters:

  • params (Hash) (defaults to: {})

    Optional settings to override default processing flags. - :delinearize [bool] (default: true) Whether to convert a linearized PDF to its full form. - :recompile [bool] (default: true) Whether to recompile the PDF after processing. - :decrypt [bool] (default: false) Whether to attempt to decrypt the PDF. - Other keys such as :intent, :rebuild_xrefs, :noindent, and :obfuscate may be modified automatically.

Returns:

  • (StringIO)

    A binary-encoded stream representing the processed PDF.

[View source]

16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# File 'lib/mindee/pdf/pdf_tools.rb', line 16

def to_io_stream(params = {})
  options = {
    delinearize: true,
    recompile: true,
    decrypt: false,
    noindent: nil,
  }
  options.update(params)

  if frozen? # incompatible flags with frozen doc (signed)
    options[:recompile] = nil
    options[:rebuild_xrefs] = nil
    options[:noindent] = nil
    options[:obfuscate] = false
  end
  load_all_objects unless @loaded

  intents_as_pdfa1 if options[:intent].to_s =~ %r{pdf[/-]?A1?/i}
  delinearize! if options[:delinearize] && linearized?
  compile(options) if options[:recompile]

  io_stream = StringIO.new(output(options))
  io_stream.set_encoding Encoding::BINARY
  io_stream
end