Go processing PDF implementation code

Time:2020-11-24

In our work, we often encounter some problems in PDF file processing. There are 1000 processing methods for a thousand kinds of PDF files. Every time we try our best to fight these PDF files to the end.

I am also a gopher, so this article will list every PDF processing scenario I have experienced from the perspective of goper, such as:

Pdf rendering
Pdf verification
PDF with watermark
Pdf get pages
Pdf merge
Pdf split
Repair damaged pdf
Pdf to png
Identify fonts in PDF
Pdf decryption

Most of this article is a list of scene problems. You can select the parts you are interested in according to the title

I am not very professional about many PDF questions. If you have any questions, please contact me

1、 HTML page rendering pdf

I have used the following two schemes to render PDF based on HTML pages:

  • wkhtmltopdf
  • chromedp

1. Use wkhtmltopdf to render PDF

Wkhtmltopdf is a command-line tool for rendering HTML pages to PDF based on QT WebKit rendering engine

It is easy to use

##Print a static HTML page as PDF
$ wkhtmltopdf input.html output.pdf

##Print a web page as PDF
$ wkhtmltopdf https://www.google.com output.pdf

The parameters of wkhtmltopdf are rich, such as:

It supports sending HTTP post request, which is suitable for rendering custom developed web pages into PDF files


$ wkhtmltopdf --help
...
--post <name> <value>      Add an additional post field (repeatable)
...

Support JavaScript script, modify HTML before rendering PDF:


$ wkhtmltopdf --run-script "javascript:(function(){document.getElementsByClassName('dom_class_name')[0].style.display = 'none'}())" page input.html output.pdf

More detailed parameters can be found in the official website

If you use go, there is a third-party package that encapsulates the usage of wkhtmltopdf: go wkhtmltopdf

2. Use chromedp to render PDF

Chromedp is a kind of software package that can drive the browser supporting chrome devtools protocol faster and easier in go language without external dependency (such as selenium or phenomenjs)

Usage:


package main

import (
  "context"
  "io/ioutil"

  "github.com/chromedp/cdproto/page"
  "github.com/chromedp/chromedp"
  "errors"
)

func main(){
  err := ChromedpPrintPdf("https://www.google.com", "/path/to/file.pdf")
  if err != nil {
    fmt.Println(err)
    return
  }
}

func ChromedpPrintPdf(url string, to string) error {
  ctx, cancel := chromedp.NewContext(context.Background())
  defer cancel()

  var buf []byte
  err := chromedp.Run(ctx, chromedp.Tasks{
    chromedp.Navigate(url),
    chromedp.WaitReady("body"),
    chromedp.ActionFunc(func(ctx context.Context) error {
      var err error
      buf, _, err = page.PrintToPDF().
        Do(ctx)
      return err
    }),
  })
  if err != nil {
    return fmt.Errorf("chromedp Run failed,err:%+v", err)
  }

  if err := ioutil.WriteFile(to, buf, 0644); err != nil {
    return fmt.Errorf("write to file failed,err:%+v", err)
  }

  return nil
}

2、 PDF with watermark

The tools that I learned support PDF watermarking include:

  • unidoc/unipdf
  • pdfcpu

1.unidoc/unipdf

Unipdf developed by unidoc platform is a PDF library written in go language, which provides API and cli usage mode and supports the following functions:


$ unipdf -h
...
Available Commands:
 decrypt   Decrypt PDF files
 encrypt   Encrypt PDF files
 explode   Explodes the input file into separate single page PDF files
 extract   Extract PDF resources
 form    PDF form operations
 grayscale  Convert PDF to grayscale
 help    Help about any command
 info    Output PDF information
 merge    Merge PDF files
 optimize  Optimize PDF files
 passwd   Change PDF passwords
 rotate   Rotate PDF file pages
 search   Search text in PDF files
 split    Split PDF files
 version   Output version information and exit
 watermark  Add watermark to PDF files
...

Add watermark in cli mode


$ unipdf watermark in.pdf watermark.png -o out.pdf

Watermark successfully applied to in.pdf
Output file saved to out.pdf

Use API to add watermark, you can refer to unipdf GitHub example directly

Note: unidoc products need to pay for license

2.pdfcpu

Pdfcpu is a PDF processing library written in go language, which provides API and cli mode

The following functions are supported:


$ pdfcpu help
...
The commands are:

  attachments list, add, remove, extract embedded file attachments
  changeopw  change owner password
  changeupw  change user password
  decrypt   remove password protection
  encrypt   set password protection
  extract   extract images, fonts, content, pages, metadata
  fonts    install, list supported fonts
  grid    rearrange pages or images for enhanced browsing experience
  import   import/convert images to PDF
  info    print file info
  merge    concatenate 2 or more PDFs
  nup     rearrange pages or images for reduced number of pages
  optimize  optimize PDF by getting rid of redundant page resources
  pages    insert, remove selected pages
  paper    print list of supported paper sizes
  permissions list, set user access permissions
  rotate   rotate pages
  split    split multi-page PDF into several PDFs according to split span
  stamp    add, remove, update text, image or PDF stamps for selected pages
  trim    create trimmed version of selected pages
  validate  validate PDF against PDF 32000-1:2008 (PDF 1.7)
  version   print version
  watermark  add, remove, update text, image or PDF watermarks for selected pages
...

Use the CLI tool to add a watermark as an image:


$ pdfcpu watermark add -mode image 'voucher_watermark.png' 's:1 abs, rot:0' in.pdf out.pdf

Call API to add watermark


package main

import (
  "github.com/pdfcpu/pdfcpu/pkg/api"
  "github.com/pdfcpu/pdfcpu/pkg/pdfcpu"
)

func main() {
  onTop := false
  wm, _ := pdfcpu.ParseImageWatermarkDetails("watermark.png", "s:1 abs, rot:0", onTop)
  api.AddWatermarksFile("in.pdf", "out.pdf", nil, wm, nil)
}

3、 Pdf merge

  • cpdf
  • unipdfc
  • pdfcpu

1. Use CPDF to merge PDF

CPDF is an open source and free PDF command-line tool library with rich functions, such as:

  • Merge PDF files together, or split them apart
  • Encrypt and decrypt
  • Scale, crop and rotate pages
  • Read and set document info and metadata
  • Copy, add or remove bookmarks
  • Stamp logos, text, dates, page numbers
  • Add or remove attachments
  • Losslessly compress PDF files

Merge PDF:


$ cpdf -merge input1.pdf input2.pdf -o output.pdf

2. Merge PDF with unipdf


$ unipdf merge output.pdf input1.pdf input2.pdf

Use API to merge PDF, refer to unpdf GitHub example

3. Use pdfcpu to merge PDF


$ pdfcpu merge output.pdf input1.pdf input2.pdf

Note: pdfcpu only supports PDF files with version lower than PDF V1.7

4、 Split pdf

  • cpdf
  • unipdf
  • pdfcpu

1. Use CPDF to split PDF

##Split page by page into a single PDF
$ cpdf -split in.pdf 1 even -chunk 1 -o ./out%%%.pdf

2. Use unipdf to split PDF

##Split the first page
$ unipdf split input.pdf out.pdf 1-1

Use API to split PDF, refer to unipdf GitHub examples

3. Use pdfcpu to split PDF


$ pdfcpu split in.pdf .

5、 Pdf to picture

  • mupdf
  • xpdf

1. Use mupdf to convert PDF to picture

MuPDF is a lightweight PDF, XPS, and E-book viewer.
MuPDF consists of a software library, command line tools, and viewers for various platforms.

After downloading mupdf, you can get some tools, such as:

mupdf              
pdfdraw
pdfinfo            
pdfclean           
pdfextract         
pdfshow            
xpsdraw

Pdfdraw can be used to convert pictures


$ pdfdraw -o out%d.png in.pdf

Note: mupdf does not support Mac OS

2. Use Xpdf to operate PDF to transfer pictures

Xpdf is a free PDF toolkit, including text parsing, image conversion, HTML conversion, etc

After downloading the software package, you can get a series of tools:

pdfdetach
pdffonts 
pdfimages
pdfinfo  
pdftohtml
pdftopng 
pdftoppm 
pdftops  
pdftotext

From the name, you can roughly see the use of each tool

##Using pdftopng to convert PDF to png
$ pdftopng in.pdf out-prefix

6、 Pdf decryption

There is often a scenario where an error is reported when reading a PDF file: the file is encrypted

But how to solve it without a password?

  • Decryption using qpdf

Using qpdf for forced decryption, some cases can be decrypted successfully, but some cases may not be able to decrypt successfully

Qpdf is a PDF tool that supports the command line


$ qpdf --decrypt in.pdf out.pdf

Decryption using pdfcpu


$ pdfcpu decrypt encrypted.pdf output.pdf

When there is a password, the password can be used to decrypt:

Decrypting PDF with unipdf


$ unipdf decrypt -p pass -o output.pdf input.pdf

7、 Pdf recognition

Often encounter some scenarios, such as identifying whether a file is a PDF file, recognizing the text in PDF, recognizing the pictures in PDF, etc

1. Identify the text in PDF

Here we use Xpdf to parse the text in PDF, and then use some string operations or regular expressions for business analysis

Using Xpdf / PDFtoTEXT to parse text in PDF


$ pdftotext input.pdf output.txt

Using unipdf to parse text in PDF


$ unipdf extract text input.pdf

Use API to parse PDF Text, refer to unipdf GitHub examples

Using coordinate information to analyze PDF data

All the above are processed according to the business

Another way is to parse PDF according to the coordinate position, which is more flexible and universal, and uses PDFlib / tet

##Input a set of coordinates to parse the data in PDF according to the coordinates
$ tet --pageopt "includebox={{38 707.93 243.91 716.93}}" input.pdf

The coordinates can be analyzed by tet to get a tetml file, which contains the coordinate information:


$ tet --tetml input.pdf

Of course, we can also use some other ways to obtain the coordinate information of PDF data, such as nodejs

Note: PDFlib / tet is a charging software. However, according to the official document, tet provides basic functions, and it does not need to purchase a license to process PDF files with no more than 10 pages or less than 1m

PDFlib / tet provides command-line tools and SDK support in multiple languages, such as C / C + + / Java /. Net / Perl / PHP / Python / RUBY / swift, but it does not support go language at present, so there are only two options for gopher: cli or CGO

8、 Damaged PDF file repair

When some PDF files are opened on the computer, the display is normal, but it is not normal to use code detection. For example, in go, try to use a third-party library to parse a (damaged) PDF:


import (
  "fmt"
  "github.com/rsc.io/pdf"
)

func main() {
  filePath := "path/to/your/broken.pdf"
  _, err := pdf.Open(filePath)
  if err != nil {
    fmt.Println("open pdf failed,err:", err.Error())
    return
  }
}

After running, you will get the following result:

open pdf failed,err: malformed PDF: cross-reference table not found: {5 0 obj}<</Contents 6 0 R /Group <</CS /DeviceRGB /S /Transparency /Type /Group>> /MediaBox [0 0 595.27600098 841.89001465] /Parent 3 0 R /Type /Page>>

Computer open normal, but the program read error!

At this time, if you try to open PDF on the computer, and then save it as a new PDF file, and then use the code to detect, it will be found that it has been repaired!

Great, problem solved!

Wait, if I have 1000 PDF files, should I open them one by one and save them as? How can this be tolerated? So it would be nice if there was a batch repair function

After searching the Internet for a long time, I got three solutions:

  • Using acrobat SDK, call the save as function in the SDK to realize the effect of opening and saving as
  • Pdf repair using ghost script
  • Pdf repair using mupdf

Here I only verify that the third method is feasible. Here I use mupdf-0.9-linux-amd64 to verify

After downloading the software package, we get one of the executable files: pdfclean


$ pdfclean broken.pdf repaired.pdf

+ pdf/pdf_xref.c:160: pdf_read_trailer(): cannot recognize xref format: '%'
| pdf/pdf_xref.c:481: pdf_load_xref(): cannot read trailer
\ pdf/pdf_xref.c:537: pdf_open_xref_with_stream(): trying to repair

From the output, mupdf tried to fix it

After getting the new PDF file, try to open it with the previous go code. It is normal

The rest is to write a bash script, batch repair, goal achieved!

9、 Identify font information of a PDF file

Sometimes to make multiple PDF text fonts consistent, it is necessary to analyze which fonts are used in PDF. In this case, Xpdf / pdffonts can be used for font analysis


$ pdffonts input.pdf
name                 type       encoding     emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
NimbusSanL-Regu           CID TrueType   Identity-H    yes no yes   10 0
NimbusSanL-Bold           CID TrueType   Identity-H    yes no yes   20 0

Other introduction to libiray:

PDF-Writer
This is a C + + open source library, support to create PDF, merge PDF, image watermark, text operation and so on

For gopher, to use this library, you need to encapsulate a layer of CGO code

rsc/pdf
This is a go language implementation of the PDF library, can be used to read PDF information, such as read PDF content / page / font… Specific reference documents

The introduction of so many third-party libraries is simply multifarious, each showing its magic power. Some functions are duplicated in most libraries. What problems will be encountered in the specific use depends on the actual situation.

I hope these conclusions can be helpful to readers

reference resources:

wkhtmltopdf
xpdf
cpdf
qpdf
unidoc
pdflib/tet
pdfwriter
mupdf
pdfcpu

The above is the whole content of this article, I hope to help you in your study, and I hope you can support developeppaer more.

Recommended Today

The road of high salary, a more complete summary of MySQL high performance optimization in history

preface MySQL for many linux practitioners, is a very difficult problem, most of the cases are because of the database problems and processing ideas are not clear. Before the optimization of MySQL, we must understand the query process of MySQL. In fact, a lot of query optimization work is to follow some principles so that […]