Extracting PDF text content with Java

Time:2020-11-26

summary

Generally speaking, we can’t modify and edit the content of PDF document format, but when we do need it, we can achieve it by extracting text content. This article will introduce how to extract the text content in PDF document through Java code.

The third-party controls used in this tutorial areFree Spire.PDF for Java(free version). According to different requirements, it can support the following three aspects of extraction function.

  • Extract theAll text content
  • Extract pdfSpecify pageText content of
  • Extract pdfDesignated areaText content of

Acquisition and import of jar package

Before running the code, you need to set free Spire.PDF The jar package in the for Java control is imported into idea. There are two ways to import: first, in theOfficial websiteDownload the product package from the Spire.Pdf.jar Import idea manually; second, create one in ideaMavenProject, and then in pom.xml Type the following code in the file, and finally click “import changes”.

<repositories>
<repository>
<id>com.e-iceblue</id>
<url>http://repo.e-iceblue.cn/repository/maven-public/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>e-iceblue</groupId>
<artifactId>spire.pdf.free</artifactId>
<version>3.9.0</version>
</dependency>
</dependencies>

Sample code

Example 1 extracts all text content from a PDF document
import com.spire.pdf.PdfDocument;
import com.spire.pdf.PdfPageBase;
import java.io.*;
public class ExtractAllText {
    public static void main(String[] args) {
        //Create pdfdocument instance
 PdfDocument doc=new PdfDocument();
        //Load PDF document
 doc.loadFromFile("C:UsersTest1DesktopSample.pdf");
        //Create a StringBuilder instance
 StringBuilder sb=new StringBuilder();
        PdfPageBase page;
        //Traverse the PDF page, get the text of each page and add it to the StringBuilder object
 for(int i=0;i<doc.getPages().getCount();i++){
            page=doc.getPages().get(i);
            sb.append(page.extractText(true));
        }
        FileWriter writer;
        try {
            //Writes the text from the StringBuilder object to a text file
 writer = new FileWriter("output/ExtractAllText.txt");
            writer.write(sb.toString());
            writer.flush();
        } catch (IOException e) {
            e.printStackTrace();
        }
        doc.close();
    }
}

Extraction effect:

Extracting PDF text content with Java

Example 2 extracts the text content of a PDF specified page
import com.spire.pdf.*;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
public class ExtractTextFromParticularPage {
    public static void main(String[] args) throws IOException {
        //Load PDF document
 PdfDocument pdf = new PdfDocument();
        pdf.loadFromFile("C:UsersTest1DesktopSample.pdf");
        //Create a. TXT file to save the extracted text
 String result = "output/extractTextFromAParticularPage.txt";
        File file=new File(result);
        if(!file.exists()){
            file.delete();
        }
        file.createNewFile();
        FileWriter fw=new FileWriter(file,true);
        BufferedWriter bw=new BufferedWriter(fw);
        //Gets the text of the first page
 PdfPageBase page = pdf.getPages().get(0);
        String text = page.extractText(true);
       //String text = page.extractText(false);
 bw.write(text);
        bw.flush();
        bw.close();
        fw.close();
    }
}

Extraction effect:

Extracting PDF text content with Java

Example 3 extract the text content of the specified area of PDF
import com.spire.pdf.*;
import java.awt.geom.Rectangle2D;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
public class ExtractTextFromSpecificArea {
    public static void main(String[] args) throws IOException {
        //Load PDF document
 PdfDocument pdf = new PdfDocument();
        pdf.loadFromFile("C:UsersTest1DesktopSample.pdf");
        //Create a. TXT file to save the extracted text
 File file=new File("output/ExtractTextFromASpecifiedArea.txt");
        if(!file.exists()){
            file.delete();
        }
        file.createNewFile();
        FileWriter fw=new FileWriter(file,true);
        BufferedWriter bw=new BufferedWriter(fw);
        //Get first page
 PdfPageBase page = pdf.getPages().get(0);
        //Extract the text of the specified area on the first page
 String text = page.extractText(new Rectangle2D.Float(80, 20, 500, 110));
        bw.write(text);
        bw.flush();
        bw.close();
        fw.close();
    }
}

Extraction effect:

Extracting PDF text content with Java

Recommended Today

Detailed steps for installing Perl and Komodo IDE for windows

Perl official website: https://www.perl.org/Perl document: https://perldoc.perl.org/Download address: https://www.perl.org/get.html The installation package of Perl Windows version is divided into activestate Perl and strawberry Perl. For the difference between the two, see: http://www.zzvips.com/article/202134.htm Note: the download speed of activestate Perl is slow. You may need KX to surf the Internet I have uploaded all the versions of […]