如何使用java從PDF中提取內容?

2019-10-16 22:25:42

在Java程式設計中,如何使用java從PDF中提取內容?

專案的目錄結構如下 -

Tika的工具包可從以下網址下載:http://tika.apache.org/download.html ,只下載:tika-app-1.16.jartika-server-1.16.jar

以下是使用java從PDF中提取內容的程式 -

import java.io.File;
import java.io.FileInputStream;

import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;

public class ExtractContentFromPDF {

    public static void main(String[] args) throws Exception {
        BodyContentHandler handler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        FileInputStream inputstream = new FileInputStream(new File("pdfExample.pdf"));

        ParseContext pcontext = new ParseContext();

        // parsing the document using PDF parser
        PDFParser pdfparser = new PDFParser();
        pdfparser.parse(inputstream, handler, metadata, pcontext);

        // getting the content of the document
        System.out.println("Contents of the PDF :" + handler.toString());

        // getting metadata of the document
        System.out.println("Metadata of the PDF:");
        String[] metadataNames = metadata.names();

        for (String name : metadataNames) {
            System.out.println(name + " : " + metadata.get(name));
        }
    }
}

原PDF檔案:pdfExample.pdf 的內容如下 -

執行上面範例程式碼,得到以下結果 -

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/F:/worksp/javaexamples/libs/tika_libs/tika-app-1.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/F:/worksp/javaexamples/libs/tika_libs/tika-server-1.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
九月 27, 2017 4:29:50 上午 org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
警告: JBIG2ImageReader not loaded. jbig2 files will be ignored
See http://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
TIFFImageWriter not loaded. tiff files will not be processed
See http://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
J2KImageReader not loaded. JPEG2000 files will not be processed.
See http://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

九月 27, 2017 4:29:50 上午 org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
警告: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
Contents of the PDF :
Apache Tika is a library that is used for document type detection and
content extraction from various file formats.

Internally, Tika uses various existing document parsers and
document type detection techniques to detect and extract data.

Using Tika, one can develop a universal type detector and content
extractor to extract both structured text as well as metadata from
different types of documents such as spreadsheets, text documents,
images, PDFs and even multimedia input formats to a certain extent.



Metadata of the PDF:
date : 2017-09-26T20:00:44Z
pdf:PDFVersion : 1.7
pdf:docinfo:title : 
xmp:CreatorTool : WPS Office
Company : 
Keywords : 
access_permission:modify_annotations : true
access_permission:can_print_degraded : true
subject : 
dc:creator : Administrator
dcterms:created : 2017-09-26T20:00:44Z
Last-Modified : 2017-09-26T20:00:44Z
dcterms:modified : 2017-09-26T20:00:44Z
dc:format : application/pdf; version=1.7
Last-Save-Date : 2017-09-26T20:00:44Z
pdf:docinfo:creator_tool : WPS Office
access_permission:fill_in_form : true
pdf:docinfo:keywords : 
pdf:docinfo:modified : 2017-09-26T20:00:44Z
meta:save-date : 2017-09-26T20:00:44Z
pdf:encrypted : false
modified : 2017-09-26T20:00:44Z
pdf:docinfo:custom:SourceModified : D:20170927041644+08'16'
cp:subject : 
pdf:docinfo:subject : 
Content-Type : application/pdf
pdf:docinfo:creator : Administrator
creator : Administrator
meta:author : Administrator
dc:subject : 
meta:creation-date : 2017-09-26T20:00:44Z
created : Tue Sep 26 16:00:44 BOT 2017
Comments : 
access_permission:extract_for_accessibility : true
access_permission:assemble_document : true
xmpTPg:NPages : 1
Creation-Date : 2017-09-26T20:00:44Z
access_permission:extract_content : true
pdf:docinfo:custom:Company : 
access_permission:can_print : true
SourceModified : D:20170927041644+08'16'
pdf:docinfo:custom:Comments : 
meta:keyword : 
Author : Administrator
producer : 
access_permission:can_modify : true
pdf:docinfo:producer : 
pdf:docinfo:created : 2017-09-26T20:00:44Z