TIKA提取文字文件


下面給出的程式是用來提取文字文件的內容和後設資料:

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.txt.TXTParser;

import org.xml.sax.SAXException;

public class TextParser {

   public static void main(final String[] args) throws IOException,SAXException, TikaException {

      //detecting the file type
      BodyContentHandler handler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      FileInputStream inputstream = new FileInputStream(new File("example.txt"));
      ParseContext pcontext=new ParseContext();
      
      //Text document parser
      TXTParser  TexTParser = new TXTParser();
      TexTParser.parse(inputstream, handler, metadata,pcontext);
      System.out.println("Contents of the document:" + handler.toString());
      System.out.println("Metadata of the document:");
      String[] metadataNames = metadata.names();
      
      for(String name : metadataNames) {
         System.out.println(name + " : " + metadata.get(name));
      }
   }
}

儲存上述程式碼作為TextParser.java,並通過使用下面的命令從命令提示編譯:

javac TextParser.java
java TextParser

下面給出的是example.txt檔案的快照:

Simple Document

文字檔案具有以下屬性:

Document Properties

執行上述程式後,將得到下面的輸出。

輸出:

Contents of the document:

At Tw511.com, we strive hard to provide quality tutorials for self-learning purpose in the domains of Academics, Information Technology, Management and Computer Programming Languages.

The endeavour started by Hema su, who is the founder and the managing director of  Yiibai Pvt. Ltd. He came up with the website tw511.com in year 2014 with the help of handpicked freelancers, with an array of tutorials for computer programming languages.

Metadata of the document: Content-Encoding: windows-1252 Content-Type: text/plain; charset=windows-1252