Python Tika guidePython Tika guide

104 阅读 0 评论 69 点赞

我是靠谱客的博主开朗洋葱，这篇文章主要介绍Python Tika guidePython Tika guide，现在分享给大家，希望可以做个参考。

Python Tika guide

IMPORTANT NOTE: Thanks to Chris Wilson's work it seems that a simple command line pip install git+git://github.com/aptivate/python-tika.git will do the work ! Much better isn't it ? See http://blog.aptivate.org/2012/02/01/content-indexing-in-django-using-apache-tika/ for more info. The following is now clearly deprecated, I keep it here just in case...

This document is a very short guide for building and using Tika (an all purpose documents' content and metadata extraction library) through a Python wrapper. The wrapper is built using JCC.

http://lucene.apache.org/tika/

http://lucene.apache.org/pylucene/jcc/index.html

Until now only the few functionalities I am interested in were tested.

Install

Install jcc : http://lucene.apache.org/pylucene/jcc/documentation/install.html

Install tika : http://lucene.apache.org/tika/0.7/gettingstarted.html

Don't forget to run mvn install in tika directory.
You will need the jar files from tika-parsers/target, tika-core/target and tika-app/target.

Build Tika Python wrapper with jcc:

> cd jcc/jcc
> sudo python __main__.py --jar jar/tika-parsers-0.7.jar --jar jar/tika-core-0.7.jar
 java.io.File java.io.FileInputStream java.io.StringBufferInputStream
 --package org.xml.sax.ContentHandler --package org.xml.sax.SAXException
 --include jar/tika-app-0.7.jar --python tika --reserved asm --build --install

I have been told that the package line should be: "--package org.xml.sax". I don't know if it is because of a version change and I haven't tested it, but try it if you have errors with the command as it is.

1 feb 2012: thanks to another fellow tika user for his input:

I concur with the need to change the package to "--package org.xml.sax".
Without this, I do not get "errors" during the compilation process,
but jcc silently ignores the all-important AutoDetectParser.parse() method,
and produces a wrapper with no such method in it, because it doesn't recognise the return type.
This causes the example code that you gave to fail because of the missing method.

I also needed to add an OSGI library for Tika 1.0, which I happened to find on my system, so my final command was:

python ../jcc/jcc/__main__.py 
       --include /usr/share/java/org.eclipse.osgi.jar
       --jar tika-parsers-1.0.jar 
       --jar tika-core-1.0.jar 
       java.io.File java.io.FileInputStream 
       java.io.StringBufferInputStream 
       --package org.xml.sax 
       --include tika-app-1.0.jar 
       --python tika --version 1.0 --reserved asm

Usage example

In a python console:

# Setup module and virtual machine
import tika
tika.initVM()

# The all purpose parser from Tika (html, pdf, open documents, etc...)
parser = tika.AutoDetectParser()

# Create input from a small fake html code
# Alternatively you can use: input = tika.FileInputStream(tika.File("/path/to/example"))
input = tika.StringBufferInputStream("<html><title>My title</title><body>My body</body></html>")

# Create handler for content, metadata and context
content = tika.BodyContentHandler()
metadata = tika.Metadata()
context = tika.ParseContext()

# Parse the data and display result
parser.parse(input,content,metadata,context)
content.toString()
> u'My body'
metadata.toString()
> u'title=My title Content-Encoding=UTF-8 Content-Type=text/html '
metadata.get('title')
> u'My title'