我是靠谱客的博主 开朗洋葱,最近开发中收集的这篇文章主要介绍Python Tika guidePython Tika guide,觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

Python Tika guide

IMPORTANT NOTE: Thanks to Chris Wilson's work it seems that a simple command line pip install git+git://github.com/aptivate/python-tika.git will do the work ! Much better isn't it ? See http://blog.aptivate.org/2012/02/01/content-indexing-in-django-using-apache-tika/ for more info. The following is now clearly deprecated, I keep it here just in case...

This document is a very short guide for building and using Tika (an all purpose documents' content and metadata extraction library) through a Python wrapper. The wrapper is built using JCC.

http://lucene.apache.org/tika/

http://lucene.apache.org/pylucene/jcc/index.html

Until now only the few functionalities I am interested in were tested.

Install

Install jcc : http://lucene.apache.org/pylucene/jcc/documentation/install.html

Install tika : http://lucene.apache.org/tika/0.7/gettingstarted.html

Don't forget to run mvn install in tika directory.
You will need the jar files from tika-parsers/target, tika-core/target and tika-app/target.

Build Tika Python wrapper with jcc:

> cd jcc/jcc
> sudo python __main__.py --jar jar/tika-parsers-0.7.jar --jar jar/tika-core-0.7.jar
 java.io.File java.io.FileInputStream java.io.StringBufferInputStream
 --package org.xml.sax.ContentHandler --package org.xml.sax.SAXException
 --include jar/tika-app-0.7.jar --python tika --reserved asm --build --install

I have been told that the package line should be: "--package org.xml.sax". I don't know if it is because of a version change and I haven't tested it, but try it if you have errors with the command as it is.

1 feb 2012: thanks to another fellow tika user for his input:

I concur with the need to change the package to "--package org.xml.sax".
Without this, I do not get "errors" during the compilation process,
but jcc silently ignores the all-important AutoDetectParser.parse() method,
and produces a wrapper with no such method in it, because it doesn't recognise the return type.
This causes the example code that you gave to fail because of the missing method.

I also needed to add an OSGI library for Tika 1.0, which I happened to find on my system, so my final command was:

python ../jcc/jcc/__main__.py 
       --include /usr/share/java/org.eclipse.osgi.jar
       --jar tika-parsers-1.0.jar 
       --jar tika-core-1.0.jar 
       java.io.File java.io.FileInputStream 
       java.io.StringBufferInputStream 
       --package org.xml.sax 
       --include tika-app-1.0.jar 
       --python tika --version 1.0 --reserved asm

Usage example

In a python console:

# Setup module and virtual machine
import tika
tika.initVM()

# The all purpose parser from Tika (html, pdf, open documents, etc...)
parser = tika.AutoDetectParser()

# Create input from a small fake html code
# Alternatively you can use: input = tika.FileInputStream(tika.File("/path/to/example"))
input = tika.StringBufferInputStream("<html><title>My title</title><body>My body</body></html>")

# Create handler for content, metadata and context
content = tika.BodyContentHandler()
metadata = tika.Metadata()
context = tika.ParseContext()

# Parse the data and display result
parser.parse(input,content,metadata,context)
content.toString()
> u'My body'
metadata.toString()
> u'title=My title Content-Encoding=UTF-8 Content-Type=text/html '
metadata.get('title')
> u'My title'

最后

以上就是开朗洋葱为你收集整理的Python Tika guidePython Tika guide的全部内容,希望文章能够帮你解决Python Tika guidePython Tika guide所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(58)

评论列表共有 0 条评论

立即
投稿
返回
顶部