我是靠谱客的博主 内向小蜜蜂,这篇文章主要介绍tika提取文件内容,现在分享给大家,希望可以做个参考。

引入mavenjar包(提取不通格式文件内容 需要引入相关依赖的包)

复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
<!--tika--> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>1.26</version> </dependency> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>1.26</version> </dependency> <dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>jbig2-imageio</artifactId> <version>3.0.1</version> </dependency> <dependency> <groupId>com.github.jai-imageio</groupId> <artifactId>jai-imageio-jpeg2000</artifactId> <version>1.3.0</version> </dependency> <dependency> <groupId>org.xerial</groupId> <artifactId>sqlite-jdbc</artifactId> <version>3.20.1</version> </dependency> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi</artifactId> <version>4.0.0</version> </dependency> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml</artifactId> <version>4.0.0</version> </dependency> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml-schemas</artifactId> <version>4.0.0</version> </dependency>

代码

复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
public static void main(String[] args) { String filePath="D:\tools\nginx-1.18.0\html\index.html"; Tika tika = new Tika(); try { //System.out.println("---"+tika.parseToString(new File(filePath))); //推荐使用 System.out.println("--1-"+new TikaController().fileToTxt(new File(filePath))+"---e1-"); System.out.println("--2-"+new TikaController().tikaTool(new File(filePath))+"---e2-"); }catch (Exception e) { e.printStackTrace(); } } public String tikaTool(File f) throws IOException, TikaException { Tika tika = new Tika(); Metadata metadata = new Metadata(); metadata.set(Metadata.RESOURCE_NAME_KEY, f.getName()); String str = tika.parseToString(new FileInputStream(f), metadata); for(String name : metadata.names()) { //System.out.println(name+":"+metadata.get(name)); } //return tika.parseToString(f); return str; } public String fileToTxt(File f) { Parser parser = new AutoDetectParser(); InputStream is = null; try { Metadata metadata = new Metadata(); metadata.set(Metadata.RESOURCE_NAME_KEY, f.getName()); is = new FileInputStream(f); ContentHandler handler = new BodyContentHandler(); ParseContext context = new ParseContext(); context.set(Parser.class, parser); parser.parse(is, handler, metadata, context); for (String name : metadata.names()) { //System.out.println(name +":"+metadata.get(name)); } return handler.toString(); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } catch (SAXException e) { e.printStackTrace(); } catch (TikaException e) { e.printStackTrace(); } finally { if (is != null) { try { is.close(); } catch (IOException e) { e.printStackTrace(); } } } return null; }

最后

以上就是内向小蜜蜂最近收集整理的关于tika提取文件内容的全部内容,更多相关tika提取文件内容内容请搜索靠谱客的其他文章。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(84)

评论列表共有 0 条评论

立即
投稿
返回
顶部