java zip 读取,在Java中高效读取zip文件

69 阅读 0 评论 46 点赞

我是靠谱客的博主现代洋葱，最近开发中收集的这篇文章主要介绍java zip 读取,在Java中高效读取zip文件，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

I working on a project which works on a very large amount of data.

I have a lot(thousands) of zip files, each containing ONE simple txt file with thousands of lines(about 80k lines).

What I am currently doing is the following:

for(File zipFile: dir.listFiles()){

ZipFile zf = new ZipFile(zipFile);

ZipEntry ze = (ZipEntry) zf.entries().nextElement();

BufferedReader in = new BufferedReader(new InputStreamReader(zf.getInputStream(ze)));

...

In this way I can read the file line by line, but it is definetely too slow.

Given the large number of files and lines that need to be read, I need to read them in a more efficient way.

I have looked for a different approach, but I haven't been able to find anything.

What I think I should use are the java nio APIs intended right for intensive I/O operations, but I don't know how to use them with zip files.

Any help would really be appreciated.

Thanks,

Marco

解决方案

I have a lot(thousands) of zip files. The zipped files are about 30MB each, while the txt inside the zip file is about 60/70 MB. Reading and processing the files with this code takes a lot of hours, around 15, but it depends.

Let's do some back-of-the-envelope calculations.

Let's say you have 5000 files. If it takes 15 hours to process them, this equates to ~10 seconds per file. The files are about 30MB each, so the throughput is ~3MB/s.

This is between one and two orders of magnitude slower than the rate at which ZipFile can decompress stuff.

Either there's a problem with the disks (are they local, or a network share?), or it is the actual processing that is taking most of the time.

The best way to find out for sure is by using a profiler.