我是靠谱客的博主 彩色狗,这篇文章主要介绍linux70G文件去重 排序,用sort命令对(大)文件进行(快速)排序/去重,现在分享给大家,希望可以做个参考。

本文最后更新于2016年2月28日,已超过 1 年没有更新,如果文章内容失效,还请反馈给我,谢谢!





sorting large files faster with a shell script



Look carefully at the options of sort to speed performance and understand it’s impact on your machine and problem. Key parameters on Ubuntu are

Location of temporary files -T directory_name

Amount of memory to use -S N% ( N% of all memory to use, the more the better but avoid over subscription that causes swapping to disk. You can use it like “-S 80%” to use 80% of available RAM, or “-S 2G” for 2 GB RAM.)

The questioner asks “Why no high memory usage?” The answer to that comes from history, older unix machines were small and the default memory size is set small. Adjust this as big as possible for your workload to vastly improve sort performance. Set the working directory to a place on your fastest device that has enough space to hold at least 1.25 * the size of the file being sorted.


Buffer it in memory using -S. For example, to use (up to) 50% of your memory as a sorting buffer do:

sort -S 50% file

Note that modern Unix sort can sort in parallel. My experience is that it automatically uses as many cores as possible. You can set it directly using –parallel. To sort using 4 threads:

sort --parallel=4 file

So all in all, you should put everything into one file and execute something like:

sort -S 50% --parallel=4 file


Using the sort command will probably be the fastest option.

But you’ll probably want to fix the locale to C.

sort -u doesn’t report unique lines, but one of each set of lines that sort the same. In the C locale, 2 different lines necessarily don’t sort the same, but that’s not the case in most UTF-8 based locales on GNU systems.

Also, using the C locale avoids the overhead of having to parse UTF-8 and processing complex sort orders so would improve performance dramatically.


LC_ALL=C sort -u file

You can also improve performance by using a faster drive (or a different drive from the one where the input and/or output files are) for the temporary files (using -T or $TMPDIR environment variable), or by fiddling with the -S option supported by some sort implementations).

For some type of input or for slow storage, using the –compress-program option of GNU sort(for instance with lzop) might improve performance in addition to storage usage.



以上就是彩色狗最近收集整理的关于linux70G文件去重 排序,用sort命令对(大)文件进行(快速)排序/去重的全部内容,更多相关linux70G文件去重内容请搜索靠谱客的其他文章。


评论列表共有 0 条评论
