概述
本文最后更新于2016年2月28日,已超过 1 年没有更新,如果文章内容失效,还请反馈给我,谢谢!
=Start=
在订阅的博客列表中看到一篇文章「对大文本文件进行去重」,觉得内容不错,但是原文章没有给出代码的出处链接,所以手动搜索了一番,挖到了更多的知识,在此记录一下。
搜索关键字/参考链接:
搜索关键字/参考链接:
sorting large files faster with a shell script
参考解答:
(多看manual)使用-T选项手动指定临时目录;使用-S选项指定允许sort命令使用的内存大小;如果服务器是多核的话,还可以使用–parallel选项设定并发任务量以提高速度。在某些特殊情况下,你甚至可以通过手动设定环境变量「LC_ALL=C」,来提高处理速度(避免解析UTF-8文本,以及进行复杂的排序操作)。
Look carefully at the options of sort to speed performance and understand it’s impact on your machine and problem. Key parameters on Ubuntu are
Location of temporary files -T directory_name
Amount of memory to use -S N% ( N% of all memory to use, the more the better but avoid over subscription that causes swapping to disk. You can use it like “-S 80%” to use 80% of available RAM, or “-S 2G” for 2 GB RAM.)
The questioner asks “Why no high memory usage?” The answer to that comes from history, older unix machines were small and the default memory size is set small. Adjust this as big as possible for your workload to vastly improve sort performance. Set the working directory to a place on your fastest device that has enough space to hold at least 1.25 * the size of the file being sorted.
==
Buffer it in memory using -S. For example, to use (up to) 50% of your memory as a sorting buffer do:
sort -S 50% file
Note that modern Unix sort can sort in parallel. My experience is that it automatically uses as many cores as possible. You can set it directly using –parallel. To sort using 4 threads:
sort --parallel=4 file
So all in all, you should put everything into one file and execute something like:
sort -S 50% --parallel=4 file
==
Using the sort command will probably be the fastest option.
But you’ll probably want to fix the locale to C.
sort -u doesn’t report unique lines, but one of each set of lines that sort the same. In the C locale, 2 different lines necessarily don’t sort the same, but that’s not the case in most UTF-8 based locales on GNU systems.
Also, using the C locale avoids the overhead of having to parse UTF-8 and processing complex sort orders so would improve performance dramatically.
So:
LC_ALL=C sort -u file
You can also improve performance by using a faster drive (or a different drive from the one where the input and/or output files are) for the temporary files (using -T or $TMPDIR environment variable), or by fiddling with the -S option supported by some sort implementations).
For some type of input or for slow storage, using the –compress-program option of GNU sort(for instance with lzop) might improve performance in addition to storage usage.
=EOF=
最后
以上就是彩色狗为你收集整理的linux70G文件去重 排序,用sort命令对(大)文件进行(快速)排序/去重的全部内容,希望文章能够帮你解决linux70G文件去重 排序,用sort命令对(大)文件进行(快速)排序/去重所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复