Nutch 学习笔记 2
-----------------
1. Nutch 1.3 运行命令的一些介绍
要看Nutch的命令说明,可执行如下命令bin/nutch复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29Usage: nutch [-core] COMMAND where COMMAND is one of: crawl one-step crawler for intranets readdb read / dump crawl db convdb convert crawl db from pre-0.9 format mergedb merge crawldb-s, with optional filtering readlinkdb read / dump link db inject inject new urls into the database generate generate new segments to fetch from crawl db freegen generate new segments to fetch from text files fetch fetch a segment's pages parse parse a segment's pages readseg read / dump segment data mergesegs merge several segments, with optional filtering and slicing updatedb update crawl db from segments after fetching invertlinks create a linkdb from parsed segments mergelinkdb merge linkdb-s, with optional filtering solrindex run the solr indexer on parsed segments and linkdb solrdedup remove duplicates from solr solrclean remove HTTP 301 and 404 documents from solr plugin load a plugin and run one of its classes main() or CLASSNAME run the class named CLASSNAME Most commands print help when invoked w/o parameters. Expert: -core option is for developers only. It avoids building the job jar, instead it simply includes classes compiled with ant compile-core. NOTE: this works only for jobs executed in 'local' mode
2. 单个命令的说明
2.1 bin/nutch crawl
Usage: Crawl <urlDir> -solr <solrURL> [-dir d] [-threads n] [-depth i] [-topN N]这是用于对urls进行一键式抓取的命令
2.2 bin/nutch readdb
Usage: CrawlDbReader <crawldb> (-stats | -dump <out_dir> | -topN <nnnn> <out_dir> [<min>] | -url <url>)这是用于对crawldb数据库进行读取的命令,主要是用于dump相应的url文件
2.3 bin/nutch convdb
这个命令主要用于把nutch 0.9的crawldb数据转换成1.3的格式2.4 bin/nutch mergedb
Usage: CrawlDbMerger <output_crawldb> <crawldb1> [<crawldb2> <crawldb3> ...] [-normalize] [-filter]这个命令主要用于合并多个crawldb数据库
2.5 bin/nutch readlinkdb
Usage: LinkDbReader <linkdb> {-dump <out_dir> | -url <url>)主要用于读取invertlinks产生的链接数据
2.6 bin/nutch inject
Usage: Injector <crawldb> <url_dir>主要用于把url_dir中的url注入到crawldb数据库中去
2.7 bin/nutch generate
Usage: Generator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm][-maxNumSegments num]用于产生准备抓取的url列表
2.8 bin/nutch freegen
Usage: FreeGenerator <inputDir> <segmentsDir> [-filter] [-normalize]从文本文件中提取urls来产生新的抓取segment
2.9 bin/nutch fetch
Usage: Fetcher <segment> [-threads n] [-noParsing]主要用来对generate产生的urls进行抓取,这里用到了Hadoop架构,使用了一个FetcherOutputFormat来对其结果进行多目录输出
2.10 bin/nutch parse
Usage: ParseSegment segment主要是对抓取的内容进行分析
2.11 bin/nutch readseg
Usage: SegmentReader (-dump ... | -list ... | -get ...) [general options]这个命令主要是输出segment的内容
2.12 bin/nutch invertlinks
Usage: LinkDb <linkdb> (-dir <segmentsDir> | <seg1> <seg2> ...) [-force] [-noNormalize] [-noFilter]这个命令主要是得到抓取内容的外链接数据
2.13 bin/nutch solrindex
Usage: SolrIndexer <solr url> <crawldb> <linkdb> (<segment> ... | -dir <segments>)对抓以的内容进行索引建立,前提是要有solr环境。
2.14 bin/nutch plugin
Usage: PluginRepository pluginId className [arg1 arg2 ...]这个命令主要对插件进行测试,运行其main方法
最后
以上就是感性糖豆最近收集整理的关于Nutch 1.3 学习笔记2的全部内容,更多相关Nutch内容请搜索靠谱客的其他文章。
本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
发表评论 取消回复