概述
crawlstatus:
STATUS_UNFETCHED = 0x01; //Page was not fetched yet
STATUS_FETCHED = 0x02; //Page was successfully fetched
STATUS_GONE = 0x03; //Page no longer exists
STATUS_REDIR_TEMP = 0x04; //Page temporarily redirects to other page
STATUS_REDIR_PERM = 0x05; //Page permanently redirects to other page
STATUS_RETRY = 0x22; //Fetching unsuccessful, needs to be retried (transient errors)
STATUS_NOTMODIFIED = 0x26; //Fetching successful - page is not modified
injectorjob
**_injmrk_ :'y'**
distance:0
generatorjob
生成batchId
判断distance> maxDistance return
_gnmrk_ 有值return
**fetchTime 太近return**
count >= limit return
计算url的score
**_gnmrk_ = batchId**
page.batchId = batchId
fetcherjob
_gnmrk_ 无值return
_ftcmrk_ 有值return
**batchId.equals(_gnmrk_) 则fetch**
**_ftcmrk_ = _gnmrk_ **
parsejob
**batchId.equals(_ftcmrk_) 则parse**
batchId.equals("-reparse") 或 force 强制parse
_ftcmrk_ 无值return
__prsmrk__ 有值return
skipTruncated return
status 不等STATUS_FETCHED return
parse:setSignature setOutlinks
__prsmrk__ = _ftcmrk_
dbupdaterjob
**batchId.equals(_gnmrk_) 则update**
*outlinks插入数据库
*更新score
*判断Signature的变动
*更新fetchTime,modifiedTime
*计算newDistance
*删除_ftcmrk_ / _gnmrk_
*如果 __prsmrk__ 存在:_updmrk_ = __prsmrk__ ; __prsmrk__=NULL
indexingjob
**batchId.equals(_updmrk_) 则 indexing**
if (pstatus == null || !ParseStatusUtils.isSuccess(pstatus)) return
indexing
_idxmrk_ = _updmrk_
最后
以上就是花痴火车为你收集整理的nutch2.3.1爬取marker流程的全部内容,希望文章能够帮你解决nutch2.3.1爬取marker流程所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
发表评论 取消回复