Nutch 2.3 Hadoop 2.7 Solr 5.3 Hbase 0.98.14 搜索引擎集群环境搭建

发表于 2015-09-17 更新于 2025-04-23 分类于大数据， Hadoop 评论数：阅读次数：

折腾一个月，呕心沥血而成，欢迎测试：

Nutch 2.3

local 模式

把 hbase 0.98.14 的相关jar 包拷贝到local/lib

查看帮助

bin/nutch

查看 Web UI

http://master:8983/solr/

运行抓取：（注意：如果你用的solr是下面5.3的配置，此时 solr 的链接是 http://localhost:8983/solr/solr/，这里有两个solr，而查看 Web UI 的时候只有一个

./bin/crawl 
Missing seedDir : crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>

bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/solr/ 2

~/urls 是存放了种子url的目录
TestCrawl 是crawlId，这会在HBase中创建一张以crawlId为前缀的表，例如TestCrawl_Webpage。
http://localhost:8983/solr/solr/ , 这是Solr服务器
2，numberOfRounds，迭代的次数

查看结果：

1	./bin/nutch readdb -crawlId TestCrawl -stats

在 hbase shell 中运行 scan 查看表中内容时，对于列的含义不确定时可以查看 conf/gora-hbase-mapping.xml 文件，该文件定义了列族及列的含义。

Hadoop 2.7 (完全分布式)

如果遇到本地库无法加载的情况，请参考本文解决Unable to load native-hadoop library for your platform

Solr 5.3

下载解压

example/example-DIH 包含了完整的solr home配置，拷贝到server/solr

1	cp -rf /disk2/solr/solr-5.3/example/example-DIH/solr/* /disk2/solr/solr-5.2.1/server/solr/

解决 Nutch 运行中可能遇到的 Error 404: Prob accessing /solr/solr/update. Reason: Not Found

1
2
3

cd /disk2/solr/solr-5.3/server/solr
cp /disk2/solr/solr-5.3/example/exampledocs/monitor.xml .
curl http://127.0.0.1:8983/solr/solr/update --data-binary @monitor.xml -H 'Content-type:application/xml'

为 Nutch crawl 运行，还要修改/disk2/solr/solr-5.3/server/solr/solr/conf/schema.xml，加上:

<field name="host" type="string" stored="false" indexed="true"/>
<field name="site" type="string" stored="false" indexed="true"/>
<field name="cache" type="string" stored="true" indexed="false"/>
<field name="digest" type="string" stored="true" indexed="false"/>
<field name="segment" type="string" stored="true" indexed="false"/>
<field name="boost" type="float" stored="true" indexed="false"/>
<field name="tstamp" type="date" stored="true" indexed="false"/>
<field name="stamp" type="date" stored="true" indexed="false"/>  
<field name="anchor" type="string" stored="true" indexed="true" multiValued="true"/>

Hbase 1.0.1.1 (完全分布式) 0.98.14 也可以。

需要编辑 conf/hbase-site.xml

报错处理

1	ERROR solr.SolrIndexWriter - Missing SOLR URL. Should be set via -D solr.server.url

这个就是上面说的 solr 的链接错误，或者没有启动。

1	IndexingJob: starting SolrIndexerJob: java.lang.RuntimeException: job failed:

同上。

1	java.lang.ClassNotFoundException: org.cloudera.htrace.Trace

把hbase/lib 下面的 htrace* 拷贝到hadoop/share/hadoop/mapreduce/ 下面

1	Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

把hbase/lib 下面的 hbase* 拷贝到hadoop/share/hadoop/mapreduce/ 下面

参考资料

NUTCH2.3 hadoop2.7.1 hbase1.0.1.1 solr5.2.1部署(三），hadoop2.7安装
 Nutch 1.X Tutorial
Nutch 2.X Tutorial
Nutch 1.x REST API v1.0