"It would be nice if there were an open-source search engine owned by the world."
http://nutch.sourceforge.net/blog/cutting.html
Spelling Rate this Message:
http://www.nabble.com/Spelling-t569437.html#a1547313
I have it loaded on mozdex.com and it works fairly
well.
Only thing i noticed is it seems to look for longer
versions of a matching phrase vs immediate common
mistakes.
For example "diat pill" (which is a very common query)
comes up as diatribe pill instead of "diet pill" :)
BUT as my index grows perhaps these will trickle out.
use a PHP front-end with Nutch. Run Tomcat on the same box as PHP. Then write a PHP search page that makes an HTTP call to "http://localhost/opensearch" for the actual search operation. Parse the resulting XML (RSS 2.0) with xml_parse_into_struct() and display. I'm using this setup on http://www.busytonight.com and it works great. --Matt
http://www.nabble.com/Replace-Tomcat-and-JSP-with-PHP-in-Nutch-How-Hard-is-It--t515663.html#a1398738
Here's an example of a Nutch-based site that has both /search.jsp and /opensearch interfaces available. AFAIK, it accepts all of the same parameters that the stock Nutch setup accepts.
http://www.mozdex.com/search.jsp?query=miserable&failure
http://www.mozdex.com/opensearch?query=miserable&failure
Nutch 0.7 supports A9 opensearch RSS. http://fisher.osu.edu/resources/search.html http://jon.shoberg.net
Heritrix Crawler vs. Nutch Crawler
主要目的不同。 Heritrix 是个 "archival crawler" -- 用来获取完整的、精确的、站点内容的深度复制。包括获取图像以及其他非文本内容。抓取并存储相关的内容。对内容来者不拒,不对页面进行内容上的修改。重新爬行对相同的URL不针对先前的进行替换。爬虫通过Web用户界面启动、监控、调整,允许弹性的定义要获取的URL。
二者的差异:
Nutch 只获取并保存可索引的内容。Heritrix则是照单全收。力求保存页面原貌
Nutch 可以修剪内容,或者对内容格式进行转换。
Nutch 保存内容为数据库优化格式便于以后索引;刷新替换旧的内容。而Heritrix 是添加(追加)新的内容。
Nutch 从命令行运行、控制。Heritrix 有 Web 控制管理界面。
Nutch 的定制能力不够强,不过现在已经有了一定改进。Heritrix 可控制的参数更多。
Mozdex -- open source search engine based on Nutch
http://www.mozdex.com/
http://www.nutch.org/
interesting readings:
http://www.linux.org/news/2004/04/09/0002.html
http://www.technewsworld.com/story/31653.html