收藏 分享(赏)

尚学堂科技-张志宇-lucene-构建一个简单的WEB搜索程序.doc

上传人:weiwoduzun 文档编号:5716610 上传时间:2019-03-14 格式:DOC 页数:150 大小:1.13MB
下载 相关 举报
尚学堂科技-张志宇-lucene-构建一个简单的WEB搜索程序.doc_第1页
第1页 / 共150页
尚学堂科技-张志宇-lucene-构建一个简单的WEB搜索程序.doc_第2页
第2页 / 共150页
尚学堂科技-张志宇-lucene-构建一个简单的WEB搜索程序.doc_第3页
第3页 / 共150页
尚学堂科技-张志宇-lucene-构建一个简单的WEB搜索程序.doc_第4页
第4页 / 共150页
尚学堂科技-张志宇-lucene-构建一个简单的WEB搜索程序.doc_第5页
第5页 / 共150页
点击查看更多>>
资源描述

1、Lucene_构建一个简单的 WEB 搜索程序lucene 2.3.2tomcat 6.0.16je-analysis 1.4.0lukeall 0.7.1Mysql jdbc driver 3.1.13Tidy 04aug2000r7MyEclipse 6.0M1_E3.3 项目周期 3-4 天 目标 Lucene 入门 全文检索的概念,倒排索引的概念 建立索引 搜索 中文分词的实现 Nutch 入门 串知识点 Html,css,javascript,servlet,jsp,mysql, 介绍 MVC 的概念 演示借用一些 javascript 的成熟的框架实现页面的特殊效果。例如: ric

2、o 学会使用 myeclipse 熟悉 mysql 数据库的用法 什么时候用 lucene 数据库大量数据,文本字段内容很多 非结构化文档1. 安装 myeclipse 建立工程web project 工程名称 lucene 如何配置 tomcat 服务器 好处自动部署 Windowshow viewservers 如何部署 web app Deploy 按钮,添加 tomcat 项目 Web browser 窗口 最好不用此 browser Show viewweb browser 引入 jar 包 Lucene 工程文件夹下,建立 lib 目录,拷贝如下 jar 包到 lib 目录 luc

3、ene-core-2.2.0.jar Tidy.jar lucene-2.2.0lucene-2.2.0contribanalyzerslucene-analyzers-2.2.0.jar je-analysis-1.4.0.jar mysql-connector-java-3.1.13-bin.jar 显示 line number Alt/自动完成快捷键效果出不来 .快捷键效果出不来2. 为一个文件建立索引(英文)确认已经引入包 lucene-core-2.2.0.jarField.Store.YES 和 Field.Store.NO 区别 termVector 是 Lucene 1.4.3

4、 新增的它提供一种向量机制来进行模糊查询 ,很少用。 DateTools.timeToStringIndexHTML.javaimport java.io.File;import org.apache.lucene.analysis.standard.StandardAnalyzer;import org.apache.lucene.document.Document;import org.apache.lucene.document.Field;import org.apache.lucene.index.IndexWriter;public class IndexHTML static S

5、tring index = “D:share05_Servlet_JSPtomcatapache-tomcat-5.5.17index“;static String root = “D:sharelucenesoftlucene-2.2.0lucene-2.2.0docsapiindex.html“;public static void main(String args )throws ExceptionIndexWriter writer = new IndexWriter(index,new StandardAnalyzer(),true);Document doc = new Docum

6、ent();File f = new File(root);doc.add(new Field (“path“,f.getPath(),Field.Store.YES,Field.Index.UN_TOKENIZED);doc.add(new Field (“content“,“我们是共产主义接班人“,Field.Store.NO ,Field.Index.TOKENIZED);writer.addDocument(doc);writer.optimize();writer.close();3. 如何确认索引已经正确建立?java -jar lukeall-0.7.1.jar4. tomcat

7、 配置 WEB-INFlib lucene-core-2.2.0.jar je-analysis-1.4.0.jar 确保 8080 端口可用 reloadable C:tomcatconfcontext.xml5. 为一个文件建立索引(递归)import java.io.File;import java.io.FileNotFoundException;import java.io.FileReader;import java.io.IOException;import org.apache.lucene.analysis.standard.StandardAnalyzer;import o

8、rg.apache.lucene.document.Document;import org.apache.lucene.document.Field;import org.apache.lucene.index.CorruptIndexException;import org.apache.lucene.index.IndexWriter;import org.apache.lucene.store.LockObtainFailedException;public class IndexHTML1 static IndexWriter writer;public static void mai

9、n(String args) throws Exception String root = “D:share01_J2SEsofthtml_zh_CNhtmlzh_CNapijavalang“;String index = “D:sharetoolsapache-tomcat-6.0.14apache-tomcat-6.0.14index_cn“;writer = new IndexWriter(index,new StandardAnalyzer(),true);File f = new File(root);indexDocs(f);writer.optimize();writer.clo

10、se();private static void indexDocs(File f) throws Exception if(f.isDirectory()File subs = f.listFiles();for (int i = 0; i 读取单个字符。InputStreamReader ips = new InputStreamReader(new FileInputStream(f),“gb2312“);/ 适配器模式InputStream is = new ReaderToInputStream(ips);org.w3c.dom.Document root = tidy.parseD

11、OM(is, null);/ 得到根元素Element rawDoc = root.getDocumentElement();String title = getTitle(rawDoc);String body = getBody(rawDoc);System.out.println(title);/ System.out.println(body);doc.add(new Field(“title“, title, Field.Store.YES,Field.Index.TOKENIZED);String summary = body;if (body.length() = 200) su

12、mmary = body.substring(0, 200);doc.add(new Field(“summary“, summary, Field.Store.YES,Field.Index.TOKENIZED);doc.add(new Field(“content“, body, Field.Store.YES,Field.Index.TOKENIZED);writer.addDocument(doc);/ 适配器public static class ReaderToInputStream extends InputStream Reader reader;public ReaderTo

13、InputStream(Reader reader) super();this.reader = reader;Overridepublic int read() throws IOException try return reader.read(); catch (IOException e) throw e;/ 得到title 标签内容protected static String getTitle(Element rawDoc) if (rawDoc = null) return “;String title = “;NodeList children = rawDoc.getEleme

14、ntsByTagName(“title“);if (children.getLength() 0) Element titleElement = (Element) children.item(0);Text text = (Text) titleElement.getFirstChild();if (text != null) title = text.getData();return title;/ 得到body标签内容protected static String getBody(Element rawDoc) if (rawDoc = null) return “;String bod

15、y = “;NodeList children = rawDoc.getElementsByTagName(“body“);if (children.getLength() 0) body = getText(children.item(0);return body;/ 递归调用,因为标签里面还有标签protected static String getText(Node node) NodeList children = node.getChildNodes();StringBuffer sb = new StringBuffer();for (int i = 0; i 读取单个字符。Inp

16、utStreamReader ips = new InputStreamReader(new FileInputStream(f),“gb2312“);/ 适配器模式InputStream is = new ReaderToInputStream(ips);org.w3c.dom.Document root = tidy.parseDOM(is, null);/ 得到根元素Element rawDoc = root.getDocumentElement();String title = getTitle(rawDoc);String body = getBody(rawDoc);System.

17、out.println(title);/ System.out.println(body);doc.add(new Field(“title“, title, Field.Store.YES,Field.Index.TOKENIZED);String summary = body;if (body.length() = 200) summary = body.substring(0, 200);doc.add(new Field(“summary“, summary, Field.Store.YES,Field.Index.TOKENIZED);doc.add(new Field(“conte

18、nt“, body, Field.Store.YES,Field.Index.TOKENIZED);writer.addDocument(doc);/ 适配器public static class ReaderToInputStream extends InputStream Reader reader;public ReaderToInputStream(Reader reader) super();this.reader = reader;Overridepublic int read() throws IOException try return reader.read(); catch

19、 (IOException e) throw e;/ 得到title 标签内容protected static String getTitle(Element rawDoc) if (rawDoc = null) return “;String title = “;NodeList children = rawDoc.getElementsByTagName(“title“);if (children.getLength() 0) Element titleElement = (Element) children.item(0);Text text = (Text) titleElement.

20、getFirstChild();if (text != null) title = text.getData();return title;/ 得到body标签内容protected static String getBody(Element rawDoc) if (rawDoc = null) return “;String body = “;NodeList children = rawDoc.getElementsByTagName(“body“);if (children.getLength() 0) body = getText(children.item(0);return bod

21、y;/ 递归调用,因为标签里面还有标签protected static String getText(Node node) NodeList children = node.getChildNodes();StringBuffer sb = new StringBuffer();for (int i = 0; i 读取单个字符。InputStreamReader ips = new InputStreamReader(new FileInputStream(f),“gb2312“);/ 适配器模式InputStream is = new ReaderToInputStream(ips);org

22、.w3c.dom.Document root = tidy.parseDOM(is, null);/ 得到根元素Element rawDoc = root.getDocumentElement();/得到title 内容String title = getTitle(rawDoc);/得到body内容String body = getBody(rawDoc);System.out.println(title); doc.add(new Field(“title“, title, Field.Store.YES,Field.Index.TOKENIZED);String summary = bo

23、dy;if (body.length() = 200) summary = body.substring(0, 200);doc.add(new Field(“summary“, summary, Field.Store.YES,Field.Index.TOKENIZED);doc.add(new Field(“content“, body, Field.Store.YES,Field.Index.TOKENIZED);writer.addDocument(doc);/ 适配器public static class ReaderToInputStream extends InputStream

24、 Reader reader;public ReaderToInputStream(Reader reader) super();this.reader = reader;Overridepublic int read() throws IOException try return reader.read(); catch (IOException e) throw e;/ 得到title 标签内容protected static String getTitle(Element rawDoc) if (rawDoc = null) return “;String title = “;NodeL

25、ist children = rawDoc.getElementsByTagName(“title“);if (children.getLength() 0) Element titleElement = (Element) children.item(0);Text text = (Text) titleElement.getFirstChild();if (text != null) title = text.getData();return title;/ 得到body标签内容protected static String getBody(Element rawDoc) if (rawD

26、oc = null) return “;String body = “;NodeList children = rawDoc.getElementsByTagName(“body“);if (children.getLength() 0) body = getText(children.item(0);return body;/ 递归调用,因为标签里面还有标签protected static String getText(Node node) NodeList children = node.getChildNodes();StringBuffer sb = new StringBuffer(

27、);for (int i = 0; i 关键词:每页显示条 11. 简单的处理页面 拷贝 lucene-core-2.3.2.jar 到 tomcat 相应 lib 目录 拷贝 je-analysis-1.4.0.jar 到 tomcat 相应 lib 目录 注意地址栏大小写 文件改名,如果只是改了大小写。则不会自动更新 Jsp 页面不要放到 web-inf 文件夹 Post 方式不要忘了写 request.setCharacterEncoding(“GBK“);并且要写在读取第一个参数之前的位置 确保 index 已经存在 否则:java.io.FileNotFoundException:

28、no segments* file found in org.apache.lucene.store.FSDirectoryD:sharetoolsapache-tomcat-6.0.16apache-tomcat-6.0.16index_cn111: files:results.jsp文档 摘要 “12. 中文显示乱码:13. 表单提交内容校验regcheckdata.jsJs 文件不能那个存盘编码问题function check_empty(text) marks= false;var nLengs=text.length;for(var i=0;ifunction checkdata()

29、 /alert(searchForm.query.value)if(!check_empty(searchForm.query.value)alert(“不能为空“);return false;return true;关键词:每页显示条14. 翻页实现 优化:翻页逻辑封装成单独的类 Requerying at first glance seems a waste, but Lucenes blazing speed more than compensates.- Lucene in Actionresults.jsp“My JSP results.jsp starting pagefunction checkdata() /alert(searchForm.query.value)if(!check_empty(searchForm.query.value)alert(“不能为空“ );return false;return true;关键词:“每页显示“条“

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 企业管理 > 管理学资料

本站链接:文库   一言   我酷   合作


客服QQ:2549714901微博号:道客多多官方知乎号:道客多多

经营许可证编号: 粤ICP备2021046453号世界地图

道客多多©版权所有2020-2025营业执照举报