收藏 分享(赏)

信息检索(Information.doc

上传人:weiwoduzun 文档编号:2414740 上传时间:2018-09-15 格式:DOC 页数:2 大小:19KB
下载 相关 举报
信息检索(Information.doc_第1页
第1页 / 共2页
信息检索(Information.doc_第2页
第2页 / 共2页
亲,该文档总共2页,全部预览完了,如果喜欢就下载吧!
资源描述

1、摘 要信息检索(Information Retrieval)通常指文本信息检索,包括信息的存储、组织、表现、查询、存取等各个方面,其核心为文本信息的索引和检索技术。从历史上看,信息检索经历了手工检索、计算机检索到目前网络化、智能化检索等多个发展阶段。目前,信息检索的对象从相对封闭、稳定一致、由独立数据库集中管理的信息内容扩展到开放、动态、更新快、分布广泛、管理松散的 Web 内容;信息检索的用户也由原来的情报专业人员扩展到包括商务人员、管理人员、教师学生、各专业人士等在内的普通大众,他们对信息检索的效率和准确性提出了更高、更多样化的要求。现有的信息检索工具(搜索引擎)的查全率和查准率不高,为了

2、提高信息检索工具的查全率和查准率,人们提出了各种各样的技术和算法,旨在使信息检索工具更趋于智能化和人性化。本文在研究传统信息检索技术实现的同时,结合现有的网页分类技术,对智能信息检索进行了较为系统的研究。在此基础上对基于分类的智能信息检索中的中文分词、网页索引、网页特征提取、网页分类、提出了一定的思考和见解。论文主要工作如下:(1) 本文首先针对网页结构的特点,分析了网页中对分类过程有贡献的信息成分。使用了一种简单而又高效的词典存储方式,使其在切分速度上有了很大的提高,而所得到的切分结果也基本上满足网页分类中对中文分词的要求。采用了词串统计的方式,提高了未登录词的识别几率。(2) 传统的中英文

3、分类的特征提取方式并没有考虑汉语词语之间的语义关联(反义词、近义词、同义词),在本文中,我除了考虑了语义关联,还提取了网页标题,一起参与特征词的提取,使特征词的提取较传统的方法更为合理,并对 CHI 公式做了一些改进,使之更符合中文 Web 的特征表示。(3) 研究了现有的网页分类方法,结合了网页的特点,在传统的特征加权公式的基础上,提出了一种网页分类的特征加权公式。(4) 对网页的索引及搜索进行了探讨,并这两项技术进行了编程实现。(5) 在上述理论的基础上构建了一个较为完整的分类检索系统,使用 VC+6.0 开发环境,在 Windows操作系统上实现了一个分类检索系统,并对实验结果做出了评价

4、。关键词:信息检索 中文分词 特征提取 倒排文件 VSM 模型 KNN 分类算法AbstractThe information Retrieval usually refers to the text information Retrieval, including the information saving and organize, express, search, accessing etc. .Its core is text information index and retrieval technology. Historically, the information Retri

5、eval goes through the manual retrieval, the computer retrieves, actual network, the intellectualized retrieval and so on.Currently, the retrieval object of information is from the relative closing, stabilize consistent, be expand to the information contents of independent database centralized manage

6、ment to open, dynamic state, renew quick, distribute extensive, manage the lax contents of Web. The user is expand to by the professional personnel of original intelligence report to include the personnel of the business, manager, teacher student, various professional, etc., they put forward higher

7、and more diverse request to the efficiency and accuracies that information retrieval. The recall and accuracy of current retrieval system (the search engine) are low, to enhance recall and accuracy of the information retrieval search tool, the people put forward various ways of technique and calcula

8、te, the aim is making the information retrieval tool near to humanized and intelligence gradually.This paper combines the existing classification technology of web page while studying traditional information retrieval technology, have carried on comparatively systematic research to the intellectual

9、search engine. To the Chinese word segmentation in the intelligent information retrieve that is classified on this basis, web page index, web page feature selection web page classify, get sure thinking and opinion out of. The groundwork of the thesis is as follows:(1) This paper directs against the

10、feature of the structure of the web page at first, has analyzed to the classification course contributory information composition in web pages. Used a kind of comparatively advanced dictionary to store, make it improve a lot in cutting the component velocity, satisfy the automatic classification cou

11、rse to the Chinese word segmentation in the Chinese web page basically too on the exactness of segmentation result. Have adopted the way in which string counts of word, has raised the discernment probability of recorded words.(2) The way of traditional feature selection and has not considered the se

12、manteme between Chinese word is related (the antonym, near-synonym, synonym). In the paper, I have also drawn the web page title besides considering semanteme is related, participate in the abstraction of the feature word together, make the abstraction of the feature word rational even more than the

13、 traditional method, and has done some improvement to CHI formula, making it accord with the feature of Chinese Web even more expresses.(3) Study the existing classification method of web page, combined the feature of the web page, and proposed a kind of new web page feature weighting formula(4) Ind

14、ex and search of web page(5) Structure a comparatively intact classification retrieval system based on above-mentioned theories, and has appraised to the experimental result.Keywords:Information Retrieval Chinese word Segmentation Feature Selection Inversed Document Vector Space Model KNN Categorization Algorithm.

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 企业管理 > 经营企划

本站链接:文库   一言   我酷   合作


客服QQ:2549714901微博号:道客多多官方知乎号:道客多多

经营许可证编号: 粤ICP备2021046453号世界地图

道客多多©版权所有2020-2025营业执照举报