收藏 分享(赏)

5-搜索引擎PPT课件.ppt

上传人:Facebook 文档编号:3832506 上传时间:2018-11-20 格式:PPT 页数:29 大小:4.81MB
下载 相关 举报
5-搜索引擎PPT课件.ppt_第1页
第1页 / 共29页
5-搜索引擎PPT课件.ppt_第2页
第2页 / 共29页
5-搜索引擎PPT课件.ppt_第3页
第3页 / 共29页
5-搜索引擎PPT课件.ppt_第4页
第4页 / 共29页
5-搜索引擎PPT课件.ppt_第5页
第5页 / 共29页
点击查看更多>>
资源描述

1、,Web Mining(5),杨光飞 系统工程研究所 大连理工大学 ,Search Engine,搜索引擎概况,The search engine rankings for January 2012, according to comScore, were: Google grew to 66.2 percent (up from 65.9 percent in December 2011). Bing grew to 15.2 percent (up from 15.1 percent). Yahoo fell to 14.1 percent (down from 14.5 percent)

2、. Ask grew to 3 percent (up from 2.9 percent). AOL remained unchanged at 1.6 percent.,Search Engine Characteristics,Unedited anyone can enter content Quality issues; Spam Varied information types Phone book, brochures, catalogs, dissertations, news reports, weather, all in one place! Different kinds

3、 of users Lexis-Nexis: Paying, professional searchers Online catalogs: Scholars searching scholarly literature Web: Every type of person with every type of goal Scale Hundreds of millions of searches/day; billions of docs,Web Search Queries,Web search queries are short: 2.4 words on average (Aug 200

4、0) Has increased, was 1.7 (1997) User Expectations: Many say “The first item shown should be what I want to see!” This works if the user has the most popular/common notion in mind, not otherwise.,Standard Web Search Engine Architecture,crawl the web,create an inverted index,Check for duplicates, sto

5、re the documents,Inverted index,Search engine servers,user query,Show results To user,DocIds,Brin & Page 98,Inverted Indexes,Inverted indexes are still used, even though the web is so huge Some systems partition the indexes across different machines; each machine handles different parts of the data

6、Other systems duplicate the data across many machines; queries are distributed among the machines Most do a combination of these,In this example, the data for the pages is partitioned across machines. Additionally, each partition is allocated multiple machines to handle the queries.Each row can hand

7、le 120 queries per secondEach column can handle 7M pagesTo handle more queries, add another row.,How Inverted Files Are Created,Periodically rebuilt, static otherwise. Documents are parsed to extract tokens. These are saved with the Document ID.,Now is the time for all good men to come to the aid of

8、 their country,Doc 1,It was a dark and stormy night in the country manor. The time was past midnight,Doc 2,How Inverted Files are Created,After all documents have been parsed the inverted file is sorted alphabetically.,How Inverted Files are Created,Multiple term entries for a single document are me

9、rged. Within-document term frequency information is compiled.,Googles Indexing,The Indexer converts each doc into a collection of “hit lists” and puts these into “barrels”, sorted by docID. It also creates a database of “links”. Hit: Hit type: Plain or fancy. Fancy hit: Occurs in URL, title, anchor

10、text, metatag. Optimized representation of hits (2 bytes each). Sorter sorts each barrel by wordID to create the inverted index. It also creates a lexicon file. Lexicon: Lexicon is mostly cached in-memory,Lexicon (in-memory),Postings (“Inverted barrels”, on disk),Each “barrel” contains postings for

11、a range of wordids.,Googles Inverted Index,Sorted by wordid,Barrel i,Barrel i+1,Sorted by Docid,Google,Sorted barrels = inverted indexPagerank computed from link structure; combined with IR rankIR rank depends on TF, type of “hit”, hit proximity, etc.Billion documentsHundred million queries a day AN

12、D queries,Link Analysis for Ranking Pages,Why does this work? The official Toyota site will be linked to by lots of other official (or high-quality) sites The best Toyota fan-club site probably also has many links pointing to it Less high-quality sites do not have as many high-quality sites linking

13、to them,PageRank,Let A1, A2, , An be the pages that point to page A. Let C(P) be the # links out of page P. The PageRank (PR) of page A is defined as:PageRank is principal eigenvector of the link matrix of the web. Can be computed as the fixpoint of the above equation.,PR(A) = (1-d) + d ( PR(A1)/C(A

14、1) + + PR(An)/C(An) ),PageRank: User Model,PageRanks form a probability distribution over web pages: sum of all pages ranks is one. User model: “Random surfer” selects a page, keeps clicking links (never “back”), until “bored”: then randomly selects another page and continues. PageRank(A) is the pro

15、bability that such a user visits A d is the probability of getting bored at a page Google computes relevance of a page for a given search by first computing an IR relevance and then modifying that by taking into account PageRank for the top pages.,闲话Google,名称渊源,1938年,美国数学家爱德华卡斯纳Edward Kasner想发明一个单词来

16、表示“10的100次方”这样一个庞大的数字,于是就征询9岁的小侄子米尔顿Sirotta的意见。小米尔顿认真地思索了几分钟,脑子里冒出了一个词googol,叔侄两人击节叫好。,,10,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,1998年9月7日,24岁的布林和25岁的佩奇决定合伙开个公司,公司提供的唯一服务就是搜索引擎。,Google的定位信息处理及分享标志着一个新时代的开始:

17、这意味着信息开始像石油、钢铁一样成为一种资源,也注定了其商业模式不会等同于软件公司。,Google搜索引擎源于拉里佩奇和谢尔盖布林在斯坦福大学读书时所做的一个研究项目。更确切地说,他们最开始是在佩奇简陋的宿舍搞研究,没多久搬到车库。,两人1995年在斯坦福大学念计算科学博士学位而相识,他们开发了一个对网站之间的关系做精确分析为基础的搜寻引擎,他的使用结果上胜于当时使用的基本搜索技术。当时项目被称作BackRub因为系统需要检查backlinks(反向链接)去估计站点的重要性。,,PageRank(网页级别)是Google排名运算法则(排名公式)的一部分,是Google用于用来标识网页的等级/重

18、要性的一种方法,是Google用来衡量一个网站的好坏的唯一标准。在揉合了诸如Title标识和Keywords标识等所有其它因素之后,Google通过PageRank来调整结果,使那些更具“等级/重要性”的网页在搜索结果中另网站排名获得提升,从而提高搜索结果的相关性和质量。,通过对由超过 50,000 万个变量和 20 亿个词汇组成的方程进行计算,PageRank 能够对网页的重要性做出客观的评价。PageRank 并不计算直接链接的数量,而是将从网页 A 指向网页 B 的链接解释为由网页 A 对网页 B 所投的一票。这样,PageRank 会根据网页 B 所收到的投票数量来评估该页的重要性。,

19、Google全球99%的收入都来自网络广告产品AdWords和AdSense。Google AdSense,也就是通过为谷歌的广告客户提供广告位获取分成。由于谷歌右栏的赞助商广告平台狭小,不能完全满足广告商的需求,于是Google开发了 AdSense,通过这套系统,谷歌能把广告商的广告分配到其他的中小网站上,被分配到这些广告的网站会因为展示或者吸引了顾客下载、注册而获得广告分成。当然,谷歌必须保证分配出去的广告与网站的内容是相匹配的,不会把咖啡广告分配给一个软件下载网站。,,,Google成为了硅谷唯一一家用期权招聘厨师的公司。多年来,Google每天为所有员工提供免费的三餐,以及免费的医疗

20、、牙医、美发、洗衣、干衣等服 务;在Google的办公室里,随处可以找到免费的十几种巧克力豆和几十种饮料,台球桌、桌上足球、按摩椅散布于其间,员工可以带狗上班(猫还不行),每 个员工还能获得100美元布置自己的空间(有人买了星球大战里的机器人R2D2和尤达大师的模型,有人则买了个红绿灯挂在自己脑袋上,还有人买了一座 英式电话厅)。虽然Google的薪水在硅谷并不算顶级,它仍有大量员工一天12小时以上待在办公室。 Google员工有20%的自由时间做自己的事情,这也是Google创新思想的源泉,Google50%的产品都是有这20%的自由时间产生的。然而笔者通过研究发现,其实Google的创新不在于“自由”,而在于对创新的“管理”。,,Thank You !,

展开阅读全文
相关资源
猜你喜欢
相关搜索
资源标签

当前位置:首页 > 中等教育 > 小学课件

本站链接:文库   一言   我酷   合作


客服QQ:2549714901微博号:道客多多官方知乎号:道客多多

经营许可证编号: 粤ICP备2021046453号世界地图

道客多多©版权所有2020-2025营业执照举报