1、WEB MINING绪论,刘 均电信学院系统结构与网络研究所,西一436 http:/ Mining或Data Mining、 Text Mining等领域的深入研究奠定基础;能够利用所学理论与技术解决Web Mining相关的实际问题。掌握Web Mining的基本概念;了解Web Mining产生背景、目前研究现状、研究方向以及主要应用领域。掌握Data Mining与Text Mining等领域的基本概念以及较成熟的算法。掌握Web Content Mining、Web Structure Mining、Web Usage Mining等领域的基本概念以及较成熟的算法,并具有一定的分析、
2、应用能力。,课程内容,Web结构挖掘,Web内容挖掘,Web日志挖掘,数据挖掘,文本挖掘,课程内容与时间安排,绪论(2学时)Data Mining与Text Mining理论与技术(20学时)Web Structure Mining (4学时)Web Content Mining (4学时)Web Usage Mining (4学时)Web Mining应用举例(2学时),教材与参考书,Web知识挖掘:理论、方法与应用, 郑庆华,刘均,田锋 等著, 科学出版社,2010Mining the Web: Analysis of Hypertext and Semi Structured Data,
3、 by Soumen Chakrabarti, Morgan Kaufmann, 2002数据挖掘:概念与技术 , Jiawei Han,Micheline Kamber 等著,范明,孟小峰译. 机械工业出版社,2001,考试与作业,考试作业成绩的加权和作业作业1:试验,提交试验报告、程序、数据等。(60,人)作业2:专业翻译(40%,每人)提交方式、时间下学期开学两周内,课件下载,ftp:/202.117.15.158u:webp:web,学科定位,科学 世界观、认识世界、完整严密的体系结构技术方法论、改造世界WEB MINING(DATA MINING )是一门技术类学科。,引用说明,课件
4、的部分内容引用了国内外同行的PPT页面或其他资料。,Web Mining的定义Web Mining的背景分类Web Structure Mining、Web Content Mining、Web Usage Mining的研究现状与应用,本节课主要内容,Web Mining的定义,Web Mining的定义,Web mining - data mining techniques to automatically discover and extract information from Web documents/services (Etzioni, 1996).Data mining (kn
5、owledge discovery from database) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data,Some DM tasks,Classification:mining patterns that can classify future data into known classes. Association rule miningmining an
6、y rule of the form X Y, where X and Y are sets of data items. Clusteringidentifying a set of similarity groups in the data,Some DM tasks,Sequential pattern mining:A sequential rule: A B, says that event A will be followed by event B with a certain confidenceDeviation detection: discovering the most
7、significant changes in data,Web Mining的其它定义,Jaideep Srivastava借鉴数据挖掘的定义将Web挖掘定义为“从Web文档和Web活动中抽取感兴趣的潜在的有用模式和隐藏的信息” 。维基百科:Web 挖掘被定义为“利用数据挖掘技术从Web中发现模式” (Wikipedia),Web Mining的定义,对Web Mining定义的理解(五个方面)信息与知识数据分析技术 支撑技术:Data Mining(DM)、Text Mining(TM)、 Multimedia Mining (MM)目标:获取有用的信息或知识rules, patterns,
8、 constraints数据源: Web documents/services 隐藏在半结构化数据中的模式和数据实体超链接关系Web日志,Web Mining的定义,我们的定义:利用数据挖掘、文本挖掘、机器学习等技术从Web页面数据、日志数据、超链接关系中发现感兴趣的、潜在的、有用的规则、模式、领域知识等。,Information and Knowledge,Information is data that has been organized into a meaningful context.Negentropy(负熵) entropy(熵) 1944,薛定谔(Schrdinger),生命
9、是什么。负熵是物质系统有序化、组织化的一种量度。信息是负熵信息是系统有序度的量度。信息用于消除不确定性。 Knowledge is defined as re-usable information in a specific context.,Data pyramid,数 据,知 识,信 息,数据是计算机中对事实、概念或指令进行描述的一种特殊格式,赋以语义的数据称为信息,知识是适用面更广的信息,智慧则是通过对过去知识和新信息的整合,形成决策的能力。,Data Analysis Evolution,Confluence of Multiple Disciplines,WEB Mining,Dat
10、abase,Information retrieval,Data MiningText Mining,Natural language processing,MachineLearning,Web or Internet,Web mining research integrate research from several research communities (Kosala and Blockeel, July 2000),Web挖掘与数据挖掘、信息检索、信息抽取的区别,Web挖掘与数据挖掘数据挖掘的对象的不同:结构化数据、(非/半)结构化数据Web 挖掘与信息检索 从特定文档集中返回与
11、检索需求相关的文档包括文档建模、分类、索引、结果排序与可视化Web等流程,Web挖掘技术一般用于分类、索引以及结果排序信息检索的结果往往也是Web挖掘的对象,Web挖掘与数据挖掘、信息检索、信息抽取的区别,Web 挖掘与信息抽取 从给定的文档中抽取特定类别的信息,如元数据信息 抽取方法能够自动或半自动的方法建立抽取模式 利用信息抽取可以建立文档的压缩版本以提高挖掘效率,Web Mining的背景,History of the Web,1965 Ted Nelson proposed “Literary Machines,” allow writing and publishing of non
12、sequential text hypertextLate 1960s Doug Engelbart at SRI developed the oNLine System (NLS), software for the about-to-be ARPANET that allowed hyperlinking between files on different computers1965,Doug Engelbart ,Mouse,History of the Web,1989-90 Berners-Lee “global hypertext system”第一台Web服务器:nxoc01.
13、cern.ch三大支撑技术:HTML(Hyper Text Markup Language)信息与信息的链接、URL(Uniform Resource Locator)信息定位、HTTP(Hyper Text Transfer Protocol)分布式的信息共享 10/90 TBL first browser program, names it “World Wide Web”8/91 software released on the Internet9/93 “Mosaic” browser for PC; Web traffic measures 1% of traffic on NSFn
14、et backbone,Web技术发展,客户端集成于Web浏览器的技术,涉及HTML语言、Java语言、CSS(Cascading Style Sheets)、DHTML(Dynamic HTML)以及浏览器插件等由静态向动态逐渐发展。服务器端由静态向动态逐渐发展。NCSA :CGI(Common Gateway Interface)可执行程序到脚本程序 PHP(Personal Home Page Tools)语言Microsoft:ASPServlet和JSP,同时拥有了类似CGI程序的集中处理功能和类似PHP的HTML嵌入功能,Web技术发展,向语义化发展Web(Semantic Web
15、)是下一代WEB的信息组织和表达方式,其目标是在当前Web资源链接与共享的基础上,通过XML 和RDF框架,对Web数据的语义信息进行描述和管理,使其能为机器所理解并集成到各个不同的应用程序中,从而能够更好地支持人机协同工作。,The WEB,Web网站数目与页面数目呈指数级速增长 Web网站数目从1990年的1个发展到2006年超过108个,倍增周期仅有23周。Web页面数目则平均每6个月翻一番,2004年以后,已达到了1010数量级,每天新增数目超过800万。,The WEB,Deep Web中的页面数量是PIW页面400550倍,约占整个Web页面99.7%。,Motivation fo
16、r Web Mining,信息爆炸每年以2弋(1018)字节,每-年翻一翻; 知识的获取 VS. 信息的获取;“We are buried in information, but looking for knowledge” (John Naisbett) 。“The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as
17、 none at all.” (W.H. Auden) 。如何应对信息爆炸DM、KDD、KDT、TM、MM,Motivation for Web Mining,Mining interesting patterns and knowledge leads tobetter information and knowledge acquisitionbusiness intelligencemore efficient organizing Web resource.,Web Mining: The User-Centeric View (Client-Side),discovery of doc
18、uments on a subjectdiscovery of semantically related documents or document segmentsextraction of relevant knowledge about a subject,Web Mining: The Owner-Centeric View (Server-Side),targeted ads of goods, services, productsmeasuring effectiveness of site content / structureproviding dynamic personal
19、ized services or content,Web挖掘面临的挑战,数据自身的复杂性以及在获取手段方面的局限性,导致Web挖掘与传统的数据挖掘相比,面临着一些新的挑战。Web数据的高度复杂性异构性:挖掘对象是异构的;信息组织方式是异构的半结构化特性动态性存在噪音数据:页面中与挖掘应用无关的信息;质量低下的页面信息,Web挖掘面临的挑战,Web数据检索的局限性 丰度问题(Abundance Problem)。丰度问题是由美国康纳尔大学Jon Kleinberg教授提出的,表现为:Web信息总量虽然很大,但对于某一个特定用户,他所感兴趣的Web信息却相对很少,即“99的Web信息对于99的We
20、b用户是没有用处的”。丰度问题导致严重的信息负荷。 有限覆盖问题(Limited Coverage Problem) 。检索接口的局限性。当前,各种主流搜索引擎一般都采用关键词或者是关键词的逻辑组合作为检索条件。这种检索接口很难明确地表达用户的检索意图,主要原因在于自然语言词汇具有一词多义(Polysemy)与一义多词(Synonymy)特性。 (病毒 & 电脑)缺少个性化检索机制。现有搜索引擎给用户呈现的是无差别的、“千人一面”的资源检索界面与结果显示。,Web Mining 分类,Web Mining 分类,Web Mining 分类,Web Data Web pagesIntra-pag
21、e structuresInter-page structuresInternet structuresUsage dataClick StreamSupplemental dataProfilesRegistration informationCookies,Web Mining 分类,Web Mining 分类依据说明(全信息理论),“全信息理论” 概要,语用,语义,语法,Web Mining 分类,Web Content Mining,Web Content Mining,Web StructureMining,Web ContentMining,Web Page Content Min
22、ing1. Web Page Summarization 2. Can Identify information within given web pages Uses heuristics to distinguish personal home pages from other web pages Looks for product prices within web pages,Search ResultMining,Web UsageMining,General AccessPattern Tracking,CustomizedUsage Tracking,Web Content Mi
23、ning,Web UsageMining,General AccessPattern Tracking,CustomizedUsage Tracking,Web StructureMining,Web ContentMining,Web PageContent Mining,Search Result Mining(Text Mining & Knowledge Discovery from Text)Search Engine Result SummarizationClustering Search ResultCategorizes documents using phrases in
24、titles and snippets,Web Content Mining,Discovery of useful information from web contentsInformation Retrieval View: ( Structured + Semi-Structured)Assist / Improve information findingFiltering Information to users on user profilesDatabase ViewModel Data on the web Integrate them for more sophisticat
25、ed queries,Web Content Mining,Web contentsUnstructured text document Semi-structured HTML document (hyperlinks)Textual, image, audio, video, metadataMultimedia data mining,Web Content Mining,What have been doing in Web content mining? 1. Developing intelligent tools for IR - Finding keywords and key
26、phrases - Discovering grammatical rules and collocations - Hypertext classification/categorization - Extracting key phrases from text documents - Hierarchical clustering,Web Content Mining,2. Developing Web query systems Many applications such as WebLog (Lakshmanan, et al., 1996) 3. Mining multimedi
27、a data - Fayyad, et al. (1996) mining image from satellite - Smyth, et al (1996) mining image to identify small volcanoes on Venus.,Server-based approaches,Multilevel databasesThe main idea behind these proposals is that the lowest level of the database contains primitive semi-structured information
28、 stored in various Web repositories, such as hypertext documents. At the higher level(s) meta data or generalizations are extracted from lower levels and organized in structured collections such as relational or object-oriented databases. Web query systems,A Multiple Layered Meta-Web Architecture,Ge
29、neralized Descriptions,More Generalized Descriptions,Layer0,Layer1,Layern,.,Multilevel databases Meta-WEB,Meta-WEBLayer0: the Web itselfLayer1: the lowest layer of the ML-WebAn entry: a Web page summary, including class, time, URL, contents, keywords, popularity, rank, links, etc.Layer2 and up:,Bene
30、fits of Multi-Layer Meta-Web Multi-dimensional Web info summary analysisintelligent queryWeb high-level query (WebSQL),Web Content Mining的应用,Information retrieval (Search) on the WebAutomated generation of topic hierarchiesKnowledge baseDocuments classifyingOurs,Web Content Mining的应用,E-commerce Gene
31、rate user profiles - improving customization and provide users with pages, advertisements of interestTargeted advertising - Ads are a major source of revenue for Web portals(e.g., Yahoo, Lycos) and E-commerce sites. Internet advertising is probably the“ hottest” web mining application todayFraud - M
32、aintain a signature for each user based on buying patterns on the Web ( e.g., amount spent, categories of items). If buying pattern changessignificantly, then signal fraud,Web Structure Mining,Web Structure Mining,Web ContentMining,Web PageContent Mining,Search ResultMining,Web UsageMining,General A
33、ccessPattern Tracking,CustomizedUsage Tracking,Web Structure Mining Using Links Use interconnections between web pages to give weight to pages. Using Generalization Web. Counters (popularity),Internet and Web structures,the fundamental difference between Internet and Web structuresInternet structure
34、 is controlled by wiringWeb structure is controlled by hyperlinks,Web StructureStructure of the hyperlinks within the Web itselfStructure of a Page,Web Structure Mining,Web Structure Mining,Finding authoritative Web pagesRetrieving pages that are not only relevant, but also of high quality, or autho
35、ritative on the topicHyperlinks can infer the notion of authorityThese hyperlinks contain an enormous amount of latent human annotationA hyperlink pointing to another Web page, this can be considered as the authors endorsement of the other page,Web Structure Mining,PageRank Stanford projectLawrence
36、Page, Sergey Brin, Rajeev Motwani, Terry Winograd. The PageRank Citation Ranking: Bringing Order to the Web.GoogleHITS (HyperlinkInduced Topic Search)Jon M. Kleinberg: Authoritative Sources in a Hyperlinked Environment. JACM 46(5): 604-632 (1999) HITS ( Hypertext-Induced Topic Search) developed by J
37、on Kleinberg, while visiting IBM Almaden.IBM expanded HITS into Clever.,Web Structure Mining,Internet的宏观特性挖掘如无尺度、小世界特性、蝴蝶结理论,利用这些来提高挖掘的效率与质量。,Web Structure Mining 的应用,指导网页采集 (采集 “高质量”的网页)帮助结果排序 (PageRank 或HITS 算法)查找相关网页(Query By Examples)确定Web 影响因子,Web Usage Mining,Web Usage Mining,Web StructureMini
38、ng,Web ContentMining,Web PageContent Mining,Search ResultMining,Web UsageMining,General Access Pattern Tracking Web Log Mining Uses DM techniques to understand general access patterns and trends. Can shed light on better structure and grouping of resource providers.,CustomizedUsage Tracking,Web Usag
39、e Mining,Web UsageMining,General AccessPattern Tracking,Customized Usage TrackingAdaptive Sites Analyzes access patterns of each user at a time.Web site restructures itself automatically by learning from user access patterns.,Web StructureMining,Web ContentMining,Web PageContent Mining,Search Result
40、Mining,Web Usage Mining,Web usage mining, also known as Web log mining, process of discovering interesting patterns in Web access logs.Commonly used approaches (Borges and Levene, 1999): Maps the log data into relational tables before an adapted data mining technique is performed. Uses the log data
41、directly by utilizing special pre-processing techniques.,Web Usage Mining,Perform data mining on Weblog records Find association patterns, sequential patterns, and trends of Web accessingConduct studies toAnalyze system performance, improve system design by Web caching, Web page prefetching,W3C Exte
42、nded Log File Format,Server logs,123.456.78.9 - - 24/Oct/1999:19:13:44 0400 “GET /Images/tagline.gif HTTP/1.0” 200 1449 http:/ “Mozilla/4.51 en (Win98;I)”,Problems with Web Logs,Identifying users Clients may have multiple streams Clients may access web from multiple hosts Proxy servers: many clients
43、/one address Proxy servers: one client/many addressesData not in log POST data (i.e., CGI request) not recorded Cookie data stored elsewhere Pages may be cached Use of forward and backward pointers,Web Usage Mining的应用,System Improvement1). Site Improvement,根据实际用户的浏览情况,调整网站的网页的连接结构和内容,更好的服务用户,极端:Adap
44、tive web sites,Web Usage Mining的应用,System Improvement2). Caching & Network Transmission,例如:从proxy 的访问信息中可以分析用户的访问模式,从而可以预测用户的Page访问,提高Web Caching的性能,Web Usage Mining的应用,Personalization,直接实现形式:Recommender,作用:1)方便用户查询和浏览2)增强广告的作用3)促进网上销售4)提高用户忠诚度,根据发现的用户喜好,动态地为用户定制观看的内容或提供浏览建议。,Web挖掘相关的标准、规范及语言,数据挖掘相关
45、标准CRISP-DM(交叉行业数据挖掘过程标准,Cross Industry Standard Process for Data Mining)。SPSS、NCR以及DaimlerChrysler三个在数据挖掘领域经验丰富的公司发起建立一个社团,目的建立数据挖掘方法和过程的标准,Web挖掘相关的标准、规范及语言,PMML(预测模型标记语言,Predictive Model Markup Language)。数据挖掘应用往往需要多种类型的数据挖掘软件、算法协同运行,这就要求对挖掘出的模型能够很好地继承、复用与集成。为此,DMG(The Data Mining Group,DMG)提出了PMML语
46、言。PMML最新版本为3.2,支持12种数据挖掘模型,包括:AssociationModel (关联规则)、ClusteringModel(聚类模型)、GeneralRegressionModel(回归模型)、MiningModel(组合模型)、NaiveBayesModel(朴素贝叶斯)、NeuralNetwork(神经网络)、RegressionModel(线性、多项式、对数三种回归模型)、RuleSetModel(规则集)、 SequenceModel(序列模式)、SupportVectorMachineModel(支持向量机) TextModel(文本模型)、TreeModel(决策树)。,Web挖掘相关的标准、规范及语言,JDM(Java Data Mining API)。旨在提供一个访问数据挖掘工具的标准API,支持数据挖掘模型的建立、使用,数据及元数据的创建、存储、访问及维护,从而使得Java应用程序能够能够方便集成数据挖掘技术。,