搜索引擎及搜索引擎优化(SEO)实验.doc-道客多多

资源描述

1、1实验三搜索引擎及 SEO 实验一、实验目的研究并学习几种常见的搜索引擎算法，包括网络蜘蛛爬行策略、中文分词算法、网页正文提取算法、网页去重算法、PageRank 和 MapReduce 算法，了解它们的基本实现原理；运用所学 SEO 技术对网页进行优化。二、实验内容1. 研究常用的网络蜘蛛爬行策略，如深度优先策略、广度优先策略、网页选择策略、重访策略和并行策略等，了解其实现原理；2. 研究至少两种中文分词算法，了解其实现原理；3. 研究至少两种网页正文提取算法，了解其实现原理；4. 研究至少两种网页去重算法，了解其实现原理；5. 研究 Google 的 PageRank 和 MapRedu

2、ce 算法，了解它们的实现原理；6. 使用所学的 SEO 技术，对实验二所设计的网站静态首页实施 SEO，在实施过程中需采用如下技术：(1) 网页标题(title)的优化；(2) 选取合适的关键词并对关键词进行优化；(3) 元标签的优化；(4) 网站结构和 URL 的优化；(5) 创建 robots.txt 文件，禁止蜘蛛抓取网站后台页面；(6) 网页内部链接的优化；(7) Heading 标签的优化；(8) 图片优化；(9) 网页减肥技术。7. 使用 C+、C#和 Java 等任意一种编程语言，设计并实现一个简单的网络蜘蛛爬行程序，要求在输入关键词、设置爬行深度和初始网页 URL 之后能够实

3、现网页搜索，输出包含关键词的网页的 URL 和网页标题。【注：实验 7 为补充实验，不要求每个同学都完成，感兴趣者可自行实现该程序，不计入实验报告评分。】三、实验要求1. 研究几种常用的网络蜘蛛爬行策略，填写相应的表格，表格必须填写完整；2. 研究两种中文分词算法，填写相应的表格，表格必须填写完整；3. 研究两种网页正文提取算法，填写相应的表格，表格必须填写完整；4. 研究两种网页去重算法，填写相应的表格，表格必须填写完整；25. 研究 PageRank 算法和 MapReduce 算法，填写相应的表格，表格必须填写完整；6. 提供实施 SEO 之后的网站静态首页界面和 HTML 代码，尽

4、量多地使用所学 SEO 技术；7. 严禁大面积拷贝互联网上已有文字资料，尽量用自己的理解来阐述算法原理，必要时可以通过图形来描述算法；8. 使用任意一种编程语言实现一个简单的网络蜘蛛程序，需提供网络蜘蛛程序完整源代码及实际运行结果。四、实验步骤1. 通过使用搜索引擎并查阅相关资料，研究并整理几种常用的网络蜘蛛爬行策略相关资料，填写相应的表格；2. 通过使用搜索引擎并查阅相关资料，研究并整理两种中文分词算法的基本原理，填写相应的表格；3. 通过使用搜索引擎并查阅相关资料，研究并整理两种网页正文提取算法的基本原理，填写相应的表格；4. 通过使用搜索引擎并查阅相关资料，研究并整理两种网页去重算法的基

5、本原理，填写相应的表格；5. 通过使用搜索引擎并查阅相关资料，研究并整理 PageRank 算法和 MapReduce 算法的基本原理，填写相应的表格；6. 对实验二所设计的网站静态首页实施 SEO；7. 使用任意一种编程语言，设计并实现一个简单的网络蜘蛛爬行程序。五、实验报告要求1. 研究几种常用的网络蜘蛛爬行策略并填写如下表格：策略名称基本原理参考资料深度优先策略深度优先搜索是一种在开发爬虫早期使用较多的方法。它的目的是要达到被搜索结构的叶结点(即那些不包含任何超链的HTML 文件) 。在一个 HTML 文件中，当一个超链被选择后，被链接的 HTML 文件将执行深度优先搜索，即在搜索

6、其余的超链结果之前必须先完整地搜索单独的一条链。深度优先搜索沿着 HTML 文件上的超链走到不能再深入为止，然后返回到某一个 HTML 文件，再继续选择该百度百科深度优先搜索：http:/ 3HTML 文件中的其他超链。当不再有其他超链可选择时，说明搜索已经结束。广度优先策略宽度优先搜索算法（又称广度优先搜索）是最简便的图的搜索算法之一，这一算法也是很多重要的图的算法的原型。Dijkstra 单源最短路径算法和Prim 最小生成树算法都采用了和宽度优先搜索类似的思想。其别名又叫 BFS，属于一种盲目搜寻法，目的是系统地展开并检查图中的所有节点，以找寻结果。换句话说，它并不考虑结果的可能位址

7、，彻底地搜索整张图，直到找到结果为止。百度百科广度优先搜索：http:/ 对搜索引擎而言，要搜索互联网上所有的网页几乎不可能，即使全球知名的搜索引擎google也只能搜索整个Internet网页的30左右。其中的原因主要有两方面，一是抓取技术的瓶颈。网络爬虫无法遍历所有的网页；二是存储技术和处理技术的问题。因此，网络爬虫在抓取网页时。尽量先采集重要的网页，即采用网页优先抓取策略。网页选择策略是给予重要程度、等级较高的 Web 页以较高的抓取优先级，即 Web 页越重要，则越应优先抓取。其实质上是一种使网络爬虫在一定条件下较快地锁定互联网中被用户普遍关注的重要信息资源的方法。而实现该策略的前提

8、李志义网络爬虫的优化策略探略，广东广州 5106314是正确评测 Web 页的重要程度bJ，目前评测的主要指标有PageRank 值、平均链接深度等。重访策略 (1)依据Web站点的更新频率确定重访频率此法符合实际情况，能够更有效地管理和利用网络爬虫。例如，门户网站通常每天要不断地更新信息和添加新的信息，重访的频率则以天或小时为周期进行网页的重访。(2)不关心Web站点的更新频率问题，而是间隔一段时间重访已被抓取的冈页。其弊端是重复抓取的概率大，容易造成不必要的资源浪费。(3)根据搜索引擎开发商对网页的主观评价，提供个性化的服务网页的重访需要搜索引擎开发商对主要的站点进行网页更新频率的主观评

9、价，可以根据需求提供个性化的服务。李志义网络爬虫的优化策略探略，广东广州 510631并行策略实施并行策略的核心是在增加协同工作的爬虫数量的同时，科学合理地分配每个爬虫的任务，尽量避免不同的爬虫做相同的Web信息抓取。一般通过两种方法来分配抓取任务，一是按照Web站点所对应的m地址划分任务，一个爬虫只需遍历某一组地址所包含Web页即可；另一种方法是依据Web站点的域名动态分配爬行任务，每个爬虫完成某个或某些域名段内Web信息的搜集。李志义网络爬虫的优化策略探略，广东广州 5106312. 研究两种中文分词算法并填写如下表格：算法名称基本原理参考资料算法一：最大匹配算法最大匹配算法是一

10、种有着广泛应用的机械分词方法，该方法依据一个分词词表和一个基本的切分评估原则即“长词张玉茹肇庆526070中文分词算5优先”原则，来进行分词法之最大匹配算法的研究算法二:基于无词典的分词算法基于汉字之间的互信息和t-测试信息的分词算法。汉语的词可以理解为字与字之间的稳定结合，因此。如果在上下文中某几个相邻的字出现的次数越多，那么，这几个字成词的可能性就很大。根据这个道理引入互信息(Mutual information)和t-测试值(tscore)的概念，用来表示两个汉字之间结合关系的紧密程度。该方法的分词原理是：对于一个汉字字符串，计算汉字之间的互信息和t-测试差信息，选择互信息和t-测试

11、差信息大的组成词。该方法的局限性是只能处理长度为2的词，且对于一些共现频率高的但并不是词的字组，常被提取出来，并且常用词的计算开销大，但可以识别一些新词，消除歧义。对于一个成熟的分词系统来说，不可能单独依靠某一个算法来实现，都需要综合不同的算法，在实际的应用中，要根据具体的情况来选择不同的分词方案。刘红芝徐州医学院图书馆江苏徐州221004中文分词技术的研究3. 研究两种网页正文提取算法并填写如下表格：算法名称基本原理参考资料算法一基于相似度的中文网页正文提取算法正文文本在HTML源文件中有两种修饰方式：有标签提示和无标签提示。有标签文本中标签的作用一般包含分块信息、表格信息、或者文本

12、的字体颜色信息等。这种文本采用基于分块的方法能有不错的效果。而无标签信息的正文文本处理之后不在分块中，也不在表格内。采用先分块后提取放入网页正文提取方法，无法达到理想的精度。本文提出根据相似度来提取网页正文的算法。算法分为两个步骤：首先取出网页中包含中文最多的行，然后利用鉴于此余弦相似度匹配和标签相似度来提取网页正文。该算法最大的特点是避免了上述的分块步骤。熊子奇张晖林茂松(西南科技大学计算机科学与技术学院四川绵阳 621010)基于相似度的中文网页正文提取算法6算法二基于 FFT 的网页正文提取算法研究与实现给定一个底层网页的HTML源文件，求解最佳的正文区问。对于任何字符串区间(b,e),

13、(O6蓝天数码城 _专业的电竞鼠标,耳机,键盘网上购物商城101112137. 选做：提供网络蜘蛛程序完整源代码及实际运行结果界面截屏（实验报告中需包含源代码和界面截屏）。14import javax.swing.*;import java.awt.*;/ need this to access the color object/* IntegerVerifier.java*/15/* Input Verifier to verifier integer text fields* Checks for valid integer input, and to see if the number

14、 is between a* specified max and min value.* author Mark Pendergast*/public class IntegerVerifier extends javax.swing.InputVerifier /* listener to get valid/invalid data reports */private VerifierListener listener = null;/* blank fields allowed, true for ok, false for error */private boolean blankOk

15、 = false;/* minimum valid value */int minValue = Integer.MIN_VALUE;/* maximum valid value*/int maxValue = Integer.MAX_VALUE;/* Creates a new instance of IntegerVerifier * param alistener VerifierListener to receive invalid/valid data class (null means no listener)* param blankok if true, then the fi

16、eld can be left blank* param min minimum valid value* param max maximum valid value*/public IntegerVerifier(VerifierListener alistener, boolean blankok, int min, int max) listener = alistener;blankOk = blankok;minValue = min;maxValue = max;/* Verifies contents of the specified component* param jComp

17、onent the component to check16* return true if the component is ok, else false*/public boolean verify(javax.swing.JComponent jComponent) JTextField thefield = (JTextField)jComponent;String input = thefield.getText();int number;input = input.trim(); / strip off leading and trailing spaces as these gi

18、ves Integer.parseInt problemsif(input.length() = 0 if(listener != null)listener.validData(jComponent);return true; / if empty, just return trueelseif(input.length() = 0 return false; / if empty, just return true/* try to convert to an integer*/trynumber = Integer.parseInt(input);catch (NumberFormatE

19、xception e)reportError(thefield,“You must enter a valid number“);return false;17/* test if its in the range*/if(number maxValue)reportError(thefield,“You must enter a number between “+minValue+“ and “+maxValue);return false;/* report good data*/ thefield.setForeground(Color.black);thefield.setText(“

20、+number); / reset what we converted into the componentif(listener != null)listener.validData(jComponent);return true; / valid input found /* report error to the listener (if any)* param thefield text field being checked* param message error message to report*/private void reportError(JTextField thef

21、ield, String message)thefield.setForeground(Color.red); / paint the text red, return false invalid inputif(listener != null)listener.invalidData(message,thefield);/* Spider.java*18*/import java.util.*;import java.io.*; import .*;import javax.swing.*;import javax.swing.tree.*;import javax.swing.text.

22、html.parser.*;import javax.swing.text.html.HTMLEditorKit.*;import javax.swing.text.html.*;import javax.swing.text.*;/* Object used to search the web (or a subset of given domains) for a list of keywords* author Mark Pendergast*/public class Spider extends Thread/* site visit limit (stops search at s

23、ome point) */private int siteLimit = 100;/* search depth limit */private int depthLimit = 100;/* keyword list for seach */private String keywordList; /* ip type list */private String ipDomainList;/* visited tree */private JTree searchTree = null;/* message JTextArea, place to post errors */private J

24、TextArea messageArea;/* place to put search statistics */private JLabel statsLabel;/* keep track of web sites searched */private int sitesSearched = 0;/* keep track of web sites found with matching criteria */private int sitesFound = 0;19/* starting site for the search */private String startSite;/*

25、flag used to stop search */private boolean stopSearch = false;/* Creates a new instance of Spider* param atree JTree used to display the search space* param amessagearea JTextArea used to display error/warning messages* param astatlabel JLabel to display number of searched sites and hits* param akey

26、wordlist list of keywords to search for* param aipdomainlist list of top level domains* param asitelimit maximum number of web pages to look at* param adepthlimit maximum number of levels down to search (controls recursion)* param astartsite web site to use to start the search*/public Spider(JTree a

27、tree, JTextArea amessagearea,JLabel astatlabel, String astartsite, String akeywordlist, String aipdomainlist, int asitelimit, int adepthlimit) searchTree = atree; / place to display search tree messageArea = amessagearea; / place to display error messagesstatsLabel = astatlabel; / place to put run s

28、tatisticsstartSite = fixHref(astartsite); keywordList = new Stringakeywordlist.length;for(int i = 0; i= depthLimit)return true;else22return false;/* add a node to the search tree * param parentnode parent to add the new node under* param newnode node to be added to the tree* */private DefaultMutable

29、TreeNode addNode(DefaultMutableTreeNode parentnode, UrlTreeNode newnode)DefaultMutableTreeNode node = new DefaultMutableTreeNode(newnode);DefaultTreeModel treeModel = (DefaultTreeModel)searchTree.getModel(); / get our modelint index = treeModel.getChildCount(parentnode); / how many children are ther

30、e already?treeModel.insertNodeInto(node, parentnode,index); / add as last childTreePath tp = new TreePath(parentnode.getPath();searchTree.expandPath(tp); / make sure the user can see the node just addedreturn node;/* determines if the given url is in a one of the top level domains in the domain* sea

31、rch list* param url url to be checked* return true if its ok, else false if url should be skipped*/private boolean isDomainOk(URL url)if(url.getProtocol().equals(“file“) return true; / file protocol always ok23String host = url.getHost();int lastdot = host.lastIndexOf(“.“);if(lastdot “)return true;i

32、f(ipDomainListi.equalsIgnoreCase(domain)return true;return false;/* * upate statistics label*/private void updateStats()statsLabel.setText(“Sites searched : “+sitesSearched+“ Sites found : “+sitesFound);/* repairs a sloppy href, flips backwards /, adds missing /* return repaired web page reference*

33、param href web site reference*/public static String fixHref(String href)24String newhref = href.replace(, /); / fix sloppy web referencesint lastdot = newhref.lastIndexOf(.);int lastslash = newhref.lastIndexOf(/);if(lastslash lastdot)if(newhref.charAt(newhref.length()-1) != /)newhref = newhref+“/“;

34、/ add on missing /return newhref; /* recursive routine to search the web* param parentnode parentnode in the search tree* param urlstr web page address to search*/public void searchWeb(DefaultMutableTreeNode parentnode, String urlstr)if(urlHasBeenVisited(urlstr) / have we been here?return; / yes, ju

35、st returnif(depthLimitExceeded(parentnode)return;if(sitesSearched siteLimit)return;yield(); / allow the main program to runif(stopSearch)return;25messageArea.append(“Searching :“+urlstr+“ n“);sitesSearched+;updateStats();/ now look in the file/tryURL url = new URL(urlstr); / create the url object fr

36、om a string.String protocol = url.getProtocol(); / ask the url for its protocolif(!protocol.equalsIgnoreCase(“http“) return;String path = url.getPath(); / ask the url for its pathint lastdot = path.lastIndexOf(“.“); / check for file extensionif(lastdot 0)String extension = path.substring(lastdot); /

37、 just the file extensionif(!extension.equalsIgnoreCase(“.html“) / skip everything but html filesif(!isDomainOk(url)messageArea.append(“ Skipping : “+urlstr+“ not in domain listnn“);return;UrlTreeNode newnode = new UrlTreeNode(url); / create the node InputStream in = url.openStream(); / ask the url o

38、bject to create an input stream26InputStreamReader isr = new InputStreamReader(in); / convert the stream to a reader.DefaultMutableTreeNode treenode = addNode(parentnode, newnode); SpiderParserCallback cb = new SpiderParserCallback(treenode); / create a callback objectParserDelegator pd = new Parser

39、Delegator(); / create the delegatorpd.parse(isr,cb,true); / parse the streamisr.close(); / close the stream / end trycatch(MalformedURLException ex)messageArea.append(“ Bad URL encountered : “+urlstr+“nn“); catch(IOException e)messageArea.append(“ IOException, could not access site : “+e.getMessage(

40、)+“nn“); yield();return; /* Stops the search.*/ public void stopSearch()stopSearch = true;/* Inner class used to html handle parser callbacks*/27public class SpiderParserCallback extends HTMLEditorKit.ParserCallback /* url node being parsed */private UrlTreeNode node;/* tree node */private DefaultMu

41、tableTreeNode treenode;/* contents of last text element */private String lastText = “;/* Creates a new instance of SpiderParserCallback* param atreenode search tree node that is being parsed*/public SpiderParserCallback(DefaultMutableTreeNode atreenode) treenode = atreenode;node = (UrlTreeNode)treen

42、ode.getUserObject();/* handle HTML tags that dont have a start and end tag* param t HTML tag* param a HTML attributes* param pos Position within file*/ public void handleSimpleTag(HTML.Tag t,MutableAttributeSet a,int pos)if(t.equals(HTML.Tag.IMG)node.addImages(1);return;if(t.equals(HTML.Tag.BASE)Obj

43、ect value = a.getAttribute(HTML.Attribute.HREF);if(value != null)node.setBase(fixHref(value.toString(); 28 /* take care of start tags* param t HTML tag* param a HTML attributes* param pos Position within file*/public void handleStartTag(HTML.Tag t,MutableAttributeSet a,int pos)if(t.equals(HTML.Tag.T

44、ITLE)lastText=“;return;if(t.equals(HTML.Tag.A)Object value = a.getAttribute(HTML.Attribute.HREF);if(value != null)node.addLinks(1); String href = value.toString();href = fixHref(href);tryURL referencedURL = new URL(node.getBase(),href);searchWeb(treenode, referencedURL.getProtocol()+“:/“+referencedU

45、RL.getHost()+referencedURL.getPath();catch (MalformedURLException e)messageArea.append(“ Bad URL encountered : “+href+“nn“); return; 29/* take care of start tags* param t HTML tag* param pos Position within file*/public void handleEndTag(HTML.Tag t,int pos)if(t.equals(HTML.Tag.TITLE) DefaultTreeMode

46、l tm = (DefaultTreeModel)searchTree.getModel();tm.nodeChanged(treenode);/* take care of text between tags, check against keyword list for matches, if* match found, set the node match status to true* param data Text between tags* param pos position of text within web page*/public void handleText(char

47、 data, int pos)lastText = new String(data);node.addChars(lastText.length();String text = lastText.toUpperCase();for(int i = 0; i = 0)if(!node.isMatch()sitesFound+;updateStats();node.setMatch(keywordListi); return;/* SpiderControl.java*/import java.awt.*;import javax.swing.*;import javax.swing.tree.*;import java.util.*;import .*;/* User interface to conduct web searches with the Spider object* author Mark Pendergast*/public class SpiderControl extends javax.swing.JFrame implements VerifierListener/* Creates new form SpiderControl */public SpiderControl() initComponents();

展开阅读全文