1、 李涓子 清华大学计算机系知识工程研究室 2 Outline ! Knowledge graph and technologies ! Big scholar knowledge base Aminer II ! Knowledge graph building over enterprise data ! Conclusion 3 The Web 1.0 Connects information Web of documents The Social Web (Web 2.0) Connects People Web of People The Semantic Web Web 3.0 Co
2、nnects Knowledge Web of Data The Ubiquitous Web Connects Intelligence Web of Agents Increasing Connectivity Increasing Knowledge and reasoning Agent Webs that know, learn and reason as human do Vision of future Web 4 Bring structure to the meaningful content of Web pages Annotated Web ages Annotated
3、 Web pages Ontology Annotated Web pages Agent s Agent s The Semantic Web. Tim Berners-Lee, James Hendler, and Ora Lassila. Scientific American, 2001. 5 Philosophy of ontology ! Concept triangle “Tank“ Referent Form Stands for Relates to activates Concept Ogden, Richards, 1923 ? Ontology is the philo
4、sophical study of the nature of being, becoming, existence, or reality, as well as the basic categories of being and their relations. - Wikipedia 6 Some knowledge graphs Google KG 250 concepts 4M instances 6000 properties 500 Triples 350K Cs 10M Is 100 Ps 120M Ts 15K Cs 40M Is 4000 Ps 1BTs Google KB
5、 Core 850K Cs 8M Is 70K Ps 15K Cs 600M Is 20B Ts 50M Ss 50+Ls 262M Ts WordNet 7 Europe Ls Cross lingual links OpenIE (Reverb, OLLIE) NELL 7 Our Knowledge graph definition “ C concepts A group of objects with same properties cars, students, professors “ I - instances A object which belongs to a conce
6、pt Peter is a student “ T ISA subConceptOf, instanceOf “ P properties char instance-attribute-value (AVP) Taxonomy Knowledge Factual knowledge 8 Knowledge graph technologies ! Manually KG building: Wordnet, Cyc, Hownet ! Taxonomy knowledge learning “ Learning from Wikipedia “ Learning beyond Wikiped
7、ia ! Factual knowledge learning “ Learning from Wikipedia “ Learning beyond Wikipedia 9 Learning taxonomy knowledge from Wikipedia ! Category system in Wikipedia “ Category system in Wikipedia as a conceptual network PHILOSOPHY and BELIEF (deals-with?) PHILOSOPHY and HUMANITIES (isa) PHILOSOPHY and
8、SCIENCE (isa) Advantages: “ widely recognized concepts in human minds “ Large scale - over millions of concepts and ten millions of instances “ Large coverage Problems: “ noise categories for different purposes “ inconsistence - not well formally define 10 ! Using linguistic features of isa relation
9、ship “ syntactic parsing: head matching modifier matching/Singular/plural forms “ Lexico- atterns: ! Using structure of wikipedia Deriving a Large Scale Taxonomy from Wikipedia. Ponzetto et al. AAAI 07. ! Using external high quality isa resources wordnet, Hownet, Cilin YAGO(WWW2007) ! isa relation v
10、alidation using cross lingual knowledge links lore (AAAI2014) Learning taxonomy knowledge from Wikipedia 11 Learning taxonomy knowledge beyond Wikipedia ! Using Web sources Root concepts, search engine “ Hearst atterns “ Bootstrapping “ Taxonomy induction (structural learning) domain specific taxono
11、my building EMNLP2010, ACL 2014 ! Large scale taxonomy building “ Automatically generated from Web data “ 1.6 billion web pages “ Rich hierarchy of millions of concepts “ Probabilistic knowledge base SIGMOD2012 “ Probase: 2,653,872 concepts 20,757,545 Isa politicians people presidents George W. Bush
12、, 0.0117 Bill Clinton, 0.0106 George H. W. Bush, 0.0063 Hillary Clinton, 0.0054 Bill Clinton, 0.057 George H. W. Bush, 0.021 George W. Bush, 0.019 12 Factual knowledge learning Supervised Semi-supervised Unsupervised From Wikipedia Sematic annotation Semantify Wikipdia-Kylin Cross lingual IE-WikiCiK
13、E Beyond Wikipedia Distant supervision(Stanford) Coupled Semi-Supervised Learning(NELL) KnowItAll: TextRuner WOE 13 Automatic semantic annotation “ Rule learning based approach Automatically learn annotation rules from the training data “ Classification based approach Identify the boundary of tags i
14、n instances using classification models “ Sequential labeling based approach Consider the dependencies between tags “ Constrained Hierarchical Conditional Random Fields “ And Others . 14 Learning factual knowledge beyond Wikipedia-Knowledge Vault ! 15 Learning factual knowledge beyond Wikipedia-Know
15、ledge Vault ! Motivation “ the new approach should automatically leverage already- cataloged knowledge to build prior models of fact correctness ! Framework TXT: Distant supervision DOM: DOM tree structure features TBL:Table information ANO: annotated tags in htmls Priors: Path ranking algorithm Pri
16、ors: Neural network method 16 Learning factual knowledge beyond Wikipedia-Knowledge Vault 17 ! ! ! 18 Outline ! Knowledge graph and technologies ! Big scholar knowledge base Aminer II ! Knowledge graph building over enterprise data ! Conclusion 19 4 - Researcher profile extraction - Expert finding -
17、 Social network search - Topic browser - Conference analysis - ArnetApp platform 20 Person Search Basic Info. Citation statistics Ego network Research Interests 21 Expert Search Finding experts, for “data mining” Demographics: gender, language, location, etc. Knowledge about “data mining” similar au
18、thors 22 Conference Ranking 23 Reviewer Suggestion Inerest matching COI avoiding Load balancing Forcast review quality 24 Reviewer Suggestion 25 ! Academic Social Network Analysis and Mining systemAMiner (http:/ aminer.org) ! Online since 2006 ! 38 million researcher profiles ! 76 million publicatio
19、n papers ! 241 million requests ! 12.35 Terabyte data ! 100K IP access from 170 countries per month ! 10% increase of visits per month ! Deep analysis, mining, and search AMiner II (ArnetMiner) 26 7.32 million IP from 220 countries/regions User Distribution Top 10 countries 1. USA 6. Canada 2. China
20、 7. Japan 3. Germany 8. Spain 4. India 9. France 5. UK 10. Italy 27 Ruud Bolle Office: 1S-D58 Letters: IBM T.J. Watson Research CenterP.O. Box 704Yorktown Heights, NY 10598 USA Packages: IBM T.J. Watson Research Center19 Skyline DriveHawthorne, NY 10532 USA Email: Ruud M. Bolle was born in Voorburg
21、, The Netherlands. He received the Bachelors Degree in Analog Electronics in 1977 and the Masters Degree in Electrical Engineering in 1980, both from Delft University of Technology, Delft, The Netherlands. In 1983 he received the Masters Degree in Applied Mathematics and in 1984 the Ph.D. in Electri
22、cal Engineering from Brown University, Providence, Rhode Island. In 1984 he became a Research Staff Member at the IBM Thomas J. Watson Research Center in the Artificial Intelligence Department of the Computer Science Department. In 1988 he became manager of the newly formed Exploratory Computer Visi
23、on Group which is part of the Math Sciences Department. Currently, his research interests are focused on video database indexing, video processing, visual human-computer interaction and biometrics applications. Ruud M. Bolle is a Fellow of the IEEE and the AIPR. He is Area Editor of Computer Vision
24、and Image Understanding and Associate Editor of Pattern Recognition. Ruud M. Bolle is a Member of the IBM Academy of Technology. DBLP: Ruud Bolle 2006 Nalini K. Ratha, Jonathan Connell, Ruud M. Bolle, Sharat Chikkerur: Cancelable Biometrics: A Case Study in Fingerprints. ICPR (4) 2006: 370-373 EE 50
25、 Sharat Chikkerur, Sharath Pankanti, Alan Jea, Nalini K. Ratha, Ruud M. Bolle: Fingerprint Representation Using Localized Texture Features. ICPR (4) 2006: 521-524 EE 49 Andrew Senior, Arun Hampapur, Ying-li Tian, Lisa Brown, Sharath Pankanti, Ruud M. Bolle: Appearance models for occlusion handling.
26、Image Vision Comput. 24(11): 1233-1243 (2006) EE 48 2005 Ruud M. Bolle, Jonathan H. Connell, Sharath Pankanti, Nalini K. Ratha, Andrew W. Senior: The Relation between the ROC Curve and the CMC. AutoID 2005: 15-20 EE 47 Sharat Chikkerur, Venu Govindaraju, Sharath Pankanti, Ruud M. Bolle, Nalini K. Ra
27、tha: Novel Approaches for Minutiae Verification in Fingerprint Images. WACV. 2005: 111-116 EE 46 .Ruud Bolle Office: 1S-D58 Letters: IBM T.J. Watson Research CenterP.O. Box 704Yorktown Heights, NY 10598 USA Packages: IBM T.J. Watson Research Center19 Skyline DriveHawthorne, NY 10532 USA Email: Ruud
28、 M. Bolle was born in Voorburg, The Netherlands. He received the Bachelors Degree in Analog Electronics in 1977 and the Masters Degree in Electrical Engineering in 1980, both from Delft University of Technology, Delft, The Netherlands. In 1983 he received the Masters Degree in Applied Mathematics an
29、d in 1984 the Ph.D. in Electrical Engineering from Brown University, Providence, Rhode Island. In 1984 he became a Research Staff Member at the IBM Thomas J. Watson Research Center in the Artificial Intelligence Department of the Computer Science Department. In 1988 he became manager of the newly fo
30、rmed Exploratory Computer Vision Group which is part of the Math Sciences Department. Currently, his research interests are focused on video database indexing, video processing, visual human-computer interaction and biometrics applications. Ruud M. Bolle is a Fellow of the IEEE and the AIPR. He is A
31、rea Editor of Computer Vision and Image Understanding and Associate Editor of Pattern Recognition. Ruud M. Bolle is a Member of the IBM Academy of Technology. Motivating Example Contact Information Educational history Academic services Publications 1 1 2 2 Ruud Bolle Position Affiliation Address Add
32、ress Email Phduniv Phdmajor Phddate Msuniv Msdate Msmajor Bsuniv Bsdate Bsmajor Research Staff IBM T.J. Watson Research Center P.O. Box 704 Yorktown Heights, NY 10598 USA Brown University 1984 Electrical Engineering Delft University of Technology Analog Electronics 1977 Delft University of Technolo
33、gy IBM T.J. Watson Research Center 19 Skyline Drive Hawthorne, NY 10532 USA IBM T.J. Watson Research Center Electrical Engineering 1980 Applied Mathematics Msmajor http:/ ecvg/people/bolle.html Homepage Ruud Bolle Name video database indexing video processing visual human-computer interaction biomet
34、rics applications Research_Interest Photo Publication 1# Cancelable Biometrics: A Case Study in Fingerprints ICPR 370 2006 Date Start_page Venue Title 373 End_page Publication 2# Fingerprint Representation Using Localized Texture Features ICPR 521 2006 Date Start_page Venue Title 524 End_page . . .
35、Co-author Co-author 1 Ruud Bolle 2 Publication #3 Publication #5 coautho r coautho r UIUC affiliatio n Professor position 2 1 28 Researcher Social Network Extraction Researcher Homepage Phone Address Email Phduniv Phddate Phdmajor Msuniv Bsmajor Bsdate Bsuniv Affiliation Postion Msmajor Msdate Fax P
36、erson Photo Publication Research_Interest Name Authored Title Publication_venue Start_page End_page Date Coauthor 70.60% of the researchers have at least one homepage or an introducing page 85.6% from universities 14.4% from companies 71.9% are homepages 28.1% are introducing pages 60% are natural l
37、anguage text 40% are in lists and tables 29 CRFs He is a Professor at OTH OTH POS AFF POS OTH POS UIUC OTH POS AFF OTH POS AFF AFF OTH POS AFF ADR AFF ADR - Green nodes are hidden vars, - Purple nodes are observations , 1 ( | ) exp ( , | , ) ( , | , ) () jj e kk v eEj vVk pyx tey x svy x Zx ! “ #$ =
38、+ %& ( )30 ALC FUC AMC PRV RPA DEL PSB PRV DEL PRV DEL AUC AUC ALC FUC AMC AUC ALC FUC AMC PRV DEL AUC ALC FUC AMC of Ruud is Fellow the IEEE Bolle a Ruud M. Bolle is a Fellow of the IEEE and the AIPR. He is Area Editor of Computer Vision and Image Understanding and Associate Editor of Pattern Recog
39、nition. Ruud M. Bolle is a Member of the IBM Academy of Technology. of Ruud is Fellow the IEEE ALC FUC AMC PRV RPA DEL PSB PRV DEL PRV DEL AUC AUC ALC FUC AMC AUC ALC FUC AMC Bolle a PRV DEL AUC ALC FUC AMC Processing Flow for Profiling Preprocessin g Determin e Tokens Standard word Special word Ima
40、ge Token Term Punc. mark Labeled data Learning a CRF model Train Test Assigning tags A unified tagging model Model Learning Ta g g i n g Tagging results Inputted docs Feature definitions Document 1 2 3 He obtained his BS in Computer Science in 1999. Ruud M. Bolle is a Fellow of the IEEE. . Science o
41、btained BS Computer ALC RPA PRV ALC FUC his in PRV31 Profiling Results5-fold cross validation Profiling Task Unified Unified_NT SVM Amilcare Photo 89.11 88.64 88.86 31.62 Position 69.44 64.70 64.68 56.48 Affiliation 83.52 72.16 73.86 46.65 Phone 91.10 78.72 79.71 83.33 Fax 90.83 64.28 64.17 86.88 Em
42、ail 80.35 75.47 79.37 78.70 Address 86.34 75.15 77.04 66.24 Bsuniv 67.38 57.56 59.54 47.17 Bsmajor 64.20 59.18 60.75 58.67 Bsdate 53.49 40.59 28.49 52.34 Msuniv 57.55 47.49 49.78 45.00 Msmajor 63.35 61.92 62.10 57.14 Msdate 48.96 41.27 30.07 56.00 Phduniv 63.73 53.11 57.01 59.42 Phdmajor 67.92 59.30
43、 59.67 57.93 Phddate 57.75 42.49 41.44 61.19 Overall 83.37 72.09 73.57 62.30 83.37 32 Outline ! Knowledge graph and technologies ! Big scholar knowledge base Aminer II ! Knowledge graph building over enterprise data ! Conclusion 33 ! Motivation The current constructions of the knowledge graph are ma
44、inly from two aspects: Web, Domains, Science Gene Ontology LOD There is huge demand on knowledge graph building based on internal data of enterprise Knowledge Graph over Enterprise Data 34 ! Building Knowledge Graph based on Mobile Customer Care Documents # Document Parsing based Logical Structure E
45、xtraction # Heuristic Table Extraction # Hierarchical Concept Extraction # Iterative Instance Identification & Property Extraction # Evaluation Throughout Performance Knowledge Graph over Enterprise Data 35 ! Evaluations Document Parsing Evaluation Table Alignment Evaluation via manual evaluation Knowledge Graph Evaluation Coverage Knowledge Graph over Enterprise Data 36 ! Domain data characteristics ! Problem to be solved ! Building pipeline ! ! Visualization and Evaluation “ On each process “ Knowledge base evaluation ! Human interaction