1、Visualizing DataBen FryBeijing Cambridge Farnham Kln Paris Sebastopol Taipei TokyoVisualizing Databy Ben FryCopyright 2008 Ben Fry.All rights reserved.Printed in the United States of America.Published by OReilly Media,Inc.,1005 Gravenstein Highway North,Sebastopol,CA 95472.OReilly books may be purch
2、ased for educational,business,or sales promotional use.Online editionsare also available for most titles().For more information,contact ourcorporate/institutional sales department:(800)998-9938 or.Editor:Andy OramProduction Editor:Loranah DimantCopyeditor:Genevieve dEntremontProofreader:Loranah Dima
3、ntIndexer:Ellen Troutman ZaigCover Designer:Karen MontgomeryInterior Designer:David FutatoIllustrator:Jessamyn ReadPrinting History:December 2007:First Edition.Nutshell Handbook,the Nutshell Handbook logo,and the OReilly logo are registered trademarks ofOReilly Media,Inc.Visualizing Data,the image o
4、f an owl,and related trade dress are trademarks ofOReilly Media,Inc.Many of the designations used by manufacturers and sellers to distinguish their products are claimed astrademarks.Where those designations appear in this book,and OReilly Media,Inc.was aware of atrademark claim,the designations have
5、 been printed in caps or initial caps.While every precaution has been taken in the preparation of this book,the publisher and author assumeno responsibility for errors or omissions,or for damages resulting from the use of the informationcontained herein.This book uses RepKover,a durable and flexible
6、 lay-flat binding.ISBN-10:0-596-51455-7ISBN-13:978-0-596-51455-6CiiiTable of ContentsPreface.vii1.The Seven Stages of Visualizing Data.1Why Data Display Requires Planning 2An Example 6Iteration and Combination 14Principles 15Onward 182.Getting Started with Processing.19Sketching with Processing 20Ex
7、porting and Distributing Your Work 23Examples and Reference 24Functions 27Sketching and Scripting 28Ready?303.Mapping.31Drawing a Map 31Locations on a Map 32Data on a Map 34Using Your Own Data 51Next Steps 53iv|Table of Contents4.Time Series.54Milk,Tea,and Coffee(Acquire and Parse)55Cleaning the Tab
8、le(Filter and Mine)55A Simple Plot(Represent and Refine)57Labeling the Current Data Set(Refine and Interact)59Drawing Axis Labels(Refine)62Choosing a Proper Representation(Represent and Refine)73Using Rollovers to Highlight Points(Interact)76Ways to Connect Points(Refine)77Text Labels As Tabbed Pane
9、s(Interact)83Interpolation Between Data Sets(Interact)87End of the Series 925.Connections and Correlations.94Changing Data Sources 94Problem Statement 95Preprocessing 96Using the Preprocessed Data(Acquire,Parse,Filter,Mine)111Displaying the Results(Represent)118Returning to the Question(Refine)121So
10、phisticated Sorting:Using Salary As a Tiebreaker(Mine)126Moving to Multiple Days(Interact)127Smoothing Out the Interaction(Refine)132Deployment Considerations(Acquire,Parse,Filter)1336.Scatterplot Maps.145Preprocessing 145Loading the Data(Acquire and Parse)155Drawing a Scatterplot of Zip Codes(Mine
11、and Represent)157Highlighting Points While Typing(Refine and Interact)158Show the Currently Selected Point(Refine)162Progressively Dimming and Brightening Points(Refine)165Zooming In(Interact)167Changing How Points Are Drawn When Zooming(Refine)177Deployment Issues(Acquire and Refine)178Next Steps 1
12、80Table of Contents|v7.Trees,Hierarchies,and Recursion.182Using Recursion to Build a Directory Tree 182Using a Queue to Load Asynchronously(Interact)186An Introduction to Treemaps 189Which Files Are Using the Most Space?194Viewing Folder Contents(Interact)199Improving the Treemap Display(Refine)201F
13、lying Through Files(Interact)208Next Steps 2198.Networks and Graphs.220Simple Graph Demo 220A More Complicated Graph 229Approaching Network Problems 240Advanced Graph Example 242Mining Additional Information 2629.Acquiring Data.264Where to Find Data 265Tools for Acquiring Data from the Internet 266L
14、ocating Files for Use with Processing 268Loading Text Data 270Dealing with Files and Folders 276Listing Files in a Folder 277Asynchronous Image Downloads 281Using openStream()As a Bridge to Java 284Dealing with Byte Arrays 284Advanced Web Techniques 284Using a Database 288Dealing with a Large Number
15、 of Files 29510.Parsing Data.296Levels of Effort 296Tools for Gathering Clues 298Text Is Best 299Text Markup Languages 303vi|Table of ContentsRegular Expressions(regexps)316Grammars and BNF Notation 316Compressed Data 317Vectors and Geometry 320Binary Data Formats 325Advanced Detective Work 32811.In
16、tegrating Processing with Java.331Programming Modes 331Additional Source Files(Tabs)334The Preprocessor 335API Structure 336Embedding PApplet into Java Applications 338Using Java Code in a Processing Sketch 342Using Libraries 343Building with the Source for processing.core 343Bibliography.345Index.3
17、49viiPreface1When I show visualization projects to an audience,one of the most common ques-tions is,“How do you do this?”Other books about data visualization do exist,butthe most prominent ones are often collections of academic papers;in any case,fewexplain how to actually build representations.Book
18、s from the field of design thatoffer advice for creating visualizations see the field only in terms of static displays,ignoring the possibility of dynamic,software-based visualizations.A number spendmost of their time dissecting whats wrong with given representationssometimesproviding solutions,but
19、more often not.In this book,I wanted to offer something for people who want to get started build-ing their own visualizations,something to use as a jumping-off point for more com-plicated work.I dont cover everything,but Ive tried to provide enough backgroundso that youll know where to go next.I wro
20、te this book because I wanted to have a way to make the ideas fromComputational Information Design,my Ph.D.dissertation,more accessible to a wideraudience.More specifically,I wanted to see these ideas actually applied,rather thanlimited to an academic document on a shelf.My dissertation covered the
21、process ofgetting from data to understanding;in other words,from considering a pile of infor-mation to presenting it usefully,in a way that can be easily understood and inter-acted with.This process is covered in Chapter 1,and used throughout the book as aframework for working through visualizations
22、.Most of the examples in this book are written from scratch.Rather than relying ontoolkits or libraries that produce charts or graphs,instead you learn how to createthem using a little math,some lines and rectangles,and bits of text.Many readersmay have tried some toolkits and found them lacking,par
23、ticularly because they wantto customize the display of their information.A tool that has generic uses will pro-duce only generic displays,which can be disappointing if the displays do not suityour data set.Data can take many interesting forms that require unique types of dis-play and interaction;thi
24、s book aims to open up your imagination in ways that collec-tions of bar and pie charts cannot.viii|PrefaceThis book uses Processing(http:/processing.org),a simple programming environ-ment and API that I co-developed with Casey Reas of UCLA.Processings program-ming environment makes it easy to sit d
25、own and“sketch”code to produce visualimages quickly.Once you outgrow the environment,its possible to use a regularJava IDE to write Processing code because the API is based on Java.Processing is freeto download and open source.It has been in development since 2001,and weve hadabout 100,000 people tr
26、y it out in the last 12 months.Today Processing is used bytens of thousands of people for all manners of work.When I began writing thisbook,I debated which language and API to use.It could have been based on Java,but I realized I would have found myself re-implementing the Processing API tomake thin
27、gs simple.It could have been based on Actionscript and Flash,but Flash isexpensive to buy and tends to break down when dealing with larger data sets.Otherscripting languages such as Python and Ruby are useful,but their execution speedsdont keep up with Java.In the end,Processing was the right combin
28、ation of cost,ease of use,and execution speed.The Audience for This BookIn the spring of 2007,I co-taught an Information Visualization course at CarnegieMellon.Our 30 students ranged from a freshman in the art school to a Ph.D.candi-date in computer science.In between were graduate students from the
29、 School ofDesign and various other undergrads.Their skill levels were enormously varied,butthat was less important than their level of curiosity,and students who were curiousand willing to put in some work managed to overcome the technical difficulties(forthe art and design students)or the visual de
30、mands(for those with an engineeringbackground).This book is targeted at a similar range of backgrounds,if less academic.Im tryingto address people who want to ask questions,play with data,and gain an under-standing of how to communicate information to others.For instance,the book is forweb designers
31、 who want to build more complex visualizations than their tools willallow.Its also for software engineers who want to become adept at writing softwarethat represents datathat calls on them to try out new skills,even if they have somebackground in building UIs.None of this is rocket science,but it is
32、nt always obvi-ous how to get started.Fundamentally,this book is for people who have a data set,a curiosity to explore it,and an idea of what they want to communicate about it.The set of people who visu-alize data is growing extremely quickly as we deal with more and more information.Even more impor
33、tant,the audience has moved far beyond those who are experts invisualization.By making these ideas accessible to a wide range of people,we shouldsee some truly amazing things in the next decade.Preface|ixBackground InformationBecause the audience for this book includes both programmers and non-progr
34、ammers,the material varies in complexity.Beginners should be able to pick itup and get through the first few chapters,but they may find themselves lost as we getinto more complicated programming topics.If youre looking for a gentler introduc-tion to programming with Processing,other books are availa
35、ble(including one writ-ten by Casey Reas and me)that are more suited to learning the concepts fromscratch,though they dont cover the specifics of visualizing data.Chapters 14 canbe understood by someone without any programming background,but the laterchapters quickly become more difficult.Youll be m
36、ost successful with this book if you have some familiarity with writingcodewhether its Java,C+,or Actionscript.This is not an advanced text by anymeans,but a little background in writing code will go a long way toward understand-ing the concepts.Overview of the BookChapter 1,The Seven Stages of Visu
37、alizing Data,covers the process for developing auseful visualization,from acquiring data to interacting with it.This is the frameworkwell use as we attack problems in later chapters.Chapter 2,Getting Started with Processing,is a basic introduction to the Processingenvironment and syntax.It provides
38、a bit of background on the structure of the APIand the philosophy behind the projects development.Chapters 3 through 8 cover example projects that get progressively morecomplicated.Chapter 3,Mapping,plots data points on a map,our first introduction to readingdata from the disk and representing it on
39、 the screen.Chapter 4,Time Series,covers several methods of plotting charts that represent howdata changes over time.Chapter 5,Connections and Correlations,is the first chapter that really delves intohow we acquire and parse a data set.The example in this chapter reads data from theMLB.com web site
40、and produces an image correlating player salaries and team per-formance over the course of a baseball season.Its an in-depth example illustratinghow to scrape data from a web site that lacks an official API.These techniques canbe applied to many other projects,even if youre not interested in basebal
41、l.Chapter 6,Scatterplot Maps,answers the question,“How do zip codes relate to geog-raphy?”by developing a project that allows users to progressively refine a U.S.mapas they type a zip code.x|PrefaceChapter 7,Trees,Hierarchies,and Recursion,discusses trees and hierarchies.It cov-ers recursion,an impo
42、rtant topic when dealing with tree structures,and treemaps,auseful representation for certain kinds of tree data.Chapter 8,Networks and Graphs,is about networks of information,also calledgraphs.The first half discusses ways to produce a representation of connectionsbetween many nodes in a network,an
43、d the second half shows an example of doingthe same with web site trafficdata to see how a site is used over time.The latterproject also covers how to integrate Processing with Eclipse,a Java IDE.The last three chapters contain reference material,including more background andtechniques for acquiring
44、 and parsing data.Chapter 9,Acquiring Data,is a kind of cookbook that covers all sorts of practicaltechniques,from reading data from files,to spoofing a web browser,to storing datain databases.Chapter 10,Parsing Data,is also written in cookbook-style,with examples that illus-trate the detective work
45、 involved in parsing data.Examples include parsing HTMLtables,XML,compressed data,and SVG shapes.It even includes a basic example ofwatching a network connection to understand how an undocumented data protocolworks.Chapter 11,Integrating Processing with Java,covers the specifics of how the Process-i
46、ng API integrates with Java.Its more of an appendix aimed at advanced Java pro-grammers who want to use the API with their own projects.Safari Books OnlineWhen you see a Safari Books Online icon on the cover of yourfavorite technology book,that means the book is available onlinethrough the OReilly N
47、etwork Safari Bookshelf.Safari offers a solution thats better than e-books.Its a virtual library that lets youeasily search thousands of top tech books,cut and paste code samples,downloadchapters,and find quick answers when you need the most accurate,current informa-tion.Try it for free at http:/.Ac
48、knowledgmentsId first like to thank OReilly Media for taking on this book.I was initially put intouch with Steve Weiss,who met with me to discuss the book in the spring of 2006.Steve later put me in touch with the Cambridge office,where Mike Hendricksonbecame a champion for the book and worked to ma
49、ke sure that the contract hap-pened.Tim OReillys enthusiasm along the way helped seal it.Preface|xiI owe a great deal to my editor,Andy Oram,and assistant editor,Isabel Kunkle.With-out Andys hard work and helpful suggestions,or Isabels focus on our schedule,Imight still be working on the outline for
50、 Chapter 4.Thanks also to those who reviewedthe draft manuscript:Brian DeLacey,Aidan Delaney,and Harry Hochheiser.This book is based on ideas first developed as part of my doctoral work at the MITMedia Laboratory.For that I owe my advisor of six years,John Maeda,and mycommittee members,David Altshul