1、 ABSTRACT Title of Document: USING HISTORICAL DATA FROM SOURCE CODE REVISION HISTORIES TO DETECT SOURCE CODE PROPERTIES Chadd Creighton Williams, Doctor of Philosophy, 2006 Directed By: Professor Jeffrey K. Hollingsworth, Department of Computer Science In this dissertation, we describe several techn
2、iques for using historical data mined from the source code revision histories of software projects to determine important properties of the source code. These properties are then used to improve the results of various bug-finding techniques as well as to provide documentation to the developer. We de
3、scribe a method to mine source code revision histories, in this case CVS repositories, to extract relevant information to be fed into a static source code bug finder for use in improving the results generated by the bug finding tool. We apply this technique to the CVS repositories of two widely used
4、 open source software projects, Apache httpd and Wine. We show how source code revision history can be used to reduce false positives from a static source code checker that identifies the misuse of values returned from a function call. A method of mining source code revision histories for the purpos
5、e of learning about project specific idioms is then discussed. Specifically, we show how source code revision history can be used to identify patterns of calling sequences that describe how functions in the software should be used in relation to each other. With this data, we are able to find bugs i
6、n the source code, document API usage and identify refactoring events. In short, this dissertation shows that it is possible to automatically determine meaningful properties of the source code from studying source code changes cataloged in the software revision history. USING HISTORICAL DATA FROM SO
7、URCE CODE REVISION HISTORIES TO DETECT SOURCE CODE PROPERTIES By Chadd Creighton Williams Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2006 Advisory Committe
8、e: Professor Jeffrey K. Hollingsworth, Chair Professor Victor Basili Assistant Professor Jeffrey Foster Assistant Professor Michael Hicks Professor Martin Loeb Copyright by Chadd Creighton Williams 2006 ii Dedication For my parents. iii Acknowledgements I first need to thank my advisor, Dr. Jeffrey
9、K. Hollingsworth. His immense patience and support made this work possible. Much of a graduate students success is a reflection of the advisor, and I hope my work reflects well on Dr. Hollingsworth. My advisory committee also deserves a round of thanks. They provided invaluable advice on how to impr
10、ove my dissertation. Dr. Bill Pugh, a member of my proposal committee, also needs to be included in this group. His influence through my entire research career was second only to that of my advisors. My friends here at Maryland and elsewhere deserve thanks for helping to keep me sane as I went throu
11、gh all the trials of graduate school. These people include my housemates through the years (all seven of them), my fellow Dyninst API workers and anyone who visited my house so that I could cook chili or grill meat (nothing is more therapeutic). They are too numerous to mention here. There are a num
12、ber of fellow graduate students who deserve special mention. Mustafa Tikir, Brian Postow and Byran Buck all gave me wonderful advice, based on their experiences, on how to be a successful graduate student. Without their advice Im sure I would have repeated all of their mistakes, rather that just a f
13、ew of them. I hope that I have been able to offer similar advice to the newer, younger graduate students I have met over the last few years. Finally, my family has been extremely supportive throughout my entire educational career. While they may have been mystified as to why it took so long for me t
14、o graduate (a confusion I shared at times), my parents and sister offered nothing but encouragement and much needed optimism. For this I am eternally grateful. iv Table of Contents Dedication. ii Acknowledgementsiii Table of Contents. iv List of Tables . vi List of Figures. vii Chapter 1: Introducti
15、on. 1 Chapter 2: Related Work 7 2.1 Static Analysis 7 2.2 Software Revision History Mining . 13 2.3 Dynamic Analysis. 21 Chapter 3: Infrastructure. 24 3.1 Interaction With CVS . 25 3.1.1 CVS Transactions 25 3.1.2 CVS Data in a Database. 29 3.1.3 Moving a File or Directory in CVS . 31 3.2 Building Ye
16、sterdays Source Tree Today. 32 3.2.1 Running configure . 32 3.2.2 Preprocessing with GCC 33 3.2.3 Versions of Software Support Tools 35 3.2.4 Line Number Mapping. 36 3.3 Database for Results . 37 3.4 Computation Costs 37 Chapter 4: Function Return Value Checker 40 4.1 Preliminary Mining of CVS 40 4.
17、2 Static Analysis Tool 45 4.2.1 Function Return Value Checker. 45 4.2.2 Mining the Source Code Repository 47 4.2.3 Ranking the Results . 49 4.3 Case Studies 50 4.3.1 Evaluation of Results . 50 4.3.2 Apache Web Server Case Study 52 4.3.3 Special Considerations. 52 4.3.4 Results for the Apache web ser
18、ver Case Study . 53 4.3.5 Wine Case Study 56 4.3.6 Special Considerations. 57 4.3.7 Results for the Wine Case Study . 58 4.4 Effectiveness of using Mined Data. 60 4.4.1 Analysis of the Ranked Functions . 60 4.4.2 Where are the bugs in the rankings? 62 4.4.3 Statistical Significance. 64 4.4.4 Precisi
19、on. 65 4.4.5 Recall . 69 4.4.6 Cumulative False Positive Rate . 69 v 4.5 Threats to Validity 72 4.6 Computation Costs 73 4.7 Summary. 73 Chapter 5: Function Usage Pattern Miner 76 5.1 Function Usage Pattern Miner 78 5.1.1 Control Flow Graphs 79 5.1.2 Data Flow Information. 80 5.2 Mining the Reposito
20、ry 82 5.2.1 Finding New Instances of Patterns 82 5.2.2 Mining Rules from the Repository 84 5.3 Case Studies 88 5.4 Bug Finding 89 5.4.1 Results 91 5.4.2 Sources of False Positives 97 5.4.3 Warning Browser. 101 5.5 Documenting API Usage 101 5.5.1 Results 105 5.6 Refactoring 119 5.6.1 Results 120 5.7
21、Threats to Validity 123 5.8 Computation Costs 125 5.9 Summary. 126 Chapter 6: Conclusions. 131 Appendix A. 135 Appendix B . 137 Bibliography . 139 vi List of Tables Table 1: Affect of Window Size on Number of CVS Transactions found . 28 Table 2: Bugs Identified in the Apache Bug Database. 42 Table 3
22、: Bugs found the Apache Software Repository. 44 Table 4: Warning Types 47 Table 5: Warnings and Likely Bugs for Apache 54 Table 6: Warnings Reported for Apache 55 Table 7: Warnings and Likely Bugs for Wine 59 Table 8: Warnings Reported for Wine 60 Table 9: Apache Chi-square Calculation 65 Table 10:
23、Function Usage Pattern Statistics . 87 Table 11: Threshold Values, Bug Finding 90 Table 12: Chi-square Calculation, Apache httpd Pattern Violations 95 Table 13: Function Usage Patterns, APR Locks, Apache 108 Table 14: SSL Library API, Recovered from Apache httpd source code. 110 Table 15: Selected F
24、unction Usage Patterns from the wine/dlls/msi directory 114 Table 16: Function Usage Patterns, Chains, Wine Source Code 115 Table 17: Function Usage Patterns, Socket API, Wine 118 Table 18: Threshold Values 129 vii List of Figures Figure 1: Example Return Value Check Bug . 61 Figure 2: Division of W
25、arnings, Apache 63 Figure 3: Division of Warnings, Wine 64 Figure 4: Precision in the Wine Case Study . 66 Figure 5: Precision in the Wine Case Study, detail 66 Figure 6: Precision in the Apache Case Study 67 Figure 7: Precision in the Apache Case Study, detail. 67 Figure 8: Recall in the Wine Case
26、Study 68 Figure 9: Recall in Apache Case Study 68 Figure 10: False Positive Analysis in the Wine Case Study. 71 Figure 11: False Positive Analysis in the Apache Case Study . 71 Figure 12: Called After Pattern. 78 Figure 13: Data Flow, Same Parameter 81 Figure 14: Data Flow, Produce/Consume. 81 Figur
27、e 15: Data Flow, Update Same Variable 82 Figure 16: Change that Results in an Increase of ConfFirst . 85 Figure 17: Student Project Bug, Function Usage Pattern partially in Dead Code 92 Figure 18: Apache httpd bug, Access data structure internals 93 Figure 19: Bug due to missing close call from the
28、Wine Source Code 96 Figure 20: Bug due to missing InvalidateRect call from the Wine Source Code. 96 Figure 22: Call Chain with Variable Middle Function Call . 103 Figure 23: Function Usage Patterns for Chain Creation. 104 Figure 24: Lock Sequence 107 Figure 25: Graph of PSDRV_WriteXXX Functions 116
29、Figure 26: Unix Socket API Usage. 118 1 Chapter 1: Introduction Source code revision repositories hold a wealth of information that is not only useful for managing and building source code, but also as a detailed log of how the source code has evolved during development. If a piece of the source cod
30、e is refactored, evidence of this will be in the repository. The code describing how to use the software pre- and post-refactoring will exist in the repository. As bugs are fixed the new code is stored in the repository alongside the buggy code. As new APIs are added to the source code, the proper w
31、ay to use them is implicitly explained in the source code. The proper way to use an API or how to invoke a function can be viewed as rules or properties describing how to use the source code that has been written. Understanding these rules is vital for the developers to produce correct code. Having
32、a documented set of properties also allows for the automatic analysis of the source code to determine where these properties may not be correctly used. The challenge in dealing with these rules is that they evolve over time. Changes to the source code may change the rules by adding or removing funct
33、ions, changing the implementation of a function or rewriting a data structure. As the code evolves and new, system specific rules develop detailing how to use internal functions, they are implicitly written into the source code, no matter if they are ever formally documented. The goal of this work i
34、s to determine if it is possible to identify these rules by analyzing the change history of the source code. 2 System specific rules can be used in a number of ways. Using these rules to find bugs via static analysis has been very successful. However, the challenge is always to correctly document al
35、l the system specific rules. This is usually left to an expert, a senior developer on the project or a group of developers, each one familiar with a different subset of the code. This is an unsatisfactory way to derive these very important rules. Understanding the source code, how to use an API for
36、example, is another application of these rules. As software grows and changes new rules evolve that the source code must obey. Unfortunately, few of these rules are formally documented and often new developers to the project must discover these on their own or rely on instruction from veteran develo
37、pers. However, these rules that guide the source code changes are implicitly defined in the source code. More importantly they have been added to the software; hence they show up as a modification in the source code repository. Automatically extracting these properties and rules from source code mod
38、ifications could allow projects to formally document these rules as time goes by. Alternatively, modifications to the source code could be checked for violations of these rules before a change is allowed to be added to the source code repository. The thesis of this dissertation is that these propert
39、ies can be extracted from software repositories and used to benefit the developers. To validate this thesis, we introduce techniques by which we examine the changes made to the source code and use this data to refine and enhance current bug finding techniques as well as using this data to document t
40、he proper use of the source code. Each of the tools we implement uses the same technique to mine the source code for a particular property. 3 The tools mine each version of each file to determine where a property is present in the file. We then identify which instances of the property in the later v
41、ersion of the file are new instances created by the source code change. The tools we produce will track how and when new instances of these properties are created in the source code. We first discuss the infrastructure we built to facilitate mining the source code repositories. This infrastructure d
42、eals with extracting the source code from the repository and storing it in a database, recovering valid source trees from the repository for each CVS transaction and gathering configuration information needed to parse the source files successfully. While a seemingly simple set of tasks, doing this a
43、utomatically for source trees of open source projects that date back through several years is quite a challenge. We must deal with changing operating system header files, buggy support tools, configure scripts written for specific (older) versions of the operating system and changes to the structure
44、 of the source tree. The goal of this infrastructure is determine what the local source tree looked like when a developer made a commit to the repository. Being able to recreate this snapshot increases the chances that the source files will correctly run through our static analysis tools. Storing th
45、e results of mining each version of each file in the database allows many of our tools, or parts of our tools, to be built as database queries. This allows quick and easy access to the data. We present our work in applying our techniques in two chapters of this dissertation. First, we study using th
46、e bug fix history gleaned from the source code repository to improve the results of a static bug finding tool. The bug fix history is used to rank the warnings found by the tool in an attempt to increase the number of 4 likely bugs found in the top of the rankings. The particular bug fix we are look
47、ing for involves the return value check bug. The bug consists of instances where a functions return value is not checked against an error code before being used in the code. In C code, it is often the case that the return value of a function is either valid data or some type of error code. In the ca
48、ses where the return value could be an error code, the return value should not be used without first determining it is not the error code. Our mining tool examines the source code repository to determine where a bug of this type is fixed and which function produces the return value involved. The lat
49、est version of the code is then inspected for bugs of this type and the warnings produced are sorted using the historical information mined from the repository. To evaluate this work, the precision of this sorted list is compared to the precision of a sorting that does not use historical information to rank the warnings. For both of the software projects we studied we found that the sorting that uses historical information has a better precision and that the difference is statistically significant. The second property that we mine from the repositories is function usage patterns. T