1、frequently used commands are highlighted in yellow use “yourStataFile.dta“, clear load a dataset from the current directory import delimited“yourFile.csv“, /* */ rowrange(2:11) colrange(1:8) varnames(2) import a .csv filewebuse set “https:/ webuse “wb_indicators_long“ set web-based directory and loa
2、d data from the web import excel “yourSpreadsheet.xlsx“, /* */ sheet(“Sheet1“) cellrange(A2:H11) firstrow import an Excel spreadsheet Import Data sysuse auto, clear load system data (Auto data) for many examples, we use the auto dataset. display price4 display the 4th observation in price; only work
3、s on single values levelsof rep78 display the unique values for rep78 Explore Data duplicates report finds all duplicate values in each variable describe make price display variable type, format, and any value/variable labels ds, has(type string) lookfor “in.“ search for variable types, variable nam
4、e, or variable label isid mpg check if mpg uniquely identifies the data plot a histogram of the distribution of a variable count if price 5000 count number of rows (observations) Can be combined with logic VIEW DATA ORGANIZATION inspect mpg show histogram of data, number of missing or zero observati
5、ons summarize make price mpg print summary statistics (mean, stdev, min, max) for variables codebook make price overview of variable type, stats, number of missing/unique values SEE DATA DISTRIBUTION BROWSE OBSERVATIONS WITHIN THE DATA gsort price mpg gsort price mpg sort in order, first by price th
6、en miles per gallon (descending) (ascending) list make price if price 10000 needs to be done once packages contain extra commands that expand Statas toolkit underlined parts are shortcuts use “capture“ or “cap“ Ctrl D + highlight text in .do file, then ctrl + d executes it in the command line clear
7、delete data in memory Useful Shortcuts Ctrl 8 open the data editor + F2 describe data cls clear the console (where results are displayed) PgUp PgDn scroll through previous commands Tab autocompletes variable name after typing part AT COMMAND PROMPT Ctrl 9 open a new .do file + keyboard buttons Data
8、Processing Cheat Sheet with Stata 14.1 For more info see Statas reference manual () Tim Essam (tessamusaid.gov) Laura Hughes (lhughesusaid.gov) follow us StataRGIS and flaneuseks inspired by RStudios awesome Cheat Sheets ( updated January 2016 CC BY 4.0 geocenter.github.io/StataTraining Disclaimer:
9、we are not affiliated with Stata. But we like it.export delimited “myData.csv“, delimiter(“,“) replace export data as a comma-delimited file (.csv) export excel “myData.xls“, /*/ firstrow(variables) replace export data as an Excel file (.xls) with the variable names as the first row Save see also st
10、rupper, strproper display strmatch(“123.89“, “1?.?9“) return true (1) or false (0) if string matches pattern list make if regexm(make, “0-9“) list observations where make matches the regular expression (here, records that contain a number) FIND MATCHING STRINGS GET STRING PROPERTIES list if regexm(m
11、ake, “(Cad.|Chev.|Datsun)“) return all observations where make contains “Cad.“, “Chev.“ or “Datsun“ list if inlist(word(make, 1), “Cad.“, “Chev.“, “Datsun“) return all observations where the first word of the make variable contains the listed words compare the given list against the first word in ma
12、ke charlist make display the set of unique characters within a string * user-defined package replace make = subinstr(make, “Cad.“, “Cadillac“, 1) replace first occurrence of “Cad.“ with Cadillac in the make variable display length(“This string has 29 characters“) return the length of the string disp
13、lay substr(“Stata“, 3, 5) return the string located between characters 3-5 display strpos(“Stata“, “a“) return the position in Stata where a is first found display real(“100“) convert string to a numeric or missing value _merge code row only in ind2 row only in hh2 row in both 1 (master) 2 (using) 3
14、 (match) Combine Data ADDING (APPENDING) NEW DATA MERGING TWO DATASETS TOGETHER FUZZY MATCHING: COMBINING TWO DATASETS WITHOUT A COMMON ID merge 1:1 id using “ind_age.dta“ one-to-one merge of “ind_age.dta“ into the loaded dataset and create variable “_merge“ to track the origin webuse ind_age.dta, c
15、lear save ind_age.dta, replace webuse ind_ag.dta, clear merge m:1 hid using “hh2.dta“ many-to-one merge of “hh2.dta“ into the loaded dataset and create variable “_merge“ to track the origin webuse hh2.dta, clear save hh2.dta, replace webuse ind2.dta, clear append using “coffeeMaize2.dta“, gen(filenu
16、m) add observations from “coffeeMaize2.dta“ to current data and create variable “filenum“ to track the origin of each observation webuse coffeeMaize2.dta, clear save coffeeMaize2.dta, replace webuse coffeeMaize.dta, clear load demo data id blue pink + id blue pink id blue pink should contain the sam
17、e variables (columns) MANY-TO-ONE id blue pink id brown blue pink brown _merge 3 3 1 3 2 1 3 . . . . id + = ONE-TO-ONE id blue pink id brown blue pink brown id _merge 3 3 3 + = must contain a common variable (id) match records from different data sets using probabilistic matching reclink create dist
18、ance measure for similarity between two strings ssc install reclink ssc install jarowinkler jarowinkler Reshape Data webuse set https:/ webuse “coffeeMaize.dta“ load demo dataset xpose, clear varname transpose rows and columns of data, clearing the data and saving old column names as a new variable
19、called “_varname“ MELT DATA (WIDE LONG) reshape long coffee maize, i(country) j(year) convert a wide dataset to long reshape variables starting with coffee and maize unique id variable (key) create new variable which captures the info in the column names CAST DATA (LONG WIDE) reshape wide coffee mai
20、ze, i(country) j(year) convert a long dataset to wide create new variables named coffee2011, maize2012. what will be unique id variable (key) create new variables with the year added to the column name When datasets are tidy, they have a consistent, standard format that is easier to manipulate and a
21、nalyze. country coffee 2011 coffee 2012 maize 2011 maize 2012 Malawi Rwanda Uganda cast melt Rwanda Uganda Malawi Malawi Rwanda Uganda 2012 2011 2011 2012 2011 2012 year coffee maize country WIDE LONG (TIDY) TIDY DATASETS have each observation in its own row and each variable in its own column. new
22、variable Label Data label list list all labels within the dataset label define myLabel 0 “US“ 1 “Not US“ label values foreign myLabel define a label and apply it the values in foreign Value labels map string descriptions to numbers. They allow the underlying data to be numeric (making logical tests
23、simpler) while also connecting the values to human-understandable text. note: data note here place note in dataset Replace Parts of Data rename (rep78 foreign) (repairRecord carType) rename one or multiple variables CHANGE COLUMN NAMES recode price (0 / 5000 = 5000) change all prices less than 5000
24、to be $5,000 recode foreign (0 = 2 “US“)(1 = 1 “Not US“), gen(foreign2) change the values and value labels then store in a new variable, foreign2 CHANGE ROW VALUES useful for exporting data mvencode _all, mv(9999) replace missing values with the number 9999 for all variables mvdecode _all, mv(9999)
25、replace the number 9999 with missing value in all variables useful for cleaning survey datasets REPLACE MISSING VALUES replace price = 5000 if price 5000 replace all values of price that are less than $5,000 with 5000 Select Parts of Data (Subsetting) FILTER SPECIFIC ROWS drop in 1/4 drop if mpg 20
26、drop observations based on a condition (left) or rows 1-4 (right) keep in 1/30 opposite of drop; keep only rows 1-30 keep if inlist(make, “Honda Accord“, “Honda Civic“, “Subaru“) keep the specified values of make keep if inrange(price, 5000, 10000) keep values of price between $5,000 $10,000 (inclus
27、ive) sample 25 sample 25% of the observations in the dataset (use set seed # command for reproducible sampling) SELECT SPECIFIC COLUMNS drop make remove the make variable keep make price opposite of drop; keep only variables make and price Data Transformation Cheat Sheet with Stata 14.1 For more inf
28、o see Statas reference manual () Tim Essam (tessamusaid.gov) Laura Hughes (lhughesusaid.gov) follow us StataRGIS and flaneuseks inspired by RStudios awesome Cheat Sheets ( updated March 2016 CC BY 4.0 geocenter.github.io/StataTraining Disclaimer: we are not affiliated with Stata. But we like it.Data
29、 Visualization Cheat Sheet with Stata 14.1 For more info see Statas reference manual () Laura Hughes (lhughesusaid.gov) Tim Essam (tessamusaid.gov) follow us flaneuseks and StataRGIS inspired by RStudios awesome Cheat Sheets ( updated February 2016 CC BY 4.0 geocenter.github.io/StataTraining Disclai
30、mer: we are not affiliated with Stata. But we like it.graph y 1y 2 y nx in if, by(var) xline(xint) yline(yint) text(y x “annotation“) BASIC PLOT SYNTAX: plot size custom appearance save variables: y first plot-specific options facet annotations titles axes title(“title“) subtitle(“subtitle“) xtitle(
31、“x-axis title“) ytitle(“y axis title“) xscale(range(low high) log reverse off noline) yscale() scheme(s1mono) play(customTheme) xsize(5) ysize(4) saving(“myPlot.gph“, replace) CONTINUOUS DISCRETE (asis) (percent) (count) over(, ) cw missing nofill allcategories percentages stack bargap(#) intensity(
32、*#) yalternate xalternate graph hbar draws horizontal bar charts bar plot graph bar (count), over(foreign, gap(*0.5) intensity(*0.5) bin(#) width(#) density fraction frequency percent addlabels addlabopts() normal normopts() kdensity kdenopts() histogram histogram mpg, width(5) freq kdensity kdenopt
33、s(bwidth(5) main plot-specific options; see help for complete set bwidth kernel( normal normopts() smoothed histogram kdensity mpg, bwidth(3) (asis) (percent) (count) (stat: mean median sum min max .) over(, ) cw missing nofill allcategories percentages linegap(#) marker(#, ) linetype(dot | line | r
34、ectangle) dots() lines() rectangles() rwidth dot plot graph dot (mean) length headroom, over(foreign) m(1, ms(S) ssc install vioplot over(, ) nofill vertical horizontal obs kernel() bwidth(#) barwidth(#) dscale(#) ygap(#) ogap(#) density() bar() median() obsopts() violin plot vioplot price, over(for
35、eign) over(, ) missing allcategories intensity(*#) boxgap(#) medtype(line | line | marker) medline() medmarker() graph box draws vertical boxplots box plot graph hbox mpg, over(rep78, descending) by(foreign) missing graph hbar . bar plot graph bar (median) price, over(foreign) (asis) (percent) (coun
36、t) (stat: mean median sum min max .) over(, ) cw missing nofill allcategories percentages stack bargap(#) intensity(*#) yalternate xalternate graph hbar . grouped bar plot graph bar (percent), over(rep78) over(foreign) (asis) (percent) (count) over(, ) cw missing nofill allcategories percentages sta
37、ck bargap(#) intensity(*#) yalternate xalternate a b c sort cmissing(yes | no) vertical, horizontal base(#) line plot with area shading twoway area mpg price, sort(price) 17 2 10 23 20 jitter(#) jitterseed(#) sort cmissing(yes | no) connect() aweight() scatter plot with labelled values twoway scatte
38、r mpg weight, mlabel(mpg) jitter(#) jitterseed(#) sort connect() cmissing(yes | no) scatter plot with connected lines and symbols see also line twoway connected mpg price, sort(price) (sysuse nlswide1) twoway pcspike wage68 ttl_exp68 wage88 ttl_exp88 vertical, horizontal Parallel coordinates plot (s
39、ysuse nlswide1) twoway pccapsym wage68 ttl_exp68 wage88 ttl_exp88 vertical horizontal headlabel Slope/bump plot SUMMARY PLOTS twoway mband mpg weight | scatter mpg weight bands(#) plot median of the y values ssc install binscatter plot a single value (mean or median) for each x value medians nquanti
40、les(#) discrete controls() linetype(lfit | qfit | connect | none) aweight binscatter weight mpg, line(none) THREE VARIABLES mat() color() freq ssc install plotmatrix regress price mpg trunk weight length turn, nocons matrix regmat = e(V) plotmatrix, mat(regmat) color(green) heatmap TWO+ CONTINUOUS V
41、ARIABLES bwidth(#) mean noweight logit adjust calculate and plot lowess smoothing twoway lowess mpg weight | scatter mpg weight FITTING RESULTS level(#) stdp stdf nofit fitplot() ciplot() range(# #) n(#) atobs estopts() predopts() calculate and plot quadriatic fit to data with confidence intervals t
42、woway qfitci mpg weight, alwidth(none) | scatter mpg weight level(#) stdp stdf nofit fitplot() ciplot() range(# #) n(#) atobs estopts() predopts() calculate and plot linear fit to data with confidence intervals twoway lfitci mpg weight | scatter mpg weight REGRESSION RESULTS horizontal noci regress
43、mpg weight length turn margins, eyex(weight) at(weight = (1800(200)4800) marginsplot, noci Plot marginal effects of regression ssc install coefplot baselevels b() at() noci levels(#) keep() drop() rename() horizontal vertical generate() Plot regression coefficients regress price mpg headroom trunk l
44、ength turn coefplot, drop(_cons) xline(0) vertical, horizontal base(#) barwidth(#) bar plot twoway bar price rep78 vertical, horizontal base(#) dropped line plot twoway dropline mpg price in 1/5 twoway rarea length headroom price, sort vertical horizontal sort cmissing(yes | no) range plot (y 1 y 2
45、) with area shading vertical horizontal barwidth(#) mwidth msize() range plot (y 1 y 2 ) with bars twoway rbar length headroom price jitter(#) jitterseed(#) sort cmissing(yes | no) connect() aweight() scatter plot twoway scatter mpg weight, jitter(7) half jitter(#) jitterseed(#) diagonal aweights()
46、scatter plot of each combination of variables graph matrix mpg price weight, half y 3 y 2 y 1 dot plot twoway dot mpg rep78 vertical, horizontal base(#) ndots(#) dcolor() dfcolor() dlcolor() dsize() dsymbol() dlwidth() dotextend(yes | no) ONE VARIABLE sysuse auto, clear DISCRETE X, CONTINUOUS Y twow
47、ay contour mpg price weight, level(20) crule(intensity) ccuts(#s) levels(#) minmax crule(hue | chue| intensity) scolor() ecolor () ccolors() heatmap interp(thinplatespline | shepard | none) 3D contour plot vertical horizontal range plot (y 1 y 2 ) with capped lines twoway rcapsym length headroom price see also rcap Plot Placement SUPERIMPOSE graph twoway scatter mpg price in 27/74 | scatter mpg price /* */ if mpg 12000 in 27/74, mlabel(make) m(i) combine twoway plots using | scatter y3 y2 y1 x, marker(i o i) mlabel(var3 var2 var1) plot sev