1、统计基础和prism软件使用,仝鑫 魏健 2015-12,目录,线性回归和prism软件应用,t检验、F检验(方差分析)和prism软件应用,假设检验(参数检验和非参数检验),统计学基础知识,The Gaussian Distribution,The Gaussian function describing this shape is defined as follows:,where m represents the population mean and s the standard deviation. Few biological distributions, if any, real
2、ly follow the Gaussian distribution,一、统计学基础知识,The Central Limit Theorem,If your samples are large enough, the distribution of means will follow a Gaussian distribution even if the population is not Gaussian. N=10 or so is generally enough,一、统计量(Descriptive Statistics:column statistics in prism),Meas
3、ures of Location,A typical or central value that best describes the data(central tendency).Mean(平均值) Median(中数) Mode(众数) Geometric mean(几何均数),Measures of Dispersion,Describe spread (variation) of the data around that central value. Range(范围) Variance(方差) Standard Deviation(标准偏差) Standard Error(样本间标准
4、误=SD/n Coefficient of variation(变异系数) Confidence Interval(置信区间),No single parameter can fully describe distribution of data in the sample. Most statistics software will provide a comprehensive table describing the distribution.,Measures of Location: Mean,Mean,More commonly referred to as “the averag
5、e”. It is the sum of the data points divided by the number of data points.,Migration Assay,M=76.78 microns = 77 microns,Measures of Dispersion: Variance,Variance,Defined as the average of the square distance of each value from the mean.,To calculate variance, it is first necessary to calculate the m
6、ean score then measure the amount that each score deviates from the mean. The formula for calculating variance is:,Measures of Dispersion: Standard Deviation,Standard Deviation,The most common and useful measure of dispersion. Tells you how tightly each sample is clustered around the mean. When the
7、samples are tightly bunched together, the Gaussian curve is narrow and the standard deviation is small. When the samples are spread apart, the Gaussian curve is flat and the standard deviation is large.,The formula to calculate standard deviation is:,SD = square root of the variance.,标准偏差(SD) 和标准误(
8、SEM),Standard deviation refers to the amount you expect an individual measurement to vary from the average.标准差(standard deviation)衡量的是样本值对样本平均值的离散程度,反应个体间变异的大小,是量度数据精密度的指标。Standard error of the mean is how much you expect a value averaged from several measurements to vary from the true mean. 标准误(sta
9、ndard error)衡量的是样本平均值对总体平均值的离散程度,反映抽样误差的大小,是量度结果精密度的指标。,Should we show standard deviation or standard error?,Use Standard Deviation,If the scatter is caused by biological variability and you want to show that variability. For example: You aliquot 10 plates each with a different cell line and measure
10、 integrin expression of each.,Use standard error,If the variability is caused by experimental imprecision and you want to show the precision of the calculated mean. Then show the 95% confidence interval of the mean. For example: You aliquot 10 plates of the same cell line and measure integrin expres
11、sion of each.,Precision of the Mean,在统计学中,样本的置信区间(Confidence intervals)是对这个样本的总体某参数的区间估计。展现的是这个参数的真实值有一定概率落在测量结果的周围的程度。 “一定概率”:称为置信水平。当求取90% 置信区间时 Z=1.645 当求取95% 置信区间时 Z=1.96 当求取99% 置信区间时 Z=2.576,The formula for calculating CI:,CI = X (SEM x Z)X is the sample mean and Z is the critical value for the
12、 normal distribution.For the 95% CI, Z=1.96.For our data set:95% CI=77 (19x1.96)=77 32CI 95%=45-109This means that theres a 95% chance that the CI you calculated contains the population mean.,CI: A Practical Example,Between these two data sets, which mean do you think best reflects the population me
13、an and why?,Interpret CI of a mean,SD/SEM/95% CI error bars,SD,SEM,95% CI,二、The Null Hypothesis(假设检定),Appears in the form Ho: m1 = m2 Where; Ho = null hypothesism1 = mean of population 1m2 = mean of population 2 An alternate form is Ho: m1-m2=0 The null hypothesis is presumed true until statistical
14、evidence in the form of a hypothesis test proves otherwise.(非此即彼),检验统计量 用于假设检验问题的统计量称为检验统计量。与参数估计相同,需要考虑:总体是否正态分布;大样本还是小样本;总体方差已知还是未知。,假设检验的一些基本概念,假设检验的一些基本概念,the difference you observed from sampling true difference of population. All you can do is calculate probabilities(P value:0,1).Before thinki
15、ng about P values, you should: 1) Assess the science. 2) Review the assumptions of the analysis you chose P values (Small P and big P see page 35 and 37),显著性水平(threshold significance level) 用样本推断H0是否正确,必有犯错误的可能。原假设H0正确,而被我们拒绝,犯这种错误的概率或风险用表示。 把称为假设检验中的显著性水平, 即决策中的风险。,例:0.05时的接受域和拒绝域,接受域:原假设为真时允许范围内的变
16、动,应该接受原假设。拒绝域:当原假设为真时只有很小的概率出现,因而当统计量的结果落入这一区域便应拒绝原假设,这一区域便称作拒绝域。,假设检验的一些基本概念,双侧检验与单侧检验 假设检验根据实际的需要可以分为 : 双侧检验(双尾): 指只强调差异而不强调方向性的检验。单侧检验(单尾):强调某一方向性的检验。左侧检验右侧检验,假设检验中的单侧检验示意图,拒绝域 拒绝域 (a)右侧检验 (b)左侧检验,假设检验的一些基本概念,假设检验中的两类错误假设检验是依据样本提供的信息进行推断的,即由部分来推断总体,因而假设检验不可能绝对准确,是可能犯错误的。两类错误:错误(I型错误): H0为真时却被拒绝,弃
17、真错误;错误(II型错误): H0为假时却被接受,取伪错误。假设检验中各种可能结果的概率:接受H0 ,拒绝H1 拒绝H0,接受H1H0为真 1 (正确决策) (弃真错误)H0为伪 (取伪错误) 1- (正确决策),(1)与是两个前提下的概率。即是拒绝原假设H0时犯错误的概率,这时前提是H0为真; 是接受原假设H0时犯错误的概率,这时前提是H0为伪。所以 不等于1。 (2)对于固定的n,与一般情况下不能同时减小。对于固定的n, 越小, Z/2越大,从而接受假设区间(-Z/2, Z/2)越大,H0就越容易被接受,从而“取伪”的概率就越大; 反之亦然。即样本容量一定时,“弃真”概率和“取伪”概率不能
18、同时减少,一个减少,另一个就增大。,与,(3)要想减少与,一个方法就是要增大样本容量n。,与,Statistical Power(统计功效),Hypothesis Testing,Nonparametric tests and parametric tests (参数检验和非参数检验),ANOVA, t tests, and many statistical tests :sampled data from populations that follow a Gaussian bell-shaped distribution. many kinds of biological data fol
19、low a bell-shaped distribution that is approximately Gaussian.a Gaussian distribution : Normality test(正态性检验):Normality tests can help you decide when to use nonparametric tests, but the decision should not be an automatic one. examine the frequency distribution or the cumulative frequency distribut
20、ion.,使用服从t分布的统计量检验正态总体平均值的方法。 是定量资料分析中最常用的假设检验方法,t检验和prism软件应用,三、t检验、F检验(方差分析)和prism软件应用,t检验类型,1、样本均数与已知某总体均数比较的t检验 use the column statistics analysis 2、配对设计(paired design)均数比较的t检验 目的:推断两个未知总体均数1与2是否有差别,用配对设计。 3、两个独立样本(unpaired design)均数比较的t检验 目的:推断两个未知总体的均数1与2是否有差别,用成组设计。,适用于样本均数与已知总体均数0的比较,其比较目的是检
21、验样本均数所代表的总体均数是否与已知总体均数0有差别。 已知总体均数0一般为标准值、理论值或经大量观察得到的较稳定的指标值。 单样t检验的应用条件是总体标准未知的小样本资料( 如n50),且服从正态分布。,一、样本均数与已知某总体均数的比较(单样本t检验),二、配对设计均数比较的t检验(配对t检验),配对设计处理分配方式主要有三种情况: 两个同质受试对象分别接受两种处理,如把同窝、同性别和体重相近的动物配成一对,或把同性别和年龄相近的相同病情病人配成一对; 同一受试对象或同一标本的两个部分,随机分配接受两种不同处理; 自身对比(self-contrast)。即将同一受试对象处理(实验或治疗)前
22、后的结果进行比较,如对高血压患者治疗前后、运动员体育运动前后的某一生理指标进行比较。,案例1,6L1 Tm1 t ratio Testing if pairs follow a Gaussian distribution,独立样本:又称非配对样本或成组样本。是指一组数据与另一组数据没有任何关系,也就是说,两样本资料是相互独立的。 两组的样本容量尽可能相同,可以提高检验的精确度。 其均数差异显著性的t检验,又分为两总体方差相等(方差齐性)和方差不等两种检验方法(Levenes Test for equality of variance)。 若两总体方差不等,即方差不齐,可采用t检验,或进行变量变
23、换,或用秩和检验方法处理。,三、两个独立样本均数比较的t检验 (两独立样本的t检验),案例2,6L1rs和16L1rs Tm1,nonparametric test,Choosing when to use a nonparametric test is not straightforward The Mann-Whitney test Wilcoxon matched pairs test -unpaired data,总结,应用条件: t 检验:1. 小样本(n50)计量资料2.样本来自正态分布总体3.总体标准差未知4.两样本均数比较时,要求两样本相应的总体方差相等(方差齐性),2019/
24、11/22,把所有观察值之间的变异分解为几个部分。即把描写观察值之间的变异的离均差平方和分解为某些因素的离均差平方和及随机抽样误差的离均差平方和,进而计算其相应的均方差,构成F统计量。分类: 单因素方差分析:因素只有一个,这个因素的水平2。 多因素(2)方差分析:因素2,各因素的水平2,方差分析(ANOVA,F检验)和prism软件应用,在试验中所考虑的因素只有一个时,称为单因素实验。它是最简单的一种,它适用于只研究一个试验因素的资料,目的在于正确判断该试验因素各处理的相对效果(各水平的优劣) 1.各样本是相互独立的随机样本; 2.各样本数据均服从正态分布; 3.相互比较的各样本的总体方差相等
25、,即方差齐性(homogeneity of variance)。,单因素方差分析(one way ANOVA),ordinary one-way ANOVA Repeated measures one-way ANOVA Non parameters :Kruskal-Wallis test; Fredmans test,案例3,NaCl浓度对6L1rs颗粒影响 NaCl浓度对11L1rs颗粒影响 F ratio,2019/11/22,多因素方差分析(two way ANOVA),总体思路:1、观察数据类型选择方法一般线性模型多因素方差分析2、选择要分析的结果变量,固定因素或随机因素变量的选择
26、。3、方差分析模型的选择:全因素or自定义4、选择描述性统计分析。5、两两比较(多重比较)方法的选择。,案例4,HPV不同型别、25度不同天数的颗粒结果,线性回归和prism软件应用,The goal of linear regression is to adjust the values of slope and intercept to find the line that best predicts Y from X.,r2, a measure of goodness-of-fit of linear regression,The value r2 is a fraction betw
27、een 0.0 and 1.0, and has no unitsAn r2 value of 0.0 means that knowing X does not help you predict Y.When r2 equals 1.0, all points lie exactly on a straight line with no scatter. Knowing X lets you predict Y perfectly.,How is r2 calculated?,The left panel shows the best-fit linear regression line. In this example, the sum of squares of those distances (SSreg) equals 0.86.The right half of the figure shows the null hypothesis - a horizontal line through the mean of all the Y values. Goodness-of-fit of this model (SStot) is 4.907.,案例5,浊度实验,