1、based on the beta-binomial distribution, which is found to be particularly adequate for thispurpose. A well-known dataset on personal loans is used to illustrate the application of theproposed model. 2008 Elsevier B.V. All rights reserved.y used in practice and raise a number of interesting and chal
2、lenging research questions.tthey,1997Journal of Empirical Finance 16 (2009) 457465Contents lists available at ScienceDirectJournal of Empirical Financejournal homepage: a lending institution, hereinafter referred to as the bank, that wants to use information on the characteristics andrepayment beha
3、vior of its clients to estimate the probability of a prospective borrower to default on a loan.1The nature of thestatistical methods to use in the estimationof this model will depend on the type of loan being considered. Indeed, different typesof loans generate data with different characteristics, a
4、nd that is critical for the choice of the statistical methodology to use.In this paper we restrict our attention to a particular, yet important, type of loan. Specically, we develop a model for theprobabilityofdefaultforthecaseinwhichtheloanistoberepaidinanumberofregularinstallments.2Scoringmodelsfo
5、rthistypeof loan are typicallyestimated using only data on contracts that have reached their maturity. However, models constructed in thisway waste the information on clients who are currently repaying their loans. This is inappropriate, not only because it is anTherefore,itisnotsurprisingtondtha198
6、1;Maddala,1996; Handand Henleyinefcient use of the data, but also becausethe Corresponding authors. Santos Silva is to be contactedFaculdade de Economia, Universidade de Coimbra, Av.E-mail addresses: jmcssessex.ac.uk (J.M.C. Santos1In this paper onlythe probabilityof default is modelled.importance b
7、oth for its protability and for the households2Therefore,themodeldevelopedheremaybeinappropriateThomas (2000) surveys different techniques that are0927-5398/$ see front matter 2008 Elsevier B.V.doi:10.1016/j.jempn.2008.11.003havebeenthesubjectofaconsiderableliterature(see,amongmanyothers,Altmanetal.
8、,; Handand Jacka,1998;Thomas, 2000;Thomasetal., 2002, and the references therein).1. IntroductionModels for credit scoring are widelJEL classication code:C21C51G21Keywords:Beta-binomial distributionCredit scoringPopulation driftcontract, the model provides some information on the timing of the defau
9、lts. The model isEstimation of default probabilities using incomplete contracts dataJ.M.C. Santos Silvaa,c, J.M.R. Murteirab,c,aDepartment of Economics, University of Essex, UKbFaculdade de Economia, Universidade de Coimbra, PortugalcCEMAPRE, Portugalarticle info abstractArticle history:Received 4 O
10、ctober 2007Received in revised form 11 August 2008Accepted 11 November 2008Available online 24 November 2008This paper develops a count data model for credit scoring which allows the estimation ofdefault probabilities using incomplete contracts data. The main advantage of the proposedapproach is tha
11、t it permits a more efcient use of the data, including that for the most recentclients. Moreover, because the probability of default is specied as a function of the age of theresulting model maybe affected by population drift problems, caused bychanges inat Department of Economics, University of Ess
12、ex, Wivenhoe Park, Colchester CO4 3SQ, UK. Murteira,Dias da Silva, 165, 3004-512 Coimbra, Portugal.Silva), jmurtfe.uc.pt (J.M.R. Murteira).Although this is onlya part of the optimizationproblem faced bythe lending institution, it is of criticalwelfare (see Carling et al., 2001).todescribedefaultswhe
13、ntheloanstaketheformofoverdrafts,oraregrantedthroughcreditcards.useful in constructing scoring models for these other types of loans.All rights reserved.the distribution of the characteristics of the clients (Kelly et al.,1999). To mitigate these problems, in practice, current clients areoften inclu
14、ded in the sample and are classied according to their present status. However, this procedure will inevitably induce adegreetheir contrThis458 J.M.C. Santos Silva, J.M.R. Murteira / Journal of Empirical Finance 16 (2009) 457465possible to use the data on all contracts, including the more recent ones
15、, to estimate the conditional probability of a clientbecoming a defaulter. This is particularly important in the case of long term contracts, e.g., mortgages, because in these cases thecharacteristics and repayment behavior of clients with completed contracts may have little to do with those of the
16、prospectiveborrowers.Moreover,forsmallerandnewerbanks,thenumberofobservationsonclientswithcompletedlongtermcontractsmaybe very small.The proposed model also has some additional advantages. First, by modelling the actual number of missed payments ratherthan just an indicator of default, the model exp
17、lores more of the available information. Second, it allows the probability of defaultto depend on the age of the contract, thereby providing some information on the way the probability of default varies in time. Thetiming of defaults has been previously addressed in the literature (e.g., Roszbach, 2
18、004; Dufe et al., 2007), but the particularcharacteristic of the model developed here is that it allows the researcher to estimate the probability of default for different timehorizons,withoutactuallyusinganydurationdataonthetimingofdefaults.Finally,theproposedmodellingstrategyis interestingbecausei
19、tcanbe adaptedtoavarietyof circumstancesandpermitsthetestofa numberofinterestinghypothesis,likethepossiblechange in repayment behavior after the client is considered a defaulter.An issue that has to be addressed in the construction of credit scoring models is the potential sample selectionproblem ca
20、usedbythe fact that the bank only has information on clients towhom it has decided to grant a loan. Thissituation is problematic if thedecisiontoacceptorrefusethecreditapplicationsismadeusinginformationontheclientsthatisnotavailablefortheconstructionof the credit scoring model. In this case the samp
21、le is endogenously stratied and there is not much that can be done to solve theproblem without relying on very strong assumptions (see Hand and Henley,1993). On the contrary, if all the information used todecide about the credit applications is available for the construction of the credit scoring mo
22、del, standard inference methods canbe used because in this case the sample is exogenously stratied (see Pudney, 1989; Wooldridge, 1999). This more favorablesituation is the one considered here.Theremainderofthepaperisorganizedasfollows.Thenextsectionintroducesacountdatamodelthatallowstheestimationof
23、default probabilities using data on incomplete contracts. Section 3 presents, purely for illustrative purposes, an application of theproposed model using a well-known dataset on personal loans, and Section 4 concludes the paper.2. An appropriate count data modelConsidera loanthathastobe repaid in N
24、regularinstallments andlet n denotethepresentageof thecontract,measuredbythenumber of installments that should have beenpaid since the contract began. Furthermore, let Y be the number of payments so farmissed by a client. Therefore, we have that 0YnN.The purpose of the scoring model is to estimate t
25、he probability that Y will be larger than the maximum number of paymentsthat a client is allowed to miss without being considered a defaulter, denoted by l. Obviously, this probability is zero for any nl.For clients whose loans have reached their maturity date, i.e., n=N, it is possible to construct
26、 a binary variable indicatingwhether or not the client defaulted. Credit scoring models are typically estimated by using appropriate statistical methods tomodel these indicators. The main drawback of this approach is that is wastes information, non only on the current clients whosenal status is yet
27、unknown, but also on the actual number of payments missed by the former clients.Rather than just modelling the default indicator, the alternative approach we follow here is to model the count variable Y,taking into account that this variate is bounded between 0 and n. These bounds on Y have implicat
28、ions for the type of count datamodeltouse.Indeed,thedistributionsmoreoftenusedinappliedwork,e.g.,Poissonandnegativebinomial,arenotsuitableinthiscontextbecausetheyassumethatthecountshavenoupperbound.InordertoaccountforthepeculiarcharacteristicsofY,weuseabeta-binomial model, which explicitly imposes t
29、hat Yn. The beta-binomial regression was rst used by Heckman and Willis(1977) in a very different context, but has seen little use in practice.2.1. Beta-binomial regressionSuppose that, besides Y, N and n, the bank observes a set x of characteristics of the contract and of its clients.3The objective
30、 isthen to estimate the probability that Y will cross the threshold above which the client will be classied as a defaulter, given N, nand x.In order to take explicitly into account the upper bound on the value of Y, the model developed here has as a starting point thebinomial distribution characteri
31、zed byPY= yjn=n!y! n y!py1pny; 13Depending on the nature of the problem, macroeconomic indicators may also be useful predictors of default (see, e.g., Bellotti and Crook, 2007).of misclassication because some clients currently classied as non-defaulters may eventually default before the end ofacts.p
32、aper shows that, using an appropriate count data model for the number of payments missed by the borrowers, it iswhere p is the probability that the individual will miss any of the n payments and y=1,2,n.Evenifp is parameterized as afunction of x, the simple binomial model dened by Eq. (1) is unlikel
33、y to be adequate to describe the credit default data due tothepresence of unobserved individual heterogeneity.Tothesefunctionsin lineOnfor thebecomithe baTherandomsemi-pspeciFirin whisemipathecothe coAdditionalldistributionsuccessprobabilityFinallof Y when459J.M.C. Santos Silva, J.M.R. Murteira / Jo
34、urnal of Empirical Finance 16 (2009) 457465Wooldridge (1992) for more general specications. It is also interesting to note that when is specied as =exp (xln(N), the limiting distributionN passes to innity is negative binomial with mean =exp (x) and variance +2.the present case, this motive is streng
35、thened by the fact that this simplicity is certainlyof great value for the practitioner in chargeof the practical application of the model.4Seenditioning variables.y, an interesting feature of the beta-binomial distribution is that, besides this interpretation as a binomialwith individual heterogene
36、ity, it can be viewed as giving the total number of successes in n Bernoulli trials when bothand failure are contagious (see Johnson et al., 2005). Therefore, this model can accommodate a situation in which thethat an individual will miss a certain payment depends on his previous repayment behavior.
37、y, a fully parametric model is generally much easier to estimate and to interpret than a semiparametric specication. Inunobservables)itis neverthelessquiteexible and has the advantage of allowingthedistribution of theunobservables to depend onj=0cetherelevant parameters are estimated, the creditwort
38、hiness of prospective clients can be gauged byevaluating P(YNl|N,x,n)corresponding values of x and for different values of n ,forexamplen=N. Notice that in this sort of model the probability ofngadefaulterwilldependonthetimehorizonconsidered(seeFig.1 below).Thisisimportantsince,fromthepointofviewofn
39、k, it is not indifferent when the client becomes a defaulter.beta-binomial regression accounts for extra-binomial variation by assuming that p is distributed in the population as a betavariable.Analternativewaytoaccountforextra-binomialvariationinEq.(1)istomodelthedistributionoftheunobservablesarame
40、trically, as it is done by Johansson and Palme (1996). There are several reasons why we prefer to use a fully parametriccation based on the beta-binomial distribution.stofall,thebetadistributioniswellknownforitsexibility(see,e.g.,Johnsonetal.,1995)andcaneasilyaccommodatesituationsch the distribution
41、 of the unobservables depends on the conditioning variables. This is difcult to do, if at all possible, if therametricapproachisadopted.Indeed,inthiscaseitisgenerallyassumedthattheunobservablesarestatisticallyindependentofvariates.Therefore,althoughtheparametricapproachisrestrictive(inthatitrequires
42、thespecicationofthedistributionofthecovariates and N.4In a credit scoring context, the probability of default is the main object of interest. Recalling that l denotes the maximumnumberofrepaymentsthataclientmaymisswithoutbeingconsideredadefaulter,theprobabilityofdefaultimpliedbyEq.(2)canbe written a
43、sPYN ljN; x; n=1XlPY= jjN; x; n: 3 +11+ + complete the model specication it is necessary to dene how and depend on N and x. Naturally, the particular form ofisanempiricalissueandnogeneralrulescanbeestablished.However,giventhattheyarepositiveparameters,andwith standard practice in count data models,
44、it is natural and convenient to specify and as exponential functions of theTo account for extra-binomial variation, we assume that p is distributed in the population as a beta random variable withparameters that depend on the value of N and x. In particular, it is assumed thatfpjN; x= C1+1p111p11C1C
45、16C17C1C16C17 ;where f(p|N,x) denotes the conditional density function of p, and and are positive parameters that may depend on N and x.Therefore, Y follows a beta-binomial distribution dened byPY= yjN; x; n=n!y! n y!C +1C16C17C1+n yC16C17C1+yC16C17C1C16C17C1C16C17C +1+nC16C17; 2withEYjN; x; n= n1+;
46、VYjN; x; n= EYjN; x; n1+ + n:2.2. A beta-binomial hurdle modelAsnotedbefore,thebeta-binomialregressionhasmanycharacteristicsthatmakeitappropriatetomodelthenumberofmissedpayments. However, in its basic formulation, the beta-binomial regression ignores the fact that the repayment behavior of a clientm
47、aychangeafterheisclassiedasadefaulter.Indeed,thebankmayputpressureondefaulterstorepaytheirdebts,forinstancebythreatening to take legal action. In more serious cases, defaulted loans may enter special debt-collection procedures and so, afterdefault, the number of missed payments will have a very diff
48、erent nature.TheyNl. Indeed,that has3. An3.1. Thethan 3Asand destined460 J.M.C. Santos Silva, J.M.R. Murteira / Journal of Empirical Finance 16 (2009) 457465An important feature of thedataset available for this study is thatit containsonly the sub-sample of 2446 observations usedbyDionne et al. (199
49、6), who deleted all the observations with incomplete records, as well as cases with outlying values of theexplanatoryorof thedependentvariable.Inparticular,theseauthorsdeleted217observationswith yN11.Thiscreatesatruncationproblem that will have to be taken into account in the modelling stage.7Besides containing information on y, the number of missed payments, and on n, the number of months fro