1、SpeechREcognition- a practical guideLecture 2Hidden MarkovModelsFirst, a detour.NaziGermany, 1939The EnigmamachineEnigma was one of many such machines Encrypted text messages sent by radioSubstituted letters in an order-dependent way (a disk rotated with each letter, changing the pattern of substitu
2、tions)Had a large number of settings controlled by rotorsRecipient had to know the rotor settingsThe EnigmaRotor settings were changed daily (the same everywhere).The Allies could use computers to try out each combination of rotor settings.For the correct combination, the decrypted messages would “m
3、ake sense”Secret!Colossusthefirstprogrammablecomputer(Bletchley Park,England)A computer was used to try out the various combinations fast.*But wait!A computer cannot tell whether message makes senseNeed to have a model of (German) textThis model will assign a higher score to sequences that look like
4、 “real text”Models of text*Disclaimer: this is all a gross oversimplification of what really happened.Assume letters produced “independently” (i.i.d. = identically and independently distributer).Just use letter frequencies, e.g. “e” more common.For each letter, P(letter) is relative frequency.P(sent
5、ence) is product over letters of P(letter)Simplest modelThat model is sufficient to break simple codes (e.g. a rotation of the letters).Enigma machine was not a simple substitution or rotation cipher.Many different rotor settings would produce plausible letter frequencies e.g. swapping 2 letters of
6、similar frequencyDrawbacksA Markov Chain is a model of probabilities of sequences of symbolsIn 1st order Markov chain, p(symbol) depends only on previous symbol.E.g. p(hat) = p(h|?) p(a|h) p(t|a)Write p(h|?) because were not addressing start/end effects right now.Would work out a table of probabilit
7、ies from previous telegrams.Markov ChainA model of the probability of a sequence.At each time instant, model is in some “hidden” state.Matrix of “emission probabilities”: #states by #symbolsMatrix of “transition probabilities”: #states by #statesNote: invented by US govt. researchers, probably for s
8、ome code-breaking stuff (but the exact use has not been made public.)Hidden Markov Model (HMM)Computing p(sequence | model) involves summing over exponentially many state sequencesCan be done fast using a dynamic programming recursion.Training the model parameters aims to maximize train-data likelih
9、ood.Not just a question of computing frequencies like for Markov chain.You have to work out the hidden state sequences (well, a distribution over them)HMMsForward-backward algorithmRecursion to compute state occupation probs.Used during model training.This is a class of algorithms for iteratively ma
10、ximizing likelihood.Viterbi algorithmFinds most likely sequence of HMM states given a symbol sequence.HMMs: important algorithmsOld HMM-based speech-reco used to work like this:Use Vector Quantization (VQ) to map each speech feature to one symbol (out of typically around 256).Each phone has a 3-stat
11、e HMM with a left-to-right structure as above.HMMs for speechThe model for a sentence is a concatenation of the models for its phones.You dont need phone time alignment to train.The forward-backward algorithm just finds the right alignment itself, after many iterations.This relies on there being eno
12、ugh training sentences (and nice enough data)HMMs for speech (2)Vector Quantization was done by training a Gaussian Mixture Model (GMM) and using the top-scoring index.After some time people switched to a “soft” Vector Quantization: sum over the index, rather than take the max.Eventually the Gaussia
13、ns were made specific to the HMM-states.This is a “continuous” HMM, not discrete: the states emit a continuous feature vector, not a discrete symbol.Continuous HMMsgaussian mixture modelThe following are very fundamental things that you should learn if you dont know already:Markov ChainHidden Markov
14、 ModelForward-backward algorithmViterbi algorithmE-M for mixture of GaussiansThings you should knowMonophone model training$ cd /kaldi-trunk/egs/rm/s3$ # these commands are in run.sh$ . path.sh # set up your path- will be needed later.$ scripts/subset_data_dir.sh data/train 1000 data/train.1k $ step
15、s/train_mono.sh data/train.1k data/lang exp/mono$ local/decode.sh -mono steps/decode_deltas.sh exp/mono/decode Assumes you are where you left off last weekSee for slidesNote: “monophone” is to distinguish from phonetic-context-dependent HMMs “triphones”Monophone model training$ cd /kaldi-trunk/egs/
16、rm/s3$ # these commands are in run.sh$ scripts/subset_data_dir.sh data/train 1000 data/train.1k $ steps/train_mono.sh data/train.1k data/lang exp/mono$ local/decode.sh -mono steps/decode_deltas.sh exp/mono/decode Use a subset of data since waste of time to use all of it (so few parameters to train)S
17、uggested exercise: try with different amounts of data and see how WER changesDoes WER change smoothly?Monophone model training$ cd /kaldi-trunk/egs/rm/s3$ # these commands are in run.sh$ scripts/subset_data_dir.sh data/train 1000 data/train.1k $ steps/train_mono.sh data/train.1k data/lang exp/mono$
18、local/decode.sh -mono steps/decode_deltas.sh exp/mono/decode Next well look at how the training script works.Output is in exp/monoWill look at log files to show you the commands the script actually runs.Cepstral normalization$ cat exp/mono/cmvn.logcompute-cmvn-stats -spk2utt=ark:data/train.1k/spk2ut
19、t scp:data/train.1k/feats.scp ark:exp/mono/cmvn.arkFor each speaker, compute statistics to normalize the means and variances of the cepstral features.Just count, and (sum, sum-squared) for each dim.Statistics are in binary formatViewing CMVN stats$ copy-matrix ark:exp/mono/cmvn.ark ark,t:- | headcop
20、y-matrix ark:exp/mono/cmvn.ark ark,t:-adg0 202805.7 -45171.13 -25113.25 -32178.11 -64940.17 -59834.14 -53859.89 -25581.98 -38369.97 -35770.74 -34123.73 -40811.7 -17615.15 31201.370877e+07 1926569 693237.5 1436362 2282034 1924576 1983340 791023.9 1075897 947114.2 995890.7 1152457 501246.2 0 ahh0 2435
21、59.7 -49834.79 -45551.43 -32294.71 -31345.15 -69359.74 -21432.35 -62158.46 -2514.336 -19708.29 -40365.65 37920.28 -36189.33 37201.676417e+07 2100356 1245457 1481856 963323.5 2106689 816228.5 2149422 596726.6 830598.1 1141586 1077821 1064480 0 Part highlighted in yellow went to stderrAll logging goes
22、 to stderr (including echoing command line).This is standard archive of matrices.Model initialization$ head exp/mono/init.log gmm-init-mono -train-feats=ark:apply-cmvn -norm-vars=false -utt2spk=ark:data/train.1k/utt2spk ark:exp/mono/cmvn.ark scp:data/train.1k/feats.scp ark:- | add-deltas ark:- ark:-
23、 | subset-feats -n=10 ark:- ark:-| data/lang/topo 39 exp/mono/0.mdl exp/mono/tree (ignore part in gray; used to get plausible means and variances)Input: topology file data/lang/topo, and dim (39)Outputs: model “0.mdl” and “tree”Tree is phonetic-context decision tree- doesnt have any splits in monoph
24、one case.Monophone treewhole set of trees represented as one treeask first about central phone.Phone = ?HMM state = ?aaHMM state = ?aeHMM state = ?ahHMM state = ?aoHMM state = ?awHMM state = ?axHMM state = ?ayHMM state = ?bHMM state = ?chHMM state = ?dHMM state = ?ddHMM state = ?dhHMM state = ?dxHMM
25、 state = ?ehHMM state = ?elHMM state = ?enHMM state = ?erHMM state = ?eyHMM state = ?fHMM state = ?gHMM state = ?hhHMM state = ?ihHMM state = ?iyHMM state = ?jhHMM state = ?kHMM state = ?kdHMM state = ?lHMM state = ?mHMM state = ?nHMM state = ?ngHMM state = ?owHMM state = ?oyHMM state = ?pHMM state
26、= ?pdHMM state = ?rHMM state = ?sHMM state = ?shHMM state = ?tHMM state = ?tdHMM state = ?thHMM state = ?tsHMM state = ?uhHMM state = ?uwHMM state = ?vHMM state = ?wHMM state = ?yHMM state = ?zHMM state = ?sil0011223041526071829010111212013114215016117218019120221022123224025126227028129230031132233
27、034135236037138239040141242043144245046147248049150251052153254055156257058159260061162263064165266067168269070171272073174275076177278079180281082183284085186287088189290091192293094195296097198299010011012102010311042105010611072108010911102111011211132114011511162117011811192120012111222123012411
28、25212601271128212901301131213201331134213501361137213801391140214101421143214431454$ draw-tree data/lang/phones.txt exp/mono/tree | dot -Tps -Gsize=8,10.5 | ps2pdf - /tree.pdfViewing topology file$ cat data/lang/topo1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
29、32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 470 0 0 0.75 1 0.25 1 1 1 0.75 2 0.25 2 2 2 0.75 3 0.25 3 480 0 0 0.25 1 0.25 2 0.25 3 0.25 Specifies 3-state left-to-right HMM, and default transition probs (before training)Separate topology for silence (5 states, more transitions)Compiling training gra
30、phs$ cat exp/mono/compile_graphs.log compile-train-graphs exp/mono/tree exp/mono/0.mdl data/lang/L.fst ark:exp/mono/train.tra ark:|gzip -c exp/mono/graphs.fsts.gz LOG (compile-train-graphs:main():compile-train-graphs.cc:150) compile-train-graphs: succeeded for 1000 graphs, failed for 0Compiles FSTs,
31、 one for each train utteranceEncode HMM structure for that training utteranceWe precompile them because otherwise this would dominate training timeViewing training graphs$ fstcopy ark:gunzip -c exp/mono/graphs.fsts.gz| ark,t:- | headfstcopy ark:gunzip -c exp/mono/graphs.fsts.gz| ark,t:- trn_adg04_sr
32、249 0 1 266 949 0.6933590 106 284 0 0.6933590 107 285 0 0.6933590 108 286 0 0.6933591 7 268 01 1 265 02 109 288 02 110 289 02 111 290 0Archive format is: (utt-id graph utt-id graph.)Graph format is:from-state to-state input-symbol output-symbol costCosts include pronunciation probs, but for training
33、 graphs, not transition probs (added later).Symbols in graphsIn graphs for training and testing.Output symbols are words (look up in words.txt)In the traditional recipe, input symbols would be p.d.f.s (so each mixture of Gaussians has a number)This causes difficulties for training the transition probabilities and finding phone alignments etc.In our graphs, input-symbols are “transition-ids”, which correspond roughly to arcs in context-dependent HMMs. See docs!Can be mapped to “pdf-ids” which are fewer.