1、String Matching,Algorithm : Design & Analysis 19,In the last class,Optimal Binary Search Tree Separating Sequence of Word Dynamic Programming Algorithms,String Matching,Simple String Matching KMP Flowchart Construction Jump at Fail KMP Scan,String Matching: Problem Description,Search the text T, a s
2、tring of characters of length n For the pattern P, a string of characters of length m (usually, mn) The result If T contains P as a substring, returning the index starting the substring in T Otherwise: fail,Straightforward Solution,t1 ti ti+k-2 ti+k-1 ti+m-1 tn,p1 pk-1 pk pm,T :,P :,?,First matched
3、character,Matched window Expanding to right,Next comparison,Note: If it fails to match pk to ti+k-1, then backtracking occurs, a cycle of new matching of characters starts from ti+1.In the worst case, nearly n backtracking occurs and there are nearly m-1 comparisons in one cycle, so (mn),Disadvantag
4、es of Backtracking,More comparisons are needed Up to m-1 most recently matched characters have to be readily available for re-examination. (Considering those text which are too long to be loaded in entirety),An Intuitive Finite Automaton for Matching a Given Pattern,1,2,3,4,*,start node,stop node ma
5、tched!,B,C,C,B,C,A,B,A,A,A,B,C,Automaton for pattern “AABC”,Alphabet=A,B,C,Advantage: each character in the text is checked only once Difficulty: Construction of the automaton too many edges(for a large alphabet) to defined and stored,Why no backtracking? Memorize the prefix.,The Knuth-Morris-Pratt
6、Flowchart,Get next text char.,A,A,B,B,B,C,*,1,2,3,4,5,6,An example: T=“ACABAABABA”, P=“ABABCB”,Success,Failure,P: ABABABCBT: . ABABAB x ,Matched Frame,matched frame,to be compared next,If x is not C,P: ABAB ABCBT: . ABABAB x ,The matched frame move to right for 2 chars, which is equal to moving the
7、pointers backward.,P: ABABABCBT: . ABABABABCB ,Moving for 4 chars may result in error.,Matched frame slides, with its breadth changed as well: p1 pr-1 pr p1 pk-r+1 pk-1t1 ti pj-r+1 tj-1 tj ,Sliding the Matched Frame,When dismatching occurs: p1 pk-1 pk t1 ti tj-1 tj ,Matched frame,Dismatching,New mat
8、ched frame,Next comparison,As large as possible.,Fail Links,Out of each node of KMP flowchart is a fail link, leading to node r, where r is the largest non-negative interger satisfying rk and p1,pr-1 matches pk-r+1,pk-1. (stored in failk)Note: r is independent of T.,k,r,k-r,P,P,pointer for P backwar
9、d,pointer for T forward,Which means: When fail at node k, next comparison is pk vs. pr,Computing the Fail Links,Thinking recursively, let failk-1=s: p1 ps-1 ps ps+1 p1 pk-r+1 pk-2 pk-1 pk pm,To be compared,Matched,Case 1ps=pk-1 failk=s+1,Case 2: pspk-1p1 pfails-1 pfailsp1 ps-1 ps ps+1 p1 pk-r+1 pk-2
10、 pk-1 pk pm,To be compared and thinking recursively,Recursion on Node fails,Thinking recursively, at the beginning, s=failk-1:,Case 2: pspk-1p1 pfails-1 pfailsp1 ps-1 ps ps+1 p1 pk-r+1 pk-2 pk-1 pk pm,ps is replaced by pfails, that is, new value assumed for s,Then, proceeding on new s, that is: If c
11、ase 1 applys (ps=pk-1): failk=s+1, or If case 2 applys (pspk-1): another new s,Computing Fail Links: an Example,Constructing the KMP flowchart for P = “ABABABCB”,Assuming that fail1 to fail6 has been computed,Get next text char.,A,A,B,B,A,B,C,B,*,0,3,4,5,6,7,8,9,1,2,fail7: fail6=4, and p6=p4, fail7=
12、fail6+1=5 (case 1) fail8: fail7=5, but p7p5, so, let s=fail5=3, but p7p3, keeping back, let s=fail3=1. Still p7p1. Further, let s=fail1=0, so, fail8=0+1=1.(case 2),Constructing KMP Flowchart,Input: P, a string of characters; m, the length of P Output: fail, the array of failure links, filledvoid kmp
13、Setup (char P, int m, int fail)int k, s;fail1=0;for (k=2; km; k+)s=failk-1;while (s1)if (ps= = pk-1)break;s=fails;failk=s+1;,For loop executes m-1 times, and while loop executes at most m times since fails is always less than s. So, the complexity is roughly O(m2),Number of Character Comparisons,Suc
14、cess comparison: at most once for a specified k, totaling at most m-1,Unsuccess comparison: Always followed by decreasing of s. Since: s is initialed as 0, s increases by one each time s is never negative So, the counting of decreasing can not be larger than that of increasing,fail1=0;for (k=2; km;
15、k+)s=failk-1;while (s1)if (ps= = pk-1)break;s=fails;failk=s+1;,These 2 lines combine to increase s by 1, done m-2 times,2m-3,Input: P and T, the pattern and text; m, the length of P; fail: the array of failure links for P. Output: index in T where a copy of P begins, or -1 if no match int kmpScan(ch
16、ar P, char T, int m, int fail)int match, j,k; /j indexes T, and k indexes Pmatch=-1; j=1; k=1;while (endText(T,j)=false)if (km) match=j-m; break;if (k= =0) j+; k=1;else if ( tj= =pk) j+; k+; /one character matchedelse k=failk; /following the failure linkreturn match,KMP Scan: the Algorithm,Each time
17、 a new cycle begins, p1,pk-1 matched,Executed at most 2n times, why?,Skipping Characters in String Matching,If you wish to understand others you must ,must,must,must,must,Checking the characters in P, in reverse order,must,must,must,must,must,must,must,must,The copy of the P begins at t38. Matching
18、is achieved in 18 comparisons,Distance of Jumping Forward,With the knowledge of P, the distance of jumping forward for the pointer of T is determined by the character itself, independent of the location in T.,p1 A A pm,p1 A A ps pm,t1 tj=A tn,current j,new j,Rightmost A,charJumpA = m-k,=pk,Computing
19、 the Jump: Algorithm,Input: Pattern string P; m, the length of P; alphabet size alpha=| Output: Array charJump, indexed 0, alpha-1, storing the jumping offsets for each char in alphabet.,void computeJumps(char P, int m, int alpha, int charJumpchar ch;int k;for (ch=0; chalpha; ch+)charJumpch=m; /For
20、all char no in P, jump by mfor (k=1; km; k+)charJumppk=m-k;,The increasing order of k ensure that for duplicating symbols in P, the jump is computed according to the rightmost,(|+m),Partially Matched Substring,P: b a t s a n d c a t sT: d a t s ,matched suffix,Current j charJumpd=4,New j Move only 1
21、 char,Remember the matched suffix, we can get a better jump,P: b a t s a n d c a t sT: d a t s ,New j Move 7 chars,Forward to Match the Suffix,p1 pk pk+1 pm,t1 tj tj+1 tn,Matched suffix,Dismatch,Substring same as the matched suffix occurs in P,p1 pr pr+1 pr+m-k pm,p1 pk pk+1 pm,t1 tj tj+1 tn,Old j,N
22、ew j,slidek,matchJumpk,Partial Match for the Suffix,p1 pk pk+1 pm,t1 tj tj+1 tn,Matched suffix,Dismatch,No entire substring same as the matched suffix occurs in P,p1 pq pm,p1 pk pk+1 pm,t1 tj tj+1 tn,Old j,New j,slidek,matchJumpk,May be empty,matchjump and slide,slidek: the distance P slides forward
23、 after dismatch at pk, with m-k chars matched to the rightmatchjumpk: the distance j, the pointer of P, jumps, that is: matchjumpk=slidek+m-kLet r(rk) be the largest index, such that pr+1 starts a largest substring matching the matched suffix of P, and prpk, then slidek=k-rIf the r not found, the lo
24、ngest prefix of P, of length q, matching the matched suffix of P will be lined up. Then slidek=m-q.,Computing matchJump: Example,P = “ w o w w o w ”,matchJump6=1,Direction of computing,w o w w o w,t1 tj ,Matched is empty,w o w w o w,matchJump5=3,w o w w o w,t1 tj w ,Matched is 1,w o w w o w,Slide6=1
25、 (m-k)=0,pk,pk,Slide5=5-3=2 (m-k)=1,Computing matchJump: Example,P = “ w o w w o w ”,matchJump4=7,Direction of computing,w o w w o w,t1 tj o w ,Matched is 2,w o w w o w,matchJump3=6,w o w w o w,t1 tj w o w ,Matched is 3,w o w w o w,Not lined up,=pk,No found, but a prefix of length 1, so, Slide4 = m-
26、1=5,pk,Slide3=3-0=3 (m-k)=3,Computing matchJump: Example,P = “ w o w w o w ”,matchJump2=7,Direction of computing,w o w w o w,t1 tj w w o w ,Matched is 4,w o w w o w,matchJump1=8,w o w w o w,t1 tj o w w o w ,Matched is 5,w o w w o w,No found, but a prefix of length 3, so, Slide2 = m-3=3,No found, but
27、 a prefix of length 3, so, Slide1 = m-3=3,The Boyer-Moore Algorithm,Void computeMatchjumps(char P, int m, int matchjump)int k, r, s, low, shift; int sufx=new intm+1for (k=1; km; k+) matchjumpk=m+1; sufxm=m+1;for (k=m-1; k0; k-)s=sufixk+1while (sm)if (pk+1=ps) break;matchjumps = min (matchjumps, s-(k+1);s = sufxk;sufxk=s-1;,Sufxk=x means a substring starting from pk+1 matches suffix starting from px+1,Computing slidek,/ computing prefix length is necessary; / change slide value to matchjump by addition;,Home Assignment,pp.508- 11.4 11.8 11.9 11.13 11.18,