KMP Algorithm for Pattern Seaching
Pattern: 模式
Given a text txt[0...N-1] and a pattern pat[0...M-1], write a
function search(char pat[], char txt[]) that prints all
occurrences of pat[] in txt[].You may assume that
N>M.
Occurrence: 出现
Assume: 假定
Examples:
Input: txt[] = "THIS IS A TEST TEXT", pat[] =
"TEST"
Output: Pattern found at index 10
Input: txt[] = "AABAACAADAABAABA"
pat[] =
"AABA"
Output: Pattern found at index 0, Pattern found at index 9, Pattern found at index 12
Pattern
searching is an important problem in computer science. When we do search for a string in a notepad/word file
or
browser or database, pattern-searching algorithms are used to show the search results.
We have discussed
the Naive pattern-searching algorithm in the previous post. The worst case complexity of the Naive
algorithm is
O(m(n-m+1)). The time complexity of the KMP algorithm is O(n+m) in the worst case.
Discuss: 讨论
Naive:
朴素,原始
Post: 帖子
KMP Pattern Searching:
The Naive pattern-searching algorithm doesn't work well in cases
where we see many matching characters followed by a
mismatching character.
Match:
比得上…;满足…。
Mismatch:错配。
Examples:
1) txt[] = "AAAAAAAAAB", pat[] = "AAAB"
2) txt[] =
"ABABABABABCABABABABC", pat[] = "ABABC"(not a worst case, but a bad case for Naive)
The KMP
matching algorithm uses degenerating property (pattern having the same sub-patterns appearing more than once in
the
pattern) of the pattern and improves the worst-case complexity to O(n+m).
Degenerating: 退化。
Generating:
引起。
Property: 特性。
The basic idea behind KMP's algorithm is: whenever we detect a mismatch (after some
matches), we already know some of
the characters in the text of the next window. We take advantage of this
information to avoid matching the characters
that we know will anyway match.
Whenever: 每当。
Detect:
识别。
Advantage: 有利的。
Matching Overview
txt = "AAAAABAABA"
pat = "AAAA"
We
compare first window of txt with pat
txt = "AAAAABAAABA"
pat = "AAAA" [Initial
position]
We find a match. This is same as Naive String Matching.
In the next step, we compare next window of
txt with pat.
txt = "AAAAABAAABA"
pat = "AAAA" [Pattern shifted one position]
// Shift:
移动。
This is where KMP does optimization over Naive. In this second window, we only compare fourth A of pattern
with fourth
character of current window of text to decide whether current window matches or not. Since we know first
three
characters will anyway match, we skipped matching first three characters.
Optimization: 优化。
Current:
当前的。
Need of Preprocessing?
An important question arises from the above explanation, how to know how many
characters to be skipped. To know this, we
pre-process pattern and prepare an integer array lps[] that tells us the
count of characters to be skipped.
Arise: 产生。
Explanation: 解释。
Preprocessing Overview:
• KMP algorithm
preprocesses pat[] and constructs an auxiliary lps[] of size m (same as the size of the pattern) which
is used to
skip characters while matching.
• Name lps indicates the longest proper prefix which is also a suffix. A proper
prefix is a prefix with a whole string
not allowed.
Auxiliary: 辅助的。