Rabin-Karp Algorithm

String Matching: A string (say txt) and a pattern( smaller string say pat) are given and ask for finding the position of pat in txt. Let the length of txt = n and length of pat is m.

Method 1: Naive Approach

    Traverse string txt from 0 to (n-m+1) and in every round check if part of txt match with pat character by character. The time complexity of this algorithm will be O(m*n).

Method 2: Rabin-Karp algorithm

    In this method, we use a hash function to hash pat and hash part of txt of size m and apply the rolling hash method and is the hash function of part of txt match with pat then check pat and txt character by character.

    Time Complexity: O(n-m+1) Average

    Time Complexity: O(n-m+1) Worst

        eg: txt: a a a a a b    and     pat: a a b

    Let the hashing begin with 1 and end with 26 as:

        hash(a): 1
        hash(b): 2
        -----------
        -----------
        hash(z): 26

        hash(a a a): 1+1+1=3
        hash(a a b): 1+1+2=4
        hash(pattern): 1+1+2=4

        Every we get 'a a a' this means we have 1+1+1=3 as hash(aaa) = 1+1+1=3.
        The hash(pattern) matched to txt as the last position so we check character by character by character. 
            Pattern: a a b
            Text:       a a b
                So we get each character will be the same and at the same position so pat match with txt.

    But in the above approach there will be some problem that we will be seen with this example:
        eg: txt: c c a c c a e d b a    and    pat: d b a


        hash(pattern): 4+2+1=7

        hash(c c a): 3+3+1=7    Hash function same but character not matching
        hash(c a c): 3+1+3=7    Hash function same but character not matching
        hash(a c c): 1+3+3=7    Hash function same but character not matching
        hash(c c a): 3+3+1=7    Hash function same but character not matching
        hash(c a a): 3+1+1=5    Hash function not matching
        hash(a a e): 1+1+5=7    Hash function same but character not matching
        hash(a e d): 1+5+4=9    Hash function not matching
        hash(e d b): 5+4+2=11    Hash function not matching
        hash(e d b): 5+4+2=11    Hash function not matching
        hash(d b a): 4+2+1=7    Hash and character matching so we can check char by char
            txt: d b a
            pat: d b a
                so characters also matched so pattern matched to the string. This called spurious hits.
        Here the problem is that in some part of the string, the hash function is equal to the hash function of the pattern but it does not contain the same characters. To remove this problem we can use this method:
        hash(a1 a2 a3): [hash(a1)*pow(k,m-1)] + [hash(a2)*pow(k,m-2)]+ [hash(a1)*pow(k,m-3)]
            where k can be any number greater than range, like in alphabet 26, if the character isn't defined use k=256, for here assume k=10

Rolling hash function: In rolling hashing function we use a window with specific size and find the hash of a part of the string. Now we move window one step ahead, to find the hash value of that we simply remove the first element hash of the previous window and rehash( move all the elements one step left ) and add the last element of a current hash.
        Like in the above example if we have hash(c c a) and we want hash(c a c) we remove the first hash(c) of the old window and last hash(c) of the current window.

        hash(c c a): [hash(c)*pow(k,3-1)] + [hash(c)*pow(k,3-2)] + [hash(a)*pow(k,3-3)]
                            :    3*100 + 3*10 + 1*1
                            :    331
        hash(c a c): [ (  hash(c c a)-( hash(c)*pow(k,3-1) )  )*pow(k,3-2) + (hash(c)*pow(k,3-3))]
                            :    (331 - (3*100))*10 + (3*1)
                            :    313

If the value of m is very large it can be possible that it crosses the integer range, so we use % so that it falls in the range of integer. We can use % pow(2,31) or other dependents on the programming language. But it will cause Spurious hit, so it increases time complexity.
    Average: O(n-m+1)
    Worst:     O(m*n)

Comments

Popular posts from this blog

Insertion/Deletion in 2-3 Tree