Inspirated

 
 

June 19, 2010

Using Boyer-Moore-Horspool algorithm on file streams in Python

Filed under: Blog — krkhan @ 4:53 am

Horspool’s algorithm is a simple and efficient string-searching algorithm which trades space for time and performs better as length of search string is increased. Another (perhaps overlooked) advantage of this algorithm is its ability to search through stream files without requiring random access. As I was working on Launchpad for my SoC project I required this particular stream-handling attribute as the file descriptors opened by urllib2 didn’t support seek()ing. Modifying the example code from Wiki page a little, I was able to read() only the required bytes sequentially:

horspool.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
#!/usr/bin/env python
 
import locale
import os
import sys
import urllib2
 
def boyermoore_horspool(fd, needle):
    nlen = len(needle)
    nlast = nlen - 1
 
    skip = []
    for k in range(256):
        skip.append(nlen)
    for k in range(nlast):
        skip[ord(needle[k])] = nlast - k
    skip = tuple(skip)
 
    pos = 0
    consumed = 0
    haystack = bytes()
    while True:
        more = nlen - (consumed - pos)
        morebytes = fd.read(more)
        haystack = haystack[more:] + morebytes
 
        if len(morebytes) < more:
            return -1
        consumed = consumed + more
 
        i = nlast
        while i >= 0 and haystack[i] == needle[i]:
            i = i - 1
        if i == -1:
            return pos
 
        pos = pos + skip[ord(haystack[nlast])]
 
    return -1
 
if __name__ == "__main__":
    if len(sys.argv) < 3:
        print "Usage: horspool.py <url> <search text>"
        sys.exit(-1)
 
    url = sys.argv[1]
    needle = sys.argv[2]
    needle = needle.decode('string_escape')
 
    fd = urllib2.urlopen(url)
    offset = boyermoore_horspool(fd, needle)
    print hex(offset), '::', offset
    fd.close()

Now comes the fun part:

  • The code can search through any URL without downloading it completely, stopping at the first match. For example, the following command will download only the first few bytes of the provided URL:
    $ ./horspool.py http://www.gutenberg.org/files/132/132.txt "The Art of War"

    0x1d :: 29

  • Unicode searches work perfectly as well. Although the matching takes place according to the character encoding of the terminal used. That’s to say, since I’m using a UTF-8 terminal the “bytes” searched were assumed to be UTF-8 encoded as well:
    $ ./horspool.py http://www.gutenberg.org/files/29011/29011-0.txt "Σημείωση: Ο Πίνακας περιεχομένων"

    0x44f :: 1103

  • Same goes for multi-line searches:
    $ ./horspool.py http://www.gutenberg.org/files/29011/29011-0.txt "διευκόλυνση\r\nτου αναγνώστη"

    0x4b5 :: 1205

Tags: , , , , , , , , ,