Using Boyer-Moore-Horspool algorithm on file streams in Python

Filed under: Blog — krkhan @ 4:53 am

Horspool’s algorithm is a simple and efficient string-searching algorithm which trades space for time and performs better as length of search string is increased. Another (perhaps overlooked) advantage of this algorithm is its ability to search through stream files without requiring random access. As I was working on Launchpad for my SoC project I required this particular stream-handling attribute as the file descriptors opened by urllib2 didn’t support seek()ing. Modifying the example code from Wiki page a little, I was able to read() only the required bytes sequentially:

horspool.py

#!/usr/bin/env python
 
import locale
import os
import sys
import urllib2
 
def boyermoore_horspool(fd, needle):
    nlen = len(needle)
    nlast = nlen - 1
 
    skip = []
    for k in range(256):
        skip.append(nlen)
    for k in range(nlast):
        skip[ord(needle[k])] = nlast - k
    skip = tuple(skip)
 
    pos = 0
    consumed = 0
    haystack = bytes()
    while True:
        more = nlen - (consumed - pos)
        morebytes = fd.read(more)
        haystack = haystack[more:] + morebytes
 
        if len(morebytes) < more:
            return -1
        consumed = consumed + more
 
        i = nlast
        while i >= 0 and haystack[i] == needle[i]:
            i = i - 1
        if i == -1:
            return pos
 
        pos = pos + skip[ord(haystack[nlast])]
 
    return -1
 
if __name__ == "__main__":
    if len(sys.argv) < 3:
        print "Usage: horspool.py <url> <search text>"
        sys.exit(-1)
 
    url = sys.argv[1]
    needle = sys.argv[2]
    needle = needle.decode('string_escape')
 
    fd = urllib2.urlopen(url)
    offset = boyermoore_horspool(fd, needle)
    print hex(offset), '::', offset
    fd.close()

Now comes the fun part:

The code can search through any URL without downloading it completely, stopping at the first match. For example, the following command will download only the first few bytes of the provided URL:
$ ./horspool.py http://www.gutenberg.org/files/132/132.txt "The Art of War"
$ ./horspool.py https://www.gutenberg.org/files/132/132.txt "The Art of War"
0x1d :: 29
Unicode searches work perfectly as well. Although the matching takes place according to the character encoding of the terminal used. That’s to say, since I’m using a UTF-8 terminal the “bytes” searched were assumed to be UTF-8 encoded as well:
$ ./horspool.py http://www.gutenberg.org/files/29011/29011-0.txt "Σημείωση: Ο Πίνακας περιεχομένων"
$ ./horspool.py https://www.gutenberg.org/files/29011/29011-0.txt "Σημείωση: Ο Πίνακας περιεχομένων"
0x44f :: 1103
Same goes for multi-line searches:
$ ./horspool.py http://www.gutenberg.org/files/29011/29011-0.txt "διευκόλυνση\r\nτου αναγνώστη"
$ ./horspool.py https://www.gutenberg.org/files/29011/29011-0.txt "διευκόλυνση\r\nτου αναγνώστη"
0x4b5 :: 1205

Tags: Algorithms, Boyer-Moore-Horspool, Code, File Handling, Flag 42, GSoC, Python, Search, Unicode, UTF-8

Comments (1)

Inspirated

Using Boyer-Moore-Horspool algorithm on file streams in Python

1 Comment