Using Boyer-Moore-Horspool algorithm on file streams in Python
Horspool’s algorithm is a simple and efficient string-searching algorithm which trades space for time and performs better as length of search string is increased. Another (perhaps overlooked) advantage of this algorithm is its ability to search through stream files without requiring random access. As I was working on Launchpad for my SoC project I required this particular stream-handling attribute as the file descriptors opened by urllib2
didn’t support seek()
ing. Modifying the example code from Wiki page a little, I was able to read()
only the required bytes sequentially:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | #!/usr/bin/env python import locale import os import sys import urllib2 def boyermoore_horspool(fd, needle): nlen = len(needle) nlast = nlen - 1 skip = [] for k in range(256): skip.append(nlen) for k in range(nlast): skip[ord(needle[k])] = nlast - k skip = tuple(skip) pos = 0 consumed = 0 haystack = bytes() while True: more = nlen - (consumed - pos) morebytes = fd.read(more) haystack = haystack[more:] + morebytes if len(morebytes) < more: return -1 consumed = consumed + more i = nlast while i >= 0 and haystack[i] == needle[i]: i = i - 1 if i == -1: return pos pos = pos + skip[ord(haystack[nlast])] return -1 if __name__ == "__main__": if len(sys.argv) < 3: print "Usage: horspool.py <url> <search text>" sys.exit(-1) url = sys.argv[1] needle = sys.argv[2] needle = needle.decode('string_escape') fd = urllib2.urlopen(url) offset = boyermoore_horspool(fd, needle) print hex(offset), '::', offset fd.close() |
Now comes the fun part:
- The code can search through any URL without downloading it completely, stopping at the first match. For example, the following command will download only the first few bytes of the provided URL:
$ ./horspool.py http://www.gutenberg.org/files/132/132.txt "The Art of War"
0x1d :: 29
- Unicode searches work perfectly as well. Although the matching takes place according to the character encoding of the terminal used. That’s to say, since I’m using a UTF-8 terminal the “bytes” searched were assumed to be UTF-8 encoded as well:
$ ./horspool.py http://www.gutenberg.org/files/29011/29011-0.txt "Σημείωση: Ο Πίνακας περιεχομένων"
0x44f :: 1103
- Same goes for multi-line searches:
$ ./horspool.py http://www.gutenberg.org/files/29011/29011-0.txt "διευκόλυνση\r\nτου αναγνώστη"
0x4b5 :: 1205