UTF-8 | Inspirated

Using Boyer-Moore-Horspool algorithm on file streams in Python

Filed under: Blog — krkhan @ 4:53 am

Horspool’s algorithm is a simple and efficient string-searching algorithm which trades space for time and performs better as length of search string is increased. Another (perhaps overlooked) advantage of this algorithm is its ability to search through stream files without requiring random access. As I was working on Launchpad for my SoC project I required this particular stream-handling attribute as the file descriptors opened by urllib2 didn’t support seek()ing. Modifying the example code from Wiki page a little, I was able to read() only the required bytes sequentially:

horspool.py

#!/usr/bin/env python
 
import locale
import os
import sys
import urllib2
 
def boyermoore_horspool(fd, needle):
    nlen = len(needle)
    nlast = nlen - 1
 
    skip = []
    for k in range(256):
        skip.append(nlen)
    for k in range(nlast):
        skip[ord(needle[k])] = nlast - k
    skip = tuple(skip)
 
    pos = 0
    consumed = 0
    haystack = bytes()
    while True:
        more = nlen - (consumed - pos)
        morebytes = fd.read(more)
        haystack = haystack[more:] + morebytes
 
        if len(morebytes) < more:
            return -1
        consumed = consumed + more
 
        i = nlast
        while i >= 0 and haystack[i] == needle[i]:
            i = i - 1
        if i == -1:
            return pos
 
        pos = pos + skip[ord(haystack[nlast])]
 
    return -1
 
if __name__ == "__main__":
    if len(sys.argv) < 3:
        print "Usage: horspool.py <url> <search text>"
        sys.exit(-1)
 
    url = sys.argv[1]
    needle = sys.argv[2]
    needle = needle.decode('string_escape')
 
    fd = urllib2.urlopen(url)
    offset = boyermoore_horspool(fd, needle)
    print hex(offset), '::', offset
    fd.close()

Now comes the fun part:

The code can search through any URL without downloading it completely, stopping at the first match. For example, the following command will download only the first few bytes of the provided URL:
$ ./horspool.py http://www.gutenberg.org/files/132/132.txt "The Art of War"
$ ./horspool.py https://www.gutenberg.org/files/132/132.txt "The Art of War"
0x1d :: 29
Unicode searches work perfectly as well. Although the matching takes place according to the character encoding of the terminal used. That’s to say, since I’m using a UTF-8 terminal the “bytes” searched were assumed to be UTF-8 encoded as well:
$ ./horspool.py http://www.gutenberg.org/files/29011/29011-0.txt "Σημείωση: Ο Πίνακας περιεχομένων"
$ ./horspool.py https://www.gutenberg.org/files/29011/29011-0.txt "Σημείωση: Ο Πίνακας περιεχομένων"
0x44f :: 1103
Same goes for multi-line searches:
$ ./horspool.py http://www.gutenberg.org/files/29011/29011-0.txt "διευκόλυνση\r\nτου αναγνώστη"
$ ./horspool.py https://www.gutenberg.org/files/29011/29011-0.txt "διευκόλυνση\r\nτου αναγνώστη"
0x4b5 :: 1205

Tags: Algorithms, Boyer-Moore-Horspool, Code, File Handling, Flag 42, GSoC, Python, Search, Unicode, UTF-8

Comments (1)

wmartagu: Viewer/editor for WMA metadata info

Filed under: Blog — krkhan @ 11:05 am

As part of a Google Summer of Code application challenge, I was required to code an application for which the check-list was:

Accepts 1 WMA file name on the command line
Accepts -a for artist
Accepts -t for title
Removes the current tag from a WMA
Writes a new tag to the WMA file containing the new artist & title

And the bonus points:

The program does not obliterate the existing tag but merges the new information
The program is robust and does not destroy any of the test WMA files
The program does not emit any compile warnings during build
The program is UTF-8 clean

Consulting the ASF specification, my first attempt to code the application resulted in wmartag.c. However, since the last bonus point wasn’t present in the list at the time, the program didn’t support UTF-8 (and used hand-tweaked dirty functions for converting between ASCII and low-endian UTF-16). After I found out about the bonus goal, I reimplemented the program from scratch and renamed it to wmartagu. The new code relies on GLib for character code conversions which really does work like a charm:

[krkhan@orthanc wmartagu]$ make

make wmartagu
make[1]: Entering directory `/home/krkhan/Desktop/wmartagu’
cc -Wall -g `pkg-config –cflags –libs glib-2.0` wmartagu.c -o wmartagu
make[1]: Leaving directory `/home/krkhan/Desktop/wmartagu’

[krkhan@orthanc wmartagu]$ ./wmartagu O\ Fortuna.wma

Title: O Fortuna
Author: Carl Orff
Copyright:
Description:
Rating:

[krkhan@orthanc wmartagu]$ ./wmartagu -t ɹǝqɯǝʇdǝs O\ Fortuna.wma

Title: ɹǝqɯǝʇdǝs
Author: Carl Orff
Copyright:
Description:
Rating:
Tags updated successfully.

[krkhan@orthanc wmartagu]$ mplayer -v -v -v O\ Fortuna.wma

MPlayer SVN-r28461-4.3.2 (C) 2000-2009 MPlayer Team
CPU: Intel(R) Core(TM)2 CPU T5600 @ 1.83GHz (Family: 6, Model: 15, Stepping: 2)

Playing O Fortuna.wma.
ASF file format detected.
[asfheader] Audio stream found, -aid 1
Clip info:
name: ɹǝqɯǝʇdǝs
author: Carl Orff
===================================================
Opening audio decoder: [ffmpeg] FFmpeg/libavcodec audio decoders
AUDIO: 44100 Hz, 2 ch, s16le, 192.0 kbit/13.61% (ratio: 24002->176400)
Selected audio codec: [ffwmav2] afm: ffmpeg (DivX audio v2 (FFmpeg))
===================================================
AO: [pulse] 44100Hz 2ch s16le (2 bytes per sample)
Video: no video
Starting playback…
A: 3.3 (03.3) of 153.0 (02:33.0) 0.4%

MPlayer interrupted by signal 2 in module: play_audio
A: 3.3 (03.3) of 153.0 (02:33.0) 0.4%
Exiting… (Quit)

This does not go without saying that if you do try to tackle the UTF character encodings without using any libraries, may God have mercy on your soul, because Unicode won’t. Also, “wmartagu” simply means “WMA Artist/Title Tagger with UTF-8 support” and does not — as it would lead one to otherwise assume — represent any noble African philosophy such as Ubuntu.

Tags: ASF, Endianness, Low-Endian, Metadata, Multimedia, Open Source, Summer of Code, Tags, Technology, Unicode, UTF-16, UTF-16LE, UTF-8, WMA, wmartag, wmartagu

Comments Off