Inspirated

 
 

June 19, 2010

Using Boyer-Moore-Horspool algorithm on file streams in Python

Filed under: Blog — admin @ 4:53 am

Horspool’s algorithm is a simple and efficient string-searching algorithm which trades space for time and performs better as length of search string is increased. Another (perhaps overlooked) advantage of this algorithm is its ability to search through stream files without requiring random access. As I was working on Launchpad for my SoC project I required this particular stream-handling attribute as the file descriptors opened by urllib2 didn’t support seek()ing. Modifying the example code from Wiki page a little, I was able to read() only the required bytes sequentially:

horspool.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
#!/usr/bin/env python
 
import locale
import os
import sys
import urllib2
 
def boyermoore_horspool(fd, needle):
    nlen = len(needle)
    nlast = nlen - 1
 
    skip = []
    for k in range(256):
        skip.append(nlen)
    for k in range(nlast):
        skip[ord(needle[k])] = nlast - k
    skip = tuple(skip)
 
    pos = 0
    consumed = 0
    haystack = bytes()
    while True:
        more = nlen - (consumed - pos)
        morebytes = fd.read(more)
        haystack = haystack[more:] + morebytes
 
        if len(morebytes) < more:
            return -1
        consumed = consumed + more
 
        i = nlast
        while i >= 0 and haystack[i] == needle[i]:
            i = i - 1
        if i == -1:
            return pos
 
        pos = pos + skip[ord(haystack[nlast])]
 
    return -1
 
if __name__ == "__main__":
    if len(sys.argv) < 3:
        print "Usage: horspool.py <url> <search text>"
        sys.exit(-1)
 
    url = sys.argv[1]
    needle = sys.argv[2]
    needle = needle.decode('string_escape')
 
    fd = urllib2.urlopen(url)
    offset = boyermoore_horspool(fd, needle)
    print hex(offset), '::', offset
    fd.close()

Now comes the fun part:

  • The code can search through any URL without downloading it completely, stopping at the first match. For example, the following command will download only the first few bytes of the provided URL:
    $ ./horspool.py http://www.gutenberg.org/files/132/132.txt "The Art of War"

    0x1d :: 29

  • Unicode searches work perfectly as well. Although the matching takes place according to the character encoding of the terminal used. That’s to say, since I’m using a UTF-8 terminal the “bytes” searched were assumed to be UTF-8 encoded as well:
    $ ./horspool.py http://www.gutenberg.org/files/29011/29011-0.txt "Σημείωση: Ο Πίνακας περιεχομένων"

    0x44f :: 1103

  • Same goes for multi-line searches:
    $ ./horspool.py http://www.gutenberg.org/files/29011/29011-0.txt "διευκόλυνση\r\nτου αναγνώστη"

    0x4b5 :: 1205

Tags: , , , , , , , , ,

July 27, 2008

At least I wasn’t even close

Filed under: Blog — admin @ 4:35 pm

“I think I have a problem with that silver medal.
I think, if I was an Olympic athlete, I would rather come in last then win the silver.
If you think about it… if you win the gold, you feel good.
If you win in the bronze, you think: “Well, at least I got something.”
But if you win that silver, it’s like:
“Congratulations! You… almost won.”
“Of all the losers, you came in first of that group.”
“You’re the number one… loser.”
“No one lost… ahead of you.”” — Jerry Seinfeld

I had two chances for progressing in Code Jam by qualifying in one of the two sub-rounds assigned to me. Both sub-rounds had three problems A, B & C ordered in increasing difficulty level. Here’s a quick summary of both:

  • For the first sub-round (1A), I did solve problem A and got 15 points. However, I didn’t solve it quick enough. Problem B was left untouched by me. Problem C’s small input contained only 29 cases and could have very well been solved using only a scientific calculator. However, that just didn’t seem the right way of progressing (30 points would have been enough to qualify, but I’d definitely have failed at the subsequent rounds).
  • For the second sub-round (1C), apart from connectivity issues, I couldn’t provide correct output for any of the problems. I did solve problem B but as my solution was recursive, it was taking too much time for calculating the output. I’m not that good with refactoring recursive solutions for yielding iterative ones, so my chances got blown right away.

Overall, Code Jam was pretty fun and having fun was the sole aim of participating this year. Now I’ve got to start reading the Introduction to Algorithms book and get myself formally acquainted with algorithmic problem solving. Good luck to all the gurus who did progress (seeing some of them solve all 3 problems within half the time was amazing). They thoroughly deserve it and I’ll keep monitoring later rounds as a spectator — just reading through their ingenuous solutions is nothing short of a delightful experience.

Tags: , , , , ,

July 18, 2008

Jam and Geometry

Filed under: Blog — admin @ 5:24 am

The scores for Google Code Jam qualification round are out. It lasted 24-hours, and the participants were allowed to enter any time and try to solve any of the three given problems. Each problem had one small and one large input set. Participants were able to check during the qualification whether their programs produced correct results on the small input sets but had to wait for the round to finish to know whether correct outputs were produced on large ones.

Correct solutions for small and large input sets were worth 5 and 20 points respectively. To progress to Online Round 1, each participant needed to score at least 25 points. Participants based on the times of their correct submissions and their wrong submissions. And, what I actually did not know was that the timer started ticking with the qualification kick-off. Which means that if someone slept through the earlier hours (or watched the final scenes of One Flew Over the Cuckoo’s Nest again, like me), he’d be ranked lower even though he may solve the problem within half an hour of viewing it.

Anyways, since points were what mattered the most and not the rankings, I actually started off with the problem set 2-3 hours after the qualification had started. Participants were provided with the following three problems:

  1. Saving the Universe

    The urban legend goes that if you go to the Google homepage and search for “Google”, the universe will implode. We have a secret to share… It is true! Please don’t try it, or tell anyone. All right, maybe not. We are just kidding.

    The same is not true for a universe far far away. In that universe, if you search on any search engine for that search engine’s name, the universe does implode!

    To combat this, people came up with an interesting solution. All queries are pooled together. They are passed to a central system that decides which query goes to which search engine. The central system sends a series of queries to one search engine, and can switch to another at any time. Queries must be processed in the order they’re received. The central system must never send a query to a search engine whose name matches the query. In order to reduce costs, the number of switches should be minimized.

    Your task is to tell us how many times the central system will have to switch between search engines, assuming that we program it optimally.

    I solved the problem using a vector of strings in STL. It took me around 35-40 minutes. My entry for the small input set was judged to be correct on my first attempt.

  2. Train Timetable

    A train line has two stations on it, A and B. Trains can take trips from A to B or from B to A multiple times during a day. When a train arrives at B from A (or arrives at A from B), it needs a certain amount of time before it is ready to take the return journey – this is the turnaround time. For example, if a train arrives at 12:00 and the turnaround time is 0 minutes, it can leave immediately, at 12:00.

    A train timetable specifies departure and arrival time of all trips between A and B. The train company needs to know how many trains have to start the day at A and B in order to make the timetable work: whenever a train is supposed to leave A or B, there must actually be one there ready to go. There are passing sections on the track, so trains don’t necessarily arrive in the same order that they leave. Trains may not travel on trips that do not appear on the schedule.

    This was actually easier than problem A. As I only had to use a simple multimap and a vector to hold the departure/arrival times in minutes and then loop throughout the day and manage the trains. I had 2 wrong attempts on the smaller input set though, which were caused by the fact that I started solving the problem initially with a map instead of multimap; which was imposing the limit of only one train’s departure from a station at a given instant.

  3. Fly Swatter

    What are your chances of hitting a fly with a tennis racquet?

    To start with, ignore the racquet’s handle. Assume the racquet is a perfect ring, of outer radius R and thickness t (so the inner radius of the ring is R−t).

    The ring is covered with horizontal and vertical strings. Each string is a cylinder of radius r. Each string is a chord of the ring (a straight line connecting two points of the circle). There is a gap of length g between neighbouring strings. The strings are symmetric with respect to the center of the racquet i.e. there is a pair of strings whose centers meet at the center of the ring.

    The fly is a sphere of radius f. Assume that the racquet is moving in a straight line perpendicular to the plane of the ring. Assume also that the fly’s center is inside the outer radius of the racquet and is equally likely to be anywhere within that radius. Any overlap between the fly and the racquet (the ring or a string) counts as a hit.

    This is where I got stuck, and stuck bad. This problem had more to do with Euclidean Geometry than with data structures, STL or structured programming, and I know this much about Euclidean Geometry: chapter 6 from my higher-secondary school Mathematics book was titled “Conic Sections”. Naturally, my first resort was to try and find some library which would issue my particular problems (using free library code is allowed in Code Jam). More specifically, I wanted a library that would allow me to calculate the area of intersection between a circle and a rectangle (so that I’d gradually subtract out the racket strings’ area in calculations). The result wasn’t much different from my Euclidean Geometry knowledge, as now I know this too: there’s a GPL library called GLAC which does geometry stuff. To summarize, I was unsuccessful in solving this problem. Maybe I’ll need to familiarize myself with GLAC before next round to have a good shot at progressing.

One of the advantages of using STL is that if your program is correct on small inputs, i.e., the logic is applied correctly, there’s little chance that things shall take the unfortunate route for the large ones as larger data structures are accommodated dynamically. Consequently, my solutions for A and B were later on judged as correct for the large input sets too. This gave me 50 points, and 1319th rank among the 7154 participants (I wish I had known that wasting time earlier on means a drop in my ranks, but all’s well that ends well).

The Online Round 1 takes place in another week or so. I started solving algorithmic problems only a fortnight ago so I think I’ll need some more practice to be able to compete properly. To be fair though, I didn’t have high hopes for even the qualification round, as I had entered just for fun and some experience so that I’d be able to contend properly next year — after I’ve had some proper and extensive practice with this kind of problem-solving.

Tags: , , , , , , , , , , , , , ,

July 11, 2008

Programming Challenges: Australian Voting

Filed under: Blog — admin @ 10:44 am

PC/UVa IDs: 110108/10142, Popularity: B, Success rate: low, Level: 1

I had 28 (that’s twenty-eight) “Time Limit Exceeded” tries on this problem. I was looking everywhere for a loop that would drag, for every possible input that would cause breakdown and I couldn’t find any such thing. So much so, that I considered labeling this post as Australian Nightmare. The issue, once I found it, had nothing to do with Australia and all to do with the string and istringstream objects being significantly heavy to be constructed/destructed on each iteration of the loop on lines #138-153.

The book recommends further reading on voting systems and has linked to some mathematical theorem that proves that no voting system can ever be perfect. Considering the democratic results around the world in recent elections, yeah, what an absolute shocker.
(more…)

Tags: , , , , , , , , , , ,

July 2, 2008

Programming Challenges: Check the Check

Filed under: Blog — admin @ 7:55 pm

PC/UVa IDs: 110107/10196, Popularity: B, Success rate: average, Level: 1

I had two approaches in mind for checking the check before solving this problem:

  • return a flag value from the move-generation functions as soon as opponent’s king is encountered in a reachable square.
  • Generate all reachable squares for a side first, and then check whether opponent’s king is positioned on one.

I opted for the latter because even though it was more performance-intensive, it made my move-generation functions more generic and appropriate for extensibility.
(more…)

Tags: , , , , , , , , , , , , , , , , ,

July 1, 2008

Programming Challenges: Interpreter

Filed under: Blog — admin @ 8:38 am

PC/UVa IDs: 110106/10033, Popularity: B, Success rate: low, Level: 2

Nothing extraordinarily interesting here — typical straight-outta-the-book exercise.
(more…)

Tags: , , , , , , , , ,

Programming Challenges: Graphical Editor

Filed under: Blog — admin @ 1:31 am

PC/UVa IDs: 110105/10267, Popularity: B, Success rate: low, Level: 1

Few notes:

  • In the problem input, pixels are specified as [column# row#], whereas two-dimensional vectors (or arrays) are referenced using [row# column#] format.
  • The program would be doomed to infinite recursion if the condition on line #51 is omitted.

(more…)

Tags: , , , , , , , , , ,

June 30, 2008

Programming Challenges: LCD Display

Filed under: Blog — admin @ 8:35 am

PC/UVa IDs: 110104/706, Popularity: A, Success rate: average, Level: 1

Two-dimensional vectors again, with an array of function pointers to construct the digits’ appearance.
(more…)

Tags: , , , , , , ,