Beautiful Soup | Inspirated

Find your most used words in Pidgin logs with Python

Filed under: Blog — krkhan @ 12:08 am

Here’s a quick little script which I wrote to tabulate the word frequencies in Pidgin logs. Simple, you point it towards a contact’s log directory:

$ ./purple-stats.py /home/krkhan/.purple/logs/msn/krkhan\@inspirated.com/some.friend\@some.gmail.com

And it gives you the words in their descending order of usage:

         0: you        (38)
         1: it         (30)
         2: to         (24)
         3: the        (22)
         4: in         (22)
         5: lol        (22)
         6: of         (22)
         7: so         (18)
         8: is         (18)
         9: what       (16)
         ...

As usual, Python was used for the dirty work:

purple-stats.py

#!/usr/bin/env python
 
from operator import itemgetter
from string import punctuation
 
import locale
import os
import sys
 
from BeautifulSoup import BeautifulSoup
 
if __name__ == "__main__":
    if len(sys.argv) < 2:
        print "usage:", sys.argv[0], "<logs directory>"
    dir = sys.argv[1]
 
    contents = filter(lambda x: x[-5:] == '.html', os.listdir(dir))
    stats = {}
    for entry in contents:
        path = os.path.join(dir, entry)
        with open(path, 'r') as fd:
            data = fd.read()
        soup = BeautifulSoup(data,
            convertEntities=BeautifulSoup.ALL_ENTITIES)
        spans = soup.findAll('span')
        for span in spans:
            for word in span.text.split():
                word = word.strip(punctuation).lower()
                if len(word) < 2:
                    continue
                stats[word] = stats.get(word, 0) + 1
 
    sorted_stats = sorted(stats.iteritems(), key=itemgetter(1))
    sorted_stats.reverse()
    for num, (word, count) in enumerate(sorted_stats):
        line = "%10d: %-10s (%d)" % (num, word, count)
        line = line.encode(locale.getpreferredencoding())
        print line

Tags: Beautiful Soup, Code, IM, Logs, Open Source, Pidgin, Python, Statistics, Technology

Comments Off

Bookmark Undertaker v0.3 — Picking up the threads

Filed under: Blog — krkhan @ 8:09 pm

Threads are love. Threads are speed. And more often than not, threads are a consistent PITA. However, I’ve had an accidental epiphany just a few hours ago:

“~~When in doubt~~ When you need to communicate among threads, use synchronized Queues.”

There. This magic mantra will solve more issues in your life than you can ever imagine, and certainly more than I expected.

Getting back to the topic at hand, adding threading support to the program has sped up the bookmark checking process by a factor of about 435895234. Coupled with fixing of some parsing bugs, Bookmark Undertaker v0.3 is finally capable of providing a quick, stable and consistent way of sanitizing your Firefox favorites:

Boomark Undertaker v0.3 Screenshot

This time, I’ve also tried to provide Deb and RPM packages on the release page for easy installation by the Debian/Ubuntu/Fedora populace.

Ushering in the era of communist applications:

“If everyone gives one thread, the poor person will have a shirt.” — Russian Proverb

Tags: Beautiful Soup, Bookmark Undertaker, Bookmarks, Code, Deb, Debian, Fedora, Firefox, Mozilla, Open Source, PyGTK, Python, Red Hat, RPM, Technology, Threading, Ubuntu

Comments Off

Bookmark Undertaker — Check your Firefox favorites for dead links

Filed under: Blog — krkhan @ 9:24 pm

No matter how much you try to keep the browser bookmarks clean, inevitably they jumble up and one day you realize that you have no idea which links are working and which aren’t. This is where a small utility named AM-Deadlink comes to the rescue for Windows users which checks the links for errors. Somehow, the utility lacked an alternative in the open-source world. And this is where Bookmark Undertaker comes into picture:

(Click on the image for larger version.)

For the utility, I chose PyGTK as UI backend. For parsing the bookmarks.html files exported from Firefox, I used Beautiful Soup. The latter, I must say, made my life a lot easier by cleverly sanitizing the insanity contained in Firefox’s exported favorites, staying true to the project tagline:

You didn’t write that awful page. You’re just trying to get some data out of it. Right now, you don’t really care what HTML is supposed to look like.

Neither does this parser.

And indeed it does not.

For the time being, the application imports the bookmarks properly and displays their attributes including the favorite icons. It then checks the linked URLs for errors in a separate thread and marks them as working or non-working accordingly. Exporting the bookmarks is next on the TODO-list, while it’s possible that in future I will internationalize the application as well.

Time to purge those pesky outdated 404’s.

Tags: Beautiful Soup, Bookmark Undertaker, Bookmarks, Code, Firefox, Mozilla, Open Source, PyGTK, Python, Technology

Comments Off