Inspirated

 
 

June 27, 2010

Find your most used words in Pidgin logs with Python

Filed under: Blog — krkhan @ 12:08 am

Here’s a quick little script which I wrote to tabulate the word frequencies in Pidgin logs. Simple, you point it towards a contact’s log directory:

$ ./purple-stats.py /home/krkhan/.purple/logs/msn/krkhan\@inspirated.com/some.friend\@some.gmail.com

And it gives you the words in their descending order of usage:

         0: you        (38)
         1: it         (30)
         2: to         (24)
         3: the        (22)
         4: in         (22)
         5: lol        (22)
         6: of         (22)
         7: so         (18)
         8: is         (18)
         9: what       (16)
         ...

As usual, Python was used for the dirty work:

purple-stats.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/usr/bin/env python
 
from operator import itemgetter
from string import punctuation
 
import locale
import os
import sys
 
from BeautifulSoup import BeautifulSoup
 
if __name__ == "__main__":
    if len(sys.argv) < 2:
        print "usage:", sys.argv[0], "<logs directory>"
    dir = sys.argv[1]
 
    contents = filter(lambda x: x[-5:] == '.html', os.listdir(dir))
    stats = {}
    for entry in contents:
        path = os.path.join(dir, entry)
        with open(path, 'r') as fd:
            data = fd.read()
        soup = BeautifulSoup(data,
            convertEntities=BeautifulSoup.ALL_ENTITIES)
        spans = soup.findAll('span')
        for span in spans:
            for word in span.text.split():
                word = word.strip(punctuation).lower()
                if len(word) < 2:
                    continue
                stats[word] = stats.get(word, 0) + 1
 
    sorted_stats = sorted(stats.iteritems(), key=itemgetter(1))
    sorted_stats.reverse()
    for num, (word, count) in enumerate(sorted_stats):
        line = "%10d: %-10s (%d)" % (num, word, count)
        line = line.encode(locale.getpreferredencoding())
        print line
Tags: , , , , , , , ,

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment

One small verification for man, one giant PITA for bots: