Find your most used words in Pidgin logs with Python
Here’s a quick little script which I wrote to tabulate the word frequencies in Pidgin logs. Simple, you point it towards a contact’s log directory:
$ ./purple-stats.py /home/krkhan/.purple/logs/msn/krkhan\@inspirated.com/some.friend\@some.gmail.com |
And it gives you the words in their descending order of usage:
0: you (38) 1: it (30) 2: to (24) 3: the (22) 4: in (22) 5: lol (22) 6: of (22) 7: so (18) 8: is (18) 9: what (16) ...
As usual, Python was used for the dirty work:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | #!/usr/bin/env python from operator import itemgetter from string import punctuation import locale import os import sys from BeautifulSoup import BeautifulSoup if __name__ == "__main__": if len(sys.argv) < 2: print "usage:", sys.argv[0], "<logs directory>" dir = sys.argv[1] contents = filter(lambda x: x[-5:] == '.html', os.listdir(dir)) stats = {} for entry in contents: path = os.path.join(dir, entry) with open(path, 'r') as fd: data = fd.read() soup = BeautifulSoup(data, convertEntities=BeautifulSoup.ALL_ENTITIES) spans = soup.findAll('span') for span in spans: for word in span.text.split(): word = word.strip(punctuation).lower() if len(word) < 2: continue stats[word] = stats.get(word, 0) + 1 sorted_stats = sorted(stats.iteritems(), key=itemgetter(1)) sorted_stats.reverse() for num, (word, count) in enumerate(sorted_stats): line = "%10d: %-10s (%d)" % (num, word, count) line = line.encode(locale.getpreferredencoding()) print line |