During the course of my work on botnet security we have had to deal with mammoth traffic traces captured at a local ISP. While analyzing the traffic we needed to extract traffic for some certain hosts out of large
pcap files. An obvious solution would be to run
tshark once for each host, filtering the traffic for that particular IP and writing it to a separate
pcap file. However with the number of hosts approaching thousands and the pcap traces approaching terabytes in size
tshark didn’t really fit the bill.
Initially I thought of writing a splitter in Python but my colleague’s aversion for using Python on large network traces coupled with lack of maintenance of
libpcap bindings resulted in me going for C/
libpcap directly. The new C-based slicer is available at our GitHub respository. It needs
glib to compile though, as I needed a hash table implementation for maintaining the list of hosts that need to be sliced. The
Makefile in the repository should take care of compiling with the appropriate flags.
Onto the performance, the speed of slicing is only throttled by
libpcap‘s own read/write throughput as most of the remaining work is done in constant time. It took only 71 minutes (or 1.1 hours) to slice 1019 hosts out of a 180 GB pcap file on 2.5 GHz CPU. In simpler words, it’s lightning fast.
Right now the script does its job well enough. If someone needs to package it I’ll prefer removing the
glib dependency in favor of perhaps
glibc‘s own hash table implementation (
search.h). In any case, I hope it proves helpful for other people playing with large