slicehosts: Extract host-based traffic out of pcap dumps
During the course of my work on botnet security we have had to deal with mammoth traffic traces captured at a local ISP. While analyzing the traffic we needed to extract traffic for some certain hosts out of large pcap
files. An obvious solution would be to run tshark
once for each host, filtering the traffic for that particular IP and writing it to a separate pcap
file. However with the number of hosts approaching thousands and the pcap traces approaching terabytes in size tshark
didn’t really fit the bill.
Initially I thought of writing a splitter in Python but my colleague’s aversion for using Python on large network traces coupled with lack of maintenance of libpcap
bindings resulted in me going for C/libpcap
directly. The new C-based slicer is available at our GitHub respository. It needs glib
to compile though, as I needed a hash table implementation for maintaining the list of hosts that need to be sliced. The Makefile
in the repository should take care of compiling with the appropriate flags.
Onto the performance, the speed of slicing is only throttled by libpcap
‘s own read/write throughput as most of the remaining work is done in constant time. It took only 71 minutes (or 1.1 hours) to slice 1019 hosts out of a 180 GB pcap file on 2.5 GHz CPU. In simpler words, it’s lightning fast.
Right now the script does its job well enough. If someone needs to package it I’ll prefer removing the glib
dependency in favor of perhaps glibc
‘s own hash table implementation (search.h
). In any case, I hope it proves helpful for other people playing with large pcap
files.