Network Traffic Visualization
Mihalis Tsoukalos
In this article, I am going to present methods and ideas for visualizing your network traffic data. The tools I will use are the tcpdump utility, the tcpshow utility, the R statistics package, and the Perl programming language. I will also include two Perl scripts and a little interaction with the R package.
Why Visualize?
All systems and network administrators know how important it is to have a high-level view of their network traffic both for security and network monitoring purposes. As the flow of network data increases, it becomes more difficult to watch all your data, all the time. Therefore, you need a way to get a quick overview of your network data.
Some Theory Required
You probably already know that network traffic comes as a continuous stream of small packets. Each packet contains the necessary information so that it knows where it came from, where it is going, where the useful data is, etc. In this article, each individual packet is analyzed to get the required data.
This article also uses scatter plots. A scatter plot is a representation technique where bivariate data is illustrated as points in a plane with the values registered as coordinates. The scatter.smooth() R command, which is used in this article, plots a scatter plot of the data but also adds a smooth curve. The tcpdump utility -- used for capturing network data -- has many capabilities and command-line options, but you are not obligated to use them all. The simplest way to run tcpdump is as follows:
tcpdump -w output.dump
where it keeps saving its output in the output.dump file until you stop it by pressing Control-C. You can then use output.dump file the way you want. Type man tcpdump for more information about this tool.
Collecting Data
Before we can analyze any network data, we must first collect some network data. As we said before, network data is usually captured using the tcpdump utility. I think tcpdump is one of the most useful network troubleshooting tools (the other is Ethereal, which also uses tcpdump!). Please note that for security reasons, you must have permission to capture network data. To make this article more realistic, I used sample tcpdump data file from the Lincoln Laboratory at MIT (this data is part of the 1998 DARPA Intrusion Detection Evaluation Data Set), and I thank them for making it publicly available.
Tcpdump output is difficult to read and understand, so I used the tcpshow utility to preprocess tcpdump output file (./tcpshow < outside.tcpdump > tcpshow.data). Please note that the tcpshow output is a plain text file, whereas the tcpdump output is a binary file. After transforming the tcpdump data with tcpshow, I used a Perl script (Listing 1) to extract the information I wanted. Inside the Perl script, you can see a very small portion of the tcpshow.data file that will help you understand the general format of the tcpshow command output.
The output of the parse_tcpshow.pl script is as follows:
$ ./parse_tcpshow.pl tcpshow.data
* * * SOURCE_PORT_COUNT data
* * * DEST_PORT_COUNT data
* * * PACKET_TYPE_COUNT data
* * * PACKET_TIME
* * * TCP_PACKET_TIME
* * * SOURCE_PORT_TIME
* * * DEST_PORT_TIME
* * * SOURCE_DEST_PORT
* * * SOURCE_DEST_TIME
Beginning of Record: 6732
Records processed: 6732
Using R for Visualization
R is an advanced statistical package with many complex capabilities. This article uses a tiny portion of its capabilities that present no difficulties, so do not be afraid to try it.
SOURCE_PORT_TIME.data Visualization
This subsection shows the SOURCE_PORT_TIME.data file that you can get after executing parse_tcpshow.pl script. For R to read the data, you should give the following command:
> SOURCE_PORT_TIME <- read.table("/Users/mtsouk/data/
SOURCE_PORT_TIME.data", header=TRUE)
To plot them using a scatter plot, you must use the following command:
> scatter.smooth(SOURCE_PORT_TIME)
The output can be seen in Figure 1. You can see there that most of the requests are in high TCP source ports, which is explained by the way TCP works (a service serves its requests by forwarding them to other ports). Please note that the Time axis values are in minutes.
Note: For this plot, I manually changed the SOURCE_PORT_TIME.data file (using the vi editor), by replacing the http value with 80, the smtp value with 25, and the telnet value with 23.
DEST_PORT_TIME.data Visualization
We will follow the same procedure in R for the DEST_PORT_TIME.data file:
> DEST_PORT_TIME <- read.table("/Users/mtsouk/data/
DEST_PORT_TIME.data", header=TRUE)
> scatter.smooth(DEST_PORT_TIME)
This output can be seen in Figure 2. Figure 2 is very similar to Figure 1, which is explained by the fact that, in TCP, source and destination ports come in pairs.
SOURCE_DEST_PORT.data Visualization
For this section, I will apply the same procedure as before in order to visualize the SOURCE_DEST_PORT.data file:
> SOURCE_DEST <- read.table("/Users/mtsouk/data/
SOURCE_DEST_PORT.data", header=TRUE)
> scatter.smooth(SOURCE_DEST)
The output can be seen in Figure 3, which is quite symmetrical due to the pairing of source and destination ports. If you notice one day that this scatter plot differs significantly from its usual pattern, you should search for the reasons. A denial-of-service attack is one possible explanation.
Time, Source Port, and Destination Port Visualization
This time we will try something different using the TIME_SOURCE_DEST.data file. The commands that were given in R are the following:
> SOURCE_DEST_TIME <- read.table("/Users/mtsouk/data/
SOURCE_DEST_TIME.data", header=TRUE)
> pairs(SOURCE_DEST_TIME)
The pairs() command creates the output that can be seen in Figure 4.
Using Perl for Visualization
The main reason for using Perl, or some other scripting language that you already know, for a task like this is that you do not have to learn any complex mathematics software, such as R or Mathematica or MATLAB, in order to put logic inside your processing and analysis procedures. Perl has a very handy module, called GD, that makes it easy to create color drawings. GD is the Perl interface to Tomas Boutell's gd graphics library. For the purposes of this article, you will also need to install the GD::Graph Perl module that is used for creating the presented graphs.
Note: I could have integrated the Perl script found in Listing 2 into Listing 1. However, those two scripts are separated for reasons of clarity.
You should run the create_graphs.pl script (Listing 2) inside the directory that contains the data files. The script will create all its graphics files in PNG format. If for some reason the script does not create an image file, try changing the type of the relevant graph.
Visualizing DEST_PORT_COUNT.data
Figure 5 shows that most of the network traffic goes to the http port, which is not considered unusual. The other popular destination ports are port 23 (telnet) and port 25 (SMTP). If you observe something abnormal in your graph, you should try to find the reason.
Visualizing PACKET_TIME.data
Figure 6 shows the number of packets in respect to time. You can use it as a general overview of your network traffic.
Visualizing PACKET_TYPE_COUNT as a 3D Pie Chart
Figure 7 shows a 3D pie chart that pictures the types of network packets. You can see that most of your network traffic is in TCP format. You can change the parse_tcpshow.pl script to include other types of traffic or ignore certain types of traffic.
Visualizing SOURCE_PORT_COUNT.data
Figure 8 is related to Figure 5. It also shows that most of the network traffic comes from the http port; this is expected, as http-related TCP packets are nowadays the most popular type of TCP traffic. A sudden increase to SMTP traffic may be a result of a new virus or worm, so it's good to keep an eye on it.
Visualizing TCP_PACKET_TIME.data
Figure 9 shows the number of TCP packets per minute. This is a truly useful graph that can help every network and systems administrator. From such a graph, you can find the hours of peak usage or idle hours during which you can schedule network maintenance tasks.
Conclusions
This article provided some tips for network visualization, but the key to success is to experiment with your own network traffic and find the data and the graphs that are effective for you and your network. Visualizing your network data can be very beneficial, although it is a relatively difficult task. On the other hand, no matter what kind of network data you administer, the presented techniques can make your life easier.
References and Web Links
1998 DARPA Intrusion Detection Evaluation Data Set Overview -- http://www.ll.mit.edu/IST/ideval/data/1998/1998_data_index.html
Christiansen, Tom and Nathan Torkington. 2003. Perl Cookbook, 2nd Ed. O'Reilly.
GD Perl module -- http://search.cpan.org/dist/GD/
Mathematica home page -- http://www.wolfram.com/products/mathematica/index.html
MATLAB home page -- http://www.mathworks.com/products/matlab/
R Project home page -- http://www.r-project.org/
Scatter Plot -- http://en.wikipedia.org/wiki/Scatterplot
Tsoukalos, Mihalis. 2005. "Using the R System for Systems Administration." Sys Admin 14(1) 40-46.
Tcpdump file -- http://www.ll.mit.edu/IST/ideval/data/1998/ training/four_hours/tcpdump.gz
Venables, W.N. and Ripley, B.D. 2002. Modern Applied Statistics with S, 4th Ed., Springer Verlag.
Mihalis Tsoukalos lives in Greece with his wife, Eugenia, and works as a high school teacher. He holds a B.Sc. in Mathematics and a M.Sc. in IT from University College London. Before teaching, he worked as a Unix systems administrator and an Oracle DBA. Mihalis can be reached at: tsoukalos@sch.gr.
|