Article Listing 1 Listing 2 jun2007.tar

Using a MySQL Database to Store Network Data

Mihalis Tsoukalos

In this article, I will describe how to store network data in a MySQL database as well as how to take advantage of the SQL language to query stored data. I will also present some handy SQL queries for use with the stored network data. To begin, let's look at the usefulness of a database for storing network data. I've listed some advantages and disadvantages below. And, although I selected the MySQL DBMS, you can use PostgreSQL, Oracle, DB2, FileMaker, or your favorite DBMS.

Advantages:

  • You can query your data offline.
  • Databases are storing their data optimized, so they are quicker than pure disk I/O.
  • You can embed intelligence in your data by using a database.
  • You can distribute your data in many databases.
  • You can automatically use the reporting tools that support your database.
  • If you already know SQL, querying your network data is easy.
  • A remote user can easily access your data using the network.
  • You can use OLAP tools, which are very efficient for massive data.

Disadvantages:

  • May take too much space, especially if you have a busy network.
  • You need to know how to use a database.
  • You need to learn SQL if you do not already know it.
  • You need to know how to administer your database.

Why Use MySQL?

I chose MySQL for several reasons. First of all, MySQL is free. Second, it is easy to learn and use. Third, it is light, which means it does not demand huge computer resources. Other reasons include the plethora and quality of the available documentation and the fact that MySQL can run on almost every available platform. This means that the little MySQL knowledge that you will get your hands on in this article can be used in every platform for which MySQL is ported.

Collecting the Network Data

Before being able to use any network data, we must first collect network data. Network data is usually captured using the handy tcpdump utility. Tcpdump output is difficult to read, so I used the tcpshow utility to preprocess the original tcpdump file (./tcpshow < outside.tcpdump > tcpshow.data). Please note that the tcpshow output is plain text, whereas the tcpdump output is usually in binary format. After transforming the tcpdump data with tcpshow, I used the Perl script shown in Listing 1 to extract the information I wanted.

This section will also present another Perl script that is going to be used for both reading the network data from an external file and inserting the data into the MySQL DB.

Note: For the Perl script to run, you also need to install the DBI and DBD::mysql Perl modules. I prefer to install Perl modules using the CPAN Perl module. I will not describe the procedure here, but you will find more information about using the CPAN Perl module in the references section at the end of the article.

The description of the NetData MySQL table that holds the network data is the following:

mysql> desc NetData;

+------------+-------------+------+-----+---------+-------+
| Field      | Type        | Null | Key | Default | Extra |
+------------+-------------+------+-----+---------+-------+
| Date       | datetime    | NO   |     |         |       |
| dt         | float       | YES  |     | NULL    |       |
| sourcePort | varchar(12) | YES  |     | NULL    |       |
| destPort   | varchar(12) | YES  |     | NULL    |       |
+------------+-------------+------+-----+---------+-------+
4 rows in set (0.00 sec)

mysql>
The NetData table was created inside the SA database using the following SQL command:

CREATE TABLE NetData
     (Date DATETIME NOT NULL,
      dt FLOAT,
      sourcePort VARCHAR(12),
      destPort VARCHAR(12)
     );

You may add additional columns such as source IP, destination IP, exchange data size, etc, according to your specific needs.

Input Data File Format

The first 10 lines of the file that hold the TCP network data are the following:

900.11856939 0.308413 1024 80
900.11857575 0.000636 80   1024
900.1485114 0.300888  80   1024
900.14852096 0.000956 1024 80
900.14853166 0.001070 1024 80
900.14871127 0.017961 80   1024
900.14989811 0.118684 80   1024
900.15316301 0.326490 1024 80
900.15317819 0.001518 80   1024
900.15319053 0.001234 80   1024

The first column represents the time (in minutes, starting from midnight -- 900 is 15:00) that the packet was captured. The second column is the time difference between the current packet and the previous one. The third column is the source port, and the fourth column is the destination port. So, you can have whatever data you want. What I am presenting here is a relatively simple -- yet realistic -- example with TCP data.

The populateNetData.pl Perl Script

The Perl script reads the TCP network data from a file and inserts it into the database. The MySQL database connection is established using the following Perl code:

my $datasource = "DBI:mysql:database=sa;host=localhost";
my $user = "mtsouk";
my $password = "xxx";

Listing 2 displays the populateNetData.pl Perl script.

Useful SQL Queries

This section will present and explain several SQL queries related to the presented network data records.

Displaying the 20 Busiest Moments (Seconds)

The following SQL command will display the 20 busiest moments as well as the total number of connections:

select Date, count(*) TotalConnections
             from NetData
             group by Date
             order by TotalConnections desc
                         limit 20;

The count(*) output is named TotalConnections so that it can be used later in the "order by" part of the SQL command.

The output is:

+---------------------+------------------+
| Date                | TotalConnections |
+---------------------+------------------+
| 2007-01-27 15:06:07 |              346 |
| 2007-01-27 15:12:10 |              283 |
| 2007-01-27 15:14:22 |              265 |
| 2007-01-27 15:12:11 |              262 |
| 2007-01-27 15:14:21 |              255 |
| 2007-01-27 15:13:08 |              218 |
| 2007-01-27 15:00:09 |              201 |
| 2007-01-27 15:09:31 |              201 |
| 2007-01-27 15:15:01 |              172 |
| 2007-01-27 15:06:08 |              143 |
| 2007-01-27 15:11:24 |              139 |
| 2007-01-27 15:09:04 |              115 |
| 2007-01-27 15:15:02 |              107 |
| 2007-01-27 15:02:09 |              103 |
| 2007-01-27 15:01:07 |              100 |
| 2007-01-27 15:01:15 |               99 |
| 2007-01-27 15:07:03 |               98 |
| 2007-01-27 15:03:28 |               79 |
| 2007-01-27 15:13:34 |               79 |
| 2007-01-27 15:08:35 |               77 |
+---------------------+------------------+

Returning Random Rows

The MySQL RAND function can be used to return rows randomly. The following command returns 10 random rows:

select *
             from NetData
             order by rand()
                         limit 10;

A possible output is:

+---------------------+----------+------------+----------+
| Date                | dt       | sourcePort | destPort |
+---------------------+----------+------------+----------+
| 2007-01-27 15:14:22 | 0.000143 | 80         | 4196     |
| 2007-01-27 15:11:24 | 0.019023 | 80         | 2723     |
| 2007-01-27 15:11:24 | 0.008738 | 80         | 2686     |
| 2007-01-27 15:15:01 | 0.000183 | 80         | 4208     |
| 2007-01-27 15:00:13 | 0.000273 | 25         | 1052     |
| 2007-01-27 15:04:04 | 0.006982 | 25         | 1107     |
| 2007-01-27 15:03:28 | 0.001365 | 80         | 2005     |
| 2007-01-27 15:10:34 | 0.001232 | 80         | 2676     |
| 2007-01-27 15:02:09 | 0.014544 | 80         | 1814     |
| 2007-01-27 15:07:16 | 0.001109 | 2596       | 80       |
+---------------------+----------+------------+----------+

Returning the Number of Total Connections per Destination Port

The following command returns the total number of connections per destination port for the 15 most popular ports:

select destPort, count(*) TotalNumber
             from NetData
             group by destPort
             order by TotalNumber desc
                         limit 15;

+----------+-------------+
| destPort | TotalNumber |
+----------+-------------+
| 80       |        1923 |
| 25       |         223 |
| 23       |          42 |
| 1024     |          38 |
| 1306     |          36 |
| 4233     |          27 |
| 1174     |          25 |
| 2005     |          25 |
| 2599     |          25 |
| 3718     |          25 |
| 4205     |          25 |
| 2676     |          24 |
| kpop     |          22 |
| 2063     |          22 |
| 2307     |          22 |
+----------+-------------+

Finding Busy Minutes

This part will show you how to find the 5 most popular minutes (i.e., 15:12, 15:16, but not 15:12:13, which is in seconds). I will use the substr() command:

select substr(Date, length(Date)-7, 5), count(*) Total
             from NetData
             group by substr(Date, length(Date)-7, 5)
             order by Total desc
                         limit 5;

The key point here is to select and group by the substring that you want, for example, by using the substr(Date, length(Date)-7, 5) expression.

This example produced the following output:

+---------------------------------+-------+
| substr(Date, length(Date)-7, 5) | Total |
+---------------------------------+-------+
| 15:14                           |   720 |
| 15:06                           |   551 |
| 15:12                           |   545 |
| 15:09                           |   511 |
| 15:15                           |   435 |
+---------------------------------+-------+

Creating an Empty Table Similar to the NetData Table

The following command will create a new table, called NetData_identical, that looks like the original NetData table without the actual data:

create table NetData_identical
as
select * from NetData
     where 0 = 1;

NetData_identical looks as follows:

mysql> desc NetData_identical;
+------------+-------------+------+-----+---------------------+-------+
| Field      | Type        | Null | Key | Default             | Extra |
+------------+-------------+------+-----+---------------------+-------+
| Date       | datetime    | NO   |     | 0000-00-00 00:00:00 |       |
| dt         | float       | YES  |     | NULL                |       |
| sourcePort | varchar(12) | YES  |     | NULL                |       |
| destPort   | varchar(12) | YES  |     | NULL                |       |
+------------+-------------+------+-----+---------------------+-------+
4 rows in set (0.01 sec)

mysql>

Creating a Backup Copy of the NetData Table

The following output illustrates how to create a backup copy of the NetData table:

mysql> create table NetData_backup as select * from NetData;
Query OK, 4912 rows affected (0.02 sec)
Records: 4912  Duplicates: 0  Warnings: 0

mysql>

Backup copies, apart from being useful for security reasons, are also very handy because you can work with them without affecting the original data.

Finding the Maximum Value of the dt Column

The following SQL query finds the maximum value of the dt column:

select max(dt)
             from NetData;

Finding the Average Value of the dt Column

The following SQL query reports the average value of the values found in the dt column:

select avg(dt)
             from NetData;
Finding the Average Value of the dt Column in Respect to the sourcePort Value

The following query returns the averages for the dt values in respect to the sourcePort. The output is limited to 10 values and is sorted in descending order:

select sourcePort, avg(dt) as dt_average
             from NetData
             group by sourcePort
             order by dt_average desc
                         limit 10;

The output is as follows:

+------------+-------------------+
| sourcePort | dt_average        |
+------------+-------------------+
| 2680       |  0.17783659529523 |
| 1174       | 0.098484598236973 |
| 2615       | 0.095210820149408 |
| 1307       | 0.089900799299357 |
| 1620       | 0.077908113450071 |
| 1306       |  0.07518981667553 |
| 2727       | 0.073359332897881 |
| 2616       | 0.066323998617008 |
| 4205       | 0.061019102641148 |
| 1112       | 0.060378213710335 |
+------------+-------------------+

Finding the Number of Different Ports Used in the destPort Column

The following query uses the distinct() function to return each entry only once. It then counts the number of entries:

select count(distinct(destPort))
             from NetData;

Finding the Percentage that Specific Data Represents

Let us say that you want to find out what percentage of the values in the dt column are from http traffic (destPort = "80"). The following query answers just that:

select
     (
              sum(case when destPort = "80" then dt end) / sum(dt)
     ) * 100 as percentage
     from NetData;

The answer is the following:

+-----------------+
| percentage      |
+-----------------+
| 30.156820654134 |
+-----------------+

Finding the Percentage of Traffic for Each Destination Port

This example will find the percentage of traffic -- the number of collections for a port divided by the total number of all connections -- for each destination port. The simplest way is by using a MySQL stored procedure. To begin, type the following code:

DELIMITER //
CREATE PROCEDURE PERTRAFFIC ()

BEGIN
             DECLARE total INT DEFAULT 0;

             SELECT COUNT(*) INTO total FROM netdata_backup;

             select (count(*)/total) * 100 as Percentage, sourcePort
            from NetData
            group by sourcePort
            order by Percentage desc;

END
//
DELIMITER ;

Then, execute the procedure by using the following command:

mysql> call PERTRAFFIC ();

The output, limited to 10 lines, is as follows:

+------------+--------------+
| Percentage | sourcePort   |
+------------+--------------+
|    50.3360 | 80           |
|     4.1132 | 25           |
|     0.7941 | 23           |
|     0.6312 | 1024         |
|     0.4480 | 2615         |
|     0.4276 | 1026         |
|     0.4072 | 1027         |
|     0.4072 | 1049         |
|     0.4072 | 1052         |
|     0.4072 | 1106         |

Conclusion

I hope you enjoyed reading this article as much as I enjoyed writing it. What I have shown here is a small part of probable queries of network data stored in a database. SQL offers you a tremendous number of possible queries that you can execute. You must determine your own needs and create the queries you want. Please feel free to ask questions or send me your ideas for future articles.

References and Web Links

Bowman, Judith, Sandra Emerson, and Marcy Darnovsky. 2001. The Practical SQL Handbook: Using SQL Variants, 4th Ed. Addison Wesley.

Dyer, Russell. 2005. MySQL in a Nutshell. O'Reilly.

Kofler, Michael. 2005. The Definitive Guide to MySQL 5, 3rd Ed. Apress.

Molinaro, Anthony. 2006. SQL Cookbook. O'Reilly.

Russell, Chad and Jon Stephens. 2004. Beginning MySQL Database Design and Optimization: From Novice to Professional. Apress.

Tsoukalos, Mihalis. 2007. "Network Traffic Visualization". Sys Admin: In publication.

Zawodny, Jeremy and Derek J. Balling. 2004. High Performance MySQL. O'Reilly.

1998 DARPA Intrusion Detection Evaluation Data Set Overview -- http://www.ll.mit.edu/IST/ideval/data/1998/1998_data_index.html

Installing Perl Modules -- http://www.livejournal.com/doc/server/lj.install.perl_setup.modules.html

Life with CPAN -- http://sial.org/howto/perl/life-with-cpan/

MySQL Reference Manual -- http://dev.mysql.com/doc/

MySQL User Groups -- http://dev.mysql.com/user-groups/

MySQL Web Site -- http://www.mysql.com/

Perl CPAN -- http://www.perl.com/CPAN/

Sample tcpdump File -- http://www.ll.mit.edu/IST/ideval/data/1998/training/four_hours/tcpdump.gz

Mihalis Tsoukalos lives in Greece with his wife, Eugenia, and works as a high school teacher. He holds a B.Sc. in Mathematics and a M.Sc. in IT from University College London. Before teaching, he worked as a Unix systems administrator and an Oracle DBA. Mihalis can be reached at: tsoukalos@sch.gr.