5.5 Visualization

Visualization of complex data is important but difficult. This is especially true of streaming data. While many complex techniques for visualization have been developed, simple scatter plots can be used effectively, and should not be shunned.

**Figure 5.3:** Source port versus time for all the incoming SYN packets for an 8 hour period
$\includegraphics[width=78mm]{text/4-5/abb/source.eps}$

**Figure 5.4:** Source port versus time for a short time period, the last two hours from Fig. 5.5. As time progresses, the plot shifts from right to left, dropping the left most column and adding a new column on the right
$\includegraphics[width=9.7cm]{text/4-5/abb/source2.eps}$

Figure 5.5 shows a scatter plot of source port against time for an $8\,$ hour period of time. These are all the SYN packets coming in to a class B network (an address space of

possible IP addresses). This graphic, while simple, provides quite a few interesting insights.

Note that there are a number of curves in the plot. These are a result of the fact that each time a client initiates a session with a server, it chooses a new source port, and this corresponds to the previous source port used by the client incremented by one. Contiguous curves correspond to connections by a single source IP. Vertical gaps in the curves indicate that the IP visited other servers between visits to the network. It is also easy to see the start of the work day in this plot, indicated by the heavy over plotting on the right hand side.

The source ports range from

. Different applications and operating systems select ports from different ranges, so one can learn quite a bit from investigating plots like this.

The plot of Fig. 5.5 is static. Figure 5.5 is meant to illustrate a dynamic plot. This is analogous to the waterfall plots used in signal processing. It displays a snapshot in time that is continuously updated. As new observations are obtained they are plotted on the right, with the rest of the data shifting left, dropping the left most column. Plots like this are required for streaming data.

**Figure 5.5:** Plot of spoofed IP address against time for backscatter packets from a denial of service attack against a single server. The IP addresses have been converted to 16-bit numbers, since in this case they correspond to the final two octets of the IP address
$% latex2html id marker 163622 \includegraphics[width=9cm]{text/4-5/abb/example4.eps}$

Simple plots can also be used to investigate various types of attacks. In Fig. 5.5 is plotted spoofed IP address against time for a denial of service attack against a single server. Each point corresponds to a single unsolicited SYN/ACK packet received at the sensor from a single source. This plot provides evidence that there where actually two distinct attacks against this server. The left side of the plot shows a distinctive stripped pattern, indicating that the spoofed IP addresses have been selected in a systematic manner. On the right, the pattern appears to be gone, and we observe what looks like a random pattern, giving evidence that the spoofed addresses are selected at random (a common practice for distributed denial of service tools). Between about

and

there is evidence of overlap of the attacks, indicating that this server was under attack from at least two distinct programs simultaneously.

**Figure 5.6:** Number of bytes transfered within a completed session plotted against the number of packets within the session. *Solid dots* correspond to email sessions, *circles* correspond to all other applications
$\includegraphics[width=9cm]{text/4-5/abb/sess1.eps}$

Another use of scatter plots for analysis of network data is depicted in Fig. 5.5. These data were collected on completed sessions. The number of packets is plotted against the number of bytes. Clearly there should be a (linear) relationship between these. The interesting observation is that there are several linear relationships. This is similar to the observations made about Fig. 5.2, in which it was noted that different applications use different packet lengths.

Figure 5.5 shows the number of bytes transfered within a session plotted against the start time of the session. There is a lot of horizontal banding in this plot, corresponding mostly to email traffic. It is unknown whether the distinctive repetitive patterns are a result of spam (many email messages all the same size) or whether there are other explanations for this. Since these data are constructed from packet headers only, we do not have access to the payload and cannot check this hypothesis for these data. Figure 5.5 shows a zoom of the data. The band just below $400\,$ bytes correspond to telnet sessions. These are most likely failed login attempts. This is the kind of thing that one would like to detect. The ability to drill down the plots, zooming and selecting observations to examine the original data, is critical to intrusion detection.

**Figure 5.7:** Number of bytes transfered for each session plotted against the starting time of the session for a single day
$\includegraphics[width=8.5cm]{text/4-5/abb/sess2.eps}$

**Figure 5.8:** The portion of the sessions in Fig. 5.5 which were less than 1000 bytes
$\includegraphics[width=8.5cm]{text/4-5/abb/sess3.eps}$

High dimensional visualization techniques are clearly needed. Parallel coordinates is one solution to this. In Fig. 5.9 we see session statistics for four different applications plotted using parallel coordinates.

**Figure 5.9:** Parallel coordinates plots of session statistics for four different applications. From left to right, top to bottom they are: secure shell, email, web and secure web. The coordinates are the time of the initiating SYN packet, the total number of packets, the total number of bytes sent and the duration of the session. The axes are all scaled the same among the plots
$\includegraphics[width=55mm]{text/4-5/abb/pcoord22.eps}$ $\includegraphics[width=55mm]{text/4-5/abb/pcoord25.eps}$ $\includegraphics[width=55mm]{text/4-5/abb/pcoord80.eps}$ $\includegraphics[width=55mm]{text/4-5/abb/pcoord443.eps}$

One problem with plots like this is that of over plotting. Wegman solves this via the use of color saturation (see ([35], [38])). Without techniques such as this it is extremely difficult to display large amounts of data. Figure 5.9 illustrates this problem in two ways. First, consider the secure shell data in the upper left corner. It would be reasonable to conclude from this plot that secure shell sessions are of short duration, as compared with other sessions. This is an artifact of the data. For these data there are only

secure shell sessions, and they all happen to be of short duration. Thus, we really need to look at a lot of data to see the true distribution for this applications. Next, look at the email plot in the upper right. Most of the plot is black, showing extensive over plotting. Beyond the observation that these email sessions have heavy tails in the size and duration of the sessions, little can be gleaned from this plot.

A further point should be made about the web sessions. Some of the sessions which are relatively small in terms of number of packets and bytes transfered have relatively long durations. This is a result of the fact that often web sessions will not be closed off at the end of a transfer. They are only closed when the browser goes to another web server, or a time-out occurs. This is an interesting fact about the web application which is easy to see in these plots.