Visualization of complex data is important but difficult. This is especially true of streaming data. While many complex techniques for visualization have been developed, simple scatter plots can be used effectively, and should not be shunned.
|
Figure 5.5 shows a scatter plot of source port against time for an hour period of time. These are all the SYN packets coming in to a class B network (an address space of possible IP addresses). This graphic, while simple, provides quite a few interesting insights.
Note that there are a number of curves in the plot. These are a result of the fact that each time a client initiates a session with a server, it chooses a new source port, and this corresponds to the previous source port used by the client incremented by one. Contiguous curves correspond to connections by a single source IP. Vertical gaps in the curves indicate that the IP visited other servers between visits to the network. It is also easy to see the start of the work day in this plot, indicated by the heavy over plotting on the right hand side.
The source ports range from to . Different applications and operating systems select ports from different ranges, so one can learn quite a bit from investigating plots like this.
The plot of Fig. 5.5 is static. Figure 5.5 is meant to illustrate a dynamic plot. This is analogous to the waterfall plots used in signal processing. It displays a snapshot in time that is continuously updated. As new observations are obtained they are plotted on the right, with the rest of the data shifting left, dropping the left most column. Plots like this are required for streaming data.
Simple plots can also be used to investigate various types of attacks. In Fig. 5.5 is plotted spoofed IP address against time for a denial of service attack against a single server. Each point corresponds to a single unsolicited SYN/ACK packet received at the sensor from a single source. This plot provides evidence that there where actually two distinct attacks against this server. The left side of the plot shows a distinctive stripped pattern, indicating that the spoofed IP addresses have been selected in a systematic manner. On the right, the pattern appears to be gone, and we observe what looks like a random pattern, giving evidence that the spoofed addresses are selected at random (a common practice for distributed denial of service tools). Between about and there is evidence of overlap of the attacks, indicating that this server was under attack from at least two distinct programs simultaneously.
|
Another use of scatter plots for analysis of network data is depicted in Fig. 5.5. These data were collected on completed sessions. The number of packets is plotted against the number of bytes. Clearly there should be a (linear) relationship between these. The interesting observation is that there are several linear relationships. This is similar to the observations made about Fig. 5.2, in which it was noted that different applications use different packet lengths.
Figure 5.5 shows the number of bytes transfered within a session plotted against the start time of the session. There is a lot of horizontal banding in this plot, corresponding mostly to email traffic. It is unknown whether the distinctive repetitive patterns are a result of spam (many email messages all the same size) or whether there are other explanations for this. Since these data are constructed from packet headers only, we do not have access to the payload and cannot check this hypothesis for these data. Figure 5.5 shows a zoom of the data. The band just below bytes correspond to telnet sessions. These are most likely failed login attempts. This is the kind of thing that one would like to detect. The ability to drill down the plots, zooming and selecting observations to examine the original data, is critical to intrusion detection.
|
High dimensional visualization techniques are clearly needed. Parallel coordinates is one solution to this. In Fig. 5.9 we see session statistics for four different applications plotted using parallel coordinates.
One problem with plots like this is that of over plotting. Wegman solves this via the use of color saturation (see ([35], [38])). Without techniques such as this it is extremely difficult to display large amounts of data. Figure 5.9 illustrates this problem in two ways. First, consider the secure shell data in the upper left corner. It would be reasonable to conclude from this plot that secure shell sessions are of short duration, as compared with other sessions. This is an artifact of the data. For these data there are only secure shell sessions, and they all happen to be of short duration. Thus, we really need to look at a lot of data to see the true distribution for this applications. Next, look at the email plot in the upper right. Most of the plot is black, showing extensive over plotting. Beyond the observation that these email sessions have heavy tails in the size and duration of the sessions, little can be gleaned from this plot.
A further point should be made about the web sessions. Some of the sessions which are relatively small in terms of number of packets and bytes transfered have relatively long durations. This is a result of the fact that often web sessions will not be closed off at the end of a transfer. They are only closed when the browser goes to another web server, or a time-out occurs. This is an interesting fact about the web application which is easy to see in these plots.