Visualization of complex data is important but difficult. This is especially true of streaming data. While many complex techniques for visualization have been developed, simple scatter plots can be used effectively, and should not be shunned.
![]() |
Figure 5.5 shows a scatter plot of source port against
time for an
hour period of time. These are all the SYN
packets coming in to a class B network (an address space of
possible IP addresses). This graphic, while simple,
provides quite a few interesting insights.
Note that there are a number of curves in the plot. These are a result of the fact that each time a client initiates a session with a server, it chooses a new source port, and this corresponds to the previous source port used by the client incremented by one. Contiguous curves correspond to connections by a single source IP. Vertical gaps in the curves indicate that the IP visited other servers between visits to the network. It is also easy to see the start of the work day in this plot, indicated by the heavy over plotting on the right hand side.
The source ports range from to
. Different
applications and operating systems select ports from different ranges,
so one can learn quite a bit from investigating plots like this.
The plot of Fig. 5.5 is static. Figure 5.5 is meant to illustrate a dynamic plot. This is analogous to the waterfall plots used in signal processing. It displays a snapshot in time that is continuously updated. As new observations are obtained they are plotted on the right, with the rest of the data shifting left, dropping the left most column. Plots like this are required for streaming data.
Simple plots can also be used to investigate various types of
attacks. In Fig. 5.5 is plotted spoofed IP address
against time for a denial of service attack
against a single server. Each point corresponds to a single
unsolicited SYN/ACK packet received at the sensor from a single
source. This plot provides evidence that there where actually two
distinct attacks against this server. The left side of the plot shows
a distinctive stripped pattern, indicating that the spoofed IP
addresses have been selected in a systematic manner. On the right, the
pattern appears to be gone, and we observe what looks like a random
pattern, giving evidence that the spoofed addresses are selected at
random (a common practice for distributed denial of service
tools). Between about and
there is evidence of
overlap of the attacks, indicating that this server was under attack
from at least two distinct programs simultaneously.
![]() |
Another use of scatter plots for analysis of network data is depicted in Fig. 5.5. These data were collected on completed sessions. The number of packets is plotted against the number of bytes. Clearly there should be a (linear) relationship between these. The interesting observation is that there are several linear relationships. This is similar to the observations made about Fig. 5.2, in which it was noted that different applications use different packet lengths.
Figure 5.5 shows the number of bytes transfered within
a session plotted against the start time of the session. There is
a lot of horizontal banding in this plot, corresponding mostly to
email traffic. It is unknown whether the distinctive repetitive
patterns are a result of spam (many email messages all the same size)
or whether there are other explanations for this. Since these data are
constructed from packet headers only, we do not have access to the
payload and cannot check this hypothesis for these
data. Figure 5.5 shows a zoom of the data. The band just
below
bytes correspond to telnet sessions. These are
most likely failed login attempts. This is the kind of thing that one
would like to detect. The ability to drill down the plots, zooming and
selecting observations to examine the original data, is critical to
intrusion detection.
![]() |
High dimensional visualization techniques are clearly needed. Parallel coordinates is one solution to this. In Fig. 5.9 we see session statistics for four different applications plotted using parallel coordinates.
One problem with plots like this is that of over plotting. Wegman
solves this via the use of color saturation
(see ([35], [38])). Without techniques such as this
it is extremely difficult to display large amounts of
data. Figure 5.9 illustrates this problem in two
ways. First, consider the secure shell data in the upper left
corner. It would be reasonable to conclude from this plot that secure
shell sessions are of short duration, as compared with other
sessions. This is an artifact of the data. For these data there are
only secure shell sessions, and they all happen to be of short
duration. Thus, we really need to look at a lot of data to see the
true distribution for this applications. Next, look at the email plot
in the upper right. Most of the plot is black, showing extensive over
plotting. Beyond the observation that these email sessions have heavy
tails in the size and duration of the sessions, little can be gleaned
from this plot.
A further point should be made about the web sessions. Some of the sessions which are relatively small in terms of number of packets and bytes transfered have relatively long durations. This is a result of the fact that often web sessions will not be closed off at the end of a transfer. They are only closed when the browser goes to another web server, or a time-out occurs. This is an interesting fact about the web application which is easy to see in these plots.