Use Case Overview
This article demonstrates how to build a classification model for identifying the occurrence and category of network intrusions. Rapid detection of network intrusions allows organizations to block the offending traffic, improving site reliability and data security.
Data Requirements
Our sample data are derived from the 1999 KDD intrusion detection contest. Each sample represents a connection (a sequence of TCP packets with start and end timestamps) and is labeled as either normal, or as an attack from one of four categories:
- DoS: "denial-of-service", e.g. syn flood
- R2L: "remote-to-local" - unauthorized access from a remote machine, e.g. guessing password
- U2R: "user-to-root" - unauthorized access to local superuser (root) privileges, e.g. various ''buffer overflow'' attacks
- Probe: surveillance and other probing, e.g port scanning
Data Transformation
First template workflow for joining the source datasets and cleansing the samples.
Our first flow joins the network events with metadata on attack types into a single dataset with one row per network event. In addition, we create a binary label using the Variable operator (whether or not an event is an attack, regardless of category), remove irrelevant and sparse columns, and replace nulls in numeric columns with the column average.
Data Exploration
Second template for exploratory data visualization.
The second step performs several forms of exploratory data visualization on the ETLed dataset from the previous flow. These visualizations expose several trends in the data. Looking at the attack characteristics by protocol type we find that nearly all ICMP connections are attacks. We also find that network events with a high number of connections to the same host in a two second window, or with failed login attempts, are very likely to be either DoS or Probe attacks. We use these features in the next workflow to improve our model accuracy.
Modeling
Third template for predictive modeling.
In the final workflow, we use logistic regression, SVM, naive bayes, and random forest to predict whether a given network event is an attack, and if so, what its attack category is. Random forest performs best with a 99.7% accuracy rate, though it struggles to distinguish rare U2R attacks from normal connections. Naive bayes conflates R2L attacks with U2R, but correctly segments that joint group from normal events.
Key Technique - Subflow The Subflow operator allows you to embed an entire workflow within another workflow. In this case we embed the data transformation flow inside the modeling flow. Without the flow embedded, we'd need users of the modeling flow to run the transformation flow first, as the modeling flow depends on the data set produced by the transformation flow. With the Subflow operator, users need only run the modeling flow. |
Check It Out!
For access to this Playbook, including its workflows, sample data, a PowerPoint summary, and expert support from Spotfire® Data Science data scientists, contact your Spotfire® Data Science sales representative.
Recommended Comments
There are no comments to display.