If you are new to analyzing large data sets, specifically those with many variables, you may not know the Curse of Dimensionality. This phrase sounds like a low-budget horror film but it is a predictive analytics problem.
Difficulties arise when trying to analyze data that has hundreds or thousands of predictor variables. While it may feel like you are in a horror film when dealing with large data sets, it doesn't have to be a film with a bad ending. A better understanding of the Curse is the first step to working with this type of data.
Richard Bellman is usually credited with first using the term Curse of Dimensionality in his work with dynamic optimization in the late 1950s and early 1960s. If you break the term into its two components:
- Curse refers to the difficulties that arise when the number of predictors increases
- Dimensionality refers to the number of dimensions or predictors (variables) in a data set
The Curse of Dimensionality is the exponentially increasing difficulty you encounter in finding any discernible patterns or global optima for the parameter space to fit models as the number of predictors in the data space increases.
It's hard to visualize what this means, but here is a common example. Imagine a single line 100 yards long, about the length of an American football field. This is a single dimension. Now drop a coin somewhere along this line. Imagine walking on this line while looking for the coin. Yeah, that doesn't sound too hard.
Now imagine adding another dimension, another line of 100 yards. Now you are searching the area of a square; 100 yards by 100 yards. Trying to find a coin dropped somewhere inside that square is a considerably more difficult task.
Now add another dimension so that you have a 100-yard cube. If a coin were to be dropped somewhere inside that cube, you may be in there for weeks before finding that coin! As you add predictors, the volume of the space becomes so large that it is increasingly difficult to discern any meaning from the data.
There are many techniques to deal with this Curse.
You can discover which variables are most important in predicting the desired outcome (i.e. feature selection).
- Advanced Trees C&RT
- Advanced CHAID
- Boosted Trees
- Feature Selection (univariate correlation)
- Lasso Regression
- Random Forest (multivariate correlation)
You can reduce the dimensionality in the data with an algorithm like:
- Bayesian
- Chi-square. Some examples of filter methods include the chi-squared test, information gain, and correlation coefficient
- Canonical Correlation
- Cluster Analysis
- Factor Analysis
- Multidimensional Scaling
- Optimal Binning
- Principal Components Analysis
- Principal Components & Classification Analysis
- Remove sparse variables / missing data
- Remove no variance variables
- Weight of Evidence (WoE) - Users can combine groups with similar observed WoE to create new coded predictors with continuous weight-of-evidence values
Or you can do both with Neural Networks.
Resources:
Recommended Comments
There are no comments to display.