SAX Encoding
SAX encoding is a method used to simplify time series through the summarization of time intervals. By averaging, binning, and symbolically representing periods of time, the data becomes much smaller and easier to deal with, while still capturing its important aspects. I came across the method when looking at sensor data from the manufacturing industry. Wanting to find anomalous patterns, I discovered that looking at the unique SAX representations lead me to find sensors with higher failure rates. Applying the same idea to baseball seasons, I used unique SAX representations to find anomalous seasons.
Our Data
The data used for this analysis is pulled from the pybaseball library in Python. Attached is a link to its Github: https://github.com/jldbc/pybaseball
For my analysis, I took every MLB season since 1880 and viewed each as a time series that is represented, at any given point, by the cumulative number of wins minus the number of losses. In baseball lingo, each season is a graph of the number of games above or below 0.500 (even number of wins and losses). Below is the time series for the 2000 Anaheim Angels:
Representing every season like this, I can store all 2,892 teams in one pandas dataframe:
The NaN values here are due to the differing lengths of MLB seasons overtime. In the 19th century, seasons consisted of around 82 games, while since the 1960s most seasons have been twice as long, with around 162 games. SAX encoding is thankfully able to deal with the problem of differing lengths of seasons without issue.
Now that there is a numerical representation of each season, I will normalize the data to keep each season on the same scale. After transforming the data, each datapoint is represented by the number of standard deviations above or below the mean, as compared to the other 2,981 seasons. The resulting dataframe:
Piecewise Aggregate Approximation
The first step of SAX encoding is performing PAA (Piecewise Aggregate Approximation) on the time series. This method splits the time series into n subsections and then uses the average of each subsection as its new value. Think of PAA as a way to summarize sections of the data. Depending on the number of splits, the resulting dataframe holds a metric of how well a team has been doing in each subsection of the season. In my case, I chose to make n = 5, meaning that each column represents a fifth of the season.
Taking a look at column 0 above, the numbers represent the average performance of the different seasons during the first fifth of the season. As a result, we have metrics that can give a general trend of how the team performed across the whole season.
SAX Methodology
After PAA, the beauty of the method comes as you convert the results into a single, symbolic representation. Now that we have metrics for each 'split' of the season, we can bin these measurements into different categories. Here, it might be valuable to set up the bins so that teams with varying performances are separated. In this case, I ended up choosing four bins, represented by an 'alphabet' of A, B, C, & D, which in turn essentially suggests whether a team was horrible, bad, good, or great for a given slice of the season. Looking back at our PAA chart on the left, we can translate those values into the chart on the right and, after aggregating the slices, every season can now be represented simply by a "SAX string."
The value of the SAX strings becomes clear now that we can easily count the frequency of seasons associated with each SAX string. For example, the most frequent SAX string is 'CCCCC,' which 628 seasons are represented by. What is particularly interesting is looking at the seasons where the SAX string is unique. There are only a few of them, and they are easy to find. The unique encoding of the time series tells you that the trend of that particular season was unlike any other season in baseball history; according to SAX encoding, these seasons are numerically 'interesting.' A baseball fan myself, the results of the method were a bit shocking to me and helped me discover really unique teams that I had never known about. Take a look at the visual below:
Highlighted above are some of the unique values from the SAX encoding method. Using my settings, there ended up being 20 total seasons that were unique. From these 20, we discover some remarkable seasons. The 2001 & 2002 Athletics seasons had a movie made about them; the 1972 Philadelphia Phillies are truly one of the most puzzling teams ever; and the 1914 Boston Braves should have a movie made about them! The list goes on...
Here is a graph of what the top common SAX String looks like compared to some of the unique seasons:
SAX String: CCCCC vs. Unique Seasons
Conclusion
When trying to find the most interesting seasons in history, someone could look one by one at every team's time series and take note of which might look different. Instead, SAX reduces the dimensionality of thousands of time series and quickly produces results which point to just 20 teams.
While not quite as well known as other techniques, SAX is a great tool for pattern recognition on time series data and was the perfect, simple choice for approaching this question. What's more, the same technique can be used for finding anomalous time series in manufacturing and possibly many other fields, including the medical, automotive, and telecommunications industries.
The SAX Encoding algorithm is available as a plug-in for the Spotfire Data Science - Team Studio platform.
Recommended Comments
There are no comments to display.