Jump to content
  • Forecasting of Energy Consumption and Generation

    In this article, we describe a step-by-step application for predicting energy consumption and generation. It is a Spotfire-exclusive demo that uses different machine-learning techniques to understand energy consumption and generation patterns and then make future predictions. It includes detailed steps on data preprocessing, exploratory data analysis, time series modeling using three different mathematical models (ARIMA, Holt-Winters, and LSTM), and the evaluation of their performance.


    Forecasting energy consumption and generation is a critical aspect of energy management and planning. It involves predicting the future demand for energy based on historical data, environmental factors, and other relevant variables. Accurate forecasting is essential for ensuring that energy supply meets demand and for making informed decisions about energy production and distribution. With the increasing emphasis on sustainable energy sources and the need to reduce carbon emissions, energy forecasting has become even more important in recent years. This tool uses a number of different statistical and machine-learning techniques to understand patterns of energy consumption and generation. It's a Spotfire-only demo divided into five different parts with eight built-in Python Data functions  (Figure 1). We will go through each step by step.

    You can view the Forecasting of Energy Consumption and Generation demo on the Spotfire Interactive Demo Gallery.


    Figure 1. Cover page. Shows the different steps in this demo.

    Open Source Libraries Used

    The necessary Python libraries are pandas, NumPy, Math, lxml, Sklearn, Statsmodels, and Tensorflow. To install them, we have to use Spotfire's inbuilt Python tools from the Tools menu (Figure 2).


    .Figure 2. The installation dialog box in Spotfire.

    Data Functions

    Spotfire Data Functions are the Spotfire way to add pre-built Python and R scripts to Spotfire analyses. They can perform pretty much any type of calculation and return the results to a Spotfire analysis.

    There are eight data functions being used in this application out of which three are being used for the forecasting modeling part. And all are built using Python (Figure 3)

    1. DataPreparation: In this Data Function, we prepare the data before modeling. We deal with null values, change the data type and format of some columns, and filter the dataset.
    2. GetTimeSeries: This Data Function obtains the time series we want to analyze from the filter selection of the energy consumption sector and the source of electricity generation. It removes unnecessary columns, creates the differentiated first and second-order series, and creates ready-to-use data frames.
    3. TimeSerieDecomposition: In this Data Function, we filter the series according to the selected date and we perform their additive decomposition.
    4. Stationarity: In this code, we test the stationarity of the series. To do this, we obtain the rolling statistics (mean and standard deviation) of the series and we perform the Dickey-Fuller test. The most commonly used method to determine whether a series is stationary is the Dickey-Fuller test but there are other alternatives such as the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test or the Philips Perron (PP) test.
    5. Autocorrelation: With this Data Function we generate the ACF (short for AutoCorrelation Function) and PACF (short for Partial AutoCorrelation Function) plots that can help us find appropriate values for the parameters p and q (associated with the non-seasonal component) and the parameters P and Q (associated with the seasonal component) of the ARIMA model.
    6. [Modeling] ARIMA: We search for multiple combinations of the parameters and choose the best SARIMAX model with the lowest AIC (Akaike Information Criterion). We create predictions with the best model and we generate the performing metrics and residuals of the model. 
    7. [Modeling] Holt-Winters: We build a triple Exponential Smoothing model (also called Holt-Winters Exponential Smoothing).
    8. [Modeling] LSTM: We build a Long Short-Term Memory network, a recurrent neural network trained using Backpropagation Through Time that overcomes the vanishing gradient problem.



    Figure 3. Data functions in Spotfire.


    The data used is from the U.S. Energy Information Administration (EIA) for both energy consumption and net electricity generation, where energy consumption is broken down by sector and electricity generation is broken down by source. The consumption and generation data are each given on a monthly basis. The dataset's contents and some useful characteristics to note about the data are as follows (source):

    The energy table:

    • Contains monthly energy consumption by sector for the U.S.
    • Energy consumption is the use of energy as a source of heat or power or as an input in the manufacturing process
    • Primary energy is first accounted for energy in a statistical energy balance, before any transformation to secondary or tertiary forms of energy
    • Total energy consumption in sectors consists of primary energy consumption, electricity retail sales, and electrical system energy losses

    The electricity table:

    • Contains monthly net electricity generation for all sectors in the U.S.
    • Net electricity generation is the amount of gross electricity generation less station use (the electric energy consumed at the generating station(s) for station service or auxiliaries)
    • Btu stands for British Thermal Unit

    The columns in the tables are:

    1. YYYYMM: The month of energy use.
    2. Value: The amount of energy consumed/generated.
    3. Description: The description of which sector consumed/generated the electricity.
    4. Unit: The unit of energy used for the value.

    You can look at a snapshot of the data below (Figure 4 and Figure 5).


    Figure 4. First rows of the electricity generation dataset.


    Figure 5. First rows of the energy consumption dataset.

    The Description column gives a sector (energy consumption) or source (electricity generation) description. Let's look at all the available description values for each dataset to understand what data is available (Figure 6 and Figure 7).


    Figure 6. Unique descriptions available for the energy consumption dataset.

    We can see that we have a variety of energy consumption sectors, as well as a variety of energy generation sources for each sector.



    Figure 7. Unique descriptions available for the electricity generation dataset.


    On this page, we pre-process the data to simplify analysis moving forward. The Prepare Data button triggers a Python Data Function to create new cleaned datasets ready for exploration and further analytics steps. The tables on the dashboard show the output of this data function, the monthly electricity generation by sector, and the monthly energy consumption by sector.

    Notice that the existing descriptions are quite long therefore we use some abbreviations:

    • PEC: Primary Energy consumption
    • TEC: Total Energy Consumption
    • ENG: Electricity Net Generation

    The next step is to select from the drop-down lists the energy consumption sector and the electricity generation source (the previous Figures show the values available in each case). In this case, we have selected the PEC Electric Power Sector and the ENG Nuclear Electric Power. Based on this selection we will get the time series that we want to analyze by clicking the Get Time Series button in the Text Area. 

    We can now start exploring the data to see if we can uncover hidden patterns. The scatterplot on the right allows us to identify the relationship between energy consumption and electricity generation. From this plot, we see that the electric power sector levels track nuclear power generation levels very well. We create the variable Ratio which is the energy consumed (for the selected sector) divided by the energy generated (for the source selected) and in the side-by-side box plots, we can take a look at how the distribution of the ratio of consumed to generated energy evolves over the different months of the year. In this particular case, we see a pattern of higher ratios in the summer months, particularly July and August, as the boxes in the box plot shift upward in the summer months, and subsequently shift downward in the winter months.



    Figure 8. Exploration page.

    On the following page, we continue exploring the dataset, the line plot allows us to determine trends and cyclical patterns across time for both energy use and generation. In this example, we notice that both energy consumption and generation exhibit an upward trend over time, with a strong cyclical pattern.

    The heat maps allow us to look at consumption and generation levels month-by-month over time and check if the peak cyclical patterns we see are stable across many decades of data. The color indicates the level of the variable under study. Each colored rectangle, therefore, conveys three numbers: the year (horizontal axis), the month (vertical axis), and the value (energy consumed in the plot above, energy generated in the plot below). The colors are mapped to a gradient scale so that the largest values are always red and the smallest values are always blue.

    There is a notable increase in both energy consumption and generation in the peak summer months that began in the 1990s. However, in the earlier years (before 1990) the difference between the peak summer months and other months is not as marked (although it still exists). Additionally, we see the overall energy consumed and electricity generated over time increasing. 

    Finally, we determine which sectors consume the most energy by looking at the box plot on the bottom right. Here we see that the PEC Electric Power Sector has the highest energy consumption across all sectors.


    Figure 9. Second exploration page in Spotfire.

    Time Series Analysis

    This page corresponds to the time series analysis (Figure 7) where we perform a decomposition of the selected time series (for a specific time period that we can decide) and check whether it is stationary or not.

    The time series decomposition is an important step in understanding what type of model we should use. Is there an overall trend in our data that we should be aware of? Does the data show any seasonal trends? 

    Let's first understand the components of a time series: 

    • Trend: change direction over a period of time. In other words, is the increasing or decreasing value in the series.

    • Seasonality: is about periodic behavior, spikes, or drops caused by different factors

    • Residual (also called noise): irregular fluctuations that we cannot predict using trend or seasonality, is the random variations in the series.

    There are basically two methods to decompose a Time Series: additive and multiplicative. We use the additive model when the seasonal variation is relatively constant over time. The multiplicative model is useful when the seasonal variation increases over time. In this demo, an additive decomposition is more appropriate but you can change this from the drop-down menu in the text area. Then you can click on the Create Time Series Decomposition button and see the results in the plot at the top of the page.

    Another important concept is stationarity. What does it mean for data to be stationary and why this is important? A time series is considered stationary if its statistical properties (mean, variance, etc.) are invariant with time. A stationary process is quite useful for forecasting: as it contains no trends or longer-term changes, knowing its value today is sufficient to predict its future values.


    ARIMA models can be applied only to stationary data. If the time series has a trend over time, then it's non-stationary and we need to apply differencing to transform it into stationery.  

    There are two ways you can check the stationarity of a time series. The first is by looking at the data. By visualizing the data it should be easy to identify a changing mean or variation in the data. For a more accurate assessment, there is the Augmented Dickey-Fuller test. If the  𝑝-value is lower than the significance level of  0.05  we reject the null hypothesis (the time series is non-stationary) and take that the series is stationary. If the 𝑝-value is greater than 0.05  we fail to reject the null hypothesis and take that the series is non-stationary. 

    Once we press the Stationarity button we check the results in the line plot and the table at the bottom. Previously, we have created the first and second-order differencing series for both, the generation and consumption time series. Therefore, if the original time series is non-stationary we can use the differenced series.


    Figure 10. Time Series Analysis page.


    This page is where the time series modeling and forecasting happen (Figure 8). We build three different models: Auto ARIMA, Holt-Winters, and LSTM. We also create predictions and we compare the real and forecasted values of the time series to assess how well we did.

    ARIMA, short for "AutoRegressive Integrated Moving Average", is a forecasting algorithm based on the idea that the information in the past values of the time series can alone be used to predict future values. To determine the tuning parameters of the ARIMA model we can click the Get ACF & PACF button and look at the autocorrelation and partial autocorrelation graphs. 

    • The ACF describes how well the present value of the series is related to its past values. The bars of the ACF plot represent the ACF values at increasing lags. The Moving Average order value (q) can be obtained from this plot, identifying the value at which the ACF first crosses the confidence region.

    • The PACF describes the direct relationship between an observation and its lag. Instead of finding correlations of present with lags like ACF, it finds a correlation of the residuals. The Auto-regressive order value (p) can be obtained from this plot, identifying the first time the plot crosses the confidence region.

    We search for multiple combinations of the SARIMAX parameters and choose the best model. In Machine Learning, this process is known as grid search (or hyperparameter optimization) for model selection. We find the best SARIMAX parameters by comparing the AIC (Akaike Information Criterion) value. The AIC measures how well a model fits the data while taking into account the overall complexity of the model. A model that fits the data very well while using lots of features will be assigned a larger AIC score than a model that uses fewer features to achieve the same goodness of fit. Therefore, we are interested in finding the model that yields the lowest AIC value.

    For each model (ARIMA, Holt-Winters, and LSTM) there is a suite of hyperparameters that can be altered on the left of the page. We can expand each accordion bar to show the different options exposed, and that is sent to the Modeling Data Functions when we click the Build Models button.

    The summary attribute that results from the output of ARIMA returns a significant amount of information, but we?ll focus our attention on the table of coefficients which is on the button. The coef column shows the weight (i.e. importance) of each feature and how each one impacts the time series. The P>|z| column informs us of the significance of each feature weight. If each weight has a p-value lower or close to 0.05, then it is reasonable to retain all of them in our model.

    Now that we have our models built, we want to use them to make forecasts. The line plot in the middle shows the real values vs. the predicted value of the time series. We are using the models to predict for time periods that we already have data for, so we can understand how accurate are the forecasts. In this case, we are creating predictions for 25 months but you can change this value by editing the Periods to forecast parameter from the Auto ARIMA accordion bar.


    Figure 11. Forecasting page.


    This is the final step of this tool (Figure 9) but a very important one because, when fitting seasonal ARIMA models (and any other models for that matter), it is essential to run model diagnostics to ensure that none of the assumptions made by the model have been violated.

    For an ideal model, the residuals are uncorrelated and normally distributed with zero mean. If the seasonal ARIMA model does not satisfy these properties, it is a good indication that it can be further improved.

    We will focus on the residuals of the training data. The residuals are the difference between the model's one-step-ahead predictions and the real values of the time series.

    The model diagnostics will suggest that the model residuals are normally distributed based on the following:

    • Standardized residuals over time (top left plot): The residual errors should fluctuate around a mean of zero and have a uniform variance. 

    • Histogram plus estimated density: In the bottom left plot, we should see that the grey KDE line follows closely with the N(0,1) line (where N(0,1) is the standard notation for a normal distribution with mean 0 and standard deviation of 1). This is a good indication that the residuals are normally distributed.

    • Normal Q-Q plot: All the dots should fall perfectly in line with the 45-degree reference line, indicating a normal distribution of the residuals. Any significant deviations would imply the distribution is skewed. When the distribution is skewed only at the extremes, the Q-Q plot will show a deviation from the straight line only in the tails of the plot. This indicates that there is more variability in the extreme values of the distribution than would be expected under a normal distribution. If the skewness is only present in the tails, it suggests that the majority of the distribution is relatively normal or symmetric. 

    • AutoCorrelation Function: (bottom right plot) 95% of correlations for lag greater than one should not be significant (inside the confidence area), showing that the time series residuals have a low correlation with lagged versions of themselves.

    These charts can lead us to conclude if our ARIMA model produces a satisfactory fit that could help us understand our time series data and forecast future values. If we don?t get a satisfactory fit, we could change some parameters of the ARIMA model to improve our model fit. For example, our grid search only considered a restricted set of parameter combinations, so we may find better models if we widen the grid search.

    It is also useful to quantify the predictive performance of our forecasts. There are different metrics that can be applied to evaluate the forecasting model under different circumstances. Some of them are

    • Root Mean Square Error (RMSE)
    • Mean Absolute Error (MAE)
    • Mean Absolute Percentage Error (MAPE)
    • Mean Error (ME)
    • Mean Percentage Error (MPE)

    We can see the results of these metrics in the table at the top left.


    Figure 12. Evaluation page.


    In this article, we describe a step-by-step application for predicting energy consumption and generation. It is a Spotfire-exclusive demo that uses different machine-learning techniques to understand energy consumption and generation patterns and then make future predictions. It includes detailed steps on data preprocessing, exploratory data analysis, time series modeling using three different mathematical models (ARIMA, Holt-Winters, and LSTM), and the evaluation of their performance.

    If you would like to explore more AI Apps and data science work from the Spotfire Data Science Team, please visit Spotfire Apps: Data Science in Operations.


    User Feedback

    Recommended Comments

    There are no comments to display.

  • Create New...