Making Sense of Diesel Prices

In this case study, we examine closely the highway retail price ($/Gallon) for “No.2 Ultra Low Sulfur (0-15 ppm) Diesel” in the EIA nine (9) PADD regions. We carry on principal component analysis in an attempt to find a minimal subset of the principal components that capture (or explains) the variation (spreads) in prices across different regions with a minimal loss of information.

The sample data start on February 5th, 2007, and end on May 6th, 2013 (327 observations). Each observation represents the weekly average prices in nine regions:

  1. East Coast
  2. New England
  3. Central Atlantic (PADD 1B)
  4. Lower Atlantic (PADD 1C)
  5. Midwest
  6. Gulf Coast
  7. Rocky Mountain
  8. West Coast
  9. California
No.2 ultra low sulfur highway weekly retail price average for the Nine (9) EIA PADD regions

The weekly prices across the different regions are highly correlated:

Correlation table for the spot weekly prices of No.2 Ultra low sulfur diesel across the Nine(9) EIA PADD regions
data plot for the spot weekly average of ultra low sulfur diesel in Nine(9) EIA PADD regions

In this study, we’ll investigate the drivers behind the price variation among the different regions, and attempt to imply the physical representation of those components using each region’s price loadings.


Now we are ready to conduct our principal component analysis. First, select an empty cell in your worksheet where you wish the output to be generated, then locate and click on the principal component (PCA) icon in the NumXL tab (or toolbar).

Principal component analysis icon in NumXL toolbar or ribbon in Excel

The principal component analysis wizard appears.

General tab of NumXL Principal component analysis dialog or wizard

Select the cells-range for the five input variable values.


  1. Leave out the last three observations, so our input data ends on April 15th, 2013. The remaining three points will be used for comparing the forecast values later on.
  2. Leave the “Variables Mask” field blank for now. We will revisit this field in later entries.

Next, select the “Options” tab.

Options tab of NumXL principal component analysis wizard or dialog

Initially, the tab is set to the following values:

  • “Standardize Inputs” is checked. Leave this option checked.
  • “Principal Component Output” is checked. Leave it checked.
  • The significance level (aka. ) is set to 5%.
  • Under the “principal component,” check the “Values” option, so the generated output tables include the principal component values for different dates.
  • “Input Variables” is unchecked. Leave it unchecked.

Now, click the “Missing Values” tab.

Missing values tab in NumXL principal component analysis dialog or wizard

In this tab, you can select an approach to handle missing values in the data set (X). By default, any missing value found in any observation would exclude the observation from the analysis.

This treatment is a good approach for our analysis, so let’s leave it unchanged.

Now, click “OK” to generate the output tables.

Output tables generated by NumXL principal component analysis wizard or dialog


1. Statistics

Principal components statistics table for the 9 spot prices of No.2 ultra low sulfur diesel in 9 EIA PADD regions generated by NumXL principal component wizard

In the table above, we show the variance of each principal component and the proportion of the input (standardized) data set’s total variance variable that it accounts for. Examining the table closer, the 1st two components capture 99.8% of the data set variation.

2. Loadings

In the loadings table, we outline the weights of the principal component in each input variable:

Loadings of each input variable in the different principal component

Examining the input variables (i.e. region price) loadings for the first component shows a uniform loading for all variables. This can be interpreted as the level-factor (a price that is locale-neutral).

For the second factor, the picture is a bit different:

  • For all PADD regions in the east, the loading is negative.
  • For PADD regions in the west, the loading is positive.
  • Gulf coast’s loading is slightly negative.
  • Midwest loading is slightly positive.

This factor’s loading can be viewed as proximity to (or availability of) refinery capacity or import ports (example: New York harbor). The second factor reflects the cost of transportation and tax of the fuel.

The diesel spot prices loadings in the first principal component

Note: The loadings of the input variables for the 1st component are very comparable, so, in effect, the second component (factor) is what drives the price differential between the different PADD regions.

3. Principal Components Values

Let’s examine the values of the first two principal components.

first and second principal components data plot

The two-time series exhibit some seasonality, although it is more apparent in the second factor. Furthermore, the first principal component exhibits a pattern close to crude oil prices.

Plot for the first principal component and WTI spot prices

NOTE: The lack of an exact match may be attributed to other costs incurred in the making of No. 2 Ultra Low Sulfur Diesel: labor, energy prices, raw material, etc. Furthermore, refineries build up an inventory of products (e.g. diesel) in anticipation of the seasonal demand peaks, so there may be a lag.

4. Adding WTI spot prices

As a last thought on the WTI spot price, we will include the WTI spot price in our input data set and re-examine the input loadings. Intuitively, adding an input variable for the raw material price (i.e. crude oil) with the finished product prices in the same data set will likely reveal another driver: cost of production.

principal component analysis output tables for ten (10) variables: Nine (9) EIA PADD regions diesel spot prices and WTI spot price

Now, we’d need three drivers to account for 99.8% of the price variation.

principal component summary statistics output table for the ten (10) variables: Nine (9) EIA PADD regions diesel spot prices and WTI spot price

IMPORTANT: The principal components of the new data set are not necessarily the same as the ones we computed earlier with only the diesel prices.

Let’s look at the input variable’s loadings in each principal component (i.e. driver):

principal component loadings output table for each of the ten (10) variables ( 9 EIA PADD regions diesel spot prices and WTI spot price) in the different principal components


  1. The loadings for the first component are similar to ones we calculated earlier with only the diesel prices. Note that the WTI loadings are slightly lower than their diesel counterparts. Again, we’ll designate this factor as the general price level (region neutral).
  2. The loadings for the second factor are very different now, and the loading for WTI is negative (-%94.5) while all the rest are positive. We can designate this factor as the cost of cracking diesel of crude.
  3. The loading for the 3rd factor is very similar to the loading of the second component in the earlier data sets. Again, the loading varies based on the location (east vs. west). The WTI, Gulf Coast, and Midwest are almost neutral.

Let’s now plot the factor’s time series.

time series plot for the 1st and third principal component of the new data set (10 variables: 9 spot prices of EIA PADD regions and WTI spot price

In general terms, the first and the third component are very similar to the first two components we calculated earlier (w/o WTI in the data set).

For the second component, we hypothesize this one as the proxy of the diesel cracking cost.

time series plot for the second principal component of the new data set (10 variables: 9 spot prices of EIA PADD regions and WTI spot price


In this study, we examined a non-trivial application for PCA. We attempted to explain the price variation among different PADD regions by uncovering the driving forces behind the prices.

What’s next?

We mentioned earlier that refineries build up an inventory of products (e.g. diesel) in anticipation of the seasonal demand peaks, so there may be a lag. How do we capture this variable? Futures: we can include the future prices of diesel and crude oil.


Intuitively, futures prices reflect market anticipation of (1) future demand (2) future storage cost, and possibly a premium for supply scarcity. This application is intended to give you a sample of how to apply PCA and time series, as well as how to use or interpret the variables’ loadings in deriving a practical proxy for them.

In sum, PCA is a mathematical procedure that NumXL can help you execute. Making sense of and interpreting the results is where your expertise and intuition are indispensable.



Please sign in to leave a comment.

Was this article helpful?
0 out of 0 found this helpful