# Which autocorrelation (ACF) estimation method should I use?

Before we can answer this question, let’s take a quick overview of the autocorrelation definition. In principle, the autocorrelation of a time series ${x_{t}}$ for lag k is a cross-correlation of the time series with its k-lagged version (i.e. ${x_{t-k}}$) of itself.

$$\rho_k = \frac{\gamma_k }{\gamma_o }$$

Where

• $\rho_k$ is the population autocorrelation for lag k.
• $\gamma_k$ is the population autocovariance for lag k.
• $\gamma_o$ is the population variance.

Using a finite-length time series sample, an estimate of autocorrelation ($\hat{\rho_k}$) can be obtained as follow.

$$\hat{\gamma_k}= E\left [ (x_{t-k}-\mu) \times (x_{t}-\mu) \right ] = \frac{\sum_{t=k+1(x_{t-k}-\mu)(x_{t}-\mu)}^{N}}{N-k}$$

$$\hat{\gamma_o }=\gamma^{2}=E\left [ (x_{t}-\bar{X})^{2} \right ]=\sigma ^{2}$$

$$\hat{\rho_k}=\frac{1}{(N-k)\sigma ^{2}}\times \sum_{t=k+1}^{N}(x_{t-\mu})(x_{t-k}-\mu)$$

Where

• $\mu$ is the time series population mean
• $\sigma^{2}$ is the time series population variance.
• $0 <k < N$

Sounds simple? Let’s delve into practical considerations.

In practice, the true mean and true variance of the time series is almost never known, and they have to be estimated from the sample data. This will leave us with few possibilities:

## Method 1:

$\mu$ and $\sigma^{2}$ are replaced with sample average ($\bar{x}$) and biased sample variance ($s^{2}$)

$$\bar{x}=\frac{1}{N}\times \sum_{t=1}^{N}x_{t}$$

$$s^{2}=\frac{1}{N}\sum_{t=1}^{N}(x_{t}-\bar{x})^{2}$$

$$\hat{\rho_k}=\frac{N}{N-k}\times \frac{\sum_{t=k+1}^{N}(x_{t}-\bar{x})(x_{t-k}-\bar{x})}{\sum_{t=1}^{N}(x_{t}-\bar{x})^{2}}$$

For $N\gg k$, the formula above is further simplified into:

$$\hat{\rho_k}\approx \frac{\sum_{t=k+1}^{N}(x_{t}-\bar{x})(x_{t-k}-\bar{x})}{\sum_{t=1}^{N}(x_{t}-\bar{x})^{2}}$$

Although this method yields a biased estimator for the autocorrelation, and, to make things worse, the values calculated (as a function of k) don’t form a valid autocorrelation function, in a sense, we can’t define a theoretical process having exactly those values.

This method is implemented in NumXL ACF function as “sample autocorrelation method (default)”

Why do we care for this method?

The “sample autocorrelation” method is found in many academic textbooks and implemented in many popular software packages. NumXL includes this method for benchmarking and for completion purposes.

## Method 2: Periodogram-based (Spectral Density) Estimate

There is a strong relationship between the time series periodogram (spectral analysis) and its autocovariance function.

Although the periodogram-based method computes a biased estimate of the autocorrelation, the error is generally smaller than one from other methods (e.g. Method 1).

This method suffers from the same issues: biased estimates and calculated values (as a function of k) don’t always form a valid autocorrelation function.

This method is implemented in NumXL ACF function as a “periodogram-based estimate.

## Method 3: Cross-correlation

We treat the original time series and its k-lagged version as two separate time series and calculate the Pearson cross-correlation value.

Consider a finite stationary time series of length N observations ${x_{t}}$

$$x_{t}=\left \{ {x_{1}, x_{2}, x_{3}\cdots, x_{N}} \right \}$$

And its k-lagged version time series ${x_{t-k}}$

$${x_{t-k}}=\left \{x_{1-k}, x_{2-k},\cdots, x_{1}\cdots, x_{N-k} \right \}$$

Since the values of the time series before $t=1$ are not available, we will chop the first k-observation from the time series.

$$x_{*}^{t}=\left \{ x_{k+1,} x_{k+2,} x_{k+3}\cdots, x_{N} \right \}$$

$$x_{*}^{t}=\left \{x_{1-k}, x_{2-k},\cdots, x_{N-k} \right \}$$

Now, we have two time series with ($N-k$) observations. The estimated sample average are calculated as follow:

$$\bar{x}=\frac{\sum_{t=k+1}^{N}x_{t}}{N-k}$$

$$k=\frac{\sum_{t=1}^{N-k}x_{t}}{N-k}$$

And the unbiased sample estimate of the variances:

$$s^{2}=\frac{\sum_{t=k+1}^{N}(x_{t}-\bar{x})^{2}}{N-k-1}$$

$$s_{k}^{2}=\frac{\sum_{t=1}^{N-k}(x_{t}-\bar{x})^{2}}{N-k-1}$$

So, the Pearson’s cross-correlation estimate for the two time series:

$$\hat{\rho_k}=\frac{1}{N-k-1}\times\frac{\sum_{t=k+1}^{N}(x_{t}-\bar{x})(x_{t-k}-\bar{x}_{k})}{s\times s_{k}}$$

$$\hat{\rho_k}=\frac{\sum_{t=k+1}^{N}(x_{t}-\bar{x})(x_{t-k}-\bar{x}_{k})}{\sqrt{\sum_{t=k+1}^{N}(x_{t}-\bar{x})\times \sum_{t=k+1}(x_{t-k}-\bar{x}_{k})}}$$

These autocorrelation values computed using this method (as a function of k) form a valid autocorrelation function, in a sense that it is possible to define a theoretical process having exactly that autocorrelation. This is not the case with Method 1 and Method 2.

Which method to use?