DATA SCIENCE USING PYTHON

ONLINE TRAINING COURSE

Autoregressive Integrated Moving Average (ARIMA) Model

The values of a series of data at a particular point in time is highly correlated with the values that precede and succeed them. In simple term, observations are not independent. This can be checked by using Durbin-Watson statistics as follow:

d = \(\frac {∑ (et – et-1)}{∑e2}\)
where e = Error term at time t

  • Durbin-Watson (d) test can be used to test for the autocorrelation in the time-series data.
  • Smaller d (near 0) means positively correlated data.
  • Larger d (near 4) means negatively correlated data.
  • d approximately 2 indicates a no autocorrelation present in the time-series.

We can model the autocorrelation if present in the time-series by using Autoregressive Integrated Moving Average (ARIMA) models.

First - Order Autocorrelation Model (association between two consecutive values in the series)

Yi = A0 + A1Yi-1 + Ɛi

Second - Order Autocorrelation Model (association between values that are two periods apart)

Yi = A0 + A1Yi-1 + A2Yi-2+ Ɛi

pth Order Autocorrelation Model (association between values that are pth periods apart)

Yi = A0 + A1Yi-1 + A2Yi-2+ ...ApYi-pi

Yi = Observed value of time-series at time i
Yi-1 = Observed value of time-series at time i – 1
A0 = Fixed least-square parameter
A1, A2, Ap = Autoregressive parameters to be estimated using least-square regression.
Ɛi = Random error at time i


Forecast = Ŷn + j = A0 + A1 Ŷn + j -1 + A2 Ŷn + j -2 + … Ap Ŷn + j -p


Notes:

It is important to select the order of autocorrelation in the Auto-Regressive models. You can use the t-test to test for the significance of respective order of autocorrelation. In Eastman Kodak revenue, third order auto-regressive model is not significant and hence, second order model is fitted to the time series data and same is used to forecast the value for Year = 2000 and 2001.

You must be equally concern with selecting the high order model as it requires estimation of high order parameters. This may cause problem especially when the number of observations are less.