How to Perform ARIMA Time Series Analysis in Python – Step by Step

In this article, you will learns how to perform time series analysis using the ARIMA (AutoRegressive Integrated Moving Average) method. The dataset we’ll use for this tutorial is the london-daily-temperature dataset which you can get for free from here.

We would cover the following:

  1. Obtain and Prepare Your Dataset
  2. Perform Seasonal Decomposition
  3. Create and Fit the Model
  4. Plot the Model Performance
  5. Examine the Model Metrics
  6. Further Topics in Time Series

 

1. Obtain and Prepare Your Dataset

The dataset for this analysis would be the london-daily-temperature dataset which contains records for temperatures from the year 1979 to 2023. We read it into Pandas dataframe

import pandas as pd
from datetime import datetime

#1. Load the dataset
data = pd.read_csv('london_daily_temperature.csv')

#2.  Extract just the DATE and TX Columns
data = pd.DataFrame(data=data, columns=['DATE', 'TX'])

#3.  convert the DATE column to DateTim
data['DATE'] = pd.to_datetime(data['DATE'], format='%Y%m%d')

#4. Rename the columns to meaningful names
data.rename(columns={'DATE': 'Date', 'TX': 'Temperature'}, inplace=True)

#5.  Set the index of the dataframe 
data.set_index('Date', inplace=True)

Note that in #5, we set the index of the dataset to the Date column instead of leaving the default integer type index. This is because in a time-series analysis,  the date/time when the data is received is generally expected to be unique.

 

2. Perform Seasonal Decomposition

Seasonal decomposition allows us to see the 4 different components of the data. These includes:

  • Observed – the original data series you provided
  • Trend – the long-term progression of the series. It provide a view of the long-term patterns.
  • Seasonality – the repeating short-term cycles. In this example, we use a period of 365 which represents effect the repeat in a yearly cycle.
  • Residuals – the remaining part after removing the trend and seasonality. Represents irregular random fluctuations in the data.
# Decompose the time series data
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose

decomposition = seasonal_decompose(train['Temperature'], model='additive', period=365)

fig, axes = plt.subplots(4, 1, figsize=(12, 12))  # Create a figure and 4 subplots

decomposition.observed.plot(ax=axes[0], title='Observed')
decomposition.seasonal.plot(ax=axes[1], title='Seasonal Component')
decomposition.trend.plot(ax=axes[2], title='Trend Component')
decomposition.resid.plot(ax=axes[3], title='Residual Component')

plt.tight_layout()
plt.show()

 

Time-Series-Decomposition.png
Time-Series Decomposition

 

3. Create and Fit the ARIMA Model

Now, we will create the ARIMA model. ARIMA stands for AutoRegressive (AR) Integrated (I) Moving Average (MA) which  is made up of three components:

  • AutoRegressive (AR): This indicates that the model provides a relationship between the current value and it’s previous values.
  • Integrated (I): This indicated how much the data is differenced to achieve stationarity
  • Moving Average (MA): This component models the relationship between the current value and the past forecast errors.
# Create and fit an ARIMA model
from statsmodels.tsa.arima.model import ARIMA

model = ARIMA(train['Temperature'], order=(1, 1, 1))
model_fit = model.fit()

 

4. Plot the model performance

Once we fit the model through our dataset, we can access the predictions via the fittedvalues() method of the model.  In the code snippet below, the create the two plots:

  • the original temperature values
  • the predicted values from the model’s fittedvalues()

From the plot, we can see that the fittedvalues closely matches the observed values indicating that our model has a decent performance

# Create both th original and fitted plot
train['Temperature'].plot(figsize=(14, 6), title='Daily Temperature in London')
model_fit.fittedvalues.plot(color='red') # fitted plot
plt.show()
Time Series Fit
Time Series Fit

 

# Plot the performance for a 5-months slice of data
train['Temperature'][(train.index >'2010-01-01') & (train.index <= '2010-05-28')].plot(figsize=(12, 6), label='Original')

model_fit.fittedvalues[(train.index > '2010-01-01') & (train.index <= '2010-05-28')].plot(label='Fitted')

plt.legend()
plt.show()
Time Series Fit on Data Slice
Time Series Fit on Data Slice

 

5. Examine Model Metrics

We examine the model metrics. Here we look at the MSE (Mean Squared Error), RMSE (Root Mean Squared Error), MEA (Mean Absolute Error) and R_squared scored. From the outputs we can see an R2 of 0.87.

However, we note a significant value for the MSE and this would likely be due to the present of outliers in the original dataset.

# Access the model metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

mse = mean_squared_error(train['Temperature'], model_fit.fittedvalues)
rmse = np.sqrt(mse)
mae = mean_absolute_error(train['Temperature'], model_fit.fittedvalues)
r2 = r2_score(train['Temperature'], model_fit.fittedvalues)

print('MSE:', mse)
print('RMSE:', rmse)
print('MAE:', mae)
print('R2:', r2)

The output is given below

MSE: 566.5997912577614
RMSE: 23.80335672248268
MAE: 18.618313872554147
R2: 0.868029565993606

 

6. Next Steps in Time Series

Having covered the basics of Time Series, we would continue with a deeper dive in subsequent articles. The following topics would be covered:

  • Stationarity and Differencing
  • Test for Stationarity – Augmented Dickey-Fuller test (ADF)
  • Autocorrelation and Partial Autocorrelation
  • Interpreting Autocorrelation Plots
  • Seasonal ARIMA
  • Prophet for Business Forecasting

kindsonthegenius

Kindson Munonye is currently completing his doctoral program in Software Engineering in Budapest University of Technology and Economics

View all posts by kindsonthegenius →

Leave a Reply

Your email address will not be published. Required fields are marked *