Contents
 What is Time Series?
 Components of a Time Series
 When NOT to use Time Series?
 What is Stationarity?
 What is the ARIMA model?
 Example: Forecast future
Acknowledgement: I took help from several popular online blogs and video tutorials like simplilearn, edureka, machinelearningmastery, otext, just to name a few. I rephrased some sentences and would like to give them full credits.
What is Time Series?
If we want to predict something in future i.e. stock prices, sales etc, we use time series analysis over past data. We can predict daily stock price, weekly interest rates, sales figure etc where the outcome varies over time. In such scenarios, we use time series forecasting.
A time series is a set of observation taken at specified times usually at equal intervals. It is used to predict future values based on the previously observed values.
By time series analysis we not only predict the future values but also able to understand past behavior, plan for the future and evaluate current accomplishment.
Componets of Time Series

Trend:
 A trend exists when there is a longterm increase or decrease in the data. It does not have to be linear. It could go from an increasing trend to a decreasing trend.
 There is a trend in the antidiabetic drug sales data shown below.
Reference: https://otexts.org/fpp2/fpp_files/figurehtml/a101.png 
Seasonality:
 A seasonal pattern occurs when a time series is affected by seasonal factors such as the time of the year or the day of the week.
 Seasonality is always of a fixed and known frequency.
 The monthly sales of antidiabetic drugs above show seasonality which is induced partly by the change in the cost of the drugs at the end of the calendar year.

Irregularity:
 It is also called residual.
 These are erratic in nature, unsystematic, basically happens for short durations and not repeating.
 Let's take an example. If there is a sudden natural disaster i.e flood in a town out of nowhere, a lot of people buy medicine and ointment and after sometime when everything settles down, sales of those ointments might go down. Nobody knows how many numbers of sales are going to happen that time and also cannot force the event not to happen. These type of random variations are called irregularity.

Cyclic:
 A cycle occurs when the data exhibit rises and falls that are not of a fixed frequency.
 These fluctuations are usually due to economic conditions, and are often related to the “business cycle”. The duration of these fluctuations is usually at least 2 years.
When NOT to use Time Series?
Time series is not applicable when
 values are constant such as y = c.
 values are in the form of a function, such as y = sin(x), y = log(x).
What is Stationarity?
 Time series has a particular behavior over time, there is a very high probability that it will follow the same in the future.
 A stationary time series is one whose properties do not depend on the time at which the series is observed.
 Time series with trends (varying mean over time), or with seasonality (variations of a specific time frame), are not stationary — the trend and seasonality will affect the value of the time series at different times.
 On the other hand, a white noise series is stationary — it does not matter when you observe it, it should look much the same at any point in time.
 Some cases can be confusing — a time series with cyclic behavior (but with no trend or seasonality) is stationary. This is because the cycles are not of a fixed length, so before we observe the series we cannot be sure where the peaks and troughs of the cycles will be.
In general, a stationary time series will have no predictable patterns in the longterm. Time plots will show the series to be roughly horizontal (although some cyclic behavior is possible), with constant variance.
Consider the nine series plotted in Figure 2. Which of these do you think are stationary?
Obvious seasonality rules out series (d), (h) and (i). Trends and changing levels rules out series (a), (c), (e), (f) and (i). Increasing variance also rules out (i). That leaves only (b) and (g) as stationary series.
At first glance, the strong cycles in series (g) might appear to make it nonstationary. But these cycles are aperiodic. In the longterm, the timing of these cycles is not predictable. Hence the series is stationary.
How to Remove Stationarity?
Following characteristics can be found in stationary time series. We have to get rid of those characteristics.
 Constant means according to the time
 Constant variance at different time intervals
 Autocovariance that does not depend on time i.e. correlation between values of any time interval t and their previous time intervals such as t1, t2 etc.
Tests to Check Stationarity

Rolling Statistics
 Plot the moving average or moving variance and see if it varies with time.
 It's more of a visual technique, not suitable for production.
 It can be used as POC (proof of concept) purpose.

ADCF (Augmented DickeyFuller) Test
 The test starts with a null hypothesis that states that the time series is nonstationary, followed by some statistical results based on the hypothesis.
 The test results comprise of a Test Statistic Value and some Critical Values.
 We will go the rules that check stationarity in a time series later on in our example.
 More about this algorithm can be found at https://en.wikipedia.org/wiki/Augmented_Dickey%E2%80%93Fuller_test
What is the ARIMA model?
 ARIMA stands for AutoRegressive Integrated Moving Average
 ARIMA is one of the best models to work with time series data.
 Combination of two models, AR and MA
 AR (AutoRegressive) model: Correlation between data in previous time steps and current time step. Parameterized with p, called autoregressive lags term.
 MA (Moving Average) model: Moving average to smoothen data and average the noises and irregularities. Parameterized with q, called moving average term.
 These two models are then integrated with a degree of differencing from previous p time intervals.
ACF and PACF
To find p and q, we have to determine PACF and ACF which are autocorrelation function and partial autocorrelation function respectively of a time series. ACF is used to find autocorrelation between timestamp t and tk where k = 1, 2, ..., t1. This allows the model to predict according to the trend. Again, PACF is also determined for the same purpose but over residuals to smoothen out noises.
If you are interested to know some details of ACF and PACF, please refer to this link: https://www.youtube.com/watch?v=ZjaBn93YPWo.
Example: Forecast Future
Build a model to forecast the demand or passenger traffic in airplanes. The data is classified in date/time and the passengers traveling per month.
Import necessary libraries
We will be using numpy for matrix manipulation, pandas for loading dataset from csv, matplotlib to visualize some data in different formats i.e. graphs, charts etc. Also, we will be using some other libraries to perform statistical analysis later on.
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 10, 61
Load the dataset airpassengers.csv
First, we need to convert date strings as python date time format. Also, we want to select Month as the index of this data table.
dataset = pd.read_csv('data/airpassengers.csv')
# parse strings to datetime type
dataset['Month'] = pd.to_datetime(dataset['Month'], infer_datetime_format=True)
indexed_data = dataset.set_index(['Month'])
Let's print some data from the beginning of the table.
indexed_data.head(5)
Let's plot this data as a graph with matplotlib.
## plot graph
plt.xlabel("Date")
plt.ylabel("#Passengers")
plt.plot(indexed_data)
Check if the data is Stationary or not
We will be using both rolling statistics and ADCF test for this purpose.
Determining Rolling Statistics
Since the dataset is monthly, we can set the rolling window as yearly (12 months), halfyearly (6 months) or even quarterly (4 months). We will be using 12 months or 12 rows for this case to determine the means, standard deviations and check if they are constant or not.
# Determining rolling statistics
rolling_mean = indexed_data.rolling(window=12).mean()
rolling_std = indexed_data.rolling(window=12).std()
print rolling_mean, rolling_std
#Passengers
Month
19490101 NaN
19490201 NaN
19490301 NaN
19490401 NaN
19490501 NaN
19490601 NaN
19490701 NaN
19490801 NaN
19490901 NaN
19491001 NaN
19491101 NaN
19491201 126.666667
19500101 126.916667
19500201 127.583333
19500301 128.333333
19500401 128.833333
19500501 129.166667
19500601 130.333333
19500701 132.166667
19500801 134.000000
19500901 135.833333
19501001 137.000000
19501101 137.833333
19501201 139.666667
19510101 142.166667
19510201 144.166667
19510301 147.250000
19510401 149.583333
19510501 153.500000
19510601 155.916667
... ...
19580701 59.590013
19580801 65.557054
19580901 65.557054
19581001 65.106207
19581101 64.593074
19581201 64.530472
19590101 63.627229
19590201 61.759553
19590301 61.597422
19590401 60.284678
19590501 60.008270
19590601 63.009138
19590701 71.987951
19590801 80.049369
19590901 81.485451
19591001 79.680422
19591101 74.498729
19591201 69.830097
19600101 66.624399
19600201 61.866180
19600301 61.382741
19600401 60.171472
19600501 60.184565
19600601 65.021849
19600701 77.194510
19600801 83.630500
19600901 84.617276
19601001 82.541954
19601101 79.502382
19601201 77.737125
[144 rows x 1 columns]
We will see that the first 11 rows have NaN values. We took means and standard deviations over every 12 rows from the beginning. Since these rows don't fulfill the rolling window, we have got data starting from the 12th row.
Let's plot these rolling statistics.
# Plot rolling statistics
orig = plt.plot(indexed_data, color='blue', label='Original')
mean = plt.plot(rolling_mean, color='red', label='Rolling Mean')
std = plt.plot(rolling_std, color='black', label='Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
plt.show(block=False)
We can see that the means and standard deviations are not constant and we can say that this time series is not stationary.
Perform DickyFuller Test
We will be using statsmodels library for this purpose. The adfuller test will give us the following data
 Test Statistic
 pvalue
 Number of lags used
 Number of observations used
If we don't want to dive into the details of DickyFuller algorithm and just want to know whether the time series is stationary or not from this test results, then we can see the summary of the algorithm below
 If
 pvalue > 0.05: The data is nonstationary.
 pvalue <= 0.05: The data is stationary.
 If
 test statistic value is significantly less than critical values (possibly less than 1% of critical values): The data is stationary.
 test statistic value is greater than critical values: The data is nonstationary.
# Perform DickyFuller test
from statsmodels.tsa.stattools import adfuller
print 'Result of Dicky=Fuller Test'
dftest = adfuller(indexed_data['#Passengers'], autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'pvalue', '#Lags Used', '#Observations Used'])
for key, value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
Result of Dicky=Fuller Test
Test Statistic 0.815369
pvalue 0.991880
#Lags Used 13.000000
#Observations Used 130.000000
Critical Value (5%) 2.884042
Critical Value (1%) 3.481682
Critical Value (10%) 2.578770
dtype: float64
In summary, we can say that the data is nonstationary since from adfuller test we can see that pvalue is greater than 0.05 and statistic value is greater than critical values.
Formalize Stationarity Checking in a Function
We can now formalize the stationarity checking scripts using a function test_stationarity. We will perform both rolling statistics and ADFC test to check stationarity.
from statsmodels.tsa.stattools import adfuller
def test_stationarity(timeseries):
# Determinign rolling statistics
rolling_mean = timeseries.rolling(window=12).mean()
rolling_std = timeseries.rolling(window=12).std()
# Plot rolling statistics
orig = plt.plot(timeseries, color='blue', label='Original')
mean = plt.plot(rolling_mean, color='red', label='Rolling Mean')
std = plt.plot(rolling_std, color='black', label='Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
plt.show(block=False)
print 'Result of Dicky=Fuller Test'
dftest = adfuller(timeseries['#Passengers'], autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'pvalue', '#Lags Used', '#Observations Used'])
for key, value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
Make The Data Stationary
There are several ways to make the data stationary, but it totally depends on the data. For this case, we can take the log scale data and subtract moving average from that to make it somewhat stationary.
# Estimating trend
indexed_data_log_scale = np.log(indexed_data)
moving_average = indexed_data_log_scale.rolling(window=12).mean()
moving_std = indexed_data_log_scale.rolling(window=12).mean()
plt.plot(indexed_data_log_scale)
plt.plot(moving_average, color='red')
dataset_log_scale_minus_moving_average = indexed_data_log_scale  moving_average
dataset_log_scale_minus_moving_average.dropna(inplace=True)
dataset_log_scale_minus_moving_average.head(12)
Now call this function with our newly transformed data.
test_stationarity(dataset_log_scale_minus_moving_average)
Result of Dicky=Fuller Test
Test Statistic 3.162908
pvalue 0.022235
#Lags Used 13.000000
#Observations Used 119.000000
Critical Value (5%) 2.886151
Critical Value (1%) 3.486535
Critical Value (10%) 2.579896
This is better than original and looks little bit stationary and adfuller test also suggests that this transformed series is stationary. Let's do some other type of transformations to make it more stationary. We can start with subtracting exponentially weighted average from log scaled values and check again its stationarity.
dataset_log_scale_ewm = indexed_data_log_scale.ewm(halflife=12, min_periods=0, adjust=True).mean()
dataset_log_scale_minus_ewm = indexed_data_log_scale  dataset_log_scale_ewm
test_stationarity(dataset_log_scale_minus_ewm)
Result of Dicky=Fuller Test
Test Statistic 3.601262
pvalue 0.005737
#Lags Used 13.000000
#Observations Used 130.000000
Critical Value (5%) 2.884042
Critical Value (1%) 3.481682
Critical Value (10%) 2.578770
So both of the above transformations can make the data stationary. Now let's start working on finding p, d and q to feed in our ARIMA model.
First of all, let's get first order differences.
dataset_log_first_order_diff = indexed_data_log_scale  indexed_data_log_scale.shift()
dataset_log_first_order_diff.dropna(inplace=True)
test_stationarity(dataset_log_first_order_diff)
Result of Dicky=Fuller Test
Test Statistic 2.717131
pvalue 0.071121
#Lags Used 14.000000
#Observations Used 128.000000
Critical Value (5%) 2.884398
Critical Value (1%) 3.482501
Critical Value (10%) 2.578960
Decompose into Trend, Seasonality and Residual
Now, let's decompose values into the trend, seasonal and residual components. We can do so using statsmodels.tsa.seasonal package.
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(indexed_data_log_scale)
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid
plt.subplot(411)
plt.plot(indexed_data_log_scale, label='Original')
plt.legend(loc='best')
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend(loc='best')
plt.subplot(413)
plt.plot(seasonal, label='Seasonality')
plt.legend(loc='best')
plt.subplot(414)
plt.plot(residual, label='Residuals')
plt.legend(loc='best')
plt.tight_layout()
Now, we can test that if the residual is stationary or not using our test_stationarity functiuon.
decomposed_log_data = residual
decomposed_log_data.dropna(inplace=True)
test_stationarity(decomposed_log_data)
Result of Dicky=Fuller Test
Test Statistic 6.332387e+00
pvalue 2.885059e08
#Lags Used 9.000000e+00
#Observations Used 1.220000e+02
Critical Value (5%) 2.885538e+00
Critical Value (1%) 3.485122e+00
Critical Value (10%) 2.579569e+00
dtype: float64
We can easily say that the residual is not stationary. Since MA component deals with residual, we have to transform this data to make it stationary, so that it smoothen it out to predict what will happen next. But for now, don't worry about that since ACF and PACF curves will help us to find proper values of p and q, and feeding those into the ARIMA model will do the job for us.
Determine ACF and PACF
Now, we know the value of d. We have to calculate the values of p and q. We can find these values from ACF and PACF graph. Let's plot those graph.
# ACF and PACF plots:
from statsmodels.tsa.stattools import acf, pacf
lag_acf = acf(dataset_log_first_order_diff, nlags=20)
lag_pacf = pacf(dataset_log_first_order_diff, nlags=20, method='ols')
# Plot ACF
plt.subplot(121)
plt.plot(lag_acf)
plt.axhline(y=0, linestyle='', color='gray')
plt.axhline(y=1.96/np.sqrt(len(dataset_log_first_order_diff)), linestyle='', color='gray')
plt.axhline(y=1.96/np.sqrt(len(dataset_log_first_order_diff)), linestyle='', color='gray')
plt.title('ACF')
# Plot PACF
plt.subplot(122)
plt.plot(lag_pacf)
plt.axhline(y=0, linestyle='', color='gray')
plt.axhline(y=1.96/np.sqrt(len(dataset_log_first_order_diff)), linestyle='', color='gray')
plt.axhline(y=1.96/np.sqrt(len(dataset_log_first_order_diff)), linestyle='', color='gray')
plt.title('PACF')
plt.tight_layout()
In order to calculate p and q, we have to check what is the value that cuts off its origin for the first time. Looking at PACF curve, we can see that at the time around 2, the first time curve cuts the horizontal axis, hence the value of p is 2 and similarly from ACF curve, we find that the value of q is also around 2.
Feed p, d, and q into ARIMA Model
We have found the values of p, d, and q, which are 2, 1 and 2 respectively. Let's fit our ARIMA model with these values.
from statsmodels.tsa.arima_model import ARIMA
#ARIMA Model
model = ARIMA(indexed_data_log_scale, order=(2, 1, 2))
results_ARIMA = model.fit(disp=1)
plt.plot(dataset_log_first_order_diff)
plt.plot(results_ARIMA.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((results_ARIMA.fittedvaluesdataset_log_first_order_diff['#Passengers'])**2))
Text(0.5,1,'RSS: 1.0292')
Check If Fitted Values Depicts Original Curve
Since we are working with firstorder differences, we have to covert fitted values back to the real values. We went through the following sequence from given values to fitted values:
 Take logarithm of the values
 Take firstorder differences of the values found in step no 1
So we have to reverse the process to predict the next values:
 Get back the first item from the log scaled values and place it at index 0 after shifting the fitted values once
 Take cumulative sum to get back logarithm scaled values
 Take exponentials of the values found in step no 2
# Get back to the state before differening
predictions_ARIMA_log = pd.Series(indexed_data_log_scale.loc[indexed_data_log_scale.index[0]].append(results_ARIMA.fittedvalues).cumsum(), index=indexed_data_log_scale.index)
# Get exponetials
predictions_ARIMA = np.exp(predictions_ARIMA_log)
# Plot predicted data vs original data
plt.plot(indexed_data)
plt.plot(predictions_ARIMA)
If we don't want to be bothered about all these conversions, well, we have a piece of good news. ARIMA model also can plot the fitted values efficiently, even for future time intervals.
Predict and Plot Future Values Using The ARIMA Model
Now, our dataset has 144 rows, that means 12 years' values. If we want to plot the graph for next 1 year, we have to call plot_predict() function from ARIMA model with parameters 1 (start from the first row) and 204 (first 144 + new 5x12=60 rows).
results_ARIMA.plot_predict(1,204)
The forecast function from ARIMA model with parameter steps=60 will give us the new 60 values.
future = results_ARIMA.forecast(steps=60)
print np.exp(future[1])
[1.08746262 1.11348487 1.12264437 1.12415063 1.12415874 1.12461435
1.12481942 1.12500462 1.12776932 1.13582632 1.14879044 1.16321524
1.1753643 1.18334355 1.1874086 1.18900856 1.18956455 1.18988252
1.19041521 1.1917625 1.1946837 1.19956286 1.2059434 1.21268138
1.21858755 1.22299943 1.22591873 1.22776853 1.22906372 1.23024086
1.23166843 1.23368207 1.23651561 1.24016542 1.24433566 1.2485496
1.25236108 1.2555239 1.25803132 1.26004567 1.26179869 1.26352486
1.26543139 1.26767321 1.27031559 1.27330434 1.27647669 1.2796205
1.2825529 1.28517677 1.28749351 1.28957958 1.29155007 1.29352656
1.29561291 1.29787521 1.30032659 1.30292387 1.30558205 1.30820345]