Several tools for model identification do not require models. The following are common model free tools
The following model identification methods assume the time series is stationary.
If $X_t$ is white noise, we have the following expectations
Therefore, we reject $H_o$ at a 5% type 1 error rate if $ |\hat{\rho}_k| > 2 \left( 1/ \sqrt{n} \right) $
Generally, the model orders p, q are selected by a maximum likelihood estimate with AIC, AICc, or BIC.
For the stationary case, AIC is calculated as
$$ AIC = ln \left( \hat{\sigma}_a^2 \right) + 2 \left( \frac{p + q + 1}{n} \right) $$The idea is to explain as much of the variation as reasonably possible.
Once a model order is select, the model parameters can be estimated. Maximum likelihood (ML) estimates are typically used unless ML produces a model with root(s) in the unit circle. Other methods can be used to estimate AR-only models; these are discussed below.
Note: There is an alternative model ID method called the Box-Jenkins method. However, selecting a model by AIC is preferred of the Box-Jenkins method.
There are three methods for estimating AR paramters given the model order.
In the case of ARIMA, the $\left( 1 - B^d \right)$ factors will dominate. The expected behavior is slowly damped exponential autocorrelations and wandering.
1. Stationarize the Data
Take differences of the data until the data appears to be stationary. The number of differences taken to stationarize the data is the order of $d$.
2. Model the Stationarized Data
Model the stationarized data with the stationary modeling process described above.
The Tiao-Tsay result states that if a high order AR(p) model is fit to a realization from a non-stationary process, the factors associated with the roots on the unit circle will show up in the factor table.
1. Fit a High Order AR Model
Fit a high order AR model to the realization. Get the factor table and find all the roots near unity. These likely represent non-stationary factors associated with process.
2. Remove the Effects of the Non-Stationary Factors
Remove the effects of the non-stationary factors from the realization by differencing. Note that these factors maybe seasonal, ARIMA, or non-conforming (ARUMA).
3. Model the Stationarized Data
Model the stationarized data with the stationary modeling process described above.
Use the general approach to assess whether a seasonal model will be useful. Fit a high order AR(p) model then matchup roots in the overfit table that are close to the roots of the seasonal model. If many of the roots match, then a seasonal model may be useful.
The typicall approach of fitting an OLS model to $X_t$ vs time may lead to bad conclusions. The correlated errors lead to an inflated type 1 error rate.
The Cochrane-Orcutt procedure can be used to account for this. This fits an OLS model to the data, then fits an AR(1) to the noise. This is used to adjust the p-value of the slope.
Letting
$$ c = \left( 1 - \hat{\phi}_1 B \right) \hat{a} \\ t_{\phi_1} = t - \hat{\phi}_1 ( t- 1) \\ g_t = \left( 1 - \hat{\phi}_1 B \right) $$Then,
$$ Y_t = c + b t_{\phi_1} + g_t $$We expect that $g_t$ is probably fairly uncorrelated noise.
The central assessment of model is whether it whitens the residuals. A model may be sufficient if the resulting residuals appear to be white noise. For an AR(p) model, there are $n-p$ conditional residuals. However, backcasting can be used to calculate all $n$ unconditional residuals.
Residual Check 1
Start by checking the residuals visually. Create the typical sample plots of the residuals (realization, autocorrelations, etc). These plots should be consistent with white noise.
Residual Check 2
The Ljung-Box test can be used to test if all the sample autocorrelations are significantly different than 0. The hypotheses are
$$ H_0: \,\,\, \rho_1 = \rho_2 = ... \rho_k = 0 $$$$ H_a: \,\,\, at \, least \, one \, \rho_k \, is \, not \, zero $$The test is
$$ L = n \left( n + 1 \right) \sum_{k = 1}^{K} \frac{\hat{\rho}^2_k}{n-k} $$which follows a $\chi^2$ distribution with degrees of freedom equal to $K - p - q$.
Two value of K should be used, 24 adn 48 are suggested.