How to Remove NonStationarity From Time SeriesIn addition to the normalcy assumptions, most ML algorithms assume a static connection between input characteristics and outputs. A static connection requires inputs and outputs with constant characteristics like mean, median, and variance. In other words, algorithms work best when their inputs and outputs are fixed. This is not true in time series forecasting. Distributions that fluctuate over time might exhibit distinctive characteristics such as seasonality and trend. These, in turn, induce fluctuations in the series' mean and variance, making it difficult to model their behavior. So, rendering a distribution stationary is a stringent condition in time series forecasting. Importance of StationarityIf a distribution is not steady, it is difficult to model. Algorithms establish links between inputs and outputs by calculating the key parameters of the underlying distributions. When all of these factors are timedependent, algorithms will encounter different values at various points in time. If the time series is fine enough (e.g., minutes or seconds), models may contain more parameters than real data. This sort of changeable link between inputs and outputs fundamentally undermines the decision function of any model. If the relationship changes over time, models will use an outdated relationship or one that does not add to their predictive power. As a result, you must devote a certain amount of effort to detecting nonstationarity and eliminating its consequences during your workflow. Detecting NonStationarity StatisticallyThere are various tests that fall under the category of unit root testing. The enhanced DickeyFuller test may be the most common, and we've already seen how to use it to identify random walks in my previous kernel. Here's how to use it to determine if a series is stationary or not. Simply expressed, the following are the test's null and alternative hypotheses:
The pvalue decides the outcome of the test. If it is less than a crucial threshold (0.05 or 0.01), we reject the null hypothesis and infer that the series is stationary. Otherwise, we fail to reject the null hypothesis and infer that the series is not stationary. The entire test is readily implemented as an adplete function within statsmodels. First, let's apply it to a distribution that we know is steady and become acquainted with its results: Output: We look at the pvalue, which is nearly zero. This implies we can simply reject the null hypothesis and assume the distribution is stationary. Let's import the TPS July Playground dataset from Kaggle and see if the goal carbon monoxide is stationary. Output: Surprisingly, carbon monoxide is discovered to remain stationary. This might be because the data was collected during a short period of time, reducing the impact of the time component. In reality, all of the variables in the data are totally stationary. Now, let's load some nonstationary stock data: Output: As you can see, Amazon stocks are on a definite rising trend. Let us run the DickeyFuller test: Output: We obtain a perfect pvalue of 1 indicating 100% nonstationary time series data. Let's do a final test on Microsoft stocks before moving on to other strategies for dealing with this sort of data: Output: Output: The pvalue is close to 1. No interpretation is necessary. Transforming NonStationary Series to Make it StationaryDifferencing is one approach for altering even the most basic nonstationary data. This procedure entails taking the differences between consecutive observations. To do this, Pandas have a different function: Output: The above output displays the outcomes of first, second, and thirdorder differencing. For simple distributions, the firstorder difference is sufficient to render them stationary. Let's verify this by applying the adfuller function to the diff_1 (firstorder difference of Microsoft stocks. Output: When we used an adfuller on the original distribution of Microsoft stocks, the pvalue was close to one. After differencing, the pvalue is flat 0, indicating that we may reject the null hypothesis and infer that the series is now stationary. However, other distributions may not be as simple to work with. Coming back to Amazon stocks. Output: Before calculating the difference, we must account for the clear nonlinear trend. Otherwise, the series will remain nonstationary. To eliminate nonlinearity, we will utilize the logarithmic function np.log and then compute the firstorder difference: Output: Output: As you can see, the distribution that had a flawless pvalue prior to transformation is now perfectly stationary. Let us look at another case. The chart below depicts the monthly sales of antibiotics in Australia. Output: As you can see, the data exhibits both an increasing trend and significant seasonality. To eliminate seasonality, we will apply a log transform again, but this time with a yearly difference (12 months). Here's how each step looks like: Output: We can confirm the stationarity with adfuller. Output: The pvalue is quite modest, indicating that the transformation processes were effective. In general, each distribution is unique, therefore achieving stationarity may require chaining many procedures. The majority of them include logarithms, first/secondorder calculations, or seasonal differences.
Next TopicAutoEncoders
