How to Remove Non-Stationarity From Time SeriesIn addition to the normalcy assumptions, most ML algorithms assume a static connection between input characteristics and outputs. A static connection requires inputs and outputs with constant characteristics like mean, median, and variance. In other words, algorithms work best when their inputs and outputs are fixed. This is not true in time series forecasting. Distributions that fluctuate over time might exhibit distinctive characteristics such as seasonality and trend. These, in turn, induce fluctuations in the series' mean and variance, making it difficult to model their behavior. So, rendering a distribution stationary is a stringent condition in time series forecasting. Importance of StationarityIf a distribution is not steady, it is difficult to model. Algorithms establish links between inputs and outputs by calculating the key parameters of the underlying distributions. When all of these factors are time-dependent, algorithms will encounter different values at various points in time. If the time series is fine enough (e.g., minutes or seconds), models may contain more parameters than real data. This sort of changeable link between inputs and outputs fundamentally undermines the decision function of any model. If the relationship changes over time, models will use an outdated relationship or one that does not add to their predictive power. As a result, you must devote a certain amount of effort to detecting non-stationarity and eliminating its consequences during your workflow. Detecting Non-Stationarity StatisticallyThere are various tests that fall under the category of unit root testing. The enhanced Dickey-Fuller test may be the most common, and we've already seen how to use it to identify random walks in my previous kernel. Here's how to use it to determine if a series is stationary or not. Simply expressed, the following are the test's null and alternative hypotheses:
The p-value decides the outcome of the test. If it is less than a crucial threshold (0.05 or 0.01), we reject the null hypothesis and infer that the series is stationary. Otherwise, we fail to reject the null hypothesis and infer that the series is not stationary. The entire test is readily implemented as an adplete function within statsmodels. First, let's apply it to a distribution that we know is steady and become acquainted with its results: Output: We look at the p-value, which is nearly zero. This implies we can simply reject the null hypothesis and assume the distribution is stationary. Let's import the TPS July Playground dataset from Kaggle and see if the goal carbon monoxide is stationary. Output: Surprisingly, carbon monoxide is discovered to remain stationary. This might be because the data was collected during a short period of time, reducing the impact of the time component. In reality, all of the variables in the data are totally stationary. Now, let's load some non-stationary stock data: Output: As you can see, Amazon stocks are on a definite rising trend. Let us run the Dickey-Fuller test: Output: We obtain a perfect p-value of 1 indicating 100% non-stationary time series data. Let's do a final test on Microsoft stocks before moving on to other strategies for dealing with this sort of data: Output: Output: The p-value is close to 1. No interpretation is necessary. Transforming Non-Stationary Series to Make it StationaryDifferencing is one approach for altering even the most basic non-stationary data. This procedure entails taking the differences between consecutive observations. To do this, Pandas have a different function: Output: The above output displays the outcomes of first, second, and third-order differencing. For simple distributions, the first-order difference is sufficient to render them stationary. Let's verify this by applying the adfuller function to the diff_1 (first-order difference of Microsoft stocks. Output: When we used an adfuller on the original distribution of Microsoft stocks, the p-value was close to one. After differencing, the p-value is flat 0, indicating that we may reject the null hypothesis and infer that the series is now stationary. However, other distributions may not be as simple to work with. Coming back to Amazon stocks. Output: Before calculating the difference, we must account for the clear non-linear trend. Otherwise, the series will remain non-stationary. To eliminate non-linearity, we will utilize the logarithmic function np.log and then compute the first-order difference: Output: Output: As you can see, the distribution that had a flawless p-value prior to transformation is now perfectly stationary. Let us look at another case. The chart below depicts the monthly sales of antibiotics in Australia. Output: As you can see, the data exhibits both an increasing trend and significant seasonality. To eliminate seasonality, we will apply a log transform again, but this time with a yearly difference (12 months). Here's how each step looks like: Output: We can confirm the stationarity with adfuller. Output: The p-value is quite modest, indicating that the transformation processes were effective. In general, each distribution is unique, therefore achieving stationarity may require chaining many procedures. The majority of them include logarithms, first/second-order calculations, or seasonal differences. Next TopicAutoEncoders |