Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inverting differenced values - wrong starting point #14

Open
Mkranj opened this issue May 19, 2024 · 1 comment
Open

Inverting differenced values - wrong starting point #14

Mkranj opened this issue May 19, 2024 · 1 comment

Comments

@Mkranj
Copy link

Mkranj commented May 19, 2024

In Chapter 5, there's a paragraph about reverting differenced values to the original scale, with the following code:

df['pred_foot_traffic'] = pd.Series()
df['pred_foot_traffic'][948:] = df['foot_traffic'].iloc[948] +
➥ pred_df['pred_AR'].cumsum()

However, df['foot_traffic'].iloc[948] is not the last point of training data, but the first point to predict, isn't it? So shouldn't the code actually use df['foot_traffic'].iloc[947] ?

@BorjaArroyo
Copy link

Hi @Mkranj , I have found a similar issue in Chapter 5 as well. When the author applies the inverse transformation to the ARMA(2, 2) model for the hourly bandwith dataset, it proposes the following code.

df['pred_bandwidth'] = pd.Series()
df['pred_bandwidth'].iloc[9832:] = df['hourly_bandwidth'].iloc[9832] + predictions['sarimax'].cumsum()

Which gives a MAE of 14. Nevertheless, the index of the initial point for the difference is wrong, as it should be the last point of the training set. Therefore, the code should apply the addition with respect to the index 9831 as follows:

df['pred_bandwidth'] = pd.Series()
df['pred_bandwidth'].iloc[9832:] = df['hourly_bandwidth'].iloc[9831] + predictions['sarimax'].cumsum()

To verify this result, you can use the following code, which operates on the whole array:

df_final = train_diff.copy(True)
assert len(df_final) == len(train_diff)
df_final.loc[df.index[0]] = df.iloc[0]
df_final = df_final.sort_index()
assert len(df_final) == (len(train_diff) + 1)
df_final = pd.concat([df_final["hourly_bandwidth"], predictions["sarimax"]]).cumsum()
assert len(df_final) == len(df)

Which returns the same result as the fixed code. The variables are:

  • train_diff results from the difference with drop of the nan value.
  • df is the whole dataset as it is loaded with pd.read_csv.
  • predictions is the DataFrame with the three different models, including sarimax -> ARMA(2,2).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants