False Positive Errors and Introduced Auto-correlation

In my previous report I messed up my result by introducing an auto-correlation false positive into my results. I've corrected my infrastructure to prevent that sort of thing from happening again, and thought it would warrant some explanation.

In my analysis I check the effect of signals on future returns. I do that by separating out "signal" and "non-signal" returns and compare their distributions. This is straightforward and easy with 1 period returns, you won't run into trouble. If you want to cover multiple periods with a single return value however, you can accidentally guarantee a positive test result.

Spreadsheet Illustration

The below screen capture of a spreadsheet shows buy signals, sell signals, and 10 trading day returns. This is the sort of data structure my system typically works with. The sell signals are set by comparing, in this instance, the stochastic "D" output with the "overbought" level which was in this instance 80. As a result, you can have many adjacent signals.

A spreadsheet illustration showing multiple adjacent rows with active signals that all share similar returns.
An Illustration of the Auto-correlation Problem

The issue with these adjacent signals is that we are computing a rolling 10 trading day return. Since there are ten days in that one value, individual days have a small impact on it. Adjacent days will have similar return values. If you have, as in the above case, 13 sell signals associated with what is essentially the same return value. You're artificially magnifying your data in a way that may muddy your attempts to distinguish signal from noise.

The More Obvious False Positive

A much more serious error I made in my analysis was not carefully thinking about what I was doing when switching from N=1 returns to N=10 returns. I left this little line as-is without thinking about it at all:

df['return'] = df.close.pct_change(test_return_shift) # close-close return df['future_return'] = df['return'].shift(-1)

I'm computing an N-day return, okay, that's fine. That'll be the return between the current day and the day that is test_return_shift previous.

And then I'm shifting that return by one to compare what the future return will be if you traded on this signal. The return that includes the test_return_shift previous days. By one. Hrm. That's not good.

Given that an "overbought" condition for this indicator means something like "over the past 14 days, prices have gone up", that guarantees that a 10 day (specific case for my report of test_return_shift) return prior to a sell signal will be a big positive return. Shifting that return by one day will give you a very similar, auto-correlated positive return.

Naturally this will tend to show that the indicator is a good momentum indicator rather than a contrarian indicator like the words "overbought" and "oversold" imply, but I honestly don't look too closely at long or short hypothesis when evaluating these signals. I check to see if the signal gives you a better idea of what future returns are, that's it.

Closing Remarks

These obvious mistakes have driven home the importance of dumping a spreadsheet when important changes have occurred to your data analysis process and manually verifying what you've done. You have a return column? Great. Calculate it again in the spreadsheet and compare. That's how I caught the N=1 shift of a N=10 return.

I resolved this problem with something like a nuclear option. I think I went a little overboard, but I feel very confident now that a positive or negative result is very unlikely to be false.

First, I made my buy and sell signals "latching". Only the first signal counts, the reset get set to "false" until test_return_shift periods have occurred.

Second, I made buy and sell "block" columns that lock out rows that are guaranteed to show auto-correlation. These columns are not used in analysis at all. This is potentially a little overkill, but I want to avoid a false positive much more than I want to avoid false negatives.

A spreadsheet capture illustrating the introduction of a buy_block and sell_block column that are set to true for N days around any signal. This blocking avoids false positives through autocorrelated returns.
A Spreadsheet Capture Illustrating the Block Column Concept

Since the only returns compared are ones that are free and clear of the influence of any kinds of averaging, I'll know that a positive result is a good one. I can certainly fool myself other ways, but I'm glad to have this one nailed down.