Cross-validation strategies for time-series data
Move beyond train-test splits to protect against overfitting.
Last updated on November 12, 2021


When building a machine learning system, our ultimate goal is to predict events that haven't happened yet. Measuring how well our model predicts our training data is simple and intuitive (think back to R-squared in your introductory STAT class), but such comparisons don't tell us anything useful about how our model will perform in the future. The big risk is that our model will become overtuned to the data it was trained on, causing it to massively underperform when used in the real world.

Cross-validation (CV) is the best-known diagnostic to protect against this phenonom (which we commonly refer to as "overfitting"). By repeatedly training our model on a sample of historical data and measuring its performance against other, out-of-sample observations, we can simulate how well our model will perform when predicting the future. Cross-validation is an absolute requirement in modern machine learning; you can find a chapter on cross-validation in nearly every machine learning textbook available today. In this post, we'll assume you know the basics of k-fold CV to cover advanced cross-validation strategies for working with flexible algorithms on time-series data.

Expanding Window CV

To start, consider the expanding window strategy discussed in Rob Hyndman's Forecasting: Principles and Practice. Graphically, this strategy looks like this:

Illustration of expanding window cross-validation for time-series data

In practice, the number of folds and the length of training and test data used in each fold will vary depending on the data you're using (though it's common to use 5 or 10 folds for larger timeframes and 3 folds for shorter timeframes). In Python, you can implement this strategy using sklearn's TimeSeriesSplit:

from sklearn.model_selection import TimeSeriesSplit # These parameters vary depending on your data N_FOLDS = 5 GAP = 0 time_grouper = data.sort_values(by="datetime")["datetime"] time_splitter = TimeSeriesSplit(n_splits=N_FOLDS, gap=GAP) # Get a nested list of training / test splits time_splits = list(time_splitter.split(time_grouper)) errors = [] # Run your models over each fold for fold, [train_index, test_index] in enumerate(time_splits): train, test =[train_index, :],[test_index, :] # your modeling and scoring code here errors.append(your_error_function(actuals, predictions))

This strategy works well for evaluating classic time-series algorithms since they are relatively inflexible: ARIMA, exponential smoothing, and dynamic regression models don't require much hyperparameter tuning in order to work effectively, so your risk of overfitting during training is low. These algorithms work well out-of-the-box, but their simplicity handicaps their performance relative to other, newer algorithms.

Flat Cross-Validation

When we want world-class results, we typically rely on more flexible algorithms (e.g., LightGBM) that depend on carefully-tuned hyperparameters to achieve outstanding results. In so-called "flat" cross-validation (a term borrowed from Wainer et. al. 2018, for lack of a more descriptive alternative), we use a single cross-validation loop (e.g., expanding window CV, illustrated above) to both a) tune our hyperparameters and b) calculate our "out-of-sample" error (usually as a simple average of our test metrics across all folds).

The problem with using the same cross-validation folds for both hyperparameter tuning and model evaluation is that we have no way of knowing if we're overfitting our model during hyperparameter selection. Without an extra slice of out-of-sample data to use for model evaluation, we're adding bias to our out-of-sample results. The amount of incremental bias depends on:

  • The size of our data, where larger training and testing sets will be more robust to overfitting
  • The flexibility of our model, where more flexible models will have a higher potential to overfit

Flat cross-validation is undoubtedly the fastest and most convenient way to train LightGBM, but the added estimator bias is usually a no-go in academic circles. For anything other than quick and dirty estimates, we'd recommend one of the strategies listed below.

Hold-Out Cross-Validation

In hold-out cross-validation, we'll split our validation into two sections to avoid bias from data leakage: one cross-validation pipeline to tune our hyperparameters, and one hold-out set to validate our model's generalizability:

Illustration of hold-out cross-validation for time-series data

In code, hold-out cross-validation might look like this:

from sklearn.model_selection import TimeSeriesSplit, RandomizedSearchCV # Split-off 25% of your data into a validation set modeling, validation = np.split(data, [int(.75 * len(data))]) # Parameter tune using sklearn's RandomizedSearch method estimator_object = RandomizedSearchCV(...).fit(...) OOS_predictions = estimator_object.predict(validation) validation_error = calculate_RMSE(, predictions=OOS_predictions)

By splitting-out a validation sample and leaving it untouched until we're finished with hyperparameter selection, we greatly reduce (but not eliminate) the risk of overfitting during hyperparameter selection. Hold-out CV is relatively quick and easy to implement, but it does require that you have enough modeling data to split some off into a validation set. Because we're only measuring our model performance against a single, continuous validation set, there's still a chance that we'll inadvertently overfit. For a more rigorous approach, we turn to nested cross-validation.

Nested Cross-Validation

In nested cross-validation, we split our validation process into two separate cross-validations: one to tune our hyperparameters, and another to validate our model's performance:

Illustration of nested cross-validation for time-series data

In the code below, our inner cross-validation loop is handled using sklearn's RandomizedSearchCV, while the outer cross-validation loop uses the expanding window algorithm mentioned above:

from sklearn.model_selection import TimeSeriesSplit, RandomizedSearchCV # Split-off 25% of your data into a validation set modeling, validation = np.split(data, [int(.75 * len(data))]) # These parameters vary depending on your data N_FOLDS = 5 GAP = 0 time_grouper = modeling.sort_values(by="datetime")["datetime"] time_splitter = TimeSeriesSplit(n_splits=N_FOLDS, gap=GAP) # Get a nested list of training / test splits time_splits = list(time_splitter.split(time_grouper)) errors = [] # Run your models over each fold (note: writing pseudo-code here, but should be close) for fold, [train_index, test_index] in enumerate(time_splits): train, test = modeling.iloc[train_index, :], modeling.iloc[test_index, :] estimator_object = RandomizedSearchCV(...).fit(...) OOS_predictions = estimator_object.best_estimator_.predict(validation) validation_error = calculate_RMSE(, predictions=OOS_predictions) errors.append(validation_error)

Nested cross-validation enables us to ask deep, probing questions into our model's performance:

  • Are my predictions stable between inner and outer folds? (if not, we're overfitting)
  • Does each outer fold pick the same set of hyperparameters? (if they're unstbale, we're likely overfitting)
  • Are my inner and outer fold error metrics similar? (if not, we're overfitting)

The added confidence that nested cross-validation provides comes with a cost, namely in terms of run-time, code complexity, and data requirements. The jury is still out on whether nested cross-validation can noticably improve model generaliation. Wainer et. al. 2018 concluded that nested cross-validation probably isn't necessary after testing gradient boosting machines on 115 different classification datasets, but their conclusions were drawn from relatively small datasets and parameter grids. Krstajic et al. 2014, on the other hand, found a significant variance between their flat cross-validation results and their nested cross-validation results. Unfortunately, they also used small sample sizes and relatively simple models (mostly variants of ridge regression). Cawley and Talbot 2010 tells us that the impact of overfitting and data leakage depends on dataset size and model complexity, so it's unclear whether either of these studies are analogous to the types of machine learning problems we typically solve. Until a clear winner emerges, we tend to stick to either hold-out CV or nested CV depending on our data, run-time, and processing requirements.

For a simple nested cross-validation example that isn't time-series specific, check out sklearn's docs.


To summarize this post, we've discussed three advanced cross-validation strategies to train modern, flexible learning algorithms on time-series data. Nested cross-validation is the gold standard, but comes with run-time, code complexity, and data requirements that may not fit your goals. "Flat" cross-validation is quick and easy to implement, but may lead to significant overfitting and subsequent embarrassment. Hold-out cross-validation sits somewhere in-between these two approaches and is our go-to cross-validation approach when under time constraints.

If this write-up was helpful or you have additional questions, we'd love to continue the conversation at