Last modified 7 months ago Last modified on 03/27/17 09:32:49

Modified Mon Mar 27 09:32:49 2017 by Eric.Woldridge.

Challenge Problem #7: Flu Spread

Phase I and Phase II

Phase III


Predicting the spread of epidemics geographically and over time can help government agencies and organizations better prepare and allocate resources, and could ultimately help prevent cases of influenza.

Seasonal flu epidemics have been closely monitored, and many years of historical data has been collected by the medical community. Data collected and aggregated by the CDC has been valuable for researchers trying to develop models to forecast the spread of flu epidemics.

In addition to data collected by the CDC, there are many other datasets being collected by different entities for various purposes – many of them unrelated to flu epidemics. When those datasets are combined with the CDC data, they offer the opportunity to significantly improve the ability to assess and forecast flu epidemics, both spatially and temporally.

Datasets include social network data and vaccination statistics. Such data has different characteristics (e.g. percentages for CDC regional ILI rates and flu vaccination, and quantized flu activity levels for CDC state ILI rates) and different spatial and temporal resolution. Aggregating the data into a forecasting model is challenging but, if successful, can provide much improved forecasting accuracy over a longer time horizon than what current approaches based on limited sources of information can accomplish.

Problem Overview

Phase 1 - Reconstruction

During Phase 1, the goal is to fuse multiple data sources to reconstruct Influenza-like Illness (ILI) rates at a spatial resolution finer than that of the ILI data from the CDC. Performers can use specified datasets to estimate weekly ILI rates in the 48 contiguous states. The spatial resolution of the estimates should be at the county level. The results will be compared to a set of “Evaluation Regions” consisting of state-level ILI rates from selected states (Massachusetts, North Carolina, Rhode Island and Texas) and district-level ILI rates from two states (Mississippi and Tennessee), where each district consists of multiple counties.

The datasets cover the flu seasons 2013-2014 and 2014-2015. For model development and prediction, performers will have access to all data from both years with the exception of the Evaluation Region data for 2014-2015. Fitted models will be evaluated based on their estimated state and district ILI rates against the actual ILI rates of the select states and districts.

Phase 2 - Nowcasting

During Phase 2, the goal is to produce estimates ILI rates that are more timely than those published by the CDC (while maintaining spatial resolution finer than that of the CDC). The ILI data from the CDC and the Evaluation Region states are released after a delay of 1 to 2 weeks. The goal of Phase 2 is to predict ILI rates in week 𝑡 using all data from previous weeks 𝑡−1,𝑡−2,…. This includes the CDC ILI rates from week 𝑡−2 and Twitter data from week 𝑡−1. All data from 2013-2014 and 2014-2015 will be available for model development and training. Performers can also use the NREVSS dataset in this phase, which may provide additional predictive power. Evaluation will be performed on data collected during the current 2015-2016 flu season.

Phase 3 - Nowcasting

The task in this phase of CP7 is to predict seasonal rates of Influenza-Like Illness (ILI or 'flu') in 60 distinct sub-populations of the continental US, ranging in size from the entire country to individual counties.

In addition to historical ILI rate data for each population, three different kinds of covariate data, representing flu-related tweets, vaccination claims, and weather, are provided for use in solutions. In a simulated forecast experiment, all four kinds of variables will be made available to solutions, one week of data at time, over the 32-week target season.

Evaluation Timeline

  • Phase 1:
    • Introduced January 2016
    • Evaluated July 2016
  • Phase 2:
    • Introduced July 2016
    • Evaluated January 2017
  • Phase 3:
    • Introduced January 2017
    • Evaluated July 2017