Wikipedia May Predict The Next Global Health Crisis

November 18, 2014

288

Scientists may one day be able to predict epidemics simply by analyzing what people are searching on the Internet.

Researchers at the Los Alamos National Laboratory in New Mexico found that a surge in Wikipedia traffic for certain disease-related articles predicted the large-scale spread of illnesses in several world locations, sometimes as long as 28 days before official spikes were recorded. The findings were published online Thursday in PLOS Computational Biology.

Unlike Google flu trends, which uses search data to reflect where outbreaks are happening, the Wikipedia analysis can predict an outbreak. Nicholas Generous, the study’s lead researcher and a digital epidemiologist for the lab, said the finding may one day become a tool to help public health officials decide to mount prevention or vaccination campaigns, or help hospitals prepare for a wave of patients.

“It kind of beats the older versions of disease surveillance, where the patient gets sick, goes to the doctor, the doctor writes up the report, sends it to the county epidemiologist, who aggregates it upward and then it gets reported by the state,” Generous said in a phone interview with The Huffington Post. “It can be kind of a lengthy process, but if you use this Internet data, you can get data almost hourly or daily.”

Predicting an outbreak is especially important for fast-spreading diseases like flu, which contributes to the deaths of 3,000 to 49,000 people in the U.S. each year.

“Outbreaks, like the flu for example, can move very quickly, and having that information in real time can help you make better decisions in the moment,” Generous added.

To make the predictive model, Generous identified specific disease articles on Wikipedia (like “Influenza” or “Swine Influenza”) and then took open-source data from the site about each article’s traffic over three years. He did this for 14 specific disease-location pairs, like cholera in Haiti, dengue in Brazil, or tuberculosis in Thailand.

Then he compared the articles’ page views to official records of infection. He found that the Wikipedia traffic predicted infection rates up to 28 days in advance in eight cases: Flu in the U.S., Poland, Japan and Thailand; tuberculosis in China and Thailand; and dengue in Thailand and Brazil.

In the remaining six cases, the data was either too subtle to track (HIV, for example is a relatively slow-moving disease) or the “noise” about a certain article drowned out potentially infected people who were actually searching for their symptoms.

For example, approximately two to 10 people contract the plague in the U.S. every year, but interest in the plague for historical or other reasons far outweighs the traffic that these two to 10 people could generate on their own. Similarly, there is a lot of reader interest about Ebola in the U.S. because of media reports, but there have only been a few hundred people in the country who may have potentially been exposed to the virus.

For Ebola, “what you’re going to be measuring is not really the number of people who may be infected, but the number of people who are looking up Ebola,” said Generous. “And obviously in West Africa, there’s not a lot of Internet connectivity, so people probably aren’t going to be searching for Ebola on Wikipedia either.”

“The opposite example is the flu,” explained Generous. “A lot of people have flu-like symptoms and search it online, but the general interest in influenza doesn’t drown out the people who are either infected and searching for their own symptoms, or secondary observations, like when when you look for your family member or your child who is sick.”

Google has been aggregating flu search data for more than a decade, but doesn’t disclose raw data or algorithms to outside scientists, said Generous. Facebook also is a closed-data system. Wikipedia data, though open-source, had limitations. Wikipedia releases page view data by language, which means page views for articles written in English may be coming from Australia, South Africa, the U.S. or India.

That’s why Generous used language as a proxy for location. Ideally, the data would include geotags. For some countries, it worked — Thai is generally only spoken in Thailand, and Polish in Poland. In other cases, the disease itself was able to make the page view data relevant. While Portuguese is spoken in Portugal, Brazil, and a handful of African nations, only Brazil has dengue fever.

But the model also worked for the U.S. and the flu, even though flu is common and English is spoken all over the world. Generous called it a “really unexpected” finding.

“We did not expect to be able to track flu in the United States using English Wikipedia,” he said. “Our best theory is that Canada and the United States are in the same region, and we think that there are enough readers from that region that are overpowering the other readers” around the world.