Seth Stephens-Davidowitz made his name by using the enormous trove of data from Google search inquiries – that is, what users all over the world type in the search box – to measure things that researchers would typically measure solely by voluntary responses to surveys. And, as Stephens-Davidowitz says in the title of his first book, Everybody Lies, those surveys are not that reliable. It turns out, to pick one of the most notable results of his work (described in this book), that only 2-3% of men self-report as gay when asked in surveys, but the actual rate is probably twice that, based on the data he mined from online searches.
Stephens-Davidowitz ended up working for a year-plus at Google as a data scientist before leaving to become an editorial writer at the New York Times and author, so the book is bit more than just a collection of anecdotes like later entries in the Freakonomics series. Here, the author is more focused on the potential uses and risks of this enormous new quantity of data that, of course, is being collected on us every time we search on Google, click on Facebook, or look for something on a pornography site. (Yep, he got search data from Pornhub too.)
The core idea here is twofold: there are new data, and these new data allow us to ask questions we couldn’t answer before, or simply couldn’t answer well. People won’t discuss certain topics with researchers, or even answer surveys truthfully, but they will spill everything to Google. Witness the derisive term “Dr. Google” for people who search for their symptoms online, where they may end up with information from fraudsters or junk science sites like Natural News or Mercola, rather than seeing a doctor. What if, however, you looked at people who reveal through their searches that they have something like pancreatic cancer, and then looked at the symptoms those same people were Googling several weeks or months before their diagnosis? Such an approach could allow researchers to identify symptoms that positively correlate with hard-to-detect diseases, and to know the chances of false positives, or even find intermediate variables that alter the probability the patient has the disease. You could even build expert systems that really would work like Dr. Google – if I have these five symptoms, but not these three, should I see a real doctor?
Sex, like medical topics, is another subject people don’t like to discuss with strangers, and it happens to sell books too, so Stephens-Davidowitz spent quite a bit of time looking into what people search for when they’re searching about sex, whether it’s pornography, dating sites, or questions about sex and sexuality. The Pornhub data trove reveals quite a bit about sexual orientations, along with some searches I personally found a bit disturbing. Even more disturbing, however, is just how many Americans secretly harbor racist views, which Stephens-Davidowitz deduces from internet searches for certain racial slurs, and even shows how polls underestimated Donald Trump’s appeal to the racist white masses by demonstrating from search data how many of these people are out there. Few racists reveal themselves as such to surveys or researchers, and such people may even lie about their voting preferences or plans – saying they were undecided when they planned to vote for Trump, for instance. If Democrats had bothered to get and analyze this data, which is freely available, would they have changed their strategies in swing states?
Some of Stephens-Davidowitz’s queries here are less earth-shattering and seem more like ways to demonstrate the power of the tool. He looks at whether violent movies actually correlate to an increase in violent crime (spoiler: not really), and what first-date words or phrases might indicate a strong chance for a second date. But he also uses some of these queries to talk about new or revived study techniques, like A/B testing, or to show how such huge quantities of data can lead to spurious correlations, a problem known as “the curse of dimensionality,” such as in studies that claim a specific gene causes a specific disease or physical condition that then aren’t replicated by other researchers.
Stephens-Davidowitz closes with some consideration of the inherent risks of having this much information about us available both to corporations like Google, Facebook, and … um … Pornhub, as well as the risks of having it in the hands of the government, especially with the convenient excuse of “homeland security” always available to the government to explain any sort of overreach. Take the example in the news this week that a neighbor of Adam Lanza, the Sandy Hook mass murderer, warned police that he was threatening to do just such a thing, only to be told that the police couldn’t do anything because his mother owned the guns legally. What if he’d searched for this online? For ways to kill a lot of people in a short period of time, or to build a bomb, or to invade a building? Should the FBI be knocking on the doors of anyone who searches for such things? Some people would say yes, if it might prevent Sandy Hook or Las Vegas or San Bernardino or the Pulse Orlando or Columbine or Virginia Tech or Luby’s or Binghamton or the Navy Yard. Some people will consider this an unreasonable abridgement of our civil liberties. Big Data forces the conversation to move to new places because authorities can learn more about us than ever before – and we’re the ones giving them the information.
Next up: J.M. Coetzee’s Waiting for the Barbarians.