Scott Howe ’90 thinks he knows a lot about me, although we have never met. Better to say that he thinks the Internet knows a lot about me — and that I ought to know what it knows. Howe, the CEO of Acxiom, the Little Rock, Ark.-based marketing company, has created a Web portal, AboutTheData.com, where users can go, create an account with identifying information, and discover what is out there about them on the World Wide Web. I decided to take him up on it.

AboutTheData does retrieve a lot of personal information about me, culled from public records and customer databases that companies I do business with have sold to other marketers. It got the size of my house right but was way off on the date I bought it, told me that my car-insurance policy renews in April (true) but also that I don’t have children living at home (false). It said someone in my household buys pet products and enjoys sports, but such things are hardly unusual for people of my age, marital status, and income level. This didn’t seem like Big Brother. More like Mildly Curious Uncle.

Crossing the wrong line in cyberspace, however, will bring you under the gaze of a different uncle — Uncle Sam — and he doesn’t miss much. Thanks to a cache of classified documents leaked by computer technician Edward Snowden to a select group of journalists, including Barton Gellman ’82 (see page 46), we are learning a lot about government snooping on American citizens. The National Security Agency has acknowledged that it subpoenaed phone records from the nation’s three largest telecommunications companies, which it now stores in its massive database. Gellman recently disclosed that the NSA has been scooping up contact lists and address books from Google, Yahoo, and others.

What the government knows about me and what some marketing company knows are very different issues, though related. The databases on which both rely fall within the broad category called “Big Data,” and it is no exaggeration to say that Big Data has the power to transform the world, yielding insights into how highly complex systems work. Applied to my video downloads, Big Data enables Netflix to recommend movies I might enjoy. Applied to epidemiological records, it enables researchers to trace and possibly stop the spread of disease. Applied to phone records, it can help the NSA uncover a terrorist cell. Or learn if I am cheating on my spouse.

How secure is the information — bank records, prescription records, personal photos — that I place willingly on my computer’s hard drive or in the cloud? How much of my life is the government watching without my knowledge? What does privacy mean in an age when seemingly everything about me is known or knowable? Disclosure of the NSA’s surveillance program has put these questions squarely in the news, and although the constitutionality of those programs has been called into question, the NSA may be a good place to begin.

The NSA says its programs are critical to national security. Furthermore, it says, the agency is not eavesdropping on private conversations but collecting information about those conversations — known as metadata — that can help detect suspicious patterns and expose terrorist networks. “You can’t have 100 percent security and also then have 100 percent privacy and zero inconvenience,” President Barack Obama said last June. “We’re going to have to make some choices as a society.”

Professor Edward Felten believes the NSA should be required to issue regular reports about its surveillance activities.
Professor Edward Felten believes the NSA should be required to issue regular reports about its surveillance activities.
Peter Murphy

Professor Edward Felten points out that the only reason we are having a national debate about NSA surveillance is because of documents leaked by a man — Snowden — whom the United States considers to be a traitor and a fugitive. How, Felten asks, can we make choices about programs the government would not even admit existed until a few months ago?

Felten, the Robert E. Kahn Professor of Computer Science and Public Affairs, is the director of Princeton’s Center for Information Technology Policy (CITP). CITP, which draws faculty and students from several departments, including computer science, economics, politics, sociology, and the Woodrow Wilson School, occupies the third floor of Sherrerd Hall, the glass jewel box on Shapiro Walk. In addition to teaching, Felten blogs (freedom-to-tinker.com), advises several technology companies, and served from 2011 to 2012 as chief technologist for the Federal Trade Commission, where he helped prepare a report on protecting consumer privacy. (More on that later.)

He contends that we have only just begun to consider how much of our privacy we ought to be willing to surrender in the name of security — and how much we already have surrendered without knowing it. Certainly, the Snowden documents have provided a steady stream of revelations. Chief among them was that the Foreign Intelligence Surveillance Court — established under the 1978 Foreign Intelligence Surveillance Act (FISA) to review applications for warrants related to national-security investigations — had ordered Verizon, AT&T, and Sprint to turn over records of all calls within the United States and overseas to the NSA on “an ongoing daily basis.” Those orders were issued under Section 215 of the Patriot Act, which allows U.S. intelligence agencies to collect information needed “to protect against international terrorism or clandestine intelligence activities.” Unlike a traditional search warrant, the government can obtain information under Section 215 without establishing probable cause, as long as it has a “reasonably articulable suspicion” that the information is relevant to a national-security investigation. It is very difficult to challenge such an assertion because anyone served with a Section 215 request is legally prohibited from revealing that the government has demanded the information.

The FISA court’s order allowed the NSA to gather “call detail records” or “telephony metadata.” This includes the originating and destination numbers of each call; the time, date, and duration; and other pieces of identifying information that are unique to each cellphone. How much of this data does the NSA have? We don’t know. That information is secret, but Felten has made some back-of-the-envelope calculations. Assuming there are approximately 3 billion phone calls made every day in the United States and that each call record takes 50 bytes to store, he estimates that the NSA is collecting about 140 gigabytes of data each day, or 50 terabytes a year. That translates into 25 billion Web pages of information every year, and it is growing daily. The NSA is building a huge data center outside Salt Lake City to hold all of it.

Phone records are not all. Under a program known as PRISM, the NSA also has collected email and instant-messaging contact lists from at least nine Internet service providers, including Facebook, Google, and Yahoo. There have been reports of other surveillance tools, including something called XKeyscore, which enables the agency to see “nearly everything a typical user does on the Internet,” according to a leaked NSA training manual.

Eric Schmidt ’76, Google’s executive chairman, said that his company did not know that the government was snooping on its servers and strongly criticized what he termed the NSA’s overreach. “There clearly are cases where evil people exist,” he told The Wall Street Journal in November, “but you don’t have to violate the privacy of every single citizen of America to find them.” (On the other hand, when asked by a CNBC interviewer in 2009 whether Internet users should feel comfortable sharing personal information with Google, he famously replied, “If you have something that you don’t want anyone to know, maybe you shouldn’t be doing it in the first place,” continuing to explain that Google retained information and that “it is possible that the information could be made available to the authorities.”)

The NSA’s defenders deny that it is becoming Big Brother. “Nobody’s listening to the content of people’s phone calls,” President Obama assured the public in June about the collection of phone metadata. Leave aside for a moment that the NSA itself has acknowledged instances in which it has misused the telephone records it has collected. The president’s claim rests on a distinction between data — what was said during the calls — and metadata, which might be thought of as data about the data, specifically all that descriptive information covered in the FISA court’s order.

In Felten’s view, this distinction no longer makes much of a difference. As he testified in October before the Senate Judiciary Committee, “It is no longer safe to assume that this ‘summary’ or ‘non-content’ information is less revealing or less sensitive than the content it describes.” Metadata, in other words, often can tell investigators more than the underlying data itself. That is why they want it.

Conversations, Felten explains, are unstructured data. They might be conducted in a foreign language. The speakers might mumble. There might be a lot of background noise. Even if the conversations can be understood, they can be hard to decipher. If a suspect says, “The package is being delivered,” does he mean a birthday present or a bomb? Transcribing and interpreting conversations takes a lot of work, which generally still needs to be done by humans.

Metadata, in contrast, is structured data, which makes it easier to work with, and the NSA has very sophisticated tools that it says can detect subtle patterns of behavior and networks of associations, even without knowing what is said. Those tools have led Peter Swire ’80, a professor at Georgia Tech, to describe this as a Golden Age for Surveillance. “For many investigators,” Swire wrote in a 2011 article for the Center for Democracy and Technology, “who is called is at least as important as what is said in the call.” In August, Obama named Swire to a five-member group assigned to review the nation’s intelligence policies.

The patterns revealed in metadata yield remarkable insights: when people sleep, how many friends they have, even clues about their religious affiliation. The metadata can help investigators construct a model of an organization, such as who is in it, who reports to whom, who is gaining influence, and who is losing it. Only by having all the raw data can analysts apply their algorithms to search for patterns and connections. As NSA director Gen. Keith Alexander put it: “You need the haystack to find the needle.”

Scott Howe ’90’s company collects information about consumers’ financial means, leisure pursuits, and shopping habits — and lets consumers know what information it has.
Scott Howe ’90’s company collects information about consumers’ financial means, leisure pursuits, and shopping habits — and lets consumers know what information it has.
Jacob Slaton/The New York Times/Redux

But turning every haystack over to the government presents troubling questions. These social graphs, as they are called, are sometimes inaccurate and can expose innocent people to suspicion. In a talk to last year’s freshman class, Felten gave an example by showing how easy it would be for investigators to place him at the center of a social network connected to Julian Assange, the founder of WikiLeaks, whom he has never met. There is also what might be called a bootstrapping problem. The NSA is permitted to share intelligence information it gets under a FISA warrant with the FBI or local prosecutors, enabling them to obtain information they could not have gotten if required to show probable cause, the standard for obtaining a traditional search warrant.

Nevertheless, some might say, if I have done nothing wrong, I have nothing to hide. But that, in Felten’s view, misses the point. Even if I have broken no laws, I almost certainly have engaged in behavior that I would prefer to keep private.

To illustrate, he posits the following scenario: Phone records reveal that a young woman receives a telephone call from her gynecologist’s office. Over the next hour, she makes three more calls: one to her mother, one to a man she dated several months earlier, and one to an abortion clinic. We do not need transcripts of those conversations to guess that the woman learned she is pregnant. Such inferences are made easier because many phone numbers, such as domestic-violence or suicide-prevention hotlines, are used for a single purpose.

Anthony Romero ’87, executive director of the American Civil Liberties Union, frames the issue in personal terms. “Every single one of us,” he says, “has had a private conversation that we would be chagrined, embarrassed, aghast if the details were exposed. Privacy is a fundamental part of a dignified life.” The ACLU has filed a lawsuit challenging the constitutionality of the NSA’s surveillance program; Felten has filed a declaration in support of that suit. In December, a judge in a different case found that the metadata-collection program probably is unconstitutional. An appeal was expected.

Data collection on such a massive scale threatens to change the relationship between citizens and government in fundamental ways. Beyond the erosion of personal dignity, Romero says, the knowledge that records of every call people make are being saved will prompt them to think twice before saying or doing something that might make them look bad or before they advocate an unpopular cause. And while you or I may not care if the government has our information, there are many other people — including public officials, judges, journalists, and whistleblowers — whom we should insulate from even the threat of governmental coercion.

Felten suggests that we imagine the politician we most distrust becoming president and ask if we would want a government run by that person to have such personal information about us. For that reason, Romero describes the NSA’s metadata collection as “a loaded gun on a table. It’s just a matter of time before someone picks it up and uses it.”

However, Michael O’Hanlon ’82 *91, a senior fellow in the Center for 21st Century Security and Intelligence at the Brookings Institution, suggests that everyone take a deep breath. “We’re all very quick to indulge our fantasies that Big Brother is watching us,” O’Hanlon says, but the NSA operates under restrictions in federal law and rigorously polices itself. More than a decade removed from 9/11, we may have grown complacent about dangers we face, and if the threats from government surveillance remain hypothetical, the benefits may be real. The NSA asserts that these programs already have thwarted dozens of possible terrorist attacks.

Many other people, from medical researchers to Internet marketers, also are in the business of collecting data haystacks these days, and like the NSA they have a lot to work with. Every day, nearly every hour, we willingly create a digital trail. The cellphone in your pocket and the E-ZPass transponder on your windshield track your movements. Your debit card records your purchases. Your browsing history records your interests — as well as your peccadilloes.

That tension is inherent in the digital age, says Ruby Lee, the Forrest G. Hamrick Professor in Engineering and director of the Princeton Architecture Lab for Multimedia and Security. The social and personal benefits of sharing must be weighed against the risks. Lee defines privacy as the right to determine who gets to see your personal data.

Certain types of records — financial, medical, educational — are legally protected, but access to most personal information on the Internet is negotiated on a website-by-website basis. Some sites require users to accept their privacy policy, which often is set forth in dense legalese that hardly anyone bothers to read. Most sites can do whatever they want with the information they collect, including sell it to data brokers such as Acxiom. Internet-data brokers have at least crude demographic and purchasing data on more than 75 percent of the U.S. population, writes Kaiser Fung ’95, a statistician and adjunct professor at New York University, in his book, Numbersense: How to Use Big Data to Your Advantage (McGraw-Hill 2013).

Merchants try to connect this data with what they know about me as an individual to sell me things. Netflix, for example, uses my rental information, as well as information about people like me, to recommend other movies I might enjoy. Scott Howe urges me to embrace this rather than fear it — in fact, to improve it. He encourages people who visit AboutTheData.com to update or correct inaccurate information about themselves. “Consumers want ads for brands they love,” he insists, and companies can provide them only if they have current data.

Target went so far as to develop a program designed to predict whether a customer was pregnant. In fact, the program’s developers boasted that they could even predict her due date based only on when she shopped and what she bought — not just diapers, but whether she bought certain vitamins or switched from scented to unscented soap — and could use that information to send her targeted ads. According to a story about the program in The New York Times Magazine, Target “knew” that a female customer was pregnant before her parents did.

For all the hype, Fung says that Target’s pregnancy-prediction program was accurate only about 30 percent of the time, which is still very good by industry standards. In any such system, he explains, there are bound to be a lot of false positives — people the model predicts are having a baby but aren’t — but companies don’t mind because the costs of getting it wrong are small. Here, though, is where the difference between corporate data mining and governmental data mining becomes most apparent. If Target misidentifies me as an expectant mother, I receive some useless coupons. If the NSA misidentifies me as a national-security threat, I find myself in a Kafka novel.

Professor Janet Currie *88 says access to medical data, with privacy safeguards, would improve public-health research.
Professor Janet Currie *88 says access to medical data, with privacy safeguards, would improve public-health research.
Sameer A. Khan

Big Data also can lead to big breakthroughs in scientific research. Data from public-health departments, hospitals, or insurance companies can reveal risks from long-term exposure to certain chemicals, drug reactions in small groups of patients, or trends in birth weight or teen pregnancy. Medical records are protected by the Health Insurance Portability and Accountability Act (HIPAA), which places strict limits on how those records can be used and by whom. However, states and the federal government do compile detailed information on such things as births, deaths, the incidence of sexually transmitted diseases, and adverse events in hospitals.

Janet Currie *88, the Henry Putnam Professor of Economics and Public Affairs at the Woodrow Wilson School, frequently uses governmental health records in her research on issues such as the effects of pollution on infant health. Researchers, she explains, cannot always rely on state data summaries. If they want to learn, for example, whether infants in a particular area were exposed to pollution, it is necessary to know where their mothers lived when they were pregnant. Starting from birth records and trying to obtain each individual’s consent to use such address information would be impossible, and relying on those who could be located would skew the results.

To use this sort of data, a researcher must submit a protocol to the state’s Institutional Review Board as well as to the university’s or organization’s review board, describing the research and setting limits on how the data would be used. Princeton’s board is governed by federal regulations as well as its own guidelines, which include a requirement that, for research involving humans, researchers ensure “adequate provisions to protect the privacy of the subjects and confidentiality of data.”

Even so, only a few states allow academic researchers access to administrative health records (New Jersey is one), severely limiting the types of public-health research that can be done. Currie says much remains unknown about the health of premature babies in later life, for example, because it is impossible to link their birth records with later hospital and emergency-room records in most states. “Making it hard to collect health-care data really does have costs in terms of limiting what we can learn,” she says. “I think people are kind of schizophrenic about what they want. On one hand, they want us to be able to use medical data to address important public-health problems. On the other hand, they hate the idea that anyone has access to their data.” She believes that statistical methods that “anonymize” data offer a possible way forward.

With the privacy genie out of the bottle, we can only hope to control it. As a practical matter, Felten says, it is difficult to skirt government surveillance. However, in mid-December, the presidential-advisory group on which Peter Swire sits suggested dozens of changes to the NSA’s spying program, many of which Felten and others have advocated. It recommended that phone metadata be stored by the phone companies or an independent body rather than by the NSA, that the agency obtain a court order each time it wants to search the database for information about U.S. citizens, and that control of the NSA be transferred from the military to civilians. It also suggested that privacy advocates be appointed to ensure that civil-liberties concerns are raised in hearings before the FISA court. While President Obama was reported to be “open to many” of the panel’s recommendations, he had not made a decision about them at the time this issue of PAW went to press.

A group called Digital Due Process wants to update the Electronic Communications Privacy Act, which was enacted in 1986 — before email, cellphones, cloud computing, the Internet, or social networking. The group also wants the government to obtain a search warrant based on probable cause before it tracks cellphone locations or compels Internet service providers to turn over customer information.

Felten thinks the NSA should be required to issue regular reports about its surveillance activities and provide details on such things as how many searches it has conducted, how many records it has collected, and how long it is keeping them. “The history has been that broad surveillance capabilities coupled with lack of oversight leads to bad results,” he says.

As for protecting Internet users from private data harvesting, one promising solution, which is being investigated by David Blei, an associate professor of computer science, and Rebecca Pottenger ’12, now a Ph.D. student at the University of California, Berkeley, is a mathematical technique called differential privacy, an algorithm that might be imagined as a computational black box. Personal information goes in and statistical data comes out, but in such a way that identifying information about individuals is scrubbed off or rendered unreadable. “Differential privacy is the most promising method we have for trying to reconcile inferences about a population with the protection of information about individuals,” Felten says.

In 2012, before he returned to Princeton, Felten advised the FTC on a report, “Protecting Consumer Privacy in an Era of Rapid Change,” which set forth a series of best practices that Internet sites could adopt to promote and protect privacy. Those recommendations fell into three broad categories: First, companies should consider privacy protections at each step in developing their products; second, that they give consumers the option to decide what information the companies will share and with whom; and third, that they be more transparent in disclosing what information they collect and allow consumers to view information about themselves, including information sold to data brokers.

“I prefer to think about it as something that ought to be at the option of you as a consumer,” Felten elaborates today. “If you choose to reveal information for convenience, you can do that. But when you have a situation where info about you is being collected without your knowledge and without your consent and being shared and used, I think that is often harmful and unfair.”

Rebecca MacKinnon, a former visiting fellow at CITP, thinks that the concerns about protecting privacy in different arenas are related. In her 2012 book Consent of the Networked: The Worldwide Struggle for Internet Freedom (Basic Books), she paid most attention to governmental attempts to restrict Web access, but also devotes a chapter to the dangers of corporate surveillance. The keys, she argues, are clarity and accountability over how information is collected and who has power over it. “If you don’t even know who has it,” she says, “it is very difficult to visit consequences on those who abuse it. And we will not have a free society.”

People often say it may be necessary to surrender a little privacy to gain more security or convenience, but absolute safety and the smoothest browsing experience are not the only public goods at issue. Rarely is the question ever reversed. Would we accept a slower Internet in order to protect our privacy? Would we be willing to risk another 9/11 attack?

“There is a tradeoff, there is a balance here, and it’s important to get the balance right,” Felten insists. “But we need to have that conversation instead of pretending that there is not an interest on the privacy side of that scale.” 

Mark F. Bernstein ’83 is PAW’s senior writer.