Skip to main content
SHARE
Organization News

Vandy Tombs – Using math to improve data privacy in machine learning

Vandy Tombs has always been intentionally protective of her personal data. When she had an opportunity to apply her math background to improve data privacy on machine learning models, she unintentionally found herself extending the same protection of data to strangers’ information.

“It's always beautiful when you find math in places you don't expect,” Tombs, an applied mathematician in the Geospatial Science and Human Security Division at the Department of Energy’s Oak Ridge National Laboratory, said of finding this project. “I thought I could do something about privacy with models that don't have lots of data by using differential privacy.”

Differential privacy is a method for rigorously measuring privacy. It is usually achieved by introducing ‘noise’ to prevent any piece of data from revealing information about a person. This is important in instances where the information could be harmful to the person if revealed, such as health data, account information for cellphones or banks, or studies involving personal answers.

In enormous data sets, it can be difficult, but not impossible, to trace a piece of data back to a person. In smaller data sets, it is much easier. Consider this example: 10 people are holding scarves. One person is holding a yellow scarf while the other nine are holding green scarves. Ten percent of the people have a yellow scarf. In a group of 100 people, one person holding a yellow scarf represents one percent if the other 99 people are holding a green scarf. Mathematically, as the number of data points grows, it is easier to get lost in the crowd.

Tombs is working on research to guarantee individual privacy in machine learning models that are trained with small datasets. Traditionally, researchers insert noise into such models to achieve greater privacy. The goal is for this added noise not to impact results when all the information is pooled together but to provide enough of a mask to protect any individual person. When differential privacy is achieved, the model shouldn’t reveal anything about an individual beyond what is learned about the population as a whole.

But achieving differential privacy in machine learning models is difficult because the more noise that is added to the training data, the higher the likelihood the results are affected. This is especially the case with model’s trained on less data. Why go through the trouble of layers of math that can impact the outcome? Why not just anonymize the data?

Anonymized data was the method dating back to the 1990s to protect data. In the 2000s, however, hackers discovered how to take public information, even if the data is anonymized, and link it to a person—called a linkage attack. In several high-profile cases, data was anonymized and released with good intentions to improve machine learning algorithms using crowdsourced knowledge. Instead, individual records were correlated to specific people, resulting in the misuse of personal information and lawsuits alleging illegal distribution of data.

Tombs and her colleagues are looking at privacy differently. “Ensuring strong differential privacy guarantees the data is safe from linkage attacks and other privacy attacks that occur after training,” Tombs said. “There are many ways to achieve differential privacy. Rather than using the conventional approach, we are exploring other methods that aren’t usually used with machine learning. We hope that our methods will have less impact on the model outcome for the same or better privacy guarantees.”

 

Data privacy and personal choice

Protecting data is an active choice for people who want to benefit from machine learning technology, such as improved Google services, while protecting their information from unintentional disclosures.

“You don't think one piece of information is valuable, but often data isn’t in isolation,” said Tombs. Many people overlook how information can come together to reveal personal information. Giving a piece of personal data to one company and another and another can lead to involuntarily revealing a bigger picture about a person’s life.

Furthermore, people should consider the risk to their personal information with the convenience of technology. Smart speakers are listening for the opportunity to help a user but are also listening to conversations not meant for that company. New artificial intelligence assistants are using search terms submitted by users to improve the algorithm by integrating each query into the training data. New apps often require users to create accounts and track how users navigate through an app in order to improve the user experience. Each time a person types or talks into an app, the person may be giving a piece of personal information to the company, and how the company uses that data is out of the hands of users.

While Tombs is protective of her information, she also recognizes the need for research and companies to access data. Through her own research using machine learning on sensitive data like health records, Tombs sees the benefit of doctors and scientists uncovering correlations that wouldn’t be seen without large, merged data sets.

Consumers should take the time to look at a company’s privacy policy, Tombs recommends. “You can probably trust companies that are more transparent with how they protect data to be responsible.”

Finally, Tombs advocates for people to understand how those close to them view the tradeoff between privacy and convenience. The lack of protecting one’s information can influence others living in the same house. “Your data will influence other people who are very closely connected to you. You should be aware of their tolerance levels and maybe consider how your preferences impact theirs.”

UT-Battelle manages ORNL for the Department of Energy’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. The Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit energy.gov/science. — Liz Neunsinger