$1.2 million grant from the NSF to create a search engine for online privacy research

UNIVERSITY PARK, Pennsylvania – A team of researchers led by Penn State recently received a $1.2 million grant from the National Science Foundation (NSF) to create a search engine and other resources that can make the web safer for users by helping scientists sift through billions of online documents to more efficiently collect and classify privacy-related documentation.

The search engine — called PrivaSeer – will use a type of artificial intelligence (AI), called natural language processing – or NLP – to help researchers collect, review and analyze privacy documents, including privacy policies, terms of use, cookie policies, bills and privacy laws, regulatory guidelines and other web-related texts.

NLP combines linguistics, computer science and AI to program computers to better process and analyze large amounts of natural language data.

Ultimately, the search engine could help researchers better understand online privacy and online privacy trends, while helping users browse the web more safely and securely, according to Shomir Wilsonassistant professor of information science and technologyPenn State and Institute of Computer and Data Sciences affiliate.

“Privacy policies are documents we encounter in our daily lives when visiting websites and, in theory, we are supposed to read them,” Wilson said. “But, in practice, few people do. It’s not convenient and it doesn’t fit the way people use the internet. Often people also lack the legal knowledge to understand these documents. »

Wilson, who is the project’s Principal Investigator (PI), said the search engine is necessary because even though many documents about organizations’ privacy and data practices are available on the web, researchers face the daunting challenge to identify and collect these documents. According to the researchers, the current way of collecting this information requires scientists to carry out careful manual research.

“There has been previous work on privacy policies, but one thing researchers have come across is that there is a lack of good data on these policies,” Wilson said.

The search engine can also offer information about how policies change and help users navigate the complex realm of online privacy, according to C. Lee GilesDavid Reese Professor of Information Science and Technology, Penn State, and a co-PI of the project.

“One of the reasons for having a privacy policy search engine is so you can get an idea of ​​how different companies deal with their users’ privacy now and over time,” said Giles, who is also associated with ICDS. “It can also let users know how they want to react to these companies.”

The researchers said that PrivaSeer will also advance NLP techniques for large-scale interpretation of these privacy documents. This technology will help scientists analyze the state of privacy on an unprecedented scale.

According to Giles, creating the search engine posed several challenges for the team.

“One of the challenges of building a privacy policy search engine is crawling the web for those pages,” Giles said. “There is no list of URLs for this. Do we try a URL — for example, ‘https://company.com/privacy.html’ — or something different? Once page returned, how do we know it’s a privacy page?”

In addition to the search engine, the team also plans to develop corpora – large sets of text data – and application programming interfaces, or APIs.

Other PIs also include Florian SchaubAssistant Professor of Information, Electrical Engineering, and Computer Science, University of Michigan, and Gabriela Zanfir-Fortunadirector of global privacy at the Future of Privacy Forum.