Securing AI and ML projects

Artificial intelligence and machine learning bring new vulnerabilities along with their benefits. Here are examples of how several companies have minimised the risk

Pro

(Image: Stockfresh)

11 August 2020

When enterprises adopt new technology, security is often on the back burner. It can seem more important to get new products or services to customers and internal users as quickly as possible and at the lowest cost. Good security can be slow and expensive.

Artificial intelligence (AI) and machine learning (ML) offer all the same opportunities for vulnerabilities and misconfigurations as earlier technological advances, but they also have unique risks. As enterprises embark on major AI-powered digital transformations, those risks may become greater. “It’s not a good area to rush in,” says Edward Raff, chief scientist at Booz Allen Hamilton.

AI and ML require more data, and more complex data, than other technologies. The algorithms developed by mathematicians and data scientists come out of research projects. “We’re only recently as a scientific community coming to understand that there are security issues with AI,” says Raff.

The volume and processing requirements mean that cloud platforms often handle the workloads, adding another level of complexity and vulnerability. It’s no surprise that cybersecurity is the most worrisome risk for AI adopters. According to a Deloitte survey released last month, 62% of adopters see cybersecurity risks as a major or extreme concern, but only 39% said they are prepared to address those risks.

Compounding the problem is that cybersecurity is one of the top functions for which AI is being used. The more experienced organisations are with AI, the more concerned they are about cybersecurity risks, says Jeff Loucks, executive director of Deloitte’s Centre for Technology, Media and Telecommunications.

In addition, enterprises, even the more experienced ones, are not following basic security practices, such as keeping a full inventory of all AI and ML projects or conducting audits and testing. “Companies aren’t doing a great job right now of implementing these,” says Loucks.

AI and ML data needs create risk

AI and ML systems require three sets of data:

Training data to build a predictive model
Testing data to assess how well the model works
Live transactional or operational data when the model is put to work

While live transactional or operational data is clearly a valuable corporate asset, it can be easy to overlook the pools of training and testing data that also contains sensitive information.

Many of the principles used to protect data in other systems can be applied to AI and ML projects, including anonymisation, tokenisation and encryption. The first step is to ask if the data is needed. It’s tempting, when preparing for AI and ML projects, to collect all the data possible and then see what can be done with it.

Focusing on business outcomes can help enterprises limit the data they collect to just what’s needed. “Data science teams can be very data hungry,” says John Abbatico, CTO at Othot, a company that analyses student data for educational institutions. “We make it clear in dealing with student data that highly sensitive PII [personally identifiable information] is not required and should never be included in the data that is provided to our team.”

Of course, mistakes do happen. For example, customers sometimes provide sensitive personal information such as Social Security numbers. This information doesn’t improve the performance of the models but does create additional risks. Abbatico says that his team has a procedure in place to identify PII, purge it from all systems, and notify the customers about the error. “We don’t consider it a security incident, but our practices make it seem like one.”

AI systems also want contextualised data, which can dramatically expand a company’s exposure risk. Say an insurance company wants a better handle on the driving habits of its customers, it can buy shopping, driving, location and other data sets that can easily be cross-correlated and matched to customer accounts. That new, exponentially richer data set is more attractive to hackers and more devastating to the company’s reputation if it is breached.

AI security by design

One company that has a lot of data to protect is Box, the online file sharing platform. Box uses AI to extract metadata and improve search, classification and other capabilities. “For example, we can extract terms, renewals and pricing information from contracts,” says Lakshmi Hanspal, CISO at Box. “Most of our customers are coming from an era where the classification of their content is either user-defined classification or has been completely ignored. They’re sitting on mountains of data that could be useful for digital transformation — if the content is classified, self-aware, without waiting for human action.”

Protecting data is a key pillar for Box, Hanspal says, and the same data protection standards are applied to AI systems, including training data. “At Box, we believe that it is trust we build, trust we sell, and trust we maintain. We truly believe that this needs to be bolted into the offerings we provide to our partners and customers, not bolted on.”

That means that all systems, including new AI-powered projects, are built around core data security principles, including encryption, logging, monitoring, authentication and access controls. “Digital trust is innate to our platform, and we operationalise it,” Hanspal says.

Box has a secure development process in place for both traditional code and the new AI and ML-powered systems. “We’re aligned with the ISO industry standards on developing secure products,” says Hanspal. “Security by design is built in, and there are checks and balances in place, including penetration testing and red teaming. This is a standard process, and AI and ML projects are no different.”

Mathematicians and data scientists typically do not worry about potential vulnerabilities when writing AI and ML algorithm code. When enterprises build AI systems, they draw on the available open-source algorithms, use commercial “black box” AI systems, or build their own from scratch.

With the open-source code, there is the possibility that attackers have slipped in malicious code or the code includes vulnerabilities or vulnerable dependencies. Proprietary commercial systems also use that open-source code, plus new code that enterprise customers usually aren’t able to look at.

Inversion attacks a major threat

AI and ML systems usually wind up being a combination of open-source libraries and newly written code created by people who aren’t security engineers. Plus, no standard best practices exist for writing secure AI algorithms. Given the shortage of security experts and the shortage of data scientists, people who are experts in both are even in shorter supply.

One of the biggest potential risks of AI and ML algorithms, and the long-term threat that concerns Booz Allen Hamilton’s Raff the most, is the possibility of leaking training data to attackers. “There are inversion attacks where you can get the AI model to give you information about itself and what it was trained on,” he says. “If it was trained on PII data, you can get the model to leak that information to you. The actual PII can be potentially exposed.”

This is an area of active research, Raff says, and a massive potential pain point. Some tools can protect training data from inversion attacks, but they’re too expensive. “We know how to stop that, but to do that increases the cost of training the models by 100 times,” he says. “That’s not me exaggerating. It’s literally 100 times more expensive and longer to train the model, so nobody does it.”

You can’t secure what you can’t explain

Another area of research is explainability. Today, many AI and ML systems — including the AI- and ML-powered tools offered by many major cybersecurity vendors — are “black box” systems. “Vendors are not building explainability in,” says Sounil Yu, CISO-in-residence at YL Ventures. “In security, being able to explain what happened is a foundational component. If I can’t explain why it happened, how can I fix it?”

For companies building their own AI or ML systems, when something goes wrong, they can go back to the training data or to the algorithms used and fix the problem. “If you’re building it from someone else, you have no idea what the training data was,” says Yu.

Need to secure more than just algorithms

An AI system isn’t just a natural language processing engine or just a classification algorithm or just a neural network. Even if those pieces are completely secure, the system still must interact with users and back-end platforms.

Does the system use strong authentication and the principles of least privilege? Are the connections to the back-end databases secure? What about the connections to third-party data sources? Is the user interface resilient against injection attacks?

Another people-related source of insecurity is unique to AI and ML projects: data scientists. “They don’t call them scientists for nothing,” says Othot’s Abbatico. “Good data scientists perform experiments with data that lead to insightful models. Experimentation, however, can lead to risky behaviour when it comes to data security.” They might be tempted to move data to insecure locations or delete sample data sets when done working with them. Othot invested in getting SOC II certification early on, and these controls help enforce strong data protection practices throughout the company, including when it comes to moving or deleting data.

“The truth is, the biggest risk in most AI models everywhere is not in the AI,” says Peter Herzog, product manager of Urvin AI, an AI agency, and co-founder of ISECOM, an international non-profit organisation on security research. The problem, he says, is in the people. “There’s no such thing as an AI model that is free of security problems because people decide how to train them, people decide what data to include, people decide what they want to predict and forecast, and people decide how much of that information to expose.”

Another security risk specific to AI and ML systems is data poisoning, where an attacker feeds information into a system to force it to make inaccurate predictions. For example, attackers may trick systems into thinking that malicious software is safe by feeding it examples of legitimate software that has indicators similar to malware.

It’s a high concern to most organisations, says Raff. “Right now, I’m not aware of any AI systems actually being attacked in real life,” he says. “It’s a real threat down the line, but right now the classic tools that attackers use to evade antivirus are still effective, so they don’t need to get fancier.”

Avoiding bias, model drift

When AI and ML systems are used for enterprise security – for user behaviour analytics, to monitor network traffic or to check for data exfiltration, for example – bias and model drift can create potential risks. A training data set that under-represents particular attacks or that is out of date quickly can leave an organisation vulnerable, especially as the AI is relied on more and more for defence. “You need to be constantly updating your model,” says Raff. “You need to make it a continuous thing.”

In some cases, the training can be automatic. Adapting a model to changing weather patterns or supply chain delivery schedules, for example, can help make it be more reliable over time. When the source of information involves malicious actors, then the training data sets need to be carefully managed to avoid poisoning and manipulation.

Enterprises are already dealing with algorithms creating ethical problems, such as when facial recognition or recruitment platforms discriminate against women or minorities. When bias creeps into algorithms, it can also create compliance problems, or, in the case of self-driving cars and medical applications, can kill people.

Just as algorithms can inject bias into predictions, they can also be used to control for bias. Orthot, for example, helps universities with such goals as optimising class sizes or achieving financial goals. Creating models without appropriate constraints can very easily create bias, says Othot’s Abbatico. “Accounting for bias requires diligence. Adding goals related to diversity helps the modelling understand objectives and can help counter bias that could easily be incorporated in admissions if diversity goals weren’t included as constraints.”

The future of AI is cloudy

AI and ML systems require lots of data, complex algorithms, and powerful processors that can scale up when needed. All the major cloud vendors are falling over themselves to offer data science platforms that have everything in one convenient place. That means that data scientists don’t need to wait for IT to provision servers for them. They can just go online, fill out a couple of forms, and they’re in business.

According to the Deloitte AI survey, 93% of enterprises are using some form of cloud-based AI. “It makes it easier to get started,” says Deloitte’s Loucks. These projects then turn into operational systems, and as they scale up, the configuration issues multiply. With the newest services, centralised, automated configuration and security management dashboards may not be available, and companies must either write their own or wait for a vendor to step up and fill the gap.

When the people using the systems are citizen data scientists or theoretical researchers without strong backgrounds in security, this can be a problem. In addition, vendors historically roll out new features first and security second. That can be a problem when systems are rapidly deployed and then even more rapidly scaled. We’ve already seen this happen with IoT devices, cloud storage and containers.

AI platform vendors are becoming more aware of this threat and have learned from the mistakes says Raff. “I’m seeing more active inclusion of plans to include security than we might otherwise expect given the historic ‘security comes last’ mindset,” he says. “The ML community is more concerned about it, and the lag time is probably going to be shorter.”

Irfan Saif, principal and AI co-leader at Deloitte, agrees, especially when it comes to the major cloud platforms that support large enterprise AI workloads. “I would say, yes, they are more mature than maybe prior technologies have been in terms of the evolution of cybersecurity capabilities.”