There is a growing tension in Europe between the economic interests in enabling developers of artificial intelligence (AI) systems to access data on which to train their models and the legal frameworks that impose constraints on the ability of developers to do that.
Meghan Higgins of Pinsent Masons said the issue has been brought into sharp focus by the recent actions of some leading technologists, on the one hand, and the activity of data protection authorities in Europe, on the other.
Earlier this month, LinkedIn suspended the training of its generative AI models using UK user data after the UK’s Information Commissioner’s Office (ICO) raised concerns about the practice.
That move came after X agreed to suspend its processing of personal data contained in the public posts of users of its platform in the EU and European Economic Area for the purposes of training its AI model, under a deal reached with the Data Protection Commission in Ireland. Privacy campaign group noyb filed complaints in relation to X’s practices in nine EU countries.
“Scraping data from public sources may often involve processing personal data,” Higgins said. “Data available in the public domain could still be considered as personal data and therefore afforded protections under the UK and EU’s General Data Protection Regulations (GDPR).”
The recent activity follows on from a consultation the ICO held earlier this year concerning the appropriate lawful basis to use for training large language models (LLMs). In the consultation, the ICO asked whether ‘legitimate interests’ could be a valid legal basis for this type of processing.
Under the UK General Data Protection Regulation (GDPR), one of the legal bases that can be relied on to process personal data is the ‘legitimate interests’ ground. This requires that the processing of personal data is necessary for legitimate interests pursued by the controller or by a third party, provided the interests relied upon are not “overridden by the interests or fundamental rights and freedoms of the data subject which require protection of personal data”. A balancing exercise therefore needs to be undertaken by any organisation seeking to undertake legitimate interests processing.
Lucia Doran of Pinsent Masons said: “In its consultation, the ICO suggested that in order for developers to be able to demonstrate a ‘legitimate interest’ in personal data processing, they will, amongst other things, need to be clear about how the model will be used and what its purpose is. It suggested that relying on the broad aims of the AI product may not be sufficient to demonstrate a ‘legitimate interest’ in such processing.”
Even if a legitimate interest can be cited, developers must limit the data being processed to what is necessary to achieve their purpose. In this regard, the ICO also posed a question in its consultation about whether it would be possible to train AI models on smaller datasets.
“This is an emerging area for data protection,” Doran said. “But given the potential concerns of users if their personal data is used to train LLMs, controllers will have to put effective mitigations in place in order to satisfy any balancing test.”
The approach of European policymakers and data protection authorities towards the use of data for training AI was the subject of an open letter signed by 49 technologists, academics and business executives – including Meta founder and chief executive Mark Zuckerberg – earlier this month.
The signatories to the letter cited concerns over “fragmented and unpredictable” regulatory decision making and “huge uncertainty” caused by “interventions” by EU data protection authorities. They warned that AI developers will choose to develop and deploy innovative new models elsewhere in the world – with the significant economic and social downsides that would mean for Europe – unless there is a “change of course”.
Higgins said: “The publication of the open letter comes at a time when European competitiveness is under intense scrutiny – Mario Draghi, the former European Central Bank president, recently published a detailed report into EU competitiveness, a report which is set to shape the work of the next European Commission under the leadership of Ursula von der Leyen. Addressing the challenges of commercialising technological innovation was one of the main themes of Draghi’s report.”
For businesses developing or using AI tools, however, there are a myriad of legal issues beyond just data protection compliance risks that they need to navigate.
“There have already been claims raised against AI companies for alleged infringement of intellectual property rights – perhaps most notably, in UK terms, the case raised by Getty Images against Stability AI before the High Court in London – which revolve around using data to train and improve the operation of AI models,” Higgins said.
AI data scraping can also give rise to potential violations of websites’ terms of service, according to Higgins.
Higgins said: “Some websites deploy the robots.txt protocol to deter unwanted data scraping – this essentially tells web robots what they can or cannot do when visiting a website. However, recently, some online publishers have raised concerns that these instructions are being ignored by AI web crawlers seeking data to train their models on.”
“If a contract is effectively created between the web-scraping service and the website’s terms of use, the website operator might in theory have a claim for breach of contract where those terms are violated,” Higgins said, adding that there is case law in Ireland as well as in England and Wales that suggest it is possible to establish such contractual claims.
Anthropic, a US-based AI company, was accused recently of “egregious” data scraping practices and failing to comply with websites’ instructions. The publishers of the affected websites complained this takes up unnecessary developer and computing resources, slows web traffic for all users and has other cost and sustainability implications.
“Case law has also been developed in the UK that provides that extensive bombardment of websites that adversely impact on their services can be akin to a distributed denial of service attack and violate the Computer Misuse Act,” Higgins said.
“Where some businesses see legal and commercial risks, however, others are seeing an opportunity. Cloudflare has developed a tool that allows online publishers to monitor how often AI models are seeking access to their content and to block models from doing so, and has announced plans to enable website operators to charge AI models for access to their content in future.”