Analyzing generative AI’s copyright crisis

WashU computer scientists developed a platform to evaluate the prevalence of intellectual property violations by code language models

Shawn Ballard 07.25.2023

Artificial intelligence tools like ChatGPT and Copilot offer helpful assistance to programmers, but WashU computer scientists have recently shown that both open-source and commercial AI platforms frequently generate copyright infringing content. (Photo by Mojahid Mottakin on Unsplash)

Facebook Twitter Linkedin Email

The recent explosion of artificial intelligence tools such as ChatGPT and Copilot have supercharged the assistance available to programmers. However, AI assistants may strip out comments embedded in code to convey copyright and attribution guidelines, leaving human coders none the wiser yet still on the hook legally for intellectual property infringement.

To combat this problem, computer science & engineering researchers in the McKelvey School of Engineering at Washington University in St. Louis have developed CodeIPPrompt, the first automated testing platform to evaluate how much language models generate IP-violating code. The team includes Ning Zhang and Chenguang Wang, both assistant professors; Yevgeniy Vorobeychik, professor; Zhiyuan Yu, a graduate student in Zhang’s lab and first author on the paper; and Chaowei Xiao, assistant professor of computer science at Arizona State University.

Yu presented the work July 23 at the International Conference on Machine Learning in Honolulu. Notably, the team’s analysis showed that copyright infringement issues are prevalent across state-of-the-art open-source models including CodeRl, CodeGen and CodeParrot, as well as in commercial products including Copilot, ChatGPT and GPT-4.

“We developed this tool to help people understand that if they’re using these large language models to help write code, there’s a good chance they might generate IP infringing content,” Zhang said. “As users, we have a responsibility to use AI ethically. That’s influenced by how we understand AI technology and the content it produces.”

Though CodeIPPrompt can’t say for sure if AI-generated code constitutes an IP violation – Zhang notes that issue is ultimately a legal question that will play out in the courts as cases are brought against the users of AI tools for copyright infringement – it can give users a risk score that indicates how similar generated code is to copyright protected content. Zhang anticipates that the tool will help guide the ongoing development of AI and point to potential mitigation strategies and other protections against IP violations in the future.

Yu Z, Wu Y, Zhang N, Wang C, Vorobeychik Y, Xiao C. CodeIPPrompt: Intellectual property infringement assessment of code language models. International Conference on Machine Learning, July 23-29, 2023. https://sites.google.com/view/codeipprompt

This work was supported by the National Science Foundation (CNS-1916926, CNS-2238635), Army Research Office (W911NF2010141), DHS (17STQAC00001-06-00) and Intel.

The McKelvey School of Engineering at Washington University in St. Louis promotes independent inquiry and education with an emphasis on scientific excellence, innovation and collaboration without boundaries. McKelvey Engineering has top-ranked research and graduate programs across departments, particularly in biomedical engineering, environmental engineering and computing, and has one of the most selective undergraduate programs in the country. With 165 full-time faculty, 1,420 undergraduate students, 1,614 graduate students and 21,000 living alumni, we are working to solve some of society’s greatest challenges; to prepare students to become leaders and innovate throughout their careers; and to be a catalyst of economic development for the St. Louis region and beyond.