Stanford researchers claim ChatGPT's performance and accuracy has decreased over time

Researchers compared performance of OpenAI's GPT-3.5 and GPT-4.
Sejal Sharma
ChatGPT widget on a phone
ChatGPT widget on a phone

Robert Way/iStock 

It seems that the honeymoon phase for large language models (LLMs), introduced in the rush to make inroads in the generative AI space, is over.

According to a study by researchers at Stanford and UC Berkeley, the performance of OpenAI’s LLMs has decreased significantly over time.

The researchers wanted to determine if these LLMs were improving, as they can be updated based on data, user feedback, and design changes.

The team evaluated the behavior of the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on four tasks. The first was solving math problems, the second was answering sensitive/dangerous questions, the third was generating code, and the fourth was assessing the models on visual reasoning.

LLMs’ diverse capabilities

When introducing GPT-4 in May this year, OpenAI’s report claimed that GPT-4 is much more reliable and creative and can handle more nuanced instructions than GPT-3.5. More recently, GPT-4 was shown to successfully pass difficult exams in professional domains such as medicine and law.

However, the researchers found that the performance and behavior of GPT-3.5 and GPT-4 varied across their respective releases in March and June.

GPT-4, in its March 2023 version, could identify prime numbers with an accuracy of 97.6 but the team found in its June 2023 version performed very poorly on these same questions with 2.4 percent accuracy. Also, GPT-3.5’s June 2023 version was much better than its March 2023 version in the same task. 

The team also found that GPT-4 was less willing to answer sensitive questions in June than in March, and both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March.

Increasing concerns regarding LLMs’ tendency to hallucinate

In good news, GPT-4’s update was more robust to jailbreaking attacks than that of GPT-3.5. Jailbreaking is a form of manipulation in which a prompt is crafted to conceal a malicious question and surpass protection boundaries. The prompt manipulates the LLM to generate responses that could aid in malware creation.

While the world was mesmerized by ChatGPT, the study is a powerful reminder that developers need to continuously evaluate and assess the behavior of LLMs in production applications.

“We plan to update the findings presented here in an ongoing long-term study by regularly evaluating GPT-3.5, GPT-4 and other LLMs on diverse tasks over time. For users or companies who rely on LLM services as a component in their ongoing workflow, we recommend that they should implement similar monitoring analysis as we do here for their applications,” said the researchers in the study.

In contrast, another study was carried out by a team of researchers at Microsoft, which has invested billions of dollars in OpenAI. Interestingly, the results of that study said that GPT-4 is a significant step towards artificial general intelligence (AGI), which many in the AI industry said was a dangerous claim.

Study abstract:

GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on four diverse tasks: 1) solving math problems, 2) answering sensitive/dangerous questions, 3) generating code and 4) visual reasoning. We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. For example, GPT-4 (March 2023) was very good at identifying prime numbers (accuracy 97.6%) but GPT-4 (June 2023) was very poor on these same questions (accuracy 2.4%). Interestingly GPT-3.5 (June 2023) was much better than GPT-3.5 (March 2023) in this task. GPT-4 was less willing to answer sensitive questions in June than in March, and both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March. Overall, our findings shows that the behavior of the “same” LLM service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLM quality.

Add Interesting Engineering to your Google News feed.
Add Interesting Engineering to your Google News feed.
message circleSHOW COMMENT (1)chevron
Job Board