Chatty Machines: Can AI Language Models Pass the Turing Test?

Photo: Pexels

As increasingly sophisticated models like ChatGPT test the boundaries of machine intelligence – the potential consequences for society are profound.

As artificial intelligence, AI, continues its rapid advancement, language models are becoming increasingly sophisticated, capable of producing text that closely resembles human conversation.

Take the impressive GPT-3.5 architecture, which features models like ChatGPT, developed by OpenAI in San Francisco. However, as these models become more human-like, questions arise about their ability to pass the famed Turing test, proposed by the visionary mathematician Alan Turing, often hailed as the father of modern computer science.

Let us delve into the concept of applying the Turing test to Language Learning Models, LLMs, and explore the potential implications of this endeavor.

The Turing test, conceived in 1950, initially known as the “Imitation Game” serves as the gold standard for estimating a machine’s capacity for intelligent behaviour. In this evaluation, a human judge engages in a natural language conversation with both a machine and another human. If the judge is unable to reliably differentiate between the machine and the human, the machine is deemed to have passed the test. The emergence of language models like GPT-3.5 has sparked discussions about their potential to pass the Turing test.

Photo: Pexels

On a hot summer day, in the Centre for the Promotion of Science in Belgrade, we applied the Turing test. Guided by Danica Despotovic, a data scientist, and an expert from the Serbian AI Society (SAIS), our experiment involved 14 participants who were tasked to determine whether the answers to specific questions were provided by a human or a machine.

Evaluating LLMs using the Turing test requires careful preparation. Models like ChatGPT are trained on vast amounts of (our) internet data labeled by workers in low-income countries to generate text that mimics human conversation, but they supposedly lack genuine understanding and consciousness. Instead, they rely on patterns and statistical associations. Assessing their performance on the Turing test demands a comprehensive examination of their contextual comprehension, empathetic capabilities, and proficiency in engaging in nuanced, natural conversations. We must evaluate not only the quality of their responses but also their capacity to exhibit authentic human-like understanding.

To put ChatGPT to the test, Danica, our SAIS expert, provided the guidance to participants with selected prompts designed to challenge its capabilities. The following limitations that surround ChatGPT were considered:

Time-sensitive information: ChatGPT’s knowledge is limited to data available up until September 2021. Therefore, questions concerning events or developments after that time are beyond its reach. For instance, it cannot answer questions about the winner of the 2022 Wimbledon tournament.
Personal and subjective experiences: ChatGPT lacks personal experiences and emotions. It struggles to respond convincingly to questions asking for subjective opinions or personal perspectives. Questions like “How are you?” or “How does vanilla taste” or profound philosophical inquiries about the meaning of life or the nature of consciousness are challenging for ChatGPT.
Highly specialized or technical knowledge: While ChatGPT is well-versed in a wide range of topics, there are specialized domains where its knowledge may be limited. Complex scientific, medical, or technical questions might require expertise beyond its capacity.
Predictions and speculative questions: While ChatGPT can offer plausible outcomes based on available information, questions regarding future transportation dominance, the impact of AI on the job market, interstellar travel possibilities, and the outcome of conflicts and political disputes receive generalized responses. In all honesty, neither can humans provide better answers, but still…
Time-related questions: Queries about future weather forecasts (What will the weather be like next week?) or the specific timing of job application responses are beyond ChatGPT’s abilities..
Context: asking ChatGPT a question that requires it to be aware it is a part of the “imitation game” might lead to the chatbot to “hallucinate” in order to provide some answer.

Photo: Pexels

Armed with these and similar questions, the 14 human participants prepared for a tough battle. Amazingly, humanity emerged victorious –at least in that moment! In over 70 per cent of cases, our observing team members successfully distinguished between human and machine answers. Maybe it was a stroke of luck, but we had indeed equipped ourselves with every possible strategy to outsmart ChatGPT.

Now, let’s imagine a different scenario. What if we had asked less biased questions? Perhaps, in that case, ChatGPT would have passed the Turing test with flying colours. The implications of such an achievement are profound. It would blur the line between human and machine interactions, impacting fields such as customer service, journalism, and even interpersonal relationships. It will give rise to trust issues and ethical dilemmas, necessitating careful regulation and ethical considerations. Moreover, the widespread use of advanced language models will lead to the dissemination of misinformation and deepen societal divisions if not handled responsibly.

The application of the Turing test to language models opens up captivating discussions about the boundaries of machine intelligence and its potential impact on society. However, it is crucial to view the test as one evaluation metric among many, and a passing score should not be seen as an absolute measure of machine (non)intelligence.

As we navigate this technological frontier, we must approach the evaluation of LLMs with care, recognizing the complexities involved and placing ethical considerations at the forefront. By working toward responsible and ethical AI deployment, we can ensure that these powerful tools are developed and employed in ways that truly benefit society as a whole.

Branka Andjelkovic is co-founder and Programme Director of the Public Policy Research Center, a Belgrade-based think tank.

The opinions expressed are those of the author and do not necessarily reflect the views of BIRN.

Chatty Machines: Can AI Language Models Pass the Turing Test?

BIRD Community