Can large language models (LLMs) such as ChatGPT simulate human respondents sufficiently well? A closer look at the meaning and function of natural language in humans and machines can provide an answer to this question.
I was recently asked to give an “expert interview” for a master’s thesis to discuss the topic of “silicon sampling in marketing” from a psychological perspective. Silicon sampling refers to synthetic, i.e., artificially generated respondents or survey data created by an LLM (large language model).
The question is not whether machines think like humans or even develop consciousness—most experts agree that this is not the case and never will be, at least not with the technology used today. Rather, the question is whether purely mathematical procedures in transformer models such as LLMs can achieve sufficiently good results to replace human responses, e.g., in the context of market research or user testing.
LLMs seem to be ideal for simulating human response behavior: they use natural language, and their formulations often seem astonishingly plausible. Humans and machines “meet” in the “natural language” interface. It is therefore worth taking a look at what function language has for humans and what function it has for an LLM, and how it is “processed” differently in each case.
Humans and language
Humans do not generate language output. They communicate. They use language as a means of exchange to form a shared view of the world with others (even when we are alone, we virtually ‘add’ others to our view). The language negotiated for communication, with its words and grammar, is only one of many ways we communicate, because we also communicate through our bodies and the entire social and material context. The framework of language alone would be far too limited.
In communication, we refer—more or less effectively—to verbal and nonverbal aspects, to our physical presence in the world, our relationship to the world and to other people, to our intentions and motivations, our situational states and atmospheric impressions, and there is always a great deal of unconsciousness involved.
When understanding language, we interpret what we hear or read (and many other clues from the context) according to our view of the world and our expectations. We can do this because we ourselves are physically anchored in the world and share many pre-linguistic and non-linguistic life experiences with our communication partners.
The machine and language
For the machine, the language interface is something completely different. It does not communicate. It predicts the most probable next word based on highly complex text patterns and thus remains 100% at the level of data contained in language as language. It refers associatively to other language components in high-dimensional data spaces, but not to non-linguistic ones, such as physical or atmospheric experiences.
It is then us again who think we recognize meaning and significance in the machine’s output (and confuse it with ‘communication about something’), even though for the machine it refers to nothing beyond the bare words. Such projections lead to people even using an LLM as a personal coach or therapist (they then treat themselves as if in a mirror).
An LLM is therefore a powerful tool for analyzing and predicting text patterns. However, it remains blind to the meaning and significance that people want to express and refer to when communicating with others.
This human level is not currently found in the huge text libraries that have been fed into the LLM, because only a very small part of our ‘innermost thoughts’ ever find their way into language or text (or even end up on the internet, where they can be used to train LLMs), and then usually in a highly processed form.
It cannot be added to training, because many things cannot be processed in written form at all, such as atmospheres that are felt physically or our bodily knowledge, such as playing the piano or riding a bike. There are many things that people cannot even put into words. Many things (and often the most important things in psychology) are simply indescribable, and some are just helplessly vague.
Carbon versus mathematics
Nevertheless, the linguistic output often seems realistic and human, consistently achieving new highs on the Turing test scale. Could it therefore be entirely sufficient to recognize and predict patterns on a purely textual and statistical level?
The answer here also lies in the different processing of language. The machine imitates what would be most expected on a text level. This creates a high degree of plausibility, precisely because AI operates within the logic of statistics and probabilities. Thus, even pure statistics can sometimes simulate possible human (language) behavior well, either by chance or when the task is very close to the patterns in the training material.
However, when faced with novel tasks or situations (which is usually the case in research projects), the LLM only produces statistically probable and therefore plausible-sounding text continuations. In addition, synthetic responses are not reproducible. Even with identical prompts, an LLM will produce different responses depending on the model version, system prompts, individual settings, or internal random processes. This variation cannot be explained psychologically, but has technical reasons.
There is a wonderful term called “bullshit.” Bullshit is defined as something that sounds plausible, but where it doesn’t matter whether it’s true or not. It’s impossible to decide whether something is a good simulation or simply nonsense. That’s the crux of bullshit.
Conclusion
As a psychologist, one should always maintain a healthy skepticism when “silicon sampling” is treated as a serious alternative to interviewing people. If I want to find out how people think and feel and why they come to a certain conclusion, it is not enough that something sounds like something people would say.
I can certainly still use the LLM, e.g., for initial hypothesis generation, if the topic has already been researched in a similar form and published, or to get an idea of what a specific target group might sound like. But then it is a supplement, and I realize that it is the confabulation of a machine based on analyses of published texts.
This can be extremely helpful. However, it does not simulate human respondents sufficiently well or reliably enough, and—importantly for the discourse—it will not do so even with increasingly better or specially trained models, because this only further increases textual plausibility without simulating experience and behavior in a more reliable way.
(ms)
Additions and comments:
* The post image was created entirely with AI in line with the topic (exceptionally here in the blog – I think it was this Gemini Banana).
** The fact that the training data for LLMs always refers to the past became clear to me when I asked ChatGPT before the interview what “silicon sampling in marketing” meant. The answer: “Silicon sampling refers to mini or dummy product samples that look like real electronic devices but are often not functional. They serve as haptic, visual, or demonstrative samples before a product is actually developed or produced. The term comes from the fact that these samples are often made of silicone, plastic, or 3D printing—i.e., “silicone” as a material, not silicon (semiconductor).” ChatGPT 5.1 must have already completed its training before the term found its way onto the internet. Still, it sounds plausible.



















































