AI, why don’t you do the job for us? – An experiment

3. Juni 2026 / Creativity, Design research / 0 comment / By INNCH

reading time: 12 minutes

We have systematically tested where AI truly helps in our research and design process—and where it does not. The result is both sobering and helpful: AI simulates knowledge and understanding remarkably well, but it has limited ability to translate that knowledge into effective design, evaluation, and practical decisions. For us, it is precisely this distinction that constitutes true AI expertise.

I really don’t feel like making another collage like that!

The collages for the blog posts here—and especially for the main pages of our website—are quite labor-intensive. It would be great if AI could take this work off our hands. For example, we recently needed a new collage on the topic of “Artistic Research.” As a first step, we had the AI (Claude Opus 4.7 and ChatGPT with GPT‑5.5 Thinking) analyze all 10 collages on our site. Then, it was tasked with creating a collage in the same style for the topic “Artistic Research.”

The analysis showed that both LLMs correctly identified the structure made up of oval segments and several other aspects. However, the translation into a design was disappointing. The oval segments didn’t quite work out. The collage is chaotic. There are repetitions and indefinable symbols. In some cases, image elements were copied directly from the templates without modification. The colors aren’t quite right. The theme of “artistic research” was also illustrated in a rather simplistic way. Too bad… looks like we’ll have to keep making the collages ourselves.

Let’s really put it to the test and see what AI can do for us!

We then used this small test as an opportunity to conduct a more elaborate, systematic experiment. Our goal was to identify areas where we haven’t yet used AI and where it could genuinely make our work easier.

The study was based on 7 real-world projects—consumerLabs—that we had conducted in recent years in collaboration with our clients. These projects were similar in terms of process and therefore easily comparable: the 7 projects comprised a total of 46 workshops lasting several hours each, involving more than 250 real participants and over 80 hours of exploration. However, we didn’t want to compare the AI’s output only with the final results of these projects, but with the results of each individual step in the iterative process, in order to truly uncover the potential of AI.

What is a consumerLab?
In the consumerLab, offerings for which a concept is already roughly defined are further developed in collaboration with consumers into a comprehensive proposition that includes a marketing strategy, design, and copy. The labs used for this experiment each consisted of three cycles, each with two customer workshops: Based on hypotheses and, where available, existing insights, several routes were first developed and implemented as concrete mockups—for example, for online teasers, landing pages, pop-ups, newsletters, or social media ads—each with variations in visuals, claims, and text. These were then discussed with customers, analyzed, revised, and refined in team reviews until a validated prototype remained at the end.

We had already used AI for these projects, but so far only as a tool to help us bring our design ideas to life. Manual work was still usually required. We also regularly used AI to generate additional ideas for taglines or product names.

The central question of the experiment was: What tasks will AI soon be able to perform on our behalf? Or: In what areas of the process can it provide additional support that it does not currently offer?

To begin with, we identified the various steps in the process. We then had the LLMs execute these steps one after another, building on the previous steps. The data was prepared identically for each project. We also carefully developed the prompts so that they could be entered identically for each of the 7 projects. For each project, we also adapted the designs—which were developed in real-world labs and discussed with real consumers—for the AI.

Do it!

The plan was for the AI to have, at each step, roughly the same level of knowledge that we had at that point. This broadly covered four “competency” areas:

Psychological Hypotheses: How sound are the initial hypotheses regarding potential consumer insights that the AI has gathered from somewhere on the internet—and from which we then develop the first 3 or 4 different design paths? For example, in one route we assume that customers are primarily interested in getting a good deal, and in another route, that it’s about the greater convenience this new offering promises. To do this, we often search the internet ourselves for available studies on the topic or browse user forums to at least have a basis for developing the routes that isn’t completely off the mark. How effective is the AI at preliminary research?

Idea Generation: How relevant and original are the design concepts and taglines that the AI develops from its initial hypotheses, and how closely do they align with the eventual results from real-world labs? And how effective are they when the AI uses the hypotheses we actually had for the specific project? In other words: How well does the AI replace human idea generation?

Prediction Accuracy: How well does the AI predict—including an explanation—how a design and a claim performed with consumers in a real-world lab setting? For the first prediction, it had no information other than what it had “researched” itself. For the second prediction, we provided it with 15 design drafts and explained which 5 were rated positively and which 5 were rated negatively by consumers. So how well can the AI replace discussions with consumers about the designs?

Design: How good—in terms of suitability, originality, and adherence to good design criteria—is the design that the AI develops from the ideas? First, without any reference to the actual results of the lab, then with knowledge of the evaluation of the 15 designs (see above). At the very end, we provided the AI with the entire evaluation, including many designs, along with information on exactly what was rated as good or bad about each design and why. Based on this, it was tasked with developing new designs. The question here: How well does the AI replace the work of designers and copywriters?

We also repeatedly pushed the LLMs to perform better, for example by running a second round of design development—using a supplementary prompt that asked them to create “particularly creative or funny designs” with a “completely different look and feel than traditional advertising images.”

First, sort through and analyze the jumble of materials

As usual, the AI systems we used were very hard at work. If it had made sense to let the AI handle the analysis, it would have been happy to scan and analyze the vast amounts of text and images on its own. But that wouldn’t have been helpful: we had to do the comparison ourselves. So we each set about analyzing the data independently.

For each of the four questions (see above), we were also interested in how closely the AI actually followed our instructions and how consistently it operates. Does it always get the same things right and the same things wrong—or does it always do something different? We were also interested in how consistently it translates its own ideas into designs and how consistently it evaluates designs when a second evaluation round is conducted without any “reminder” of the first. These results were also important for deciding which process steps we might soon be able to entrust entirely to the AI and in which areas it can serve as a good complement.

Conclusion: Ultimately, what tasks can AI take off our hands?

Psychological Hypotheses: The initial hypotheses regarding insights worked quite well. After all, when it comes to hypotheses, it’s enough for them to be plausible. They don’t have to be correct; that’s what testing is for. This is a capability that aligns well with AI. In this case, we had the impression that it even generated hypotheses that were interesting and that we wouldn’t necessarily have come up with on our own.

Idea Generation: The AI generates its design ideas very consistently based on the insights. Some of the suggestions are quite interesting, including for taglines. However, the ideas are sometimes a bit vague. It doesn’t specify exactly what the visual scene should look like or what its look and feel should be. This means you have to use a lot of your own imagination to fill in the gaps. If the scene is described in detail, you sometimes realize even in your imagination that it would be difficult to execute if it is to still work according to the criteria of good design. Example (suggestion from Claude): “A family on the sofa, someone opens the app—and immediately finds what everyone wants to see.” Try to fit that into a single image, not a film.

Prediction Accuracy: When it comes to image evaluation, there are already issues with image recognition. It is true that real consumers sometimes fail to notice something crucial in an image or misinterpret it, leading them to misunderstand the design concept. However, the LLMs also failed to recognize aspects that our participants in the labs identified correctly without any difficulty. In particular, when images tell a story, are somewhat unusual, or contain references to well-known memes, for example, they are usually misinterpreted. For example, if the AI actually locates a suitcase in an image in front of a sea-blue wall as being in the sea, fails to notice the cast on a man’s leg sitting in a beach chair, or misinterprets facial expressions extremely badly, then the crucial visual cues cannot be incorporated into the evaluation at all.

When evaluating the 15 images, the AI consistently classifies them as “good” or “bad” on average, even across multiple runs. However, there are always outliers. For every rating, the LLMs—as we know them—have a convincing-sounding explanation ready. This holds true even when, in the second round, an image that was praised as “good” in the first round is downgraded to “bad.” Unfortunately, such occurrences render the entire evaluation performance invalid. Overall, across all evaluation rounds, only about half of the images that were winners in the real-world labs were considered favorites. The evaluation performance is therefore insufficient to do without the judgment of real people.

Design: Yes, it’s impressive how AI can now be used to create an entire advertising campaign from just a few sentences—something that would likely take a designer several days to do—not to mention the time required for the photographer, models, props, and the travel expenses for the entire crew.

Until now, we’ve used AI to translate our own design ideas into designs in a very targeted and controlled way using various image-generation tools. However, a design is rarely usable right away. Often, you still have to generate individual elements and assemble them in Photoshop, adjust the mood, correct many details, or create your own sketches in advance as templates for the AI. In this case, however, the AI had, so to speak, free rein in the creative implementation of its own ideas.

At first glance, the results here are impressive too: they look like genuine professional advertisements. Even upon closer inspection, some of the visuals are still useful, or at least offer design ideas that can serve as templates or inspiration. However, this is rather rare. Upon closer inspection, most of the designs reveal a whole host of shortcomings:

👉 The images are often stereotypical: they are very similar in both subject matter and atmosphere. In some cases, the people depicted are even the same, and their facial expressions are highly limited.

👉 There is little variety or sophistication in the storytelling, for example in the way danger is hinted at or in the humor.

👉 They look “smoothly ironed” and scream from a distance, “I’m an advertisement!” Some of them are also overloaded and look like hidden-object puzzles.

👉 Often, these are vague images that, without key terms in the text, could represent just about anything. Even across different topics, you could swap out one or two images without anyone noticing, because they could just as easily represent insurance as they could a vacation.

👉 They’re a bit “flat”—the message is often conveyed in a way that’s too literal, or even translated word for word from the text, without any sensory or physical imagination.

👉 The design challenge of conveying a complex message in a single image is often solved simply by adding text, symbols, and diagrams rather than through visual interpretation. As a result, the designs often appear overly fragmented.

The request to be particularly “creative” in the design process doesn’t work at all. The results are neither creative in the sense of a particularly unusual visual concept nor in the sense of an extraordinary style or subtly clever humor. Instead, they are confusing and over-the-top. They stray arbitrarily from the theme and are deliberately offbeat to the point or silly.

If you finally provide the LLM with all the results—including visual ones—and ask it to develop its own idea for a different medium, it sometimes manages to come up with new ideas that are reasonably consistent with the results. However, these often stick very closely to the original templates and are sometimes merely slight variations of them. The AI also failed to find a solution for the prompt: “Unlike the newsletter, the image should attract more attention, as it will only be viewed briefly. It can also be more unusual, even somewhat provocative, so that it fits the Instagram medium.” Even though we were subsequently assured that this new design is particularly well-suited to Instagram.

Let’s give it another shot: The LLM is being fed additional knowledge

Since we weren’t entirely satisfied with the results after the evaluation—to be honest, we had hoped for more—we decided to run another test. Perhaps we’ll still find an area where AI can take on more of our workload than it has so far. What if we equip the AI with extensive relevant background knowledge and then repeat the tasks?

To this end, the tasks were set up as a “project,” meaning that, in addition to the project-related information and prompts in the chat, the LLM was provided with additional files that had been laboriously prepared to be “AI-readable”, serving as a kind of knowledge base that it was supposed to access at each step. To provide this context, we prepared a series of files, including overarching psychological insights from past projects, information on the desired target formats, and a toolkit of useful creativity techniques. Out of curiosity, we also uploaded all 630 pages of our book “Wie Design wirkt”.

The results were interesting: When it came to generating hypotheses and ideas, there were only very minor differences from the first round. Those had already been useful in the first round, especially the hypotheses. The generated ad designs, however, were also very similar to those from the first round: too stereotypical, too illustrative. This was precisely where we had expected more. On the other hand, the explanations were significantly more elaborate. The AI explained to us very competently why a generated ad image adhered to the design principles from “Wie Design wirkt,” which colors were deliberately chosen, and which principles of effectiveness were applied. It explained which creativity technique it had used and what joke underlies a particular motif. The only problem was: none of this was visible in the designs themselves.

For example, when it was explained that blue was chosen to convey “trust, clarity, and rationality,” complemented by green as a “glimmer of hope” and orange as a sign of “warm, active energy,” we searched in vain for the colors green and orange in the actual image that was generated. We were told why a design wasn’t overloaded and that it conveyed its message in 3 seconds (that was mentioned somewhere in the context files). The image itself, however, was just as overloaded as in the first round, and we still had to puzzle over the message.

What becomes clear here is fascinating: AI recognizes patterns in the uploaded texts (e.g., from our book) and can elegantly expand on them to craft sentences that sound competent. However, this is not practical knowledge, nor is it applicable knowledge. AI can generate new sentences from this data, but it cannot apply this knowledge.

We have probably reached the most fundamental limit of AI: the lack of (sensory-physical) “understanding.” We learn from this that even additional contextual knowledge doesn’t help, and we can save ourselves the effort (and it was really a lot of work!): The AI can summarize our book very well in text form, but it doesn’t learn from it in the sense that it can apply this knowledge to practical tasks.

Where AI can actually help us—and where it can’t

As this experiment shows, our approach probably won’t result in significant time savings. It is certainly helpful to use AI to generate initial hypotheses. The key here is to let the AI develop hypotheses only after you’ve come up with some yourself, so as not to be influenced. However, it still doesn’t replace skimming through user forums, where you can get a good sense or an “idea” and often even a design concept for what might be important to consumers regarding a specific offering, which promises should be highlighted, and which design flaws should be best concealed.

AI can also contribute the occasional suggestion when it comes to developing ideas. We’ve been using AI to generate supplementary suggestions for taglines and naming for quite some time now. AI-generated ideas for images can also serve as inspiration, even if it’s just the odd suggestion here and there—it’s definitely a valuable addition. However, you can’t skip the discussion and evaluation of the designs by real customers. Even if the AI wasn’t completely off the mark on average, that’s not enough for a reliable recommendation. The transfer to, for example, a different medium or a different target audience can also serve as inspiration; but it’s not enough to leave it to the AI alone.

Even during targeted and meticulously controlled design development with AI, the AI sometimes acts like a stubborn mule. What is clear to a human designer who has been given a briefing is sometimes not understood at all—or misunderstood—by the AI. If the result happens to be immediately usable—which was rarely the case in this experiment—AI designs can also be included in the labs as supplementary material. It can also be helpful, for example, to prepare a style guide and make it available as a reference so that you can obtain images that are already in line with a client’s design language; however, even this isn’t always applied consistently.

However, it is not enough to simply let the AI design with a free hand, so to speak. Even the combination of LLM and a knowledge database (our great hope) does little to change this. For the most part, therefore, the research and design process with AI support will continue to proceed in the future just as we have handled it so far.

But the best outcome of our experiment is this: we now have a much clearer understanding of where AI can be usefully applied in the process—and where it cannot—so neither hype nor excessive skepticism will be able to throw us off course in the future. Isn’t “AI competence” precisely about knowing where AI cannot be usefully applied—and understanding why?

However, the results may differ for other tasks and processes, so our experiment can likely only be generalized to a limited extent.

Our conclusion

AI is particularly helpful in our process when it comes to generating plausible hypotheses, additional ideas, and inspiration. However, it cannot replace real customer interaction, reliable evaluation, or the creative interpretation of complex messages. Additional knowledge databases primarily improve the reasoning behind decisions, not the practical outcomes.

Posts from this blog on similar topics: