AI, why don’t you do the job for us? – An experiment

rea­ding time: 12 minu­tes

We have sys­te­ma­ti­cally tes­ted where AI truly helps in our rese­arch and design process—and where it does not. The result is both sobering and hel­pful: AI simu­la­tes know­ledge and under­stan­ding remar­kably well, but it has limi­ted ability to trans­late that know­ledge into effec­tive design, eva­lua­tion, and prac­ti­cal decis­i­ons. For us, it is pre­cis­ely this distinc­tion that con­sti­tu­tes true AI expertise.

I really don’t feel like making ano­ther col­lage like that!

The col­la­ges for the blog posts here—and espe­ci­ally for the main pages of our website—are quite labor-inten­sive. It would be great if AI could take this work off our hands. For exam­ple, we recently nee­ded a new col­lage on the topic of “Artis­tic Rese­arch.” As a first step, we had the AI (Claude Opus 4.7 and ChatGPT with GPT‑5.5 Thin­king) ana­lyze all 10 col­la­ges on our site. Then, it was tas­ked with crea­ting a col­lage in the same style for the topic “Artis­tic Research.”

The ana­ly­sis showed that both LLMs cor­rectly iden­ti­fied the struc­ture made up of oval seg­ments and seve­ral other aspects. Howe­ver, the trans­la­tion into a design was dis­ap­poin­ting. The oval seg­ments didn’t quite work out. The col­lage is chao­tic. There are repe­ti­ti­ons and inde­fi­nable sym­bols. In some cases, image ele­ments were copied directly from the tem­pla­tes wit­hout modi­fi­ca­tion. The colors aren’t quite right. The theme of “artis­tic rese­arch” was also illus­tra­ted in a rather sim­pli­stic way. Too bad… looks like we’ll have to keep making the col­la­ges ourselves.

Let’s really put it to the test and see what AI can do for us!

We then used this small test as an oppor­tu­nity to con­duct a more ela­bo­rate, sys­te­ma­tic expe­ri­ment. Our goal was to iden­tify areas where we haven’t yet used AI and where it could genui­nely make our work easier.

The study was based on 7 real-world projects—consumerLabs—that we had con­duc­ted in recent years in col­la­bo­ra­tion with our cli­ents. These pro­jects were simi­lar in terms of pro­cess and the­r­e­fore easily com­pa­ra­ble: the 7 pro­jects com­pri­sed a total of 46 work­shops las­ting seve­ral hours each, invol­ving more than 250 real par­ti­ci­pants and over 80 hours of explo­ra­tion. Howe­ver, we didn’t want to compare the AI’s out­put only with the final results of these pro­jects, but with the results of each indi­vi­dual step in the ite­ra­tive pro­cess, in order to truly unco­ver the poten­tial of AI.

What is a consumerLab?

In the con­sum­er­Lab, offe­rings for which a con­cept is alre­ady roughly defi­ned are fur­ther deve­lo­ped in col­la­bo­ra­tion with con­su­mers into a com­pre­hen­sive pro­po­si­tion that includes a mar­ke­ting stra­tegy, design, and copy. The labs used for this expe­ri­ment each con­sis­ted of three cycles, each with two cus­to­mer work­shops: Based on hypo­the­ses and, where available, exis­ting insights, seve­ral rou­tes were first deve­lo­ped and imple­men­ted as con­crete mockups—for exam­ple, for online teasers, landing pages, pop-ups, news­let­ters, or social media ads—each with varia­ti­ons in visu­als, claims, and text. These were then dis­cus­sed with cus­to­mers, ana­ly­zed, revi­sed, and refi­ned in team reviews until a vali­da­ted pro­to­type remained at the end.

We had alre­ady used AI for these pro­jects, but so far only as a tool to help us bring our design ideas to life. Manual work was still usually requi­red. We also regu­larly used AI to gene­rate addi­tio­nal ideas for tag­li­nes or pro­duct names.

The cen­tral ques­tion of the expe­ri­ment was: What tasks will AI soon be able to per­form on our behalf? Or: In what areas of the pro­cess can it pro­vide addi­tio­nal sup­port that it does not curr­ently offer?

To begin with, we iden­ti­fied the various steps in the pro­cess. We then had the LLMs exe­cute these steps one after ano­ther, buil­ding on the pre­vious steps. The data was pre­pared iden­ti­cally for each pro­ject. We also carefully deve­lo­ped the prompts so that they could be ente­red iden­ti­cally for each of the 7 pro­jects. For each pro­ject, we also adapted the designs—which were deve­lo­ped in real-world labs and dis­cus­sed with real consumers—for the AI.

Do it!

The plan was for the AI to have, at each step, roughly the same level of know­ledge that we had at that point. This broadly covered four “com­pe­tency” areas:

Psy­cho­lo­gi­cal Hypo­the­ses: How sound are the initial hypo­the­ses regar­ding poten­tial con­su­mer insights that the AI has gathe­red from some­where on the internet—and from which we then deve­lop the first 3 or 4 dif­fe­rent design paths? For exam­ple, in one route we assume that cus­to­mers are pri­ma­rily inte­res­ted in get­ting a good deal, and in ano­ther route, that it’s about the grea­ter con­ve­ni­ence this new offe­ring pro­mi­ses. To do this, we often search the inter­net our­sel­ves for available stu­dies on the topic or browse user forums to at least have a basis for deve­lo­ping the rou­tes that isn’t com­ple­tely off the mark. How effec­tive is the AI at preli­mi­nary research?

Idea Gene­ra­tion: How rele­vant and ori­gi­nal are the design con­cepts and tag­li­nes that the AI deve­lops from its initial hypo­the­ses, and how clo­sely do they align with the even­tual results from real-world labs? And how effec­tive are they when the AI uses the hypo­the­ses we actually had for the spe­ci­fic pro­ject? In other words: How well does the AI replace human idea generation?

Pre­dic­tion Accu­racy: How well does the AI predict—including an explanation—how a design and a claim per­for­med with con­su­mers in a real-world lab set­ting? For the first pre­dic­tion, it had no infor­ma­tion other than what it had “rese­ar­ched” its­elf. For the second pre­dic­tion, we pro­vi­ded it with 15 design drafts and explai­ned which 5 were rated posi­tively and which 5 were rated nega­tively by con­su­mers. So how well can the AI replace dis­cus­sions with con­su­mers about the designs?

Design: How good—in terms of sui­ta­bi­lity, ori­gi­na­lity, and adhe­rence to good design criteria—is the design that the AI deve­lops from the ideas? First, wit­hout any refe­rence to the actual results of the lab, then with know­ledge of the eva­lua­tion of the 15 designs (see above). At the very end, we pro­vi­ded the AI with the entire eva­lua­tion, inclu­ding many designs, along with infor­ma­tion on exactly what was rated as good or bad about each design and why. Based on this, it was tas­ked with deve­lo­ping new designs. The ques­tion here: How well does the AI replace the work of desi­gners and copywriters?

We also repea­tedly pushed the LLMs to per­form bet­ter, for exam­ple by run­ning a second round of design development—using a sup­ple­men­tary prompt that asked them to create “par­ti­cu­larly crea­tive or funny designs” with a “com­ple­tely dif­fe­rent look and feel than tra­di­tio­nal adver­ti­sing images.”

First, sort through and ana­lyze the jum­ble of materials

As usual, the AI sys­tems we used were very hard at work. If it had made sense to let the AI handle the ana­ly­sis, it would have been happy to scan and ana­lyze the vast amounts of text and images on its own. But that wouldn’t have been hel­pful: we had to do the com­pa­ri­son our­sel­ves. So we each set about ana­ly­zing the data independently.

For each of the four ques­ti­ons (see above), we were also inte­res­ted in how clo­sely the AI actually fol­lo­wed our ins­truc­tions and how con­sis­t­ently it ope­ra­tes. Does it always get the same things right and the same things wrong—or does it always do some­thing dif­fe­rent? We were also inte­res­ted in how con­sis­t­ently it trans­la­tes its own ideas into designs and how con­sis­t­ently it eva­lua­tes designs when a second eva­lua­tion round is con­duc­ted wit­hout any “remin­der” of the first. These results were also important for deci­ding which pro­cess steps we might soon be able to ent­rust enti­rely to the AI and in which areas it can serve as a good complement.

Con­clu­sion: Ulti­m­ately, what tasks can AI take off our hands?

Psy­cho­lo­gi­cal Hypo­the­ses: The initial hypo­the­ses regar­ding insights worked quite well. After all, when it comes to hypo­the­ses, it’s enough for them to be plau­si­ble. They don’t have to be cor­rect; that’s what test­ing is for. This is a capa­bi­lity that ali­gns well with AI. In this case, we had the impres­sion that it even gene­ra­ted hypo­the­ses that were inte­res­t­ing and that we wouldn’t neces­s­a­rily have come up with on our own.

Idea Gene­ra­tion: The AI gene­ra­tes its design ideas very con­sis­t­ently based on the insights. Some of the sug­ges­ti­ons are quite inte­res­t­ing, inclu­ding for tag­li­nes. Howe­ver, the ideas are some­ti­mes a bit vague. It doesn’t spe­cify exactly what the visual scene should look like or what its look and feel should be. This means you have to use a lot of your own ima­gi­na­tion to fill in the gaps. If the scene is descri­bed in detail, you some­ti­mes rea­lize even in your ima­gi­na­tion that it would be dif­fi­cult to exe­cute if it is to still work accor­ding to the cri­te­ria of good design. Exam­ple (sug­ges­tion from Claude): “A family on the sofa, someone opens the app—and imme­dia­tely finds what ever­yone wants to see.” Try to fit that into a sin­gle image, not a film.

Pre­dic­tion Accu­racy: When it comes to image eva­lua­tion, there are alre­ady issues with image reco­gni­tion. It is true that real con­su­mers some­ti­mes fail to notice some­thing cru­cial in an image or mis­in­ter­pret it, lea­ding them to misun­derstand the design con­cept. Howe­ver, the LLMs also fai­led to reco­gnize aspects that our par­ti­ci­pants in the labs iden­ti­fied cor­rectly wit­hout any dif­fi­culty. In par­ti­cu­lar, when images tell a story, are some­what unu­sual, or con­tain refe­ren­ces to well-known memes, for exam­ple, they are usually mis­in­ter­pre­ted. For exam­ple, if the AI actually loca­tes a suit­case in an image in front of a sea-blue wall as being in the sea, fails to notice the cast on a man’s leg sit­ting in a beach chair, or mis­in­ter­prets facial expres­si­ons extre­mely badly, then the cru­cial visual cues can­not be incor­po­ra­ted into the eva­lua­tion at all.

When eva­lua­ting the 15 images, the AI con­sis­t­ently clas­si­fies them as “good” or “bad” on average, even across mul­ti­ple runs. Howe­ver, there are always out­liers. For every rating, the LLMs—as we know them—have a con­vin­cing-sound­ing expl­ana­tion ready. This holds true even when, in the second round, an image that was prai­sed as “good” in the first round is down­gra­ded to “bad.” Unfort­u­na­tely, such occur­ren­ces ren­der the entire eva­lua­tion per­for­mance inva­lid. Over­all, across all eva­lua­tion rounds, only about half of the images that were win­ners in the real-world labs were con­side­red favo­ri­tes. The eva­lua­tion per­for­mance is the­r­e­fore insuf­fi­ci­ent to do wit­hout the judgment of real people.

Design: Yes, it’s impres­sive how AI can now be used to create an entire adver­ti­sing cam­paign from just a few sentences—something that would likely take a desi­gner seve­ral days to do—not to men­tion the time requi­red for the pho­to­grapher, models, props, and the tra­vel expen­ses for the entire crew.

Until now, we’ve used AI to trans­late our own design ideas into designs in a very tar­ge­ted and con­trol­led way using various image-gene­ra­tion tools. Howe­ver, a design is rarely usable right away. Often, you still have to gene­rate indi­vi­dual ele­ments and assem­ble them in Pho­to­shop, adjust the mood, cor­rect many details, or create your own sket­ches in advance as tem­pla­tes for the AI. In this case, howe­ver, the AI had, so to speak, free rein in the crea­tive imple­men­ta­tion of its own ideas.

At first glance, the results here are impres­sive too: they look like genuine pro­fes­sio­nal adver­ti­se­ments. Even upon clo­ser inspec­tion, some of the visu­als are still useful, or at least offer design ideas that can serve as tem­pla­tes or inspi­ra­tion. Howe­ver, this is rather rare. Upon clo­ser inspec­tion, most of the designs reveal a whole host of shortcomings:

👉 The images are often ste­reo­ty­pi­cal: they are very simi­lar in both sub­ject mat­ter and atmo­sphere. In some cases, the peo­ple depic­ted are even the same, and their facial expres­si­ons are highly limited.

👉 There is little variety or sophisti­ca­tion in the sto­rytel­ling, for exam­ple in the way dan­ger is hin­ted at or in the humor.

👉 They look “smoothly iro­ned” and scream from a distance, “I’m an adver­ti­se­ment!” Some of them are also over­loa­ded and look like hid­den-object puzzles.

👉 Often, these are vague images that, wit­hout key terms in the text, could repre­sent just about any­thing. Even across dif­fe­rent topics, you could swap out one or two images wit­hout anyone noti­cing, because they could just as easily repre­sent insu­rance as they could a vacation.

👉 They’re a bit “flat”—the mes­sage is often con­veyed in a way that’s too lite­ral, or even trans­la­ted word for word from the text, wit­hout any sen­sory or phy­si­cal imagination.

👉 The design chall­enge of con­vey­ing a com­plex mes­sage in a sin­gle image is often sol­ved sim­ply by adding text, sym­bols, and dia­grams rather than through visual inter­pre­ta­tion. As a result, the designs often appear overly fragmented.

The request to be par­ti­cu­larly “crea­tive” in the design pro­cess does­n’t work at all. The results are neither crea­tive in the sense of a par­ti­cu­larly unu­sual visual con­cept nor in the sense of an extra­or­di­nary style or subtly cle­ver humor. Ins­tead, they are con­fu­sing and over-the-top. They stray arbi­tra­rily from the theme and are deli­bera­tely off­beat to the point or silly.

If you finally pro­vide the LLM with all the results—including visual ones—and ask it to deve­lop its own idea for a dif­fe­rent medium, it some­ti­mes mana­ges to come up with new ideas that are reason­ably con­sis­tent with the results. Howe­ver, these often stick very clo­sely to the ori­gi­nal tem­pla­tes and are some­ti­mes merely slight varia­ti­ons of them. The AI also fai­led to find a solu­tion for the prompt: “Unlike the news­let­ter, the image should attract more atten­tion, as it will only be viewed briefly. It can also be more unu­sual, even some­what pro­vo­ca­tive, so that it fits the Insta­gram medium.” Even though we were sub­se­quently assu­red that this new design is par­ti­cu­larly well-sui­ted to Instagram.

Let’s give it ano­ther shot: The LLM is being fed addi­tio­nal knowledge

Since we weren’t enti­rely satis­fied with the results after the evaluation—to be honest, we had hoped for more—we deci­ded to run ano­ther test. Per­haps we’ll still find an area where AI can take on more of our workload than it has so far. What if we equip the AI with exten­sive rele­vant back­ground know­ledge and then repeat the tasks?

To this end, the tasks were set up as a “pro­ject,” mea­ning that, in addi­tion to the pro­ject-rela­ted infor­ma­tion and prompts in the chat, the LLM was pro­vi­ded with addi­tio­nal, carefully pre­pared files ser­ving as a kind of know­ledge base that it was sup­po­sed to access at each step. To pro­vide this con­text, we pre­pared a series of files, inclu­ding over­ar­ching psy­cho­lo­gi­cal insights from past pro­jects, infor­ma­tion on the desi­red tar­get for­mats, and a tool­kit of useful crea­ti­vity tech­ni­ques. We also uploa­ded all 630 pages of our book “Wie Design wirkt”.

The results were inte­res­t­ing: When it came to gene­ra­ting hypo­the­ses and ideas, there were only very minor dif­fe­ren­ces from the first round. Those had alre­ady been useful in the first round, espe­ci­ally the hypo­the­ses. The gene­ra­ted ad designs, howe­ver, were also very simi­lar to those from the first round: too ste­reo­ty­pi­cal, too illus­tra­tive. This was pre­cis­ely where we had expec­ted more. On the other hand, the expl­ana­ti­ons were signi­fi­cantly more ela­bo­rate. The AI explai­ned to us very com­pe­tently why a gene­ra­ted ad image adhe­red to the design prin­ci­ples from “Wie Design wirkt,” which colors were deli­bera­tely cho­sen, and which prin­ci­ples of effec­ti­ve­ness were applied. It explai­ned which crea­ti­vity tech­ni­que it had used and what joke under­lies a par­ti­cu­lar motif. The only pro­blem was: none of this was visi­ble in the designs themselves.

For exam­ple, when it was explai­ned that blue was cho­sen to con­vey “trust, cla­rity, and ratio­na­lity,” com­ple­men­ted by green as a “glim­mer of hope” and orange as a sign of “warm, active energy,” we sear­ched in vain for the colors green and orange in the actual image that was gene­ra­ted. We were told why a design wasn’t over­loa­ded and that it con­veyed its mes­sage in 3 seconds (that was men­tio­ned some­where in the con­text files). The image its­elf, howe­ver, was just as over­loa­ded as in the first round, and we still had to puz­zle over the message.

What beco­mes clear here is fasci­na­ting: AI reco­gni­zes pat­terns in the uploa­ded texts (e.g., from our book) and can ele­gantly expand on them to craft sen­ten­ces that sound com­pe­tent. Howe­ver, this is not prac­ti­cal know­ledge, nor is it appli­ca­ble know­ledge. AI can gene­rate new sen­ten­ces from this data, but it can­not apply this knowledge.

We have pro­ba­bly rea­ched the most fun­da­men­tal limit of AI: the lack of (sen­sory-phy­si­cal) “under­stan­ding.” We learn from this that even addi­tio­nal con­tex­tual know­ledge doesn’t help, and we can save our­sel­ves the effort (and it was really a lot of work!): The AI can sum­ma­rize our book very well in text form, but it doesn’t learn from it in the sense that it can apply this know­ledge to prac­ti­cal tasks.

Where AI can actually help us—and where it can’t

As this expe­ri­ment shows, our approach pro­ba­bly won’t result in signi­fi­cant time savings. It is cer­tainly hel­pful to use AI to gene­rate initial hypo­the­ses. The key here is to let the AI deve­lop hypo­the­ses only after you’ve come up with some yours­elf, so as not to be influen­ced. Howe­ver, it still doesn’t replace skim­ming through user forums, where you can get a good sense or an “idea” and often even a design con­cept for what might be important to con­su­mers regar­ding a spe­ci­fic offe­ring, which pro­mi­ses should be high­ligh­ted, and which design flaws should be best concealed.

AI can also con­tri­bute the occa­sio­nal sug­ges­tion when it comes to deve­lo­ping ideas. We’ve been using AI to gene­rate sup­ple­men­tary sug­ges­ti­ons for tag­li­nes and naming for quite some time now. AI-gene­ra­ted ideas for images can also serve as inspi­ra­tion, even if it’s just the odd sug­ges­tion here and there—it’s defi­ni­tely a valuable addi­tion. Howe­ver, you can’t skip the dis­cus­sion and eva­lua­tion of the designs by real cus­to­mers. Even if the AI wasn’t com­ple­tely off the mark on average, that’s not enough for a relia­ble recom­men­da­tion. The trans­fer to, for exam­ple, a dif­fe­rent medium or a dif­fe­rent tar­get audi­ence can also serve as inspi­ra­tion; but it’s not enough to leave it to the AI alone.

Even during tar­ge­ted and meti­cu­lously con­trol­led design deve­lo­p­ment with AI, the AI some­ti­mes acts like a stub­born mule. What is clear to a human desi­gner who has been given a brie­fing is some­ti­mes not unders­tood at all—or misunderstood—by the AI. If the result hap­pens to be imme­dia­tely usable—which was rarely the case in this experiment—AI designs can also be included in the labs as sup­ple­men­tary mate­rial. Howe­ver, it is not enough to sim­ply let the AI design with a free hand, so to speak. Even the com­bi­na­tion of LLM and a know­ledge data­base (our great hope) does little to change this. For the most part, the­r­e­fore, the rese­arch and design pro­cess with AI sup­port will con­ti­nue to pro­ceed in the future just as we have hand­led it so far.

But the best out­come of our expe­ri­ment is this: we now have a much clea­rer under­stan­ding of where AI can be usefully applied in the process—and where it cannot—so neither hype nor exces­sive skep­ti­cism will be able to throw us off course in the future. Isn’t “AI com­pe­tence” pre­cis­ely about kno­wing where AI can­not be usefully applied—and under­stan­ding why?

Howe­ver, the results may dif­fer for other tasks and pro­ces­ses, so our expe­ri­ment can likely only be gene­ra­li­zed to a limi­ted extent.

Our con­clu­sion

AI is par­ti­cu­larly hel­pful in our pro­cess when it comes to gene­ra­ting plau­si­ble hypo­the­ses, addi­tio­nal ideas, and inspi­ra­tion. Howe­ver, it can­not replace real cus­to­mer inter­ac­tion, relia­ble eva­lua­tion, or the crea­tive inter­pre­ta­tion of com­plex mes­sa­ges. Addi­tio­nal know­ledge data­ba­ses pri­ma­rily improve the reaso­ning behind decis­i­ons, not the prac­ti­cal outcomes.

Posts from this blog on simi­lar topics:

3 years ago: Our 2023 experiment

Share Post :

weitere Beiträge