Hatty AI Chat Bot

Categories

Human-Centered design through AI-Assisted Usability Tests: Reality or Fiction?

Online UX research tools have helped make unmoderated usability tests more popular. It can be beneficial to allow participants to complete the usability testing at their own pace, without a moderator. First, the lack of a strict time schedule and moderators means that more participants can be recruited more quickly and at a lower cost. Your team can also see how users interact in their own environment with your solution, using their own devices. It is also much easier to overcome the challenges of distances and time zones, in order for data from around the world to be obtained. The use of moderators is not necessary, but it has its disadvantages. The moderator adds flexibility and a human touch to usability testing. The moderator is usually in the same space (virtual) as the participants and has a good understanding of what is going on. They can react in real time based on what they see and hear the participant say or do. A moderator may gently remind participants to voice their thoughts. For the participant, speaking out loud in front of a facilitator can feel more natural than talking to yourself. The moderator can ask the participant to elaborate on something they did. A traditional unmoderated research study lacks this flexibility. Participants are given a set of instructions to follow in order to complete the tasks. After completing the task, participants can be asked to fill out a static questionnaire. The feedback the research and design team receives is entirely dependent on the information that participants provide. The way in which instructions and questions are phrased during unmoderated testing can be crucial. Even if all is well planned, the lack adaptive questioning will mean that much information will remain unsaid. This is especially true for people who aren’t trained to provide user feedback. The moderator may ask for more information if the participant in the usability test misunderstands the question or does not answer it completely. The question is: Could AI handle something like this to upgrade unmoderated tests? Once we take into account their current capabilities, generative AI could be a powerful tool to address this dilemma. Large language models can have conversations that are almost humanlike. If LLMs were to be integrated into usability tests to enhance data collection interactively by conversing with participants, they could significantly increase the ability of researchers in obtaining detailed personal feedback from large numbers of people. This is a great example of AI that is human-centered, as it keeps humans involved. There are a lot of gaps in research on AI in UX. UXtweak Research conducted a case-study to investigate whether AI could generate meaningful follow-up questions and get valuable answers from participants. The moderator is responsible for a number of things, including asking follow-up questions in order to get more detailed information from participants. It is a subproblem that is well-suited for our evaluation, as it involves the moderator’s ability to react in real time to the context of a conversation and encourage participants to share important information. Experiment Spotlight – Testing GPT-4 in Real-Time Feedback Our study focused on the principles behind AI solutions for unmoderated usability tests, rather than a specific commercial AI solution. AI models and prompts undergo constant tuning, so findings which are too narrow could become irrelevant a few weeks after the new version is released. Since AI models are also black boxes based on artificial neuron networks, their method of generating specific outputs is not transparent. Our results show you what to be cautious of when evaluating whether an AI solution you use will actually provide value or harm. In our study, GPT-4 was used, which was at the time the most recent model from OpenAI. It is also capable of answering complex prompts, and in our experience, it can handle some prompts better than GPT-4o. In our experiment, a prototype e-commerce site was used to conduct a usability testing. The tasks were based on the typical user flow for purchasing a product. Note: For more information on the prototype, tasks and questions, please see our article in the International Journal of Human-Computer Interaction. In this setting, the results were compared with three conditions. A static questionnaire consisting of three predefined questions (Q1,Q2,Q3), which served as a baseline without AI. Q1 was an open-ended question, which asked the participants to describe their experience during the task. Q2 and Q3 are non-adaptive questions that follow up Q1 because they ask participants to discuss usability issues more directly and to identify what they dislike. The question Q1, which can be used as a seed to generate up to three GPT-4 generated follow-up questions in place of Q2 and Q3. All three predefined questions Q1, Q2, Q3, were used as seeds for their own GPT-4 follow up. The following prompt was used for the follow-up question generation: To assess the impact on the AI follow up questions, we compared the results both quantitatively and qualitatively. We analyzed the informativeness of the responses, based on their ability to clarify new usability issues. As shown in the graph below, the informativeness decreased significantly between the seed questions to their AI follow-up. The follow-ups were rarely able to identify a new problem, but they did provide more details. The emotional reactions of participants provide a different perspective on AI-generated questions. Based on the wording of the answers, we analyzed the emotional valence. At first, the answers had a neutral tone. Then, the sentiment shifted to the negative. This could be viewed as natural in the case of pre-defined Q2 and Q3. Question Seed 1 was an open-ended question, asking participants to describe what they did. Q2 and Q3 were more focused on the negative, such as usability issues or other disliked aspects. Curiously, follow-up questions received more negative feedback than the seed questions. This was not because of the same reasons. Participants often expressed frustration when interacting with the GPT-4 driven follow-up questions. This is important, as frustration with the testing can cause participants to not take usability testing seriously. It can also hinder meaningful feedback and introduce a bias. Participants were most frustrated by redundancy. Repetition, such as explaining the same usability problem again, was common. Pre-defined follow-ups yielded between 27-28% of repeat answers (it is likely that participants had already mentioned aspects they didn’t like during the open-ended question 1), but AI-generated questions only yielded about 21%. This is not a significant improvement, considering that the comparison was made with questions that could not be adapted to prevent repetition. When AI follow-ups were added to get more elaborate answers to each pre-defined question the repetition rate increased to 35%. Participants also rated questions as less reasonable in the AI variant. The AI-generated questions had a lot more statements like “I said that” or “The obvious AI questions overlooked my previous responses.” This shows that some of the follow-ups questions were not distinct enough and did not have the direction to warrant their being asked. Insights from the Study: Pitfalls and Successes To summarize AI-generated follow up questions in usability tests, there are both positive and negative points. Generative AI (GPT-4), excels in refining participant responses with contextual follow-ups. The depth of qualitative insights can also be improved. Challenges: Limited ability to uncover new issues beyond the pre-defined questions. Participants can become frustrated by repetitive or generic follow-ups. Although extracting more elaborate answers is a good thing, it can easily be overshadowed by the lack of relevance and quality in the questions. This can inhibit the natural behavior of participants and the relevance or feedback if they are focused on the AI. In the next section, we will discuss what you should be aware of when choosing an AI tool to help you with unmoderated testing or implementing AI prompts or models for a similarly purpose. Recommendations for Practitioners Context is key to the usefulness and effectiveness of follow-up question. The majority of the problems we found with the AI follow up questions in our study are due to a lack of context. We have compiled a list based on the real mistakes that GPT-4 committed when generating questions for our study. This list is a great tool to use as a checklist, whether you’re using an existing AI tool to interact with participants or implementing your system to interact in unmoderated research. You can use it as a guide to assess whether the AI models or prompts you have at your disposal are able to ask reasonable, context sensitive follow-up questions, before you trust them with interacting real participants. Here are the types of context that matter: General Usability Testing.In its questions, the AI should include standard principles of usability tests. This may seem obvious, but it is. It is important to mention this, as we have faced issues in our study that are related to the context. For example, questions should not be guiding, participants should be asked to make design suggestions or predict their future behavior based on completely hypothetical scenarios. Usability Testing Context.

The goals of different usability tests can vary depending on the design stage, the business goals or the features being tested. Each follow-up questions and the time spent by participants answering them are valuable resources. You should not waste them by going off-topic. In our study, for example, we evaluated a prototype website with placeholder images of a product. We are unable to use this information when the AI asks participants what they think of the fake products displayed. User Task Context.

Follow-up questions should reflect the nature of your tasks, whether they are exploratory or goal-driven. Follow-up questions can be helpful in understanding the motivations of participants when they have freedom. If, on the other hand, your AI tool asks participants to explain why they did a task-related action (e.g. placing the item they were supposed buy in the cart), then you will appear just as stupid for using it. Design Context.

Detail information about the tested design can be essential to ensure that follow-ups questions are reasonable. The participant should be asked to provide input on the follow-up questions. The design should not be enough to answer them. The topics chosen could reflect interesting aspects of the design. In our study, for example, the AI asked participants to explain why they believed an information displayed prominently in the user interface. This question was irrelevant in context. Interaction Context.

Interaction Context is the actual context of the participant’s actions, including the consequences. This could include the audio and video recordings of the participant’s thoughts as well as the video recording of their usability test. Included interaction context allows follow-up questions that build on information already provided by the participant and further clarify their decision. If a participant fails to complete a task successfully, follow-up question could be asked to investigate the cause even though the participant still believes they have achieved their goal. Previous Question Context.

Even if the questions you ask are completely different, participants can still find logical connections between their experiences, especially because they don’t know which question you will ask next. A skilled moderator might decide to skip a previous question that the participant answered as part of a different question and instead focus on clarifying details. AI follow-ups should be able to do the same, so that the testing doesn’t become a tedious slog. Question Intent Context.

Participants often answer questions in a manner that is not consistent with their original intent, particularly if the question has an open-ended nature. A follow-up question can be rephrased from a different angle to retrieve the desired information. The AI may miss the fact that the participant’s response is technically valid, but it only addresses the word of the question and not the spirit. Clarifying the intention could help with this. Ask if the tool allows you provide all the contextual information directly. If AI doesn’t have a source of context (implicit or explicit), it can only make unreliable and biased guesses, which can lead to irrelevant, repetitive and frustrating questions. Even if the AI tool is given the context (or you create the AI prompt), it does not mean that the AI will act as you expect. It may not apply the context correctly or consider its implications. As demonstrated in our study when a conversation history was provided within a question group there was still considerable repetition. To test the contextual responsiveness a particular AI model, simply converse with it using context. It shouldn’t be difficult, as most human conversations are heavily influenced by context (saying all the things would take too much time otherwise). It is important to focus on the different types of context in order to determine what the AI model is capable of. The seemingly endless number of possible combinations of different types of context may pose the greatest challenge to AI follow-up queries. Human moderators, for example, may decide to break the rules and ask less open-ended question to get the information they need to achieve their research goals while also understanding the tradeoffs. In our study, we observed that if AI asked questions too generically as a follow up to seed questions which were open-ended, without a significant shift in perspective, it resulted in frustration, repetition, and irrelevance. The ability of AI models to resolve different types of context conflict in a proper manner could be a reliable metric to measure the quality of AI generators of follow-up question. The researcher should also be in control, as tougher decisions that depend on the researcher’s vision and understanding must remain in their hands. A combination of static questions and AI-driven ones with complementary strengths and weakness could be the key to unlocking richer insights. When considering broader social implications, a focus on contextual sensitivity can be seen as more important. The overhyping of AI and the trend-chasing by the industry has led to a backlash from some people. AI skeptics are concerned about a variety of issues, including the usefulness of AI, ethics, privacy and the environment. Some participants in usability tests may be hostile or unaccepting of AI. It is therefore essential that AI be presented to users as both reasonable and useful. The principles of ethical research are as relevant as ever. Data must be collected and processed only with the consent of the participant and not infringe on their privacy (e.g. So that sensitive data cannot be used to train AI models without permission. Conclusion: What is Next for AI in UX? Is AI a game changer that could break the barrier between moderated usability research and unmoderated research? Maybe one day. There is definitely potential. When AI follow-ups questions work as intended, results can be exciting. Participants can become more talkative, and clarify important details. An automated solution to this problem may seem like a fantasy to any UX researcher familiar with the feeling that they wish they could have asked one more question in order to make the point. We should be cautious, however, as the addition of AI blindly without testing and oversight could introduce a host of biases. The relevance of the follow-up questions depends on a variety of contexts. To ensure that the research is grounded in solid conclusions and intentions, humans must remain in control. The opportunity lies in a synergy between usability researchers and designers, whose ability conduct unmoderated testing could be greatly enhanced. Humans + AI = better insights The best approach is likely to be a balanced approach. As UX designers and researchers, humans should learn to use AI to uncover insights. This article can be used as a starting point. It provides a list with the potential weak points of AI-driven techniques that should be monitored and improved.

Latest Posts

Scroll to Top