Measuring the experience of AI

Finding emergent relevance during interaction

Image generated with DALL·E 2

If AI-driven content is going to dominate our digital experience, it has already been doing so for the past two decades. I have witnessed this firsthand, having worked on search engines and then researching and designing for users who are overwhelmed by content. Automatic delivery has enabled scaling, surrounding us with personalised product recommendations, tailored ads, customised feeds, and even more generated content. How can we manage the user experience in this context?

This is not a new challenge, but the scale is growing and, like other disciplines, we in the UX world are scratching our heads, going through a multifaceted mid-life crisis and asking: what does it mean to create value in an AI world?

What is special about UX for AI, after all?

It is a question that fascinates us. The UX community has been trying to come to grips with the notion of AI, how to design for it, how to position the role of researchers and designers in this space, how to stay relevant in a movement driven by a fascination with technology.

Our attempts to make sense of this question yield criteria such as control, transparency, safetx, trust and ethics (see Jessa Anderson’s talk and Hal Wuertz’s article). Somehow these attempts reflect as much existential angst on our part as they do real issues that demand our attention. Drawing on a Lacanian idea of how we base our desires on the lack we long to satisfy, this suggests that when we talk about trust, control and transparency, it’s the fear of their absence that alarms us. We fear the possibility that AI will be uncontrollable, untrustworthy, opaque, and unethical.

We’re in a strange relationship with the machine, and we’re struggling to come to terms with it.

I think we should still try to find a better foothold, so let’s dig deeper. Sure, these criteria are all worthy endeavours, and we need to put in place safeguards to deal with the threat (we’ve done it before with industrial and pharmaceutical regulation). But stopping at this level locks us into a reactive, protective state. Let’s find the organising principles of AI and take a more active, productive stance.

A new mode of interaction: negotiating relevance

Looking at the basics of how AI works gives us a good starting point. Machine learning (ML), the latest wave of AI, involves training high-dimensional statistical models to fit complex, real-world phenomena. It’s about modelling signals and patterns, which helps with classification, prediction and content generation. So here’s a simple topology of ML applications:

Classification, which involves determining the category of a given input. For example, is this a picture of a cat or a fluffy loaf of bread?Pattern recognition and prediction, identify trends or predict future data points based on historical data. It’s like having a crystal ball, but for mundane things like weather forecasts and traffic jams.Content generation, which is an extension of prediction: new content is generated based on learned sequences and patterns. This could be text, images, music, speech, control signals (e.g. for robots), virtual environments and social interaction.

The common theme across these use cases is their dynamic, content-rich nature. For users, this means that interacting with AI is likely to involve exposure to recommendations, analysis, trends, curated material, simulated conversations, simulated environments and automated decisions.

A key category to consider then is relevance. If using AI applications involves so much dynamically curated and generated content that fits my needs (or technically, my signal), is that fit really good enough? Is it relevant?

Hal Wuertz’s recent article came close to this notion by highlighting that our relationship with AI differs from traditional applications in that the purpose and content are not fixed, but in flux. Both the AI and the user change as they interact. Adrian Zumbrunnen comes close to this when he talks about the need for intelligent interaction to match the context of use. I’d add: at any given moment, we would be asking: is this interaction or content relevant to the user, does it match expectations, hopes, problems and desires?

This may seem elementary, and it is. It really is a fundamental question of usefulness. But the catch is that in our current UX paradigm, we’re used to solving the usefulness question early in the product lifecycle. We’re used to figuring out usefulness in early product discovery, market research, exploratory research, and product-market fit.

What we’re not used to, and here’s the kicker, is that relevance is (increasingly) something that an AI application would strive to achieve dynamically, during the course of its use. To deal with this, we need to approximate the real end experience in real contexts, not just for one particular interaction flow, but for a range of possible flows, and identify how our users negotiate relevance with AI during use (see Zhaochang He’s piece on UX design for conversational interactions). Lab settings, wireframes and static prototypes may become less useful in providing a reliable signal, and we need more than ever to test with the real thing (e.g. testing but real data, role-playing, Wizzard of Oz).

In fact, it’s the transformation of relevance from a static pre-use design goal to an emergent outcome of interaction that explains our concern with trust, control, ethics, contestability and transparency (see how Ericsson positions ‘competence’, similar to relevance, as a cornerstone of trust). These are safeguards against anti-relevant (or harmful) outcomes, and about providing an environment that allows negotiation to help users find relevance.

How do we approach relevance in our UX practice?

There are many answers to this question, from UX frameworks, methods, job descriptions, strategy exercises, design guidelines and more.

I’ll just pick one. Having addressed the macro-level question of relevance, I’d like to visit the other end and make a practical, micro-level contribution. I’ll show how we’ve measured UX in an AI-driven content recommender, taking relevance into account.

Over the past year, we’ve been developing an application that curates content according to users’ interests. We had already done a round of discovery research and prototyping that highlighted the importance of emergent relevance and context in use. So we wanted to test whether it really worked for users, with real production data, with real user profiles and preferences. We decided to test for the following criteria, all of which had emerged from the early discovery research:

Content relevance: How interesting do our users find the content?Trust: How controllable, understandable, and transparent does it feel?

I set out to design a tool to measure these aspects. The result was an attitudinal questionnaire that can be integrated into user testing, whether moderated or not. It’s an evaluative tool that can also be used in an exploratory or generative way when administered in a moderated, open-ended setting.

In the early 2000s, I had some experience in implementing and evaluating information retrieval and search engines (Halabi, Islim & Kurdy, 2010). The convention at the time was to rely on formal performance metrics (such as accuracy and breadth) to tell whether a retrieval technique was good, measured against standard benchmark datasets. Over time, algorithms improved to the point where the formal quality of retrieval and recommendation became less of an issue. Therefore, formal metrics became less useful in predicting actual user adoption, and it was the user experience that played an increasing role.

Despite this growing recognition of the importance of user experience (see reading list), I was surprised by the lack of robust, validated tools for measuring UX on retrieval platforms. Indeed, recent surveys still report an over-reliance on formal metrics (e.g., Bauer, Zangerle & Said, 2023). This resonated with how we worked in my organisation with ML engineers and as a black box service. It was a simple, classic case where we needed more rapid prototyping with real data, not only to test UI and interaction, but also to test how different algorithms contributed to the experience.

Fortunately, Pu, Chen & Hu (2011) have addressed a similar problem in UX measurement. They developed a validated questionnaire based on psychometric modelling and factor analysis. They found how certain experience factors, such as relevance, explanation and variety, contributed to satisfaction and intention to use.

This was a good starting point; these categories fit very well with the factors we wanted to measure. Using this framework, we were able to explain the intention to use we measured by linking it to different aspects of the UX of the AI-driven solution. Relevance here corresponds to recommendation accuracy in Pu et al.’s model (Figure 1).

Figure 1. Structural model fit, used with permission of the publisher (ACM), from: Pu, Li & Chen (2011)

With the help of my colleagues on the product team (all of whom have long experience in search and recommendation), I adapted Pu, Chen & Hu’s model. I consolidated similar constructs that show a strong correlation in the original article, while preserving the causal relationships to maintain explanatory power. Specifically:

I merged Explanation and Transparency.I merged Use Intention and Purchase Intention into Use & Convergence.I merged Interface Adequacy and Ease of Use.I merged Interaction Adequacy and Control.

In addition, we integrated Recommendation Timeliness into the model, as we had found it to be an important factor in our earlier research. Fortunately, this notion was also validated in a later study (Chen et al., 2019). See the resulting causal model that I adapted (Figure 2).

Figure 2. Our simplified causal model for measuring UX in AI-driven applications

Questionnaire items

For the final wording of the questionnaire, I took most of the wording from Pu, Chen & Hu (2011), with the exceptions noted below. All questions are measured on a 5-point Likert scale:

Transparency: I understood why the items were recommended to me.Control: I found it easy to tell the system what I like/dislike.Recommendation Relevance: The items recommended to me matched my interests.
— Follow-up: How many items from this list of recommendations would you investigate further?*
Note: I’ve added this follow-up to get more granular details on relevanceRecommendation Novelty: The items recommended to me are novel.Recommendation Diversity: The items recommended to me are diverse.Recommendation Timeliness: The items recommended to me are timely.Information sufficiency: The information provided for the recommended items is sufficient for me to make a decision.Ease of use: How difficult or easy did you find it to use the system?*
Note: I used the standard wording of the Single Ease Question (SEQ)Confidence & Trust: I am convinced and trust the items recommended to me.Usefulness: I feel supported in finding what I like with the help of the recommender.Use & Convergence: I would use this recommender often.
Note: Depending on the end result you want to measure, you can reword for convergence to action, recommendation, will to promote, etc.

I also wanted to compare users’ attitudes to our new platform with what they were already using, so I added the following benchmark section:

What tools do you use to keep up with your areas of interest?Overall, how satisfied are you with your existing tools and practices for keeping up to date?In comparison, how satisfied are you with the recommender you’ve seen in this experiment for keepoing up to date?Imagine you had a development budget of 5000 to support tools to help professionals keep up to date. Where would you invest it? Options:
— The tools you currently use to keep up to date
— Improve this recommender to replace the tool(s) you use
— Other: elaborate

How to use this questionnaire?

The structural model provides an explanatory framework for interpreting the data (Figure 2). For example, if you observe low intentions for Use & Convergence, you can trace this back to other poorly scored factors (e.g. was Usefulness poor because of low Relevance? Or was Trust low due to low Transparency?) Use this reasoning to determine which factors need to be improved, including algorithm design with the ML engineers.

To improve the signal quality, we decided that it was best to integrate the questionnaire into the flow of a user test (and later into the actual pilot). The tested product can be at any level of fidelity, but it’s important that the content recommendations are real, as they are the core of the test.

Depending on the stage of your product, if you’re doing early testing or seeking qualitative feedback, you could do as we did and combine it with interviews or moderated testing, using it as a prompt to dig deeper qualitatively. In later stages, once you have a functional pilot, you can combine it with product analytics and run it as a non-moderated on-site survey to get an attitudinal signal that helps explain the behavioural data.

In our scenario, we planned a 3-phase approach to iterate and triangulate for greater confidence:

Initial test with a functional POC: moderated, qualitative, small-sample user testing, culminating with the questionnaire.Pilot test with an alpha release: unmoderated, larger-sample test concluding with the questionnaire. This would give us a reliable signal of user attitudes.Online, real-world evaluation with A/B testing: Analytically measure actual behaviour, combined with the questionnaire, to correlate attitudinal responses with adoption, retention and churn.

What’s next?

I’m sharing the use case above as an example of how we can adapt our methods in a new, brave AI world. Feel free to play with it, adapt it, develop your own methods and share what you come up with. If you have ideas or critique please reach out — I’ll be happy to improve along, and to credit you in the process.

In the meantime, bear in mind a few caveats about what I’ve suggested in this piece.

First, there are well-known limitations when it comes to questionnaires and surveys. Surveys are attitudinal and depend on explicit statements made by participants, not their actions, which means that their link to actual behaviour is at best indirect. The questionnaire above is still useful, given the following measures:

The factors and the explanatory model are based on several decades of research in the evaluation of recommender systems.We’ve come up with similar categories in our qualitative discovery research, so it fits well with our product context.We’ve always triangulated survey data with other signals: qualitative research in early product discovery, then analytics and A/B testing to generate real behavioural data once we have a working product.

Second, the constructs in the questionnaire are not exhaustive of all aspects relevant to UX in AI. We don’t yet have a strong notion of attractiveness, intrigue or serendipity. There’s nothing that captures social and collaborative experiences, or the role of peer validation in trust. It also doesn’t capture the role of significance or aesthetics in formulating an overall experience of relevance. We could further unpack the notion of relevance into other components such as context, embodiment, engagement, familiarity, personalisation, modes of being (e.g. exploration, goal-directed action, meditative use or entertainment). And the list goes on. You define the granularity and relevant factors according to your product needs.

Finally, I’ve admittedly framed all of this within the current paradigm where humans are users and machines are responsive servants. How long can we maintain this paradigm before it’s no longer useful? We don’t have to go too far before it collapses: the pace of development is fast, and by the time a product is released and tested, three more generations of LLMs will have been released. How can we speed up our UX research and design cycles? Could AI help us streamline these experiments? (Ironically, in this case, AI would be experimenting on humans).

Taking this further, what if AI changes the rules of the game, shifting the locus of power, with different modes of interaction that may even be hidden? Do we really have the tools and vocabulary for this? We already know how automating A/B testing of political campaigns with psychometric profiling was able to influence politics in the UK, effectively making humans components of computational politics, so imagine what can be achieved with even higher levels of automation. The whole pipeline can now be automated and we could be looking at fully automated digital environments, spun, deployed and rapidly A/B tested by AI, in a wild game of survival of the fittest to extract value, whether financial, political or attention. The question is not if it’s happening, but to what extent. What does experience mean in this world?

Acknowledgements

I’d like to thank the wonderful product and engineering team for being true partners in research and design: Nikolaos Nanas, Evangelos Prachas, Christos Spiliopoulos, and Christoforos Varakliotis. Lewis Allison and Liz Immer gave me wise advice on experimental design.

Reading list

Good material on UX and AI:

Learners event (2023): AI and UX Research Joël van Bodegraven (2018): Design principles for AI-driven UX Hal Wuertz (2023): Design for AI: What should people who design AI know?Zhaochang He (2018): UX Design for AI Products

Articles with validated instruments to measure UX in recommenders:

Pu, Chen & Hu (2011): A user-centric evaluation framework for recommender systems.Chen et al. (2019): How Serendipity Improves User Satisfaction with Recommendations? A Large-Scale User Evaluation.Knijnenburg et al. (2012): Explaining the user experience of recommender systems.Knijnenburg, Willemsen & Kobsa (2011): A pragmatic procedure to support the user-centric evaluation of recommender systems.

Background material on measuring the performance and experience of recommenders:

Chen & Hu (2012): Evaluating recommender systems from the user’s perspective: survey of the state of the art.Konstan, J.A., Ried (2012): lRecommender systems: from algorithms to user experience.Shani & Gunawardana (2011): Evaluating Recommendation Systems.Said et al. (2013): User-centric evaluation of a K-furthest neighbor collaborative filtering recommender algorithm.A great discussion on Quora: How do you measure and evaluate the quality of recommendation engines?Beel & Langer (2015): A Comparison of Offline Evaluations, Online Evaluations, and User Studies in the Context of Research-Paper Recommender Systems.Bauer, Zangerle, & Said (2023): Exploring the Landscape of Recommender Systems Evaluation: Practices and Perspectives.

Measuring the experience of AI was originally published in UX Collective on Medium, where people are continuing the conversation by highlighting and responding to this story.

Article Categories:

Technology

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30