SUPR-Q 2: SUPR-Q Harder

At SEEK we’ve been experimenting with the SUPR-Q (AKA the Standardized User Experience Percentile Rank Questionnaire). We ran a small trial through in-person usability research and then rolled out to a full-scale study using an on site Hotjar poll in February. The first study established a benchmark against which to measure changes over time. We’ve now done our second full-scale. To learn more about what the SUPR-Q actually is, and this first experiment, read the post from March here.

After our last study we hypothesised:

  1. Tool limitations may negatively impact scores 
  2. Drop off could be addressed by improving the survey design
  3. Even with addressing 1) and 2) the NPS would likely still be lower when measured as part of the SUPR-Q than an individual NPS rating, and skew towards aesthetic feedback given the questions asked.

For this SUPR-Q round we decided continue using HotJar, given some design improvements were possible, and it did not require any additional developer effort. While still not perfect, this experiment allowed us to explore the above 3 hypotheses.

 SUPR-Q implementation Feb 2018 vs. June 2018.

SUPR-Q implementation Feb 2018 vs. June 2018.

In the February round we had to present a horizontal scale vertically (as per the image above) and risk skewing responses negatively. In the June round we were able to display the scale from left to right.

Hypothesis 1 — Tool limitations negatively impacted scores

Since our last SUPR-Q study HotJar has removed one of the limitations — hortizontal Likert scales can now be customised to 5 or 7 points. 

Previously, we had no way to know how many people had selected a different rating to the one they intended due to the vertical representation of the scale. However, in this round we no longer saw individuals selecting Strongly Disagree but writing contradictory positive free text comments e.g. Strongly Disagree to “I will likely return again” yet stating:

“I’m definitely going to come back to this site again!”

We also did not have respondents explicitly telling us: 

I accidentally hit Strongly Disagree when I meant Strongly Agree

Furthermore, between the February and June rounds we did not launch any customer-facing features, so we are able to compare the SUPR-Q scores and infer that positive changes may be due to the implementation improvements. Every single score improved in this SUPR-Q study. 

These two factors suggest that the new implementation has reduced errors in selecting an unintended response that had previously skewed the results negatively.

Hypothesis 2 — Drop off could be addressed by improving the survey design

Another improvement made in this round was to manually add (x of y) to the end of each question, so users knew what they were getting into, had a proxy progress indicator, and could see how long they had left.

We still had the limitation that we could not automatically advance once an answer was selected, and that we had to ask one question at a time rather than having a matrix rating as recommended.

In this round we had to run the poll for longer to get a similar number of responses. This may be because individuals could see they were going to be asked 9 questions and did not want to invest that much effort. However, once individuals opted to engage with the SUPR-Q questions, 38% went on the complete all questions, compared with a lower 33% in the first study; the survey design has improved drop off. The response decay (cumulative drop off between questions) was also lower in the second round.

 Response Decay 

Response Decay 

This suggests that the new implementation has reduced drop off.

Hypothesis 3 — NPS would likely still be lower when measured as part of the SUPR-Q than an individual NPS rating, and skew towards aesthetic feedback

While our NPS in this round was higher than the February round, it was still lower than our previous highest NPS (measured as a single question). When reporting on this, as with last round, the business did ask questions about why it had dropped. We’ll continue to remind the business that changing the way you measure something impacts the results. Going forward we will only compare like for like (i.e. SUPR-Q Measured NPS) trends over time. 

I still believe this is a more reliable, realistic, NPS score as the SUPR-Q questions may have given users pause to think about their likelihood to recommend by first making them think about trustworthiness, credibility, usability and appearance. 

In this round we still saw verbatim feedback for the SUPR-Q cohort was skewed towards aesthetic feedback, which we did not typically see in our NPS feedback. Again, this is likely due to explicitly making participants think about this with 4 of the 8 questions being along these lines: 

  • This website is easy to use.
  • It is easy to navigate within the website.
  • I found the website to be attractive.
  • The website has a clean and simple presentation.

What next? 

We will continue on our SUPR-Q journey. 

We will look into other tools to trial a single matrix / rating style grid of questions. The hypothesis is that this would further improve usability of the questionnaire and therefore reduce drop off. It may improve the scores too if the burden to answer is less. Ideally we would do this ASAP before major changes are made to the site to be able to compare whether this makes futher improvements to the scores, rather than actual changes to the product.

After running this next experiment, it would be advisable to stick with the same implementation going forward (and to use that one on our other Products). If we continue to experiment in varying the way we are measuring the SUPR-Q we will be unable to compare and contrast different cohorts. It is clear that the way you ask the SUPR-Q questions clearly influences the results. 

Stay tuned for the next results — SUPR-Q 3 with a Vengeance....

Related Reading

Does the NPS tell us what users really mean?