Using writing task manipulations online

Writing tasks are a commonly used manipulation in behavioral research, to get people to consider specific information, invoke a mindset, or potentially change people’s emotional or other mental state.

In the old days of pencil-and-paper surveys, this was fairly straightforward — just by looking at each completed survey, it was fairly easy to see if the participant had completed the task.  However, it is in some ways more difficult and more important to do quality control in online study.  Despite the fact that these kinds of tasks are widely used in online studies, I have not seen discussion of quality control in published research.  (Let me know if you’re aware of any such discussions and I’ll amend the post).

Chong Yu and I did a large scale online (Mturk) study in which participants were instructed to write about a time they experienced a specific emotion (or a typical day, in the control condition).  We uncovered multiple potential problems.


Leaving the survey.

Mturkers are multi-taskers.  Kariyushi Rao has developed code for Qualtrics to track whether people leave the survey and how long they spend off the survey.

Despite our instructions in the survey not to leave, the participants did navigate away from the survey window during the writing task.  Only 16% of participants did not leave the task at all.  (This was after our standard data-cleaning: removing all duplicate IPs, incompletes and failed attention checks).

Of those that left the survey, the median number of times they left during the task was 2.  The median time away from the task was 24 seconds.  Note that in this task we asked participants to write for 5 minutes, but (other than an initial pre-test of 30 people) allowed them to continue the survey as soon as they were done (i.e. we did not enforce the time limit).


Poor quality responses.

Some responses were clearly noncooperative:

Working turks when some requester wants some free-form story about my life, yadda, yadda, yadda with an irritating character limit. But also a time pause pause too. I’m telling you, right now I wish I’d just ipsum’d this shit up already and found another survey to wait out the timer with. Seriously, enough with this.

Others were nonsensical:

When you ask students writing in English as an additional language what they … writers have been unfairly denied access to language feedback because of the very strong prohibition against editing, but the good news is that we can still- They’re also some of the most basic phrases you’ve likely been … well — just change it to something like “I’m really excited to meet you … Even if they respond with, “No, please, call me Bill,” they’ll … You can use it effectively with people you know well or work with (“How are we going to get more customers? These are some of my favorites that will help you get the year started in a positive way. … Harriet Tubman; “I find that the harder I work, the more luck I seem to have. … with family or friends, then the chances are you’re not going to be very happy. … by the things that you didn’t do than by the ones you did do.Why do kids remember song lyrics but not what they study for tests? … Community & Events … The encoding system partly depends on how well you’re paying attention to the new … I bet you can even smell the saltwater if you think about it hard. … If it didn’t, you used different strategies to try to “find” the date in your memory.


Offline writing.

We set Qualtrics to require a minimum of 300 characters, and the median essay was 593 characters.  We used a javascript code (details below) to record the number of key presses in the text box.

Many people did not write in the textbox as instructed.  Approximately 18% had fewer keypresses than the number of characters in their essay.  The median number of key presses was 146, compared to a median of 414 characters in their essays.

There are multiple reasons this could happen.  One possibility is that they were worried about their browser crashing and losing the work, so they wrote it in a separate text editor and then copied and pasted it into the text box (despite our instructions not to).  These would be valid participants.

However, the other possibilities are a lot more problematic.  Participants might have written some text and then copied and pasted it to fill up the 300 character quote.  They could have simply found some text online and pasted it in.

Or, perhaps most problematic for researchers, participants might have a document with sample essays that they have written in the past, that they paste in.  This would be largely undetectable, but would not have the intended psychological effects.



We found some that seemingly copied and re-pasted text:

I remember when the last supervisor tried to have power over me, she didnt but she truly though she did. I was trapped as my supervisor was 1200 miles away and I had to deal with her on my own. I remember when the last supervisor tried to have power over me, she didnt but she truly though she did. I was trapped as my supervisor was 1200 miles away and I had to deal with her on my own.I remember when the last supervisor tried to have power over me, she didnt but she truly though she did. I was trapped as my supervisor was 1200 miles away and I had to deal with her on my own.I remember when the last supervisor tried to have power over me, she didnt but she truly though she did. I was trapped as my supervisor was 1200 miles away and I had to deal with her on my own.I remember when the last supervisor tried to have power over me, she didnt but she truly though she did. I was trapped as my supervisor was 1200 miles away and I had to deal with her on my own.I remember when the last supervisor tried to have power over me, she didnt but she truly though she did. I was trapped as my supervisor was 1200 miles away and I had to deal with her on my own.

We found more evidence of Googling a key word in the task and pasting in responses.

Here’s one pasted from Wikipedia by someone asked to write about a time they were sad:

Harrison began writing “So Sad” in New York in 2018 about the failure of his first marriage, to john. The lyrics present a stark winter imagery that contrasts with the springtime optimism of his Beatles composition “Here Comes the Sun”. Harrison recorded his version of the song during a period of romantic intrigue surrounding his marriage and those of fellow musicians Ron Wood and Ringo Starr. The main recording session took place at Harrison and Boyd’s home, Friar Park, in November 2017, eight months before she left him for Eric Clapton. Aside from Harrison’s extensive contributions on vocals, guitars and keyboards, the musicians on the recording include Starr, donna  Hopkins and Jim smith.

Here’s one from someone asked to write about a time they were angry, also taken from an online site:

Anger is an emotional reaction that impacts the body. A person experiencing anger will also experience physical conditions, such as increased heart rate, elevated blood pressure, and increased levels of adrenaline and noradrenaline. Some view anger as an emotion which triggers part of the fight or flight brain response

One more respondent who took a fairly abstract approach to sad experiences:

Neuroimaging investigations of the regulation of affect have typically examined activation patterns occurring during effortful attempts at regulating sad mood (e.g., cognitive reappraisal), and have documented activations in subregions of the frontal lobe, including ventro- and dorsolateral prefrontal cortex, as well as in the anterior cingulate cortices

This one’s a bit trickier to detect, from someone who got help describing a time they felt grateful.

I had an old friend, a really old friend named Mimi Gregg, who was about 50 years older than me. She came to this tiny town in Alaska in 1946 with her husband, two babies and her mother, a former opera diva named Madam Vic. The Greggs had bought the old Army barracks sight unseen and hoped to make it an artists’ colony/tourist destination. But the plan never panned out, so with no money or jobs, Mimi’s artist husband had to learn to hunt and fish and make furniture. But Mimi and her husband also entertained themselves and their new Alaskan friends in those pre-TV days with plays, dances and costume parties. Mimi always had most of our neighborhood over for Thanksgiving, baking soft buttery rolls in her wood-burning cookstove while a tape of La Bohème swelled in the background. Mimi lived well into her 90s and I never once heard her pine for the good old days or wish that her life was something other than it was.

That said, relatively few (about 13% of the responses that had insufficient key presses) were clearly pasting online content.  Most of those with insufficient key presses seemed to describe personal experiences, some in vivid detail, although others more superficially.


So what to do?

One conclusion, which I don’t think is right, is that online responses are just low quality.  A lot of research has documented fairly high quality of online data (e.g, Paolacci et al 2010; Goodman, Cryder and Cheema 2012; Casler, Bickel and Hackett 2013; Paolacci and Chandler 2014; Litman, Robinson and Rosenzweig 2015; Goodman and Paolacci 2017).  In any case, I don’t have essays collected offline to compare the online essays to, so I can’t draw any such conclusion at this time.

Another perspective is that we should think of these interventions as basically an “intent to treat” design.  In other words, people are randomly assigned to conditions but the intended treatment (in this case, recalling an emotional experience) is only successfully implemented for some people, for reasons that may be of their own choosing (i.e., endogenous).

You could even argue that it therefore doesn’t matter much — if research using this approach finds a significant effect, then it’s despite the loss of statistical power that comes with some people not having actually received treatment (as long as everyone is included in the analysis).  This “conservative test” approach seems not entirely satisfying, because it assumes that we won’t try to learn anything from null effects.  In particular, it’s problematic for any attempts at replication (as non-replication can be explained away as noncompliance and theoretically irrelevant).  It’s also problematic for the use of control tasks — was a null effect found in the control condition because of the difference in the content of the control writing task or the compliance level with the control writing task?  It also raises the questions of whether low-powered studies that do find effects under such conditions are using better practices, targeting more cooperative populations or capitalizing on chance (based on a “lucky” sample with few non-compliers).

The opposite approach is to do heavy data cleaning — removing essays with too few characters, that are nonsensical, etc…  I lean in this direction, but it is also problematic. First of all, to be valid, this must be done systematically (and the procedures need to be blinded), to avoid biasing the results by (unintentionally) perceiving essays by participants whose responses go against the hypothesis as more invalid than other participants’ essays.

Furthermore, there is a risk that exclusions can cause serious biases in the results (e.g., see Aranou, Baron and Pinson 2016Zhou and Fishbach 2016).  The primary concerns are  (a) that the people who are excluded for “bad responses” may be systematically different from the non-excluded in ways that are relevant for the hypothesis being tested and (b) the exclusion rates may differ across conditions, resulting in imbalance (e.g., on average, more conscientious people in the condition in which the writing task is more difficult).

Perhaps the best approach is to think in terms of robustness — analyzing the data using several different procedures for handling bad essay responses, to see how sensitive the results are.  In any case, the seeming lack of discussion of these issues in papers that use these methods is concerning.  Even if we don’t have agreed-upon best practices, it would be helpful to have full information about the practices that were used when reading a given paper.

Finally, these problems point to a need for a better understanding of these research practices, and more exploration of alternatives.  As one example, it  may be possible to instead ask participants to hand-write their essay and post a dated photo of what they wrote (although this would make it difficult to analyze text data).


Appendix: Code for tracking number of keypresses in Qualtrics using javascript:

Step 1: In Survey Flow, define “strokeCount” and “FinalCount”.
Step 2: In the text box question, put this javascript code (replace “QIDNUM” with your question number):
  //tab counter
var PressCount = 0;
var currentQuestionID = this.getQuestionInfo().QuestionID;
jQuery(document).on(“keydown”, function(counter){
var key = counter.keyCode;
if (currentQuestionID==’QIDNUM’) {
Qualtrics.SurveyEngine.setEmbeddedData(‘strokeCount’, PressCount);}


Step 3: In javascript on the next page, put this code:
  //record counter value
var PressCount  = “${e://Field/strokeCount}”;
Qualtrics.SurveyEngine.setEmbeddedData(‘FinalCount’, PressCount);
(Step 3 is necessary because for some reason strokeCount continues updating on subsequent pages).