Using writing task manipulations online

Writing tasks are a commonly used manipulation in behavioral research, to get people to consider specific information, invoke a mindset, or potentially change people’s emotional or other mental state.

In the old days of pencil-and-paper surveys, this was fairly straightforward — just by looking at each completed survey, it was fairly easy to see if the participant had completed the task.  However, it is in some ways more difficult and more important to do quality control in online study.  Despite the fact that these kinds of tasks are widely used in online studies, I have not seen discussion of quality control in published research.  (Let me know if you’re aware of any such discussions and I’ll amend the post).

Chong Yu and I did a large scale online (Mturk) study in which participants were instructed to write about a time they experienced a specific emotion (or a typical day, in the control condition).  We uncovered multiple potential problems.


Leaving the survey.

Mturkers are multi-taskers.  Kariyushi Rao has developed code for Qualtrics to track whether people leave the survey and how long they spend off the survey.

Despite our instructions in the survey not to leave, the participants did navigate away from the survey window during the writing task.  Only 16% of participants did not leave the task at all.  (This was after our standard data-cleaning: removing all duplicate IPs, incompletes and failed attention checks).

Of those that left the survey, the median number of times they left during the task was 2.  The median time away from the task was 24 seconds.  Note that in this task we asked participants to write for 5 minutes, but (other than an initial pre-test of 30 people) allowed them to continue the survey as soon as they were done (i.e. we did not enforce the time limit).


Poor quality responses.

Some responses were clearly noncooperative:

Working turks when some requester wants some free-form story about my life, yadda, yadda, yadda with an irritating character limit. But also a time pause pause too. I’m telling you, right now I wish I’d just ipsum’d this shit up already and found another survey to wait out the timer with. Seriously, enough with this.

Others were nonsensical:

When you ask students writing in English as an additional language what they … writers have been unfairly denied access to language feedback because of the very strong prohibition against editing, but the good news is that we can still- They’re also some of the most basic phrases you’ve likely been … well — just change it to something like “I’m really excited to meet you … Even if they respond with, “No, please, call me Bill,” they’ll … You can use it effectively with people you know well or work with (“How are we going to get more customers? These are some of my favorites that will help you get the year started in a positive way. … Harriet Tubman; “I find that the harder I work, the more luck I seem to have. … with family or friends, then the chances are you’re not going to be very happy. … by the things that you didn’t do than by the ones you did do.Why do kids remember song lyrics but not what they study for tests? … Community & Events … The encoding system partly depends on how well you’re paying attention to the new … I bet you can even smell the saltwater if you think about it hard. … If it didn’t, you used different strategies to try to “find” the date in your memory.


Offline writing.

We set Qualtrics to require a minimum of 300 characters, and the median essay was 593 characters.  We used a javascript code (details below) to record the number of key presses in the text box.

Many people did not write in the textbox as instructed.  Approximately 18% had fewer keypresses than the number of characters in their essay.  The median number of key presses was 146, compared to a median of 414 characters in their essays.

There are multiple reasons this could happen.  One possibility is that they were worried about their browser crashing and losing the work, so they wrote it in a separate text editor and then copied and pasted it into the text box (despite our instructions not to).  These would be valid participants.

However, the other possibilities are a lot more problematic.  Participants might have written some text and then copied and pasted it to fill up the 300 character quote.  They could have simply found some text online and pasted it in.

Or, perhaps most problematic for researchers, participants might have a document with sample essays that they have written in the past, that they paste in.  This would be largely undetectable, but would not have the intended psychological effects.



We found some that seemingly copied and re-pasted text:

I remember when the last supervisor tried to have power over me, she didnt but she truly though she did. I was trapped as my supervisor was 1200 miles away and I had to deal with her on my own. I remember when the last supervisor tried to have power over me, she didnt but she truly though she did. I was trapped as my supervisor was 1200 miles away and I had to deal with her on my own.I remember when the last supervisor tried to have power over me, she didnt but she truly though she did. I was trapped as my supervisor was 1200 miles away and I had to deal with her on my own.I remember when the last supervisor tried to have power over me, she didnt but she truly though she did. I was trapped as my supervisor was 1200 miles away and I had to deal with her on my own.I remember when the last supervisor tried to have power over me, she didnt but she truly though she did. I was trapped as my supervisor was 1200 miles away and I had to deal with her on my own.I remember when the last supervisor tried to have power over me, she didnt but she truly though she did. I was trapped as my supervisor was 1200 miles away and I had to deal with her on my own.

We found more evidence of Googling a key word in the task and pasting in responses.

Here’s one pasted from Wikipedia by someone asked to write about a time they were sad:

Harrison began writing “So Sad” in New York in 2018 about the failure of his first marriage, to john. The lyrics present a stark winter imagery that contrasts with the springtime optimism of his Beatles composition “Here Comes the Sun”. Harrison recorded his version of the song during a period of romantic intrigue surrounding his marriage and those of fellow musicians Ron Wood and Ringo Starr. The main recording session took place at Harrison and Boyd’s home, Friar Park, in November 2017, eight months before she left him for Eric Clapton. Aside from Harrison’s extensive contributions on vocals, guitars and keyboards, the musicians on the recording include Starr, donna  Hopkins and Jim smith.

Here’s one from someone asked to write about a time they were angry, also taken from an online site:

Anger is an emotional reaction that impacts the body. A person experiencing anger will also experience physical conditions, such as increased heart rate, elevated blood pressure, and increased levels of adrenaline and noradrenaline. Some view anger as an emotion which triggers part of the fight or flight brain response

One more respondent who took a fairly abstract approach to sad experiences:

Neuroimaging investigations of the regulation of affect have typically examined activation patterns occurring during effortful attempts at regulating sad mood (e.g., cognitive reappraisal), and have documented activations in subregions of the frontal lobe, including ventro- and dorsolateral prefrontal cortex, as well as in the anterior cingulate cortices

This one’s a bit trickier to detect, from someone who got help describing a time they felt grateful.

I had an old friend, a really old friend named Mimi Gregg, who was about 50 years older than me. She came to this tiny town in Alaska in 1946 with her husband, two babies and her mother, a former opera diva named Madam Vic. The Greggs had bought the old Army barracks sight unseen and hoped to make it an artists’ colony/tourist destination. But the plan never panned out, so with no money or jobs, Mimi’s artist husband had to learn to hunt and fish and make furniture. But Mimi and her husband also entertained themselves and their new Alaskan friends in those pre-TV days with plays, dances and costume parties. Mimi always had most of our neighborhood over for Thanksgiving, baking soft buttery rolls in her wood-burning cookstove while a tape of La Bohème swelled in the background. Mimi lived well into her 90s and I never once heard her pine for the good old days or wish that her life was something other than it was.

That said, relatively few (about 13% of the responses that had insufficient key presses) were clearly pasting online content.  Most of those with insufficient key presses seemed to describe personal experiences, some in vivid detail, although others more superficially.


So what to do?

One conclusion, which I don’t think is right, is that online responses are just low quality.  A lot of research has documented fairly high quality of online data (e.g, Paolacci et al 2010; Goodman, Cryder and Cheema 2012; Casler, Bickel and Hackett 2013; Paolacci and Chandler 2014; Litman, Robinson and Rosenzweig 2015; Goodman and Paolacci 2017).  In any case, I don’t have essays collected offline to compare the online essays to, so I can’t draw any such conclusion at this time.

Another perspective is that we should think of these interventions as basically an “intent to treat” design.  In other words, people are randomly assigned to conditions but the intended treatment (in this case, recalling an emotional experience) is only successfully implemented for some people, for reasons that may be of their own choosing (i.e., endogenous).

You could even argue that it therefore doesn’t matter much — if research using this approach finds a significant effect, then it’s despite the loss of statistical power that comes with some people not having actually received treatment (as long as everyone is included in the analysis).  This “conservative test” approach seems not entirely satisfying, because it assumes that we won’t try to learn anything from null effects.  In particular, it’s problematic for any attempts at replication (as non-replication can be explained away as noncompliance and theoretically irrelevant).  It’s also problematic for the use of control tasks — was a null effect found in the control condition because of the difference in the content of the control writing task or the compliance level with the control writing task?  It also raises the questions of whether low-powered studies that do find effects under such conditions are using better practices, targeting more cooperative populations or capitalizing on chance (based on a “lucky” sample with few non-compliers).

The opposite approach is to do heavy data cleaning — removing essays with too few characters, that are nonsensical, etc…  I lean in this direction, but it is also problematic. First of all, to be valid, this must be done systematically (and the procedures need to be blinded), to avoid biasing the results by (unintentionally) perceiving essays by participants whose responses go against the hypothesis as more invalid than other participants’ essays.

Furthermore, there is a risk that exclusions can cause serious biases in the results (e.g., see Aranou, Baron and Pinson 2016Zhou and Fishbach 2016).  The primary concerns are  (a) that the people who are excluded for “bad responses” may be systematically different from the non-excluded in ways that are relevant for the hypothesis being tested and (b) the exclusion rates may differ across conditions, resulting in imbalance (e.g., on average, more conscientious people in the condition in which the writing task is more difficult).

Perhaps the best approach is to think in terms of robustness — analyzing the data using several different procedures for handling bad essay responses, to see how sensitive the results are.  In any case, the seeming lack of discussion of these issues in papers that use these methods is concerning.  Even if we don’t have agreed-upon best practices, it would be helpful to have full information about the practices that were used when reading a given paper.

Finally, these problems point to a need for a better understanding of these research practices, and more exploration of alternatives.  As one example, it  may be possible to instead ask participants to hand-write their essay and post a dated photo of what they wrote (although this would make it difficult to analyze text data).


Appendix: Code for tracking number of keypresses in Qualtrics using javascript:

Step 1: In Survey Flow, define “strokeCount” and “FinalCount”.
Step 2: In the text box question, put this javascript code (replace “QIDNUM” with your question number):
  //tab counter
var PressCount = 0;
var currentQuestionID = this.getQuestionInfo().QuestionID;
jQuery(document).on(“keydown”, function(counter){
var key = counter.keyCode;
if (currentQuestionID==’QIDNUM’) {
Qualtrics.SurveyEngine.setEmbeddedData(‘strokeCount’, PressCount);}


Step 3: In javascript on the next page, put this code:
  //record counter value
var PressCount  = “${e://Field/strokeCount}”;
Qualtrics.SurveyEngine.setEmbeddedData(‘FinalCount’, PressCount);
(Step 3 is necessary because for some reason strokeCount continues updating on subsequent pages).







Political microtargeting

Cambridge Analytica’s (CA) mining and retaining of Facebook data during the election has become a hot topic of discussion.  The claim is that they paid people, via Mturk (and maybe other avenues) to install a Facebook personality test app, which then mined not only their Facebook data but also the Facebook data of others. There’s some confusion about what they were doing, for what purposes, and some psychologists have questioned how effective it could have been, really.

I have some first-hand experience with related methods, and while my experience is from the pre-Facebook Stone Ages (mid 2000s), I have some thoughts.  I should emphasize that everything I’ll say is speculation based on public news reports  and extrapolation from my own experience in this area, and not first-hand knowledge of any kind.  I would be curious about the perspective of people who have more recent experience with the modern iterations of these methods.

What is psychological profiling?

One of the more sensational claims is that CA used some highly effective personality profiling or psychographic modeling: CA CEO Alexander Nix touted their “model to predict the personality of every single adult in the United States of America.” What was this model?  It looks like it was basically the standard workhorse of personality psychology, the Big-5.  Unlike lots of so-called “personality tests”, the Big-5 is a real thingit was developed via factor analysis: measuring many, many different self-report questions and then identifying those questions that best distinguish between people, and the underlying constructs that those questions seem to capture. Not only does the basic structure replicate well in other studies, but there has been a large literature testing what behaviors and attitudes are predicted by the Big 5, including political views and behaviors, although the relationships appear to be complex.

So, there was no breakthrough here in understanding personality.  Instead, the purported innovation was in using Facebook data to predict people’s Big 5 personality, rebranded as OCEAN, from their online behavior, radically changing the scale at which personality data could be collected.  This approach was pioneered by researchers at Cambridge in published academic research, and seemingly adapted, replicated and applied by Kogan and CA.

Does predicting personality matter?

The basic idea is an extension of an old database marketing tactic: purchase or compile a large-scale database with many variables, survey a much smaller sample of people to get their self-reported measures for key variables, and train a model to predict those key variables, to then apply that model to the large database.  If the model predicts well, then you have an approximation of what people would say without having to get their participation.  So, for example, you could survey 1000 people to ask if they support tax cuts or more funding for public schools, figure out which variables available in both the survey and in a million-person database predict these attitudes, and then score each of the million people in the database on their likelihood to support tax cuts or school funding, and use this to either customize communications (mailings and door-hangers) or for prioritizing turnout efforts.

The catch is that political attitudes are hard to predict.  The irony is that, in my personal opinion, these kinds of models (based largely on demographics and location) are not very accurate, but still tend to be a lot more accurate than the subjective judgments of political consultants, even those who are supposed “experts” in a geographic area.   That said, the promise of the Facebook data is that it is much richer and could yield models that predict much better than old-fashioned demographics-based models.

The other odd thing about the storyline that CA developed some kind of personality super-test is that personality, even the self-reported personality that their model was supposed to predict, is not all that useful in practice.  The idea of the Big 5 is that it measures something very broad that is at least a little bit relevant to most human behavior, but that’s at the cost of not relating to behavior in any particular context that well.  Political scientists study personality only a little, primarily as a bridge to psychology and out of curiosity whether personality relates to political behavior at all. Their primary focus is instead on the attitudes and beliefs that most directly relate to political decisions.  By the same token, political pollsters (including those doing strategic work for campaigns) focus on attitudes and beliefs about candidates and issues, which are much more relevant to voting than is personality.

Predicting receptivity to political messages?

For all these reasons, I suspect that the personality profile aspect of this story is a red herring.  My guess is that personality was useful because people like taking personality tests, and because the various stakeholders CA was interacting with would find personality easy to understand, and perhaps be impressed by their ability to predict it.  But I’m guessing the main point was never to send neurotic messages to neurotic people and conscientious messages to conscientious people.

So, if the point was not some amazingly effective personality test, what was the point?  The NYT framed the issue as “the company that can perfect psychological targeting could offer far more potent tools: the ability to manipulate behavior by understanding how someone thinks and what he or she fears“.  Mere personality targeting would not achieve that. Kogan bragged about having “a sample of 50+ million individuals about whom we have the capacity to predict virtually any trait.”  I think this is the real strategy — to identify specific issues that motivate individual voters.

So, the goal is not to target generically neurotic voters with neurotic messages, it’s to adapt a much more standard strategy in political campaigns to the social media environment: find the pro or anti-gun-control people, find the pro or anti-immigration people, find the pro or anti-gay-marriage people, etc.. Then send them targeted messages, arguing that your candidate is on their side, and/or that the other candidate will bring about the end of civilization based on their evil position on the issue that the person you’re messaging cares the most about.

It seems that personality was incorporated into this strategy to some degree, but my guess is that the real key to the strategy was access to extensive Facebook behavioral data and using targeted communications to hit the right issue buttons for each person.  When David Carroll sued CA in the UK to get the data they had on him, what he got back was exactly this — their predictions of  his positions on various issues, rather than personality predictions. This is not new — it’s an old strategy, but on steroids.  In the 1980’s, campaigns would target their messages geographically — using different messages in different zip codes.  In the 2000’s,  messages could be targeted to the household, so that next-door neighbors might get different mail or different phone calls.  Now, the targeting is even more granular, by device or account, so that different people in the same household can be targeted with different messages.

Large scale A/B testing.

There is another key aspect of the strategy that has largely been overlooked.  According to Theresa Hong, a member of Trump’s digital team, “It wasn’t uncommon to have about 35 to 45 thousand iterations of these types of [Facebook] ads everyday.” In the context of the emphasis on personality, this sounds like micro-targeting, with the campaign creating ads customized to the unique personality of each voter.  But I doubt it — personality testing is about understanding big differences, splitting people up into large groups that are different from each other, rather than nuanced differences that would yield thousands of different micro-targeted ads. In fact, this volume of advertising can only be created algorithmically, by assembling ads from separate components, like a political Mad-Lib.

So, this seems to be about micro-targeting via testing, rather than via understanding personality.  According to the article, “On the day of the third presidential debate between Trump and Clinton, Trump’s team tested 175,000 different ad variations for his arguments, in order to find the right versions above all via Facebook. The messages differed for the most part only in microscopic details, in order to target the recipients in the optimal psychological way: different headings, colors, captions, with a photo or video.”  The key here is that the targeting is not being done by understanding each voters’ personality.  In fact, this kind of testing makes any such understanding completely unnecessary.  Instead, the strategy is to push out a high volume of different ads, see which ones perform best in terms of what is observable on Facebook (clicks, likes, shares, etc..), and then use those outcomes in conjunction with Facebook data on the people who responded to determine a targeting strategy (i.e. to predict which ad version will work best for whom).

Such testing is not especially sinister — it underlies much of the behavior that people experience on the web, where much of the content we interact with is constantly being tested. In fact, the Obama campaign pioneered testing in online fundraising and recruiting volunteers.  However, what’s potentially unprecedented is using testing at this scale for modifying political messages and ad copy/targeting.  Traditionally, campaign messaging was something that high-level consultants within the campaign fought over the minutae of, either for ideological reasons or to maintain their own position within the campaign.  While survey data would be leveraged in these debates, the degree to which control of the message and format was shifted to algorithmic testing in this case seems novel.  It’s also unclear to what degree the Trump campaign took what they were learning about what worked on Facebook and applied it offline, such as at campaign rallies.

So, what’s the big deal?

I think a lot of the discussion about these issues misses the mark, overselling the impact of personality profiling.  That said, I think the skepticism about whether personality profiling would impact an election also misses the impact that CA’s methods might have had. In the past, targeting would result in the voter receiving a paper mailer or email that was clearly from the campaign, or a volunteer from the campaign might ring the voter’s doorbell.  Such approaches are not all that effective — people throw out junk mail and delete emails at a high rate.  (There is some evidence that door-to-door campaigning is effective, but costly and not easily scaled up).

However, what happens online is different, in ways that may make a big difference.  Social media content blurs the boundaries between advertising vs. reputable news vs. fringe (or even fake) news sources, and often comes with an implied or explicit endorsement from others in one’s social network.  These factors may transform traditional campaign approaches into something far more effective.

Lastly, there is an entire other set of issues around how these traditional approaches were deployed online. The current attention is driven by revelations suggesting that CA harvested data from Facebook from people who were either told it was for academic research or from unknowing social contacts of those people, and kept the data even after Facebook required them to delete it.  There are also major issues of accuracy and disclosure in online political advertising. If CA’s methods were accompanied by them or someone else using bots and fake accounts to reinforce the messages, based on the same person-level targeting, people’s skepticism about online information could have been very effectively overcome, by creating a targeted echo chamber.

I’m not an election lawyer, so I don’t know to what degree existing election laws prohibited the tactics used, were too vague to proscribe tactics that occurred online, or were actually intended to allow these tactics. Was the Clinton campaign flat-footed, and out-teched, in the same way that the Obama campaign had a tech advantage over the Romney campaign?  Or did CA bend or break the rules, getting sole access to powerful data and using it for unethical messaging tactics, while the Clinton campaign did not because they followed the rules?  I don’t know.  But it seems very unlikely that the big difference between the campaigns was that one campaign knew which Facebook users were neurotic, and the other did not.

One last point: incentives and skepticism.

Going back to the controversy in the 1960s about “hidden persuaders” manipulating consumers for profit, marketers (including political marketers) have been shaped by two competing incentives.  Regardless of the actual effectiveness of their marketing tactics, there is a strong incentive for practitioners (and potentially even for academics) to oversell effectiveness, for example, to convince potential clients that spending on marketing messages will pay off in desired changes in consumers’ (or voters’) behavior. However, consumers and voters usually don’t like the idea that their minds are being played with, exaggerated though it often is, and may push for sanctions or regulations that constrain or punish marketers, leading marketers to plead that they are not trying to persuade anyone, just to honestly inform them.  This push and pull can even lead to societal panic about marketing tactics that may turn out to be more ineffective than harmful.

We can see this dynamic in the CA case. CA sends out a press release quoting Nix, “We are thrilled that our revolutionary approach to data-driven communication has played such an integral part in President-elect Trump’s extraordinary win“. Alexander Taylor of CA then says “It’s not about tricking people into voting for a candidate who they wouldn’t otherwise support. It’s just about making marketing more efficient.”  Even without the legal and political scrutiny of the Trump campaign, CA has the incentive to portray themselves as both wizards who can manipulate minds and simple purveyors of objective information, and to both highlight and downplay the role of personality in what they do, to claim that they have massive data and that they would never use third-party data.  This brings to mind another scandal, involving whether Facebook could manipulate emotions or not — in that case, a similar mix of incentives to exaggerate and downplay made it difficult to pinpoint reality.

All of which is to say, healthy skepticism is needed all around.


Update: This interview sheds a bit more light on the methods and scope, although the whistleblower, Chris Wylie, seems to draw a distinction between persuasion and manipulation that rests on whether or not messages are customized via personality profiling.  That doesn’t make much sense to me.


Update # 2:  Ch. 4 has just released a new report, with hidden camera video.  Suffice it to say, I think this has even less to do with data, let alone personality prediction, than I did before. It seems possible that the personality profiling that people find objectionable was really just an academic-seeming cover for seriously dirty campaigning.  In any case, I think the focus on personality is likely to be obscuring the much bigger problems we now face with anonymity of political actions and deliberate deception and outright propaganda.

Update #3: Latest Ch. 4 video.  It was not about personality prediction — the goal was A/B testing and targeting untraceable ads and communications that seemed to be coming from independent advocacy groups or individuals.


A few thoughts on free speech and academic freedom.

Like everyone, I have my own political views.  I mainly try to keep them out of my professional life, whether it’s my teaching, or my research. My favorite social studies teacher in high school took it as a point of pride that we could not guess what his political views were.  I don’t take it that far, and I’m sure mine are quite easy to guess.  I’m also sure my political views, along with all my other views and life experiences have influenced my choices of research topics.  But I think it’s important not to fall into the trap of doing research in an attempt to confirm a desired viewpoint, whether political or otherwise. This has made me want to keep a distinction between my personal political views, and my role as an academic, particularly since I do some research on political beliefs and motivations.

That became more difficult to do recently, as a controversy erupted in my own building, over the invitation extended to Steve Bannon, formerly of Breitbart News and the Trump campaign and White House, to speak at the Booth School of Business.  I signed a petition expressing disapproval over the invitation, along with many other University of Chicago faculty.  Others actively defended the invitation, on free speech grounds. From what I’ve been hearing, there’s not much so much disagreement within the school in people’s assessments of Bannon, but a great deal of disagreement about what the invitation means, whether it’s brave or a mistake, and what the university should or should not do about it. I think this has largely been portrayed as an issue of protecting people from hate speech, or preserving the right to free speech.

Personally, I see it a bit differently. I think part of the issue is a lack of clarity on what we mean by “free speech”, particularly in an academic context.  Academics, by and large, see ourselves as noble defenders of free speech and open inquiry.  At the same time, judging and even “policing” the ideas of others runs through the core of most of what we do.  We decide which students to admit based not only on their GPA and test scores, but on more general evidence of how well they seem to think and argue.  We grade students’ work, sometimes even failing students who we think are not able to generate accurate or credible knowledge.  We decide which faculty to hire, based on evaluating the quality of their research. We review each others’ papers for academic journals and often reject others’ work, preventing it from being published (at least in that outlet), because we think the “speech” is of low quality. In extreme cases, even after publication, papers get retracted for having major flaws.

None of this is seen as incompatible with free speech values — everyone has a chance to make their case, after all, and rejecting or refusing to promote low-quality work is not seen as suppressing the free speech of the authors.  No one has the right to be invited to give a research seminar or to publish in a particular journal or to get a good grade.  Gatekeeping is part of our mission, to separate the wheat from the chaff, in an attempt to advance an accurate understanding of the world.  In fact, many institutions take pride in being quite exclusionary, by, for example, only inviting to present on campus those academics who they see as conducting the absolutely best research.

Of course, like any human endeavor, this is sometimes done badly, and there is no guarantee that the best work is being promoted at the expense of the worst.   Academics often complain about random-seeming reviewers, or see reviewers as biased, favoring attention-getting work over more reliable findings, for example, or suppressing work critical of their favored paradigm, or even attacking research critical of their own work. This is generally seen as unfair, perhaps bad for science, but not a violation of free speech. After all, the researcher can always get their work out in other ways, and reach those who want to hear about it.  When a person has difficulty publishing their work, even if the work is being rejected for quite subjective reasons and the person doesn’t get tenure as a result, this is not seen as a violation of free speech or academic freedom.  Quality control of this type, imperfect though it may be, is seen as a necessary part of the profession.

Academics do get up in arms about protecting academic freedom. This can seem quite self-serving — as if we get to boss others around but don’t like it when others boss us around. But I think there is a coherent philosophy here, even if not always applied evenly, which hinges on a key distinction: we can police the quality of others’ work (i.e., speech) but not the content.  Free speech or academic freedom does not protect those who conduct gross violations (e.g., making up data or plagiarizing), but it also doesn’t protect speech from being rejected (“suppressed”) if it is of low quality in much more minor (and subjective ways): the appropriateness of the statistical tests used, the quality of the data, the logic of the argument, the generalizability of the findings, the novelty of the points being made, even the readability of the writing. That is all fine. What is seen as a violation of academic freedom is punishing or suppressing work because of the conclusions it reaches.

This brings us to the free speech and academic freedom dimensions of the Bannon case, and why I don’t support having him speak on campus, even though I consider myself an advocate for free speech.  To be clear, despite my opposition to the visit, there are multiple ways in which Bannon might be dis-invited that I would see as a violation of free speech, and which I would strongly oppose.  If the City of Chicago ordered the University not to host him, or a federal official threatened the University’s funding, or if an alumnus threatened to withhold donations if he came or if people who opposed his visit threatened violence, and the university caved to such pressure, I would be upset and protest that decision as a violation of free speech and academic freedom.  If Professor Luigi Zingales, who organized the invitation, was pressured by the university or harassed by people opposed to the visit, that would be disgraceful.  If such inappropriate pressure resulted in rescinding the invitation, that would be a travesty.  In fact, I’m uncomfortable with the protest that was held in Prof. Zingales‘ class, for example.  While by all accounts conducted respectfully, I think it’s a real mistake to respond to the Bannon invitation in a way that impinges on academic expression, even if just for a few minutes, on principle.

It’s a subtler point, but I personally would also be uncomfortable with rescinding the Bannon invitation based on the offensive language that he allegedly uses (and that his Breitbart writers promote).  I do think having him speak will be upsetting, even harmful, to many people.  When he promotes the term “globalist cuckold,”  I feel personally attacked, and I’m sure many others in the university community may feel far more dehumanized by his rhetoric than I do. But I also think that universities cannot base these kinds of decisions on how upset people will be.  After all, someone talking about American slavery, or investigating the Polish role in the Holocaust, or discussing Armenian genocide or female genital mutilation will make some people feel singled out, targeted and perhaps even extremely upset, but that is no basis for precluding such speech.

Where this leaves me is that academia has a role, whether we intend it to or not in a specific circumstance, as arbiters of the quality of the thoughts expressed in speech.  This changes what it means to invite Bannon to speak.  Regardless of the intention, inviting Bannon to speak sends a message that what he has to say has merit and substance, even if you disagree with him.  It says that his ideas, even if they are wrong, are of sufficient quality to be debated.

The distinction here is important. I hope that academics who wish to do so will study Bannon and his impact on American politics; they should interview him, quote him in their work, and so on, as they see fit.  Doing so does not imply any endorsement, or any vetting. Inviting him to campus, I believe, is fundamentally different.  We invite very few people to enjoy the privilege of speaking on campus, and these invitations are made based on an assessment of the intellectual value that the speaker provides. After all, there are millions of people with negative views of any topic you can think of. Some of them spin paranoid rants on street corners, dinner tables or online to anyone who will listen, making up their own facts to suit their foregone conclusions.  We don’t invite these people to speak on campus, whether or not we agree with them, because the quality of what they have to say is not valuable to academic inquiry (or to non-academic inquiry, for that matter) . There are other people, who can provide insight into the same issues, either because they have studied these issues, or because they have firsthand experiences that are relevant.  As long as there is reason to believe that such a person will present an informed perspective that, even if biased, is based in fact and reality, I think there is value in listening.

So, what about Bannon? To me, it boils down to a question of whether he is someone who will provide genuine insight into opposition to globalization and immigration, enabling his audience to have a more complete understanding of the issue.  On this, I think the record is fairly clear.  He was in charge of Breitbart News, a site know for propaganda and cherry-picking, , that Bannon described as the “platform for the alt-right,” which included a section devoted exclusively to reports of crimes committed by African Americans, for example. Under his direction, Breitbart promoted the views of “alt-right” white supremacists, while also publicly disavowing explicit neo-Nazis. The Anti-Defamation League concludes that “Bannon has embraced the alt-right, a loose collection of white nationalists and anti-semites”, but that they “are not aware of any anti-semitic statements from Bannon”.

What else does Bannon bring to the table that would help provide insights into why we are experiencing backlash against globalization and immigration?  He repeatedly cites a novel about dark-skinned refugees whose leader eats feces, who invade Western Europe and, with African Americans, take over in the US, as analogous to current events.  He has allegedly described his views on Europe as influenced by the anti-semitic French writer Charles Maurras, who was sentenced to life in prison for supporting Nazism. Closer to home, Bannon has falsely claimed that “two-thirds or three-quarters of the C.E.O.s in Silicon Valley are from South Asia or from Asia…“. In an email exchange discussing opposition to Republican leadership, he wrote “Let the grassroots turn on the hate because that’s the ONLY thing that will make them do their duty.” He characterized his attack articles on a journalist who was researching Fox News, including accusing the journalist of fraud, as “love taps…just business“. More generally, he has allegedly characterized himself as a “Leninist”, who wants to “destroy the state” and professed his admiration for Nazi propagandist Leni Riefenstahl, expressing his desire to be the “Leni Riefenstahl of the GOP“.

In my personal opinion, what emerges is a clear picture, but not of someone who represents a considered negative view of globalization and immigration, and who can shed light on the grievances and disappointments causing a backlash against these trends.  Instead, what these reports suggest is someone who stokes resentment and hatred as a tactic, with seemingly no regard for the truth and with the de facto support of white nationalists, in pursuit of a self-professed destructive agenda. To put it bluntly, my main objection is not that he promotes demonizing political opponents as globalist cuckolds (although despicable…), it’s that there seems to be no substance to his views beyond these kinds of crass attacks. I fail to see any intellectual benefit to engaging in debate with him. Were he not the former Trump campaign manager and chief of staff, I cannot imagine that he would ever be invited to speak on campus based on the intellectual content of his views. However, in my opinion, the fact that he achieved those positions does not make what he has to say any more valuable.

So, where does that leave us?  I assume Bannon will come and either give a talk or participate in a debate.  I think that’s unfortunate.  In my view, it’s a mistake to invite someone seemingly lacking in merit and integrity to speak, and having done so, I think it’s a mistake to maintain the invitation.  I think it would be wise to cancel his talk, not due to caving to political pressure, but by reconsidering what the standards should be for a speaker to whom we give a platform. Otherwise, I worry that we will legitimize and aggrandize Bannon, and discredit the University of Chicago, undermining the University’s reputation for seriousness and intellectual integrity, at least a bit.

Maybe I’m a hypocrite, and shouldn’t get to call myself a free-speech advocate any more, on the grounds that I am trying to “silence” Bannon. But I don’t think so.  First, few people in the world have more opportunities to express their views than Bannon does — journalists clamor to interview him, he has had radio and online platforms for his views, and is evidently writing an auto-biography, not to mention the keen interest the House of Representatives has expressed in hearing more from him. But more importantly, exercising our professional obligation to assess the merit of claims is not a violation of free speech, in my view.  It’s not surprising, or a violation of free speech, that we haven’t invited Diederik Stapel, Andrew Wakefield, Bernie Madoff, or Stephen Glass to give a talk. By the same token, I think we have a responsibility to assess whether having Bannon speak would advance our understanding of the world, or muddy the waters with disinformation.

There is one last aspect to consider, which is the academic freedom of faculty to invite speakers.  The University of Chicago gives faculty broad latitude to teach and research as they see fit, including inviting whomever they please to speak in their courses, for example. I appreciate that tremendously.  However, that is matched by the University of Chicago’s tradition of open feedback and criticism among colleagues.  So, while I believe that everyone, including Prof. Zingales, should have the right to invite the speakers they choose, I also feel that I have not only the right, but also, in this extreme case, an obligation to express why I think this invitation is a mistake.

That is about 2500 more words than I wanted to write about this topic. What I’m really excited to write about is research I’m doing on the accuracy of people’s statistical inferences when generalizing from single events to the outcomes from repeated events and when interpreting forecasts!  Edge-of-your-seat stuff, to be sure, but I’ll have to leave you with that cliff-hanger for now, and save it for next time.


Coda (Feb. 14): I found a lot I agreed with in this Weekly Standard editorial by Gabriel Rossman, a conservative professor of sociology at UCLA, regarding their invitation of Milo Yiannopoulos.


Cash rules everything?

[5th post in a series; start with the first post here]

This week, I’ve talked about the controversy over whether or not incentives reduce intrinsic motivation, leading to less task engagement than if the incentive had never been offered. Yesterday, I covered Indranil Goswami’s research with me, which tests this idea in repeated task choices, and finds that the post-incentive reduction is brief and small (relative to the during-incentive increase).

So, what does this mean for the original theories? Remember, paying people was supposed to feel controlling and reduce autonomy and therefore supplant people’s own intrinsic reasons for doing a task.


Maybe the theory is mostly right, just wrong about the degree and duration of the impact it has.  Maybe intrinsic motivation is reduced, but bounces back after giving people a little time to forget about the payment.  Or maybe giving people the opportunity to make a few choices on their own without any external influence resets their intrinsic motivation.

But maybe the theory is just wrong about the effects of payments.  Perhaps people are trying to manage the eternal tradeoff between effort and leisure over time.  If they want a mix of both effort and leisure, then when they have a good reason to invest more in effort for a while, they do.  But then, when the incentive is gone, it’s an opportunity to balance it back out by taking a break and engaging in more leisure.

For research purposes, it’s helpful that the Deci, Ryan and Koestner paper is quite clear about the theoretical process. This can be used to make predictions about how motivation should change depending on the context, according to their theory. Indranil designed several studies to test between these accounts.

In one study, he gave people different kinds of breaks after the incentive ended. He found that giving people a brief break eliminated the initial post-incentive decline, but only if the break did not involve making difficult choices.  So, it doesn’t seem like giving people the opportunity to make their own choices took away the sting of controlling incentives, making the negative effects brief.  Instead, it’s that giving people a little leisure made them willing to dive back into the math problems.

But how about a more direct test?  The intrinsic motivation theory predicts that the negative effects of incentives should be more pronounced, the more intrinsically motivating the task is.  If I really enjoy watching videos, and you destroy my love of video-watching for its own sake with an incentive, my behavior afterwards when there is no incentive should be really different.  On the other hand, if the post-incentive decline is because I was working hard during the incentive and need a break, then I should be less interested in changing tasks after being paid to do an easy and fun task.

To test this, Indranil took his experimental setup and varied which task was incentivized, paying some people for every video they watched in Round 2, and others for every math problem they solved.  After being paid to do math, when the payment ended, people wanted a break and initially watched more videos for the first few choices.  But after being paid to watch videos, when the payment ended, there was no difference in their choices, contrary to the intrinsic motivation theory.

Perhaps the most direct test came from simply varying the amount of the payments, in another study, either 1 cent, 5 cents or 50 cents per correct match problem in Round 2.  According to the intrinsic motivation theory, the larger the incentive, the more controlling it is, and the more damage is done to intrinsic motivation.  However, if people are balancing effort and leisure, they may feel less of a need to do so, the better they are paid. As the chart below shows, paying people a high amount (50 cents) lead to not only no immediate post-incentive decline, but summing across all of Round 3, people did more of the math task without any additional payment.  Again, the opposite of what the intrinsic motivation account would predict.


So, what does it all mean?  We need more research to figure out when it is that incentives will have no post-incentive effects, a negative effect or a positive effect.  But at a minimum, our findings strongly suggest that simply offering a temporary incentive does not necessarily harm intrinsic motivation. Instead, it seems that when an incentive gets people to work harder than they would have otherwise, they just want to take a break afterwards.



The dynamic effects of incentives on task engagement.

[4th post in a series; start with the first post here]

This week, I’ve laid out a puzzle: motivational theories and research suggest that incentives reduce intrinsic motivation, so that task engagement is lower when a temporary incentive ends than if the incentive had never been offered in the first place.  This suggests that offering people temporary incentives for performance will backfire.  But tests of incentives in real-world settings have all found either no long-term effects or positive long-term effects.

Indranil Goswami tackled this puzzle in his dissertation, which has just been published in the January issue of JEP:General.  Prior studies generally only measured people’s behavior right after the incentive ended.  Indranil designed a new test to see how motivation to engage in a task compared before, during and after an incentive, and how it changed over time.

In his studies, people are given repeated choices between doing a 30 second math problem and watching a 30 second video.  There were three rounds:

Round 1: Participants made eight choices between math problems and videos.

Round 2: For half the people, an incentive was offered for the math task (5 cents for every time they chose the math problem and solved it correctly). Participants were told that the incentive would only apply in the current round. For the other half of the people, no incentive was mentioned. All participants made ten choices.

Round 3: The key test: all participants made another 12 choices between a math task and watching a video, with no incentives.

People who were given an incentive in Round 2 will probably do more of the math tasks while the incentive is available, compared to those who weren’t paid to do math. But what will happen in Round 3, when the incentive has ended and is no longer available?

According to the psychological theories we’ve been discussing,  people who were never paid still have their intrinsic motivation intact, and will keep doing the math, to the degree they find it interesting. But for people who were paid to do math during Round 2, the math task is now different — it either no longer provides autonomy, or they will have inferred that it’s an uninteresting task –and they won’t want to do it anymore in Round 3, now that they’re not being paid.  So, the results would be predicted to look like this:



Going back to the Deci, Koestner and Ryan paper, Indranil’s experiments correspond to a tangible expected reward that is performance contingent, which resulted in a negative effect on subsequent performance (d=-.28) in the 32 such studies they reviewed. If we were to ask Alfie Kohn, he would presumably endorse the Control (no payment) condition — sure, the payment gives us a short-term increase in the math task during Round 2, but at what cost to long-term motivation?

Economists would tend to disagree, however. Incentives should increase the number of math tasks when people are being paid, but why would it have a negative effect afterwards?  Once the incentive ends, people should go back to doing as many math tasks as they enjoy, as if the incentive had never happened. Or, if the incentive actually helped them improve at the task by getting them to practice more in Round 2, then maybe they would continue to do a bit more than they had been doing before, as suggested by Fryer’s studies, which I discussed in the last post.

Indranil conducted a series of experiments, varying multiple factors.  Across the experiments, he had nearly 1100 participants who were in the versions described above.  This chart summarizes what he found:



In Round 1, the two groups were the same.  Then, in Round 2, people who could earn money for solving the math problems did a lot more math problems, when they could earn money.  That makes sense.  The key question is what happens next, when the incentive ends.

In Trial 19, when the incentive ended, those people who had previously been paid to do math were suddenly a lot less likely to choose the math tasks.  They wanted to watch a video, not only more than they had before, but also more than the people who had never been paid an incentive in the first place. So it looks like intrinsic motivation was reduced — but only for a while.  It was still the case but weaker in Trial 20, and the difference was effectively eliminated by Trial 21.

So, after a minute and a half, a mere three choices later, the story had changed.  Whether the person had been paid an incentive or not didn’t matter for their willingness to do math rather than watch videos. After a few more choices, the pattern actually fully reverses, and the people who had been paid before are now doing more math problems, for free.

In a sense, both sets of findings from prior research were vindicated.  The immediate negative post-incentive effects on behavior that had been found in previous lab experiments were found here too.  On the other hand, the lack of an overall negative post-incentive effect observed in the field studies was replicated here as well.

What does it all mean? As far as policy implications, the results are quite inconsistent with the dire warnings about incentives.  Providing a temporary incentive can yield a boost in behavior while people are being paid, and only a small and brief decline afterwards. Maybe incentives work pretty well after all.




The mystery of the missing long-term harm from incentives.

[3d post in a series; start with the first post here]

I’ve been talking about research on incentives, and how a temporary incentive might undermine intrinsic motivation. In the last post, we heard from educational policy advocate Alfie Kohn on his prediction that paying kids to do schoolwork would backfire, particularly when the incentive ended.

Perhaps the most comprehensive experiments to test these ideas in a real-world setting were done by Roland Fryer, as reported in his 2011 paper. He describes his research on education in the video below (fascinating throughout, but the student incentives discussion runs from around 17:00-24:00):

Overall, his results suggest little effect, positive or negative, of paying schoolchildren for their performance (e.g., for getting high test scores). However, he also conducted two large-scale randomized in-school trials that instead tested rewards for their efforts, for the underlying behaviors that could foster success in school. In Dallas, second-graders randomly assigned to the treatment condition were paid $2 for each book they read and passed a quiz on.  In Washington D.C., sixth through eighth graders in the treatment condition were paid for other educational inputs, including attendance, school behavior and handing in homework.

Paying kids to read books yielded a significant improvement in the Dallas students’ reading comprehension, a marginal improvement in their language scores and a positive but insignificant effect on vocabulary scores. In Washington D.C., the incentives yielded a marginal improvement in reading and a non-significant improvement in math.

The key question is what happened when the incentive program ended?  The psychological theories we’ve been discussing would predict that the kids would be worse off. Having lost their own motivation to engage in the behaviors and no longer having the external tangible incentive, motivation would be reduced, harming outcomes.  Instead, Fryer found that the positive effects were reduced by half and were no longer statistically significant after the reward ended.

To put it another way, the benefits do seem to fade when the incentive ends, but there is no evidence that the outcomes are worse than if the incentive had not been offered in the first place. This is not an isolated finding.  Three other studies with older students (high school: Jackson 2010, college: Angrist et al 2009 and Garbarino & Slonim, 2005) actually find (some) positive effects of education incentives that significantly persisted after the incentive has ended.

In his dissertation, Indranil Goswami reviewed 18 field studies across a variety of domains (including education, smoking cessation, weight loss, gym attendance, medical adherence and work productivity).  These studies all measured people’s total behavior in a period after the incentive had ended and found either no long-term effect or a modest positive effect. Not a single study found that people had worse outcomes when an incentive was offered and then ended, than if the incentive had never happened.

So where was the long-term harm that the predicted loss of intrinsic motivation from the incentive would have caused?

The evils of incentives.

[2d post in a series; start with the first post here]

Yesterday, I talked about research on incentives, and how a temporary incentive might undermine intrinsic motivation. This view has had a major impact on policy regarding incentives, particularly in relation to children and education.

Alfie Kohn has published several books on the topic, including “Punished by Rewards: The Trouble with Gold Stars, Incentive Plans, A’s, Praise, and Other Bribes.” Here he is explaining to Oprah and TV viewers everywhere, why rewarding kids is a very bad idea.  He presents the main idea in the first three minutes:


In the interview, he says:

“One of the findings in psychology that has been shown over and over again — the more you reward people for doing something, the more they tend to lose interest in whatever they had to do to get the reward.”

He then goes on to specifically talk about grades as problematic incentives.  In his book, he goes even farther, saying that verbal praise is coercive and should be avoided because it contains an implied threat to withhold praise in the future.

But looking back at Deci, Koestner and Ryan’s meta analysis which I discussed yesterday, verbal rewards had no negative effect on children’s subsequent motivation, and even tangible rewards had no post-reward effect when the reward was unexpected. So, this seems to be over-selling the impact of temporary rewards that had been found in the literature.

The show does an experiment of their own that yields a somewhat implausibly strong effect. Kohn then characterizes this result first in terms of an inference-based theory of intrinsic motivation, before circling back to the control vs. autonomy account.

“If the kid figures they have to bribe me to do this, then it must be something I wouldn’t want to do, so the very act of offering a reward for a behavior signals to somebody that this is not something interesting.”

What he’s talking about here is the “overjustification hypothesis” of Lepper, Greene and Nisbett (1973). However, most of the experiments in which this has been tested were with young children (for example, pre-schoolers in the article above).  There’s something a bit odd to me about the idea that the teenagers on the show would not be able to judge how interesting the task was on their own, without making those kinds of more remote inferences.

A schoolteacher comes on and talks about a rewards program (“Earning by Learning”) her school uses to motivate reading that she thinks works. Kohn is skeptical and raises three issues — whether the kids are choosing easier books just to get the rewards, whether they comprehend what they’re reading and the issue we’ve been discussing, whether motivation will persist when the program is over.

Nevertheless, what we’re left with is the admonition that in the long run, rewards not only don’t work, but will harm motivation. As Oprah says, “You have to change the way you think about parenting!” But notice how far we are now from the original studies. Next, we’ll look at some more direct tests of this idea.

What happens to motivation when incentives end?

If you’re temporarily paid to do something, would that change your motivation or interest in doing the same thing when you’re not paid to do it anymore? Indranil Goswami investigated this long-standing question for his dissertation with me, which I’ll get to soon.  But first, some background.

Psychologists and economists have long debated the effectiveness of incentives.  From the viewpoint of economics, it’s simple, almost definitional.  Economics is fundamentally about how incentives shape human behavior. Much of the empirical research in economics is about how incentives –overt, hidden, and even perverse — influence and explain people’s behavior. While this view can be summarized simply as “incentives work!”, identifying what the incentives are can be tricky and open to debate, and the definition of what constitutes an incentive has been broadening.  Andreoni , for example, introduced the idea of a “warm glow” (or good feeling) that a person may get from donating to others as an incentive that can explain altruistic behavior.

Psychologists tend to think in terms of internal mental processes and motivators, and have historically been skeptical of external incentives, seeing such incentives, particularly monetary incentives, as impure and interfering with people’s true or “intrinsic” motivation.

Which brings us to one of my favorite papers, a comprehensive review by Deci, Koestner and Ryan (1999) of how incentives affect intrinsic motivation. They look for experiments that tested how motivated people are to do a task without compensation, after they had been temporarily paid to do the task. In the paper, they painstakingly gather up research findings, including unpublished studies in order to deal with publication bias, categorize the differences in experimental methods and then summarize the average findings (using meta-analysis).  Their main results are summarized below (from p. 647 of the paper):


What they’re looking at here is the “free choice paradigm”, where people in a lab study are paid to do an activity (such as drawing), and then are put in a situation where they could do more of the activity or do some other activity, with no compensation for any of the options. Their decision whether or not to do more of the activity is compared with the same decision among people who were never paid for the activity in the first place (a control group).

Based on 101 such studies, it looks bad for incentives (d= -.24, at the top of the graph). People paid to do the activity then do less of it when the payment is no longer available than if they had never been paid in the first place.  From a classical economics perspective, this may appear weird – if you like drawing, then you should draw, whether you were previously paid to draw or not. To many psychologists, the reason seems clear: paying people changed how they viewed the activity, undercutting the intrinsic motivation that made it fun in the first place.

As the figure illustrates, the experiments vary a lot, and so do the results. Verbal rewards (i.e., praise) has positive effects on subsequent motivation (the opposite finding), at least for college students. The negative effects are driven by tangible rewards (e.g., money), in situations where people are paid conditional on something: trying the activity, completing the activity, or achieving a particular performance in the activity.

What does this mean?  The proposed theory centers on feelings of autonomy: people do things in part to feel good about having done it themselves,. When someone else comes in and provides a conditional reward, it eliminates the ability of the activity to provide the autonomy benefit.  And here’s the key: this is assumed to be a long-term change in how the activity is perceived and experienced.  As a result, there’s a risk to using incentives.  As they warn in the paper:

“if people use tangible [i.e., monetary] rewards, it is necessary that they be extremely careful… about the intrinsic motivation and task persistence of the people they are rewarding” (p. 656)

I’ll look more closely at this implication this week, including our new data.

The Costs of Misestimating Time

Indranil Goswami and I have a new working paper up.  We were interested in the consequences of misestimating time, and looked at contract choices.  We have some participants serve as workers, doing a task, like solving jigsaw puzzles.  The workers are either paid a flat fee, or are paid for the time they spend on the task. We have other participants serve as managers, making choices between hiring a worker with a fixed fee contract or a per-time contract.

We find that the “managers” generally prefer the fixed fee contracts, even though the per-time contracts are actually more profitable. Prior research has also found a ” flat-rate bias” in contexts such as gym memberships and service contracts. Years ago, before I knew about any of the research, I also stumbled across it in the results of a conjoint marketing research study I worked on for a telecommunications company. This puzzled all of us at the marketing research firm, accustomed to thinking of consumers as simply hunting for the best deal.

One explanation is that flat rate deals provide a kind of insurance — even if they are more expensive on average, they eliminate a costly worst-case scenario. A more recently proposed explanation is that people don’t like having to feel like each additional bit of consumption is costing them. When you’re texting your friends, you want to just enjoy texting your friends, not do a cost-benefit analysis of whether each text is actually worth the cost.

In our context, the culprit turns out to be different — misestimation of how long the workers will take. We find that the managers choose the flat-fee mainly when their own time estimates suggest that the flat-fee would be a better deal. It doesn’t seem to be about insuring against the worst-case scenario of an expensive, slow worker, because they are much less interested in the certain option, when given a choice between a certain amount and a gamble, constructed to be equivalent to their contract choice.

The best evidence that it’s about misestimating workers’ time comes from time limits.  We give the workers either a short time-limit or a long time-limit to complete the task. The contracts are set up such a way that the per-time contract is a better deal than the flat fee in both cases, but the advantage of the per-time contract is even stronger when the time limit is longer.

So, based on just the incentives, our “managers” should be less likely to choose the flat fee contract under the long time-limit. But instead, more of our managers choose the flat fee contract under the long time-limit. Why? Because under the long-time limit they also over-estimate how long the workers will take to a greater degree than under the short time-limit.  This turns out to be a very robust finding, observed with different kinds of tasks and among participants with management experience.

I think this may hint at something broader about the eternal battle between “carrot” and “stick” philosophies. Managers often have strong views about whether they should be creating a hospitable environment in which workers can unleash their creativity and productivity, or creating a tightly controlled environment to prevent overspending and inefficiency, recognizing that the two are at least somewhat incompatible.  Those views may often be based a lot more on personal philosophies than on being well-calibrated to the optimal strategy in a given setting.


Teaching resource using open data

I previously mentioned Christopher Madan’s article on open data in teaching.  He points to the Open Stats Lab at Trinity College, which I hadn’t heard about before. They create statistics exercises, suitable for an undergrad statistics or psychological research methods course, based on open data from papers published in Psychological Science. Each exercise has the original paper, a dataset, and a brief (1-2 page) description of the analysis to be done.  They have multiple examples for each topic, including significance testing, correlation, regression and ANOVA.

It’s very nicely done, and the contexts of the analyses (i.e. the research questions from the original papers) are interesting, which I think helps make it engaging for students. There’s flexibility in the exercises for the students to think about which variables to use, and to formulate their own interpretation. My one quibble is that the writeups seem aimed at the student, and I think it might be useful to have a separate document with a few pointers (e.g., a “teaching note”) for instructors. In particular, if I were using one of the correlation exercises (particularly the example on Ebola and voting), I would want to have a thorough discussion in class of what we can and cannot conclude about causation (including how the original paper tried to address causation) and how to think about alternative explanations for the observed correlation.

Psychological Science, the source of the data, has a voluntary data disclosure policy, offering “badges” to published articles that make their data public.  This is a relatively gentle “nudge” towards open data, but it seems to be having an effect.  Nearly a quarter of papers after the policy was instituted made their data public, compared to 3% before. If the Open Stats Lab’s efforts (and others like it) are successful, it provides a nice additional incentive to authors of making their data public. After all, who wouldn’t want their research becoming a “textbook example” used in statistics classes?