In October of 2020, President Trump nominated Amy Barrett to the Supreme Court of the United States (SCOTUS) to replace the late Justice Ruth Bader Ginsburg. Justice Barrett was confirmed by a Republication controlled Senate on a near party-line vote, following days of acrimonious inter-party debate, with Senator Warren calling the confirmation process "illegitimate."
Let's put aside the many political and social angles to this story and consider whether there exists a better process of selecting justices than the current one. As a gross simplification, and thereby conveniently dumping political or legal hurdles, let's ask what type of individual should be nominated to the court conditional on an existing set of justices. Our end goal should be to improve the decisions made by the court and to select justices accordingly.
While this question is admittedly fanciful, the composition and function of SCOTUS rings familiar notes to behavioral scientists that study the "wisdom of crowds." SCOTUS is a group of individuals that collectively makes important decisions. These decisions are made via a simple majority rule algorithm, albeit with a degree of shared deliberation. This suggests the following question: Can general principles from the wisdom of the crowd literature aid us in selecting new justices, with the goal being a "wiser" SCOTUS?''
Unfortunately, the answer may be...not really. That isn't to say we shouldn't value a wiser SCOTUS, but rather than illuminating a method for improving its decision making, this line of reasoning uncovers limitations of the work on crowd wisdom.
The "wisdom of the crowds" phenomenon occurs when a combination of individuals' judgments is more accurate than those made by the individuals themselves (e.g., Surowiecki, 2004). Behavioral scientists have demonstrated that this effect is quite general, extending from successfully estimating the weight of an ox at a fair (Galton, 1907) to making predictions about myriad world events (Tetlock et al., 2014). The effect is robust across many different types of aggregation methods, including complex prediction markets and simple averages of individual judgments (e.g., Atanasov et al. 2017).
Within this literature, two major themes have emerged that help explain what makes a crowd wise. First, crowd wisdom tends to improve with increased diversity of member judgments (Davis-Stober, Budescu, Dana, & Broomell, 2014; Hong & Page, 2004). Prior work has demonstrated that experts who make similar judgments may draw upon similar sources of information (Broomell & Budescu, 2009). This suggests that more diverse groups have more perspectives, hence more information, at their disposal, thereby allowing the group to make robust decisions in highly variable environments. The other major theme is, not surprisingly, the importance of accuracy. More accurate group members often result in better group performance. Tetlock and Gardner (2016) found that groups comprised of highly trained and motivated forecasters, termed super forecasters, outperformed less highly trained groups.
Let's start by examining SCOTUS under the diversity theme. To be clear, by "diversity," I refer to systematic differences in court rulings, not the background, ethnicity, or gender of the justices. To make things a bit more concrete, let's consider the court from the 1999/2000 term to the 2013/2014 term. The New York Times (July 3, 2014) published a table of the percentage of times that each justice ruled in agreement with every other justice during this time period. Justice Ginsburg ruled in agreement with Justice Kagan 93% of the time. Under the diversity theme, Justice Kagan doesn't appear to add much "information" above and beyond that of Justice Ginsburg. A cursory application of the diversity theme would suggest replacing Justice Ginsburg with someone who would agree less often with all other justices, resulting in a more diverse, wiser court.
This is where our reasoning breaks down and we run headlong into a major limitation of the crowd wisdom literature. Quantitative arguments demonstrating that diversity improves crowd wisdom depend upon a clear notion of accuracy, and we cannot establish the benefit of diversity without it. Said a bit more formally, in order to demonstrate mathematically that a greater diversity of judgments results in a more accurate group we need to specify a loss function, i.e., a metric of inaccuracy, such as the average squared error between group predictions and the true value being estimated. The diversity and accuracy themes are inextricably linked in this way.
To better make this point, consider how the objectives of the SCOTUS differ from that of a traditional courtroom jury. Classic crowd wisdom arguments, such as the Condorcet Jury Theorem (CJT), show that even relatively imperfect individuals, aggregated using majority rule, can still make good group decisions. The simplest version of CJT states that if each individual in a group is better than chance at predicting a binary outcome then increasing the size of the group will likewise increase the probability of the group making the correct choice. The proof of the CJT requires clear definitions of correct and incorrect decisions. Applied to a jury, we speak of the guilt or innocence of a defendant, i.e., they did or did not commit the crime in question. This is a ground truth, which allows us to define a loss function, such as the probability of the group arriving at the incorrect answer.
SCOTUS decisions do not have a ground truth in this way. Their purpose is to determine whether a law is, or is not, in accordance with the U.S. constitution. Similar to a jury, this is a binary decision, but it differs in that there is no objective true value upon which to define a correct or incorrect judgment. There are multiple perspectives on interpreting the constitution (e.g., originalism versus pragmatism) but there is no clear method, or metric, for determining a correct one. One could argue that rulings are correct or incorrect to the extent that they result in a net utility or dis-utility for the public good; but, at best, this could only be determined via a wide historical lens (e.g., Dred Scott v. Sandford in 1857), and perhaps never at all for most cases.
This leaves us in a difficult position. Without a clear definition of accuracy, standard diversity arguments are rendered moot. Should more moderates be appointed and confirmed to the court? This would likely result in rulings reflecting a wider range of perspectives, but without a definition of accuracy it is impossible to say if this is better or worse for the country.
Things become even more complicated with the possibility of court packing, i.e., extending the size of the SCOTUS beyond nine justices. Common wisdom of the crowd principles suggest that a larger crowd leads to improved judgments. A larger court also has greater opportunity for diversity among its members. Yet, even if we had a definition of accuracy, recent work generalizing CJT arguments by Galesic et al. (2018) demonstrates that the accuracy of groups using majority rule can be non-monotonic in size depending upon the difficulty of the task. In other words, there is no guarantee that the crowd will become wiser as more justices are added, in fact it may become less so.
So what is there to do? The appointment of justices is likely to remain a highly political affair, framed around ideological perspectives. Diversity may play a clearer factor when considering whether the court is representative. While not much easier to define than the "correctness" of a decision, we can consider whether the court is representative of the U.S. population. For example, including Justice Barrett, six of the nine justices are practicing Catholics, while only 20% of all U.S. adults are estimated to be.
The work on crowd wisdom most applicable to our larger question may be that of Mueller-Trede et al. (2018). They present a theoretic model of crowd wisdom applied to "matters of taste," which include concepts such as movie, dining, and musical preferences. Under their framework, there are no assumed universal truths among group members (e.g., all individuals are Lady Gaga fans). Instead, each individual's personal taste becomes a ground truth upon which to define accuracy. Mueller-Trade et al. found that an individual's personal preferences could be accurately predicted by aggregating the tastes of other group members. The biggest gains in prediction came from aggregating members whose tastes were broadly similar to the individual, yet maximally diverse in other respects. Applied to our question, a Democratic president would do well by broadly seeking opinions from a very diverse membership of their own party when considering a set of judicial nominees. This would result in a more accurate appraisal of potential justices from the president's perspective, which is certainly something, but far from a complete solution.
-Clintin P. Davis-Stober
This post began as a series of conversations with Dr. Stephen Broomell, who is an associate professor at Carnegie Mellon University.
Comments