A friend of mine practicing in a large firm recently asked me about comparing documents for similarity. While his interests are from a transactional standpoint, the same methods can be used to compare documents within or between cases. To continue with the amicus brief trope from the last post, I thought it would be interesting to look at the similarity between amicus briefs in a case where many such briefs were filed. For this purpose I chose the Texas redistricting case Evenwel v. Abbot (No. 14-940). There were 33 amicus briefs filed in the case after the Court noted probable jurisdiction.
The previous post on amicus briefs provides a sense of their potential effect on the Court before the Justices come to their actual decisions in cases. One of the main goals of amicus briefs is to provide the Justices with information about the policy implications in cases and to the potential externalities to larger swaths of society. If this is the case, then the Justices and their clerks are only incentivized to read more amicus briefs if these briefs provide the Justices with important additional information. This post presents ways that we can test the likeness between briefs (or documents generally) to see if they are presenting similar information.
For this analysis I compare two methods for measuring similarity and show how they can be used in tandem. The two methods are measuring the cosine similarity and the actual overlapping language between documents. Cosine similarity is a measure based on the frequency of terms in two documents that ranges from 0 to 1 with 1 equating to identical documents (Oldfather et al. (2012) provides a more detailed explanation of how cosine similarity is derived. One simplification I made in my explanation was describing similarity in documents rather than in word vectors which is the technical unit of measurement). An important distinction between the two methods is that cosine similarity removes word order entirely when comparing term frequencies while WCopyfind only marks overlapping language with the same or similar word ordering in sentences.
I use Rapidminer which is open-source software based in Java with a graphical interface to derive cosine similarity. To measure overlapping language I use WCopyfind which looks for shared language between texts (I set WCopyfind to mark overlapping phrases of six words or more that are at least 80% identical). To keep the text as authentic to the briefs as possible I used minimal pre-processing tools prior comparing documents (e.g. I transformed all text to lower-case so that words with capital letters did not appear as different from the same words with lower case letters). I then compared every pair of the 33 amicus briefs to derive similarity measures for each combination.
Cosine similarity provides insight not only into whether the briefs supporting each party present the same or similar information, but also whether the briefs supporting opposing parties present similar information. The histogram below displays a distribution of the cosine similarities between each pair of briefs.
The distribution is relatively normal and ranges from .32 to .86 with a mean value of .57. Delving into the actual briefs, not surprisingly the most similar briefs for the purpose of cosine similarity were not the same as the most similar briefs according to WCopyfind. My expectation, which was confirmed by examining the similar briefs was that cosine similarity would pick up briefs that discussed similar themes and looked to similar precedent without necessarily sharing cited text, while WCopyfind would locate briefs that shared cited passages.
According to the cosine similarity measure, the most similar pair of briefs were filed by the United States and NAACP Legal Defense Fund. Both focus heavily on the Supreme Court’s precedent in Reynolds v. Sims, 377 U.S. 533 (1964) and the distinction between legislative districts drawn based on “eligible voters” and the “total population.” Although the two briefs do not share many cited passages (these include “Equal representation for equal numbers of people is a principle designed to prevent debasement of voting power and diminution of access to elected representatives” and “equal representation for equal numbers of people, without regard to race, sex, economic status, or place of residence
within a State”) they both examine many of the same cases including: Yick Wo v. Hopkins, 118 U.S. 356, League of United Latin Am. Citizens (LULAC) v. Perry, 548 U.S. 399, Thornburg v. Gingles, 478 U.S. 30, Gomillion v. Lightfoot, 364 U.S. 339, Burns v. Richardson, 384 U.S. 73, 92-93 (1966), and NS Whitcomb v. Chavis, 403 U.S. 124 among others.
The most similar pair of briefs according to the cosine similarity measure supporting opposing parties were for the United States supporting the appellees and the Center for Constitutional Jurisprudence supporting the appellants. Both briefs also confront the difference between eligible voting population and total population with respect to the Court’s Equal Protection jurisprudence. They cite several of the same cases including: Vieth v. Jubelirer, 541 U.S. 267, Gray v. Sanders, 372 U.S. 368, Wesberry v. Sanders, 376 U.S. 1, Gaffney v. Cummings, 412 U.S. 735, and Connor v. Finch, 431 U.S. 407, among others. Both look take detailed looks at the principles behind the Court’s jurisprudence as well those reflected in the text of the Constitution and the Amendments (particularly the 14th).
Based on WCopyfind’s measure of language overlap, the most similar briefs are for Eagle Forum Education & Legal Defense Fund supporting the appellants and the Mountain States Legal Foundation supporting the appellants (11% of the Mountain States brief matches with 12% of Eagle Forum’s brief). As expected there were lengthier shared, cited passage between these two briefs than in the briefs previously examined. Two such examples include: “This principle requires that, “when members of an elected body are chosen from separate districts, each district must be established on a basis that will insure, as far as is practicable, that equal numbers of voters can vote for proportionally equal numbers of officials.” Hadley v. Junior Coll. Dist. of Metro. Kansas City, Mo., 397 U.S. 50, 56 (1970)” and “The conception of political equality from the Declaration of Independence, to Lincoln’s Gettysburg Address, to the Fifteenth, Seventeenth, and Nineteenth Amendments can mean only one thing – one person, one vote.” In total, there are fifteen shared phrases of at least 80% similarity and at least twelve words between these briefs.
Taking a step back from this particular analysis, such methods could potentially help Justices and their clerks wade through the mass of amicus briefs they regularly receive so they can get a sense of the shared and unique information they receive in them. The combination of WCopyfind and cosine similarity measures in the application to the amicus briefs filed in Evenwel, rather than the independent use of one or the other allowed for a robust assessment that accounted for both shared citations and shared themes.
Addendum: If you are interested in the raw data I used for this or other posts you can contact me at email@example.com.