Testing for Local Continuity in Racial Animus

Choropleths, Regressions

This past spring I was tasked with writing a final paper for my Comparative Historical Economic Development course. In brainstorming, I started a casual fling with one idea that quickly escalated and led to long spring break dates at the library (collecting data, making maps). In a meeting with my professor back at Harvard, I realized that I had been consumed in the honeymoon phase of an idea. Taking off my rose-colored glasses, it became clear I had to end the tryst so I could focus on more promising paths to a paper…

In the end, I wrote about a different idea. Meanwhile, the original one lived in solitude in an inactive Dropbox folder, a digital dead end. While I never developed it into a paper, why not shape it into a blog post? After all, there were interesting datasets and choropleths to share. So, here I am, throwing it a bone! If only for the sake of intellectual closure…

Persecution Perpetuated } Animus Alive

I was immediately struck by an idea after reading the QJE article “Persecution Perpetuated: The Medieval Origins of Anti-Semitic Violence in Nazi Germany.” Co-authors Voigtländer and Voth use an incredible dataset of about 400 towns “where Jewish communities are documented for both the medieval period and interwar Germany.” They find local continuity in anti-Semitism over 600 years. Local continuity in this context means attacks on Jews were six times more likely in the 1920’s in towns and cities that blamed and then murdered Jews for the Black Death (during 1348-50). Let me repeat myself: local continuity over SIX HUNDRED YEARS. (More than half a millennium.) The paper provides convincing empirical evidence that group-based persecution (anti-Semitism in this case) can meaningfully persist at the local level. History matters, the data screams.

After reading “Persecution Perpetuated,” I was curious if I could merge recently available data sources to test whether racial animus in the US would be display similar local continuity. (Possible alliterative titles to pay homage to “Persecution Perpetuated” include: “Animus Alive” and “Malice Maintained.”) Specifically, what sprang to mind as a possible dependent variable was Seth Stephens-Davidowitz‘s racial animus measure, which is based on based on Google search data. (I’ve always wanted to play around with Seth’s Google data, so I seized on a possible academic opportunity to do so.)

What does it mean to measure racial animus using Google data? Seth proxies for a geographic area’s racial animus by calculating the percent of its Google searches (2004-2007) that included the n-word or its plural. The Google data at its finest geographic level is available for US Designated Market Areas. (There are 210 such DMA’s in the US.) Specifically, the “racially charged search rate” for DMA j = 100 * [n-word Google searches / total Google searches] for j / [n-word Google searches / total Google searches] max over all DMA’s. The necessary underlying assumption is that racial animus makes one more likely to make an n-word Google search. (It does not have to be the case that “every individual using the term harbors racial animus, nor that every individual harboring racial animus will use this term on Google.”) (If you want to know more about this measure, read Seth’s NYTimes piece about his academic work as well.)

Why is using Google search data attractive? Seth argues that survey data is unlikely to paint an accurate picture of racial animus because well, people lie. (Thus the title of his book, Everybody Lies, which I highly recommend if you’re interested in data-driven social science.) Meanwhile, “the conditions under which people [Google] search – online, likely alone, and not participating in an official survey – limit concern of social censoring.” In short, Google search data allows us to access a snapshot of attitudes or beliefs that might otherwise be inaccessible with traditional surveys.

While Google search is a new phenomenon in the grand scheme of history, I was curious if historical data related to racial animus could be a powerful predictor of these modern racially charged search patterns. Ie, I sought out a relevant historical independent variable, which led me to Virginia Commonwealth University (VCU)’s project “Mapping the Second Klu Klux Klan, 1915-1940.” History Professor John Kneebone constructed a list of local KKK chapters (klaverns) using information from a large set of the group’s official publications. (More here.) The project makes visually explicit the widespread nature of the KKK; “Everywhere there was population, there was the Klan,” Kneebone explained. He then worked with digital librarians at VCU Libraries to map out the klaverns and make the raw location data publicly accessible.

With the VCU data in mind, I wondered: does the local historical KKK prevalence in 1915-1940 predict (through perpetuated racial animus) racially charged Google search rates in 2004-2007? There are some issues to note when using klavern location data:

  1. Ideally, I’d use data on the number of members by geographic areas. (Ignoring underlying population figures for a moment: Imagine there are 3 klaverns with 10 members each in DMA A. Meanwhile, there is 1 klavern with 100 members in DMA B. The Klan is more prevalent in terms of raw members in DMA B, but with only the location data in tow, DMA A would seem to have a more prevalent Klan presence.) Unfortunately, I am not aware of data on historical Klan membership by finer levels (DMA-level) of US geography.
  2. The location data are not necessarily complete. Such is the nature of collecting data from a different era. They are based on historical research and investigative work, but klavern locations are likely missing.

So, simply put, the question is: can I find empirical evidence of local continuity in American racial animus by merging historical and modern metrics? Will klavern prevalence per capita (1915-1940) meaningfully predict racially charged Google searches (2000-2004)?

Maps and regressions

First things first. Let’s map both klaverns/million and racially charged search rates by DMA. The variation in klaverns/million turned out to be so skewed that using the raw rate made the variation almost impossible to visually decipher. The use of log(klaverns/million) makes the geographic variation visually interpretable. (See below.)


While I interpreted the two metrics as proxies of racial animus at different times in history, they don’t obviously track through time and space visually. The possible story (that one meaningfully predicts the other) isn’t very convincing at this stage. But, it’s worth seeing the output from a simple regression. (I use log of klaverns/million since the distribution of klaverns/million was skewed, making a nonlinear relationship between modern racially charged search rate and klaverns/million. To counter heteroskedasticity, I transformed klaverns/million to its log. I’m open to criticism/discussion on such log transformations.)


While the independent variable is statistically significant (star, star, star), the variation in log(klaverns/million) explains only 3-4% of the variation in racially charged Google search rates. In terms of magnitude, a 1% increase in klaverns/million is associated with a 0.02645 unit increase in racially charged google search rate. That is economically meaningless considering that the rates range from 25-155. This is also without controlling for any of the analogs of V&V covariates that would need to be included to make a convincing case for the robustness of any supposed relationship.

In the end, the VCU location data aggregated up to totals for DMA areas are not a meaningful predictor of racial animus revealed by Google behavior. This could be for many reasons. It could be because klavern location totals are not accurate depictions of KKK prevalence (data reveals locations rather than member totals, data could be incomplete). Moreover, on a more philosophical level, does this project even test the persistence of racial animus?… or is it asking a question about how language squares with historical locations of institutions? I’ll leave that up to my academic counterparts in Political Theory. (CJ, you’re up.)

Why it wasn’t meant to be… a paper

Now, it is worth explaining why this idea was always predestined to stay in blog-land. For one, even an extremely powerful positive result wouldn’t “shift priors” (as economists say to one another) — the result would seem obvious and so, who cares? Voigtländer and Voth’s paper was publishable due to the long-term scope (600 years!) and fine geographic level (towns, cities in Germany) of their results. The concept that attitudes could be perpetuated over time wasn’t the selling point, it was the fact that attitudes could be perpetuated for so long and in such a locally continuous way. If the geographies had been more coarse and their data hadn’t covered such a long period of time, the paper would not have landed in the QJE. Since I asked if 1915-1940 attitudes would predict 2004-2007 attitudes and I was limited to large designated market areas (due to the nature of the google data), my findings were doomed to be unremarkable even if incredibly powerful (which… they were not).

Despite its inevitable end, I did learn a lot from this spring fling. On a qualitative level, I learned about the historical omnipresence of the KKK from an impressive digital humanities project. On a technical level, I learned how how to map DMA areas in R (very tricky). And, on a philosophical level, I learned how to cleanly break up with a project idea. (Thanks for all the optimal stopping lessons, David.)

Data & code
  1. Seth’s racial animus data is here.
  2. VCU data on klaverns is here.
  3. My R notebook for this project is here.
  4. My Github repo for this project is here.

© Alexandra Albright and The Little Dataset That Could, 2018. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.



Which U.S. State Performs Best in the New Yorker Caption Contest?


I wrote about this topic with Bob Mankoff for the New Yorker.

You can read the piece here

And you can see how the visuals were made here!

It builds off of my previous work on the New Yorker Caption contest. (New visuals, new data, and edits from real editors!) Many, many thanks to Bob for giving me access to troves of fascinating data as well as making great edits and alterations to this piece (including the addition of my new favorite phrase “nattering nabobs”).

Bonus cartoon of surprising relevance given Alaska’s success in terms of caption win rate:

Daily Cartoon for Tuesday, September 1st via The New Yorker

Daily Cartoon for Tuesday, September 1st via The New Yorker. We figure Alaska’s wins and submissions to the contest will decline if it comes to this…


Code and raw data for replicating these choropleths are available at my NYer_Choropleths Github repo. An R Notebook on this is also available here. Also, thanks to Sarah Levine for using her QGIS knowledge to help me tame maps of the US.

© Alexandra Albright and The Little Dataset That Could, 2016. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

Geography of Humor: The Case of the New Yorker Caption Contest

Bar Charts, Choropleths, Scatter Plots

Update [9-23-15]: Also check out the newest work on this topic: Which U.S. State Performs Best in the New Yorker Caption Contest?


About 10 years ago The New Yorker began a weekly contest. It was not a contest of writing talents in colorful fiction nor of investigative prowess in journalism, instead it was a contest of short and sweet humor. Write a caption for a cartoon, they said. It’ll be fun, they said. This will help our circulation, the marketing department said. Individuals like me, who back at age 12 in 2005 believed The New Yorker was the adult’s version of Calvin and Hobbes that they most enjoyed in doctors’ waiting rooms, embraced the new tradition with open arms.

Now, 10 years later, approximately 5,372 captions are submitted each week, and just a single winner is picked. Upon recently trying my own hand (and failing unsurprisingly given the sheer magnitude of competing captions) at the contest, I wondered, who are these winners? In particular, since The New Yorker always prints the name and place of residence of the caption contest winner, I wondered, what’s the geographical distribution of these winners? 

In order to answer this question, I used my prized subscriber access to the online Caption Contest archive. This archive features the winning caption for each week’s cartoon (along with two other finalist captions) and the name/place of residence of the caption creator. (The archives also feature all other submitted captions–which is super interesting from a machine learning perspective, but I don’t focus on that in this piece.) So, I snagged the geographic information on the past 10 years of winners and went with it.

The basics

For this analysis, I collected information on the first 466 caption contests–that is, all contests up to and including the following:

New Yorker Caption Contest #466

The New Yorker Caption Contest #466

Before getting into the meat of this discussion, it is worth noting the structure of the contest as well as the range of eligible participants. See this quick explanation from The New Yorker:

Each week, we provide a cartoon in need of a caption. You, the reader, submit your caption below, we choose three finalists, and you vote for your favorite… Any resident of the United States, Canada (except Quebec), Australia, the United Kingdom, or the Republic of Ireland age eighteen or older can enter or vote.

Thus, the contest consists of two rounds; one in which the magazine staff sift through thousands of submissions and pick just three as well as one in which the public votes on the ultimate winner out of the three finalists. Furthermore, the contest is open to residents outside the United States–a fact that is easy to forget when considering how often individuals from other countries actually win. Out of 466 caption contest winners, only 12 are from outside the United States–2 from Australia, 2 from British Columbia (Canada), and 8 from Ontario (Canada). Though they are allowed to compete, no one from the United Kingdom, or the Republic of Ireland has ever won. In short, 97.85% of caption contest winners are from the U.S.

Moving to the city-level of geography, it is unsurprising that The New Yorker Caption Contest is dominated by, well, New Yorkers. New York City has 62 wins, meaning New Yorkers have won 13.3% of the contests. In order to fully understand how dominant this makes New York consider the fact that the city with the next most caption contests wins is Los Angeles with a mere 18 wins (3.9% of contests). The graphic below depicting the top 8 caption contest cities further highlights New York’s exceptionalism:


Source: New Yorker Caption Contest Archive; Tool: ggplot2 package in R.

The geographic distribution: a state-level analysis

While both the country- and city-level results are dominated by the obvious contenders (the United States and New York City respectively), the state-level analysis is much more compelling.

In this vein, the first question to address is: which states win the most contests? To answer this, I present the following chrolopeth in which the states are divided into five categories of equal size (each category contains 10 states) based on the number of contests won. (This method uses quantiles to classify the data into ranges, however, there are other methods one could use as well.) Visualizing the data in this way allows us to quickly perceive areas of the country that are caption-winner-rich as well as caption-winner-sparse:


Source: New Yorker Caption Contest Archive; Tool: choroplethr package in R.

This visualization illustrates that the most successful caption contest states are either east coast or west coast states, with the exception of Illinois (due to Chicago’s 16 wins). The most barren section of the country is unsurprisingly the center of the country. (In particular, Idaho, Kansas, North/South Dakota, West Virginia, and Wyoming have never boasted any caption contest winners.)

While using quantiles to classify the data into ranges is helpful, it gives us an overly broad last category–the darkest blue class contains states with win totals ranging from 14 to 85. If we want to zoom in and compare the states within this one category, we can pivot to a simple bar chart for precision’s sake. The following graph presents the number of contests won among the top ten states:


Source: New Yorker Caption Contest Archive; Tool: ggplot2 package in R.

New York and California are clearly the most dominant states with 85 and 75 wins respectively, which is to be expected considering how populous the two are. If we were to take into account the population size of a given state that would most definitely yield a superior metric in terms of how well each state does in winning the contest. (It would also be interesting to take into account the number of The New Yorker subscribers by state, but I haven’t been able to get a hold on that data yet, so I am putting a pin in that idea for now.)

Therefore, I normalize these counts by creating a new metric: number of caption contests won per one million state residents.  In making this change, the map colors shift noticably. See the following chrolopeth for the new results:


Source: New Yorker Caption Contest Archive; Tool: choroplethr package in R.

Again, the last category is the one with the broadest range (2.425 to 7.991 wins per million residents). So, once more, it is worth moving away from cool, colorful chropleths and towards the classical bar chart. In comparing the below bar graph with the previous one, one can quickly see the difference made in normalizing by population:


Source: New Yorker Caption Contest Archive; Tool: ggplot2 package in R.

For one, the once dominant New York falls behind new-arrivals Vermont and Rhode Island while the similarly previously dominant California is no where to be seen! Other states that also lose their place among the top ten are: Illinois, New Jersey, and Pennsylvania. Meanwhile, the four new states in this updated top ten are: Alaska and New Hampshire as well as the previously mentioned Rhode Island and Vermont. Among these four new arrivals, Vermont stands clearly ahead of the pack with approximately 8 caption contest wins per million residents.

The high counts per million for states like Vermont and Rhode Island suggest a relationship that many were likely considering throughout this entire article–isn’t The New Yorker for liberals? Accordingly, isn’t there a relationship between wins per million and liberalness?

Those damn liberal, nonreligious states

Once we have normalized caption contest wins by population, we still have not completely normalized states by their likeliness to win the contest. This is due to the fact that there is a distinct relationship between wins per million residents and evident political markers of The-New-Yorker-types. In particular, consider Gallup’s State of the States measures of “% liberal” and “% nonreligious.” First, I present the strong association between liberal percentages and wins per million:


Source: New Yorker Caption Contest Archive; Tool: ggplot2 package in R.

The above is a scatterplot in which each point is a state (see the small state abbreviation labels) and the blue line is a linear regression line (the shaded area is the 95% confidence region) fit to the data. The conclusion is unmistakable; states that are more liberal tend to win more contests per million residents. Specifically, the equation for the linear regression line is:

wins_per_million = -3.13 + 0.22(pct_liberal)

This means that a 1 percentage point increase in the liberal percentage is associated with an increase of 0.22 captions per million. The R^2 (in this case, the same as the basic correlation coefficient r^2 between wins_per_million and pct_liberal since there is just one explanatory variable in the regression) is 0.364, meaning that 36.4% of response variable variation is explained by this simple model. (The standard error on the coefficient attached to pct_liberal is only 0.04, meaning the coefficient is easily statistically significant at the 0.1% level).

Also strong is the association between nonreligious percentages and wins per million, presented in the graph below:


Source: New Yorker Caption Contest Archive; Tool: ggplot2 package in R.

This plot is very similar to the previous one, most definitely because states with high liberal percentages are likely to have high nonreligious percentages as well. The linear regression line that is fit for this data is:

wins_per_million = -1.37 + 0.09(pct_nonreligious)

The relevant conceptual interpretation is that a 1 percentage point increase in the nonreligious percentage is associated with an increase of 0.09 captions per million. The R^2 for this model is 0.316, so 31.6% of response variable variation is explained by the model. (Again, the coefficient of interest–this time the coefficient attached to pct_nonreligious, is statistically significant at the 0.1% level.)

These two graphs are simple illustrations of the statistically significant relationships between wins per million and two political markers of The New Yorker readership. In order to better understand the relationship between these variables, one must return to the structure of the contest…

The mechanism behind the success of liberal, nonreligious states

The caption contest is broken chronologically into three phases: (1) individuals submit captions, (2) three captions are selected as finalists by magazine staff, and (3) the public votes on their favorite caption.

It seems most likely that the mechanism behind the success of liberal, nonreligious states lies in the first phase. In other words, liberal, nonreligious people are more likely to read The New Yorker and/or follow the caption contest. (Its humor is unlikely to resonate with the intensely religious socially conservative.) Therefore, the tendency towards wins for liberal, nonreligious states is mostly a question of who chooses to participate.

It could also be the case that at least a part of the mechanism behind these states’ successes lies in phases (2) or (3). If a piece of this mechanism was snuggled up in phase 2, that would mean The New Yorker staff is inclined due to an innate sense of liberal humor to pick captions from specific states. (Yet, since most submissions are probably already from liberals, this seems unlikely–though maybe the reverse happens as the magazine attempts to foster geographic diversity by selecting captions from a broader range of locations? I don’t think that’s part of the caption selection process, but it could be relevant to the aforementioned mechanism if it were.) If the mechanism were instead hidden within the third phase, this would mean voters tend to vote for captions created by people from more nonreligious and liberal states in the country. One interesting element to note is that voters can see the place of residence of a caption creator–though I highly doubt this influences peoples’ voting choices, it is possible that regional favoritism is a factor (e.g., New Yorkers like to see other New Yorkers win and, therefore, the large number of New Yorker voters pushes the New Yorker caption submissions to win).

In order to better investigate the mechanism behind the success of nonreligious, liberal states, one needs access to the geographic data of all submissions…or, at least the data on the number of subscribers per state. Though one can submit to the contest without a subscription, the latter measure could still be used as a credible proxy for the former since the number of people who submit to the contest in a state is likely proportional to the number of subscribers in the state.

A thank you note

Thanks to my family for giving me a subscription to The New Yorker this past holiday season in a desperate attempt to help me become less of a philistine. My sincerest apologies that I have focused more on the cartoons than all those chunks of words that mark space in between.


I’ll be sure to actually call you all up if I ever win–good news: if I enter every contest for the next ten years I’ll have approximately a 10% chance of winning just by chance alone.

Me & Bob Mankoff (Cartoon Editor of The New Yorker and creator of the above cartoon)

Me & Bob Mankoff! (Cartoon Editor of The New Yorker and creator of the above cartoon)

Future work
  • Make maps interactive (using Mapbox/TillMill/qgis and the like) and embed into page with the help of Sarah Michael Levine!
  • Look at captions per number of subscribers in a state (even though you can submit even if you’re not a subscriber–I assume submissions from a state would be proportional to the number of subscribers)
  • See if it’s possible to collect state data on all submitted captions in order to test hypotheses related to the mechanism behind the success of liberal, nonreligious states
  • Create predictive model with wins per million as the dependent variable
    • Independent variables could include proximity to New York or a dummy variable based on if the state is in northeast, income per capita, percent liberal, percent nonreligious, (use logs?) etc.
      • However, the issue with many of these is that there is likely to be multicollinearity since so many of these independent variables are highly correlated…Food for thought
        • In particular, it is not worthwhile to include both % liberal and % nonreligious in one regression (one loses statistical significance altogether and the other goes from the 0.1% level to the 5% level)

All data and R scripts needed to recreate all types of visualizations in this article (choropleths, bar charts, and scatterplots with linear regression lines) are available on my “NewYorker” Github repo).

© Alexandra Albright and The Little Dataset That Could, 2016. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.