Where My Girls At? (In The Sciences)

Line Charts, Scatter Plots
Intro

In the current educational landscape, there is a constant stream of calls to improve female representation in the sciences. However, the call to action is often framed within the aforementioned nebulous realm of “the sciences”—an umbrella term that ignores the distinct environments across the scientific disciplines. To better understand the true state of women in “the sciences,” we must investigate representation at the discipline level in the context of both undergraduate and doctoral education. As it turns out, National Science Foundation (NSF) open data provides the ability to do just that!

The NSF’s Report on Women, Minorities, and Persons with Disabilities in Science and Engineering includes raw numbers on both undergraduate and doctoral degrees earned by women and men across all science disciplines. With these figures in hand, it’s simple to generate measures of female representation within each field of study—that is, percentages of female degree earners. This NSF report spans the decade 2002–­2012 and provides an immense amount of raw material to investigate.[1]

The static picture: 2012

First, we will zero in on the most recent year of data, 2012, and explicitly compare female representation within and across disciplines.[2]

fig1

The NSF groups science disciplines with similar focus (for example, atmospheric and ocean sciences both focus on environmental science) into classified parent categories. In order to observe not only the variation within each parent category but also across the more granular disciplines themselves, the above graph plots percentage female representation by discipline, with each discipline colored with respect to its NSF classified parent category.

The variation within each parent category can be quite pronounced. In the earth, atmospheric, and ocean sciences, female undergraduate representation ranges from 36% (atmospheric sciences) to 47% (ocean sciences) of total graduates. Among PhD graduates, female representation ranges from 39% (atmospheric sciences) to 48% (ocean sciences). Meanwhile, female representation in the physical sciences has an undergraduate range from 19% (physics) to 47% (chemistry) and a PhD range from 20% (physics) to 39% (chemistry). However, social sciences has the largest spread of all with undergraduate female representation ranging from 30% (economics) to 71% (anthropology) and PhD representation ranging from 33% (economics) to 64% (anthropology).

In line with conventional wisdom, computer sciences and physics are overwhelmingly male (undergraduate and PhD female representation lingers around 20% for both). Other disciplines in which female representation notably lags include: economics, mathematics and statistics, astronomy, and atmospheric sciences. Possible explanations behind the low representation in such disciplines have been debated at length.

Interactions between “innate abilities,” mathematical content, and female representation

Relatively recently, in January 2015, an article in Science “hypothesize[d] that, across the academic spectrum, women are underrepresented in fields whose practitioners believe that raw, innate talent is the main requirement for success, because women are stereotyped as not possessing such talent.” While this explanation was compelling to many, another group of researchers quickly responded by showing that once measures of mathematical content were added into the proposed models, the measures of innate beliefs (based on surveys of faculty members) shed all their statistical significance. Thus, the latter researchers provided evidence that female representation across disciplines is instead associated with the discipline’s mathematical content “and that faculty beliefs about innate ability were irrelevant.”

However, this conclusion does not imply that stereotypical beliefs are unimportant to female representation in scientific disciplines—in fact, the same researchers argue that beliefs of teachers and parents of younger children can play a large role in silently herding women out of math-heavy fields by “becom[ing] part of the self-fulfilling belief systems of the children themselves from a very early age.” Thus, the conclusion only objects to the alleged discovery of a robust causal relationship between one type of belief, university/college faculty beliefs about innate ability, and female representation.

Despite differences, both assessments demonstrate a correlation between measures of innate capabilities and female representation that is most likely driven by (1) women being less likely than men to study math-intensive disciplines and (2) those in math-intensive fields being more likely to describe their capacities as innate.[3]

The second point should hardly be surprising to anyone who has been exposed to mathematical genius tropes—think of all those handsome janitors who write up proofs on chalkboards whose talents are rarely learned. The second point is also incredibly consistent with the assumptions that underlie “the cult of genius” described by Professor Jordan Ellenberg in How Not to Be Wrong: The Power of Mathematical Thinking (p.412):

The genius cult tells students it’s not worth doing mathematics unless you’re the best at mathematics, because those special few are the only ones whose contributions matter. We don’t treat any other subject that way! I’ve never heard a student say, “I like Hamlet, but I don’t really belong in AP English—that kid who sits in the front row knows all the plays, and he started reading Shakespeare when he was nine!”

In short, subjects that are highly mathematical are seen as more driven by innate abilities than are others. In fact, describing someone as a hard worker in mathematical fields is often seen as an implicit insult—an implication I very much understand as someone who has been regularly (usually affectionately) teased as a “try-hard” by many male peers.

The dynamic picture: 2002–2012

Math-intensive subjects are predominately male in the static picture for the year 2012, but how has the gender balance changed over recent years (in these and all science disciplines)? To answer this question, we turn to a dynamic view of female representation over a recent decade by looking at NSF data for the entirety of 2002–2012.

fig2

The above graph plots the percentages of female degree earners in each science discipline for both the undergraduate and doctoral levels for each year from 2002 to 2012. The trends are remarkably varied with overall changes in undergraduate female representation ranging from a decrease of 33.9% (computer sciences) to an increase of 24.4% (atmospheric sciences). Overall changes in doctoral representation ranged from a decline of 8.8% (linguistics) to a rise of 67.6% (astronomy). The following visual more concisely summarizes the overall percentage changes for the decade.

fig3

As this graph illustrates, there were many gains in female representation at the doctoral level between 2002 and 2012. All but three disciplines experienced increased female representation—seems promising, yes? However, substantial losses at the undergraduate level should yield some concern. Only six of the eighteen science disciplines experienced undergraduate gains in female representation over the decade.

The illustrated increases in representation at the doctoral level are likely extensions of gains at the undergraduate level from the previous years—gains that are now being eroded given the presented undergraduate trends. The depicted losses at the undergraduate level could very well lead to similar losses at the doctoral level in the coming decade, which would hamper the widely shared goal to tenure more female professors.

The change for computer sciences is especially important since it provides a basis for the vast, well-documented media and academic focus on women in the field. (Planet Money brought the decline in percentage of female computer science majors to the attention of many in 2014.) The discipline experienced a loss in female representation at the undergraduate level that was more than twice the size of that in any other subject, including physics (-15.6%), earth sciences (-12.2%), and economics (-11.9%).

While the previous discussion of innate talent and stereotype threat focused on math-intensive fields, a category computer sciences fall into, I would argue that this recent decade has seen the effect of those forces on a growing realm of code-intensive fields. The use of computer programming and statistical software has become a standard qualification for many topics in physics, statistics, economics, biology, astronomy, and other fields. In fact, completing degrees in these disciplines now virtually requires coding in some way, shape, or form.

For instance, in my experience, one nontrivial hurdle that stands between students and more advanced classes in statistics or economics is the time necessary to understand how to use software such as R and Stata. Even seemingly simple tasks in these two programs requires some basic level of comfort with structuring commands—an understanding that is not taught in these classes, but rather mentioned as a quick and seemingly obvious sidebar. Despite my extensive coursework in economics and mathematics, I am quick to admit that I only became comfortable with Stata via independent learning in a summer research context, and R via pursuing projects for this blog many months after college graduation.

The implications of coding’s expanding role in many strains of scientific research should not be underestimated. If women are not coding, they are not just missing from computer science—they will increasingly be missing from other disciplines which coding has seeped into.

The big picture: present–future

In other words, I would argue academia is currently faced with the issue of improving female representation in code-intensive fields.[4] As is true with math-intensive fields, the stereotypical beliefs of teachers and parents of younger children “become part of the self-fulfilling belief systems of the children themselves from a very early age” that discourage women from even attempting to enter code-intensive fields. These beliefs when combined with Ellenberg’s described “cult of genius” (a mechanism that surrounded mathematics and now also applies to the atmosphere in computer science) are especially dangerous.

Given the small percentage of women in these fields at the undergraduate level, there is limited potential growth in female representation along the academic pipeline—that is, at the doctoral and professorial levels. While coding has opened up new, incredible directions for research in many of the sciences, its evolving importance also can yield gender imbalances due to the same dynamics that underlie underrepresentation in math-intensive fields.

Footnotes

[1] Unfortunately, we cannot extend this year range back before 2002 since earlier numbers were solely presented for broader discipline categories, or parent science categories—economics and anthropology would be grouped under the broader term “social sciences,” while astronomy and chemistry would be included under the term “physical sciences.”

[2] The NSF differentiates between science and engineering as the latter is often described as an application of the former in academia. While engineering displays an enormous gender imbalance in favor of men, I limit my discussion here to disciplines that fall under the NSF’s science category.

[3] The latter viewpoint does have some scientific backing. The paper “Nonlinear Psychometric Thresholds for Physics and Mathematics” supports the notion that while greater work ethic can compensate for lesser ability in many subjects, those below some threshold of mathematical capacities are very unlikely to succeed in mathematics and physics coursework.

[4] On a positive note, atmospheric sciences, which often involves complex climate modeling techniques, has experienced large gains in female representation at the undergraduate level.

Speaking of coding…

Check out my relevant Github repository for all data and R scripts necessary for reproducing these visuals.

Thank you to:

Ally Seidel for all the edits over the past few months! & members of NYC squad for listening to my ideas and debating terminology with me.


© Alexandra Albright and The Little Dataset That Could, 2016. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

The Curious Case of The Illinois Trump Delegates

Scatter Plots
Intro

This past Wednesday, after watching Hillary Clinton slow-motion strut into the Broad City universe and realizing that this election has successfully seeped into even the most intimate of personal rituals, I planned to go to sleep without thinking any more about the current presidential race. However, somewhere in between Ilana’s final “yas queen” and hitting the pillow, I saw David Wasserman’s FiveThirtyEight article “Trump Voters’ Aversion To Foreign-Sounding Names Cost Him Delegates.”

Like many readers I was immediately drawn to the piece’s fundamentally ironic implication that Trump could have lost delegates in Illinois due to the very racial resentment that he espouses and even encourages among his supporters. The possibility that this could be more deeply investigated was an energizing idea, which had already inspired Evan Soltas to do just that as well as make public his rich-in-applications-and-possibilities dataset. With this dataset in hand, I tried my hand at complementing the ideas from the Wasserman and Soltas articles by building some visual evidence. (Suffice it to say I did not end up going to sleep for a while.)

To contribute to the meaningful work that the two articles have completed, I will first quickly outline their scope and conclusions, and then present the visuals I’ve built using Soltas’ publicly available data. Consider this a politically timely exercise in speedy R scripting!

Wasserman’s FiveThirtyEight piece & Soltas’ blog post

In the original article of interest, Wasserman discusses the noteworthy Illinois Republican primary. He explains that,

Illinois Republicans hold a convoluted “loophole” primary: The statewide primary winner earns 15 delegates, but the state’s other 54 delegates are elected directly on the ballot, with three at stake in each of the state’s 18 congressional districts. Each campaign files slates of relatively unknown supporters to run for delegate slots, and each would-be delegate’s presidential preference is listed beside his or her name.

Given that the delegates are “relatively unknown,” one would assume that delegates in the same district who list the same presidential preference would earn similar numbers of votes. However, surprisingly, Wasserman found that this did not seem to be the case for Trump delegates. In fact, there is a striking pattern in the Illinois districts with the 12 highest vote differentials: “[i]n all 12 cases, the highest vote-getting candidate had a common, Anglo-sounding name” while “a majority of the trailing candidates had first or last names most commonly associated with Asian, Hispanic or African-American heritages.” These findings, while admittedly informal, strongly suggest that Trump supporters are racially biased in their delegate voting behaviors.

Soltas jumps into this discussion by first creating dataset on all 458 people who ran for Illinois Republican delegate spots. He merges data on the individuals’ names, districts, and candidate representation with a variable that could be described as a measure of perceived whiteness–the non-Hispanic white percentage of the individual’s last name, as determined from 2000 US Census data. The inclusion of this variable is what makes the dataset so exciting (!!!) since, as Soltas explains, this gives us an “objective measure to test the phenomenon Wasserman discovered.”

The article goes on to confirm the legitimacy of Wasserman’s hypothesis. In short, “Trump delegates won significantly more votes when they had “whiter” last names relative to other delegates in their district” and this type of effect does not exist for the other Republicans.

Visual evidence time

I now present a few visuals I generated using the aforementioned dataset to see Soltas’ conclusions for myself. First things first, it’s important to note that some grand underlying mechanism does not jump out at you when you simply look at the association between perceived whiteness and vote percentage for all of Trump’s Illinois delegates:

fig1

The above graph does not suggest any significant relationship between these two numbers attached to each individual delegate. This is because the plot shows delegates across all different districts, which will vote for Trump at different levels, but compares their absolute variable levels. What we actually care about is comparing voting percentages within the same district, but across different individuals who all represent the same presidential hopeful. In other words, we need to think about the delegates relative to their district-level context. To do this, I calculate vote percentages and whiteness measures relative to the district: the percentage point difference between a Trump delegate’s vote|whiteness percentage and the average Trump delegate vote|whiteness percentage in that district. (Suggestions welcome on different ways of doing this for visualization’s sake!)

fig2

Now that we are measuring these variables (vote percentage and whiteness measure) relative to the district, there is a statistically significant association beyond even the 0.1% level. (The simple linear regression Y~X in this case yields a t-statistic of 5.4!) In the end, the interpretation of the simplistic linear regression is that a 10 percentage point increase in a Trump delegate’s perceived whiteness relative to the district yields a 0.12 percentage point increase in the delegate’s vote percentage relative to the district. (I’m curious if people think there is a better way to take district levels into account for these visuals–let me know if you have any thoughts that yield a simpler coefficient interpretation!)

The last dimension of this discussion requires comparing Trump to the other Republican candidates. Given the media’s endless coverage of Trump, I would not have been surprised to learn that this effect impacts other campaigns but just was never reported. But, Wasserman and Soltas argue that this is not the case. Their claims are further bolstered by the following visual, which recreates the most recent Trump plot for all 9 candidates who had sufficient data (excludes Gilmore, Huckabee, and  Santorum):

fig3

It should be immediately clear that Trump is the only candidate for whom there is a positive statistically significant association between the two relative measures. While Kasich has an upward sloping regression line, the corresponding 95% confidence interval demonstrates that the coefficient on relative perceived whiteness is not statistically significantly different from 0. Employing the whiteness measure in this context allows us to provide quantitative evidence for Wasserman’s original intuition that this effect is unique to Trump–thus, “lend[ing] credibility to the theory that racial resentment is commonplace among his supporters.”

The role of perceptions of whiteness

Wasserman’s article has incited an outpouring of genuine interest over the past few days. The fascinating nature of the original inquiry combined with Soltas’ integration of a perceived whiteness measure into the Illinois delegate dataset provides a perfect setting in which to investigate the role racial resentment is playing in these particular voting patterns, and in the election on the whole.

Code

My illinois_delegates Github repo has the R script and csv file necessary to replicate all three visuals! (We know data, we have the best data.)


© Alexandra Albright and The Little Dataset That Could, 2016. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

 

EDUANALYTICS 101: An Investigation into the Stanford Education Space using Edusalsa Data

Heat Maps, Histograms, Scatter Plots, Treemaps
Update [4-18-16]: Thanks to Stuart Rojstaczer for finding an error in my grade distribution histograms. Just fixed them and uploaded the fixed R script as well. Such is the beauty of internet feedback!
Note: This is the first in a series of posts that I am putting together in partnership with Edusalsa, an application based at Stanford that seeks to improve how college students explore and choose their courses. Our goal in these posts is to take advantage of the unique data collected from students’ use of the application in order to learn more about how to model and discuss the accumulation and organization of knowledge within the Stanford community as well as within the larger, global education space. (You can read the post here too.)
Course Syllabus

You are frozen in line. This always happens. You don’t know whether to pick the ENGLISH, PHYSICS with double CS combo that you always order or whether to take a risk and try something new. There are thousands of other options; at least a hundred should fit your strict requirements and picky tastes…Hey, maybe you’d like a side of FRENCH! But now you don’t even know what you should get on it; 258, 130, or 128. You are about to ask which of the three goes best with ENGLISH 90 when you wake up.

You realize you missed lunch… and you need to get out of the library.

Complex choices, those with a large number of options (whether in a deli or via online course registration), often force individuals to make choices haphazardly. In the case of academics, students find themselves unable to bulldoze their way through skimming all available class descriptions, and, accordingly, pick their classes with the help of word of mouth and by simply looking through their regular departments offerings. However, it is undoubtably the case that there are ways to improve matching between students and potential quarterly course combinations.

In order to better understand how one could improve the current course choice mechanism, one must first better understand the Stanford education space as well as the myriad of objects (courses, departments, and grades) and actors (students and Professors) that occupy it. The unique data collected from students’ use of Edusalsa provides an opportunity to do just this. In this post, organized in collaboration with the Edusalsa team, we will use this evolving trove of data to discuss three overarching questions: [1] How can we measure the interest surrounding, or the popularity of, a course/department? (In conjunction with that question, what should we make of enrollment’s place in measuring interest or popularity?) [2] What is the grade distribution at Stanford, on the whole as well as on the aggregate school-level? [3] How do students approach using new tools for course discovery?

[1] How can we measure the interest surrounding, or the popularity of, a course/department?

One of the first areas of interest that can be examined with the help of Edusalsa’s data is Stanford student interest across courses and departments. Simply put, we can use total views on Edusalsa, aggregated both by course and by department, as a proxy for for interest in a course/popularity of a course. [See technical footnote 1 (TF1)] In order to visualize the popularity of a collection of courses and departments, we use a treemap structure to illustrate the relative popularities of two sets of academic objects; (1) all courses that garnered at least 20 views, and (2) all departments that garnered at least 30 views: [TF2]

course_tree

dept_tree

The size of the rectangles within the treemap corresponds to the number of endpoints while the darkness of the color corresponds to the estimated enrollment by quarter for classes and entire departments. We notice that, at the course-level, the distribution of colors throughout the rectangles seems disorganized over the size dimension. In other words, there does not seem to be a strong relationship between enrollment and views at the course level. On the other hand, from a cursory look at the second graph, the department treemap seems to illustrate that courses with larger aggregate enrollments (that is, the sum of all enrollments for all classes in a given department) have more views.

What should we make of enrollment’s place in measuring interest or popularity?

While these particular treemaps are useful for visually comparing the number of views across courses and departments, they do not clarify what, if any, is the nature of the relationship between enrollment and views for these two subsets of all courses and departments. [TF2] Due to the treemaps’ analytic shortcomings, we address the legitimacy of our previous intuitions about the relationship by simply regressing views on enrollment at both the course- and department-level. See below for the relevant plot at the course-level:

course_scatter

The coefficient on enrollment in the simple linear regression model, represented by the blue line in the above plot, while positive, is not statistically significant. We can also see this is the case when considering the width of the light green area above (the 99% confidence interval) and the more narrow gray area (the 95% confidence interval), as both areas comfortably include an alternative version of the blue regression line for which the slope is 0. The enrollment variable’s lack of explanatory power is further bolstered by the fact that, in this simple regression model framework, enrollment variation only accounts for 1.3% of the variation in views.

We now turn to the department-level, which seemed more promising from our original glance at the association between colors and sizes in the relevant treemap:

dept_scatter

In this case, the coefficient on enrollment in this model is statistically significant at the 0.1% level and communicates that, on average, a 1,000 person increase in enrollment for a department is associated with an increase of 65 views on Edusalsa. The strength of the association between enrollment and views is further evidenced by the 95% and 99% confidence intervals. In fact, the explanatory power of the enrollment variable in this context is strong to the point that the model accounts for 53.9% of variation in Edusalsa views. [TF3]

Theory derived from the comparison of course-level and department-level relationships

The difference between the strength of enrollment’s relationship with views at the course and at the department level is clear and notable. I believe that this difference is attributable to the vast heterogeneity in interest across courses, meaning there is extreme variance in terms of how much interest a course garners within a given department. Meanwhile, the difference in interest levels that is so evident across courses disappears at the department-level, once all courses are aggregated. This observation potentially serves as evidence of a current course search model in which students rigidly search within specific departments based on their requirements and fields of study, but then break up their exploration more fluidly at the course-level based on what they’ve heard is good or which classes look the most interesting etc. While the students know what to expect from departments, courses can stand out via catchy names or unique concepts in the description.

More possible metrics, and way more colors…

There are a few other metrics beyond views and enrollment that we might be interested in when trying to assess or proxy for interest surrounding a course or department. In order to compare some of these alternative metrics across various departments we present the below heat map, which serves to relatively compare a set of six metrics across the top 15 departments by enrollment size:

heat

While we have discussed enrollment before, I also include number of courses in the second column as an alternative measurement of the size of the department. Rather than defining size by number of people who take classes in the department, this defines size by the number of courses the department offers. The darker greens of CEE, Education, and Law illustrate that these are the departments parenting the most courses.

Another new metric in the above is the fifth column, a metric for number of viewers, which refers the number of unique individuals who visited a course page within a department. The inclusion of this measurement allows us to avoid certain individuals exerting improperly large influence over our measures. For example, one person who visits Economics course pages thousands of times won’t be able to skew this metric though she could skew the views metric significantly. Note that the columns for number of views and number of viewers are very similar, which indicates that, beyond some individuals in EE, departments had individuals viewing courses at similar frequencies.

The last new concept we introduce in the heat map is the notion of normalizing by enrollment, seen in columns four and six, so as to define metrics that take into account the size of the Stanford population that is already involved with these departments. Normalizing views and viewers in this way makes a large impact. Most notably, CS is no longer the dominant department, and instead shares the stage with other departments like Psychology, MS&E, MEE, etc. This normalized measure could be interpreted to proxy for the interest outside of the core members of the department (eg-majors and planned majors), in which case Psychology is certainly looking interesting to those on the outside looking in.

[2] What is the grade distribution at Stanford, on the whole as well as on the aggregate school-level?

The second topic that we cover in this post pertains to that pesky letter attached to a course–that is, grades. Our obtained data included grade distributions by course. [TF4] We use this data to build the frequency distribution for all grades received at Stanford. The following histogram illustrates that the most commonly received grade during the quarter was an A while the median grade was an A- (red line) and the mean grade was a 3.57 (blue line):

stanford_dist

While this visual is interesting in and of itself since it presents all Stanford course offerings solely by grade outcomes, it would also be meaningful to compare different subsets of the Stanford education space. In particular, we choose to use a similar technique to compare grading distributions across the three schools at Stanford–the School of Humanities & Sciences, the School of Engineering, and the School of Earth, Energy and Environmental Sciences–in order to see whether there is any notable difference across the groups:

school_dist

The histograms for the three schools present incredibly similar distributions–to the extent that at first I thought I mistakenly plotted the same school’s distribution three times. All three have medians of A- and the means are span a narrow range of 0.08; the means are 3.52, 3.60, and 3.58 for the Humanities & Sciences, Engineering, and Earth Sciences schools, respectively. [TF5]

[3] How do students approach using new tools for course discovery?

Since we have discussed views and other metrics both across classes and departments, it is worth mentioning what the Edusalsa metrics look like over individual users. Specifically, we are curious how many times unique users view courses through Edusalsa. In examining this, we are inherently examining the level of “stickiness” of the site and the aggregated view of how users interact with new course tools. In this case, the stickiness level is low, as illustrated below by both (i) a quickly plunging number of unique individuals as the number of course views grows, and (ii) a linear decline of number of unique individuals as the number of course views grows when using a log-log plot. [TF6]

stick

The negative linear relationship between the log transformed variables in the second panel (evidenced by the good fit of the above blue line) is indicative of the negative exponential form of the relationship between number of course views and number of unique individuals. [TF7]  This simply indicates that, as is the case with most new applications, so-called stickiness is low. It will be interesting to see whether this changes given the new addition of the ability to create an account.

School’s out (for summer)

Our key insights in this post lie in the depths of section [1], which discussed

evidence of a current course search model in which students rigidly search within specific departments based on their requirements and fields of study, but then break up their exploration more fluidly at the course-level

With evolving data collection, we will continue to use Edusalsa data in order to learn more about the current course search model as well as the specific Stanford education space. Future steps in this line of work will include analyzing the dynamics between departments and the courses that populate them using network analysis techniques. (There is a slew of possible options on this topic: mapping out connections between departments based on overlap in the text of course descriptions, number of cross-listings, etc.)

There is ample room for tools in the education space to help students search across conventional departments, rather than strictly within them, and understanding the channels that individuals most naturally categorize or conceptualize courses constitutes a large chunk of the work ahead.

Technical footnotes
  1. Edusalsa views by course refers to the number of times an invidual viewed the main page for a course on the site. Technically, this is when the data.url that we record includes the suffix “/course?c=DEPT&NUM” where DEPT is the department abbreviation followed by the number of the course within the department. Views aggregated by department is equivalent to the sum total of all views for courses that are under the umbrella of a given department.
  2. We only illustrate courses with at least 20 views and departments with at least 30 views in order that they will be adequately visible in the static treemap. Ideally, the views would be structured in an interactive hierarchical tree structure in which one starts at the school level (Humanities & Sciences, Engineering, Geosciences) and can venture down to the department level followed by the course level.
  3. Though it might seem as though Computer Science is an outlier in this dataset whose omission could fundamentally alter the power of the simple regression model, it turns out even after omitting CS the coefficient on enrollment remains significant at the 0.1% level while the R^2 remains high as well at approximately 0.446.
  4. The grade distribution data is self-reported by Stanford students over multiple quarters.
  5. While the distributions are very similar aggregated over the school level, I doubt they would be as similar at the smaller, more idiosyncratic department-level. This could be interesting to consider across similar departments, such as ME, EE, CEE, etc. It could also be interesting to try and code all classes at Stanford as “techie” or “fuzzy” a la the quintessential Stanford student split and see whether those two grade frequency distributions are also nearly identical.
  6. We found that ID codes we use to identify individuals can change over people in the long-run. We believe this happens rarely in our dataset, however, it is worth noting nonetheless. Due to this caveat, some calculations could be over- or underestimates of the their true values. For instance, the low stickiness for Edusalsa views could be overestimated as some of the users who are coded as distinct people are the same. Under the same logic, in the heat table, the number of viewers could be an overestimate.
  7. The straight line fit in a log-log plot indicates a monomial relationship form. A monomial is a polynomial with one term–i.e. y=ax^n–appear as straight lines in log-log plots such that n and a correspond to the slope and intercept, respectively.
Code and replication

All datasets and R scripts necessary to recreate these visuals are available at my edusalsa Github repo!


© Alexandra Albright and The Little Dataset That Could, 2016. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

Geography of Humor: The Case of the New Yorker Caption Contest

Bar Charts, Choropleths, Scatter Plots

Update [9-23-15]: Also check out the newest work on this topic: Which U.S. State Performs Best in the New Yorker Caption Contest?

Intro.

About 10 years ago The New Yorker began a weekly contest. It was not a contest of writing talents in colorful fiction nor of investigative prowess in journalism, instead it was a contest of short and sweet humor. Write a caption for a cartoon, they said. It’ll be fun, they said. This will help our circulation, the marketing department said. Individuals like me, who back at age 12 in 2005 believed The New Yorker was the adult’s version of Calvin and Hobbes that they most enjoyed in doctors’ waiting rooms, embraced the new tradition with open arms.

Now, 10 years later, approximately 5,372 captions are submitted each week, and just a single winner is picked. Upon recently trying my own hand (and failing unsurprisingly given the sheer magnitude of competing captions) at the contest, I wondered, who are these winners? In particular, since The New Yorker always prints the name and place of residence of the caption contest winner, I wondered, what’s the geographical distribution of these winners? 

In order to answer this question, I used my prized subscriber access to the online Caption Contest archive. This archive features the winning caption for each week’s cartoon (along with two other finalist captions) and the name/place of residence of the caption creator. (The archives also feature all other submitted captions–which is super interesting from a machine learning perspective, but I don’t focus on that in this piece.) So, I snagged the geographic information on the past 10 years of winners and went with it.

The basics

For this analysis, I collected information on the first 466 caption contests–that is, all contests up to and including the following:

New Yorker Caption Contest #466

The New Yorker Caption Contest #466

Before getting into the meat of this discussion, it is worth noting the structure of the contest as well as the range of eligible participants. See this quick explanation from The New Yorker:

Each week, we provide a cartoon in need of a caption. You, the reader, submit your caption below, we choose three finalists, and you vote for your favorite… Any resident of the United States, Canada (except Quebec), Australia, the United Kingdom, or the Republic of Ireland age eighteen or older can enter or vote.

Thus, the contest consists of two rounds; one in which the magazine staff sift through thousands of submissions and pick just three as well as one in which the public votes on the ultimate winner out of the three finalists. Furthermore, the contest is open to residents outside the United States–a fact that is easy to forget when considering how often individuals from other countries actually win. Out of 466 caption contest winners, only 12 are from outside the United States–2 from Australia, 2 from British Columbia (Canada), and 8 from Ontario (Canada). Though they are allowed to compete, no one from the United Kingdom, or the Republic of Ireland has ever won. In short, 97.85% of caption contest winners are from the U.S.

Moving to the city-level of geography, it is unsurprising that The New Yorker Caption Contest is dominated by, well, New Yorkers. New York City has 62 wins, meaning New Yorkers have won 13.3% of the contests. In order to fully understand how dominant this makes New York consider the fact that the city with the next most caption contests wins is Los Angeles with a mere 18 wins (3.9% of contests). The graphic below depicting the top 8 caption contest cities further highlights New York’s exceptionalism:

cities

Source: New Yorker Caption Contest Archive; Tool: ggplot2 package in R.

The geographic distribution: a state-level analysis

While both the country- and city-level results are dominated by the obvious contenders (the United States and New York City respectively), the state-level analysis is much more compelling.

In this vein, the first question to address is: which states win the most contests? To answer this, I present the following chrolopeth in which the states are divided into five categories of equal size (each category contains 10 states) based on the number of contests won. (This method uses quantiles to classify the data into ranges, however, there are other methods one could use as well.) Visualizing the data in this way allows us to quickly perceive areas of the country that are caption-winner-rich as well as caption-winner-sparse:

totalwins

Source: New Yorker Caption Contest Archive; Tool: choroplethr package in R.

This visualization illustrates that the most successful caption contest states are either east coast or west coast states, with the exception of Illinois (due to Chicago’s 16 wins). The most barren section of the country is unsurprisingly the center of the country. (In particular, Idaho, Kansas, North/South Dakota, West Virginia, and Wyoming have never boasted any caption contest winners.)

While using quantiles to classify the data into ranges is helpful, it gives us an overly broad last category–the darkest blue class contains states with win totals ranging from 14 to 85. If we want to zoom in and compare the states within this one category, we can pivot to a simple bar chart for precision’s sake. The following graph presents the number of contests won among the top ten states:

top10

Source: New Yorker Caption Contest Archive; Tool: ggplot2 package in R.

New York and California are clearly the most dominant states with 85 and 75 wins respectively, which is to be expected considering how populous the two are. If we were to take into account the population size of a given state that would most definitely yield a superior metric in terms of how well each state does in winning the contest. (It would also be interesting to take into account the number of The New Yorker subscribers by state, but I haven’t been able to get a hold on that data yet, so I am putting a pin in that idea for now.)

Therefore, I normalize these counts by creating a new metric: number of caption contests won per one million state residents.  In making this change, the map colors shift noticably. See the following chrolopeth for the new results:

permill

Source: New Yorker Caption Contest Archive; Tool: choroplethr package in R.

Again, the last category is the one with the broadest range (2.425 to 7.991 wins per million residents). So, once more, it is worth moving away from cool, colorful chropleths and towards the classical bar chart. In comparing the below bar graph with the previous one, one can quickly see the difference made in normalizing by population:

top10cap

Source: New Yorker Caption Contest Archive; Tool: ggplot2 package in R.

For one, the once dominant New York falls behind new-arrivals Vermont and Rhode Island while the similarly previously dominant California is no where to be seen! Other states that also lose their place among the top ten are: Illinois, New Jersey, and Pennsylvania. Meanwhile, the four new states in this updated top ten are: Alaska and New Hampshire as well as the previously mentioned Rhode Island and Vermont. Among these four new arrivals, Vermont stands clearly ahead of the pack with approximately 8 caption contest wins per million residents.

The high counts per million for states like Vermont and Rhode Island suggest a relationship that many were likely considering throughout this entire article–isn’t The New Yorker for liberals? Accordingly, isn’t there a relationship between wins per million and liberalness?

Those damn liberal, nonreligious states

Once we have normalized caption contest wins by population, we still have not completely normalized states by their likeliness to win the contest. This is due to the fact that there is a distinct relationship between wins per million residents and evident political markers of The-New-Yorker-types. In particular, consider Gallup’s State of the States measures of “% liberal” and “% nonreligious.” First, I present the strong association between liberal percentages and wins per million:

libs

Source: New Yorker Caption Contest Archive; Tool: ggplot2 package in R.

The above is a scatterplot in which each point is a state (see the small state abbreviation labels) and the blue line is a linear regression line (the shaded area is the 95% confidence region) fit to the data. The conclusion is unmistakable; states that are more liberal tend to win more contests per million residents. Specifically, the equation for the linear regression line is:

wins_per_million = -3.13 + 0.22(pct_liberal)

This means that a 1 percentage point increase in the liberal percentage is associated with an increase of 0.22 captions per million. The R^2 (in this case, the same as the basic correlation coefficient r^2 between wins_per_million and pct_liberal since there is just one explanatory variable in the regression) is 0.364, meaning that 36.4% of response variable variation is explained by this simple model. (The standard error on the coefficient attached to pct_liberal is only 0.04, meaning the coefficient is easily statistically significant at the 0.1% level).

Also strong is the association between nonreligious percentages and wins per million, presented in the graph below:

nonreg

Source: New Yorker Caption Contest Archive; Tool: ggplot2 package in R.

This plot is very similar to the previous one, most definitely because states with high liberal percentages are likely to have high nonreligious percentages as well. The linear regression line that is fit for this data is:

wins_per_million = -1.37 + 0.09(pct_nonreligious)

The relevant conceptual interpretation is that a 1 percentage point increase in the nonreligious percentage is associated with an increase of 0.09 captions per million. The R^2 for this model is 0.316, so 31.6% of response variable variation is explained by the model. (Again, the coefficient of interest–this time the coefficient attached to pct_nonreligious, is statistically significant at the 0.1% level.)

These two graphs are simple illustrations of the statistically significant relationships between wins per million and two political markers of The New Yorker readership. In order to better understand the relationship between these variables, one must return to the structure of the contest…

The mechanism behind the success of liberal, nonreligious states

The caption contest is broken chronologically into three phases: (1) individuals submit captions, (2) three captions are selected as finalists by magazine staff, and (3) the public votes on their favorite caption.

It seems most likely that the mechanism behind the success of liberal, nonreligious states lies in the first phase. In other words, liberal, nonreligious people are more likely to read The New Yorker and/or follow the caption contest. (Its humor is unlikely to resonate with the intensely religious socially conservative.) Therefore, the tendency towards wins for liberal, nonreligious states is mostly a question of who chooses to participate.

It could also be the case that at least a part of the mechanism behind these states’ successes lies in phases (2) or (3). If a piece of this mechanism was snuggled up in phase 2, that would mean The New Yorker staff is inclined due to an innate sense of liberal humor to pick captions from specific states. (Yet, since most submissions are probably already from liberals, this seems unlikely–though maybe the reverse happens as the magazine attempts to foster geographic diversity by selecting captions from a broader range of locations? I don’t think that’s part of the caption selection process, but it could be relevant to the aforementioned mechanism if it were.) If the mechanism were instead hidden within the third phase, this would mean voters tend to vote for captions created by people from more nonreligious and liberal states in the country. One interesting element to note is that voters can see the place of residence of a caption creator–though I highly doubt this influences peoples’ voting choices, it is possible that regional favoritism is a factor (e.g., New Yorkers like to see other New Yorkers win and, therefore, the large number of New Yorker voters pushes the New Yorker caption submissions to win).

In order to better investigate the mechanism behind the success of nonreligious, liberal states, one needs access to the geographic data of all submissions…or, at least the data on the number of subscribers per state. Though one can submit to the contest without a subscription, the latter measure could still be used as a credible proxy for the former since the number of people who submit to the contest in a state is likely proportional to the number of subscribers in the state.

A thank you note

Thanks to my family for giving me a subscription to The New Yorker this past holiday season in a desperate attempt to help me become less of a philistine. My sincerest apologies that I have focused more on the cartoons than all those chunks of words that mark space in between.

How-about-never-cartoon

I’ll be sure to actually call you all up if I ever win–good news: if I enter every contest for the next ten years I’ll have approximately a 10% chance of winning just by chance alone.

Me & Bob Mankoff (Cartoon Editor of The New Yorker and creator of the above cartoon)

Me & Bob Mankoff! (Cartoon Editor of The New Yorker and creator of the above cartoon)

Future work
  • Make maps interactive (using Mapbox/TillMill/qgis and the like) and embed into page with the help of Sarah Michael Levine!
  • Look at captions per number of subscribers in a state (even though you can submit even if you’re not a subscriber–I assume submissions from a state would be proportional to the number of subscribers)
  • See if it’s possible to collect state data on all submitted captions in order to test hypotheses related to the mechanism behind the success of liberal, nonreligious states
  • Create predictive model with wins per million as the dependent variable
    • Independent variables could include proximity to New York or a dummy variable based on if the state is in northeast, income per capita, percent liberal, percent nonreligious, (use logs?) etc.
      • However, the issue with many of these is that there is likely to be multicollinearity since so many of these independent variables are highly correlated…Food for thought
        • In particular, it is not worthwhile to include both % liberal and % nonreligious in one regression (one loses statistical significance altogether and the other goes from the 0.1% level to the 5% level)
Code

All data and R scripts needed to recreate all types of visualizations in this article (choropleths, bar charts, and scatterplots with linear regression lines) are available on my “NewYorker” Github repo).


© Alexandra Albright and The Little Dataset That Could, 2016. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.