Building Visualizations Using City Open Data: Philly School Comparisons

Intro

There is a collection of notes that accompanies me throughout my day, slipped into the deep pockets of my backpack. The collection consists of small notebooks and post-its featuring sentence fragments written in inky Sharpie or scratched down frantically using some pen that was (of course) dying at the time. Ideas, hypotheses, some jokes. Mostly half baked and sometimes completely raw. Despite this surplus of scribbles, I often struggle when it comes acting on the intention of the words that felt so quick and simple to jot down… In fact, I often feel myself acting within the confines of this all too perfect graphical representation of project development:

14063489_163173454089228_1445505577_n

via the wonderful young cartoonist Liana Finck

One topic of interest–comparisons of charter and district public schools–has been on my (self-imposed) plate for over a year now. The topic was inspired by a documentary webseries that a friend actually just recently completed. [Plugs: Sivahn Barsade will be screening her documentary webseries Charter Wars this weekend in Philadelphia! Check it out if you’re around.] Given that she is currently wrapping up this long-term project, I am doing the same for my related mini-project. In other words, some post-its are officially being upgraded to objects on the internet.

To quote the filmmakers, “Charter Wars is an interactive documentary that examines the ideologies and motivations driving the charter school debate in Philadelphia.” Ah, yes, charter schools… a handful of slides glided by me on the topic in my morning Labor Economics class just this past Wednesday. Check out the intertwined and state-of-the-art Dobbie-Fryer (2013) and Fryer (2014) if you’re interested in charter school best practices and their implementation in other school environments.[1] However, despite the mention of these papers, I am not going to use this space in order to critique or praise rigorous academic research on the subject. Instead, I will use this space as a playground for the creation of city open data visualizations. Since Sivahn focuses her Charter Wars project on Philadelphia, I decided to do the same, which turned out to be a great idea since OpenDataPhilly is a joy to navigate, especially in comparison to other city data portals. After collecting data of interest from their site (details on that process available here), I used ggplot2 in R (praise Hadley!) to create two visualizations comparing district and charter schools in the city.

Think of this post as a quasi-tutorial inspired by Charter Wars; I’ll present a completed visual and then share the heart of the code in the text with some brief explanation as to the core elements therein. (I will also include links to code on my Github repo, which presents the full R scripts and explains how to get the exact data from OpenDataPhilly that you would need to replicate visuals.)

Visualization #1: Mapping out the city and schools

First things first, I wanted to map the location of public schools in the city of Philadelphia. Open data provides workable latitude and longitudes for all such schools, so this objective is entirely realizable. The tricky part in mapping the schools is that I also had to work with shape files that yield the city zip code edges and consequently build the overarching map on which points (representing the schools) can be plotted. I color schools based on four categories: Charter (Neighborhood), Charter (Citywide), District (Neighborhood), and District (Citywide);[2] and then break the plots up so that we can compare across the school levels: Elementary School, Middle School, High School, K-8 School (rather than plotting hundreds of points all on one big map). Here is my eventual result generated using R:

mappingschools

The reality is that most of the labor in creating these visuals is in figuring out both how to make functions work and how to get your data in the desired workable form. Once you’ve understood how the functions behave and you’ve reshaped your data structures, you can focus on your ggplot command, which is the cool piece of your script that you want to show off at the end of the day:

ggplot() +
geom_map(data = spr1, aes(map_id = Zip.Code), map = np_dist, fill="gray40", color="gray60") +
expand_limits(x = np_dist$long, y = np_dist$lat)+
my_theme()+
geom_point(data=datadistn, aes(x=X, y=Y, col="District (Neighborhood)"), size=1.5, alpha=1)+
geom_point(data=datachartn, aes(x=X, y=Y, col="Charter (Neighborhood)"), size=1.5, alpha=1)+
geom_point(data=datadistc, aes(x=X, y=Y, col="District (Citywide)"), size=1.5, alpha=1)+
geom_point(data=datachartc, aes(x=X, y=Y, col="Charter (Citywide)"), size=1.5, alpha=1)+
facet_wrap(~Rpt.Type.Long, ncol=2)+
ggtitle(expression(atop(bold("Mapping Philly Schools"), atop(italic("Data via OpenDataPhilly; Visual via Alex Albright (thelittledataset.com)"),""))))+
scale_colour_manual(values = c("Charter (Citywide)"="#b10026", "District (Citywide)"="#807dba","Charter (Neighborhood)"="red","District (Neighborhood)"="blue"), guide_legend(title="Type of School"))+
labs(y="", x="")

This command creates the map I had previously presented. The basic process with all these sorts of ggplot commands is that you want to start your plot with ggplot() and then add layers with additional commands (after each +). The above code uses a number of functions and geometric objects that I identify and describe below:

  • ggplot()
    • Start the plot
  • geom_map()
    • Geometric object that maps out Philadelphia with the zip code lines
  • my_theme()
    • My customized function that defines style of my visuals (defines plot background, font styles, spacing, etc.)
  • geom_point()
    • Geometric object that adds the points onto the base layer of the map (I use it four times since I want to do this for each of the four school types using different colors)
  • facet_wrap()
    • Function that says we want four different maps in order to show one for each of the four school levels (Middle School, Elementary School, High School, K-8 School)
  • ggtitle()
    • Function that specifies the overarching plot title
  • scale_colour_manual()
    • Function that maps values of school types to specific aesthetic values (in our case, colors!)
  • labs()
    • Function to change axis labels and legend titles–I use it to get rid of default axes labels for the overarching graph

Definitely head to the full R script on Github to understand what the arguments (spr1, np_dist, etc.) are in the different pieces of this large aggregated command. [Recommended resources for those interested in using R for visualization purposes: a great cheat sheet on building up plots with ggplot & the incredible collection of FlowingData tutorialsPrabhas Pokharel’s helpful post on this type of mapping in R]

Visualization #2: Violin Plots

My second creation illustrates the distribution of school scores across the four aforementioned school types: Charter (Neighborhood), Charter (Citywide), District (Neighborhood), and District (Citywide). (Note that the colors match those used for the points in the previous maps.) To explore this topic, I create violin plots, which can be thought of as sideways density plots, which can in turn be thought of as smooth histograms.[3] Alternatively, according to Nathan Yau, you can think of them as the “lovechild between a density plot and a box-and-whisker plot.” Similar to how in the previous graph I broke the school plotting up into four categories based on level of schooling, I now break the plotting up based on score type: overall, achievement, progress, and climate.  See below for the final product:

scores

The core command that yields this graph is as follows:

ggplot(data_new, aes(factor(data_new$Governance0), data_new$Score))+
geom_violin(trim=T, adjust=.2, aes(fill=Governance0))+
geom_boxplot(width=0.1, aes(fill=Governance0, color="orange"))+
my_theme()+
scale_fill_manual(values = pal2, guide_legend(title="School Type")) +
ylim(0,100)+
labs(x="", y="")+
facet_wrap(~Score_type, ncol=2, scales="free")+
ggtitle(expression(atop(bold("Comparing Philly School Score Distributions"), atop(italic("Data via OpenDataPhilly (2014-2015); Visual via Alex Albright (thelittledataset.com)"),""))))

Similar to before, I will briefly explain the functions and objects that we combine to into this one long command:

  • ggplot()
    • Begin the plot with aesthetics for score and school type (Governance0)
  • geom_violin()
    • Geometric object that specifies that we are going to use a violin plot for the distributions (also decides on the bandwidth parameter)
  • geom_boxplot()
    • Geometric object that generates a basic boxplot over the violin plot (so we can get an alternative view of the underlying data points)
  • my_theme()
    • My customized function that defines the style of visuals
  • scale_fill_manual()
    • Function that fills in the color of the violins by school type
  • ylim()
    • Short-hand function to set y-axis to always show 0-100 values
  • labs()
    • Function to get rid of default axes labels
  • facet_wrap()
    • Function that separates plots out into one for each of the four score types: overall, achievement, progress, climate
  • ggtitle()
    • Specifies the overarching plot title

Again, definitely head to the full R script to understand the full context of this command and the structure of the underlying data. (Relevant resources for looking into violin plots in R can also be found here and here.) 

It took me many iterations of code to get to the current builds that you can see on Github, especially since I am not an expert with mapping–unlike my better half, Sarah Michael Levine. See the below comic for an accurate depiction of current-day-me (the stick figure with ponytail) looking at the code that July-2015-me originally wrote to produce some variant of these visuals (stick figure without ponytail):

code_quality

Via XKCD

Hopefully current-day-me was able to improve the style to the extent that it is now readable to the general public. (Do let me know if you see inefficiencies though and I’m happy to iterate further! Ping me with questions too if you so desire.) Moreover, in intensively editing code created by my past self over the past string of days, I also quickly recalled that the previous graphical representation of my project workflow needed to be updated to more accurately reflect reality:

manic2

adapted from Liana Finck with the help of snapchat artistic resources

On a more serious note, city open data is an incredible resource for individuals to practice using R (or other software). In rummaging around city variables and values, you can maintain a sense of connection to your community while floating around the confines of a simple two-dimensional command line.

Plugs section [important]
  1. Thanks to Sivahn for communicating with me about her Charter Wars documentary webseries project–good luck with the screening and all, Si!
  2. If you like city open data projects, or you’re a New Yorker, or both… check out Ben Wellington’s blog that focuses on NYC open data.
  3. If you’d like to replicate elements of this project, see my Github documentation.
Footnotes

[1] Yes, that’s right; I’m linking you to the full pdfs that I downloaded with my university access. Think of me as Robin Hood with the caveat that I dole out journal articles instead of $$$.

[2] Note from Si on four school categories: Wait, why are there four categories? While most people, and researchers, divide public schools into charter-run and district-run, this binary is lacking vital information. For some district and charter schools, students have to apply and be selected to attend. It wouldn’t be fair to compare a charter school to a district magnet school just like it wouldn’t be fair to compare a performing arts charter school to a neighborhood district school (this is not a knock against special admit schools, just their effect on data analysis). The additional categories don’t allow for a perfect apples-apples comparison, but at least inform you’ll know that you’re comparing an apple to an orange. 

[3] The efficacy or legitimacy of this sort of visualization method is potentially contentious in the data visualization community, so I’m happy to hear critiques/suggestions–especially with respect to best practices for determining bandwidth parameters!


© Alexandra Albright and The Little Dataset That Could, 2016. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

Go East, young woman

We’ll always have Palo Alto[1]

It is 9:30pm PST on Friday evening and my seat beat is buckled. The lights are a dim purple as they always are on Virgin America flights. As if we are all headed off to a prom on the opposite side of the country together. My favorite safety video in the industry starts to play–an accumulation of visuals and beats that usually gives me a giddy feeling that only Beyoncé videos have the power to provoke–however, in this moment, I begin to tear up despite the image of a shimmying nun displayed in front of me. In my mind, overlaying the plane-inspired choreography is a projection of Rick Blaine reminding me in my moments of doubt that, I belong on this plane [2]: “If that plane leaves the ground and you’re not [in it], you’ll regret it. Maybe not today. Maybe not tomorrow, but soon and for the rest of your life.” I whisper “here’s looking at you, kid” to the screen now saturated with dancing flight attendants and fade into a confused dreamscape: Silicon Valley in black and white–founders still wear hoodies, but they have tossed on hats from the ’40s.

A few days later, I am now living in Cambridge, MA. While my senses are overcome by a powerful ensemble of changes, some more discreet or intangible than others, there is one element of the set that is clear, striking, and quantifiable. The thickness and heat in the air that was missing from Palo Alto and San Francisco. After spending a few nights out walking (along rivers, across campuses, over and under bridges, etc.) in skirts and sandals without even the briefest longing for a polar fleece, I am intent on documenting the difference between Boston and San Francisco temperatures. Sure, I can’t quantify every dimension of change that I experience, but, hey, I can chart temperature differences.

Coding up weather plots

In order to investigate the two cities and their relevant weather trends, I adapted some beautiful code that was originally written by Bradley Boehmke in order to generate Tufte-inspired weather charts using R (specifically making use of the beloved ggplot2 package). The code is incredible in how simple it is to apply to any of the cities that have data from the University of Dayton’s Average Daily Temperature archive.[3] Below are the results I generated for SF and Boston, respectively[4]:

SF_plot

Boston_plot

While one could easily just plot the recent year’s temperature data (2015, as marked by the black time series, in this case), it is quickly evident that making use of historical temperature data helps to both smooth over the picture and put 2015 temperatures in context. The light beige for each day in the year shows the range from historical lows and to historical highs in the time period of 1995-2014. Meanwhile, the grey range presents the 95% confidence interval around daily mean temperatures for that same time period. Lastly, the presence of blue and red dots illustrates the days in 2015 that were record lows or highs over the past two decades. While Boston had a similar number of red and blue dots for 2015, SF is overpowered by red. Almost 12% of SF days were record highs relative to the previous twenty years. Only one day was a record low.

While this style of visualization is primarily intuitive for comparing a city’s weather to its own historical context, there are also a few quick points that strike me from simple comparisons across the two graphs. I focus on just three quick concepts that are borne out by the visuals:

  1. Boston’s seasons are unmistakable.[5] While the normal range (see darker swatches on the graph) of temperatures for SF varies between 50 (for winter months) and 60 degrees (for late summer and early fall months), the normal range for Boston is notably larger and ranges from the 30’s (winter and early spring months) to the 70’s (summer months). The difference in the curve of the two graphs makes this difference throughout the months painfully obvious. San Francisco’s climate is incredibly stable in comparison with east coast cities–a fact that is well known, but still impressive to see in visual form!
  2. There’s a reason SF can have Ultimate Frisbee Beach League in the winter. Consider the relative wonderfulness of SF in comparison to Boston during the months of January to March. In 2015, SF ranged from 10 to 55 degrees (on a particularly toasty February day) warmer than Boston for those months. In general, most differences on a day-to-day basis are around +20 to +40 degrees for SF.
  3. SF Summer is definitely ‘SF Winter’ if one defines its temperature relative to that of other climates. In 2015, the summer months in SF were around 10 degrees colder than were the summer months in Boston. While SF summer is warmer than actual SF winter in terms of absolute temperature comparisons, comparing the temperatures to other areas of the country quickly yields SF summer as the relatively chilliest range of the year.

Of course, it is worth noting that the picture from looking at simple temperature alone is not complete. More interesting than this glance at basic temperature would be an investigation into the “feels like” temperature, which usually takes into account factors such as wind speeds and humidity. Looking into these more complex measurements would very likely heighten the clear distinction in Boston seasons as well as potentially strengthen the case for calling SF summer ‘SF winter’, given the potential stronger presence of wind chill during the summer months.[6]

The coldest winter I ever spent…[7]

It is 6:00am EST Saturday morning in Boston, MA. Hot summer morning is sliced into by divine industrial air conditioning. Hypnotized by luggage seemingly floating on the baggage claim conveyor belt and slowly emerging from my black and white dreams, I wonder if Ilsa compared the weather in Lisbon to that in Casablanca when she got off her plane… after contacts render the lines and angles that compose my surroundings crisp again, I doubt it. Not only because Ilsa was probably still reeling from maddeningly intense eye contact with Rick, but also because Lisbon and Morocco are not nearly as markedly different in temperature as are Boston and San Francisco.

Turns out that the coldest winter I will have ever spent will be winter in Boston. My apologies to summer in San Francisco.

Footnotes

[1] Sincere apologies to those of you in the Bay Area who have had to hear me make this joke a few too many times over the past few weeks.

[2] Though definitely not to serve as a muse to some man named Victor. Ah, yes, the difference 74 years can make in the purpose of a woman’s travels.

[3] Taking your own city’s data for a spin is a great way to practice getting comfortable with R visualization if you’re into that sort of thing.

[4] See my adapted R code for SF and Boston here. Again, the vast majority of credit goes to Bradley Boehmke for the original build.

[5] Speaking of seasons

[6] I’d be interested to see which US cities have the largest quantitative difference between “feels like” and actual temperature for each period (say, month) of the year…

[7] From a 2005 Chronicle article: “‘The coldest winter I ever spent was a summer in San Francisco,’ a saying that is almost a San Francisco cliche, turns out to be an invention of unknown origin, the coolest thing Mark Twain never said.”


© Alexandra Albright and The Little Dataset That Could, 2016. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

Where My Girls At? (In The Sciences)

Intro

In the current educational landscape, there is a constant stream of calls to improve female representation in the sciences. However, the call to action is often framed within the aforementioned nebulous realm of “the sciences”—an umbrella term that ignores the distinct environments across the scientific disciplines. To better understand the true state of women in “the sciences,” we must investigate representation at the discipline level in the context of both undergraduate and doctoral education. As it turns out, National Science Foundation (NSF) open data provides the ability to do just that!

The NSF’s Report on Women, Minorities, and Persons with Disabilities in Science and Engineering includes raw numbers on both undergraduate and doctoral degrees earned by women and men across all science disciplines. With these figures in hand, it’s simple to generate measures of female representation within each field of study—that is, percentages of female degree earners. This NSF report spans the decade 2002–­2012 and provides an immense amount of raw material to investigate.[1]

The static picture: 2012

First, we will zero in on the most recent year of data, 2012, and explicitly compare female representation within and across disciplines.[2]

fig1

The NSF groups science disciplines with similar focus (for example, atmospheric and ocean sciences both focus on environmental science) into classified parent categories. In order to observe not only the variation within each parent category but also across the more granular disciplines themselves, the above graph plots percentage female representation by discipline, with each discipline colored with respect to its NSF classified parent category.

The variation within each parent category can be quite pronounced. In the earth, atmospheric, and ocean sciences, female undergraduate representation ranges from 36% (atmospheric sciences) to 47% (ocean sciences) of total graduates. Among PhD graduates, female representation ranges from 39% (atmospheric sciences) to 48% (ocean sciences). Meanwhile, female representation in the physical sciences has an undergraduate range from 19% (physics) to 47% (chemistry) and a PhD range from 20% (physics) to 39% (chemistry). However, social sciences has the largest spread of all with undergraduate female representation ranging from 30% (economics) to 71% (anthropology) and PhD representation ranging from 33% (economics) to 64% (anthropology).

In line with conventional wisdom, computer sciences and physics are overwhelmingly male (undergraduate and PhD female representation lingers around 20% for both). Other disciplines in which female representation notably lags include: economics, mathematics and statistics, astronomy, and atmospheric sciences. Possible explanations behind the low representation in such disciplines have been debated at length.

Interactions between “innate abilities,” mathematical content, and female representation

Relatively recently, in January 2015, an article in Science “hypothesize[d] that, across the academic spectrum, women are underrepresented in fields whose practitioners believe that raw, innate talent is the main requirement for success, because women are stereotyped as not possessing such talent.” While this explanation was compelling to many, another group of researchers quickly responded by showing that once measures of mathematical content were added into the proposed models, the measures of innate beliefs (based on surveys of faculty members) shed all their statistical significance. Thus, the latter researchers provided evidence that female representation across disciplines is instead associated with the discipline’s mathematical content “and that faculty beliefs about innate ability were irrelevant.”

However, this conclusion does not imply that stereotypical beliefs are unimportant to female representation in scientific disciplines—in fact, the same researchers argue that beliefs of teachers and parents of younger children can play a large role in silently herding women out of math-heavy fields by “becom[ing] part of the self-fulfilling belief systems of the children themselves from a very early age.” Thus, the conclusion only objects to the alleged discovery of a robust causal relationship between one type of belief, university/college faculty beliefs about innate ability, and female representation.

Despite differences, both assessments demonstrate a correlation between measures of innate capabilities and female representation that is most likely driven by (1) women being less likely than men to study math-intensive disciplines and (2) those in math-intensive fields being more likely to describe their capacities as innate.[3]

The second point should hardly be surprising to anyone who has been exposed to mathematical genius tropes—think of all those handsome janitors who write up proofs on chalkboards whose talents are rarely learned. The second point is also incredibly consistent with the assumptions that underlie “the cult of genius” described by Professor Jordan Ellenberg in How Not to Be Wrong: The Power of Mathematical Thinking (p.412):

The genius cult tells students it’s not worth doing mathematics unless you’re the best at mathematics, because those special few are the only ones whose contributions matter. We don’t treat any other subject that way! I’ve never heard a student say, “I like Hamlet, but I don’t really belong in AP English—that kid who sits in the front row knows all the plays, and he started reading Shakespeare when he was nine!”

In short, subjects that are highly mathematical are seen as more driven by innate abilities than are others. In fact, describing someone as a hard worker in mathematical fields is often seen as an implicit insult—an implication I very much understand as someone who has been regularly (usually affectionately) teased as a “try-hard” by many male peers.

The dynamic picture: 2002–2012

Math-intensive subjects are predominately male in the static picture for the year 2012, but how has the gender balance changed over recent years (in these and all science disciplines)? To answer this question, we turn to a dynamic view of female representation over a recent decade by looking at NSF data for the entirety of 2002–2012.

fig2

The above graph plots the percentages of female degree earners in each science discipline for both the undergraduate and doctoral levels for each year from 2002 to 2012. The trends are remarkably varied with overall changes in undergraduate female representation ranging from a decrease of 33.9% (computer sciences) to an increase of 24.4% (atmospheric sciences). Overall changes in doctoral representation ranged from a decline of 8.8% (linguistics) to a rise of 67.6% (astronomy). The following visual more concisely summarizes the overall percentage changes for the decade.

fig3

As this graph illustrates, there were many gains in female representation at the doctoral level between 2002 and 2012. All but three disciplines experienced increased female representation—seems promising, yes? However, substantial losses at the undergraduate level should yield some concern. Only six of the eighteen science disciplines experienced undergraduate gains in female representation over the decade.

The illustrated increases in representation at the doctoral level are likely extensions of gains at the undergraduate level from the previous years—gains that are now being eroded given the presented undergraduate trends. The depicted losses at the undergraduate level could very well lead to similar losses at the doctoral level in the coming decade, which would hamper the widely shared goal to tenure more female professors.

The change for computer sciences is especially important since it provides a basis for the vast, well-documented media and academic focus on women in the field. (Planet Money brought the decline in percentage of female computer science majors to the attention of many in 2014.) The discipline experienced a loss in female representation at the undergraduate level that was more than twice the size of that in any other subject, including physics (-15.6%), earth sciences (-12.2%), and economics (-11.9%).

While the previous discussion of innate talent and stereotype threat focused on math-intensive fields, a category computer sciences fall into, I would argue that this recent decade has seen the effect of those forces on a growing realm of code-intensive fields. The use of computer programming and statistical software has become a standard qualification for many topics in physics, statistics, economics, biology, astronomy, and other fields. In fact, completing degrees in these disciplines now virtually requires coding in some way, shape, or form.

For instance, in my experience, one nontrivial hurdle that stands between students and more advanced classes in statistics or economics is the time necessary to understand how to use software such as R and Stata. Even seemingly simple tasks in these two programs requires some basic level of comfort with structuring commands—an understanding that is not taught in these classes, but rather mentioned as a quick and seemingly obvious sidebar. Despite my extensive coursework in economics and mathematics, I am quick to admit that I only became comfortable with Stata via independent learning in a summer research context, and R via pursuing projects for this blog many months after college graduation.

The implications of coding’s expanding role in many strains of scientific research should not be underestimated. If women are not coding, they are not just missing from computer science—they will increasingly be missing from other disciplines which coding has seeped into.

The big picture: present–future

In other words, I would argue academia is currently faced with the issue of improving female representation in code-intensive fields.[4] As is true with math-intensive fields, the stereotypical beliefs of teachers and parents of younger children “become part of the self-fulfilling belief systems of the children themselves from a very early age” that discourage women from even attempting to enter code-intensive fields. These beliefs when combined with Ellenberg’s described “cult of genius” (a mechanism that surrounded mathematics and now also applies to the atmosphere in computer science) are especially dangerous.

Given the small percentage of women in these fields at the undergraduate level, there is limited potential growth in female representation along the academic pipeline—that is, at the doctoral and professorial levels. While coding has opened up new, incredible directions for research in many of the sciences, its evolving importance also can yield gender imbalances due to the same dynamics that underlie underrepresentation in math-intensive fields.

Footnotes

[1] Unfortunately, we cannot extend this year range back before 2002 since earlier numbers were solely presented for broader discipline categories, or parent science categories—economics and anthropology would be grouped under the broader term “social sciences,” while astronomy and chemistry would be included under the term “physical sciences.”

[2] The NSF differentiates between science and engineering as the latter is often described as an application of the former in academia. While engineering displays an enormous gender imbalance in favor of men, I limit my discussion here to disciplines that fall under the NSF’s science category.

[3] The latter viewpoint does have some scientific backing. The paper “Nonlinear Psychometric Thresholds for Physics and Mathematics” supports the notion that while greater work ethic can compensate for lesser ability in many subjects, those below some threshold of mathematical capacities are very unlikely to succeed in mathematics and physics coursework.

[4] On a positive note, atmospheric sciences, which often involves complex climate modeling techniques, has experienced large gains in female representation at the undergraduate level.

Speaking of coding…

Check out my relevant Github repository for all data and R scripts necessary for reproducing these visuals.

Thank you to:

Ally Seidel for all the edits over the past few months! & members of NYC squad for listening to my ideas and debating terminology with me.


© Alexandra Albright and The Little Dataset That Could, 2016. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

The Curious Case of The Illinois Trump Delegates

Intro

This past Wednesday, after watching Hillary Clinton slow-motion strut into the Broad City universe and realizing that this election has successfully seeped into even the most intimate of personal rituals, I planned to go to sleep without thinking any more about the current presidential race. However, somewhere in between Ilana’s final “yas queen” and hitting the pillow, I saw David Wasserman’s FiveThirtyEight article “Trump Voters’ Aversion To Foreign-Sounding Names Cost Him Delegates.”

Like many readers I was immediately drawn to the piece’s fundamentally ironic implication that Trump could have lost delegates in Illinois due to the very racial resentment that he espouses and even encourages among his supporters. The possibility that this could be more deeply investigated was an energizing idea, which had already inspired Evan Soltas to do just that as well as make public his rich-in-applications-and-possibilities dataset. With this dataset in hand, I tried my hand at complementing the ideas from the Wasserman and Soltas articles by building some visual evidence. (Suffice it to say I did not end up going to sleep for a while.)

To contribute to the meaningful work that the two articles have completed, I will first quickly outline their scope and conclusions, and then present the visuals I’ve built using Soltas’ publicly available data. Consider this a politically timely exercise in speedy R scripting!

Wasserman’s FiveThirtyEight piece & Soltas’ blog post

In the original article of interest, Wasserman discusses the noteworthy Illinois Republican primary. He explains that,

Illinois Republicans hold a convoluted “loophole” primary: The statewide primary winner earns 15 delegates, but the state’s other 54 delegates are elected directly on the ballot, with three at stake in each of the state’s 18 congressional districts. Each campaign files slates of relatively unknown supporters to run for delegate slots, and each would-be delegate’s presidential preference is listed beside his or her name.

Given that the delegates are “relatively unknown,” one would assume that delegates in the same district who list the same presidential preference would earn similar numbers of votes. However, surprisingly, Wasserman found that this did not seem to be the case for Trump delegates. In fact, there is a striking pattern in the Illinois districts with the 12 highest vote differentials: “[i]n all 12 cases, the highest vote-getting candidate had a common, Anglo-sounding name” while “a majority of the trailing candidates had first or last names most commonly associated with Asian, Hispanic or African-American heritages.” These findings, while admittedly informal, strongly suggest that Trump supporters are racially biased in their delegate voting behaviors.

Soltas jumps into this discussion by first creating dataset on all 458 people who ran for Illinois Republican delegate spots. He merges data on the individuals’ names, districts, and candidate representation with a variable that could be described as a measure of perceived whiteness–the non-Hispanic white percentage of the individual’s last name, as determined from 2000 US Census data. The inclusion of this variable is what makes the dataset so exciting (!!!) since, as Soltas explains, this gives us an “objective measure to test the phenomenon Wasserman discovered.”

The article goes on to confirm the legitimacy of Wasserman’s hypothesis. In short, “Trump delegates won significantly more votes when they had “whiter” last names relative to other delegates in their district” and this type of effect does not exist for the other Republicans.

Visual evidence time

I now present a few visuals I generated using the aforementioned dataset to see Soltas’ conclusions for myself. First things first, it’s important to note that some grand underlying mechanism does not jump out at you when you simply look at the association between perceived whiteness and vote percentage for all of Trump’s Illinois delegates:

fig1

The above graph does not suggest any significant relationship between these two numbers attached to each individual delegate. This is because the plot shows delegates across all different districts, which will vote for Trump at different levels, but compares their absolute variable levels. What we actually care about is comparing voting percentages within the same district, but across different individuals who all represent the same presidential hopeful. In other words, we need to think about the delegates relative to their district-level context. To do this, I calculate vote percentages and whiteness measures relative to the district: the percentage point difference between a Trump delegate’s vote|whiteness percentage and the average Trump delegate vote|whiteness percentage in that district. (Suggestions welcome on different ways of doing this for visualization’s sake!)

fig2

Now that we are measuring these variables (vote percentage and whiteness measure) relative to the district, there is a statistically significant association beyond even the 0.1% level. (The simple linear regression Y~X in this case yields a t-statistic of 5.4!) In the end, the interpretation of the simplistic linear regression is that a 10 percentage point increase in a Trump delegate’s perceived whiteness relative to the district yields a 0.12 percentage point increase in the delegate’s vote percentage relative to the district. (I’m curious if people think there is a better way to take district levels into account for these visuals–let me know if you have any thoughts that yield a simpler coefficient interpretation!)

The last dimension of this discussion requires comparing Trump to the other Republican candidates. Given the media’s endless coverage of Trump, I would not have been surprised to learn that this effect impacts other campaigns but just was never reported. But, Wasserman and Soltas argue that this is not the case. Their claims are further bolstered by the following visual, which recreates the most recent Trump plot for all 9 candidates who had sufficient data (excludes Gilmore, Huckabee, and  Santorum):

fig3

It should be immediately clear that Trump is the only candidate for whom there is a positive statistically significant association between the two relative measures. While Kasich has an upward sloping regression line, the corresponding 95% confidence interval demonstrates that the coefficient on relative perceived whiteness is not statistically significantly different from 0. Employing the whiteness measure in this context allows us to provide quantitative evidence for Wasserman’s original intuition that this effect is unique to Trump–thus, “lend[ing] credibility to the theory that racial resentment is commonplace among his supporters.”

The role of perceptions of whiteness

Wasserman’s article has incited an outpouring of genuine interest over the past few days. The fascinating nature of the original inquiry combined with Soltas’ integration of a perceived whiteness measure into the Illinois delegate dataset provides a perfect setting in which to investigate the role racial resentment is playing in these particular voting patterns, and in the election on the whole.

Code

My illinois_delegates Github repo has the R script and csv file necessary to replicate all three visuals! (We know data, we have the best data.)


© Alexandra Albright and The Little Dataset That Could, 2016. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

 

EDUANALYTICS 101: An Investigation into the Stanford Education Space using Edusalsa Data

Update [4-18-16]: Thanks to Stuart Rojstaczer for finding an error in my grade distribution histograms. Just fixed them and uploaded the fixed R script as well. Such is the beauty of internet feedback!
Note: This is the first in a series of posts that I am putting together in partnership with Edusalsa, an application based at Stanford that seeks to improve how college students explore and choose their courses. Our goal in these posts is to take advantage of the unique data collected from students’ use of the application in order to learn more about how to model and discuss the accumulation and organization of knowledge within the Stanford community as well as within the larger, global education space. (You can read the post here too.)
Course Syllabus

You are frozen in line. This always happens. You don’t know whether to pick the ENGLISH, PHYSICS with double CS combo that you always order or whether to take a risk and try something new. There are thousands of other options; at least a hundred should fit your strict requirements and picky tastes…Hey, maybe you’d like a side of FRENCH! But now you don’t even know what you should get on it; 258, 130, or 128. You are about to ask which of the three goes best with ENGLISH 90 when you wake up.

You realize you missed lunch… and you need to get out of the library.

Complex choices, those with a large number of options (whether in a deli or via online course registration), often force individuals to make choices haphazardly. In the case of academics, students find themselves unable to bulldoze their way through skimming all available class descriptions, and, accordingly, pick their classes with the help of word of mouth and by simply looking through their regular departments offerings. However, it is undoubtably the case that there are ways to improve matching between students and potential quarterly course combinations.

In order to better understand how one could improve the current course choice mechanism, one must first better understand the Stanford education space as well as the myriad of objects (courses, departments, and grades) and actors (students and Professors) that occupy it. The unique data collected from students’ use of Edusalsa provides an opportunity to do just this. In this post, organized in collaboration with the Edusalsa team, we will use this evolving trove of data to discuss three overarching questions: [1] How can we measure the interest surrounding, or the popularity of, a course/department? (In conjunction with that question, what should we make of enrollment’s place in measuring interest or popularity?) [2] What is the grade distribution at Stanford, on the whole as well as on the aggregate school-level? [3] How do students approach using new tools for course discovery?

[1] How can we measure the interest surrounding, or the popularity of, a course/department?

One of the first areas of interest that can be examined with the help of Edusalsa’s data is Stanford student interest across courses and departments. Simply put, we can use total views on Edusalsa, aggregated both by course and by department, as a proxy for for interest in a course/popularity of a course. [See technical footnote 1 (TF1)] In order to visualize the popularity of a collection of courses and departments, we use a treemap structure to illustrate the relative popularities of two sets of academic objects; (1) all courses that garnered at least 20 views, and (2) all departments that garnered at least 30 views: [TF2]

course_tree

dept_tree

The size of the rectangles within the treemap corresponds to the number of endpoints while the darkness of the color corresponds to the estimated enrollment by quarter for classes and entire departments. We notice that, at the course-level, the distribution of colors throughout the rectangles seems disorganized over the size dimension. In other words, there does not seem to be a strong relationship between enrollment and views at the course level. On the other hand, from a cursory look at the second graph, the department treemap seems to illustrate that courses with larger aggregate enrollments (that is, the sum of all enrollments for all classes in a given department) have more views.

What should we make of enrollment’s place in measuring interest or popularity?

While these particular treemaps are useful for visually comparing the number of views across courses and departments, they do not clarify what, if any, is the nature of the relationship between enrollment and views for these two subsets of all courses and departments. [TF2] Due to the treemaps’ analytic shortcomings, we address the legitimacy of our previous intuitions about the relationship by simply regressing views on enrollment at both the course- and department-level. See below for the relevant plot at the course-level:

course_scatter

The coefficient on enrollment in the simple linear regression model, represented by the blue line in the above plot, while positive, is not statistically significant. We can also see this is the case when considering the width of the light green area above (the 99% confidence interval) and the more narrow gray area (the 95% confidence interval), as both areas comfortably include an alternative version of the blue regression line for which the slope is 0. The enrollment variable’s lack of explanatory power is further bolstered by the fact that, in this simple regression model framework, enrollment variation only accounts for 1.3% of the variation in views.

We now turn to the department-level, which seemed more promising from our original glance at the association between colors and sizes in the relevant treemap:

dept_scatter

In this case, the coefficient on enrollment in this model is statistically significant at the 0.1% level and communicates that, on average, a 1,000 person increase in enrollment for a department is associated with an increase of 65 views on Edusalsa. The strength of the association between enrollment and views is further evidenced by the 95% and 99% confidence intervals. In fact, the explanatory power of the enrollment variable in this context is strong to the point that the model accounts for 53.9% of variation in Edusalsa views. [TF3]

Theory derived from the comparison of course-level and department-level relationships

The difference between the strength of enrollment’s relationship with views at the course and at the department level is clear and notable. I believe that this difference is attributable to the vast heterogeneity in interest across courses, meaning there is extreme variance in terms of how much interest a course garners within a given department. Meanwhile, the difference in interest levels that is so evident across courses disappears at the department-level, once all courses are aggregated. This observation potentially serves as evidence of a current course search model in which students rigidly search within specific departments based on their requirements and fields of study, but then break up their exploration more fluidly at the course-level based on what they’ve heard is good or which classes look the most interesting etc. While the students know what to expect from departments, courses can stand out via catchy names or unique concepts in the description.

More possible metrics, and way more colors…

There are a few other metrics beyond views and enrollment that we might be interested in when trying to assess or proxy for interest surrounding a course or department. In order to compare some of these alternative metrics across various departments we present the below heat map, which serves to relatively compare a set of six metrics across the top 15 departments by enrollment size:

heat

While we have discussed enrollment before, I also include number of courses in the second column as an alternative measurement of the size of the department. Rather than defining size by number of people who take classes in the department, this defines size by the number of courses the department offers. The darker greens of CEE, Education, and Law illustrate that these are the departments parenting the most courses.

Another new metric in the above is the fifth column, a metric for number of viewers, which refers the number of unique individuals who visited a course page within a department. The inclusion of this measurement allows us to avoid certain individuals exerting improperly large influence over our measures. For example, one person who visits Economics course pages thousands of times won’t be able to skew this metric though she could skew the views metric significantly. Note that the columns for number of views and number of viewers are very similar, which indicates that, beyond some individuals in EE, departments had individuals viewing courses at similar frequencies.

The last new concept we introduce in the heat map is the notion of normalizing by enrollment, seen in columns four and six, so as to define metrics that take into account the size of the Stanford population that is already involved with these departments. Normalizing views and viewers in this way makes a large impact. Most notably, CS is no longer the dominant department, and instead shares the stage with other departments like Psychology, MS&E, MEE, etc. This normalized measure could be interpreted to proxy for the interest outside of the core members of the department (eg-majors and planned majors), in which case Psychology is certainly looking interesting to those on the outside looking in.

[2] What is the grade distribution at Stanford, on the whole as well as on the aggregate school-level?

The second topic that we cover in this post pertains to that pesky letter attached to a course–that is, grades. Our obtained data included grade distributions by course. [TF4] We use this data to build the frequency distribution for all grades received at Stanford. The following histogram illustrates that the most commonly received grade during the quarter was an A while the median grade was an A- (red line) and the mean grade was a 3.57 (blue line):

stanford_dist

While this visual is interesting in and of itself since it presents all Stanford course offerings solely by grade outcomes, it would also be meaningful to compare different subsets of the Stanford education space. In particular, we choose to use a similar technique to compare grading distributions across the three schools at Stanford–the School of Humanities & Sciences, the School of Engineering, and the School of Earth, Energy and Environmental Sciences–in order to see whether there is any notable difference across the groups:

school_dist

The histograms for the three schools present incredibly similar distributions–to the extent that at first I thought I mistakenly plotted the same school’s distribution three times. All three have medians of A- and the means are span a narrow range of 0.08; the means are 3.52, 3.60, and 3.58 for the Humanities & Sciences, Engineering, and Earth Sciences schools, respectively. [TF5]

[3] How do students approach using new tools for course discovery?

Since we have discussed views and other metrics both across classes and departments, it is worth mentioning what the Edusalsa metrics look like over individual users. Specifically, we are curious how many times unique users view courses through Edusalsa. In examining this, we are inherently examining the level of “stickiness” of the site and the aggregated view of how users interact with new course tools. In this case, the stickiness level is low, as illustrated below by both (i) a quickly plunging number of unique individuals as the number of course views grows, and (ii) a linear decline of number of unique individuals as the number of course views grows when using a log-log plot. [TF6]

stick

The negative linear relationship between the log transformed variables in the second panel (evidenced by the good fit of the above blue line) is indicative of the negative exponential form of the relationship between number of course views and number of unique individuals. [TF7]  This simply indicates that, as is the case with most new applications, so-called stickiness is low. It will be interesting to see whether this changes given the new addition of the ability to create an account.

School’s out (for summer)

Our key insights in this post lie in the depths of section [1], which discussed

evidence of a current course search model in which students rigidly search within specific departments based on their requirements and fields of study, but then break up their exploration more fluidly at the course-level

With evolving data collection, we will continue to use Edusalsa data in order to learn more about the current course search model as well as the specific Stanford education space. Future steps in this line of work will include analyzing the dynamics between departments and the courses that populate them using network analysis techniques. (There is a slew of possible options on this topic: mapping out connections between departments based on overlap in the text of course descriptions, number of cross-listings, etc.)

There is ample room for tools in the education space to help students search across conventional departments, rather than strictly within them, and understanding the channels that individuals most naturally categorize or conceptualize courses constitutes a large chunk of the work ahead.

Technical footnotes
  1. Edusalsa views by course refers to the number of times an invidual viewed the main page for a course on the site. Technically, this is when the data.url that we record includes the suffix “/course?c=DEPT&NUM” where DEPT is the department abbreviation followed by the number of the course within the department. Views aggregated by department is equivalent to the sum total of all views for courses that are under the umbrella of a given department.
  2. We only illustrate courses with at least 20 views and departments with at least 30 views in order that they will be adequately visible in the static treemap. Ideally, the views would be structured in an interactive hierarchical tree structure in which one starts at the school level (Humanities & Sciences, Engineering, Geosciences) and can venture down to the department level followed by the course level.
  3. Though it might seem as though Computer Science is an outlier in this dataset whose omission could fundamentally alter the power of the simple regression model, it turns out even after omitting CS the coefficient on enrollment remains significant at the 0.1% level while the R^2 remains high as well at approximately 0.446.
  4. The grade distribution data is self-reported by Stanford students over multiple quarters.
  5. While the distributions are very similar aggregated over the school level, I doubt they would be as similar at the smaller, more idiosyncratic department-level. This could be interesting to consider across similar departments, such as ME, EE, CEE, etc. It could also be interesting to try and code all classes at Stanford as “techie” or “fuzzy” a la the quintessential Stanford student split and see whether those two grade frequency distributions are also nearly identical.
  6. We found that ID codes we use to identify individuals can change over people in the long-run. We believe this happens rarely in our dataset, however, it is worth noting nonetheless. Due to this caveat, some calculations could be over- or underestimates of the their true values. For instance, the low stickiness for Edusalsa views could be overestimated as some of the users who are coded as distinct people are the same. Under the same logic, in the heat table, the number of viewers could be an overestimate.
  7. The straight line fit in a log-log plot indicates a monomial relationship form. A monomial is a polynomial with one term–i.e. y=ax^n–appear as straight lines in log-log plots such that n and a correspond to the slope and intercept, respectively.
Code and replication

All datasets and R scripts necessary to recreate these visuals are available at my edusalsa Github repo!


© Alexandra Albright and The Little Dataset That Could, 2016. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

Geography of Humor: The Case of the New Yorker Caption Contest

Update [9-23-15]: Also check out the newest work on this topic: Which U.S. State Performs Best in the New Yorker Caption Contest?

Intro.

About 10 years ago The New Yorker began a weekly contest. It was not a contest of writing talents in colorful fiction nor of investigative prowess in journalism, instead it was a contest of short and sweet humor. Write a caption for a cartoon, they said. It’ll be fun, they said. This will help our circulation, the marketing department said. Individuals like me, who back at age 12 in 2005 believed The New Yorker was the adult’s version of Calvin and Hobbes that they most enjoyed in doctors’ waiting rooms, embraced the new tradition with open arms.

Now, 10 years later, approximately 5,372 captions are submitted each week, and just a single winner is picked. Upon recently trying my own hand (and failing unsurprisingly given the sheer magnitude of competing captions) at the contest, I wondered, who are these winners? In particular, since The New Yorker always prints the name and place of residence of the caption contest winner, I wondered, what’s the geographical distribution of these winners? 

In order to answer this question, I used my prized subscriber access to the online Caption Contest archive. This archive features the winning caption for each week’s cartoon (along with two other finalist captions) and the name/place of residence of the caption creator. (The archives also feature all other submitted captions–which is super interesting from a machine learning perspective, but I don’t focus on that in this piece.) So, I snagged the geographic information on the past 10 years of winners and went with it.

The basics

For this analysis, I collected information on the first 466 caption contests–that is, all contests up to and including the following:

New Yorker Caption Contest #466

The New Yorker Caption Contest #466

Before getting into the meat of this discussion, it is worth noting the structure of the contest as well as the range of eligible participants. See this quick explanation from The New Yorker:

Each week, we provide a cartoon in need of a caption. You, the reader, submit your caption below, we choose three finalists, and you vote for your favorite… Any resident of the United States, Canada (except Quebec), Australia, the United Kingdom, or the Republic of Ireland age eighteen or older can enter or vote.

Thus, the contest consists of two rounds; one in which the magazine staff sift through thousands of submissions and pick just three as well as one in which the public votes on the ultimate winner out of the three finalists. Furthermore, the contest is open to residents outside the United States–a fact that is easy to forget when considering how often individuals from other countries actually win. Out of 466 caption contest winners, only 12 are from outside the United States–2 from Australia, 2 from British Columbia (Canada), and 8 from Ontario (Canada). Though they are allowed to compete, no one from the United Kingdom, or the Republic of Ireland has ever won. In short, 97.85% of caption contest winners are from the U.S.

Moving to the city-level of geography, it is unsurprising that The New Yorker Caption Contest is dominated by, well, New Yorkers. New York City has 62 wins, meaning New Yorkers have won 13.3% of the contests. In order to fully understand how dominant this makes New York consider the fact that the city with the next most caption contests wins is Los Angeles with a mere 18 wins (3.9% of contests). The graphic below depicting the top 8 caption contest cities further highlights New York’s exceptionalism:

cities

Source: New Yorker Caption Contest Archive; Tool: ggplot2 package in R.

The geographic distribution: a state-level analysis

While both the country- and city-level results are dominated by the obvious contenders (the United States and New York City respectively), the state-level analysis is much more compelling.

In this vein, the first question to address is: which states win the most contests? To answer this, I present the following chrolopeth in which the states are divided into five categories of equal size (each category contains 10 states) based on the number of contests won. (This method uses quantiles to classify the data into ranges, however, there are other methods one could use as well.) Visualizing the data in this way allows us to quickly perceive areas of the country that are caption-winner-rich as well as caption-winner-sparse:

totalwins

Source: New Yorker Caption Contest Archive; Tool: choroplethr package in R.

This visualization illustrates that the most successful caption contest states are either east coast or west coast states, with the exception of Illinois (due to Chicago’s 16 wins). The most barren section of the country is unsurprisingly the center of the country. (In particular, Idaho, Kansas, North/South Dakota, West Virginia, and Wyoming have never boasted any caption contest winners.)

While using quantiles to classify the data into ranges is helpful, it gives us an overly broad last category–the darkest blue class contains states with win totals ranging from 14 to 85. If we want to zoom in and compare the states within this one category, we can pivot to a simple bar chart for precision’s sake. The following graph presents the number of contests won among the top ten states:

top10

Source: New Yorker Caption Contest Archive; Tool: ggplot2 package in R.

New York and California are clearly the most dominant states with 85 and 75 wins respectively, which is to be expected considering how populous the two are. If we were to take into account the population size of a given state that would most definitely yield a superior metric in terms of how well each state does in winning the contest. (It would also be interesting to take into account the number of The New Yorker subscribers by state, but I haven’t been able to get a hold on that data yet, so I am putting a pin in that idea for now.)

Therefore, I normalize these counts by creating a new metric: number of caption contests won per one million state residents.  In making this change, the map colors shift noticably. See the following chrolopeth for the new results:

permill

Source: New Yorker Caption Contest Archive; Tool: choroplethr package in R.

Again, the last category is the one with the broadest range (2.425 to 7.991 wins per million residents). So, once more, it is worth moving away from cool, colorful chropleths and towards the classical bar chart. In comparing the below bar graph with the previous one, one can quickly see the difference made in normalizing by population:

top10cap

Source: New Yorker Caption Contest Archive; Tool: ggplot2 package in R.

For one, the once dominant New York falls behind new-arrivals Vermont and Rhode Island while the similarly previously dominant California is no where to be seen! Other states that also lose their place among the top ten are: Illinois, New Jersey, and Pennsylvania. Meanwhile, the four new states in this updated top ten are: Alaska and New Hampshire as well as the previously mentioned Rhode Island and Vermont. Among these four new arrivals, Vermont stands clearly ahead of the pack with approximately 8 caption contest wins per million residents.

The high counts per million for states like Vermont and Rhode Island suggest a relationship that many were likely considering throughout this entire article–isn’t The New Yorker for liberals? Accordingly, isn’t there a relationship between wins per million and liberalness?

Those damn liberal, nonreligious states

Once we have normalized caption contest wins by population, we still have not completely normalized states by their likeliness to win the contest. This is due to the fact that there is a distinct relationship between wins per million residents and evident political markers of The-New-Yorker-types. In particular, consider Gallup’s State of the States measures of “% liberal” and “% nonreligious.” First, I present the strong association between liberal percentages and wins per million:

libs

Source: New Yorker Caption Contest Archive; Tool: ggplot2 package in R.

The above is a scatterplot in which each point is a state (see the small state abbreviation labels) and the blue line is a linear regression line (the shaded area is the 95% confidence region) fit to the data. The conclusion is unmistakable; states that are more liberal tend to win more contests per million residents. Specifically, the equation for the linear regression line is:

wins_per_million = -3.13 + 0.22(pct_liberal)

This means that a 1 percentage point increase in the liberal percentage is associated with an increase of 0.22 captions per million. The R^2 (in this case, the same as the basic correlation coefficient r^2 between wins_per_million and pct_liberal since there is just one explanatory variable in the regression) is 0.364, meaning that 36.4% of response variable variation is explained by this simple model. (The standard error on the coefficient attached to pct_liberal is only 0.04, meaning the coefficient is easily statistically significant at the 0.1% level).

Also strong is the association between nonreligious percentages and wins per million, presented in the graph below:

nonreg

Source: New Yorker Caption Contest Archive; Tool: ggplot2 package in R.

This plot is very similar to the previous one, most definitely because states with high liberal percentages are likely to have high nonreligious percentages as well. The linear regression line that is fit for this data is:

wins_per_million = -1.37 + 0.09(pct_nonreligious)

The relevant conceptual interpretation is that a 1 percentage point increase in the nonreligious percentage is associated with an increase of 0.09 captions per million. The R^2 for this model is 0.316, so 31.6% of response variable variation is explained by the model. (Again, the coefficient of interest–this time the coefficient attached to pct_nonreligious, is statistically significant at the 0.1% level.)

These two graphs are simple illustrations of the statistically significant relationships between wins per million and two political markers of The New Yorker readership. In order to better understand the relationship between these variables, one must return to the structure of the contest…

The mechanism behind the success of liberal, nonreligious states

The caption contest is broken chronologically into three phases: (1) individuals submit captions, (2) three captions are selected as finalists by magazine staff, and (3) the public votes on their favorite caption.

It seems most likely that the mechanism behind the success of liberal, nonreligious states lies in the first phase. In other words, liberal, nonreligious people are more likely to read The New Yorker and/or follow the caption contest. (Its humor is unlikely to resonate with the intensely religious socially conservative.) Therefore, the tendency towards wins for liberal, nonreligious states is mostly a question of who chooses to participate.

It could also be the case that at least a part of the mechanism behind these states’ successes lies in phases (2) or (3). If a piece of this mechanism was snuggled up in phase 2, that would mean The New Yorker staff is inclined due to an innate sense of liberal humor to pick captions from specific states. (Yet, since most submissions are probably already from liberals, this seems unlikely–though maybe the reverse happens as the magazine attempts to foster geographic diversity by selecting captions from a broader range of locations? I don’t think that’s part of the caption selection process, but it could be relevant to the aforementioned mechanism if it were.) If the mechanism were instead hidden within the third phase, this would mean voters tend to vote for captions created by people from more nonreligious and liberal states in the country. One interesting element to note is that voters can see the place of residence of a caption creator–though I highly doubt this influences peoples’ voting choices, it is possible that regional favoritism is a factor (e.g., New Yorkers like to see other New Yorkers win and, therefore, the large number of New Yorker voters pushes the New Yorker caption submissions to win).

In order to better investigate the mechanism behind the success of nonreligious, liberal states, one needs access to the geographic data of all submissions…or, at least the data on the number of subscribers per state. Though one can submit to the contest without a subscription, the latter measure could still be used as a credible proxy for the former since the number of people who submit to the contest in a state is likely proportional to the number of subscribers in the state.

A thank you note

Thanks to my family for giving me a subscription to The New Yorker this past holiday season in a desperate attempt to help me become less of a philistine. My sincerest apologies that I have focused more on the cartoons than all those chunks of words that mark space in between.

How-about-never-cartoon

I’ll be sure to actually call you all up if I ever win–good news: if I enter every contest for the next ten years I’ll have approximately a 10% chance of winning just by chance alone.

Me & Bob Mankoff (Cartoon Editor of The New Yorker and creator of the above cartoon)

Me & Bob Mankoff! (Cartoon Editor of The New Yorker and creator of the above cartoon)

Future work
  • Make maps interactive (using Mapbox/TillMill/qgis and the like) and embed into page with the help of Sarah Michael Levine!
  • Look at captions per number of subscribers in a state (even though you can submit even if you’re not a subscriber–I assume submissions from a state would be proportional to the number of subscribers)
  • See if it’s possible to collect state data on all submitted captions in order to test hypotheses related to the mechanism behind the success of liberal, nonreligious states
  • Create predictive model with wins per million as the dependent variable
    • Independent variables could include proximity to New York or a dummy variable based on if the state is in northeast, income per capita, percent liberal, percent nonreligious, (use logs?) etc.
      • However, the issue with many of these is that there is likely to be multicollinearity since so many of these independent variables are highly correlated…Food for thought
        • In particular, it is not worthwhile to include both % liberal and % nonreligious in one regression (one loses statistical significance altogether and the other goes from the 0.1% level to the 5% level)
Code

All data and R scripts needed to recreate all types of visualizations in this article (choropleths, bar charts, and scatterplots with linear regression lines) are available on my “NewYorker” Github repo).


© Alexandra Albright and The Little Dataset That Could, 2016. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

The Rise of the New Kind of Cabbie: A Comparison of Uber and Taxi Drivers

Intro

One day back in the early 2000’s, I commandeered one of my mom’s many spiral notebooks. I’d carry the notebook all around Manhattan, allowing it to accompany me everywhere from pizza parlors to playgrounds, while the notebook waited eagerly for my parents to hail a taxicab so it could fulfill its eventual purpose. Once in a cab, after clicking my seat belt into place (of course!), I’d pull out the notebook in order to develop one of my very first spreadsheets. Not the electronic kind, the paper kind. I made one column for the date of the cab ride, another for the driver’s medallion number (5J31, 3A37, 7P89, etc.) and one last one for the driver’s full name–both the name and number were always readily visible, pressed between two slabs of Plexiglas that intentionally separate the back from the front seat. Taxi drivers always seemed a little nervous when they noticed I was taking down their information–unsure of whether this 8-year-old was planning on calling in a complaint about them to the Taxi and Limousine Commission. I wasn’t planning on it.

Instead, I collected this information in order to discover if I would ever ride in the same cab twice…which I eventually did! On the day that I collected duplicate entries in the second and third columns, I felt an emotional connection to this notebook as it contained a time series of yellow cab rides that ran in parallel with my own development as a tiny human. (Or maybe I just felt emotional because only children can be desperate for friendship, even when it’s friendship with a notebook.) After pages and pages of observations, collected over the years using writing implements ranging from dull pencils to thick Sharpies, I never would have thought that one day yellow cabs would be eclipsed by something else…

Something else

However, today in 2015, according to Taxi and Limousine Commission data, there are officially more Uber cars in New York City than yellow cabs! This is incredible not just because of the speed of Uber’s growth but also since riding with Uber and other similar car services (Lyft, Sidecar) is a vastly different experience than riding in a yellow cab. Never in my pre-Uber life did I think of sitting shotgun. Nor did I consider starting a conversation with the driver. (I most definitely did not tell anyone my name or where I went to school.) Never did my taxi driver need to use an iPhone to get me to my destination. But, most evident to me is the distinction between the identities of the two sets of drivers. It is undoubtedly obvious that compared to traditional cab service drivers, Uber drivers are younger, whiter, more female, and more part-time. Though I have continuously noted these distinctions since growing accustomed to Uber this past summer, I did not think that there was data for illustrating these distinctions quantitatively. However, I recently came across the paper “An Analysis of the Labor Market for Uber’s Driver-Partners in the United States,” written by (Economists!) Jonathan Hall and Alan Krueger. The paper supplies tables that summarize characteristics of both Uber drivers and their conventional taxi driver/chauffeur counterparts. This allows for an exercise in visually depicting the differences between the two opposing sets of drivers—allowing us to then accurately define the characteristics of a new kind of cabbie.  

The rise of the younger cabbie
age

Data source: Hall and Krueger (2015). Visualization made using ggplot2.

The above figure illustrates that Uber drivers are noticeably younger than their taxi counterparts. (From here on, when I discuss taxis I am also implicitly including chauffeurs. If you’d like to learn more about the source of the data and the collection methodology, refer directly to the paper.) For one, the age range including the highest percentage of Uber drivers is the 30-39 range (with 30.1% of drivers) while the range including the highest percentage of taxi drivers is the 50-64 range (with 36.6% of drivers). While about 19.1% of Uber drivers are under 30, only about 8.5% of taxi drivers are this young. Similarly, while only 24.5% of Uber drivers are over 50, 44.3% of taxi drivers are over this threshold. This difference in age is not very surprising given that Uber is a technological innovation and, therefore, participation is skewed to younger individuals.

The rise of the more highly educated cabbie
educ

Data source: Hall and Krueger (2015). Visualization made using ggplot2.

This figure illustrates that Uber drivers, on the whole, are more highly educated than their taxi counterparts. While only 12.2% of Uber drivers do not possess a level of education beyond high school completion, the majority of taxi drivers (52.5%) fall into this category. The percentage of taxi drivers with at least a college degree is a mere 18.8%, but the percentage of Uber drivers with at least a college degree is 47.7%, which is even higher than that percentage for all workers, 41.1%. Thus, Uber’s rise has created a new class of drivers whose higher education level is superior to that of the overall workforce. (Though it is worth noting that the overall workforce boasts a higher percentage of individuals with postgraduate degrees than does Uber–16% to 10.8%.)

The rise of the whiter cabbie
race

Data source: Hall and Krueger (2015). Visualization made using ggplot2.

On the topic of race, conventional taxis boast higher percentages of all non-white racial groups except for the “Other Non-Hispanic” group, which is 3.9 percentage points higher among the Uber population. The most represented race among taxi drivers is black, while the most represented race among Uber drivers is white. 19.5% of Uber drivers are black while 31.6% of taxi drivers are black, and 40.3% of Uber drivers are white while 26.2% of taxi drivers are white. I would be curious to compare the racial breakdown of Uber’s drivers to that of Lyft and Sidecar’s drivers as I suspect the other two might not have populations that are as white (simply based on my own small and insufficient sample size).

The rise of the female cabbie
gender

Data source: Hall and Krueger (2015). Visualization made using ggplot2.

It has been previously documented how Uber has helped women begin to “break into” the taxi industry. While only 1% of NYC yellow cab drivers are women and 8% of taxis (and chauffeurs) as a whole are women, an impressive 14% of Uber drivers are women–a percentage that is likely only possible in the driving industry due to the safety that Uber provides via the information on its riders.

The rise of the very-part-time cabbie
hours

Data source: Hall and Krueger (2015). Visualization made using ggplot2.

A whopping 51% of Uber drivers drive a mere 1-15 hours per week though only 4% of taxis do so. This distinction in driving times between the two sets of drivers makes it clear that Uber drivers are more likely to be supplementing other sources of income with Uber work, while taxi drivers are more likely to be working as a driver full-time (81% of taxis drive more than 35 hours a week on average, but only 19% of Uber drivers do so). In short, it is very clear that Uber drivers treat driving as more of a part-time commitment than do traditional taxi drivers.

Uber by the cities

As a bonus, beyond profiling the demographic and behavioral differences between the two classes of drivers, I present some information about how Uber drivers differ city by city. While this type of comparison could also be extremely interesting for demographic data (gender, race, etc.), hours worked and earnings are the only available pieces of information profiled by city in Hall and Krueger (2015).

Uber by the cities: hours
cities

Data source: Hall and Krueger (2015). Data on uberX drivers for October 2014. Visualization made using ggplot2.

New York is the city that possesses the least part-time uberX drivers. (Note: This data is only looking at hours worked for uberX drivers in October 2014.) Only 42% work 1-15 hours while the percentage for the other cities ranges from 53-59%. Similarly, 23% of NYC Uber drivers work 35+ hours while the percentage for other cities ranges from 12-16%. Though these breakdowns are different for each of the six cities, the figure illustrates that Uber driving is treated pretty uniformly as a part-time gig throughout the country.

Uber by the cities: earnings

Also in the report was a breakdown of median earnings per hour by city. An important caveat here is that these are gross pay numbers and, therefore, they do not take into account the costs of driving a Taxi or an Uber. If you’d like to read a quick critique of the paper’s statement that “the net hourly earnings of Uber’s driver-partners exceed the hourly wage of employed taxi drivers and chauffeurs, on average,” read this. However, I will not join this discussion and instead focus only on gross pay numbers since costs are indeed unknown.

earnings

Data source: Hall and Krueger (2015). Uber earnings data from October 2014. Taxi earnings data from May 2013. Visualization made using ggplot2.

According to the report’s information, NYC Uber drivers take in the highest gross earnings per hour ($30.35), followed by SF drivers ($25.77). These are also the same two cities in which the traditional cabbies make the most, however while NYC taxi counterparts make a few dollars more per hour than those in other cities, the NYC Uber drivers make more than 10 dollars per hour more than Boston, Chicago, DC, and LA Uber drivers.

Endnote

There is no doubt that the modern taxi experience is different from the one that I once cataloged in my stout, spiral notebook. Sure, Uber drivers are younger than their conventional cabbie counterparts. They are more often female and more often white. They are more likely to talk to you and tell you about their other jobs or interests. But, the nature of the taxi industry is changing far beyond the scope of the drivers. In particular, information that was once unknown (who took a cab ride with whom and when?) to those not in possession of a taxi notebook is now readily accessible to companies like Uber. Now, this string of recorded Uber rides is just one element in an all-encompassing set of (technologically recorded) sequential occurrences that can at least partially sketch out a skeleton of our lived experiences…No pen or paper necessary.

Bonus: a cartoon!
uberouterspace

The New Yorker Caption Contest for this week with my added caption. The photo was too oddly relevant to my current Uber v. Taxi project for me to not include it!

 Future work (all of which requires access to more data)
  • Investigate whether certain age groups for Uber are dominated by a specific race, e.g. is the 18-39 group disproportionately white while the 40+ group is disproportionately non-white?
  • Request data on gender/race breakdowns for Uber and Taxis by city
    • Looking at the racial breakdowns for NYC would be particularly interesting since the NYC breakdown is likely very different from that of cabbies throughout the rest of the country (this data is not available in the Taxicab Fact Book)
  • Compare characteristics by ride-sharing service: Uber, Lyft, and Sidecar
  • Investigate distribution of types of cars driven by Uber, Lyft, and Sidecar (Toyota, Honda, etc.)
Code

All data and R scripts needed to recreate these visualizations are available on my “UbervTaxis” Github repo.


© Alexandra Albright and The Little Dataset That Could, 2016. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.