The One With All The Quantifiable Friendships, Part 2

Bar Charts, Line Charts, Nightingale Graphs, Stacked Area Charts, Time Series

Since finishing my first year of my PhD, I have been spending some quality time with my computer. Sure, the two of us had been together all throughout the academic year, but we weren’t doing much together besides pdf-viewing and type-setting. Around spring break, when I discovered you can in fact freeze your computer by having too many exams/section notes/textbooks simultaneously open, I promised my MacBook that over the summer we would try some new things together. (And that I would take out her trash more.) After that promise and a new sticker addition, she put away the rainbow wheel.

Cut to a few weeks ago. I had a blast from the past in the form of a Twitter notification. Someone had written a post about using R to analyze the TV show Friends, which was was motivated by a similar interest that drove me to write something about the show using my own dataset back in 2015. In the post, the author, Giora Simchoni, used R to scrape the scripts for all ten seasons of the show and made all that work publicly available (wheeeeee) for all to peruse. In fact, Giora even used some of the data I shared back in 2015 to look into character centrality. (He makes a convincing case using a variety of data sources that Rachel is the most central friend of the six main characters.) In reading about his project, I could practically hear my laptop humming to remind me of its freshly updated R software and my recent tinkering with R notebooks. (Get ready for new levels of reproducibility!) So, off my Mac and I went, equipped with a new workflow, to explore new data about a familiar TV universe.

Who’s Doing The Talking?

Given line by line data on all ten seasons, I, like Giora, first wanted to look at line totals for all characters. In aggregating all non-“friends” characters together, we get the following snapshot:


First off, why yes, I am using the official Friends font. Second, I am impressed by how close the totals are for all characters though hardly surprised that Phoebe has the least lines. Rachel wouldn’t be surprised either…

Rachel: Ugh, it was just a matter of time before someone had to leave the group. I just always assumed Phoebe would be the one to go.

Phoebe: Ehh!!

Rachel: Honey, come on! You live far away! You’re not related. You lift right out.

With these aggregates in hand, I then was curious: how would line allocations look across time? So, for each episode, I calculate the percentage of lines that each character speaks, and present the results with the following three visuals (again, all non-friends go into the “other” category):


Tell me that first graph doesn’t look like a callback to Rachel’s English Trifle. Anyway, regardless of a possible trifle-like appearance, all the visuals illustrate dynamics of an ensemble cast; while there is noise in the time series, the show consistently provides each character with a role to play. However, the last visual does highlight some standouts in the collection of episodes that uncharacteristically highlight or ignore certain characters. In other words, there are episodes in which one member of the cast receives an unusually high or low percentage of the lines in the episode. The three episodes that boast the highest percentages for a single member of the gang are: “The One with Christmas in Tulsa” (41.9% Chandler), “The One With Joey’s Interview” (40.3% Joey), “The One Where Chandler Crosses a Line” (36.3% Chandler). Similarly, the three with the lowest percentages for one of the six are: “The One With The Ring” (1.5% Monica) , “The One With The Cuffs” (1.6% Ross), and “The One With The Sonogram At The End” (3.3% Joey). The sagging red lines of the last visual identify episodes that have a low percentage of lines spoken by a character outside of the friend group. In effect, those dips in the graph point to extremely six-person-centric episodes, such as “The One On The Last Night” (0.4% non-friends dialogue–a single line in this case), “The One Where Chandler Gets Caught” (1.1% non-friends dialogue), and “The One With The Vows” (1.2% non-friends dialogue).

The Men Vs. The Women

Given this title, here’s a quick necessary clip:

Now, how do the line allocations look when broken down by gender lines across the main six characters? Well, the split consistently bounces around 50-50 over the course of the 10 seasons. Again, as was the case across the six main characters, the balanced split of lines is pretty impressive.


Note that the second visual highlights that there are a few episodes that are irregularly man-heavy. The top three are: “The One Where Chandler Crosses A Line” (77.0% guys), “The One With Joey’s Interview” (75.1% guys), and “The One With Mac and C.H.E.E.S.E.” (70.2% guys). There are also exactly two episodes that feature a perfect 50-50 split for lines across gender: “The One Where Rachel Finds Out” and “The One With The Thanksgiving Flashbacks.”

Say My Name

How much do the main six characters address or mention one another? Giora addressed this question in his post, and I build off of his work by including nicknames in the calculations, and using a different genre of visualization. With respect to the nicknames–“Mon”, “Rach”, “Pheebs”, and “Joe”–“Pheebs” is undoubtably the stickiest of the group. Characters say “Pheebs” 370 times, which has a comfortable cushion over the second-place nickname “Mon” (used 73 times). Characters also significantly differ in their usage of each others’ nicknames. For example, while Joey calls Phoebe “Pheebs” 38.3% of the time, Monica calls her by this nickname only 4.6% of the time. (If you’re curious about more numbers on the nicknames, check out the project notebook.)

Now, after adding in the nicknames, who says whose name? The following graphic addresses that point of curiosity:


The answer is clear: Rachel says Ross’s name the most! (789 times! OK, we get it, Rachel, you’re in love.) We can also see that Joey is the most self-referential with 242 usages of his own name–perhaps not a shock considering his profession in the entertainment biz. Overall, the above visual provides some data-driven evidence of the closeness between certain characters that is clearly evident in watching the show. Namely, the Joey-Chandler, Monica-Chandler, Ross-Rachel relationships that were evident in my original aggregation of shared plot lines are still at the forefront!


Comparing the above work to what I had originally put together in January 2015 is a real trip. My original graphics back in 2015 were made entirely in Excel and were as such completely unreproducible, as was the data collection process. The difference between the opaqueness of that process and the transparency of sharing notebook output is super exciting to me… and to my loyal MacBook. Yes, yes, I’ll give you another sticker soon.

Let’s see the code!

Here is the html rendered R Notebook for this project. Here is the Github repo with the markdown file included.

*Screen fades to black* 
Executive Producer: Alex Albright

© Alexandra Albright and The Little Dataset That Could, 2017. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.



A Bellman Equation About Nothing

Line Charts, Models

Cold Open [Introduction]

A few years ago I came across a short paper that I desperately wanted to understand. The magnificent title was “An Option Value Problem from Seinfeld” and the author, Professor Avinash Dixit (of Dixit-Stiglitz model fame), therein discussed methods of solving for “sponge-worthiness.” I don’t think I need to explain why I was immediately drawn to an academic article that focuses on Elaine Benes, but for those of you who didn’t learn about the realities of birth control from this episode of 1990’s television, allow me to briefly explain the relevant Seinfeld-ism. The character Elaine Benes[1] loyally uses the Today sponge as her preferred form of contraception. However, one day it is taken off the market and, after trekking all over Manhattan, our heroine manages to find only one case of 60 sponges to purchase. The finite supply of sponges poses a daunting question to Elaine… namely, when should she choose to use a sponge? Ie, when is a given potential partner sponge-worthy?

JERRY: I thought you said it was imminent.

ELAINE: Yeah, it was, but then I just couldn’t decide if he was really sponge-worthy.

JERRY: “Sponge-worthy”?

ELAINE: Yeah, Jerry, I have to conserve these sponges.

JERRY: But you like this guy, isn’t that what the sponges are for?

ELAINE: Yes, yes – before they went off the market. But I mean, now I’ve got to re-evaluate my whole screening process. I can’t afford to waste any of ’em.

–“The Sponge” [Seinfeld Season 7 Episode 9]

As an undergraduate reading Professor Dixit’s introduction, I felt supremely excited that an academic article was going to delve into the decision-making processes of one of my favorite fictional characters. However, the last sentence in the introduction gave me pause: “Stochastic dynamic programming methods must be used.” Dynamic programming? Suffice it to say that I did not grasp the methodological context or mathematical machinery embedded in the short and sweet paper. After a few read-throughs, I filed wispy memories of the paper away in some cluttered corner of my mind… Maybe one day this will make more sense to me… 

Flash forward to August 2016. Professor David Laibson, the economics department chair, explains to us fresh-faced G1’s (first-year PhD’s) that he will be teaching us the first part of the macroeconomics sequence… Dynamic Programming. After a few days of talking about Bellman equations, I started to feel as if I had seen related work in some past life. Without all the eeriness of a Westworld-esque robot, I finally remembered the specifics of Professor Dixit’s paper and decided to revisit it with Professor Laibson’s lectures in mind. Accordingly, my goal here is to explain the simplified model set-up of the aforementioned paper and illustrate how basics from dynamic programming can be used in “solving for spongeworthiness.”

Act One [The Model]

Dynamic programming refers to taking a complex optimization problem and splitting it up into simpler recursive sub-problems. Consider Elaine’s decision as to when to use a sponge. We can model this as an optimal stopping problem–ie, when should Elaine use the sponge and thus give up the option value of holding it into the future? The answer lies in the solution to a mathematical object called the Bellman equation, which will represent Elaine’s expected present value of her utility recursively.

Using a simplified version of the framework from Dixit (2011), we can explain the intuition behind setting up and solving a Bellman equation. First, let’s lay out the modeling framework. For the sake of computational simplicity, assume Elaine managed to acquire only one sponge rather than the case of 60 (Dixit assumes she has a general m sponges in his set-up, so his computations are more complex than mine). With that one glorious sponge in her back pocket, Elaine goes about her life meeting potential partners, and yada yada yadaTo make the yada yada’s explicit, we say Elaine lives infinitely and meets one new potential partner every day t who is of some quality Qt. Elaine is not living a regular continuous-time life, instead she gets one romantic option each time period. This sets up the problem in discrete-time since Elaine’s decisions are day-by-day rather than infinitesimally-small-moment-by-infinitesimally-small-moment. If we want to base this assumption somewhat in reality, we could think of Elaine as using Coffee Meets Bagel, a dating app that yields one match per day. Ie, one “bagel” each day.

Dixit interprets an individual’s quality as the utility Elaine receives from sleeping with said person. Now, in reality, Elaine would only be able to make some uncertain prediction of a person’s quality based on potentially noisy signals. The corresponding certainty equivalent [the true quality metric] would be realized after Elaine slept with the person. In other words, there would be a distinction between ex post and ex ante quality assessments—you could even think of a potential partner as an experience good in this sense. (Sorry to objectify you, Scott Patterson.) But, to simplify our discussion, we assume that true quality is observable to Elaine—she knows exactly how much utility she will gain if she chooses to sleep with the potential partner of the day. In defense of that assumption, she does vet potential partners pretty thoroughly.

Dixit also assumes quality is drawn from a uniform distribution over [0,1] and that Elaine discounts the future exponentially by a factor of δ in the interval (0,1). Discounting is a necessary tool for agent optimization problems since preferences are time dependent. Consider the following set-up for illustrative purposes: Say Elaine gains X utils from eating a box of jujyfruit fruit today, then using our previously defined discount factor, she would gain δX from eating the box tomorrow, δ2X from eating it the day after tomorrow, and so on. In general, she gains δnX utils from consuming it n days into the future—thus the terminology “exponential discounting.” Given the domain for δ, we know unambiguously that X > δX >δ2X >… and on. That is, if the box of candy doesn’t change between periods (it is always X), (assuming it yields positive utility—which clearly it must given questionable related life decisions.) Elaine will prefer to consume it in the current time period. Ie, why wait if there is no gain from waiting? On the other hand, if Elaine wants to drink a bottle of wine today that yields Y utils, but the wine improves by a factor of w>1 each day, then whether she prefers to drink it today or tomorrow depends on whether Y—the present utility gain of the current state of the wine—or δ(wY)—the discounted utility gain of the aged (improved) wine—is greater. (Ie, if δw>1, she’ll wait for tomorrow.) If Elaine also considers up until n days into the future, she will be comparing, Y,  δ(wY), δ2X(w2Y), …, and δn(wnY).

In our set-up Elaine receives some quality offer each day that is neither static (as in the jujyfruit fruit example) nor deterministically growing (as in the wine example), rather the quality is drawn from a defined distribution (the uniform distribution on the unit interval—mainly chosen to allow for straightforward computations). While quality is observable in the current period, the future draws are not observable, meaning that Elaine must compare her current draw with an expectation of future draws. In short, everyday Elaine has the choice whether to use the sponge and gain Qt through her utility function, or hold the sponge for a potentially superior partner in the future. In other words, Elaine’s current value function is expressed as a choice between the “flow payoff” Qt and the discounted “continuation value function.” Since she is utility maximizing, she will always choose the higher of these two options. Again, since the continuation value function is uncertain, as future quality draws are from some distribution, we must use the expectation operator in that piece of the maximization problem. Elaine’s value function is thus:


This is the Bellman equation of lore! It illustrates a recursive relationship between the value functions for different time periods, and formalizes Elaine’s decision as a simple optimal stopping problem.

Act Two [Solving for Sponge-worthiness]

To solve for sponge-worthiness, we need to find the value function that solves the Bellman equation, and derive the associated optimal policy rule. Our optimal policy rule is a function that maps each point in the state space (the space of possible quality draws) to the action space such that Elaine achieves payoff V(Qt) for all feasible quality draws in [0,1]. The distribution of Qt+1 are stationary and independent of Qt, as the draws are perpetually from U[0,1]. (Note to the confounded reader: don’t think of the space of quality draws as akin to some jar of marbles in conventional probability puzzles—those in which the draw of a red marble means there are less red to draw later—since our distribution does not shift between periods. For more on other possible distributions, see Act Four.) Due to the aforementioned stationarity and independence, the value of holding onto the sponge [δEV(Qt+1)] is constant for all days. By this logic, if a potential partner of quality Q’ is sponge-worthy, then Q’ ≥ δEV(Qt+1)! Note that for all Q”>Q’, Q”>δEV(Qt+1), so some partner of quality Q” must also be considered sponge-worthy. Similarly, if a person of quality Q’ is not sponge-worthy, then δEV(Qt+1) ≥ Q’ and for all Q”<Q’, Q”<δEV(Qt+1), so any partner of quality Q” must also not be sponge-worthy. Thus, the functional form of the value function is:


In other words, our solution will be a threshold rule where the optimal policy is to use the sponge if Q> Q* and hold onto the sponge otherwise. The free parameter we need to solve for is Q*, which we can conceptualize as the all-powerful quality level that separates the sponge-worthy from the not!

Act Three [What is Q*?]

When Q= Q*, Elaine should be indifferent between using the sponge and holding onto it. This means that the two arguments in the maximization should be equal–that is, the flow payoff [Q*] and the discounted continuation value function [δEV(Qt+1)]. We can thus set Q*=δEV(Qt+1and exploit the fact that we defined Q ~ U[0,1], to make the following calculations:


The positive root yields a Q* >1, which would mean that Elaine never uses the sponge. This cannot be the optimal policy, so we eliminate this root. In effect, we end up with the following solution for Q*:


Given this Q*, it is optimal to use the sponge if Q> Q*, and it is optimal to hold the sponge Q* ≥ Qt. Thus, as is required by the definition of optimal policy, for all values of Qt:


We can interpret the way the Q* threshold changes with the discount factor δ using basic economic intuition. As δ approaches 1 (Elaine approaches peak patience), Q* then approaches 1, meaning Elaine will accept no partner but the one of best possible quality. At the other extreme, as δ approaches 0 (Elaine approaches peak impatience), Q* then approaches 0, meaning Elaine will immediately use the sponge with the first potential partner she meets.

To make this visually explicit, let’s use a graph to illustrate Elaine’s value function for some set δ. Take δ=0.8, then Q*=0.5, a clear-cut solution for the sponge-worthiness threshold. Given these numbers, the relationship between the value function and quality level can be drawn out as such:


What better application is there for the pgfplots package in LaTeX?!

The first diagram illustrates the two pieces that make up Elaine’s value function, while the second then uses the black line to denote the value function, as the value function takes on the maximum value across the space of quality draws. Whether the value function conforms to the red or green line hinges on whether we are in the sponge-worthy range or not. As explained earlier, before the sponge-worthiness threshold, the option value of holding the sponge is the constant such that Q*=δEV(Qt+1). After hitting the magical point of sponge-worthiness, the value function moves one-for-one with Qt. Note that alternative choices for the discount rate would yield different Q*’s, which would shift the red line up or down depending on the value, which in turn impact the leftmost piece of the value function in the second graph. These illustrations are very similar to diagrams we drew in Professor Laibson’s module, but with some more advanced technical graph labelings than what we were exposed to in class (ie, “no sponge for you” and “sponge-worthy”). 

Act Four [Extensions]

In our set-up, the dependence of the value function is simple since there is one sponge and Elaine is infinitely lived. However, it could be that we solve for a value function with more complex time and resource dependence. This could yield a more realistic solution that takes into account Elaine’s age and mortality and the 60 sponges in the valuable case of contraception. We could even perform the sponge-worthiness calculations for Elaine’s monotonically increasing string of sponge quantity requests: 3, 10, 20, 25, 60! (These numbers based in the Seinfeld canon clearly should have been in the tabular calculations performed by Dixit.)

For computational purposes, we also assumed that quality is drawn independently each period (day) from a uniform distribution on the unit interval. (Recall that a uniform distribution over some interval means that each value in the interval has equal probability.) We could alternatively consider a normal distribution, which would likely do a better job of approximating the population quality in reality. Moreover, the quality of partners could be drawn from a distribution whose bounds deterministically grow over time, as there could be an underlying trend upward in the quality of people Elaine is meeting. Perhaps Coffee Meets Bagel gets better at matching Elaine with bagels, as it learns about her preferences.

Alternatively, we could try and legitimize a more specific choice of a distribution using proper Seinfeld canon. In particular, Season 7 Episode 11 (“The Wink,” which is just 2 episodes after “The Sponge”) makes explicit that Elaine believes about 25% of the population is good looking. If we assume Elaine gains utility only from sleeping with good looking people, we could defend using a distribution such that 75% of quality draws are exactly 0 and the remaining 25% of draws are from a normal distribution ranging from 0 to 1.  (Note that Jerry, on the other hand, believes 95% of the population is undateable, so quality draws for Jerry would display an even more extreme distribution–95% of draws would be 0 and the remaining 5% could come from a normal distribution from 0 to 1.)

Regardless of the specific distribution or time/resource constraint choices, the key take-away here is the undeniably natural formulation of this episode’s plot line as an optimal stopping problem. Throughout the course of our six weeks with Professor Laibson, our class used dynamic programming to approach questions of growth, search, consumption, and asset pricing… while these applications are diverse and wide-ranging, don’t methods seem even more powerful when analyzing fictional romantic encounters!?


Speaking of power


As explained earlier, this write-up is primarily focused on the aforementioned Dixit (2011) paper, but also draws on materials from Harvard’s Economics 2010D sequence. In particular, “Economics 2010c: Lecture 1 Introduction to Dynamic Programming” by David Laibson (9/1/2016) & “ECON 2010c Section 1” by Argyris Tsiaras (9/2/2016).

© Alexandra Albright and The Little Dataset That Could, 2017. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

Go East, young woman

Comparisons, Line Charts, Time Series
We’ll always have Palo Alto[1]

It is 9:30pm PST on Friday evening and my seat beat is buckled. The lights are a dim purple as they always are on Virgin America flights. As if we are all headed off to a prom on the opposite side of the country together. My favorite safety video in the industry starts to play–an accumulation of visuals and beats that usually gives me a giddy feeling that only Beyoncé videos have the power to provoke–however, in this moment, I begin to tear up despite the image of a shimmying nun displayed in front of me. In my mind, overlaying the plane-inspired choreography is a projection of Rick Blaine reminding me in my moments of doubt that, I belong on this plane [2]: “If that plane leaves the ground and you’re not [in it], you’ll regret it. Maybe not today. Maybe not tomorrow, but soon and for the rest of your life.” I whisper “here’s looking at you, kid” to the screen now saturated with dancing flight attendants and fade into a confused dreamscape: Silicon Valley in black and white–founders still wear hoodies, but they have tossed on hats from the ’40s.

A few days later, I am now living in Cambridge, MA. While my senses are overcome by a powerful ensemble of changes, some more discreet or intangible than others, there is one element of the set that is clear, striking, and quantifiable. The thickness and heat in the air that was missing from Palo Alto and San Francisco. After spending a few nights out walking (along rivers, across campuses, over and under bridges, etc.) in skirts and sandals without even the briefest longing for a polar fleece, I am intent on documenting the difference between Boston and San Francisco temperatures. Sure, I can’t quantify every dimension of change that I experience, but, hey, I can chart temperature differences.

Coding up weather plots

In order to investigate the two cities and their relevant weather trends, I adapted some beautiful code that was originally written by Bradley Boehmke in order to generate Tufte-inspired weather charts using R (specifically making use of the beloved ggplot2 package). The code is incredible in how simple it is to apply to any of the cities that have data from the University of Dayton’s Average Daily Temperature archive.[3] Below are the results I generated for SF and Boston, respectively[4]:



While one could easily just plot the recent year’s temperature data (2015, as marked by the black time series, in this case), it is quickly evident that making use of historical temperature data helps to both smooth over the picture and put 2015 temperatures in context. The light beige for each day in the year shows the range from historical lows and to historical highs in the time period of 1995-2014. Meanwhile, the grey range presents the 95% confidence interval around daily mean temperatures for that same time period. Lastly, the presence of blue and red dots illustrates the days in 2015 that were record lows or highs over the past two decades. While Boston had a similar number of red and blue dots for 2015, SF is overpowered by red. Almost 12% of SF days were record highs relative to the previous twenty years. Only one day was a record low.

While this style of visualization is primarily intuitive for comparing a city’s weather to its own historical context, there are also a few quick points that strike me from simple comparisons across the two graphs. I focus on just three quick concepts that are borne out by the visuals:

  1. Boston’s seasons are unmistakable.[5] While the normal range (see darker swatches on the graph) of temperatures for SF varies between 50 (for winter months) and 60 degrees (for late summer and early fall months), the normal range for Boston is notably larger and ranges from the 30’s (winter and early spring months) to the 70’s (summer months). The difference in the curve of the two graphs makes this difference throughout the months painfully obvious. San Francisco’s climate is incredibly stable in comparison with east coast cities–a fact that is well known, but still impressive to see in visual form!
  2. There’s a reason SF can have Ultimate Frisbee Beach League in the winter. Consider the relative wonderfulness of SF in comparison to Boston during the months of January to March. In 2015, SF ranged from 10 to 55 degrees (on a particularly toasty February day) warmer than Boston for those months. In general, most differences on a day-to-day basis are around +20 to +40 degrees for SF.
  3. SF Summer is definitely ‘SF Winter’ if one defines its temperature relative to that of other climates. In 2015, the summer months in SF were around 10 degrees colder than were the summer months in Boston. While SF summer is warmer than actual SF winter in terms of absolute temperature comparisons, comparing the temperatures to other areas of the country quickly yields SF summer as the relatively chilliest range of the year.

Of course, it is worth noting that the picture from looking at simple temperature alone is not complete. More interesting than this glance at basic temperature would be an investigation into the “feels like” temperature, which usually takes into account factors such as wind speeds and humidity. Looking into these more complex measurements would very likely heighten the clear distinction in Boston seasons as well as potentially strengthen the case for calling SF summer ‘SF winter’, given the potential stronger presence of wind chill during the summer months.[6]

The coldest winter I ever spent…[7]

It is 6:00am EST Saturday morning in Boston, MA. Hot summer morning is sliced into by divine industrial air conditioning. Hypnotized by luggage seemingly floating on the baggage claim conveyor belt and slowly emerging from my black and white dreams, I wonder if Ilsa compared the weather in Lisbon to that in Casablanca when she got off her plane… after contacts render the lines and angles that compose my surroundings crisp again, I doubt it. Not only because Ilsa was probably still reeling from maddeningly intense eye contact with Rick, but also because Lisbon and Morocco are not nearly as markedly different in temperature as are Boston and San Francisco.

Turns out that the coldest winter I will have ever spent will be winter in Boston. My apologies to summer in San Francisco.


[1] Sincere apologies to those of you in the Bay Area who have had to hear me make this joke a few too many times over the past few weeks.

[2] Though definitely not to serve as a muse to some man named Victor. Ah, yes, the difference 74 years can make in the purpose of a woman’s travels.

[3] Taking your own city’s data for a spin is a great way to practice getting comfortable with R visualization if you’re into that sort of thing.

[4] See my adapted R code for SF and Boston here. Again, the vast majority of credit goes to Bradley Boehmke for the original build.

[5] Speaking of seasons

[6] I’d be interested to see which US cities have the largest quantitative difference between “feels like” and actual temperature for each period (say, month) of the year…

[7] From a 2005 Chronicle article: “‘The coldest winter I ever spent was a summer in San Francisco,’ a saying that is almost a San Francisco cliche, turns out to be an invention of unknown origin, the coolest thing Mark Twain never said.”

© Alexandra Albright and The Little Dataset That Could, 2016. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

Where My Girls At? (In The Sciences)

Line Charts, Scatter Plots

In the current educational landscape, there is a constant stream of calls to improve female representation in the sciences. However, the call to action is often framed within the aforementioned nebulous realm of “the sciences”—an umbrella term that ignores the distinct environments across the scientific disciplines. To better understand the true state of women in “the sciences,” we must investigate representation at the discipline level in the context of both undergraduate and doctoral education. As it turns out, National Science Foundation (NSF) open data provides the ability to do just that!

The NSF’s Report on Women, Minorities, and Persons with Disabilities in Science and Engineering includes raw numbers on both undergraduate and doctoral degrees earned by women and men across all science disciplines. With these figures in hand, it’s simple to generate measures of female representation within each field of study—that is, percentages of female degree earners. This NSF report spans the decade 2002–­2012 and provides an immense amount of raw material to investigate.[1]

The static picture: 2012

First, we will zero in on the most recent year of data, 2012, and explicitly compare female representation within and across disciplines.[2]


The NSF groups science disciplines with similar focus (for example, atmospheric and ocean sciences both focus on environmental science) into classified parent categories. In order to observe not only the variation within each parent category but also across the more granular disciplines themselves, the above graph plots percentage female representation by discipline, with each discipline colored with respect to its NSF classified parent category.

The variation within each parent category can be quite pronounced. In the earth, atmospheric, and ocean sciences, female undergraduate representation ranges from 36% (atmospheric sciences) to 47% (ocean sciences) of total graduates. Among PhD graduates, female representation ranges from 39% (atmospheric sciences) to 48% (ocean sciences). Meanwhile, female representation in the physical sciences has an undergraduate range from 19% (physics) to 47% (chemistry) and a PhD range from 20% (physics) to 39% (chemistry). However, social sciences has the largest spread of all with undergraduate female representation ranging from 30% (economics) to 71% (anthropology) and PhD representation ranging from 33% (economics) to 64% (anthropology).

In line with conventional wisdom, computer sciences and physics are overwhelmingly male (undergraduate and PhD female representation lingers around 20% for both). Other disciplines in which female representation notably lags include: economics, mathematics and statistics, astronomy, and atmospheric sciences. Possible explanations behind the low representation in such disciplines have been debated at length.

Interactions between “innate abilities,” mathematical content, and female representation

Relatively recently, in January 2015, an article in Science “hypothesize[d] that, across the academic spectrum, women are underrepresented in fields whose practitioners believe that raw, innate talent is the main requirement for success, because women are stereotyped as not possessing such talent.” While this explanation was compelling to many, another group of researchers quickly responded by showing that once measures of mathematical content were added into the proposed models, the measures of innate beliefs (based on surveys of faculty members) shed all their statistical significance. Thus, the latter researchers provided evidence that female representation across disciplines is instead associated with the discipline’s mathematical content “and that faculty beliefs about innate ability were irrelevant.”

However, this conclusion does not imply that stereotypical beliefs are unimportant to female representation in scientific disciplines—in fact, the same researchers argue that beliefs of teachers and parents of younger children can play a large role in silently herding women out of math-heavy fields by “becom[ing] part of the self-fulfilling belief systems of the children themselves from a very early age.” Thus, the conclusion only objects to the alleged discovery of a robust causal relationship between one type of belief, university/college faculty beliefs about innate ability, and female representation.

Despite differences, both assessments demonstrate a correlation between measures of innate capabilities and female representation that is most likely driven by (1) women being less likely than men to study math-intensive disciplines and (2) those in math-intensive fields being more likely to describe their capacities as innate.[3]

The second point should hardly be surprising to anyone who has been exposed to mathematical genius tropes—think of all those handsome janitors who write up proofs on chalkboards whose talents are rarely learned. The second point is also incredibly consistent with the assumptions that underlie “the cult of genius” described by Professor Jordan Ellenberg in How Not to Be Wrong: The Power of Mathematical Thinking (p.412):

The genius cult tells students it’s not worth doing mathematics unless you’re the best at mathematics, because those special few are the only ones whose contributions matter. We don’t treat any other subject that way! I’ve never heard a student say, “I like Hamlet, but I don’t really belong in AP English—that kid who sits in the front row knows all the plays, and he started reading Shakespeare when he was nine!”

In short, subjects that are highly mathematical are seen as more driven by innate abilities than are others. In fact, describing someone as a hard worker in mathematical fields is often seen as an implicit insult—an implication I very much understand as someone who has been regularly (usually affectionately) teased as a “try-hard” by many male peers.

The dynamic picture: 2002–2012

Math-intensive subjects are predominately male in the static picture for the year 2012, but how has the gender balance changed over recent years (in these and all science disciplines)? To answer this question, we turn to a dynamic view of female representation over a recent decade by looking at NSF data for the entirety of 2002–2012.


The above graph plots the percentages of female degree earners in each science discipline for both the undergraduate and doctoral levels for each year from 2002 to 2012. The trends are remarkably varied with overall changes in undergraduate female representation ranging from a decrease of 33.9% (computer sciences) to an increase of 24.4% (atmospheric sciences). Overall changes in doctoral representation ranged from a decline of 8.8% (linguistics) to a rise of 67.6% (astronomy). The following visual more concisely summarizes the overall percentage changes for the decade.


As this graph illustrates, there were many gains in female representation at the doctoral level between 2002 and 2012. All but three disciplines experienced increased female representation—seems promising, yes? However, substantial losses at the undergraduate level should yield some concern. Only six of the eighteen science disciplines experienced undergraduate gains in female representation over the decade.

The illustrated increases in representation at the doctoral level are likely extensions of gains at the undergraduate level from the previous years—gains that are now being eroded given the presented undergraduate trends. The depicted losses at the undergraduate level could very well lead to similar losses at the doctoral level in the coming decade, which would hamper the widely shared goal to tenure more female professors.

The change for computer sciences is especially important since it provides a basis for the vast, well-documented media and academic focus on women in the field. (Planet Money brought the decline in percentage of female computer science majors to the attention of many in 2014.) The discipline experienced a loss in female representation at the undergraduate level that was more than twice the size of that in any other subject, including physics (-15.6%), earth sciences (-12.2%), and economics (-11.9%).

While the previous discussion of innate talent and stereotype threat focused on math-intensive fields, a category computer sciences fall into, I would argue that this recent decade has seen the effect of those forces on a growing realm of code-intensive fields. The use of computer programming and statistical software has become a standard qualification for many topics in physics, statistics, economics, biology, astronomy, and other fields. In fact, completing degrees in these disciplines now virtually requires coding in some way, shape, or form.

For instance, in my experience, one nontrivial hurdle that stands between students and more advanced classes in statistics or economics is the time necessary to understand how to use software such as R and Stata. Even seemingly simple tasks in these two programs requires some basic level of comfort with structuring commands—an understanding that is not taught in these classes, but rather mentioned as a quick and seemingly obvious sidebar. Despite my extensive coursework in economics and mathematics, I am quick to admit that I only became comfortable with Stata via independent learning in a summer research context, and R via pursuing projects for this blog many months after college graduation.

The implications of coding’s expanding role in many strains of scientific research should not be underestimated. If women are not coding, they are not just missing from computer science—they will increasingly be missing from other disciplines which coding has seeped into.

The big picture: present–future

In other words, I would argue academia is currently faced with the issue of improving female representation in code-intensive fields.[4] As is true with math-intensive fields, the stereotypical beliefs of teachers and parents of younger children “become part of the self-fulfilling belief systems of the children themselves from a very early age” that discourage women from even attempting to enter code-intensive fields. These beliefs when combined with Ellenberg’s described “cult of genius” (a mechanism that surrounded mathematics and now also applies to the atmosphere in computer science) are especially dangerous.

Given the small percentage of women in these fields at the undergraduate level, there is limited potential growth in female representation along the academic pipeline—that is, at the doctoral and professorial levels. While coding has opened up new, incredible directions for research in many of the sciences, its evolving importance also can yield gender imbalances due to the same dynamics that underlie underrepresentation in math-intensive fields.


[1] Unfortunately, we cannot extend this year range back before 2002 since earlier numbers were solely presented for broader discipline categories, or parent science categories—economics and anthropology would be grouped under the broader term “social sciences,” while astronomy and chemistry would be included under the term “physical sciences.”

[2] The NSF differentiates between science and engineering as the latter is often described as an application of the former in academia. While engineering displays an enormous gender imbalance in favor of men, I limit my discussion here to disciplines that fall under the NSF’s science category.

[3] The latter viewpoint does have some scientific backing. The paper “Nonlinear Psychometric Thresholds for Physics and Mathematics” supports the notion that while greater work ethic can compensate for lesser ability in many subjects, those below some threshold of mathematical capacities are very unlikely to succeed in mathematics and physics coursework.

[4] On a positive note, atmospheric sciences, which often involves complex climate modeling techniques, has experienced large gains in female representation at the undergraduate level.

Speaking of coding…

Check out my relevant Github repository for all data and R scripts necessary for reproducing these visuals.

Thank you to:

Ally Seidel for all the edits over the past few months! & members of NYC squad for listening to my ideas and debating terminology with me.

© Alexandra Albright and The Little Dataset That Could, 2016. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

‘Serial’ and Changes of Heart

Line Charts, Stacked Bar Charts

During this past/last week of the podcast Serial, Sarah Koenig explains (unsurprisingly) that throughout her journalistic investigation of the murder of Hae Min Lee she has changed her opinion many times about Adnan’s guilt or innocence:

“Several times, I have landed on a decision, I’ve made up my mind and stayed there, with relief and then inevitably, I learn something I didn’t know before and I’m up ended. Sometimes the reversal takes a few weeks, sometimes it happens within hours. And what’s been astonishing to me is how the back and forth hasn’t let up, after all of this time. Even into this very week and I kid you not, into this very day that I’m writing this.”

Given the transparent method by which Koenig has shared large chunks as well as scraps of information pertaining to the case week by week, we, the listeners, have similarly been able to shift our opinions/beliefs/doubts about Adnan’s guilt as time has passed. Unlike in the case of a conventional television crime drama, there is no formulaic ending — no revealing of a killer who had been hiding in plain sight the during the entire 40-something minutes of predictively-paced intrigue. Uncertainty — not Adnan, Hae, or Jay — is the key player whose perpetual presence defines our experience with Serial. And given the overarching, dominant role that uncertainty has played in the “Serial phenomenon,” I wondered, after finishing the final episode, how opinions had been shifting over the course of the podcast… Was this uncertainty — the uncertainty that I had heard in coworkers’ debates, read about in thought pieces, and fought to accept in the cluttered, MailChimp-ad-filled corners of my mind — evident in the numbers somewhere?

The podcast is weekly, meaning there is time between each episode’s release to ponder, debate, maybe even cast a vote on your opinions…? In fact, yes, there is aggregate-level data with respect to public opinion on Adnan’s guilt (what is the percentage of people that think Adnan is guilty? innocent? what percentage is undecided?) thanks to the dedicated Serial coverage by /r/serialpodcast (note for the less media savvy, more mentally healthy among us: /r/serialpodcast is a subreddit, a page on Reddit, dedicated to discussion of the podcast). After the release of episode 6, users on the sub started creating weekly polls in order to keep track of listeners’ wavering opinions. Every Thursday, starting with October 30th (the date of release for episode 6), a poll was accessible on Reddit. People would vote on the poll until the next week when the poll would close just before the next episode became available. The poll opening and closing times ensured that no information from later episodes (episodes X+z | z>0 and z in Z) would influence listeners’ opinions for a given poll meant to illustrate opinions’ in the aftermath of episode X. Thus, percentages from the polls accurately reflect where listeners stand after a given episode’s recent reveals!

Going a step further, one could argue that since the voter base for these polls (/r/serialpodcast subscribers) are loyal repeat voters, the percentages associated with each subsequent episode less the corresponding percentages from the previous episode illustrate the differential effect of that very episode. This seems like a logical conclusion since listeners are adjusting their evolving opinions based on new information in the most recent podcast. Therefore, by looking into the changes in aggregate opinion between the airing of episodes 7-12 (we don’t know how episode 6 changed the public’s opinion since we don’t have data before that episode’s release), we can see the effect that episodes 7-12 had on the crowd’s collective opinion.

Less talk, more graphs

In order to visualize the impact of this range of episodes, I graphed public opinion on Adnan’s guilt following the release of episodes 6-12:

Public's Opinion on Adnan's Guilt Over Time

This graph depicts public opinion on Adnan’s guilt (in terms of percentages who believe he is guilty, innocent, and the percentage of those who are undecided) over the course of the release of Serial episodes 6-12.

There are many interesting things to note about this progression. First off, the percentage of individuals who believe Adnan is innocent ends on a high note after the finale, What We Know, of the podcast — 54% of voters believe Adnan is innocent. This percentage is exactly three times (!) that following the release of episode 8, The Deal With Jay. Furthermore, it is clear that after Episode 9, To Be Suspected, there are no more aggressive changes to public opinion. Instead, all three stances seem to move steadily — very steadily when compared to the changes brought about in consequence to the release of episodes 7-9.

Turning to episodes 7-9, are the movements in opinion due to said episodes logical given the episodes’ substance? I believe so. Listeners are also potentially mimicking, without realizing it, Koenig’s own state of mind in the episodes. Episode 7, The Opposite of the Prosecution, causes a dip in the guilty percentage (of 10 percentage points) and a jump in the not guilty percentage (of 19 percentage points) — a consequence that is predictable just given the name of the episode. However, episode 8 undoes all the hard work episode 7 did for Adnan’s case in the eyes of the public. The not guilty percentage drops down to post episode 6 levels at 18%, while the guilty percentage is above post episode 6 levels at 42%. The largest of all the weekly changes of heart comes with the release of episode 9, To Be Suspected, which highlights Adnan’s calm demeanor during his time in prison. The guilty camp goes from containing 42% of the voters to just 17% while the not guilty camp goes from 18% to 44%. In the graph, this change almost creates a perfectly symmetrical “X” with the guilty and not guilty lines. It is also in this episode that Adnan also makes an emotional appeal to Koenig saying that that his parents would be happier if they thought he deserved to be in prison — therefore saying that if he were lying about his innocence he would be bringing pain to his parents — something that Adnan, the same person with the funny anecdote about T-mobile customer service behind bars, wouldn’t do.

Another interesting element to note in this analysis is that, over our available time period, the percentage of people who vote as undecided has declined or remained the same every week. This potentially illustrates that despite the fact that more and more scraps, facts, and individuals were added into the mix throughout the progression of the postcast, the aggregate group of voters did feel more certain in their convictions — to the point of no longer checking the “undecided” option. However, this result could also be a fragment of something I felt myself when answering the poll near the end of the podcast — I wanted to vote one way or the other because I felt increasingly useless to the polling exercise by voting undecided repeatedly. Perhaps with the end of the podcast nearing, individuals wanted to be able to make a decision and stick to it, regardless of the constant insecurity in their beliefs.

After looking into how each episode affected aggregate opinions, I wondered if this could differ between the subgroups that reddit users included in their polls — specifically, those with legal training and those without legal training.

Opinion of Those with Legal Training on Adnan's Guilt over Time

This graph depicts the opinion of those with legal training on Adnan’s guilt (in terms of percentages who believe he is guilty, innocent, and the percentage of those who are undecided) over the course of the release of Serial episodes 6-12.

Opinion of Those Without Legal Training on Adnan's Guilt Over Time

This graph depicts the opinion of those without legal training on Adnan’s guilt (in terms of percentages who believe he is guilty, innocent, and the percentage of those who are undecided) over the course of the release of Serial episodes 6-12.

It is immediately evident that the percentages in these two graphs are very similar once episode 8 has aired. However, there is a large and obvious difference in how the two groups respond to episode 7, The Opposite of the Prosecution. For those without legal training, it bumped up the numbers for not guilty by 21 percentage points and pushed down the numbers for guilty by 13 percentage points…but, for those with legal training, it bumped up the numbers for not guilty by just 5 percentage points and even pushed up the numbers for guilty by 6 percentage points.

An easier way to visualize and understand the differences between the two divergent responses to episode 7 is by ignoring the undecided percentages in order to create a “innocence index” of sorts. This quasi-index is equal to the percentage of voters who vote that Adnan is not guilty minus the percentage of voters who vote that Adnan is guilty. The index doesn’t have any meaning other than the differential between perceived innocence and perceived guilt according to the crowd of voters.


This graph depicts the constructed innocence indices between those with and without legal training over the course of the release of Serial episodes 6-12.

Since we don’t have information from before the release of episode 6, we can’t speak to the differential nature of episode 6 (it could have been that the two groups were divergent in a similar way before that episode and, therefore, episode 6 had little or no effect on aggregate opinion), however, in the case of episodes 7-12, it is very clear that the paths of the two subgroups are extremely similar except for in the aftermath of episode 7. For those with legal training, the innocence index doesn’t move substantially, it actually goes down one point, meanwhile the index increases by 34 points for those without legal training.

Perhaps this drastically different response is because of the fact that episodes 6 and 7 deal with the case against and the case for the innocence of Adnan. It could be that those with legal training are aware of the potential brutal nature of the case that could be made against Adnan as well as the potentially very favorable nature of the direct opposite approach. Perhaps these individuals are not surprised by the way episode 7 threw off much of the doubt cast on him by episode 6 because they are familiar with the legal process, and understand how a single case can be framed in extremely different ways. Meanwhile, the opinions of those of us more in the dark when it comes to the dynamics of a prosecution/defense were more malleable. Regardless of the exact reasons for this divergence, the difference in the two groups’ innocence indices following episode 7 is immediately striking.

«Visualization update» [March 2015]

My original approach in visualizing this data used line charts, which I think are often the best option for depicting time-series data (due to their simplicity and corresponding comprehensibility). However, using line charts in this context generates lines out of what are truly discrete points–in other words, the plot assumes a linear trend in opinion changes between episodes, which does not reflect the true nature of the data. Because of this conceptual shortcoming that accompanies line charts, I decided to try out another form of visualization that could more accurately represent the discrete nature of the data points. Thanks to a great FlowingData post, I realized an interesting way to do this would be to use stacked bar charts since all the percentages for each opinion of guilty, no guilty, or undecided add up to 100%. (Originally I was attracted to the stacked area chart because it seems sexier–or, as sexy as a chart can be–however, this method also fails to accurately depict the discrete nature of the data points! So, stacked bar chart it is.) Here is the result (made with the ggplot2 package in R):


These charts depict public opinion on Adnan’s guilt (in terms of percentages who believe he is guilty, not guilty, and the percentage of those who are undecided) over the course of the release of Serial episodes 6-12. (Note that the percentages are rounded to the nearest percentage point. Therefore, some combinations might add up to 101 or 99 instead of 100 due to rounding in each category.)

In short

I have doubted myself with respect to my thoughts on Adnan’s case over the past many weeks. I’ve oscillated up and down with the severity of the rises and falls in the included figures. In brief, it is incredible to see that the week of Serial you just consumed can so profoundly alter the core of your beliefs about the case.

You don’t need Sarah Koenig to serenade you during the finale with tales of the tenuous nature of truth in order to have the point driven home that we are often unsure, uncertain, unclear about our convictions… Just look at the pretty pictures.

Or just listen to that girl pronounce Mail Chimp. Is it Kimp or Chimp? We may never know.

Endnote on data sources

Here are the poll percentage sources in case anyone is curious: Percentages for episodes 6-8Percentages for episodes 9-11, and Percentages for episodes 12 — I collected the data for episode 12 at 6:30pm EST Thursday 12/18/14. I looked at the updated information at 12:50am EST 12/19/14 and, of course, more people had voted, but the percentages for guilty/innocent/undecided were the same. So, I use these numbers without fear of dramatic change in the next few days. (In case anyone is curious, all data and scripts used for this project are available in my “Serial” Github repo.)

© Alexandra Albright and The Little Dataset That Could, 2016. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.