The United Nations of Words

Bar Charts

Newsletter e-mails are often artifacts of faded interests or ancient online shopping endeavors. They can be nostalgia-inducing — virtual time capsules set in motion by your past self at t-2, and egged on by your past self at t-1. Remember that free comedy show, that desk lamp purchase (the one that looks Pixar-esque), that political campaign… oof, actually let’s scratch that last one. But, without careful care, newsletters breed like rabbits and mercilessly crowd inboxes. If you wish to escape the onslaught of red notification bubbles, these e-mails are a sworn enemy whose defeat is an ever-elusive ambition.

However, there is a newsletter whose appearance in my inbox I perpetually welcome with giddy curiosity. That is, Jeremy Singer-Vine’s “Data is Plural.” Every week features a new batch of datasets for your consideration. One dataset in particular caught my eye in the 2017.07.19 edition:

UN General Debate speeches. Each September, the United Nations gathers for its annual General Assembly. Among the activities: the General Debate, a series of speeches delivered by the UN’s nearly 200 member states. The statements provide “an invaluable and, largely untapped, source of information on governments’ policy preferences across a wide range of issues over time,” write a trio of researchers who, earlier this year, published the UN General Debate Corpus — a dataset containing the transcripts of 7,701 speeches from 1970 to 2016.

The Corpus explains that these statements are “akin to the annual legislative state-of-the-union addresses in domestic politics.” As such, they provide a valuable resource for understanding international governments’ “perspective[s] on the major issues in world politics.” Now, I have been interested in playing around with text mining in R for a while. So a rich dataset of international speeches seems like a natural application of basic term frequency and sentiment analysis methods. As I am interested in comparing countries to one another, I need to select a subset of the hundreds to study. Given their special status, I focus exclusively on the five UN Security council countries: US, Britain, France, China, and Russia. (Of course, you could include many, many more countries of interest for this sort of investigation, but given the format of my desired visuals, five countries is a good cut-off.) Following in the typed footsteps of great code tutorials, I perform two types of analyses–a term frequency analysis and a sentiment analysis–to discuss the thousands of words that were pieced together to form these countries’ speeches.

Term Frequency Analysis

Term frequency analysis has been used in contexts ranging from studying Seinfeld to studying the field of 2016 GOP candidates. A popular metric for such analyses is tf-idf, which is a score of relative term importance. Applied to my context, the metric reveals words that are frequently used by one country but infrequently used by the other four. In more general terms, “[t]he tf-idf value increases proportionally to the number of times a word appears in the document, but is often offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.” (Thanks, Wikipedia.) In short, tf-idf picks out important words for our countries of interest. The 20 words with the highest tf-idf scores are illustrated below:


China is responsible for 13 of the 20 words. Perhaps this means that China boasts the most unique vocabulary of the Security Council. (Let me know if you disagree with that interpretation.) Now, if instead we want to see the top 5 words for each country–to learn something about their differing focuses–we obtain the results below:


As an American, I am not at all surprised by the picture of my country as one of democratic, god-loving, dream-having entrepreneurs who have a lot to say about Saddam Hussein. Other insights to draw from this picture are: China is troubled by Western superpower countries influencing (“imperialist”) or dominating (“hegemonism”) others, Russia’s old status as the USSR involved lots of name checks to leader Leonid Ilyich Brezhnev, and Britain and France like to talk in the third-person.

Sentiment Analysis

In the world of sentiment analysis, I am primarily curious about which countries give the most and least positive speeches. To figure this out, I calculate positivity scores for each country according to the three sentiment dictionaries, as summarized by the UC Business Analytics R Programming Guide:

The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The bing lexicon categorizes words in a binary fashion into positive and negative categories. The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.

Therefore, for the nrc and bing lexicons, my generated positivity scores will reflect the number of positive words less the number of negative words. Meanwhile, the AFINN lexicon positivity score will reflect the sum total of all scores (as words have positive scores if they possess positive sentiment and negative scores if they possess negative sentiment). Comparing these three positivity scores across the five Security Council countries yields the following graphic:


The three methods yield different outcomes: AFINN and Bing conclude that China is the most positive country, followed by the US; meanwhile, the NRC identifies the US as the most positive country, with China in fourth place. And, despite all that disagreement, at least everyone can agree that the UK is the least positive! (How else do we explain “Peep Show”?)

Out of curiosity, I also calculate the NRC lexicon word counts for anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. I then divide the sentiment counts by total numbers of words attributed to each country so as to present the percentage of words with some emotional range rather than the absolute levels for that range. The results are displayed below in stacked and unstacked formats.



According to this analysis, the US is the most emotional country with over 30% of words associated with a NRC sentiment. China comes in second, followed by the UK, France, and Russia, in that order. However, all five are very close in terms of emotional word percentages so this ordering does not seem to be particularly striking or meaningful. Moreover, the specific range of emotions looks very similar country by country as well. Perhaps this is due to countries following some well-known framework of a General Debate Speech, or perhaps political speeches in general follow some tacit emotional script displaying this mix of emotions…

I wonder how such speeches compare to a novel or a newspaper article in terms of these lexicon scores. For instance, I’d imagine that the we’d observe more evidence of emotion in these speeches than in newspaper articles, as those are meant to be objective and clear (though this is less true of new forms of evolving media… i.e., those that aim to further polarize the public… or, those that were aided by one of the Security Council countries to influence an election in another of the Security Council countries… yikes), while political speeches might pick out words specifically to elicit emotion. It would be fascinating to investigate how emotional words are wielded in political speeches or new forms of journalistic media, and how that has evolved over time. (Quick hypothesis: fear is more present in the words that make up American media coverage and political discourse nowadays than it was a year ago…) But, I will leave that work (for now) to people with more in their linguistics toolkit than a novice knowledge of super fun R packages.


As per my updated workflow, I now conduct projects exclusively using R notebooks! So, here is the R notebook responsible for the creation of the included visuals. And, here is the associated Github repo with everything required to replicate the analysis. Methods mimic those outlined by superhe’R’os Julia Silge and David Robinson in their “Text Mining with R” book.

© Alexandra Albright and The Little Dataset That Could, 2017. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

The One With All The Quantifiable Friendships, Part 2

Bar Charts, Line Charts, Nightingale Graphs, Stacked Area Charts, Time Series

Since finishing my first year of my PhD, I have been spending some quality time with my computer. Sure, the two of us had been together all throughout the academic year, but we weren’t doing much together besides pdf-viewing and type-setting. Around spring break, when I discovered you can in fact freeze your computer by having too many exams/section notes/textbooks simultaneously open, I promised my MacBook that over the summer we would try some new things together. (And that I would take out her trash more.) After that promise and a new sticker addition, she put away the rainbow wheel.

Cut to a few weeks ago. I had a blast from the past in the form of a Twitter notification. Someone had written a post about using R to analyze the TV show Friends, which was was motivated by a similar interest that drove me to write something about the show using my own dataset back in 2015. In the post, the author, Giora Simchoni, used R to scrape the scripts for all ten seasons of the show and made all that work publicly available (wheeeeee) for all to peruse. In fact, Giora even used some of the data I shared back in 2015 to look into character centrality. (He makes a convincing case using a variety of data sources that Rachel is the most central friend of the six main characters.) In reading about his project, I could practically hear my laptop humming to remind me of its freshly updated R software and my recent tinkering with R notebooks. (Get ready for new levels of reproducibility!) So, off my Mac and I went, equipped with a new workflow, to explore new data about a familiar TV universe.

Who’s Doing The Talking?

Given line by line data on all ten seasons, I, like Giora, first wanted to look at line totals for all characters. In aggregating all non-“friends” characters together, we get the following snapshot:


First off, why yes, I am using the official Friends font. Second, I am impressed by how close the totals are for all characters though hardly surprised that Phoebe has the least lines. Rachel wouldn’t be surprised either…

Rachel: Ugh, it was just a matter of time before someone had to leave the group. I just always assumed Phoebe would be the one to go.

Phoebe: Ehh!!

Rachel: Honey, come on! You live far away! You’re not related. You lift right out.

With these aggregates in hand, I then was curious: how would line allocations look across time? So, for each episode, I calculate the percentage of lines that each character speaks, and present the results with the following three visuals (again, all non-friends go into the “other” category):


Tell me that first graph doesn’t look like a callback to Rachel’s English Trifle. Anyway, regardless of a possible trifle-like appearance, all the visuals illustrate dynamics of an ensemble cast; while there is noise in the time series, the show consistently provides each character with a role to play. However, the last visual does highlight some standouts in the collection of episodes that uncharacteristically highlight or ignore certain characters. In other words, there are episodes in which one member of the cast receives an unusually high or low percentage of the lines in the episode. The three episodes that boast the highest percentages for a single member of the gang are: “The One with Christmas in Tulsa” (41.9% Chandler), “The One With Joey’s Interview” (40.3% Joey), “The One Where Chandler Crosses a Line” (36.3% Chandler). Similarly, the three with the lowest percentages for one of the six are: “The One With The Ring” (1.5% Monica) , “The One With The Cuffs” (1.6% Ross), and “The One With The Sonogram At The End” (3.3% Joey). The sagging red lines of the last visual identify episodes that have a low percentage of lines spoken by a character outside of the friend group. In effect, those dips in the graph point to extremely six-person-centric episodes, such as “The One On The Last Night” (0.4% non-friends dialogue–a single line in this case), “The One Where Chandler Gets Caught” (1.1% non-friends dialogue), and “The One With The Vows” (1.2% non-friends dialogue).

The Men Vs. The Women

Given this title, here’s a quick necessary clip:

Now, how do the line allocations look when broken down by gender lines across the main six characters? Well, the split consistently bounces around 50-50 over the course of the 10 seasons. Again, as was the case across the six main characters, the balanced split of lines is pretty impressive.


Note that the second visual highlights that there are a few episodes that are irregularly man-heavy. The top three are: “The One Where Chandler Crosses A Line” (77.0% guys), “The One With Joey’s Interview” (75.1% guys), and “The One With Mac and C.H.E.E.S.E.” (70.2% guys). There are also exactly two episodes that feature a perfect 50-50 split for lines across gender: “The One Where Rachel Finds Out” and “The One With The Thanksgiving Flashbacks.”

Say My Name

How much do the main six characters address or mention one another? Giora addressed this question in his post, and I build off of his work by including nicknames in the calculations, and using a different genre of visualization. With respect to the nicknames–“Mon”, “Rach”, “Pheebs”, and “Joe”–“Pheebs” is undoubtably the stickiest of the group. Characters say “Pheebs” 370 times, which has a comfortable cushion over the second-place nickname “Mon” (used 73 times). Characters also significantly differ in their usage of each others’ nicknames. For example, while Joey calls Phoebe “Pheebs” 38.3% of the time, Monica calls her by this nickname only 4.6% of the time. (If you’re curious about more numbers on the nicknames, check out the project notebook.)

Now, after adding in the nicknames, who says whose name? The following graphic addresses that point of curiosity:


The answer is clear: Rachel says Ross’s name the most! (789 times! OK, we get it, Rachel, you’re in love.) We can also see that Joey is the most self-referential with 242 usages of his own name–perhaps not a shock considering his profession in the entertainment biz. Overall, the above visual provides some data-driven evidence of the closeness between certain characters that is clearly evident in watching the show. Namely, the Joey-Chandler, Monica-Chandler, Ross-Rachel relationships that were evident in my original aggregation of shared plot lines are still at the forefront!


Comparing the above work to what I had originally put together in January 2015 is a real trip. My original graphics back in 2015 were made entirely in Excel and were as such completely unreproducible, as was the data collection process. The difference between the opaqueness of that process and the transparency of sharing notebook output is super exciting to me… and to my loyal MacBook. Yes, yes, I’ll give you another sticker soon.

Let’s see the code!

Here is the html rendered R Notebook for this project. Here is the Github repo with the markdown file included.

*Screen fades to black* 
Executive Producer: Alex Albright

© Alexandra Albright and The Little Dataset That Could, 2017. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.


Geography of Humor: The Case of the New Yorker Caption Contest

Bar Charts, Choropleths, Scatter Plots

Update [9-23-15]: Also check out the newest work on this topic: Which U.S. State Performs Best in the New Yorker Caption Contest?


About 10 years ago The New Yorker began a weekly contest. It was not a contest of writing talents in colorful fiction nor of investigative prowess in journalism, instead it was a contest of short and sweet humor. Write a caption for a cartoon, they said. It’ll be fun, they said. This will help our circulation, the marketing department said. Individuals like me, who back at age 12 in 2005 believed The New Yorker was the adult’s version of Calvin and Hobbes that they most enjoyed in doctors’ waiting rooms, embraced the new tradition with open arms.

Now, 10 years later, approximately 5,372 captions are submitted each week, and just a single winner is picked. Upon recently trying my own hand (and failing unsurprisingly given the sheer magnitude of competing captions) at the contest, I wondered, who are these winners? In particular, since The New Yorker always prints the name and place of residence of the caption contest winner, I wondered, what’s the geographical distribution of these winners? 

In order to answer this question, I used my prized subscriber access to the online Caption Contest archive. This archive features the winning caption for each week’s cartoon (along with two other finalist captions) and the name/place of residence of the caption creator. (The archives also feature all other submitted captions–which is super interesting from a machine learning perspective, but I don’t focus on that in this piece.) So, I snagged the geographic information on the past 10 years of winners and went with it.

The basics

For this analysis, I collected information on the first 466 caption contests–that is, all contests up to and including the following:

New Yorker Caption Contest #466

The New Yorker Caption Contest #466

Before getting into the meat of this discussion, it is worth noting the structure of the contest as well as the range of eligible participants. See this quick explanation from The New Yorker:

Each week, we provide a cartoon in need of a caption. You, the reader, submit your caption below, we choose three finalists, and you vote for your favorite… Any resident of the United States, Canada (except Quebec), Australia, the United Kingdom, or the Republic of Ireland age eighteen or older can enter or vote.

Thus, the contest consists of two rounds; one in which the magazine staff sift through thousands of submissions and pick just three as well as one in which the public votes on the ultimate winner out of the three finalists. Furthermore, the contest is open to residents outside the United States–a fact that is easy to forget when considering how often individuals from other countries actually win. Out of 466 caption contest winners, only 12 are from outside the United States–2 from Australia, 2 from British Columbia (Canada), and 8 from Ontario (Canada). Though they are allowed to compete, no one from the United Kingdom, or the Republic of Ireland has ever won. In short, 97.85% of caption contest winners are from the U.S.

Moving to the city-level of geography, it is unsurprising that The New Yorker Caption Contest is dominated by, well, New Yorkers. New York City has 62 wins, meaning New Yorkers have won 13.3% of the contests. In order to fully understand how dominant this makes New York consider the fact that the city with the next most caption contests wins is Los Angeles with a mere 18 wins (3.9% of contests). The graphic below depicting the top 8 caption contest cities further highlights New York’s exceptionalism:


Source: New Yorker Caption Contest Archive; Tool: ggplot2 package in R.

The geographic distribution: a state-level analysis

While both the country- and city-level results are dominated by the obvious contenders (the United States and New York City respectively), the state-level analysis is much more compelling.

In this vein, the first question to address is: which states win the most contests? To answer this, I present the following chrolopeth in which the states are divided into five categories of equal size (each category contains 10 states) based on the number of contests won. (This method uses quantiles to classify the data into ranges, however, there are other methods one could use as well.) Visualizing the data in this way allows us to quickly perceive areas of the country that are caption-winner-rich as well as caption-winner-sparse:


Source: New Yorker Caption Contest Archive; Tool: choroplethr package in R.

This visualization illustrates that the most successful caption contest states are either east coast or west coast states, with the exception of Illinois (due to Chicago’s 16 wins). The most barren section of the country is unsurprisingly the center of the country. (In particular, Idaho, Kansas, North/South Dakota, West Virginia, and Wyoming have never boasted any caption contest winners.)

While using quantiles to classify the data into ranges is helpful, it gives us an overly broad last category–the darkest blue class contains states with win totals ranging from 14 to 85. If we want to zoom in and compare the states within this one category, we can pivot to a simple bar chart for precision’s sake. The following graph presents the number of contests won among the top ten states:


Source: New Yorker Caption Contest Archive; Tool: ggplot2 package in R.

New York and California are clearly the most dominant states with 85 and 75 wins respectively, which is to be expected considering how populous the two are. If we were to take into account the population size of a given state that would most definitely yield a superior metric in terms of how well each state does in winning the contest. (It would also be interesting to take into account the number of The New Yorker subscribers by state, but I haven’t been able to get a hold on that data yet, so I am putting a pin in that idea for now.)

Therefore, I normalize these counts by creating a new metric: number of caption contests won per one million state residents.  In making this change, the map colors shift noticably. See the following chrolopeth for the new results:


Source: New Yorker Caption Contest Archive; Tool: choroplethr package in R.

Again, the last category is the one with the broadest range (2.425 to 7.991 wins per million residents). So, once more, it is worth moving away from cool, colorful chropleths and towards the classical bar chart. In comparing the below bar graph with the previous one, one can quickly see the difference made in normalizing by population:


Source: New Yorker Caption Contest Archive; Tool: ggplot2 package in R.

For one, the once dominant New York falls behind new-arrivals Vermont and Rhode Island while the similarly previously dominant California is no where to be seen! Other states that also lose their place among the top ten are: Illinois, New Jersey, and Pennsylvania. Meanwhile, the four new states in this updated top ten are: Alaska and New Hampshire as well as the previously mentioned Rhode Island and Vermont. Among these four new arrivals, Vermont stands clearly ahead of the pack with approximately 8 caption contest wins per million residents.

The high counts per million for states like Vermont and Rhode Island suggest a relationship that many were likely considering throughout this entire article–isn’t The New Yorker for liberals? Accordingly, isn’t there a relationship between wins per million and liberalness?

Those damn liberal, nonreligious states

Once we have normalized caption contest wins by population, we still have not completely normalized states by their likeliness to win the contest. This is due to the fact that there is a distinct relationship between wins per million residents and evident political markers of The-New-Yorker-types. In particular, consider Gallup’s State of the States measures of “% liberal” and “% nonreligious.” First, I present the strong association between liberal percentages and wins per million:


Source: New Yorker Caption Contest Archive; Tool: ggplot2 package in R.

The above is a scatterplot in which each point is a state (see the small state abbreviation labels) and the blue line is a linear regression line (the shaded area is the 95% confidence region) fit to the data. The conclusion is unmistakable; states that are more liberal tend to win more contests per million residents. Specifically, the equation for the linear regression line is:

wins_per_million = -3.13 + 0.22(pct_liberal)

This means that a 1 percentage point increase in the liberal percentage is associated with an increase of 0.22 captions per million. The R^2 (in this case, the same as the basic correlation coefficient r^2 between wins_per_million and pct_liberal since there is just one explanatory variable in the regression) is 0.364, meaning that 36.4% of response variable variation is explained by this simple model. (The standard error on the coefficient attached to pct_liberal is only 0.04, meaning the coefficient is easily statistically significant at the 0.1% level).

Also strong is the association between nonreligious percentages and wins per million, presented in the graph below:


Source: New Yorker Caption Contest Archive; Tool: ggplot2 package in R.

This plot is very similar to the previous one, most definitely because states with high liberal percentages are likely to have high nonreligious percentages as well. The linear regression line that is fit for this data is:

wins_per_million = -1.37 + 0.09(pct_nonreligious)

The relevant conceptual interpretation is that a 1 percentage point increase in the nonreligious percentage is associated with an increase of 0.09 captions per million. The R^2 for this model is 0.316, so 31.6% of response variable variation is explained by the model. (Again, the coefficient of interest–this time the coefficient attached to pct_nonreligious, is statistically significant at the 0.1% level.)

These two graphs are simple illustrations of the statistically significant relationships between wins per million and two political markers of The New Yorker readership. In order to better understand the relationship between these variables, one must return to the structure of the contest…

The mechanism behind the success of liberal, nonreligious states

The caption contest is broken chronologically into three phases: (1) individuals submit captions, (2) three captions are selected as finalists by magazine staff, and (3) the public votes on their favorite caption.

It seems most likely that the mechanism behind the success of liberal, nonreligious states lies in the first phase. In other words, liberal, nonreligious people are more likely to read The New Yorker and/or follow the caption contest. (Its humor is unlikely to resonate with the intensely religious socially conservative.) Therefore, the tendency towards wins for liberal, nonreligious states is mostly a question of who chooses to participate.

It could also be the case that at least a part of the mechanism behind these states’ successes lies in phases (2) or (3). If a piece of this mechanism was snuggled up in phase 2, that would mean The New Yorker staff is inclined due to an innate sense of liberal humor to pick captions from specific states. (Yet, since most submissions are probably already from liberals, this seems unlikely–though maybe the reverse happens as the magazine attempts to foster geographic diversity by selecting captions from a broader range of locations? I don’t think that’s part of the caption selection process, but it could be relevant to the aforementioned mechanism if it were.) If the mechanism were instead hidden within the third phase, this would mean voters tend to vote for captions created by people from more nonreligious and liberal states in the country. One interesting element to note is that voters can see the place of residence of a caption creator–though I highly doubt this influences peoples’ voting choices, it is possible that regional favoritism is a factor (e.g., New Yorkers like to see other New Yorkers win and, therefore, the large number of New Yorker voters pushes the New Yorker caption submissions to win).

In order to better investigate the mechanism behind the success of nonreligious, liberal states, one needs access to the geographic data of all submissions…or, at least the data on the number of subscribers per state. Though one can submit to the contest without a subscription, the latter measure could still be used as a credible proxy for the former since the number of people who submit to the contest in a state is likely proportional to the number of subscribers in the state.

A thank you note

Thanks to my family for giving me a subscription to The New Yorker this past holiday season in a desperate attempt to help me become less of a philistine. My sincerest apologies that I have focused more on the cartoons than all those chunks of words that mark space in between.


I’ll be sure to actually call you all up if I ever win–good news: if I enter every contest for the next ten years I’ll have approximately a 10% chance of winning just by chance alone.

Me & Bob Mankoff (Cartoon Editor of The New Yorker and creator of the above cartoon)

Me & Bob Mankoff! (Cartoon Editor of The New Yorker and creator of the above cartoon)

Future work
  • Make maps interactive (using Mapbox/TillMill/qgis and the like) and embed into page with the help of Sarah Michael Levine!
  • Look at captions per number of subscribers in a state (even though you can submit even if you’re not a subscriber–I assume submissions from a state would be proportional to the number of subscribers)
  • See if it’s possible to collect state data on all submitted captions in order to test hypotheses related to the mechanism behind the success of liberal, nonreligious states
  • Create predictive model with wins per million as the dependent variable
    • Independent variables could include proximity to New York or a dummy variable based on if the state is in northeast, income per capita, percent liberal, percent nonreligious, (use logs?) etc.
      • However, the issue with many of these is that there is likely to be multicollinearity since so many of these independent variables are highly correlated…Food for thought
        • In particular, it is not worthwhile to include both % liberal and % nonreligious in one regression (one loses statistical significance altogether and the other goes from the 0.1% level to the 5% level)

All data and R scripts needed to recreate all types of visualizations in this article (choropleths, bar charts, and scatterplots with linear regression lines) are available on my “NewYorker” Github repo).

© Alexandra Albright and The Little Dataset That Could, 2016. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

The Rise of the New Kind of Cabbie: A Comparison of Uber and Taxi Drivers

Bar Charts, Stacked Bar Charts

One day back in the early 2000’s, I commandeered one of my mom’s many spiral notebooks. I’d carry the notebook all around Manhattan, allowing it to accompany me everywhere from pizza parlors to playgrounds, while the notebook waited eagerly for my parents to hail a taxicab so it could fulfill its eventual purpose. Once in a cab, after clicking my seat belt into place (of course!), I’d pull out the notebook in order to develop one of my very first spreadsheets. Not the electronic kind, the paper kind. I made one column for the date of the cab ride, another for the driver’s medallion number (5J31, 3A37, 7P89, etc.) and one last one for the driver’s full name–both the name and number were always readily visible, pressed between two slabs of Plexiglas that intentionally separate the back from the front seat. Taxi drivers always seemed a little nervous when they noticed I was taking down their information–unsure of whether this 8-year-old was planning on calling in a complaint about them to the Taxi and Limousine Commission. I wasn’t planning on it.

Instead, I collected this information in order to discover if I would ever ride in the same cab twice…which I eventually did! On the day that I collected duplicate entries in the second and third columns, I felt an emotional connection to this notebook as it contained a time series of yellow cab rides that ran in parallel with my own development as a tiny human. (Or maybe I just felt emotional because only children can be desperate for friendship, even when it’s friendship with a notebook.) After pages and pages of observations, collected over the years using writing implements ranging from dull pencils to thick Sharpies, I never would have thought that one day yellow cabs would be eclipsed by something else…

Something else

However, today in 2015, according to Taxi and Limousine Commission data, there are officially more Uber cars in New York City than yellow cabs! This is incredible not just because of the speed of Uber’s growth but also since riding with Uber and other similar car services (Lyft, Sidecar) is a vastly different experience than riding in a yellow cab. Never in my pre-Uber life did I think of sitting shotgun. Nor did I consider starting a conversation with the driver. (I most definitely did not tell anyone my name or where I went to school.) Never did my taxi driver need to use an iPhone to get me to my destination. But, most evident to me is the distinction between the identities of the two sets of drivers. It is undoubtedly obvious that compared to traditional cab service drivers, Uber drivers are younger, whiter, more female, and more part-time. Though I have continuously noted these distinctions since growing accustomed to Uber this past summer, I did not think that there was data for illustrating these distinctions quantitatively. However, I recently came across the paper “An Analysis of the Labor Market for Uber’s Driver-Partners in the United States,” written by (Economists!) Jonathan Hall and Alan Krueger. The paper supplies tables that summarize characteristics of both Uber drivers and their conventional taxi driver/chauffeur counterparts. This allows for an exercise in visually depicting the differences between the two opposing sets of drivers—allowing us to then accurately define the characteristics of a new kind of cabbie.  

The rise of the younger cabbie


The above figure illustrates that Uber drivers are noticeably younger than their taxi counterparts. (From here on, when I discuss taxis I am also implicitly including chauffeurs. If you’d like to learn more about the source of the data and the collection methodology, refer directly to the paper.) For one, the age range including the highest percentage of Uber drivers is the 30-39 range (with 30.1% of drivers) while the range including the highest percentage of taxi drivers is the 50-64 range (with 36.6% of drivers). While about 19.1% of Uber drivers are under 30, only about 8.5% of taxi drivers are this young. Similarly, while only 24.5% of Uber drivers are over 50, 44.3% of taxi drivers are over this threshold. This difference in age is not very surprising given that Uber is a technological innovation and, therefore, participation is skewed to younger individuals.

The rise of the more highly educated cabbie


This figure illustrates that Uber drivers, on the whole, are more highly educated than their taxi counterparts. While only 12.2% of Uber drivers do not possess a level of education beyond high school completion, the majority of taxi drivers (52.5%) fall into this category. The percentage of taxi drivers with at least a college degree is a mere 18.8%, but the percentage of Uber drivers with at least a college degree is 47.7%, which is even higher than that percentage for all workers, 41.1%. Thus, Uber’s rise has created a new class of drivers whose higher education level is superior to that of the overall workforce. (Though it is worth noting that the overall workforce boasts a higher percentage of individuals with postgraduate degrees than does Uber–16% to 10.8%.)

The rise of the whiter cabbie


On the topic of race, conventional taxis boast higher percentages of all non-white racial groups except for the “Other Non-Hispanic” group, which is 3.9 percentage points higher among the Uber population. The most represented race among taxi drivers is black, while the most represented race among Uber drivers is white. 19.5% of Uber drivers are black while 31.6% of taxi drivers are black, and 40.3% of Uber drivers are white while 26.2% of taxi drivers are white. I would be curious to compare the racial breakdown of Uber’s drivers to that of Lyft and Sidecar’s drivers as I suspect the other two might not have populations that are as white (simply based on my own small and insufficient sample size).

The rise of the female cabbie


It has been previously documented how Uber has helped women begin to “break into” the taxi industry. While only 1% of NYC yellow cab drivers are women and 8% of taxis (and chauffeurs) as a whole are women, an impressive 14% of Uber drivers are women–a percentage that is likely only possible in the driving industry due to the safety that Uber provides via the information on its riders.

The rise of the very-part-time cabbie


A whopping 51% of Uber drivers drive a mere 1-15 hours per week though only 4% of taxis do so. This distinction in driving times between the two sets of drivers makes it clear that Uber drivers are more likely to be supplementing other sources of income with Uber work, while taxi drivers are more likely to be working as a driver full-time (81% of taxis drive more than 35 hours a week on average, but only 19% of Uber drivers do so). In short, it is very clear that Uber drivers treat driving as more of a part-time commitment than do traditional taxi drivers.

Uber by the cities

As a bonus, beyond profiling the demographic and behavioral differences between the two classes of drivers, I present some information about how Uber drivers differ city by city. While this type of comparison could also be extremely interesting for demographic data (gender, race, etc.), hours worked and earnings are the only available pieces of information profiled by city in Hall and Krueger (2015).

Uber by the cities: hours


New York is the city that possesses the least part-time uberX drivers. (Note: This data is only looking at hours worked for uberX drivers in October 2014.) Only 42% work 1-15 hours while the percentage for the other cities ranges from 53-59%. Similarly, 23% of NYC Uber drivers work 35+ hours while the percentage for other cities ranges from 12-16%. Though these breakdowns are different for each of the six cities, the figure illustrates that Uber driving is treated pretty uniformly as a part-time gig throughout the country.

Uber by the cities: earnings

Also in the report was a breakdown of median earnings per hour by city. An important caveat here is that these are gross pay numbers and, therefore, they do not take into account the costs of driving a Taxi or an Uber. If you’d like to read a quick critique of the paper’s statement that “the net hourly earnings of Uber’s driver-partners exceed the hourly wage of employed taxi drivers and chauffeurs, on average,” read this. However, I will not join this discussion and instead focus only on gross pay numbers since costs are indeed unknown.


According to the report’s information, NYC Uber drivers take in the highest gross earnings per hour ($30.35), followed by SF drivers ($25.77). These are also the same two cities in which the traditional cabbies make the most, however while NYC taxi counterparts make a few dollars more per hour than those in other cities, the NYC Uber drivers make more than 10 dollars per hour more than Boston, Chicago, DC, and LA Uber drivers.


There is no doubt that the modern taxi experience is different from the one that I once cataloged in my stout, spiral notebook. Sure, Uber drivers are younger than their conventional cabbie counterparts. They are more often female and more often white. They are more likely to talk to you and tell you about their other jobs or interests. But, the nature of the taxi industry is changing far beyond the scope of the drivers. In particular, information that was once unknown (who took a cab ride with whom and when?) to those not in possession of a taxi notebook is now readily accessible to companies like Uber. Now, this string of recorded Uber rides is just one element in an all-encompassing set of (technologically recorded) sequential occurrences that can at least partially sketch out a skeleton of our lived experiences…No pen or paper necessary.

Bonus: a cartoon!

The New Yorker Caption Contest for this week with my added caption. The photo was too oddly relevant to my current Uber v. Taxi project for me to not include it!

 Future work (all of which requires access to more data)
  • Investigate whether certain age groups for Uber are dominated by a specific race, e.g. is the 18-39 group disproportionately white while the 40+ group is disproportionately non-white?
  • Request data on gender/race breakdowns for Uber and Taxis by city
    • Looking at the racial breakdowns for NYC would be particularly interesting since the NYC breakdown is likely very different from that of cabbies throughout the rest of the country (this data is not available in the Taxicab Fact Book)
  • Compare characteristics by ride-sharing service: Uber, Lyft, and Sidecar
  • Investigate distribution of types of cars driven by Uber, Lyft, and Sidecar (Toyota, Honda, etc.)

The R notebook for replicating all visuals is available here. See full github repo for the data as well. (Both updated 7-27-17)

© Alexandra Albright and The Little Dataset That Could, 2016. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

(TGI)Friday the 13th

Bar Charts, Heat Maps, Network Visualizations

I was born on November 13th, 1992. And, while that might be sufficient information to verify my identity to that pre-recorded human on the other end of my calls to the bank, there is an important detail always excluded in this formal expression of one’s birth date: the day of the week. And what makes this seemingly unremarkable fact about me exciting (or disturbing, depending on your take) is that pesky, frequently ignored detail. I was born on a Friday.

Yes, Friday the 13th, the universally proclaimed day of bad luck. Black cats, the devil, all that jazz. Despite the fact that it was the strong but random force of chance, not some meaningful destiny, that sealed my birth date, I still feel a personal tie to the dark and twisty combination. And, given this past February’s Friday the 13th immediately followed by another, today(!), in March, I decided to revisit an old question: how frequent is Friday the 13th anyway? And how often does this February-March combination happen? Are there other regular month combinations?


In response to the former, in the long run, Friday is actually the day of the week mostly likely to fall on the 13th! (Just by a tiny bit…the probability of a Friday being on the 13th is 0.1433 while the probabilities of other days falling on the 13th are 0.1425 (Thursday & Saturday), 0.1427 (Monday & Tuesday), and 0.1431 (Wednesday & Sunday).) Over the past thirty years, the average number of Friday the 13th’s in a year was approximately 1.74 (which is a higher average than expected if one assumed there was exactly a 1/7th chance of a Friday the 13th every month of every year: (1/7)*12=1.714). See below for a visualization of Friday the 13th (or F13 from now on) frequencies over the past three decades:


Plot made with ggplot2 package in R.

It is obvious from this graph that each of the past thirty years contain at least one F13. This is because it is actually impossible to have a year without any F13’s. In fact, this fact is pretty quickly mathematically proven. (See proof below. Or skip ahead if that’s not your thing.)

Quick proof

Quoting directly from a StackExchange solution:

A month has a Friday 13th if and only if it begins on a Sunday.

On a regular (non-leap) year, if January begins on day k, 0k6 (with k=0 being Sunday), then we have that:

  • January begins on day k;
  • February begins on day k+3mod7 (since January has 31 days, and 313(mod7);
  • March begins on day k+3mod7;
  • April begins on day k+6mod7;
  • May begins on day k+8=k+1mod7 (since April has 30 days, and 302(mod7));
  • June begins on day k+4mod7;
  • July begins on day k+6mod7;
  • August begins on day k+9=k+2mod7;
  • September begins on day k+5mod7;

With these, we already have day k, k+1, k+2, k+3, k+4, k+5, and k+6, so at least one of these months will begin on Sunday, guaranteeing at least one Friday 13th.

For Leap years, the analysis is similar, except that:

  • January begins on day k;
  • February begins on day k+3;
  • March begins on day k+4;
  • April begins on day k;
  • May begins on day k+2;
  • June begins on day k+5;
  • July begins on day k;
  • August begins on day k+3;
  • September begins on day k+6;
  • October begins on day k+1.

So at the latest, you will have a Friday 13th by October.

Less proofs, more visuals

The distribution of F13’s over different months is very evenly aggregated over the past thirty years; six months contained a F13 four times, while the other six months contained one five times. However, all months are not equally likely to share a F13 in the same year with all other months. For instance, in the past thirty years, March and November both had F13’s in five of the same years, while March and May never had a F13 in the same year. One can verify this claim that certain month combinations are more frequent that others by inspecting the following heat map visualization (click to enlarge):


A box is red is that month-year combination included a Friday the 13th and grey otherwise. Plot made with ggplot2 package in R.

Most evident in the above plot is the February-March-November combination–immediately noticeable since February and March are right next to one another. It turns out that our current year, 2015, is one of five in the past thirty with three F13’s, and also one of four years (1987, 1998, 2009, 2015) in the past thirty featuring the February-March-November combination.

While the February-March-November combination is the most frequent trio in the past thirty years (the only other one being January-April-July in 2012), there are duos that are just as frequent during this time period. In order to best see the all the combinations of months that have had F13’s in the same year, I created a network visualization. This network features an bi-directional edge between two months (represented by red vertices) if they have both had a F13 in the same year. Alternatively, a vertex can also have a loop (an edge that connects it to itself) if it was the only month in a year to feature a F13.


Network depicting edges between month vertices if the two have included a Friday the 13th in the same year. A loop is shown if a given month has been the only month in any year to have a Friday the 13th. Network made using the igraph package in R.

(There are no weights for the edges in this network. Instead, an edge between two vertices, a and b, is determined by the binary response to: in a given year, have months a and b ever both included a Friday the 13th? If no, no edge exists between a and b. If yes, an edge exists between a and b. However, one could easily extend this work by weighting the edges to depict frequencies of various month combinations.)

This network breaks down into four smaller graphs of sizes 1, 1, 5, and 7 (size in a graph is defined as the number of edges). From this network, one can see clearly that May is the only month in the past thirty years to have never shared a F13 with another month in the same year. Also, over this time periods, September had a F13 if and only if December did as well. Lastly, January, February, and April are the months that have shared a F13s with the most other months.


Despite the fact that there is nothing mathematically extraordinary about occurrences of November Friday the 13th’s, I still hold tight to my personal connection to the date. While November Friday the 13th is not incredibly rare, it did turn out that 1992 was the only year in the past thirty to have a November Friday the 13th without both February and March Friday the 13th’s preceding it. (See the heat map!) And if that’s a fact I can use to infuse some sort of mathematical significance into my birthday then I am using it. After all, it is this very date that has given me an affinity for a number usually considered substandard as well as the birthright to scoff, personally offended, when an apartment building elevator disrespectfully skips from the 12th to the 14th floor.

Future work
  • Add weights to edges to depict frequencies of various month combinations
  • Use network centrality methods paired with edge weights to determine the months that are the most central. (At this moment, using simple degree centrality measures, the most central months would be January, February, and April–with June included if you count loops towards degree measures.)

All data and R scripts needed to recreate these visualizations are available on my “Fri13” Github repo.

© Alexandra Albright and The Little Dataset That Could, 2016. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

The One With All The Quantifiable Friendships

Bar Charts, Network Visualizations

This post does not refer to actual friendships–no, I am writing about Friends the television show and the corresponding fictional friendships born out of the 1990’s sitcom universe.


Given Netflix’s much anticipated addition of Friends to their online streaming empire, it is no surprise that public attention has refocused on the show. On a whim that was more nostalgic than anything else, I started re-watching episodes in the “sweet spot” of its 10-season run (seasons 3 & 4, in my opinion). I quickly remembered that one of the incredibly successful elements of the show was the variation the writers created in grouping different sets of characters together in plots for each episode. Also in re-watching the show, I remembered that certain pairs of characters were closer (friendship-wise) than others–and I began to wonder whether one could illustrate the closeness (or lack thereof) between certain characters using quantitative data from the 236 episodes of the show. 

The method I chose for doing exactly this was to calculate the frequency of characters’ shared plotlines, or character groupings, throughout the span of the entire show. Assuming the show collected a random sample of moments from the lives of the six fictional characters, the number of shared plotlines could serve as a measurement of closeness.  If, in her free time, X spends 5 hours a week with Y but 10 hours a week with Z, one would assume X and Z are closer to one another than are X and Y (…despite the fact that it could just be the case that both X and Z are unemployed while Y is a graduate student–I readily note the imperfections of such a measurement).

Let us consider the question of character groupings in basic mathematical terms. There are six friends, each an element of the overarching “group,” defined as the set F={1,2,3,4,5,6} where 1, 2, 3, 4, 5, 6 represent Chandler, Joey, Monica, Phoebe, Rachel, and Ross respectively (the listing method is alphabetical). Each episode features character groupings in the form of shared plots, which in turn correspond to subsets of F. For example, “The One With George Stephanopoulos” would be represented by the set TOWGS={{1,2,6},{3,4,5}} since the plotline with the guys at the hockey game, {1,2,6} (⊆F), is an element of TOWGS as is the plotline with the girls getting pizza/watching George drop his towel, {3,4,5} (⊆F). There are 64 possible subsets of set F, including both the empty set and F itself (64=2^6).

In thinking about quantitatively measuring the friendships via plotline counts, I wondered whether there already existed a numerical database of the 236 episodes that identified each plotline’s defining characters. In other words, I wondered if there was some database showing that “The One With The Unagi” features a Ross/Rachel/Phoebe combination? Since there was no such quantitative database, my unpaid/unofficial RA Adam Strawbridge and I collected information on the characters involved in the plotlines for all 236 episodes of Friends (via watching episodes on Netflix and reading the Friends Wiki). We defined and coded the dynamics in each of the 236 episodes of the series as follows: a dynamic with characters x_1, x_2,…, x_n corresponds to the set {x_1, x_2,…, x_n} such that x_1<x_2<…and n≤6--but to avoid set notation, dynamic {x_1, x_2,…, x_n} is coded into the dataset as the numeric value x_1…x_n. (Let’s consider a quick example. Given a plotline involving all the men, I know the dynamic is made up of Chandler, Ross, and Joey–their corresponding numbers are 1, 5, and 2. So, their dynamic is defined as {1,2,5} and coded as the number 125.) After coding, I looked into the independence of the six main characters as well as measurements and visualizations of the 15 total two-person dynamics present in the show.

Character independence

Before getting into the measurements of two-person dynamics, I present a visualization of the total number of independent plotlines for each character. (Independent plotlines include those ranging from Chandler dealing with his butt-slapping boss to Phoebe dating her sister’s stalker.) One can consider this frequency count as a measurement of each character’s independence.


Unsurprisingly, the most independent character is Phoebe, a free spirit who doesn’t possess as many ties (familial, romantic, or roommate-related) to the group as do the other five. To quote Rachel in “The One With The Kips”:

Rachel: Ugh, it was just a matter of time before someone had to leave the group. I just always assumed Phoebe would be the one to go.

Phoebe: Ehh!!

Rachel: Honey, come on! You live far away! You’re not related. You lift right out.

Meanwhile, Chandler, Ross’s college roommate/Joey’s young adulthood roommate/Monica’s boyfriend-then-husband, is deeply entwined in the group and, accordingly, does not go it alone very much…In fact, we know so little about him outside the context of the group that no one is quite sure what he does for a living.

2-person dynamics

Now, we move to the crux of my original question–is the emotional closeness that exists between two characters illustrated by the frequency of episodic plotlines?

I first approach this question by calculating a basic frequency measure (Frequency Original), the frequency of a given two-person plotline for all the 15 duos over all episodes. The Frequency Adjusted measure differs from the former in that it also takes into account plotlines that are not exclusive to the two individuals of interest–in other words, plotlines that include other characters on top of the two characters of interest also add to the duo’s count. For instance, the Rachel/Ross/Phoebe Unagi dynamic would add one count to all of the three following dynamics: Rachel/Ross, Phoebe/Rachel, and Phoebe/Ross. Given this simple methodology, I then plot each duo’s two frequency measures as follows:


Regardless of frequency measure used, the most frequent two-person dynamics (marked in green) are obviously Chandler/Monica, Chandler/Joey, and Rachel/Ross (as expected by any occasional viewer of Friends). Interestingly enough, Rachel and Ross share more exclusively 2-person plots than do Monica and Chandler (70 to 63) despite the fact that latter duo shares more plots overall than the former (94 to 81). This is most likely due to the fact that Rachel and Ross, an on-again-off-again couple, had a complicated romantic history that could have inhibited them from regularly interacting in larger group plots while Monica and Chandler were friends consistently until dating and then marriage.

Following the top three, using the adjusted frequency measure, are the Joey/Rachel and Monica/Phoebe dynamics, trailed closely by the Phoebe/Rachel and Monica/Rachel dynamics. This graph shows that, yes, the quantitative information about episodic dynamics can illustrate the strength of certain fictional relationships featured in the show.

Same 2-person dynamics, different visualization

In order to continue exploring the original question of interest, I try presenting the same dataset in a different way. While the previous graph makes evident which relationships are the most featured on the show, it does not clarify the relative importance of the other five characters to each of the six friends. That is what the following visualization (using the adjusted frequency measure and constructed using the TikZ package in LaTeX) is for:



This visualization features a figure for each of the six main characters (each character’s rectangular and oval-shaped labels use a particular color for ease of viewing). Below the character’s rectangular label are the other five characters in descending order of closeness (assuming that closeness is measured by number of shared plot dynamics using the adjusted count method). The dashed arrow between each set of names is the adjusted number of shared plot dynamics, ranging from a low of 12 to a high of 94.

Seen in both the 2-person dynamic visualizations, the high counts for Rachel/Ross and Chandler/Monica dynamics are very much on point with their emotional closeness throughout the show. Also, on point is the quantitative strength of the Chandler/Joey relationship and as well as the low count of plotlines between Chandler/Rachel and Chandler/Phoebe, who can be found streaming on Netflix saying to Monica that “it’s just Chandler!”

However, there are elements of these visualizations that are surprising as well. For one, the Monica/Ross dynamic does not feature high numbers despite the fact that they are siblings. Upon further consideration, this could be because, since they are related, it would be awkward or unnatural to put them in many of the romance/dating-related plotlines together. Furthermore, it is unexpected that Phoebe/Rachel and Monica/Phoebe dynamics have higher counts than the Rachel/Monica dynamic. Given the long tenure of Rachel and Monica’s roommate relationship that outcome seemed unlikely. But, the reality of the situation is that since their apartment is the stomping ground of all six characters (while Joey and Chandler’s apartment is much more exclusive to their two-person relationship), being roommates in that apartment does not necessarily throw the two together an incredible amount.

Lastly, part of me had hoped that Joey’s closest relationship would be with Phoebe and vice versa–in order for the show to round out smoothly as one focused, at its heart, on three character pairings. However, the fact that this does not happen actually leaves me with a more refreshing sense of the show as not all friendships can be perfectly symmetric. The complexity of the LaTeX figures succeeds in illustrating that the writers knew Friends gained much more from mixing up the characters than from matching, aligning, and sticking them together.

«Visualization update»

Thanks to feedback from the people on /r/DataIsBeautiful I decided to try visualizing one single network that includes all the characters rather than illustrating six separate networks:

Most recent visualization of the 'Friends' network using the

Most recent visualization of the ‘Friends’ network using the ‘network’ package in R.

Edges are weighted by number of shared plotlines (using the adjusted frequency measure). Furthermore, in order to better highlight the differences in densities of certain edges, I color the edges in order to represent different ranges of shared plotline numbers. 

Also, check out this network visualization that David Schoch created using my data and the visualization tool visone.

Potential future work
  • What are the most common combinations of dynamics that make up an episode? For instance, how often is there an episode where the girls share a plotline while the guys share a separate plotline. In other words, how often is the episode defined as the set {{1,2,6},{3,4,5}}?
  • Do higher ratings (via scrapped Imbd data) accrue to episodes that feature certain dynamics? Viewers loved the Rachel and Ross plotlines…does this mean that ratings were lower without the two of them in a plotline together?
  • Which character/characters is/are “the core” of the show? What are different ways to potentially quantify this using the data already collected?
    • UPDATE [March 2015]: Using the principles of eigenvector centrality, my fellow redditor/Friends fan David Schoch determined that Chandler is the most central character! He additionally broke the results down by season, illustrating that Chandler is the core of the show for seasons 4, 5, and 6; Joey for seasons 2 and 9; Rachel for seasons 1, 3, 8, and 10; and Monica for seasons 7 and 10 (tied with Rachel). See his blog for more details on the math behind these calculations! (He also created network graphs, similar to mine above, for each of the ten seasons–this way you can see how the relationship strengths differed season to season.)
Note on collected data

Data and scripts required to replicate the network visualizations are available in my “Friends” Github repo. Different Friends viewers might disagree about some of the character grouping coding decisions within particular episodes as some episodes are not as clear-cut for the purpose of this article as are others.

© Alexandra Albright and The Little Dataset That Could, 2016. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.