The One With All The Quantifiable Friendships, Part 2

Bar Charts, Line Charts, Nightingale Graphs, Stacked Area Charts, Time Series

Since finishing my first year of my PhD, I have been spending some quality time with my computer. Sure, the two of us had been together all throughout the academic year, but we weren’t doing much together besides pdf-viewing and type-setting. Around spring break, when I discovered you can in fact freeze your computer by having too many exams/section notes/textbooks simultaneously open, I promised my MacBook that over the summer we would try some new things together. (And that I would take out her trash more.) After that promise and a new sticker addition, she put away the rainbow wheel.

Cut to a few weeks ago. I had a blast from the past in the form of a Twitter notification. Someone had written a post about using R to analyze the TV show Friends, which was was motivated by a similar interest that drove me to write something about the show using my own dataset back in 2015. In the post, the author, Giora Simchoni, used R to scrape the scripts for all ten seasons of the show and made all that work publicly available (wheeeeee) for all to peruse. In fact, Giora even used some of the data I shared back in 2015 to look into character centrality. (He makes a convincing case using a variety of data sources that Rachel is the most central friend of the six main characters.) In reading about his project, I could practically hear my laptop humming to remind me of its freshly updated R software and my recent tinkering with R notebooks. (Get ready for new levels of reproducibility!) So, off my Mac and I went, equipped with a new workflow, to explore new data about a familiar TV universe.

Who’s Doing The Talking?

Given line by line data on all ten seasons, I, like Giora, first wanted to look at line totals for all characters. In aggregating all non-“friends” characters together, we get the following snapshot:


First off, why yes, I am using the official Friends font. Second, I am impressed by how close the totals are for all characters though hardly surprised that Phoebe has the least lines. Rachel wouldn’t be surprised either…

Rachel: Ugh, it was just a matter of time before someone had to leave the group. I just always assumed Phoebe would be the one to go.

Phoebe: Ehh!!

Rachel: Honey, come on! You live far away! You’re not related. You lift right out.

With these aggregates in hand, I then was curious: how would line allocations look across time? So, for each episode, I calculate the percentage of lines that each character speaks, and present the results with the following three visuals (again, all non-friends go into the “other” category):


Tell me that first graph doesn’t look like a callback to Rachel’s English Trifle. Anyway, regardless of a possible trifle-like appearance, all the visuals illustrate dynamics of an ensemble cast; while there is noise in the time series, the show consistently provides each character with a role to play. However, the last visual does highlight some standouts in the collection of episodes that uncharacteristically highlight or ignore certain characters. In other words, there are episodes in which one member of the cast receives an unusually high or low percentage of the lines in the episode. The three episodes that boast the highest percentages for a single member of the gang are: “The One with Christmas in Tulsa” (41.9% Chandler), “The One With Joey’s Interview” (40.3% Joey), “The One Where Chandler Crosses a Line” (36.3% Chandler). Similarly, the three with the lowest percentages for one of the six are: “The One With The Ring” (1.5% Monica) , “The One With The Cuffs” (1.6% Ross), and “The One With The Sonogram At The End” (3.3% Joey). The sagging red lines of the last visual identify episodes that have a low percentage of lines spoken by a character outside of the friend group. In effect, those dips in the graph point to extremely six-person-centric episodes, such as “The One On The Last Night” (0.4% non-friends dialogue–a single line in this case), “The One Where Chandler Gets Caught” (1.1% non-friends dialogue), and “The One With The Vows” (1.2% non-friends dialogue).

The Men Vs. The Women

Given this title, here’s a quick necessary clip:

Now, how do the line allocations look when broken down by gender lines across the main six characters? Well, the split consistently bounces around 50-50 over the course of the 10 seasons. Again, as was the case across the six main characters, the balanced split of lines is pretty impressive.


Note that the second visual highlights that there are a few episodes that are irregularly man-heavy. The top three are: “The One Where Chandler Crosses A Line” (77.0% guys), “The One With Joey’s Interview” (75.1% guys), and “The One With Mac and C.H.E.E.S.E.” (70.2% guys). There are also exactly two episodes that feature a perfect 50-50 split for lines across gender: “The One Where Rachel Finds Out” and “The One With The Thanksgiving Flashbacks.”

Say My Name

How much do the main six characters address or mention one another? Giora addressed this question in his post, and I build off of his work by including nicknames in the calculations, and using a different genre of visualization. With respect to the nicknames–“Mon”, “Rach”, “Pheebs”, and “Joe”–“Pheebs” is undoubtably the stickiest of the group. Characters say “Pheebs” 370 times, which has a comfortable cushion over the second-place nickname “Mon” (used 73 times). Characters also significantly differ in their usage of each others’ nicknames. For example, while Joey calls Phoebe “Pheebs” 38.3% of the time, Monica calls her by this nickname only 4.6% of the time. (If you’re curious about more numbers on the nicknames, check out the project notebook.)

Now, after adding in the nicknames, who says whose name? The following graphic addresses that point of curiosity:


The answer is clear: Rachel says Ross’s name the most! (789 times! OK, we get it, Rachel, you’re in love.) We can also see that Joey is the most self-referential with 242 usages of his own name–perhaps not a shock considering his profession in the entertainment biz. Overall, the above visual provides some data-driven evidence of the closeness between certain characters that is clearly evident in watching the show. Namely, the Joey-Chandler, Monica-Chandler, Ross-Rachel relationships that were evident in my original aggregation of shared plot lines are still at the forefront!


Comparing the above work to what I had originally put together in January 2015 is a real trip. My original graphics back in 2015 were made entirely in Excel and were as such completely unreproducible, as was the data collection process. The difference between the opaqueness of that process and the transparency of sharing notebook output is super exciting to me… and to my loyal MacBook. Yes, yes, I’ll give you another sticker soon.

Let’s see the code!

Here is the html rendered R Notebook for this project. Here is the Github repo with the markdown file included.

*Screen fades to black* 
Executive Producer: Alex Albright

© Alexandra Albright and The Little Dataset That Could, 2017. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.



This Post is Brought to You by the National Science Foundation

Nightingale Graphs, Stacked Area Charts, Stacked Bar Charts, Treemaps

I have officially finished applying for my PhD. While the application process included many of the same elements that I had previously encountered as a fresh-faced* 17-year-old (think standardized testing without the #2 pencils and lots more button clicking), I am no longer applying as a (relatively) blank slate–a future liberal arts student who will float and skip between disciplines until being neatly slotted into a major. Instead, we PhD applicants have already zeroed in on a particular area of study–in my case, economics. Consequently, each PhD discipline is unlikely to exhibit the same carefully crafted demographics boasted in the pie charts that plaster undergraduate brochures across the country to provide tangible evidence for optimistic, bolded statements about diversity. In formulating responses to a slew of university-specific prompts about diversity in “the sciences,” I grew curiouser and curiouser about two particular questions: What do demographic compositions look like across various PhD disciplines in the sciences? & Have demographic snapshots changed meaningfully over time?

As I continued working to imbue a sense of [academic] self into pdfs composed of tightly structured Times New Roman 12 point font, I repeatedly found myself at the NSF open data portal, seeking to answer these aforementioned questions. However, I would then remind myself that, despite my organic urge to load rows and columns into R Studio, I should be the responsible adult (who I know I can be) and finish my applications before running out to recess. Now that the last of the fateful buttons have been clicked (and a sizable portion of my disposable income has been devoured by application fees and the testing industrial complex), I’m outside and ready to talk science!**

NSF data and sizes of “the sciences”

In this post, I am focusing on the demographics of science PhD degrees awarded as they pertain to citizenship and race/ethnicity, but not gender. In an ideal world, I would be able to discuss the compositions of PhD fields as broken into race/ethnicity-gender combinations, however, the table that includes these types of combinations for US citizens and permanent residents (Table 7-7) only provides the numbers for the broader categories rather than for the desired discipline-level. For instance, social science numbers are provided for 2002-2012 without specific numbers for economics, anthropology, etc. This approach, therefore, would not allow for an investigation into the main topic of interest, which is the demographic differences between the distinct disciplines–there is too much variety within the larger umbrella categories to discuss the fields’ compositions in this way. Therefore, I limit this discussion to demographics with respect to citizenship and race/ethnicity and, accordingly, use Table 7-4 “Doctoral degrees awarded, by citizenship, field, and race or ethnicity: 2002–12” from the NSF Report on Women, Minorities, and Persons with Disabilities in Science and Engineering*** as my data source.

Before getting into the different PhD science fields and their demographics, it’s worth noting the relative sizes of these disciplines. The following treemap depicts the relative sizes of the sciences as defined by NSF data on doctoral degrees awarded in 2012:


The size of each squarified rectangle represents the number of degrees awarded within a given field while the color denotes the field’s parent category, as defined by the NSF. (Note that some studies are, in fact, their own parent categories. This is the case for Biological Sciences, Psychology, Computer Sciences, and Agricultural Sciences.) In the upcoming discussion of demographics, we will first discuss raw numbers of degrees earned and the relevant demographic components but will then pivot towards a discussion of percentages, at which point remembering the differences in size will be particularly helpful in piecing together the information into one cohesive idea of the demographics of “the sciences.”****

A decade of demographic snapshots: PhD’s in the sciences

The NSF data specifies two levels of information about the doctoral degrees awarded. The first level identifies the number of degree recipients who are US citizens or permanent residents as well as the number who are temporary residents. Though “[t]emporary [r]esident includes all ethnic and racial groups,” the former category is further broken down into the following subgroups: American Indian or Alaska Native, Asian or Pacific Islander, Black, Hispanic, Other or unknown, and White. In our first exploration of the data, we specify the raw number of degrees awarded to individuals in the specific ethnic and racial categories for US citizens and permanent residents as well as the number awarded to temporary residents. In particular, we start the investigation with the following series of stacked area charts (using flexible y-axes given the vastly different sizes of the disciplines):


In this context and for all following visualizations, the red denotes temporary residents while all other colors (the shades of blue-green and black) are ethnic and racial subsets of the US citizens and permanent residents. By illustrating the raw numbers, this chart allow us to compare the growth of certain PhD’s as well as seeing the distinct demographic breakdowns. While overall the number of science PhD’s increased by 39% from 2002 to 2012, Astronomy, Computer Science, Atmospheric sciences, and Mathematics and statistics PhD’s clearly outpaced other PhD growth rates with increases of 143%, 125% 84%, and 80%, respectively. Meanwhile, the number of Psychology PhD’s actually decreased from 2002 to 2012  by 8%. While this was the only science PhD to experience a decline over the relevant 10-year period, a number of other disciplines grew at modest rates. For instance, the number of Anthropology, Sociology, and Agricultural Sciences PhD’s experienced increases of 15%, 16%, and 18% between 2002 and 2012, which pale in comparison to the vast increases seen in Astronomy, Computer Science, Atmospheric sciences, and Mathematics and statistics.

While it is tempting to use this chart to delve into the demographics of the different fields of study, the use of raw numbers renders a comprehensive comparison of the relative sizes of groups tricky. For this reason, we shift over to visualizations using percentages to best get into the meat of the discussion–this also eliminates the need for different y-axes. In presenting the percentage demographic breakdowns, I supply three different visualizations: a series of stacked area graphs, a series of nightingale graphs (essentially, polar stacked bar charts), and a series of straightforward line graphs, which despite being the least exciting/novel are unambiguous in their interpretation:




One of my main interests in these graphs is the prominence of temporary residents in various disciplines. In fact, it turns out that Economics is actually quite exceptional in terms of its percentage of temporary residents, which lingers around 60% for the decade at hand and is at 58% for 2012. (In 2012, out of the remaining 42% that are US citizens or permanent residents, 70% are white, 11% are asian or pacific islander, 3% are black, 3% are hispanic, 0% are american indian or alaskan native, and 13% are other or unknown.) Economics stands with Computer science, Mathematics and statistics, and Physics as one of the four subjects in the sciences for which temporary residents made up a higher percentage of the PhD population than white US citizens or permanent residents consistently from 2002 to 2012. Furthermore, Economics is also the science PhD with the lowest percentage of white US citizens and permanent residents–that is, a mere 30%.  In this sense, the field stands out as wildly different in these graphs from its social science friends (or, more accurately, frenemies). On another note, it is also not hard to immediately notice that Psychology, which is not a social science in the NSF’s categorization, is so white that its nightingale graph looks like an eye with an immensely overly dilated pupil (though anthropology is not far behind on the dilated pupil front).

Also readily noticeable is the thickness of the blue hues in the case of Area and ethnic studies–an observation that renders it undeniable that this subject is the science PhD with the highest percentage of non-white US citizens and permanent residents. Following this discipline would be the other social sciences Anthropology, Sociology, and Political science and public administration, as well as the separately categorized Psychology. However, it is worth noting that the ambiguity of the temporary residents’ racial and ethnic attributes leaves much of our understanding of the prominence of various groups unclear.

Another focal point of this investigation pertains to the time dimension of these visuals. When homing in on the temporal aspect of these demographic snapshots, there is a discouraging pattern–a lack of much obvious change. This is especially highlighted by the nightingale graphs since the polar coordinates allow the 2012 percentages to loop back next to the 2002 percentages and, thus, facilitate for a simple start-to-end comparison. In most cases, the two points in time look incredibly similar. Of course, this does not necessarily mean there has been no meaningful change. For instance, there have been declines in the percentage of white US citizens and permanent residents in the subjects Area and ethnic studies, Psychology, Sociology, Anthropology, and Political science and public administration, which have then been offset by increases in other groups of individuals. However, the picture is incredibly stagnant for most of the disciplines, especially the hard sciences and the unusually quantitative social science of economics. In pairing the stagnant nature of these demographic snapshots with consistent calls for greater faculty diversity in the wake of campus protests, it is clear that there is a potential bottleneck since such lagging diversity in PhD disciplines can directly contribute to a lack of diversity at the faculty-level.


When the public discusses the demographics and diversity of “the sciences,” 1.5 dozen disciplines are being improperly blended together into generalized statements. To better understand the relevant dynamics, individuals should zero in on the discipline-level rather than refer to larger umbrella categories. As it turns out according to our investigation, the demographic breakdowns of these distinct subjects are as fundamentally different as their academic methodologies–methodologies which can be illustrated by the following joke that I can only assume is based on a true story:

As a psychological experiment, an engineer, a chemist, and a theoretical economist are each locked in separate rooms and told they won’t be released until they paint their entire room. They are each given a can of blue paint which holds about half the paint necessary to paint the room and then left alone. A few hours later the psychologist checks up on the three subjects.

(1) The engineer’s walls are completely bare. The engineer explains that he had worked out that there wasn’t enough paint to cover all the walls so he saw no point in starting.

(2) The chemist’s room is painted in faded, streaky blue. “There wasn’t enough paint, so I diluted it,” she explains.

(3) In the economist’s room, the floor and the ceiling are completely blue, and there’s a full can of paint still sitting on the floor. The experimenter is shocked and asks how the economists managed to paint everything. The economist explains, “Oh, I just painted the rational points.”

And with an unwavering appreciation for that bit, I hope to be one of the ~20-30 (who knows?) % of white US citizens/permanent residents in the economics PhD cohort of 2021.

PS-Happy 2016 everyone!


* I had yet to take a driving test at a DMV. I did this successfully at age 21. But, I will not drive your car.

** The NSF divides subjects up into S&E (science and engineering) and non-S&E categories. In this context, I am only discussing the subjects that fall under the umbrella of science. It would be simple to extend the approach and concept to the provided numbers for engineering.

*** This table explains that the exact source for this information is: National Science Foundation, National Center for Science and Engineering Statistics, special tabulations of U.S. Department of Education, National Center for Education Statistics, Integrated Postsecondary Education Data System, Completions Survey, 2002–12.

**** In particular, the tiny size of the group of History of Science PhD’s allows for much more variability year-to-year in terms of demographics. Only 19-34 degrees were given out on an annual basis from 2002-2012. In this case, size of the program is responsible for the wildly evident changes in demographic composition.


Data and R scripts necessary to replicate visualizations are now up on my github! See the NSF_Demographics repo. Let me know if you have any questions or issues with the R script in particular.

Further directions for work
  • Create gif of treemap using years 2002-2012 to replace the static version for just 2012
    • Or use a slider via some D3 magic
  • Follow-up by comparing the gender compositions
  • Look into the development and change history of the US Office of Management and Budget for racial and ethnic categories
    • Just curious as to the timeline of changes and how categorization changes affect our available data

© Alexandra Albright and The Little Dataset That Could, 2016. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.