Testing for Local Continuity in Racial Animus

Choropleths, Regressions

This past spring I was tasked with writing a final paper for my Comparative Historical Economic Development course. In brainstorming, I started a casual fling with one idea that quickly escalated and led to long spring break dates at the library (collecting data, making maps). In a meeting with my professor back at Harvard, I realized that I had been consumed in the honeymoon phase of an idea. Taking off my rose-colored glasses, it became clear I had to end the tryst so I could focus on more promising paths to a paper…

In the end, I wrote about a different idea. Meanwhile, the original one lived in solitude in an inactive Dropbox folder, a digital dead end. While I never developed it into a paper, why not shape it into a blog post? After all, there were interesting datasets and choropleths to share. So, here I am, throwing it a bone! If only for the sake of intellectual closure…

Persecution Perpetuated } Animus Alive

I was immediately struck by an idea after reading the QJE article “Persecution Perpetuated: The Medieval Origins of Anti-Semitic Violence in Nazi Germany.” Co-authors Voigtländer and Voth use an incredible dataset of about 400 towns “where Jewish communities are documented for both the medieval period and interwar Germany.” They find local continuity in anti-Semitism over 600 years. Local continuity in this context means attacks on Jews were six times more likely in the 1920’s in towns and cities that blamed and then murdered Jews for the Black Death (during 1348-50). Let me repeat myself: local continuity over SIX HUNDRED YEARS. (More than half a millennium.) The paper provides convincing empirical evidence that group-based persecution (anti-Semitism in this case) can meaningfully persist at the local level. History matters, the data screams.

After reading “Persecution Perpetuated,” I was curious if I could merge recently available data sources to test whether racial animus in the US would be display similar local continuity. (Possible alliterative titles to pay homage to “Persecution Perpetuated” include: “Animus Alive” and “Malice Maintained.”) Specifically, what sprang to mind as a possible dependent variable was Seth Stephens-Davidowitz‘s racial animus measure, which is based on based on Google search data. (I’ve always wanted to play around with Seth’s Google data, so I seized on a possible academic opportunity to do so.)

What does it mean to measure racial animus using Google data? Seth proxies for a geographic area’s racial animus by calculating the percent of its Google searches (2004-2007) that included the n-word or its plural. The Google data at its finest geographic level is available for US Designated Market Areas. (There are 210 such DMA’s in the US.) Specifically, the “racially charged search rate” for DMA j = 100 * [n-word Google searches / total Google searches] for j / [n-word Google searches / total Google searches] max over all DMA’s. The necessary underlying assumption is that racial animus makes one more likely to make an n-word Google search. (It does not have to be the case that “every individual using the term harbors racial animus, nor that every individual harboring racial animus will use this term on Google.”) (If you want to know more about this measure, read Seth’s NYTimes piece about his academic work as well.)

Why is using Google search data attractive? Seth argues that survey data is unlikely to paint an accurate picture of racial animus because well, people lie. (Thus the title of his book, Everybody Lies, which I highly recommend if you’re interested in data-driven social science.) Meanwhile, “the conditions under which people [Google] search – online, likely alone, and not participating in an official survey – limit concern of social censoring.” In short, Google search data allows us to access a snapshot of attitudes or beliefs that might otherwise be inaccessible with traditional surveys.

While Google search is a new phenomenon in the grand scheme of history, I was curious if historical data related to racial animus could be a powerful predictor of these modern racially charged search patterns. Ie, I sought out a relevant historical independent variable, which led me to Virginia Commonwealth University (VCU)’s project “Mapping the Second Klu Klux Klan, 1915-1940.” History Professor John Kneebone constructed a list of local KKK chapters (klaverns) using information from a large set of the group’s official publications. (More here.) The project makes visually explicit the widespread nature of the KKK; “Everywhere there was population, there was the Klan,” Kneebone explained. He then worked with digital librarians at VCU Libraries to map out the klaverns and make the raw location data publicly accessible.

With the VCU data in mind, I wondered: does the local historical KKK prevalence in 1915-1940 predict (through perpetuated racial animus) racially charged Google search rates in 2004-2007? There are some issues to note when using klavern location data:

  1. Ideally, I’d use data on the number of members by geographic areas. (Ignoring underlying population figures for a moment: Imagine there are 3 klaverns with 10 members each in DMA A. Meanwhile, there is 1 klavern with 100 members in DMA B. The Klan is more prevalent in terms of raw members in DMA B, but with only the location data in tow, DMA A would seem to have a more prevalent Klan presence.) Unfortunately, I am not aware of data on historical Klan membership by finer levels (DMA-level) of US geography.
  2. The location data are not necessarily complete. Such is the nature of collecting data from a different era. They are based on historical research and investigative work, but klavern locations are likely missing.

So, simply put, the question is: can I find empirical evidence of local continuity in American racial animus by merging historical and modern metrics? Will klavern prevalence per capita (1915-1940) meaningfully predict racially charged Google searches (2000-2004)?

Maps and regressions

First things first. Let’s map both klaverns/million and racially charged search rates by DMA. The variation in klaverns/million turned out to be so skewed that using the raw rate made the variation almost impossible to visually decipher. The use of log(klaverns/million) makes the geographic variation visually interpretable. (See below.)

map_both.png

While I interpreted the two metrics as proxies of racial animus at different times in history, they don’t obviously track through time and space visually. The possible story (that one meaningfully predicts the other) isn’t very convincing at this stage. But, it’s worth seeing the output from a simple regression. (I use log of klaverns/million since the distribution of klaverns/million was skewed, making a nonlinear relationship between modern racially charged search rate and klaverns/million. To counter heteroskedasticity, I transformed klaverns/million to its log. I’m open to criticism/discussion on such log transformations.)

predict.png

While the independent variable is statistically significant (star, star, star), the variation in log(klaverns/million) explains only 3-4% of the variation in racially charged Google search rates. In terms of magnitude, a 1% increase in klaverns/million is associated with a 0.02645 unit increase in racially charged google search rate. That is economically meaningless considering that the rates range from 25-155. This is also without controlling for any of the analogs of V&V covariates that would need to be included to make a convincing case for the robustness of any supposed relationship.

In the end, the VCU location data aggregated up to totals for DMA areas are not a meaningful predictor of racial animus revealed by Google behavior. This could be for many reasons. It could be because klavern location totals are not accurate depictions of KKK prevalence (data reveals locations rather than member totals, data could be incomplete). Moreover, on a more philosophical level, does this project even test the persistence of racial animus?… or is it asking a question about how language squares with historical locations of institutions? I’ll leave that up to my academic counterparts in Political Theory. (CJ, you’re up.)

Why it wasn’t meant to be… a paper

Now, it is worth explaining why this idea was always predestined to stay in blog-land. For one, even an extremely powerful positive result wouldn’t “shift priors” (as economists say to one another) — the result would seem obvious and so, who cares? Voigtländer and Voth’s paper was publishable due to the long-term scope (600 years!) and fine geographic level (towns, cities in Germany) of their results. The concept that attitudes could be perpetuated over time wasn’t the selling point, it was the fact that attitudes could be perpetuated for so long and in such a locally continuous way. If the geographies had been more coarse and their data hadn’t covered such a long period of time, the paper would not have landed in the QJE. Since I asked if 1915-1940 attitudes would predict 2004-2007 attitudes and I was limited to large designated market areas (due to the nature of the google data), my findings were doomed to be unremarkable even if incredibly powerful (which… they were not).

Despite its inevitable end, I did learn a lot from this spring fling. On a qualitative level, I learned about the historical omnipresence of the KKK from an impressive digital humanities project. On a technical level, I learned how how to map DMA areas in R (very tricky). And, on a philosophical level, I learned how to cleanly break up with a project idea. (Thanks for all the optimal stopping lessons, David.)

Data & code
  1. Seth’s racial animus data is here.
  2. VCU data on klaverns is here.
  3. My R notebook for this project is here.
  4. My Github repo for this project is here.

© Alexandra Albright and The Little Dataset That Could, 2018. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

 

Advertisements

Text Me Back: A Year of LDR Communication

Line Charts, Scatter Plots
Motivation

I spent some time in February figuring out how to download my iMessage data. The idea was that I could then use that data to make Jesse an R Notebook Valentine. And it worked!

valentine

A slide from my February R-ladies talk. (Full slides here.)

That Valentine was an investigation into word usage (via tf-idf) and emotional tinges in messages (via sentiment dictionaries). I treated all the messages as two aggregated blocks (one from Jesse, one from me) and was not concerned with the time dimension (when messages were sent). However, I’ve regularly hypothesized to Jesse that I know exactly what a graph of our iMessages would look like over time.

Why’s that? Well, Jesse and I are in a LDR, which means we are accustomed to going weeks without seeing each other. When you live with your SO, you come home to the same place. There isn’t a ton of need to send messages about your day, as you’ll see them soon and can update them later. When you live across the country, you share text update frequently since you might not catch up over phone/video for a few days. Pretty intuitive, right? So, my hypothesis was this: message frequency is ridiculously inversely correlated with being in the same city. (I.e., together, message number low; apart, message number high.)

Visually testing my hypothesis

To test this I wanted to plot the number of daily messages between us over the recent year and then mark time periods whether we were in fact in the same place or not. Turns out, yes, you can perfectly identify when we are in the same city by plotting our daily iMessage frequency!

LDR_year

I am very proud of this visual since it fits my hypothesis perfectly and is a succinct visual story about our virtual communication. Beyond the aesthetics, it also depicts an important part of my life that often is invisible to people. So, I am proud to own both the graph and the reality it depicts.

Relevant to recent concerns about data privacy, Brianna McHorse pointed out that this visual is also an example of how effectively and easily inferences can be made using personal data. I think that’s an incredibly powerful point. Jesse and I obviously didn’t intend to map out our visit schedule with our messaging. But, the picture of our time together/apart is loud and clear nonetheless.

On a final, humbled note, I am very thankful to live in a time and place where I have the technology to keep so significantly in touch with someone on the other side of the country despite our physical distance.

Still, I can’t wait to be back in the green.

Technical notes

If you’d like to download your own iMessage data, it’s very simple! (Though it did take me a while to figure out and piece together a full protocol back in February — thanks to Gulya and Bad Memes et al. for help.) Necessary steps are explained at the start of this R notebook, which also includes all code necessary to make a similar plot. Major props go to Aaron Parecki for building the command machinery used in the data extraction process.

Note to self: It could also be informative to plot daily totals of words (or maybe characters?) sent over iMessage since many messages are emojis, a few words, and reactions to other messages. Ah, to be a millennial.


© Alexandra Albright and The Little Dataset That Could, 2018. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

The United Nations of Words

Bar Charts

Newsletter e-mails are often artifacts of faded interests or ancient online shopping endeavors. They can be nostalgia-inducing — virtual time capsules set in motion by your past self at t-2, and egged on by your past self at t-1. Remember that free comedy show, that desk lamp purchase (the one that looks Pixar-esque), that political campaign… oof, actually let’s scratch that last one. But, without careful care, newsletters breed like rabbits and mercilessly crowd inboxes. If you wish to escape the onslaught of red notification bubbles, these e-mails are a sworn enemy whose defeat is an ever-elusive ambition.

However, there is a newsletter whose appearance in my inbox I perpetually welcome with giddy curiosity. That is, Jeremy Singer-Vine’s “Data is Plural.” Every week features a new batch of datasets for your consideration. One dataset in particular caught my eye in the 2017.07.19 edition:

UN General Debate speeches. Each September, the United Nations gathers for its annual General Assembly. Among the activities: the General Debate, a series of speeches delivered by the UN’s nearly 200 member states. The statements provide “an invaluable and, largely untapped, source of information on governments’ policy preferences across a wide range of issues over time,” write a trio of researchers who, earlier this year, published the UN General Debate Corpus — a dataset containing the transcripts of 7,701 speeches from 1970 to 2016.

The Corpus explains that these statements are “akin to the annual legislative state-of-the-union addresses in domestic politics.” As such, they provide a valuable resource for understanding international governments’ “perspective[s] on the major issues in world politics.” Now, I have been interested in playing around with text mining in R for a while. So a rich dataset of international speeches seems like a natural application of basic term frequency and sentiment analysis methods. As I am interested in comparing countries to one another, I need to select a subset of the hundreds to study. Given their special status, I focus exclusively on the five UN Security council countries: US, Britain, France, China, and Russia. (Of course, you could include many, many more countries of interest for this sort of investigation, but given the format of my desired visuals, five countries is a good cut-off.) Following in the typed footsteps of great code tutorials, I perform two types of analyses–a term frequency analysis and a sentiment analysis–to discuss the thousands of words that were pieced together to form these countries’ speeches.

Term Frequency Analysis

Term frequency analysis has been used in contexts ranging from studying Seinfeld to studying the field of 2016 GOP candidates. A popular metric for such analyses is tf-idf, which is a score of relative term importance. Applied to my context, the metric reveals words that are frequently used by one country but infrequently used by the other four. In more general terms, “[t]he tf-idf value increases proportionally to the number of times a word appears in the document, but is often offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.” (Thanks, Wikipedia.) In short, tf-idf picks out important words for our countries of interest. The 20 words with the highest tf-idf scores are illustrated below:

tfidftotal

China is responsible for 13 of the 20 words. Perhaps this means that China boasts the most unique vocabulary of the Security Council. (Let me know if you disagree with that interpretation.) Now, if instead we want to see the top 5 words for each country–to learn something about their differing focuses–we obtain the results below:

tfidf_country

As an American, I am not at all surprised by the picture of my country as one of democratic, god-loving, dream-having entrepreneurs who have a lot to say about Saddam Hussein. Other insights to draw from this picture are: China is troubled by Western superpower countries influencing (“imperialist”) or dominating (“hegemonism”) others, Russia’s old status as the USSR involved lots of name checks to leader Leonid Ilyich Brezhnev, and Britain and France like to talk in the third-person.

Sentiment Analysis

In the world of sentiment analysis, I am primarily curious about which countries give the most and least positive speeches. To figure this out, I calculate positivity scores for each country according to the three sentiment dictionaries, as summarized by the UC Business Analytics R Programming Guide:

The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The bing lexicon categorizes words in a binary fashion into positive and negative categories. The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.

Therefore, for the nrc and bing lexicons, my generated positivity scores will reflect the number of positive words less the number of negative words. Meanwhile, the AFINN lexicon positivity score will reflect the sum total of all scores (as words have positive scores if they possess positive sentiment and negative scores if they possess negative sentiment). Comparing these three positivity scores across the five Security Council countries yields the following graphic:

country_pos

The three methods yield different outcomes: AFINN and Bing conclude that China is the most positive country, followed by the US; meanwhile, the NRC identifies the US as the most positive country, with China in fourth place. And, despite all that disagreement, at least everyone can agree that the UK is the least positive! (How else do we explain “Peep Show”?)

Out of curiosity, I also calculate the NRC lexicon word counts for anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. I then divide the sentiment counts by total numbers of words attributed to each country so as to present the percentage of words with some emotional range rather than the absolute levels for that range. The results are displayed below in stacked and unstacked formats.

feelings1

feelings2

According to this analysis, the US is the most emotional country with over 30% of words associated with a NRC sentiment. China comes in second, followed by the UK, France, and Russia, in that order. However, all five are very close in terms of emotional word percentages so this ordering does not seem to be particularly striking or meaningful. Moreover, the specific range of emotions looks very similar country by country as well. Perhaps this is due to countries following some well-known framework of a General Debate Speech, or perhaps political speeches in general follow some tacit emotional script displaying this mix of emotions…

I wonder how such speeches compare to a novel or a newspaper article in terms of these lexicon scores. For instance, I’d imagine that the we’d observe more evidence of emotion in these speeches than in newspaper articles, as those are meant to be objective and clear (though this is less true of new forms of evolving media… i.e., those that aim to further polarize the public… or, those that were aided by one of the Security Council countries to influence an election in another of the Security Council countries… yikes), while political speeches might pick out words specifically to elicit emotion. It would be fascinating to investigate how emotional words are wielded in political speeches or new forms of journalistic media, and how that has evolved over time. (Quick hypothesis: fear is more present in the words that make up American media coverage and political discourse nowadays than it was a year ago…) But, I will leave that work (for now) to people with more in their linguistics toolkit than a novice knowledge of super fun R packages.

Code

As per my updated workflow, I now conduct projects exclusively using R notebooks! So, here is the R notebook responsible for the creation of the included visuals. And, here is the associated Github repo with everything required to replicate the analysis. Methods mimic those outlined by superhe’R’os Julia Silge and David Robinson in their “Text Mining with R” book.


© Alexandra Albright and The Little Dataset That Could, 2017. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

Senate Votes Visualized

Grid Maps

It has been exactly one week since the Senate voted to start debate on Obamacare. There were three Obamacare repeal proposals that followed in the wake of the original vote. Each one failed, but in a different way. News outlets such as the NYTimes did a great job reporting how each Senator voted for all the proposals. I then used that data to geographically illustrate Senators’ votes for each Obamacare-related vote. See below for a timeline of this past week’s events and accompanying R-generated visuals.

Tuesday, July 25th, 2017

The senate votes to begin debate.

deb_final

This passes 51-50 with Pence casting the tie-breaking vote. The visual shows the number of (R) and (D) Senators in each state as well as how those Senators voted. We can easily identify Collins and Murkowski, the two Republicans who voted NO, by the purple halves of their states (Maine and Alaska, respectively). While Democrats vote as a bloc in this case and in the impending three proposal votes, it is the Republicans who switch between NO and YES over the course of the week of Obamacare votes. Look for the switches between red and purple.

Later that day…

The Senate votes on the Better Care Reconciliation Act.

rr_final

It fails 43-57 at the mercy of Democrats, Collins, Murkowski, and a more conservative bloc of Republicans.

Wednesday, July 26th, 2017

The Senate votes on the Obamacare Repeal and Reconciliation Act.

pr_final

It fails 45-55 at the mercy of Democrats, Collins, Murkowski, and a more moderate bloc of Republicans.

Friday, July 28th, 2017

The Senate votes on the Health Care Freedom Act.

sk_final

It fails 49-51 thanks to Democrats, Collins, Murkowski, and McCain. To hear the gasp behind the slice of purple in AZ, watch the video below.

Code

This was a great exercise in using a few R packages for the first time. Namely, geofacet and magick. The former is used for creating visuals for different geographical regions, and is how the visualization is structured to look like the U.S. The latter allows you to add images onto plots, and is how there’s a little zipper face emoji over DC (as DC has no Senators).

In terms of replication, my R notebook for generating included visuals is here. The github repo is here.


© Alexandra Albright and The Little Dataset That Could, 2017. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

A rising tide lifts all podcasts

Scatter Plots

A personal history of podcast listening

One afternoon of my junior year, I listened to a chunk of a Radiolab episode about “Sleep” as I myself heavily sank into unconsciousness. It was like guided meditation… supported, in part, by the Alfred P. Sloan Foundation. Jad and Robert’s forays on Radiolab quickly became my new bedtime stories. They helped me transition from days with my nose deep in books and, more accurately, my laptop to dreams that veered away from the geographic markers of one tiny college town in a valley of the Berkshires.

The Radiolab archives were a soundtrack to my last years of college and to my transition from “student” at a college to “staff member” at a university. A few months into my new place in the world, I found myself discussing Sarah Koenig’s Serial with my colleagues in neighboring cubicles. I also wasn’t a stranger to the virtual water cooler of /r/serialpodcast. I became so entrenched in the podcast’s following that I ended up being inspired to start blogging in order to document reddit opinion trends on the topic.

Faced with regular Caltrain rides from the crickets and “beam” store of Palo Alto to the ridiculous-elevation-changes of SF, I started listening to Gilmore Guys. You know, the show about two guys who talk about the Gilmore Girls. I did not think this would take (I mean, there were hundreds of episodes–who would listen to all that?!) but I was very wrong. The two hosts accompanied me throughout two full years of solo moments. Their banter bounced next to me during mornings biking with a smile caked across my face and palm trees to my left and right as well as days marked by fierce impostor syndrome. Their bits floated next to me in the aftermath of medical visits that frightened me and suburban grocery shopping endeavors (which also sometimes frightened me). Their words, light and harmless, sat with me during evenings of drinking beer on that third-of-a-leather-couch I bought on craigslist and silent moments of self-reflection.

That might sound like pretty heavy lifting for a podcast. But, (silly as it might sound) it was my security blanket throughout a few years of shifting priorities and support networks–tectonic plates grumbling under the surface of my loosely structured young adult life.

When it came time to move to Cambridge from Palo Alto, I bought a Leesa mattress thanks to Scott Aukerman’s 4am mattress store advert bit from Comedy Bang Bang. (Sorry, Casper.) Throughout my first doctoral academic year, I regularly listened to Two Dope Queens as I showered and made dinner after frisbee practices. Nowadays, like a good little liberal, I listen to the mix of political yammering, gossip, and calls to arms that makes up Pod Save America.

Podcasts seem to be an increasingly important dimension of our alone time. A mosaic of podcast suggestions is consistently part of entertainment recommendations across friends… which leads me to my question of interest: How are podcasts growing? Are there more created nowadays, or does it just feel like that since we discuss them more? 

Methodological motivation

In following the growth of the R-Ladies organization and the exciting work of associated women, I recently spotted a blog post by R-lady Lucy McGowan. In this post, Lucy looks at the growth of so-called ‘Drunk’ Podcasts. She finds a large growth in that “genre” (if you will) while making great usage of a beer emoji. Moreover, she expresses that:

While it is certainly true that the number of podcasts in general has absolutely increased over this time period, I would be surprised if the increase is as dramatic as the increase in the number of “drunk” podcasts.

I was super skeptical about this statement. I figured the increase in many podcast realms would be just as dramatic, if not more dramatic than that in the ‘drunk’ podcasts universe. So, I took this skepticism as an opportunity to build on Lucy’s code and emoji usage and look into release trends in other podcasting categories. Think of this all as one big excuse to try using emojis in my ggplot creations while talking about podcasts. (Thank you to the author of the emoGG package, a hero who also created Beyoncé color palettes for R.)

Plotting podcasts

I look into podcasting trends in the arenas of ‘sports’, ‘politics’, ‘comedy’ and ‘science.’ I figured these were general umbrella terms that many pods should fall under. However, you can easily adapt the code to look into different genres/search terms if you’re curious about other domains. (See my R notebook for reproducible work on this subject.) I, like Lucy, then choose emojis to use in my eventual scatterplot. Expressing a concept as complex as politics with a single emoji was challenging, but a fun exercise in using my millennial skillset.  (The ‘fist’ emoji was the best I could come up with for ‘politics’ though I couldn’t figure out how to modify the skin tone. I’m open to other suggestions on this front. You can browse through options with unicode here.)

In the end, I combine the plots for all four podcasting categories into one aggregated piece of evidence showing that many podcasts genres have seen dramatic increases in 2017. The growth in number of releases is staggering in all four arenas. (Thus, the title ‘A rising tide lifts all podcasts.’) So, this growth doesn’t seem to be unique to the ‘drunk’ podcast. In fact, these more general/conventional categories see much more substantive increases in releases.

pods

While the above deals with podcast releases, I would be very curious to see trends in podcast listening visualized. For instance, one could use the American Time Use Survey to break down people’s leisure consumption by type during the day. (It seems that the ATUS added “listening to podcast” in 2015.) I’d love to see some animated graphics on entertainment consumption over the hours reminiscent of Nathan Yau’s previous amazing work (“A Day in the Life of Americans”) with ATUS data.

Putting down the headphones

Regardless of the exact nature of the growth in podcasts over the past years, there is no doubt the medium has come to inhabit a unique space. Podcasts feel more steeped in solitude than other forms of entertainment like television or movies, which often are consumed in group settings. Podcasts have helped me re-learn how to be alone (but not without stories, ideas, and my imagination) and enjoy it. And, I am an only-child, so believe me… I used to be quite good at that.

The Little Dataset–despite this focus on podcasts–is brought to you by WordPress and not Squarespace. 🙂

Code

Check out this R Notebook for the code needed to reproduce the graphic. You can also see my relevant github repository.


© Alexandra Albright and The Little Dataset That Could, 2017. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

The One With All The Quantifiable Friendships, Part 2

Bar Charts, Line Charts, Nightingale Graphs, Stacked Area Charts, Time Series

Since finishing my first year of my PhD, I have been spending some quality time with my computer. Sure, the two of us had been together all throughout the academic year, but we weren’t doing much together besides pdf-viewing and type-setting. Around spring break, when I discovered you can in fact freeze your computer by having too many exams/section notes/textbooks simultaneously open, I promised my MacBook that over the summer we would try some new things together. (And that I would take out her trash more.) After that promise and a new sticker addition, she put away the rainbow wheel.

Cut to a few weeks ago. I had a blast from the past in the form of a Twitter notification. Someone had written a post about using R to analyze the TV show Friends, which was was motivated by a similar interest that drove me to write something about the show using my own dataset back in 2015. In the post, the author, Giora Simchoni, used R to scrape the scripts for all ten seasons of the show and made all that work publicly available (wheeeeee) for all to peruse. In fact, Giora even used some of the data I shared back in 2015 to look into character centrality. (He makes a convincing case using a variety of data sources that Rachel is the most central friend of the six main characters.) In reading about his project, I could practically hear my laptop humming to remind me of its freshly updated R software and my recent tinkering with R notebooks. (Get ready for new levels of reproducibility!) So, off my Mac and I went, equipped with a new workflow, to explore new data about a familiar TV universe.

Who’s Doing The Talking?

Given line by line data on all ten seasons, I, like Giora, first wanted to look at line totals for all characters. In aggregating all non-“friends” characters together, we get the following snapshot:

total

First off, why yes, I am using the official Friends font. Second, I am impressed by how close the totals are for all characters though hardly surprised that Phoebe has the least lines. Rachel wouldn’t be surprised either…

Rachel: Ugh, it was just a matter of time before someone had to leave the group. I just always assumed Phoebe would be the one to go.

Phoebe: Ehh!!

Rachel: Honey, come on! You live far away! You’re not related. You lift right out.

With these aggregates in hand, I then was curious: how would line allocations look across time? So, for each episode, I calculate the percentage of lines that each character speaks, and present the results with the following three visuals (again, all non-friends go into the “other” category):

lines1lines2lines3

Tell me that first graph doesn’t look like a callback to Rachel’s English Trifle. Anyway, regardless of a possible trifle-like appearance, all the visuals illustrate dynamics of an ensemble cast; while there is noise in the time series, the show consistently provides each character with a role to play. However, the last visual does highlight some standouts in the collection of episodes that uncharacteristically highlight or ignore certain characters. In other words, there are episodes in which one member of the cast receives an unusually high or low percentage of the lines in the episode. The three episodes that boast the highest percentages for a single member of the gang are: “The One with Christmas in Tulsa” (41.9% Chandler), “The One With Joey’s Interview” (40.3% Joey), “The One Where Chandler Crosses a Line” (36.3% Chandler). Similarly, the three with the lowest percentages for one of the six are: “The One With The Ring” (1.5% Monica) , “The One With The Cuffs” (1.6% Ross), and “The One With The Sonogram At The End” (3.3% Joey). The sagging red lines of the last visual identify episodes that have a low percentage of lines spoken by a character outside of the friend group. In effect, those dips in the graph point to extremely six-person-centric episodes, such as “The One On The Last Night” (0.4% non-friends dialogue–a single line in this case), “The One Where Chandler Gets Caught” (1.1% non-friends dialogue), and “The One With The Vows” (1.2% non-friends dialogue).

The Men Vs. The Women

Given this title, here’s a quick necessary clip:

Now, how do the line allocations look when broken down by gender lines across the main six characters? Well, the split consistently bounces around 50-50 over the course of the 10 seasons. Again, as was the case across the six main characters, the balanced split of lines is pretty impressive.

gender1gender2

Note that the second visual highlights that there are a few episodes that are irregularly man-heavy. The top three are: “The One Where Chandler Crosses A Line” (77.0% guys), “The One With Joey’s Interview” (75.1% guys), and “The One With Mac and C.H.E.E.S.E.” (70.2% guys). There are also exactly two episodes that feature a perfect 50-50 split for lines across gender: “The One Where Rachel Finds Out” and “The One With The Thanksgiving Flashbacks.”

Say My Name

How much do the main six characters address or mention one another? Giora addressed this question in his post, and I build off of his work by including nicknames in the calculations, and using a different genre of visualization. With respect to the nicknames–“Mon”, “Rach”, “Pheebs”, and “Joe”–“Pheebs” is undoubtably the stickiest of the group. Characters say “Pheebs” 370 times, which has a comfortable cushion over the second-place nickname “Mon” (used 73 times). Characters also significantly differ in their usage of each others’ nicknames. For example, while Joey calls Phoebe “Pheebs” 38.3% of the time, Monica calls her by this nickname only 4.6% of the time. (If you’re curious about more numbers on the nicknames, check out the project notebook.)

Now, after adding in the nicknames, who says whose name? The following graphic addresses that point of curiosity:

mentions

The answer is clear: Rachel says Ross’s name the most! (789 times! OK, we get it, Rachel, you’re in love.) We can also see that Joey is the most self-referential with 242 usages of his own name–perhaps not a shock considering his profession in the entertainment biz. Overall, the above visual provides some data-driven evidence of the closeness between certain characters that is clearly evident in watching the show. Namely, the Joey-Chandler, Monica-Chandler, Ross-Rachel relationships that were evident in my original aggregation of shared plot lines are still at the forefront!

Meta-data

Comparing the above work to what I had originally put together in January 2015 is a real trip. My original graphics back in 2015 were made entirely in Excel and were as such completely unreproducible, as was the data collection process. The difference between the opaqueness of that process and the transparency of sharing notebook output is super exciting to me… and to my loyal MacBook. Yes, yes, I’ll give you another sticker soon.

Let’s see the code!

Here is the html rendered R Notebook for this project. Here is the Github repo with the markdown file included.

*Screen fades to black* 
Executive Producer: Alex Albright

© Alexandra Albright and The Little Dataset That Could, 2017. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

 

Building Visualizations Using City Open Data: Philly School Comparisons

Maps, Violin Plots
Intro

There is a collection of notes that accompanies me throughout my day, slipped into the deep pockets of my backpack. The collection consists of small notebooks and post-its featuring sentence fragments written in inky Sharpie or scratched down frantically using some pen that was (of course) dying at the time. Ideas, hypotheses, some jokes. Mostly half baked and sometimes completely raw. Despite this surplus of scribbles, I often struggle when it comes acting on the intention of the words that felt so quick and simple to jot down… In fact, I often feel myself acting within the confines of this all too perfect graphical representation of project development:

14063489_163173454089228_1445505577_n

via the wonderful young cartoonist Liana Finck

One topic of interest–comparisons of charter and district public schools–has been on my (self-imposed) plate for over a year now. The topic was inspired by a documentary webseries that a friend actually just recently completed. [Plugs: Sivahn Barsade will be screening her documentary webseries Charter Wars this weekend in Philadelphia! Check it out if you’re around.] Given that she is currently wrapping up this long-term project, I am doing the same for my related mini-project. In other words, some post-its are officially being upgraded to objects on the internet.

To quote the filmmakers, “Charter Wars is an interactive documentary that examines the ideologies and motivations driving the charter school debate in Philadelphia.” Ah, yes, charter schools… a handful of slides glided by me on the topic in my morning Labor Economics class just this past Wednesday. Check out the intertwined and state-of-the-art Dobbie-Fryer (2013) and Fryer (2014) if you’re interested in charter school best practices and their implementation in other school environments.[1] However, despite the mention of these papers, I am not going to use this space in order to critique or praise rigorous academic research on the subject. Instead, I will use this space as a playground for the creation of city open data visualizations. Since Sivahn focuses her Charter Wars project on Philadelphia, I decided to do the same, which turned out to be a great idea since OpenDataPhilly is a joy to navigate, especially in comparison to other city data portals. After collecting data of interest from their site (details on that process available here), I used ggplot2 in R (praise Hadley!) to create two visualizations comparing district and charter schools in the city.

Think of this post as a quasi-tutorial inspired by Charter Wars; I’ll present a completed visual and then share the heart of the code in the text with some brief explanation as to the core elements therein. (I will also include links to code on my Github repo, which presents the full R scripts and explains how to get the exact data from OpenDataPhilly that you would need to replicate visuals.)

Visualization #1: Mapping out the city and schools

First things first, I wanted to map the location of public schools in the city of Philadelphia. Open data provides workable latitude and longitudes for all such schools, so this objective is entirely realizable. The tricky part in mapping the schools is that I also had to work with shape files that yield the city zip code edges and consequently build the overarching map on which points (representing the schools) can be plotted. I color schools based on four categories: Charter (Neighborhood), Charter (Citywide), District (Neighborhood), and District (Citywide);[2] and then break the plots up so that we can compare across the school levels: Elementary School, Middle School, High School, K-8 School (rather than plotting hundreds of points all on one big map). Here is my eventual result generated using R:

mappingschools

The reality is that most of the labor in creating these visuals is in figuring out both how to make functions work and how to get your data in the desired workable form. Once you’ve understood how the functions behave and you’ve reshaped your data structures, you can focus on your ggplot command, which is the cool piece of your script that you want to show off at the end of the day:

ggplot() + geom_map(data = spr1, aes(map_id = Zip.Code),
map = np_dist, fill=”gray40″, color=”gray60″) + expand_limits(x = np_dist$long, y = np_dist$lat)+
my_theme()+
geom_point(data=datadistn, aes(x=X, y=Y, col=”District (Neighborhood)”), size=1.5, alpha=1)+
geom_point(data=datachartn, aes(x=X, y=Y, col=”Charter (Neighborhood)”), size=1.5, alpha=1)+
geom_point(data=datadistc, aes(x=X, y=Y, col=”District (Citywide)”), size=1.5, alpha=1)+
geom_point(data=datachartc, aes(x=X, y=Y, col=”Charter (Citywide)”), size=1.5, alpha=1)+
facet_wrap(~Rpt.Type.Long, ncol=2)+
ggtitle(expression(atop(bold(“Mapping Philly Schools”), atop(italic(“Data via OpenDataPhilly; Visual via Alex Albright (thelittledataset.com)”),””))))+
scale_colour_manual(values = c(“Charter (Citywide)”=”#b10026”, “District (Citywide)”=”#807dba”,”Charter (Neighborhood)”=”red”,”District (Neighborhood)”=”blue”), guide_legend(title=”Type of School”))+
labs(y=””, x=””) 

This command creates the map I had previously presented. The basic process with all these sorts of ggplot commands is that you want to start your plot with ggplot() and then add layers with additional commands (after each +). The above code uses a number of functions and geometric objects that I identify and describe below:

  • ggplot()
    • Start the plot
  • geom_map()
    • Geometric object that maps out Philadelphia with the zip code lines
  • my_theme()
    • My customized function that defines style of my visuals (defines plot background, font styles, spacing, etc.)
  • geom_point()
    • Geometric object that adds the points onto the base layer of the map (I use it four times since I want to do this for each of the four school types using different colors)
  • facet_wrap()
    • Function that says we want four different maps in order to show one for each of the four school levels (Middle School, Elementary School, High School, K-8 School)
  • ggtitle()
    • Function that specifies the overarching plot title
  • scale_colour_manual()
    • Function that maps values of school types to specific aesthetic values (in our case, colors!)
  • labs()
    • Function to change axis labels and legend titles–I use it to get rid of default axes labels for the overarching graph

Definitely head to the full R script on Github to understand what the arguments (spr1, np_dist, etc.) are in the different pieces of this large aggregated command. [Recommended resources for those interested in using R for visualization purposes: a great cheat sheet on building up plots with ggplot & the incredible collection of FlowingData tutorialsPrabhas Pokharel’s helpful post on this type of mapping in R]

Visualization #2: Violin Plots

My second creation illustrates the distribution of school scores across the four aforementioned school types: Charter (Neighborhood), Charter (Citywide), District (Neighborhood), and District (Citywide). (Note that the colors match those used for the points in the previous maps.) To explore this topic, I create violin plots, which can be thought of as sideways density plots, which can in turn be thought of as smooth histograms.[3] Alternatively, according to Nathan Yau, you can think of them as the “lovechild between a density plot and a box-and-whisker plot.” Similar to how in the previous graph I broke the school plotting up into four categories based on level of schooling, I now break the plotting up based on score type: overall, achievement, progress, and climate.  See below for the final product:

scores

The core command that yields this graph is as follows:

ggplot(data_new, aes(factor(data_new$Governance0), data_new$Score))+
geom_violin(trim=T, adjust=.2, aes(fill=Governance0))+
geom_boxplot(width=0.1, aes(fill=Governance0, color=”orange”))+
my_theme()+
scale_fill_manual(values = pal2, guide_legend(title=”School Type”)) +
ylim(0,100)+
labs(x=””, y=””)+
facet_wrap(~Score_type, ncol=2, scales=”free”)+
ggtitle(expression(atop(bold(“Comparing Philly School Score Distributions”), atop(italic(“Data via OpenDataPhilly (2014-2015); Visual via Alex Albright (thelittledataset.com)”),””))))

Similar to before, I will briefly explain the functions and objects that we combine to into this one long command:

  • ggplot()
    • Begin the plot with aesthetics for score and school type (Governance0)
  • geom_violin()
    • Geometric object that specifies that we are going to use a violin plot for the distributions (also decides on the bandwidth parameter)
  • geom_boxplot()
    • Geometric object that generates a basic boxplot over the violin plot (so we can get an alternative view of the underlying data points)
  • my_theme()
    • My customized function that defines the style of visuals
  • scale_fill_manual()
    • Function that fills in the color of the violins by school type
  • ylim()
    • Short-hand function to set y-axis to always show 0-100 values
  • labs()
    • Function to get rid of default axes labels
  • facet_wrap()
    • Function that separates plots out into one for each of the four score types: overall, achievement, progress, climate
  • ggtitle()
    • Specifies the overarching plot title

Again, definitely head to the full R script to understand the full context of this command and the structure of the underlying data. (Relevant resources for looking into violin plots in R can also be found here and here.) 

It took me many iterations of code to get to the current builds that you can see on Github, especially since I am not an expert with mapping–unlike my better half, Sarah Michael Levine. See the below comic for an accurate depiction of current-day-me (the stick figure with ponytail) looking at the code that July-2015-me originally wrote to produce some variant of these visuals (stick figure without ponytail):

code_quality

Via XKCD

Hopefully current-day-me was able to improve the style to the extent that it is now readable to the general public. (Do let me know if you see inefficiencies though and I’m happy to iterate further! Ping me with questions too if you so desire.) Moreover, in intensively editing code created by my past self over the past string of days, I also quickly recalled that the previous graphical representation of my project workflow needed to be updated to more accurately reflect reality:

manic2

adapted from Liana Finck with the help of snapchat artistic resources

On a more serious note, city open data is an incredible resource for individuals to practice using R (or other software). In rummaging around city variables and values, you can maintain a sense of connection to your community while floating around the confines of a simple two-dimensional command line.

Plugs section [important]
  1. Thanks to Sivahn for communicating with me about her Charter Wars documentary webseries project–good luck with the screening and all, Si!
  2. If you like city open data projects, or you’re a New Yorker, or both… check out Ben Wellington’s blog that focuses on NYC open data.
  3. If you’d like to replicate elements of this project, see my Github documentation.
Footnotes

[1] Yes, that’s right; I’m linking you to the full pdfs that I downloaded with my university access. Think of me as Robin Hood with the caveat that I dole out journal articles instead of $$$.

[2] Note from Si on four school categories: While most people, and researchers, divide public schools into charter-run and district-run, this binary is lacking vital information. For some district and charter schools, students have to apply and be selected to attend. It wouldn’t be fair to compare a charter school to a district magnet school just like it wouldn’t be fair to compare a performing arts charter school to a neighborhood district school (this is not a knock against special admit schools, just their effect on data analysis). The additional categories don’t allow for a perfect apples-apples comparison, but at least inform you’ll know that you’re comparing an apple to an orange. 

[3] The efficacy or legitimacy of this sort of visualization method is potentially contentious in the data visualization community, so I’m happy to hear critiques/suggestions–especially with respect to best practices for determining bandwidth parameters!


© Alexandra Albright and The Little Dataset That Could, 2016. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.