Building Visualizations Using City Open Data: Philly School Comparisons


There is a collection of notes that accompanies me throughout my day, slipped into the deep pockets of my backpack. The collection consists of small notebooks and post-its featuring sentence fragments written in inky Sharpie or scratched down frantically using some pen that was (of course) dying at the time. Ideas, hypotheses, some jokes. Mostly half baked and sometimes completely raw. Despite this surplus of scribbles, I often struggle when it comes acting on the intention of the words that felt so quick and simple to jot down… In fact, I often feel myself acting within the confines of this all too perfect graphical representation of project development:


via the wonderful young cartoonist Liana Finck

One topic of interest–comparisons of charter and district public schools–has been on my (self-imposed) plate for over a year now. The topic was inspired by a documentary webseries that a friend actually just recently completed. [Plugs: Sivahn Barsade will be screening her documentary webseries Charter Wars this weekend in Philadelphia! Check it out if you’re around.] Given that she is currently wrapping up this long-term project, I am doing the same for my related mini-project. In other words, some post-its are officially being upgraded to objects on the internet.

To quote the filmmakers, “Charter Wars is an interactive documentary that examines the ideologies and motivations driving the charter school debate in Philadelphia.” Ah, yes, charter schools… a handful of slides glided by me on the topic in my morning Labor Economics class just this past Wednesday. Check out the intertwined and state-of-the-art Dobbie-Fryer (2013) and Fryer (2014) if you’re interested in charter school best practices and their implementation in other school environments.[1] However, despite the mention of these papers, I am not going to use this space in order to critique or praise rigorous academic research on the subject. Instead, I will use this space as a playground for the creation of city open data visualizations. Since Sivahn focuses her Charter Wars project on Philadelphia, I decided to do the same, which turned out to be a great idea since OpenDataPhilly is a joy to navigate, especially in comparison to other city data portals. After collecting data of interest from their site (details on that process available here), I used ggplot2 in R (praise Hadley!) to create two visualizations comparing district and charter schools in the city.

Think of this post as a quasi-tutorial inspired by Charter Wars; I’ll present a completed visual and then share the heart of the code in the text with some brief explanation as to the core elements therein. (I will also include links to code on my Github repo, which presents the full R scripts and explains how to get the exact data from OpenDataPhilly that you would need to replicate visuals.)

Visualization #1: Mapping out the city and schools

First things first, I wanted to map the location of public schools in the city of Philadelphia. Open data provides workable latitude and longitudes for all such schools, so this objective is entirely realizable. The tricky part in mapping the schools is that I also had to work with shape files that yield the city zip code edges and consequently build the overarching map on which points (representing the schools) can be plotted. I color schools based on four categories: Charter (Neighborhood), Charter (Citywide), District (Neighborhood), and District (Citywide);[2] and then break the plots up so that we can compare across the school levels: Elementary School, Middle School, High School, K-8 School (rather than plotting hundreds of points all on one big map). Here is my eventual result generated using R:


The reality is that most of the labor in creating these visuals is in figuring out both how to make functions work and how to get your data in the desired workable form. Once you’ve understood how the functions behave and you’ve reshaped your data structures, you can focus on your ggplot command, which is the cool piece of your script that you want to show off at the end of the day:

ggplot() +
geom_map(data = spr1, aes(map_id = Zip.Code), map = np_dist, fill="gray40", color="gray60") +
expand_limits(x = np_dist$long, y = np_dist$lat)+
geom_point(data=datadistn, aes(x=X, y=Y, col="District (Neighborhood)"), size=1.5, alpha=1)+
geom_point(data=datachartn, aes(x=X, y=Y, col="Charter (Neighborhood)"), size=1.5, alpha=1)+
geom_point(data=datadistc, aes(x=X, y=Y, col="District (Citywide)"), size=1.5, alpha=1)+
geom_point(data=datachartc, aes(x=X, y=Y, col="Charter (Citywide)"), size=1.5, alpha=1)+
facet_wrap(~Rpt.Type.Long, ncol=2)+
ggtitle(expression(atop(bold("Mapping Philly Schools"), atop(italic("Data via OpenDataPhilly; Visual via Alex Albright ("),""))))+
scale_colour_manual(values = c("Charter (Citywide)"="#b10026", "District (Citywide)"="#807dba","Charter (Neighborhood)"="red","District (Neighborhood)"="blue"), guide_legend(title="Type of School"))+
labs(y="", x="")

This command creates the map I had previously presented. The basic process with all these sorts of ggplot commands is that you want to start your plot with ggplot() and then add layers with additional commands (after each +). The above code uses a number of functions and geometric objects that I identify and describe below:

  • ggplot()
    • Start the plot
  • geom_map()
    • Geometric object that maps out Philadelphia with the zip code lines
  • my_theme()
    • My customized function that defines style of my visuals (defines plot background, font styles, spacing, etc.)
  • geom_point()
    • Geometric object that adds the points onto the base layer of the map (I use it four times since I want to do this for each of the four school types using different colors)
  • facet_wrap()
    • Function that says we want four different maps in order to show one for each of the four school levels (Middle School, Elementary School, High School, K-8 School)
  • ggtitle()
    • Function that specifies the overarching plot title
  • scale_colour_manual()
    • Function that maps values of school types to specific aesthetic values (in our case, colors!)
  • labs()
    • Function to change axis labels and legend titles–I use it to get rid of default axes labels for the overarching graph

Definitely head to the full R script on Github to understand what the arguments (spr1, np_dist, etc.) are in the different pieces of this large aggregated command. [Recommended resources for those interested in using R for visualization purposes: a great cheat sheet on building up plots with ggplot & the incredible collection of FlowingData tutorialsPrabhas Pokharel’s helpful post on this type of mapping in R]

Visualization #2: Violin Plots

My second creation illustrates the distribution of school scores across the four aforementioned school types: Charter (Neighborhood), Charter (Citywide), District (Neighborhood), and District (Citywide). (Note that the colors match those used for the points in the previous maps.) To explore this topic, I create violin plots, which can be thought of as sideways density plots, which can in turn be thought of as smooth histograms.[3] Alternatively, according to Nathan Yau, you can think of them as the “lovechild between a density plot and a box-and-whisker plot.” Similar to how in the previous graph I broke the school plotting up into four categories based on level of schooling, I now break the plotting up based on score type: overall, achievement, progress, and climate.  See below for the final product:


The core command that yields this graph is as follows:

ggplot(data_new, aes(factor(data_new$Governance0), data_new$Score))+
geom_violin(trim=T, adjust=.2, aes(fill=Governance0))+
geom_boxplot(width=0.1, aes(fill=Governance0, color="orange"))+
scale_fill_manual(values = pal2, guide_legend(title="School Type")) +
labs(x="", y="")+
facet_wrap(~Score_type, ncol=2, scales="free")+
ggtitle(expression(atop(bold("Comparing Philly School Score Distributions"), atop(italic("Data via OpenDataPhilly (2014-2015); Visual via Alex Albright ("),""))))

Similar to before, I will briefly explain the functions and objects that we combine to into this one long command:

  • ggplot()
    • Begin the plot with aesthetics for score and school type (Governance0)
  • geom_violin()
    • Geometric object that specifies that we are going to use a violin plot for the distributions (also decides on the bandwidth parameter)
  • geom_boxplot()
    • Geometric object that generates a basic boxplot over the violin plot (so we can get an alternative view of the underlying data points)
  • my_theme()
    • My customized function that defines the style of visuals
  • scale_fill_manual()
    • Function that fills in the color of the violins by school type
  • ylim()
    • Short-hand function to set y-axis to always show 0-100 values
  • labs()
    • Function to get rid of default axes labels
  • facet_wrap()
    • Function that separates plots out into one for each of the four score types: overall, achievement, progress, climate
  • ggtitle()
    • Specifies the overarching plot title

Again, definitely head to the full R script to understand the full context of this command and the structure of the underlying data. (Relevant resources for looking into violin plots in R can also be found here and here.) 

It took me many iterations of code to get to the current builds that you can see on Github, especially since I am not an expert with mapping–unlike my better half, Sarah Michael Levine. See the below comic for an accurate depiction of current-day-me (the stick figure with ponytail) looking at the code that July-2015-me originally wrote to produce some variant of these visuals (stick figure without ponytail):



Hopefully current-day-me was able to improve the style to the extent that it is now readable to the general public. (Do let me know if you see inefficiencies though and I’m happy to iterate further! Ping me with questions too if you so desire.) Moreover, in intensively editing code created by my past self over the past string of days, I also quickly recalled that the previous graphical representation of my project workflow needed to be updated to more accurately reflect reality:


adapted from Liana Finck with the help of snapchat artistic resources

On a more serious note, city open data is an incredible resource for individuals to practice using R (or other software). In rummaging around city variables and values, you can maintain a sense of connection to your community while floating around the confines of a simple two-dimensional command line.

Plugs section [important]
  1. Thanks to Sivahn for communicating with me about her Charter Wars documentary webseries project–good luck with the screening and all, Si!
  2. If you like city open data projects, or you’re a New Yorker, or both… check out Ben Wellington’s blog that focuses on NYC open data.
  3. If you’d like to replicate elements of this project, see my Github documentation.

[1] Yes, that’s right; I’m linking you to the full pdfs that I downloaded with my university access. Think of me as Robin Hood with the caveat that I dole out journal articles instead of $$$.

[2] Note from Si on four school categories: Wait, why are there four categories? While most people, and researchers, divide public schools into charter-run and district-run, this binary is lacking vital information. For some district and charter schools, students have to apply and be selected to attend. It wouldn’t be fair to compare a charter school to a district magnet school just like it wouldn’t be fair to compare a performing arts charter school to a neighborhood district school (this is not a knock against special admit schools, just their effect on data analysis). The additional categories don’t allow for a perfect apples-apples comparison, but at least inform you’ll know that you’re comparing an apple to an orange. 

[3] The efficacy or legitimacy of this sort of visualization method is potentially contentious in the data visualization community, so I’m happy to hear critiques/suggestions–especially with respect to best practices for determining bandwidth parameters!

© Alexandra Albright and The Little Dataset That Could, 2016. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

Where My Girls At? (In The Sciences)


In the current educational landscape, there is a constant stream of calls to improve female representation in the sciences. However, the call to action is often framed within the aforementioned nebulous realm of “the sciences”—an umbrella term that ignores the distinct environments across the scientific disciplines. To better understand the true state of women in “the sciences,” we must investigate representation at the discipline level in the context of both undergraduate and doctoral education. As it turns out, National Science Foundation (NSF) open data provides the ability to do just that!

The NSF’s Report on Women, Minorities, and Persons with Disabilities in Science and Engineering includes raw numbers on both undergraduate and doctoral degrees earned by women and men across all science disciplines. With these figures in hand, it’s simple to generate measures of female representation within each field of study—that is, percentages of female degree earners. This NSF report spans the decade 2002–­2012 and provides an immense amount of raw material to investigate.[1]

The static picture: 2012

First, we will zero in on the most recent year of data, 2012, and explicitly compare female representation within and across disciplines.[2]


The NSF groups science disciplines with similar focus (for example, atmospheric and ocean sciences both focus on environmental science) into classified parent categories. In order to observe not only the variation within each parent category but also across the more granular disciplines themselves, the above graph plots percentage female representation by discipline, with each discipline colored with respect to its NSF classified parent category.

The variation within each parent category can be quite pronounced. In the earth, atmospheric, and ocean sciences, female undergraduate representation ranges from 36% (atmospheric sciences) to 47% (ocean sciences) of total graduates. Among PhD graduates, female representation ranges from 39% (atmospheric sciences) to 48% (ocean sciences). Meanwhile, female representation in the physical sciences has an undergraduate range from 19% (physics) to 47% (chemistry) and a PhD range from 20% (physics) to 39% (chemistry). However, social sciences has the largest spread of all with undergraduate female representation ranging from 30% (economics) to 71% (anthropology) and PhD representation ranging from 33% (economics) to 64% (anthropology).

In line with conventional wisdom, computer sciences and physics are overwhelmingly male (undergraduate and PhD female representation lingers around 20% for both). Other disciplines in which female representation notably lags include: economics, mathematics and statistics, astronomy, and atmospheric sciences. Possible explanations behind the low representation in such disciplines have been debated at length.

Interactions between “innate abilities,” mathematical content, and female representation

Relatively recently, in January 2015, an article in Science “hypothesize[d] that, across the academic spectrum, women are underrepresented in fields whose practitioners believe that raw, innate talent is the main requirement for success, because women are stereotyped as not possessing such talent.” While this explanation was compelling to many, another group of researchers quickly responded by showing that once measures of mathematical content were added into the proposed models, the measures of innate beliefs (based on surveys of faculty members) shed all their statistical significance. Thus, the latter researchers provided evidence that female representation across disciplines is instead associated with the discipline’s mathematical content “and that faculty beliefs about innate ability were irrelevant.”

However, this conclusion does not imply that stereotypical beliefs are unimportant to female representation in scientific disciplines—in fact, the same researchers argue that beliefs of teachers and parents of younger children can play a large role in silently herding women out of math-heavy fields by “becom[ing] part of the self-fulfilling belief systems of the children themselves from a very early age.” Thus, the conclusion only objects to the alleged discovery of a robust causal relationship between one type of belief, university/college faculty beliefs about innate ability, and female representation.

Despite differences, both assessments demonstrate a correlation between measures of innate capabilities and female representation that is most likely driven by (1) women being less likely than men to study math-intensive disciplines and (2) those in math-intensive fields being more likely to describe their capacities as innate.[3]

The second point should hardly be surprising to anyone who has been exposed to mathematical genius tropes—think of all those handsome janitors who write up proofs on chalkboards whose talents are rarely learned. The second point is also incredibly consistent with the assumptions that underlie “the cult of genius” described by Professor Jordan Ellenberg in How Not to Be Wrong: The Power of Mathematical Thinking (p.412):

The genius cult tells students it’s not worth doing mathematics unless you’re the best at mathematics, because those special few are the only ones whose contributions matter. We don’t treat any other subject that way! I’ve never heard a student say, “I like Hamlet, but I don’t really belong in AP English—that kid who sits in the front row knows all the plays, and he started reading Shakespeare when he was nine!”

In short, subjects that are highly mathematical are seen as more driven by innate abilities than are others. In fact, describing someone as a hard worker in mathematical fields is often seen as an implicit insult—an implication I very much understand as someone who has been regularly (usually affectionately) teased as a “try-hard” by many male peers.

The dynamic picture: 2002–2012

Math-intensive subjects are predominately male in the static picture for the year 2012, but how has the gender balance changed over recent years (in these and all science disciplines)? To answer this question, we turn to a dynamic view of female representation over a recent decade by looking at NSF data for the entirety of 2002–2012.


The above graph plots the percentages of female degree earners in each science discipline for both the undergraduate and doctoral levels for each year from 2002 to 2012. The trends are remarkably varied with overall changes in undergraduate female representation ranging from a decrease of 33.9% (computer sciences) to an increase of 24.4% (atmospheric sciences). Overall changes in doctoral representation ranged from a decline of 8.8% (linguistics) to a rise of 67.6% (astronomy). The following visual more concisely summarizes the overall percentage changes for the decade.


As this graph illustrates, there were many gains in female representation at the doctoral level between 2002 and 2012. All but three disciplines experienced increased female representation—seems promising, yes? However, substantial losses at the undergraduate level should yield some concern. Only six of the eighteen science disciplines experienced undergraduate gains in female representation over the decade.

The illustrated increases in representation at the doctoral level are likely extensions of gains at the undergraduate level from the previous years—gains that are now being eroded given the presented undergraduate trends. The depicted losses at the undergraduate level could very well lead to similar losses at the doctoral level in the coming decade, which would hamper the widely shared goal to tenure more female professors.

The change for computer sciences is especially important since it provides a basis for the vast, well-documented media and academic focus on women in the field. (Planet Money brought the decline in percentage of female computer science majors to the attention of many in 2014.) The discipline experienced a loss in female representation at the undergraduate level that was more than twice the size of that in any other subject, including physics (-15.6%), earth sciences (-12.2%), and economics (-11.9%).

While the previous discussion of innate talent and stereotype threat focused on math-intensive fields, a category computer sciences fall into, I would argue that this recent decade has seen the effect of those forces on a growing realm of code-intensive fields. The use of computer programming and statistical software has become a standard qualification for many topics in physics, statistics, economics, biology, astronomy, and other fields. In fact, completing degrees in these disciplines now virtually requires coding in some way, shape, or form.

For instance, in my experience, one nontrivial hurdle that stands between students and more advanced classes in statistics or economics is the time necessary to understand how to use software such as R and Stata. Even seemingly simple tasks in these two programs requires some basic level of comfort with structuring commands—an understanding that is not taught in these classes, but rather mentioned as a quick and seemingly obvious sidebar. Despite my extensive coursework in economics and mathematics, I am quick to admit that I only became comfortable with Stata via independent learning in a summer research context, and R via pursuing projects for this blog many months after college graduation.

The implications of coding’s expanding role in many strains of scientific research should not be underestimated. If women are not coding, they are not just missing from computer science—they will increasingly be missing from other disciplines which coding has seeped into.

The big picture: present–future

In other words, I would argue academia is currently faced with the issue of improving female representation in code-intensive fields.[4] As is true with math-intensive fields, the stereotypical beliefs of teachers and parents of younger children “become part of the self-fulfilling belief systems of the children themselves from a very early age” that discourage women from even attempting to enter code-intensive fields. These beliefs when combined with Ellenberg’s described “cult of genius” (a mechanism that surrounded mathematics and now also applies to the atmosphere in computer science) are especially dangerous.

Given the small percentage of women in these fields at the undergraduate level, there is limited potential growth in female representation along the academic pipeline—that is, at the doctoral and professorial levels. While coding has opened up new, incredible directions for research in many of the sciences, its evolving importance also can yield gender imbalances due to the same dynamics that underlie underrepresentation in math-intensive fields.


[1] Unfortunately, we cannot extend this year range back before 2002 since earlier numbers were solely presented for broader discipline categories, or parent science categories—economics and anthropology would be grouped under the broader term “social sciences,” while astronomy and chemistry would be included under the term “physical sciences.”

[2] The NSF differentiates between science and engineering as the latter is often described as an application of the former in academia. While engineering displays an enormous gender imbalance in favor of men, I limit my discussion here to disciplines that fall under the NSF’s science category.

[3] The latter viewpoint does have some scientific backing. The paper “Nonlinear Psychometric Thresholds for Physics and Mathematics” supports the notion that while greater work ethic can compensate for lesser ability in many subjects, those below some threshold of mathematical capacities are very unlikely to succeed in mathematics and physics coursework.

[4] On a positive note, atmospheric sciences, which often involves complex climate modeling techniques, has experienced large gains in female representation at the undergraduate level.

Speaking of coding…

Check out my relevant Github repository for all data and R scripts necessary for reproducing these visuals.

Thank you to:

Ally Seidel for all the edits over the past few months! & members of NYC squad for listening to my ideas and debating terminology with me.

© Alexandra Albright and The Little Dataset That Could, 2016. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.