Social Networks and the Endurance of Economic Injury

Killing two birds with one paper

My last fall semester exam was for Social Economics class. Reading through packets of model summaries, I set out to determine which model — besides the obviously lovable Coate and Loury (1993) — I would most like to understand, remember, and explain. That is, I picked a paper to write about here… which I could also use in my intellectual battle with those pesky blue books.

It was in the middle of the lecture packet on “Intergenerational Mobility” that I found Bowles and Sethi (2006). This paper illustrates how magically ridding the world of discrimination (i.e., saying “assume no discrimination” as an economist) doesn’t necessarily lead to perfect convergence of group economic outcomes. In other words, even with zero discrimination, group differences in economic success can still persist across generations. Why is that? Because social networks matter for economic outcomes, and networks are often segregated by group identity. In the authors’ words,

“Group differences in economic success may persist across generations in the absence of discrimination against the less affluent group because racial segregation of friendship networks, mentoring relationships, neighborhoods, workplaces and schools places the less affluent group at a disadvantage in acquiring the things — contacts, information, cognitive skills, behavioral attributes — that contribute to economic success.”

Social networks are undeniably important in determining individuals’ economic outcomes. As such, building social network structures into models of human capital accumulation improves realism and allows for intriguing intergenerational theoretical results.

Bowles and Sethi (2006) appeals to me because the model formalizes dynamics touched on in many conversations about the long-term impacts of discriminatory practices. In this post I will lay out the model mechanics, explain proofs of the key results, and showcase a graph the authors use to visually illustrate their theoretical results.

Model mechanics

The paper motivates the model with a few words on Brown v. Board and the black-white wage gap. “Many hoped that the demise of legally enforced segregation and discrimination against African Americans during the 1950s and 1960s coupled with the apparent reduction in racial prejudice among whites would provide an environment in which significant social and economic racial disparities would not persist.” Despite initial convergence from the 1950’s to the 1970’s, the gap has persisted. There are many reasons why this could be the case, continued practices of discrimination included. Bowles and Sethi use the following model to illustrate how such gaps could endure even absent discrimination.

In said model, a person is born into one of two groups — black or white — and lives for two periods. In the first period of life, she makes a decision about whether or not to acquire human capital and become ‘skilled.’ This is a simple binary choice. (She either becomes educated/trained or she does nothing.) In the second period, she is paid a wage based on her previous choice. If she didn’t acquire human capital (and thus isn’t skilled), she is paid a wage of 0. If she did (and thus is skilled), she is paid a wage of h. In effect, the marginal benefit of human capital acquisition is h for all agents.

For the sake of simplicity, the model first assumes all people have the same ability. As such, an individual’s cost of human capital acquisition is solely dependent on the level of human capital in that person’s social network. Define network capital, q, as the fraction of agents in the network who chose to acquire human capital and are skilled. The key assumption in the model is that given the cost function c(q), c'(q)<0. In words, the higher the fraction of skilled people in a person’s network, the less costly it is for the person to become skilled. Ie, acquiring training is less costly when your network can connect you with opportunities and provide you with relevant information.

As per usual, agents choose to become skilled if marginal benefit exceeds marginal cost. Assume c(0)>h>c(1) — that is, the cost of becoming skilled when no one in your network is skilled (q=0) is higher than the benefit of becoming skilled (h), but the cost when everyone is skilled (q=1) is lower than the benefit (h). In effect, there exists a unique threshold level q* such that c(q*)=h. The agent’s decision rule is then: for any q>q’, the agent chooses to becomes skilled & for any q<q’, the agent does not. (I’ll ignore indifference throughout.)

While the decision rule is clear, how are social networks (and thus q‘s) formed? We assume the population shares for B (black) and W (white) groups are x and 1-x, respectively. Moreover, agents born into the model in period t+1 have a large number of ties to those born in t. With probability p in [0,1] an associate is from same group (B or W), but with probability (1-p) an associate is randomly picked from the general population of agents (could be either group). As such, the parameter p is the degree of “in-group bias” or segregation. Assume the parameter is the same for both groups. Therefore, the probability that: a black agent’s connection is also black is p+(1-p)x, a white agent’s connection is also white is  p+(1-p)(1-x), a black agent’s connection is white is (1-p)(1-x), and a white agent’s connection is black is (1-p)x.

The network capital in t+1 depends on the mechanical formation of the agent’s group and human capital accumulation decisions made by black and white agents born in time t (represented by sB(t) and sW(t), respectively):

qB(t+1)=[p+(1-p)x] * sB(t) + [(1-p)(1-x)] * sW(t) 

qW(t+1)=[(1-p)x] * sB(t) + [p+(1-p)(1-x)] * sW(t) 

The above equations show that (for both groups) the fraction of connections in an agent’s network (born in t+1) who are skilled is: chance of black associate * fraction of black agents (born in t) who are skilled + chance of white associate * fraction of white agents (born in t) who are skilled. The network capital of people in the two groups is the same only if: p=0 (there is no segregation) or sW(t)=sB(t) (there is no initial group inequality in human capital).

Given the two above equations, we get a “law of motion” for human capital decisions: If qG(t+1)>q’, sG(t+1)=1; If qG(t+1)<q’, sG(t+1)=0 (with G in {B,W}). In words, if network capital is above the necessary threshold level, all agents of that group become skilled. If network capital is below the necessary threshold level, all agents of that group stay unskilled. Note that in this simplified model all agents make the same decisions within racial groups.

From parameter values to group outcomes

How do we get real-world implications from this model? We know black people have been historically economically disadvantaged in the United States. But, how do we integrate this fact into the model’s framework? Well, we can set the initial state of the world to the extreme (sB, sW)=(0,1), meaning all black agents start of as unskilled and all white agents start of as skilled (perhaps due to separate but unequal hospitals/schools/etc). Based on that initial state, I can then see what the future states of the world will be under the previously derived law of motion.

  1. Let’s assume complete integration, p=0. Given (sB, sW)=(0,1), then qW(t+1)=qB(t+1)=1-x, and since cost is only dependent on network, cost is then c(1-x) for both groups. Thus, all black and white agents will make the same decisions and there will be no asymmetric stable steady state.
  2. Now, consider complete segregation, p=1. Given (sB, sW)=(0,1) again, then qB=0 and qW=1. So, cost is c(0) for black agents and c(1) for white agents. Recall c(0)<h<c(1), meaning that there is necessarily an asymmetric stable steady state. (No black agents will become skilled and all white agents will become skilled.)

Given the points above, the authors explain,

“Since there exists an asymmetric stable steady state under complete segregation but none under complete integration, one may conjecture that there is a threshold level of segregation such that persistent group inequality is feasible if and only if the actual segregation level exceeds this threshold.”

Let’s prove this conjecture. (The following is my summary of the appendix proofs for propositions 1 and 2.) First, we find the unique x” (black population share) threshold s.t. c(1-x”)=h.

  1. Consider x'<x”, then c(1-x’)<h because cost decreases in its argument. Given (sB, sW)=(0,1), qB=(1-p)(1-x’) and qW=p+(1-p)(1-x’). So, c(qW) is decreasing in p and less than h at p=0 (since c(1-x’)<h). Moreover, c(qB) is increasing in p and c(qB)=c((1-0)(1-x’))<h when p=0 but c(qB)=c(0)>h when p=1, thus there is a unique p'(x’) such that c(qB)=h. For all p>p'(x’), we have c(qW)<h<c(qB), meaning (sB,sW)=(0,1) is a steady state. But, for all p<p'(x’), we have c(qW)<c(qB)<h, which makes it optimal for both groups of workers to become skilled and so there is a transition to (1,1). Since that then lowers both costs, the condition c(qW)<c(qB)<h continues to hold which makes (1,1) the stable steady state instead of (0,1).
  2. Consider x’>x”, so c(1-x’)>h. By the same logic, c(qB) is increasing in p and greater than h when p=0 since c(qB)=c(1-x’)>h. So c(qB)>h for all pc(qW) is decreasing in p and c(qW)=c((1-0)(1-x’))>h when p=0 but c(qW)=c(1)<h when p=1, thus there is a unique p'(x’) such that c(qW)=h. For all p>p'(x’), we have c(qW)<h<c(qB), meaning (sB,sW)=(0,1) is a steady state. But, for all p<p'(x’), we have h<c(qW)<c(qB), which makes it optimal for both groups of workers to not become skilled and so transition to (0,0). Since that then increases both costs, the condition h<c(qW)<c(qB) continues to hold which makes (0,0) the stable steady state instead of (0,1).

In sum, given the fraction x, there is a threshold level of segregation p* above which (sB,sW)=(0,1) is a steady state (persistent group inequality), but below which the model shifts to a symmetric steady state. Whether the eventual steady state means welfare improving equalization — (sB,sW)=(1,1) — or welfare reducing equalization — (sB,sW)=(0,0) — depends on the fraction x. If the originally skilled group is large enough, all agents will become skilled, otherwise, all agents will become unskilled.

In words, the model shows that group inequality persists if segregation is high enough. If segregation is below the threshold for maintaining inequality, groups inequality disappears, but whether that is through a loss of everyone’s skills or a gain of everyone’s skills depends on the population shares that define the model world. The authors use the following graph to depict these conclusions:


Note that Bowles and Sethi (2006) use different variables than me for the parameters of interest. Also, the authors normalize the benefits of human capital accumulation to 0.

This figure sharply summarizes the model’s results thus far. It succinctly and clearly shows how two parameters (population share and segregation) determine the eventual state of the world. I usually use graphs to visualize tangible data, but they are just as useful in visualizing concepts or theoretical results, as seen here. (The graph I built depicting when to share an idea à la Koszegi is another example of visualizing how model parameters relate to outcomes.)

Suspiciously slick?

There are a few issues with the model dynamics that you might have noticed reading the above summary. Namely, everyone is the same within racial groups and convergence occurs in a single period. This feels less interesting than a slower convergence differing by other individual characteristics.

Much of the aforementioned simplicity comes from the assumption that ability is the same for all agents (ie, ability is homogenous). However, the model can be tweaked to make ability heterogenous — that is exactly what the authors do later in the paper. As such, the cost of human capital investment then varies with ability as well as network capital. So, cost now depends on something that is specific to the individual (ability) as well as common to the group (racial identity). (Note: the model assumes no group differences in cost function or ability distributions.)

Moreover, the cost function c(a, q) is then decreasing in both ability and network capital level. In words, it is easier to become skilled when exposed to more skilled people (due to networks), and easier to become skilled when endowed with higher natural ability. For any given network capital level q, there is some threshold ability level a'(q) such that those above the cut-off become skilled and those below do not. Similar to the reasoning in the homogenous case, an agent needs c(a, q)<h to become skilled. In effect, the relevant threshold is defined as the a'(q) s.t. c(a, q)=h. (Any ability above that, the person becomes skilled. Any ability below that, the person does not.)

An interesting insight on this topic is that: “individuals belonging to groups exposed to higher levels of human capital will themselves accumulate human capital at lower ability thresholds relative to individuals in groups with initially lower levels of human capital. This difference will be greater when segregation levels are high.” So, in this more complex build of the model, black people have to boast higher ability levels than their white counterparts to make human capital accumulation cost beneficial… all due to the historical disadvantages build into their social networks. And that all came out of a bunch of threshold rules and variables!

Recap: behold the power of models

Bowles and Sethi built this model with an eye towards examples of enduring economic injury. They saw an empirical fact (the persistent black-white wage gap) and then put structure on their intuitive answers to naturally occurring questions: If there was zero discrimination, would the wage gap still endure? How would that work? Through what channels?

At the end of the day, their model hinges on a few items: (1) the inverse relationship between network capital and cost of skill acquisition, and (2) network formation (as influenced by segregation and population share). The first is an assumption based in the reality of human success and failure — it doesn’t always “take a village” but that definitely helps. The second is a distilled sketch of a complex and idiosyncratic process —  network formation depends on two parameters (segregation and population share) and leads to useful, comprehensible insights. The previously showcased graph is especially important for highlighting the potential difficulty of policy decisions — what part of the graph are we in?

Blue book’d

Bowles and Sethi (2006) illustrates how social networks, population demographics, and decision-making interact to determine the endurance of economic injury. The model also illustrates how writing a blog can sometimes help in your academic life — as it turns out, I managed to describe and solve out pieces of this model in my Social Economics blue book exam. Two birds, one paper.

© Alexandra Albright and The Little Dataset That Could, 2018. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

On ego and sharing ideas

Line Charts, Models
A behavioral Nobel means a behavioral blog post

In honor of Richard Thaler’s recent Nobel prize win, I give you a post on a behavioral economics topic! Welcome to round 2 of A G2 Talks Models. Today’s topic: ego utility and the decision to speak up in class à la Koszegi 2006, an approach to belief-based utility adapted from Psychology and Economics lecture.

An admission of ego

I have a confession to make. I want you to think I’m smart. There, I said it. It is important to my self-image that you (yes, you on the other side of this screen!), my academic peers, and even the man (boy?) who tipsily mansplained the Monty Hall problem to me perceive me as intelligent. That is, intelligence as signaled by the occasional insightful comment, deep question, or quality idea that I get up the nerve to share.

Classrooms are environments in which lots of signaling of such smarts takes place. Professors ask questions, both rhetorical and not. They let us marinate in pregnant pauses and make a call for ideas. There’s a beat in which the tiny neuron-bureaucrat who is tasked with managing and organizing my brain activity files through some nascent concepts and responses. Is this any good?, she asks. Her supervisor isn’t sure either. Is this relevant? Yeah, but, is it too obvious? The supervisor prods her saying, time is of the essence. Internal hesitation over whether or not to share an idea in class still plagues me even after multiple decades of participation in the exercise. However, the difference is that now, in 18th grade, I can explicitly model that very idea-sharing decision.

A twist on classical utility: Enter ego…

To model this decision, I enter into a belief-based utility world. I define my utility function as follows: u = r – e + g√p, with r being the classroom response to the idea, e being the effort cost of sharing the idea, p being the probability that I think it’s a quality idea, and g being a parameter for “ego utility.” In classical utility world, this g√p term would not exist; I would simply weigh the benefit of sharing the idea r and the cost of sharing the idea e. Moreover, in a departure from classical economics assumptions, this form of belief-based utility displays information aversion, thus the square root on the p.

Now, let’s run through the outcomes based on class participation or class non-participation. If I take the jump and share my idea, I always expend some amount of effort e>0. Meanwhile, the benefit I derive depends on the ex post observable quality of the idea, as measured by the classroom response. If the idea was high quality, I gain r=1. If instead it was lacking or, shall we say, basic, I gain r=b where 1>b>0. If I keep my thoughts to myself, then e=0 and r=0.

In effect, if I share my idea, I receive u = p(1-e+g√1) + (1-p)(b-e+g√0). The first term on the right hand side is my perceived probability that the idea is of high quality multiplied by the associated payoff, while the second term is my perceived probability that the idea is basic multiplied by that associated payoff. Rearranging terms, u = p(1+g) + (1-p)b – e. Meanwhile, if I stay silent, my payoff is simply u = g√p.

Using this simple framework, I will share my idea with the class if and only if p(1+g) + (1-p)b – e > g√p. Simplified, I share my idea if and only if g(p-√p) + p + (1-p)b – e > 0.

Given this inequality, we can see that if g goes to infinity (i.e., my ego utility is huge), and p is not 0 or 1, the inequality will never hold, as p-√p will always be negative (since p is a fraction between 0 and 1); this means I will never raise my hand to share my ideas because I am so paralyzed by my massive ego utility. Meanwhile, as p approaches 0 or 1, the g(p-√p) term goes to 0, leaving the decision up to the inequality p + (1-p)b > e. Thus, I will speak if the expected value of the payoff to my comment exceeds the effort cost. (Recall that this is exactly what I do in the classical utility case in which I have no ego utility.)

While both of these two above conclusions seem predictable, there is a notable intriguing prediction from this model. You might expect that the greater my perceived probability that the idea is high quality, the more likely I am to share the idea. Well, this is not true. In other words, there is non-monotonicity in p. Say I have a moderate level of ego utility g and my p grows from a low to a higher level. This positive change in p could cause me to put my hand down even though now I am more confident in the quality of my idea. Weird! Ego utility allows there to be a negative correlation between my confidence in my idea’s quality and my willingness to share said idea.

Intuitively, as I become more confident in an idea, not only is there is a higher expected benefit to sharing the idea but there is also a higher possible loss of utility due to the ego utility term. The way these two opposing effects spar with one another can lead my hand to go up, down, and up again as my confidence in an idea increases.

Let’s illustrate this surprising concept graphically. We can parametrize the model and make visually explicit how the decision to my raise hand changes with p. Let’s set g=3, b=0.5, e=0.01. Given these values, I will speak my idea if and only if 3(p-√p) + p + (1-p)0.5 – 0.01 > 0. I.e., iff 3.5p – 3√p + 0.49 > 0. As such, I can graph the utility function with the full range of possible p values from 0 to 1 and accordingly color areas depending on whether or not they correspond to sharing an idea. (I share an idea if the utility function yields a value greater than or equal to 0; otherwise, I do not.)


The above illustrates that I am willing to share an idea when my probability that it is high quality is very low, but that I am no longer willing once the probability is a more moderately low value. This is evidence of the non-monotonicity in p in this model; I might lower my hand in class to protect my ego.

Anecdotal evidence, dynamics, and blog posting

I find ego utility fascinating and very believable when reflecting on my own experiences. For one, I have noticed that I often become more silent as conversation topics sway from topics in which I am novice to topics that I am moderately more knowledgable about. I feel acutely aware of the aforementioned tensions in the model; yes, I am more confident in my ideas in this realm, but I now have more to lose if I choose to share them. This is also a hesitation I feel internally when I talk with professors and friends about ideas. If the idea is undeveloped, there is really no harm in sharing it (p is low at that point); but, if I have been working on it and have a higher p, now, there is a chance I might realize that my idea was not up to snuff. In this sense, I can sometimes feel myself keeping ideas or projects to myself, as then they can’t be externally revealed to be low quality. I can sit on the sidelines and nurture my pet projects without a care in the world, stroking the ego-related term in my utility function.

But, in a more complex model, perhaps one that better represents my reality, idea quality is improved with idea sharing and collaboration. The model at hand is a one-shot game. I have an idea and I decide whether or not to share it. (The end.) But, in my flesh-and-blood/Stata-and-R universe, ideas do not disappear after that first instance of sharing; they develop dynamically. If I imagine refitting the model to mimic my reality, it is clear that silence for ego appeasement is a strategy that does not pay off long term…

I like to think that this is one reason why I write these posts — to share and accordingly develop ideas. In fact, when I started sharing R code online almost three years ago I was such a novice that I had a very low p regarding my data visualization capacities. In this way, ego utility was not able to hold me back from openly sharing my scripts. I was a strong advocate for transparency (still am) and at that time didn’t mind at all if my code looked like “a house built by a a child using nothing but a hatchet and a picture of a house.” However, if I were to imagine starting blogging now, I could see holding off, as I perceive my probability of being a decent coder as much larger than I did three years ago.

In the end, I am very happy that I chose to start sharing my work when I had a very small p. In fact, if you squint really hard, you can probably see me lounging on the utility function curve, fumbling to use ggplot2, somewhere in that first blue chunk of the graph.


This post adapts model mechanics from Koszegi 2006. A Psychology and Economics lecture explicitly inspired and informed this piece. Lastly, here is the R notebook used to create the graphic in this post.

© Alexandra Albright and The Little Dataset That Could, 2017. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

The United Nations of Words

Bar Charts

Newsletter e-mails are often artifacts of faded interests or ancient online shopping endeavors. They can be nostalgia-inducing — virtual time capsules set in motion by your past self at t-2, and egged on by your past self at t-1. Remember that free comedy show, that desk lamp purchase (the one that looks Pixar-esque), that political campaign… oof, actually let’s scratch that last one. But, without careful care, newsletters breed like rabbits and mercilessly crowd inboxes. If you wish to escape the onslaught of red notification bubbles, these e-mails are a sworn enemy whose defeat is an ever-elusive ambition.

However, there is a newsletter whose appearance in my inbox I perpetually welcome with giddy curiosity. That is, Jeremy Singer-Vine’s “Data is Plural.” Every week features a new batch of datasets for your consideration. One dataset in particular caught my eye in the 2017.07.19 edition:

UN General Debate speeches. Each September, the United Nations gathers for its annual General Assembly. Among the activities: the General Debate, a series of speeches delivered by the UN’s nearly 200 member states. The statements provide “an invaluable and, largely untapped, source of information on governments’ policy preferences across a wide range of issues over time,” write a trio of researchers who, earlier this year, published the UN General Debate Corpus — a dataset containing the transcripts of 7,701 speeches from 1970 to 2016.

The Corpus explains that these statements are “akin to the annual legislative state-of-the-union addresses in domestic politics.” As such, they provide a valuable resource for understanding international governments’ “perspective[s] on the major issues in world politics.” Now, I have been interested in playing around with text mining in R for a while. So a rich dataset of international speeches seems like a natural application of basic term frequency and sentiment analysis methods. As I am interested in comparing countries to one another, I need to select a subset of the hundreds to study. Given their special status, I focus exclusively on the five UN Security council countries: US, Britain, France, China, and Russia. (Of course, you could include many, many more countries of interest for this sort of investigation, but given the format of my desired visuals, five countries is a good cut-off.) Following in the typed footsteps of great code tutorials, I perform two types of analyses–a term frequency analysis and a sentiment analysis–to discuss the thousands of words that were pieced together to form these countries’ speeches.

Term Frequency Analysis

Term frequency analysis has been used in contexts ranging from studying Seinfeld to studying the field of 2016 GOP candidates. A popular metric for such analyses is tf-idf, which is a score of relative term importance. Applied to my context, the metric reveals words that are frequently used by one country but infrequently used by the other four. In more general terms, “[t]he tf-idf value increases proportionally to the number of times a word appears in the document, but is often offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.” (Thanks, Wikipedia.) In short, tf-idf picks out important words for our countries of interest. The 20 words with the highest tf-idf scores are illustrated below:


China is responsible for 13 of the 20 words. Perhaps this means that China boasts the most unique vocabulary of the Security Council. (Let me know if you disagree with that interpretation.) Now, if instead we want to see the top 5 words for each country–to learn something about their differing focuses–we obtain the results below:


As an American, I am not at all surprised by the picture of my country as one of democratic, god-loving, dream-having entrepreneurs who have a lot to say about Saddam Hussein. Other insights to draw from this picture are: China is troubled by Western superpower countries influencing (“imperialist”) or dominating (“hegemonism”) others, Russia’s old status as the USSR involved lots of name checks to leader Leonid Ilyich Brezhnev, and Britain and France like to talk in the third-person.

Sentiment Analysis

In the world of sentiment analysis, I am primarily curious about which countries give the most and least positive speeches. To figure this out, I calculate positivity scores for each country according to the three sentiment dictionaries, as summarized by the UC Business Analytics R Programming Guide:

The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The bing lexicon categorizes words in a binary fashion into positive and negative categories. The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.

Therefore, for the nrc and bing lexicons, my generated positivity scores will reflect the number of positive words less the number of negative words. Meanwhile, the AFINN lexicon positivity score will reflect the sum total of all scores (as words have positive scores if they possess positive sentiment and negative scores if they possess negative sentiment). Comparing these three positivity scores across the five Security Council countries yields the following graphic:


The three methods yield different outcomes: AFINN and Bing conclude that China is the most positive country, followed by the US; meanwhile, the NRC identifies the US as the most positive country, with China in fourth place. And, despite all that disagreement, at least everyone can agree that the UK is the least positive! (How else do we explain “Peep Show”?)

Out of curiosity, I also calculate the NRC lexicon word counts for anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. I then divide the sentiment counts by total numbers of words attributed to each country so as to present the percentage of words with some emotional range rather than the absolute levels for that range. The results are displayed below in stacked and unstacked formats.



According to this analysis, the US is the most emotional country with over 30% of words associated with a NRC sentiment. China comes in second, followed by the UK, France, and Russia, in that order. However, all five are very close in terms of emotional word percentages so this ordering does not seem to be particularly striking or meaningful. Moreover, the specific range of emotions looks very similar country by country as well. Perhaps this is due to countries following some well-known framework of a General Debate Speech, or perhaps political speeches in general follow some tacit emotional script displaying this mix of emotions…

I wonder how such speeches compare to a novel or a newspaper article in terms of these lexicon scores. For instance, I’d imagine that the we’d observe more evidence of emotion in these speeches than in newspaper articles, as those are meant to be objective and clear (though this is less true of new forms of evolving media… i.e., those that aim to further polarize the public… or, those that were aided by one of the Security Council countries to influence an election in another of the Security Council countries… yikes), while political speeches might pick out words specifically to elicit emotion. It would be fascinating to investigate how emotional words are wielded in political speeches or new forms of journalistic media, and how that has evolved over time. (Quick hypothesis: fear is more present in the words that make up American media coverage and political discourse nowadays than it was a year ago…) But, I will leave that work (for now) to people with more in their linguistics toolkit than a novice knowledge of super fun R packages.


As per my updated workflow, I now conduct projects exclusively using R notebooks! So, here is the R notebook responsible for the creation of the included visuals. And, here is the associated Github repo with everything required to replicate the analysis. Methods mimic those outlined by superhe’R’os Julia Silge and David Robinson in their “Text Mining with R” book.

© Alexandra Albright and The Little Dataset That Could, 2017. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

You think, therefore I am


Hello world, I am now a G2 in economics PhD-land. This means I have moved up in the academic hierarchy; [1] I now reign over my very own cube, and [2] I am taking classes in my fields of interest. For me that means: Social Economics, Behavioral Economics, and Market Design. That also means I am coming across a lot of models, concepts, results that I want to tell you (whoever you are!) about. So, please humor me in this quasi-academic-paper-story-time… Today’s topic: Coate and Loury (1993)’s model of self-fulfilling negative stereotypes, a model presented in Social Economics.

Once upon a time…

There was a woman who liked math. She wanted to be a data scientist at a big tech company and finally don the company hoodie uniform she kept seeing on Caltrain. Though she had been a real ace at cryptography and tiling theory in college (this is the ultimate clue that this woman is not based on yours truly), she hadn’t been exposed to any coding during her studies. She was considering taking online courses to learn R or Python, or maybe one of those bootcamps… they also have hoodies, she thought.

She figured that learning to code and building a portfolio of work on Github would be a meaningful signal to potential employers as to her future quality as an employee. But, of course, she knew that there are real, significant costs to investing in developing these skills… Meanwhile, in a land far, far away–in an office ripe with ping pong tables–individuals at a tech company were engaged in decisions of their own: the very hiring decisions that our math-adoring woman was taking into account.

So, did this woman invest in coding skills and become a qualified candidate? Moreover, did she get hired? Well, this is going to take some equations to figure out, but, thankfully, this fictional woman as well as your non-fictitious female author dig that sort of thing.

Model Mechanics of “Self-Fulfilling Negative Stereotypes”

Let’s talk a little about this world that our story takes place in. Well, it’s 1993 and we are transported onto the pages of the American Economic Review. In the beginning Coates and Laury created the workers and the employers. And Coates and Laury said, “Let there be gender,” and there was gender. Each worker is also assigned a cost of investment, c. Given the knowledge of personal investment cost and one’s own gender, the worker makes the binary decision between whether or not to invest in coding skills and thus become qualified for some amorphous tech job. Based on the investment decision, nature endows each worker with an informative signal, s, which employers then can observe. Employers, armed with knowledge of an applicant’s gender and signal, make a yes-no hiring decision.

Of course, applicants want to be hired and employers want to hire qualified applicants. As such, payoffs are as follows: applicants receive w if they are hired and did not invest, w-c if they are hired and invested, and 0 if they are not hired. On the tech company side, a firm receives $q if they hire a qualified worker, -$u if they hire an unqualified worker, and 0 if they choose not to hire.

Note importantly that employers do not observe whether or not an applicant is qualified. They just observe the signals distributed by nature. (The signals are informative and we have the monotone likelihood ratio property… meaning the better the signal the more likely the candidate is qualified and the lower the signal the more likely the candidate isn’t qualified.) Moreover, gender doesn’t enter the signal distribution at all. Nor does it influence the cost of investment that nature distributes. Nor the payoffs to the employer (as would be the case in the Beckerian model of taste-based discrimination). But… it will still manage to come into play!

How does gender come into play then, you ask? In equilibrium! See, in equilibrium, agents seek to maximize expected payoffs. And, expected payoffs depend on the tech company’s prior probability that the worker is qualifiedp. Tech companies then use p and observed signal to update their beliefs via Bayes’ Rule. So, the company now has some posterior probability, B(s,p), that is a function of p and s. The company’s expected payoff is thus B(s,p)($q) – (1-B(s,p))(-$u) since that is the product of the probability of the candidate’s being qualified and the gain from hiring a qualified candidate less the product of the candidate’s being unqualified and the penalty to hiring an unqualified candidate. The tech company will hire a candidate if that bolded difference is greater than or equal to 0. In effect, the company decision is then characterized by a threshold rule such that they accept applicants with signal greater than or equal to s*(p) such that the expected payoff equals 0. Now, note that this s* is a function of p. That’s because if p changes in the equation B(s,p)($q) – (1-B(s,p))(-$u)=0, there’s now a new s that makes it hold with equality. In effect, tech companies hold different genders to different standards in this model. Namely, it turns out that s*(p) is decreasing in p, which means intuitively that the more pessimistic employer beliefs are about a particular group, the harder the standards that group faces.

So, let’s say, fictionally that tech companies thought, hmmm I don’t know, “the distribution of preferences and abilities of men and women differ in part due to biological causes and that these differences may explain why we don’t see equal representation of women in tech and leadership” [Source: a certain memo]. Such a statement about differential abilities yields a lower p for women than for men. In this model. that means women will face higher standards for employment.

Now, what does that mean for our math-smitten woman who wanted to decide whether to learn to code or not? In this model, workers anticipate standards. Applicants know that if they invest, they receive an amount = (probability of being above standard as a qualified applicant)*w +(probability of falling below standard as a qualified applicant)*0 – c. If they don’t invest, they receive = (probability of being above standard as an unqualified applicant)*w +(probability of falling below standard as an unqualified applicant)*0. Workers invest only if the former is greater than or equal to the latter. If the model’s standard is higher for women than men, as the tech company’s prior probability that women are qualified is smaller than it is for men, then the threshold for investing for women will be higher than it is for men. 

So, if in this model-world, that tech company (with all the ping pong balls) is one of a ton of identical tech companies that believe, for some reason or another, that women are less likely to be qualified than men for jobs in the industry, women are then induced to meet a higher standard for hire. That higher standard, in effect, is internalized by women who then don’t invest as much. In the words of the original paper, “In this way, the employers’ initial negative beliefs are confirmed.”

The equilibrium, therefore, induces worker behavior that legitimizes the original beliefs of the tech companies. This is a case of statistical discrimination that is self-fulfilling. It is an equilibrium that is meant to be broken, but it is incredibly tricky to do so. Once workers have been induced to validate employer beliefs, then those beliefs are correct… and, how do you correct them?

I certainly don’t have the answer. But, on my end, I’ll keep studying models and attempting to shift some peoples’ priors…

Screen Shot 2017-09-06 at 10.56.41 PM

Oh, and my fictional female math-enthusiast will be endowed with as many tech hoodies as she desires. In my imagination, she has escaped the world of this model and found a tech company with a more favorable prior. A girl can dream…


This post adapts Coate and Loury (1993) to the case of women in tech in order to demonstrate and summarize the model’s dynamics of self-fulfilling negative stereotypes. Discussion and lecture in Social Economics class informed this post. Note that these ideas need not be focused on gender and tech. They are applicable to many other realms, including negative racial group stereotypes and impacts on a range of outcomes, from mortgage decisions to even police brutality.

© Alexandra Albright and The Little Dataset That Could, 2017. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.


Senate Votes Visualized

Grid Maps

It has been exactly one week since the Senate voted to start debate on Obamacare. There were three Obamacare repeal proposals that followed in the wake of the original vote. Each one failed, but in a different way. News outlets such as the NYTimes did a great job reporting how each Senator voted for all the proposals. I then used that data to geographically illustrate Senators’ votes for each Obamacare-related vote. See below for a timeline of this past week’s events and accompanying R-generated visuals.

Tuesday, July 25th, 2017

The senate votes to begin debate.


This passes 51-50 with Pence casting the tie-breaking vote. The visual shows the number of (R) and (D) Senators in each state as well as how those Senators voted. We can easily identify Collins and Murkowski, the two Republicans who voted NO, by the purple halves of their states (Maine and Alaska, respectively). While Democrats vote as a bloc in this case and in the impending three proposal votes, it is the Republicans who switch between NO and YES over the course of the week of Obamacare votes. Look for the switches between red and purple.

Later that day…

The Senate votes on the Better Care Reconciliation Act.


It fails 43-57 at the mercy of Democrats, Collins, Murkowski, and a more conservative bloc of Republicans.

Wednesday, July 26th, 2017

The Senate votes on the Obamacare Repeal and Reconciliation Act.


It fails 45-55 at the mercy of Democrats, Collins, Murkowski, and a more moderate bloc of Republicans.

Friday, July 28th, 2017

The Senate votes on the Health Care Freedom Act.


It fails 49-51 thanks to Democrats, Collins, Murkowski, and McCain. To hear the gasp behind the slice of purple in AZ, watch the video below.


This was a great exercise in using a few R packages for the first time. Namely, geofacet and magick. The former is used for creating visuals for different geographical regions, and is how the visualization is structured to look like the U.S. The latter allows you to add images onto plots, and is how there’s a little zipper face emoji over DC (as DC has no Senators).

In terms of replication, my R notebook for generating included visuals is here. The github repo is here.

© Alexandra Albright and The Little Dataset That Could, 2017. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

A rising tide lifts all podcasts

Scatter Plots

A personal history of podcast listening

One afternoon of my junior year, I listened to a chunk of a Radiolab episode about “Sleep” as I myself heavily sank into unconsciousness. It was like guided meditation… supported, in part, by the Alfred P. Sloan Foundation. Jad and Robert’s forays on Radiolab quickly became my new bedtime stories. They helped me transition from days with my nose deep in books and, more accurately, my laptop to dreams that veered away from the geographic markers of one tiny college town in a valley of the Berkshires.

The Radiolab archives were a soundtrack to my last years of college and to my transition from “student” at a college to “staff member” at a university. A few months into my new place in the world, I found myself discussing Sarah Koenig’s Serial with my colleagues in neighboring cubicles. I also wasn’t a stranger to the virtual water cooler of /r/serialpodcast. I became so entrenched in the podcast’s following that I ended up being inspired to start blogging in order to document reddit opinion trends on the topic.

Faced with regular Caltrain rides from the crickets and “beam” store of Palo Alto to the ridiculous-elevation-changes of SF, I started listening to Gilmore Guys. You know, the show about two guys who talk about the Gilmore Girls. I did not think this would take (I mean, there were hundreds of episodes–who would listen to all that?!) but I was very wrong. The two hosts accompanied me throughout two full years of solo moments. Their banter bounced next to me during mornings biking with a smile caked across my face and palm trees to my left and right as well as days marked by fierce impostor syndrome. Their bits floated next to me in the aftermath of medical visits that frightened me and suburban grocery shopping endeavors (which also sometimes frightened me). Their words, light and harmless, sat with me during evenings of drinking beer on that third-of-a-leather-couch I bought on craigslist and silent moments of self-reflection.

That might sound like pretty heavy lifting for a podcast. But, (silly as it might sound) it was my security blanket throughout a few years of shifting priorities and support networks–tectonic plates grumbling under the surface of my loosely structured young adult life.

When it came time to move to Cambridge from Palo Alto, I bought a Leesa mattress thanks to Scott Aukerman’s 4am mattress store advert bit from Comedy Bang Bang. (Sorry, Casper.) Throughout my first doctoral academic year, I regularly listened to Two Dope Queens as I showered and made dinner after frisbee practices. Nowadays, like a good little liberal, I listen to the mix of political yammering, gossip, and calls to arms that makes up Pod Save America.

Podcasts seem to be an increasingly important dimension of our alone time. A mosaic of podcast suggestions is consistently part of entertainment recommendations across friends… which leads me to my question of interest: How are podcasts growing? Are there more created nowadays, or does it just feel like that since we discuss them more? 

Methodological motivation

In following the growth of the R-Ladies organization and the exciting work of associated women, I recently spotted a blog post by R-lady Lucy McGowan. In this post, Lucy looks at the growth of so-called ‘Drunk’ Podcasts. She finds a large growth in that “genre” (if you will) while making great usage of a beer emoji. Moreover, she expresses that:

While it is certainly true that the number of podcasts in general has absolutely increased over this time period, I would be surprised if the increase is as dramatic as the increase in the number of “drunk” podcasts.

I was super skeptical about this statement. I figured the increase in many podcast realms would be just as dramatic, if not more dramatic than that in the ‘drunk’ podcasts universe. So, I took this skepticism as an opportunity to build on Lucy’s code and emoji usage and look into release trends in other podcasting categories. Think of this all as one big excuse to try using emojis in my ggplot creations while talking about podcasts. (Thank you to the author of the emoGG package, a hero who also created Beyoncé color palettes for R.)

Plotting podcasts

I look into podcasting trends in the arenas of ‘sports’, ‘politics’, ‘comedy’ and ‘science.’ I figured these were general umbrella terms that many pods should fall under. However, you can easily adapt the code to look into different genres/search terms if you’re curious about other domains. (See my R notebook for reproducible work on this subject.) I, like Lucy, then choose emojis to use in my eventual scatterplot. Expressing a concept as complex as politics with a single emoji was challenging, but a fun exercise in using my millennial skillset.  (The ‘fist’ emoji was the best I could come up with for ‘politics’ though I couldn’t figure out how to modify the skin tone. I’m open to other suggestions on this front. You can browse through options with unicode here.)

In the end, I combine the plots for all four podcasting categories into one aggregated piece of evidence showing that many podcasts genres have seen dramatic increases in 2017. The growth in number of releases is staggering in all four arenas. (Thus, the title ‘A rising tide lifts all podcasts.’) So, this growth doesn’t seem to be unique to the ‘drunk’ podcast. In fact, these more general/conventional categories see much more substantive increases in releases.


While the above deals with podcast releases, I would be very curious to see trends in podcast listening visualized. For instance, one could use the American Time Use Survey to break down people’s leisure consumption by type during the day. (It seems that the ATUS added “listening to podcast” in 2015.) I’d love to see some animated graphics on entertainment consumption over the hours reminiscent of Nathan Yau’s previous amazing work (“A Day in the Life of Americans”) with ATUS data.

Putting down the headphones

Regardless of the exact nature of the growth in podcasts over the past years, there is no doubt the medium has come to inhabit a unique space. Podcasts feel more steeped in solitude than other forms of entertainment like television or movies, which often are consumed in group settings. Podcasts have helped me re-learn how to be alone (but not without stories, ideas, and my imagination) and enjoy it. And, I am an only-child, so believe me… I used to be quite good at that.

The Little Dataset–despite this focus on podcasts–is brought to you by WordPress and not Squarespace. 🙂


Check out this R Notebook for the code needed to reproduce the graphic. You can also see my relevant github repository.

© Alexandra Albright and The Little Dataset That Could, 2017. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.

The One With All The Quantifiable Friendships, Part 2

Bar Charts, Line Charts, Nightingale Graphs, Stacked Area Charts, Time Series

Since finishing my first year of my PhD, I have been spending some quality time with my computer. Sure, the two of us had been together all throughout the academic year, but we weren’t doing much together besides pdf-viewing and type-setting. Around spring break, when I discovered you can in fact freeze your computer by having too many exams/section notes/textbooks simultaneously open, I promised my MacBook that over the summer we would try some new things together. (And that I would take out her trash more.) After that promise and a new sticker addition, she put away the rainbow wheel.

Cut to a few weeks ago. I had a blast from the past in the form of a Twitter notification. Someone had written a post about using R to analyze the TV show Friends, which was was motivated by a similar interest that drove me to write something about the show using my own dataset back in 2015. In the post, the author, Giora Simchoni, used R to scrape the scripts for all ten seasons of the show and made all that work publicly available (wheeeeee) for all to peruse. In fact, Giora even used some of the data I shared back in 2015 to look into character centrality. (He makes a convincing case using a variety of data sources that Rachel is the most central friend of the six main characters.) In reading about his project, I could practically hear my laptop humming to remind me of its freshly updated R software and my recent tinkering with R notebooks. (Get ready for new levels of reproducibility!) So, off my Mac and I went, equipped with a new workflow, to explore new data about a familiar TV universe.

Who’s Doing The Talking?

Given line by line data on all ten seasons, I, like Giora, first wanted to look at line totals for all characters. In aggregating all non-“friends” characters together, we get the following snapshot:


First off, why yes, I am using the official Friends font. Second, I am impressed by how close the totals are for all characters though hardly surprised that Phoebe has the least lines. Rachel wouldn’t be surprised either…

Rachel: Ugh, it was just a matter of time before someone had to leave the group. I just always assumed Phoebe would be the one to go.

Phoebe: Ehh!!

Rachel: Honey, come on! You live far away! You’re not related. You lift right out.

With these aggregates in hand, I then was curious: how would line allocations look across time? So, for each episode, I calculate the percentage of lines that each character speaks, and present the results with the following three visuals (again, all non-friends go into the “other” category):


Tell me that first graph doesn’t look like a callback to Rachel’s English Trifle. Anyway, regardless of a possible trifle-like appearance, all the visuals illustrate dynamics of an ensemble cast; while there is noise in the time series, the show consistently provides each character with a role to play. However, the last visual does highlight some standouts in the collection of episodes that uncharacteristically highlight or ignore certain characters. In other words, there are episodes in which one member of the cast receives an unusually high or low percentage of the lines in the episode. The three episodes that boast the highest percentages for a single member of the gang are: “The One with Christmas in Tulsa” (41.9% Chandler), “The One With Joey’s Interview” (40.3% Joey), “The One Where Chandler Crosses a Line” (36.3% Chandler). Similarly, the three with the lowest percentages for one of the six are: “The One With The Ring” (1.5% Monica) , “The One With The Cuffs” (1.6% Ross), and “The One With The Sonogram At The End” (3.3% Joey). The sagging red lines of the last visual identify episodes that have a low percentage of lines spoken by a character outside of the friend group. In effect, those dips in the graph point to extremely six-person-centric episodes, such as “The One On The Last Night” (0.4% non-friends dialogue–a single line in this case), “The One Where Chandler Gets Caught” (1.1% non-friends dialogue), and “The One With The Vows” (1.2% non-friends dialogue).

The Men Vs. The Women

Given this title, here’s a quick necessary clip:

Now, how do the line allocations look when broken down by gender lines across the main six characters? Well, the split consistently bounces around 50-50 over the course of the 10 seasons. Again, as was the case across the six main characters, the balanced split of lines is pretty impressive.


Note that the second visual highlights that there are a few episodes that are irregularly man-heavy. The top three are: “The One Where Chandler Crosses A Line” (77.0% guys), “The One With Joey’s Interview” (75.1% guys), and “The One With Mac and C.H.E.E.S.E.” (70.2% guys). There are also exactly two episodes that feature a perfect 50-50 split for lines across gender: “The One Where Rachel Finds Out” and “The One With The Thanksgiving Flashbacks.”

Say My Name

How much do the main six characters address or mention one another? Giora addressed this question in his post, and I build off of his work by including nicknames in the calculations, and using a different genre of visualization. With respect to the nicknames–“Mon”, “Rach”, “Pheebs”, and “Joe”–“Pheebs” is undoubtably the stickiest of the group. Characters say “Pheebs” 370 times, which has a comfortable cushion over the second-place nickname “Mon” (used 73 times). Characters also significantly differ in their usage of each others’ nicknames. For example, while Joey calls Phoebe “Pheebs” 38.3% of the time, Monica calls her by this nickname only 4.6% of the time. (If you’re curious about more numbers on the nicknames, check out the project notebook.)

Now, after adding in the nicknames, who says whose name? The following graphic addresses that point of curiosity:


The answer is clear: Rachel says Ross’s name the most! (789 times! OK, we get it, Rachel, you’re in love.) We can also see that Joey is the most self-referential with 242 usages of his own name–perhaps not a shock considering his profession in the entertainment biz. Overall, the above visual provides some data-driven evidence of the closeness between certain characters that is clearly evident in watching the show. Namely, the Joey-Chandler, Monica-Chandler, Ross-Rachel relationships that were evident in my original aggregation of shared plot lines are still at the forefront!


Comparing the above work to what I had originally put together in January 2015 is a real trip. My original graphics back in 2015 were made entirely in Excel and were as such completely unreproducible, as was the data collection process. The difference between the opaqueness of that process and the transparency of sharing notebook output is super exciting to me… and to my loyal MacBook. Yes, yes, I’ll give you another sticker soon.

Let’s see the code!

Here is the html rendered R Notebook for this project. Here is the Github repo with the markdown file included.

*Screen fades to black* 
Executive Producer: Alex Albright

© Alexandra Albright and The Little Dataset That Could, 2017. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts, accompanying visuals, and links may be used, provided that full and clear credit is given to Alex Albright and The Little Dataset That Could with appropriate and specific direction to the original content.