Difference between revisions of "Social Networks"

From Math Images
Jump to: navigation, search
Line 288: Line 288:
The key differences in the PageRank calculation than regular eigenvector centrality calculation.
The key differences in the PageRank calculation than regular eigenvector centrality calculation.
*The entries make a stochastic matrix, meaning that the entries are normalized so that the sum of entries of certain column ''j'' equals 1.
*The entries make a stochastic matrix, meaning that the entries are normalized so that the sum of entries of certain column ''j'' equals 1.
*The entry is defined as <math>\frac{1}{L(p{j})}</math>, where <math>L(p{j}</math> is the number of outbound links from ''j'', if ''i'' is connected to ''j'' and 0 otherwise.
*There is a damping factor, ''d'', empirically estimated to be 0.85, that is the probability of a random surfer to keep surfing.
*There is a damping factor, ''d'', empirically estimated to be 0.85, that is the probability of a random surfer to keep surfing.
*The PageRank value of a page is the chance that a random surfer will land on that page by clicking on a link.
*The PageRank value of a page is the chance that a random surfer will land on that page by clicking on a link.
Line 297: Line 298:
where <math>P_{1}...P(N)</math> are the considered pages, <math>M(p_{i})</math> is the set of pages that link to <math>P_{i}</math>, <math>L(p_{i})</math> is the number of outbound links on page <math>p_{i}</math>, and ''N'' is the total number of pages.
where <math>P_{1}...P(N)</math> are the considered pages, <math>M(p_{i})</math> is the set of pages that link to <math>P_{i}</math>, <math>L(p_{i})</math> is the number of outbound links on page <math>p_{i}</math>, and ''N'' is the total number of pages.
Line 304: Line 306:
[[image:epidemic_world.jpg|thumb|right|300px|Picture illustrating the spread of an epidemic. Yellow/orange areas indicate high levels of contagion]]
[[image:epidemic_world.jpg|thumb|right|300px|Picture illustrating the spread of an epidemic. Yellow/orange areas indicate high levels of contagion]]

Revision as of 10:50, 23 July 2013

Facebook friend Web
FB Web.jpg
Fields: Graph Theory and Algebra
Image Created By: unknown

Facebook friend Web

Friend network of a particular Facebook account. The pink indicates a "mob" of tightly interconnected friends, such as high school or college friends.

Basic Description

The picture above may seem like an innocent model of someone's friend network on Facebook, but it reveals plenty about how modern society operates.

First, it indicates that people inherently have preferences for people who they choose to associate with. Taken in the context of economic transactions, these preferences can determine whether or not a firm, state, or country prospers economically. Think about it: there are infinite possibilities of associations, but people sometimes prefer not to associate.

Second, the pink area (which probably represents high school friends, college friends, or coworkers) indicates something called clustering, which is kind of like grouping within a group. Given the context above, grouping can be thought of as a similar preference among certain individuals. Being in groups has allowed people to protest for their rights, start a new company, take over countries, etc. The odds are more in your favor in a group if you want to achieve something.

These kinds of preferences have undoubtedly shaped our world. Therefore, it is beneficial to analyze such powerful phenomenon. In other words, social networks, or the pattern of connections between people, are of enough importance to study through mathematics.

Social network analysis (SNA) can provide valuable information about a network by answering the following questions:

  • "Who is most central to a network?"
  • "Who has the greatest influence in a group?
  • "How many different clusters can you find in a network?"
  • "Which connections are pivotal for the functioning of a group?
  • "Who is Mr./Ms. Popular?"

Most importantly, this analysis allows us to make more rational decisions based on data (rather than intuition) and make predictions about the behavior of a group. For example, if we are able to determine that in the case of an epidemic, the virus is concentrated generally around a select few, there could be something done to prevent its spread. Conversely, if we wanted to make sure an important idea would spread quickly (a pitch for a school candidate, for example) we would first target the most influential and central people.

Centrality Measures

Of the topics explained above, perhaps the easiest and most fundamental measures of network structures are centrality measures, which are given in more detail below.

Keep in mind, however, that these measures quantify characteristics of people (or whatever objects that pertain to the network) rather than the network as a whole.

The answers to all of these questions revolve around the concept of centrality. The most frequently used measures of network structure, therefore, are centrality measures. These include Degree, Eigenvector, Closeness and Betweenness measures of centrality. More information about the measures is found below.

For illustrative and simplistic purposes, these explanations will be in the context of human interactions.

Degree Centrality

(Social Network of Shakespeare's Hamlet, illustrating how protagonists are central to the plot, as evidenced by their degree centrality)

Degree centrality measures how well connected a person is to a network. It does this by simply counting how many people a person is connected to. It seeks to measure:

  • The level of influence someone can establish in a community, organization, group, etc.
  • The opportunity to be influenced by someone in a community, organization, group, etc
  • How exposed someone is in a network, mostly known as the index of exposure

Example: The characters in the image to the right will probably seem familiar. It is a network analyzing the ties among characters of one of William Shakespeare's famous plays, Hamlet.If you are not familiar with the play, just know that Hamlet (a prince) is the protagonist and is next in line to inherit the throne of his father, king Hamlet. However, prince Hamlet's uncle, Claudius, inherits the throne. Hamlet later finds out that Claudius was responsible for killing his father. Because Hamlet is a prince, and the protagonist of the play, the story mostly centers around Hamlet's quest to avenge the death of his father, which involves him talking to almost all the characters in the play. It seems like prince Hamlet is rather, central, right? In the most simple sense, because the he has most ties to any character in the play (as diagrammed by the image) he has the highest degree centrality, and thus it makes sense to conceive of him as the protagonist, or main character.

Eigenvector Centrality

Eigenvector centrality is a more sophisticated version of degree centrality that measures not only the number of people someone is connected to, but also the quality of the connections. For a connection to have quality, in this context, it means that that connection has lots of connections. It seems necessary because, in an intuitive sense, connections with influential people will make you more influential than just having non-influential connections. Eigenvector Centrality seeks to measure:

  • Who the most popular person is in the group
  • How well someone is connected to the well-connected, in other words, knowing the "right people"

Closeness Centrality

Closeness centrality, without a surprise, measures how close a person is within a network.Close in this context refers to how many "friends of friends" one would need to be related to another person. A not so close relation, for example, would be someone in California being related to some random person in North Dakota through 1 friend that knows another friend which knows another friend...(6 intermediate friends in total) that finally knows that random person in North Dakota. The 6 in the last example indicates the degrees of separation.

A person who is well connected therefore has, on average, fewer degrees of separation to reach everyone on the network.

Betweenness Centrality

Betweenness centrality measures how "in between" someone is in the network. "In between" refers to how often the flow of information in a network passes through a specific person. A person with a high measure of betweenness usually "knows what's going on" and can act as a liaison to separate parts of the network.

A More Mathematical Explanation

From now on, we will add to our understanding of social networks as graphs in the context of interact [...]

From now on, we will add to our understanding of social networks as graphs in the context of interacting people. The people will now be thought of as vertices and the connections between people will be thought of as edges. For more information on the basics of graphs, which will be imperative to the mathematical understanding of social netoworks, look at the page Graph Theory.

The following graph of a hypothetical social network will serve as the example for the following sections, and its centrality measures will be determined:

Graph math 1.png

If we are to do mathematical analyses there must be one way to store information about the network as a whole. Luckily there is such a way! It is through the Adjacency Matrix. The i,j entry of the adjacency matrix is defined as follows:

a_{ij}=\left\{ \begin{array}{rcl} 1 & \mbox{for} & i, j \mbox{ adjacent}\\ 0 & \mbox{for} & i, j \mbox{ non-adjacent} \end{array}\right.

As an example, we will determine the adjacency matrix, A, of the graph above:

0 & 1 & 0 & 0 & 1 & 1 & 0 & 0 \\ 
1 & 0 & 1 & 1 & 0 & 0 & 0 & 0 \\ 
0 & 1 & 0 & 1 & 0 & 0 & 0 & 0 \\
0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 \\ 
1 & 0 & 0 & 0 & 0 & 1 & 1 & 1 \\ 
1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 
0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 
0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 

For a better visual and mathematical understanding of the adjacency matrix, you can look at the following chart:

New adjacency matrix.png

Now with the graph and adjacency matrix at our disposal, we are ready to find certain measures of centrality. Typically, computers are used to measure centrality, but our graph is small enough that we can show most of the procedures to find certain measures of centrality.

Degree Centrality

Definition: The degree k_{i} of a vertex i is

k_{i} = \sum_{j=1}^n(A)_{ij}

in other words, it means adding all of the edges adjacent to a given vertex. Having the adjacency matrix in mind, it is like counting all of the ones in a column or row for a specific vertex. We can use Chris's row in the adjacency matrix as an example:

Degree centrality matrix.png

As you can see, Chris's degree centrality is: 1+1+1+1=4

Let's calculate the degree centralities of the others:

Jason: 1+1+1=3

Austin: 1+1=2

Donald: 1

Bernie: 1

Mark: 1

David: 1

Elissa: 1

Conveniently. they were displayed in a way that ranked them at the same time. We can now conclude that Chris has the highest degree centrality (he knows the most people!) and Donald, Bernie, Mark, David and Elissa are tied for the lowest degree centrality, only knowing one person.

Eigenvector Centrality

(Note: This section requires a knowledge of Linear Algebra)

If you can remember one thing about eigenvector centrality, it is that it's a more sophisticated version of degree centrality (counting edges from a vertex) that takes in account of the quality of the connections!

How can math store information regarding relative centralities (quality) and connections toward other vertices?

If you thought through Matrices and Vectors, you were right! We will utilize both in our focus on eigenvector centrality.

Let x_{i} be the eigenvector centrality score of a vertex i and let A=[a_{ij}] be the adjacency matrix of the graph containing i. The inclusion of the adjacency matrix is important because it tells us which vertices are connected to each other (we want to include everyone's data and we don't want to include ones that aren't connected to us!) and how many vertices are connected. Let x_{j} be the eigenvector centrality scores of the vertices adjacent to i.

The eigenvector centrality score of vertex for vertex i is proportional to the sums of the scores of all vertices that are connected to it:


where \lambda is a constant and N is the number of vertices in a graph. What this basically says that one person's eigenvector score depends on the eigenvector scores of it's neighbors.

Following the properties of matrix multiplication, the equation above can be written as:


Which simplifies into the eigenvector equation:


Where \overrightarrow{x} is an eigenvector.

This is all nice and dandy. The eigenvector centrality score of one vertex relies on the eigenvector scores of the others. But how would we find the eigenvector centrality score of a particular vertex if it depends on the others??

Well the answer is that, strangely, we find them all at once! We just find the eigenvector associated with the highest eigenvalue, according to the Perron-Frobenious Theorem. The ith entry of that vector will correspond to the eigenvector centrality score of ith vertex.

To get through material faster, the process of finding Eigenvectors will be omitted, look here for reference.

We will now find the eigenvector associated with the highest eigenvalue, and thus the vector of eigenvector centralities, of our hypothetical network:

Eigenvector Centrality.png

With \lambda=2.50

Let's test out one of the formulas, x_{i}=\frac{1}{\lambda}\sum_{j=1}^Na_{ij}x_{j}, to Jason.


We can clearly see here that even though Chris had higher degree centrality, Jason had the same eigenvector centrality. Eigenvector centrality was created for reasons like these, to determine someone's importance in a network looking more than the amount of connections. In most settings,who you are connected to is more important than how many people you are connected to.

Closeness and Betweenness Centrality

Background Definitions:

  • A path is a sequence of vertices that are traversed by "walking though" edges from one to another across the network. More information can be found in Graph Theory
  • A geodesic path is the shortest path between a pair of vertices.

Closeness Centrality of a vertex i is formally defined as the mean geodesic path length between i and each other vertex in the graph. Intuitively, if this number of mean geodesic path length is low, we say it is closely located to every other vertex. Be wary, however, that a low number in this case indicates a higher centrality score. This may be confusing for some. Therefore, closeness centrality is more easily understood as the reciprocal of the average geodesic path, as follows:

c_{k}=\frac{1}{\sum_{i,j}g_{ij}} (need to fix this up)

Where g_{ij} is the number of geodesic paths from i to j

We will use our example graph once more to illustrate the shortest paths for certain individuals (the rest will be calculated and left to be verified by the reader):

Let's look at the shortest path of Donald to Elissa:

Donald centrality.png

We can see that we individually count the amount of edges required to get to the over vertex (person). To get to Elissa, donald's geodesic path is 1+1+1+1=4. We hope that the shortest path seems relatively straight-forward given the simplicity of the model. For most real and complex models. computers calculate these shortest paths, with a specific algorithm, Dijksta's Algorithm, mentioned below.

Now to see an example where someone may be "close" within a network, which is to have high degree centrality. Let's look at some of Jason's shortest paths to other people:

Jason centrality.png

We see that to get to the two most distant people, Donald and Elissa, Jason only has to "walk" through two edges. In other words, Donald and Elissa are 2nd degree friends. Again, to have high closeness centrality, one have must the lowest mean degree of separation to everyone in the network. Things are looking good for Jason so far.

Let's now compute the closeness centralities for all:

Closeness measures 2.png

You can check for your own answers for the geodesic path/closeness centrality. It looks like Jason came in first with a closeness centrality score of 0.63 and Donald, Bernie, David and Elissa tied for last with a closeness centrality score of 0.39. Jason's high score makes sense because visually, he is centrally located. The tie between Donald, Bernie, David and Elissa makes sense because they are only a part of the system because they are directly connected to one person, and thus have to travel "farther" (meaning having to go through more "friends of friends") to reach someone in the network.

Betweenness Centrality of a vertex i is the fraction of geodesic paths within the graph that include i. In other words, it assembles every pair of vertices and computes the geodesic paths between them, then it determines the fraction of those paths that intersect i.

Mathematically, b_{k}=\sum_{i,j}\frac{g_{ijk}}{g_{ij}} (need to fix this up)

Where g_{ijk} is the number of those paths that pass through i

Again, we will use our example network to find betweenness. The following picture illustrates how betweenness is calculated (although not completely) for Chris and Austin in the context of paths between Jason and Elissa, Donald and Mark:

Betweenness centrality.png

As you can see, shortest paths consisting of only one traversed edge (they are first degree friends), won't have any "middle men", or intermediaries. Therefore, in our matrix for closeness centrality, paths containing 1 (or zero), can be ignored for the purposes of calculating betweenness centrality. What matters most are the entries in the closeness centrality matrix that are greater than one, because those entries suggest that someone had to know an intermediary (or several) to be associated with a friend.

In the next image, the closeness centrality matrix will be modified in such a way that highlights the intermediaries for each possible geodesic path. For every path, the intermediary's (if any) initial will replace the numerical entry. If there are several intermediaries, then they willl each have their own initial in the entry. The 0 and 1 entries of the former closeness centrality matrix in will be blank in the latter betweenness centrality matrix (as explained above).

Betweenness centrality 2.png

With this image in mind, we can now calculate everyone's betweenness centrality! All we do is add their respective "interruptions", or times they were between a geodesic path, and divide by the total amount of geodesic paths, 64(8*8).

Betweenness centrality 3.png

This result makes intuitive sense. Jason has highest betweenness centrality because he is literally "in the middle" of everyone. Donald, Bernie, Mark, David and Elissa both have low betweenness centrality because they are located in the periphery of the network and don't bring people together.

Dijkstra's Algorithm

Give short bit about how the algorithm see here


Bernoulli Graph and Configuration Models

Previously, we were establishing measures concerning the properties of nodes, or people. We will now focus our attention to network properties. Our previous analyses consisted of taking measurements and analyzing them, now we are taking a step further to model real life situations knowing only very little of the network.

Before we begin, here are some definitions:

  • Degree Sequence: The degree sequence of V on a graph G=(V,E) where the sequence is written in decreasing order.
    • For example {3,2,2,1}
  • Degree Distribution (p_{k}): the fraction of nodes in G that contain a specified degree k. If there are n vertices and n_{k} of them having degree k, then p_{k}=frac{n_{k}}{n}

The most classical and simplest model of a network is the Bernoulli Graph'. Given n nodes with independent probability, p, for each vertex pair, the degree distribution p_{k} in a Bernoulli Graph follows a binomial distribution:

(find out how to code the distribution)

In other words, we can find out the proportion of vertices n that have a specified degree k.

According on empirical studies on social behavior, real-life networks hardly follow these distributions. A more sophisticated modeling technique is needed.

To allow for these non-binonmial degree distributions, one can generalize the random graph, specifying a degree distribution  pk and then forming a graph that has that distribution but is otherwise random. This formation is called the configuration model.

The Configuration Model algorithm works as follows. Create vertices V = {1, 2....n}and assign them stubs or half edges according to the sequence {d_1,d_2...d_n}, as in the figure below. (These stubs or half edges are edges that are connected on one side while the other side remains free.) Then, to make the random graph, pick any two stubs uniformly at random, and connect their free ends. These two stubs became one edge. Repeat this process until no free stubs are left.

Figure indicating the process for constructing configuration models

The reason why this model works at depicting real-life social networks is because of its isomorphic properties. For more on these properties, explore the mathematical section on Graph Theory. Basically, it means that for two graphs, given that the respective edges are connected to the same vertices, are the same, or isomorphic. The following image demonstrates this concept.

Figure showing how two graphs are isomorphic

Basic Reproduction Number

Now we can apply these models to make more informed predictions!

Have you ever wondered if there was such a point, call it a tipping point, that can determine whether or not an idea (or virus) will spread or die out? If so, you're in luck! If not, you still are.

Definitions and concepts before we begin:

  • Mean Degree:<k>=\frac{\sum_{i}k_{i}}{n}, which is simply the sum of the degrees of all vertices divided by the amount of vertices.
  • Given an independent probability r and degree k  r(k-1) represents the ability to transmit information, for one individual. The expression k-1 just means that you include all other vertices except the one that told the person the information in the first place.

Given a mean degree, <k>, and a person's individual probability, r, to be connected to someone else, it is possible to take the weighted average of all probablities to get what is called the basic reproductive number:


The <k^2> term in the formula makes sense because the spread of information is like a two way steeet: the more people you are connected to the more likely you are able to spread an idea of hear of one. For example, a person with 5 connections is 5x5=25 more useful at spreading an idea than someone with degree 1.

If R_o is greater than 1, the idea will spread exponentially (yikes). If R_o is less than 1, the idea will die.

Why It's Interesting

Social networks are definitely applicable to the real world. Why? Because they model real world activity! Think of the possibilities for social growth once we have with these analytic tools at our disposal. The following examples illustrate the power of Social Network Analysis (SNA).

Food Webs

Google PageRank

Why does almost every first entry of a google search contain a link to a Wikipedia article?

Well it related to SNA, and more specifically, it's a method that uses a variant of eigenvector centrality!

The idea behind PageRank is that the PageRank for a certain page i relates the PageRank of the pages that it links to. They all depend on each other! Just as in the case for eigenvector centrality.

As a reminder: The ith node eigenvector centrality score corresponded to the ith entry of the eigenvector from the adjacency matrix that was attributed to the the greatest positive eigenvalue.

The key differences in the PageRank calculation than regular eigenvector centrality calculation.

  • The entries make a stochastic matrix, meaning that the entries are normalized so that the sum of entries of certain column j equals 1.
  • The entry is defined as \frac{1}{L(p{j})}, where L(p{j} is the number of outbound links from j, if i is connected to j and 0 otherwise.
  • There is a damping factor, d, empirically estimated to be 0.85, that is the probability of a random surfer to keep surfing.
  • The PageRank value of a page is the chance that a random surfer will land on that page by clicking on a link.
    • This focus relies on probabilities rather than measurements.

The formula is as follows:


where P_{1}...P(N) are the considered pages, M(p_{i}) is the set of pages that link to P_{i}, L(p_{i}) is the number of outbound links on page p_{i}, and N is the total number of pages.


As hinted at throughout the page, Social Network Analysis can help prevent the spread of an epidemic. It makes sense that SNA can prevent it because epidemics is nothing more than the interaction of humans, which we can analyze! The basic idea is to target the people that are the most central, in either degree, eigenvector, closeness or betweenness centrality (or combination thereof).

The picture to the right portrays epidemics in a global sense:

Picture illustrating the spread of an epidemic. Yellow/orange areas indicate high levels of contagion

Graphically, there is a short window of time in which the epidemic can be stopped or mitigated. In a response graph for this example, the incidence of contagion is graphed against time. If we make two graphs on the same plot, one with a sample of central people, and the other with a random sample, that graph would look like the picture to the below (with Influenza as the virus):

Response graph for influenza illustrating the window of time necessary for prevention

With this information of time, we can define the window of time necessary to mitigate epidemics as the difference of the random graph (not to be confused with our discussion of bernoulli random graphs) and the central friend graph.

These findings make intuitive sense. Clearly, the people who are most central to the network will have a high index of exposure and would thus catch our example virus, the flu, quicker than a random person would. We can use that fact to our advantage to know what time there is left and to take direct action to the people who are most central.

Teaching Materials

There are currently no teaching materials for this page. Add teaching materials.

If you are able, please consider adding to or editing this page!

Have questions about the image or the explanations on this page?
Leave a message on the discussion page by clicking the 'discussion' tab at the top of this image page.