Tuesday, July 29, 2014

Proof: It's summer.

It's really hot in Tennessee. 
I made this graph using data from the National Oceanic and Atmospheric Association (click here!) to reassure myself that and end was in sight. The summer temperatures don't look quite as high as I expected, but these are averages throughout the day. I couldn't find very extensive data on daily highs and lows. 
Notice how much variation there is in January weather! January has a mean temperature of 33.09*F with a standard deviation of 13*. July 1st, on the other hand, has an average temperature of 71.52* and a standard deviation of 4.8*. Yes, quantile plots were normal.

Monday, July 28, 2014

Stay in school.

Here are some choropleths I made that show the percent of people in Tennessee with only a high school degree or at least bachelor's degree. I used census data from the American Fact Finder and ggplot's map_data() to get the coordinates for each county. Dekalb County didn't have any census data. 

Friday, June 27, 2014

Shiny is always better.

After dabbling in Objective-C for a few days, I realized that app programming is pretty different from the database managing I do during the day. Thus, I was really excited learn about a thing called Shiny. Shiny is an app-development package in R. Woah! It seemed like a great way to accustom myself to app development without having to simultaneously learn a new language.

I decided to actually learn a few things and take the Developing Data Products course offered by Coursera. Shiny apps are a major component of the course, including the course project. Here's a link to mine: https://nancyorgan.shinyapps.io/nancys_app/  The Shiny servers have been having some issues keeping up with the deluge of Coursera users, so it if doesn't work then try back later.

Of course, this was an exercise in form over content. I've included the code in the comments if you're curious how to make a slider bar that changes the size of an image. More information on how to host your apps can be found at http://shiny.rstudio.com/

Monday, April 14, 2014

Randomly Generated Friends

Here's a little network plot I made to illustrate a few of the concepts I talked about in the previous post.  The entire network is made of three randomly-generated smaller networks that I connected with a few more randomly generated links. My code is included as a comment... but if you don't look that far, at least know that the igraph package has a lot of really neat options for network plotting.

The network as a whole has a modularity of 0.567. The smaller clusters have 0 modularity because everyone knows everyone else; they're very small world and have average path lengths that are exceedingly close to 1.

Finally, the figure illustrates an assortativity of 0.29. Positive assortativity (which can fall between -1 and 1) indicates that nodes tend to be connected to nodes with similar numbers of connections. That seems reasonable here because most nodes are connected to others in their group, but some aren't.

Happy networking!

Saturday, March 8, 2014

The Vikings: Social Networking since 1200 AD

Within reason, dropping academic vocabulary bombs can be a downright satisfying experience. Thanks to a fascinating paper from the December edition of the Royal Statistical Society, I am pleased to provide you with a tidy little nugget to detonate: Network Theory.

Padraig Mac Carron, a PhD student from Coventry University in London, and professor/colleague Ralph Kenna, had the intriguing idea to use statistical analysis to quantitatively study the characteristics of the social networks built into the Icelandic Sagas. They surmised that realistic social structure would be a strong indication that the sagas were at least loosely based on real people or real groups of people. While there isn't any reliable way to validate their work, the results of their investigation point to the sagas being realistic if not partially factual.

A few definitions are helpful to understand what they did:

Nodes: connection points, or in the case of social networks, people

Links: connections between points, or relationships

Degree: the number of relationships a node has

Degree distribution: the probability distribution that describes how many degrees a node is likely to have

Mean path length: the mean distance to connect any two nodes in a network

Clustering coefficient: the fraction of friend-of-a-friend relationships that are realized for a node

Small world: used to describe a network that has a short mean path length and a large clustering coefficient. 

Assortative: nodes or people with similar numbers of friends are also connected to each other. Interestingly, assortativity is quantified by the correlation coefficient, r, that falls between -1 and +1. Usually we see r as a way to describe how tightly correlated points are on, say, the XY plane. 

Disassortative: some nodes have many more links than other nodes

Modularity: if you didn't know anything about the network a priori, how many internal communities would it seem to contain? If the network is only one network, the modularity would be 0. If the community could be split up into many smaller, distantly linked communities, the modularity would approach 1. 

What, then, did the authors do with all of these ideas... and Vikings? They did what any level-headed statistician would do: compare the sagas to each other and to something known. In this case, they used modern society as the frame of reference.  Relative to each other, the sagas had distinct features that reflected the genre of the tales. When clumped together into meta-networks (not a word the authors employed), the 5 largest sagas showed strong overlap. The stories were conceivably written in reference to each other. When all 18 sagas were grouped into one giant network, the authors found that the meta-network was structurally similar to modern society.  The mean path length was an achievable 5.5, and the network was small world and assortative. Even more, the huge community had modularity of 0.7 and 9 communities, meaning that the overlap in the sagas would make it impossible to split them back into 18 discrete bunches.

But Nancy, what does it mean?

It means that you and I both now have a general understanding of Network Theory. And, it means that Viking sagas are more realistic than you ever knew. I found a cool packages in R to make network plots, so that will be my next post!

Sunday, January 26, 2014

Statistics in a box!

From the plotly website. *not my work*
I came across an interesting site recently that I thought was worth mentioning. The site is called plotly, and can be found here. It's a free online analysis and visualization application that flaunts easy usability and beautiful graphics.  While I wasn't entirely convinced that it was fantastically intuitive, the demo plots are beautiful.  I think I'd like it better if I used the R option... but then might as well just stay in R.

My main point, however, is that I think plotly does a great job at demonstrating how stunning visualization can be. Just browse through the website-- it's mesmerizing. Without getting too sappy, I love the idea that a display's aesthetic can be as cool as the information itself. Actually, it seems natural that that be the case.

I do, however, have one particular qualm with the site. It seems almost risky to make analysis and visualization so accessible and aesthetic-driven. Particularly if the interface is designed for non-statisticians, what are the risks of processing data according to what "looks right" as opposed to what is definitely (probably) right? Food for thought if anything...


Sunday, December 15, 2013


The irony of intensive problem solving is that the best answers seem the most obvious in retrospect. I'll use my week as a case study, employing a much smaller version of my task as an illustrative example. Essentially, I was given a matrix of categorical variables and a corresponding matrix of "populations" that are in each category per row. The task was place each population in a column designated for the given category. I was working in a 220 x 51, so my example is scaled down to 8 x 2.

The data started out looking like this:
2 23  0 NA  1 44  3 17  
4 62  3 91  0 NA  0 NA  

I needed it to look like this:
44      23       17    0
0       0        91    62

Take a look at the bolded numbers above. In the original matrix, the "1" category had a population of 44. The final result placed the 44 in the first entry of the row. See?

Of course, the first step was to properly sort the categories such that the population stuck with the right number. Thus, I pasted the category and population columns together, sorted, and then re-split the columns to restore the original dimensions. Next, I poured the data back into two matrices: one category matrix, and one population matrix. They looked like this:

Category matrix:
1   2   3   0 
3   4   0   0  

Population matrix:
44   23   17  0 
91   62   0   0 

Super. The problem, however, remained that the populations needed to be in the corresponding category column. Instead, all the values were squished to the left side of the matrix. The solution? Use each category's number as an index for where each population should go. I decided to use an intermediate step with 1's and 0's as placeholders so that I could monitor the process more easily without confusing any data. 

Intermediate step:
1  1  1  0
0  0  1  1

achieved by this R code:
poof = matrix(rep(0,prod(dim(population_matrix) 
#make an empty matrix to be filled with category-indexed population values

for (i in 1:dim(poof)[1]){
  poof[i,categories_matrix[i,]] = 1

#Put a 1 in the column indicated by the category number

I then made both matrices into vectors for simplicity's sake.
popvec = as.vector(t(population_matrix))
indvec= as.vector(t(poof))

Finally! The code below reads, "For each nth 1 in the categories vector, replace it with the nth nonzero population in the population vector."
indvec[which(indvec == 1)] = (popvec[popvec != 0]

Rebuild the matrix from the vector, and voilĂ . Donezo.