## New R offerings, R blog

January 29, 2009

I’ve been a huge fan of the open-source statistics package R for many years: large user base, lots of high-quality modeling functions, flexible graphics, etc. The learning curve is a little steep (especially for things like graphics and avoiding computationally costly loops in code) and there are some things R still doesn’t handle particularly gracefully (e.g., computations on very large data sets, making use of common multicore/SMP capability).

Anyway, I saw on some other blogs that a new company REvolution Computing is offering a version of R that has some tweaks worth looking at, including optimized math libraries which may speed up certain operations and allow for some degree of multithreading. There is also a non-free “Enterprise” edition that can finally take advantage of 64-bit Windows and makes using the “MP” in SMP systems a bit easier; of course, whether those changes make a difference in your analyses will depend on the kind of code you run. Still, I run into out of memory errors often enough that I think it would be a nice boost for my work.

Aside from the product, they also have a pretty decent blog, which is worth checking out.

## New book on team effectiveness in organizations, etc.

December 6, 2008

I just received my copy of Team Effectiveness in Complex Organizations, which includes a chapter on social network analysis I helped co-author (<- shameless plug). It’s a general overview of issues and concepts in social network analysis (SNA), with some examples of applications to the study of teams. It’s written for people who don’t know a lot about networks, so I think it’s not a bad place to start if you want a gentle introduction to some fundamental concepts.

There are a couple of things I like about the chapter, in no particular order: (1) It emphasizes the use of networks as a way of operationalizing and quantifying social context across multiple levels (2) I think it does a decent job of reminding people there’s a lot more to social structure and network analysis than just centralization and brokerage, and (3) gives some decent examples of more recent types of network analysis techniques, such as exponential random graph models (ERGMs, sometimes called p* models) that haven’t yet percolated into widespread use in psychology and management (although with the publications of articles introducing ERGM concepts, like Contractor, Wasserman, & Faust’s 2006 article in the Academy of Management Review, this is likely to change).

In fact, I think this last point is one of the most important – there are clearly a lot of theories in psychology and management about social, cognitive, and physical interdependency that involve multiple levels of analysis (individuals, dyads, triads, cliques, teams, organizations, etc.), but I’ve always been a little disappointed that most of the SNA articles I see published in journals like JAP and AMJ tend to focus on only a handful of concepts like centrality, and to a lesser extent, brokerage. It’s not surprising – these are simple, powerful concepts, after all – but as articles like Contractor et al. (2006) make clear, there’s a lot more out there.

Probably the biggest impendiment to a more widespread interest in these models is accessibility, in terms of (1) ease of use and (2) interpretability. The first issue is being addressed by a number of researchers with statistics software such as the statnet package for R, the StOCNET suite, and PNet; these are making it much easier for researchers to test complicated structural models (StOCNET even provides for estimation of certain network dynamics in longitudinal networks – very cool stuff). Of course, “easier” is a relative term, and even though I am comfortable with all of these packages, I wouldn’t call them “easy” to use, even for an R and SAS wonk like myself. It definitely takes a lot of time to become comfortable with them, and for people who are interested in a given topic (say, emergent leadership relations – one of my personal areas of interest), it can be hard to justify taking the time to learn the models and the software, especially if you aren’t sure you will be using the technique frequently in the future. The second issue is much more difficult. While it’s possible to take a network and have it spit out a set of results, actually interpreting what the results mean can still be quite hard – if you have a significant level of reciprocity, great, but what does that 1.34 mean, exactly? When there is also a positive alternating k-star parameter? There is information out there to help you figure it out, but it’s not particularly easy to find unless you know the ERGM literature very well, and in the case of interpreting the more recently developed ERGM configurations (like alternating k-stars), there’s simply not that much.

The people who work to develop these types of models (my co-author and former advisor, Laura Koehly is one such person) do an incredibly good job of helping to create statistical models that can be used to test a wide variety of important research questions, but speaking as a person who’s research involves actually using them, I think that the average researcher interested in testing these models would be more than a little lost when faced with describing to a reviewer what a particular set of numbers actually means, and there’s not enough help out there for people without personal pipelines to the random graph gurus that develop these things (I’ve been very lucky in that regard.) With the development of random graph models for cognitive social structures, multiple networks, longitudinal network data, and two-mode networks, things are going to get even more complicated before too long…. but if you are a researcher thinking about using these sorts of models, don’t let that scare you! There’s a lot of power to be had from these sorts of models, and they are reasonably accessible nowadays.

## Valuing social networks

June 23, 2008

Although a good portion of my research revolves around social networks, I don’t usually pay much attention to the “social network” websites (Facebook, MySpace, etc.) – it’s not really something that overlaps with my interests, outside of the name. However, a recent post by Michael Arrington on the financial value of social network sites (which I found linked over at Andrew Gelman’s blog) got my attention.

Basically, Arrington suggests that the value of these services should be related to the size of the network and where the users live…. networks with more users in higher-value markets (operationalized in terms of per-capita ad spend in the various markets) should be worth more than networks with users in lower-value markets.

I don’t know a lot about the business of advertising (as I’m sure this post shows), and this seems like a reasonable analysis, but I have to wonder how much these sorts of valuations would change if the actual structural properties of the network were taken into account. What’s the point of an ad? When an advertiser buys them, are they just trying to reach individuals? Or are they also trying to tap into cliques and subcommunities and reach other targets through word of mouth? If the latter is important, then there would probably be significant marginal value attached to the level/type of connectedness in the network: a sparse network wouldn’t be worth much more than a normal website with X number of users, but a densely connected network with high levels of inter-user interactions and tie formation might be worth more… depending on the nature of the advertising, I suppose.

Of course, for all I know, people do this sort of stuff already, or it might be stupid for a variety of reasons (off the top of my head, selling statistics about actual network connections – even anonymous ones – might raise privacy issues for some people). Still, I find it a little odd that for all the talk about the ‘power of networks’ and the constant news these social networking sites get, I never hear a discussion about tools or applications that really make use of the structural properties of the network.

## Social influence modeling using network autoregression

May 17, 2008

Well, I’ve been back from conference for several weeks, but only just got time to come back to post. The conference was nice, though I wasn’t able to see all the presentations and papers I would have liked – par for the course, I guess. The conference organizers do a great job, all things considered, but somehow the topics I want to listen to get scheduled at about the same time, meaning I only get about half the utility out of these conferences that I’d like. I think I may have to shell out for the audio recordings they make of them.

Anyhow…

To follow up an earlier post, I thought people might like to see a bit more about how one can use network data (social networks, proximity data, etc.) to model social influence in groups. The basic question these models address is pretty straightforward: Given a set of actors and data on their ideas, attitudes, and opinions, do people that share certain types of ties tend to have more similar attributes/preferences/attitudes, etc.? As it turns out, there are actually several different ways we can go about addressing this basic question.

For example, we can do it using simulation-based methods (I’ve seen this more often in epidemiology and marketing research), where we take a network with certain structural characteristics, and plot how long it takes for ideas, behaviors, etc. to spread to certain percentages of nodes in the network, conditional on certain assumptions about the parameters of influence/infection and node “recovery” (see this previous post on SI/SIR/SIS-type influence models for a rough overview of some of these parameters.) Like most simulation-based methods, this is nice because it allows us to see the implications of different sorts of assumptions about the way influence happens. Thus, we can vary our assumptions, change the model, and see if we observe network dynamics that occur in real-world networks: for social influence models, this means we want our models to be able to predict things like convergence (and non-convergence) of opinions over time. Simulation-based techniques like this can be quite powerful, but make strong assumptions about the way that influence must occur, besides being hard to test for most typical hard-working social scientists. (Although that is slowly changing.)

Another way of modeling social influence is through statistical models that take into account the interdependence between units, and estimate the extent to which that interdependence results in more similar (or dissimilar) attitudes or opinions. Again, there are several ways of doing this, but one of the more widely used (at least, to my knowledge) models is the network effects autoregression model. I’ve sometimes seen this model and related models under slightly different names. For example, geostatistics researchers have names like “spatial regression model”, “spatial error model”, etc., but it comes down to the same thing – predicting values at a point based on the values of that points’ neighbors. Geostatistics people define neighbors using geographical contiguity (for example, neighboring counties), but there’s nothing stopping us from using alternative definitions of neighbors – for example, based on social networks.

The basic network effects model is given by the equation

$y = \rho Wy$,

where y is a vector of individual outcomes (for example, organizational attitudes, turnover, etc.), W is the n x n weight matrix which defines the way in which individual scores are believed to be interdependent, and $\rho$ is the autoregressive parameter which defines the extent of social influence, given the matrix W. Similarly, we can define the mixed regressive-autoregressive model

$y = \beta x + \rho Wy$,

where $\beta$ and x are typical regression parameters and predictors, respectively. (Note that in this model, if there is no social influence effect, e.g., if $\rho = 0$, then the model reduces to a standard regression.) A third type of autoregression involves modeling social influence effects through interdependence in the error term $\tau$ associated with the regression of the exogenous predictors X on the dependent variable; this is sometimes referred to as the network disturbances model, because it provides for an estimate of the interdependence in individual deviations from their predicted score, based on other predictors:

$y = \beta X + \tau, \tau = \rho W\tau + \epsilon$.

We can also mix these models – for example, by creating a mixed regressive-autoregressive model where interdependencies exist between both the absolute value of the y-scores (as in the network effects model) AND in the deviations from the expected y-score, thus:

$y = \beta X + \rho_{1} W_{1}y + \tau, \tau = \rho_{2} W_{2}\tau + \epsilon$, where $W_{1}$ $W_{2}$ are different weight matrices.

The exact interpretation of the model depends on the way in which you define W; given a non-directional binary social network, for example, you may define the weight matrix to be the adjacency matrix, where the entries $w_{ij} = 1$ if i and j share a tie, otherwise $w_{ij} = 0$. In a network effects model using this definition, $\rho$ is the extent to which an individual y score is related to the sum of the y scores of that individual’s neighbors in the social network. The most common W matrix I am familiar with is a row-standardized matrix, where the standard adjacency matrix/sociomatrix (e.g., the matrix that defines the connections in your social network, usually binary, but possibly valued) is altered so that each row sums to 1. Using this definition of W, we are essentially estimating the relationship between an individual’s y score and the mean of their friends’ scores; this represents the assumption that influence upon an actor is evenly divided among an individual’s partners in the network. Obviously, the specific definition of W used makes a big difference in the way that the autocorrelation parameter $\rho$ is interpreted.

The nice thing about these models is that they are flexible – for example, allowing for many different ways of defining actor interdependencies, like direct ties, shared social positions, and pretty easily estimated using maximum likelihood procedures in programs like R, or using MCMC estimation in software like WinBUGS\OpenBUGS. However, some of the statistical properties of these estimators is still a little uncertain; one of the research projects I’m currently finishing up explores what these properties are, and how they are likely to impact research into organizational networks. (Time permitting, I’ll post a draft of the paper – dissertation comes first, though.)

As with any regression-based/correlational method, the limitations are pretty obvious, of course – if you don’t have longitudinal data, you can’t really say for certain that you are dealing with a true “social influence” effect, or whether the apparent interdependencies in responses aren’t being driven by some third shared variable. Still, it’s a useful model for exploring possible influence effects in a network.

Finally, it’s worth pointing out (again) that this is only one way of estimating potential social influence effects in social networks; a variety of other procedures are available, that provide slightly different ways of conceptualizing and estimating social influence effects, using both cross-sectional and longitudinal data. For example, if you wished to study social influence effects between husbands’ and wives’ purchasing decisions, and had groups of independent husband-wife dyads, you could use something like HLM pretty easily. If you had longitudinal data on both networks and actor attributes/attitudes, you could use something like Tom Snijder’s actor-oriented random graph models to separate out social influence and social selection effects across time. There are also the simulation-based techniques described at the top of the post, and even agent-based modeling for people who want to get really fancy in terms of modeling network dynamics.

## Working, drafts, and factors of bloggish non-production

March 24, 2008

Well, the SIOP conference is coming up in just a few weeks. I’m busy preparing, hence the lack of updates. This year, in addition to the usual poster sessions, I’m going to be on a panel to discuss my chapter on network analysis for an upcoming book on team research. I’m looking forward to it – this will be my first time on a panel discussion, and I’m looking forward to answering people’s questions.

One of the things I’ve noticed is that the “new resurgence” in network research has really taken hold in management, but it hasn’t become quite as widely used in industrial/organizational psychology research. I’m hoping that the book chapter will be able to convince a few more people why methods for social network analysis are worth learning – and for those I/O people who have run across network analysis before, why there’s more to life than just measuring individual centrality. We’ll see how it goes.

Oh, and I finally decided to bit the bullet and learn $\LaTeX$, so I can finally make all the beautiful formulas I want, like the basic normal distribution function: $P(x) = \frac{1}{{\sigma \sqrt {2\pi } }}e^{{{ - \left( {x - \mu } \right)^2 } \mathord{\left/ {\vphantom {{ - \left( {x - \mu } \right)^2 } {2\sigma ^2 }}} \right. \kern-\nulldelimiterspace} {2\sigma ^2 }}}$.

## Video resource roundup

March 2, 2008

Okay, things being as busy as they are, there’s not much time for a long post now. Still, I recently ran across a mentions of two interesting video-based sites I thought some people might find interesting. VideoLectures is just what it sounds like – a site where people have made videos of various lectures available. As of right now, there’s not a lot of psychology material available for viewing (not a whole lot for the social sciences in general, unfortunately). Still, there are some lectures that are probably of wider interest there; for example, lectures on Markov Chain Monte Carlo methods by Christian Robert (videos linked here). There are also a decent number of lectures related to statistical modeling and graphical models which may interest a number of people. The biggest section is for computer science, but if you search, there are lectures on a pretty good number of things that psychologists and other social scientists might find useful, like the use of the stats program R for data mining.

Another place mentioned to me by several people is TED. It’s a collection of talks by people from all sorts of areas on all sorts of topics, only some of which pertain directly to science. It’s mainly “big picture” kind of talks, but there are certainly some big names, and it’s nice to be able to watch that kind of thing on demand….

As long as I’m covering “broad” not-really-science sites, a third site I watch all the time is C-SPAN’s BookTV. They archive most of their old programs, and you can find some great interviews and talks on all sorts of different topics, especially history, politics, and public policy. Not a lot on actual science, but lots of examples relevant to (and applications of) social science research. Definitely worth browsing if you find history and politics relevant to your work, or even just interesting.

## Random resource round-up

February 20, 2008

From past students in psychology and management (and some of my non-academic friends), I often got the question about where to go to find relevant articles. There are, of course, all the major repositories (Web of Science, Psychinfo, etc.), but I find that more and more I have been using alternative resources for identifying interesting articles, ideas, methods, etc. One of the big ones is Google Scholar (in fact, there’s been some neat uses of GS recently, for example, Publish or Perish), which has been a great resource, and probably my single most-used research tool over the last year. Together with Google Books, I think it’s nearly doubled my monthly research-related book budget, something I can ill-afford as a grad student. I’m thinking of sending Google my bill.

CiteSeer is another big source for citations and draft papers that’s been quite invaluable. It focuses more on topics like computer science and IS, but I’ve been surprised by the amount of information I’ve gotten that’s been useful to my own research in social network analysis – particularly when it comes to questions of statistical modeling and computation, but also for helping me find theories and points of view on certain topics that goes outside the particular theoretical “sandboxes” in which I was raised. There is also the newer site BizSeer for academic business articles, but I have to admit I haven’t really tried it yet. For those types of articles, Google Scholar and library resources seem to work quite well. Still, the main reason I love CiteSeer is for the fast and easy websurfing from link to link; I go for one article, and half an hour later, I’m reading articles from entirely separate domains by authors I’ve never heard of on theories I didn’t know about. If BizSeer is as good at that particular aspect of searching as CiteSeer, I imagine I’ll be using it much more.

There are also two other smaller research portals I’ve run across, which have helped me to find some interesting new research, often in the form of harder-to find (for me, anyway) research reports: Scientific Commons and the Social Science Research Network. I had heard of Scientific Commons before, but only recently started using it, and I was surprised by the amount of articles I got back. SSRN seems harder to navigate, but has helped me browse for research reports that sometimes get lost in my usual Google searches, or which haven’t been indexed at CiteSeer.

Frankly, I’ve been pleasantly surprised by the amount of useful material I’ve run across in the kind of research reports that aren’t covered by the usual academic resources, and it’s just getting easier to find those kinds of articles as time goes on. I’m still not entirely certain how well-received the use of such citations would be in my area, however. I see it all the time in statistics-related articles, but hardly ever in the psychology and management articles I read, despite the large number of relevant research reports out there in those domains. I don’t know if that’s because finding the articles is hard, or because they’re somehow deemed to be “poor” citations for academic articles.