If you want to survey the international Internet-based news, world-wide social media, New York Times text archives, and search engine trends, you’re going to need either super powers or a supercomputer. Kalev Leetaru doesn’t have super powers, but he does have Linux-powered supercomputers to do the data crunching for him.
With research support from the National Science Foundation and the help of TeraGrid resources on the Nautilus SGI UV supercomputer, Leetaru has been working on some super-powered research.
Leetaru analyzed 30 years of international news and social media and discovered that “global news tone” forecast recent revolutions. Leetaru’s research results, which were published in a detailed — yet surprisingly readable — academic article called Culturomics 2.0: Forecasting large-scale human behavior using global news media tone in time and space, also accurately narrowed Osama Bin Laden’s hiding area to a 200-kilometer radius in Northern Pakistan. And he confirmed what a lot of us have long suspected about American news coverage — it really is U.S.-centric and becoming more negative.
Leetaru is the Assistant Director for Text and Digital Media Analytics at the Institute for Computing in the Humanities, Arts, and Social Science at the University of Illinois. He says that the initial work was 2.4 petabytes in size, which was too large to fit onto any current machine, so he had to look at it in smaller pieces. “Having 4TB of hardware cache-coherent shared memory is what makes that really feasible, as you have a massive intensity of random accesses all over that memory, which represent worst-case memory access patterns with no chance for local caching or predictive data movement across the machine, meaning you are really pushing the memory systems to their limit,” he says.
Leetaru says that only large hardware-based, cache-coherent shared memory systems like the SGI supercomputer are capable of allowing users to reach those memory levels while supporting those kinds of access patterns. The Nautilus, which is an SGI Altix UV 1000 system at the (NICS) Remote Data Analysis and Visualization Center (RDAV), was developed with the help of a US $10 million grant from the National Science Foundation back in 2009.
Since its 2010 launch, the young supercomputer, which has 1,024 core processors that enable it to accommodate four terabytes of shared memory, has also crunched data to predict tornado formations. In July 2011, a partnership of 17 institutions announced a $121 million NSF-funded project called the Extreme Science and Engineering Discovery Environment (XSEDE), which will replace and expand the decade-old TeraGrid project. Computing resources involved in the new project include the Nautilus supercomputer.
The Nautilus runs SUSE Linux Enterprise Server 11, which Leetaru says he runs on some of his personal machines, in addition to Red Hat and CentOS on other machines, and Fedora Core on his local servers. Leetaru says what is especially powerful about Linux is that you have one operating system for your desktop, small local test and development servers, and on massive supercomputing systems. “Since those SGIs run the same Linux that my desktop and small local server does, I can take the exact same codes that work on my desktop and scale them up to the supercomputing level without a single modification.” Leetaru adds, “I can take existing Linux applications and run them on the supercomputer and give them 1TB of memory instead of 1GB of memory, and they can handle massively larger amounts of data, but without a single modification needed.”
Leetaru says that the biggest issue he’s tackling right now is that his work requires massive shared memory. “The study actually examined an archive of 100 million news articles, and the initial work that led to the study was the creation of a network of over 10 billion people, places, things, and activities connected by over 100 trillion relationships,” he says.
Add all that together, and you get 2.4 petabytes of data. Leetaru looks for interesting patterns in the data and then reproduces those patterns using simpler, more traditional techniques that are easier for others to reproduce. “Thus, the actual methodology used in the final published study is quite simplistic on purpose, to make it easier for others to adopt those same approaches in their work, but the work is based in the world of petascale computing,” he says.
Because no systems have 2.4 petabytes of shared memory, Leetaru looks at only small portions of the data at a time. He compares it to shining a flashlight across a dark room, which doesn’t let him capture the broader, macro-level patterns in the data. “The SGI’s large hardware cache-coherent shared memory made it possible to work at very high scales,” he says, “but ultimately to really explore this data at full-res, you would need a machine with petabytes — or exabytes — of shared memory.”
Leetaru’s article includes graphs illustrating tone of coverage mentioning Egypt, for example, between January 1979 and January 2011. His animated GIFs, on the other hand, really help readers visualize the data in two different ways: they highlight the tone of coverage, but they also illustrate the American-centric view of the New York Times coverage compared to the Summary of World Broadcasts (SWB) global news monitoring services. One animation shows global geocoded tone of all New York Times content from 1945-2011, and the other one shows global geocoded tone of all Summary of World Broadcasts content from the same period.
Leetaru’s article explains, “The two maps are highly divergent, with the Times mentioning 19,785 distinct locations on Earth in 2005 to SWB‘s 29,592… Most strikingly, however, the Times map shows a world revolving around the United States, with nearly every foreign location it covers being mentioned alongside a U.S. city, usually Washington D.C. Foreign coverage is very uneven, with Africa, Southern Asia, and Latin America being especially poorly represented, while Europe is well represented. SWB shows a far more balanced view of the world, with far better coverage of most geographic locales and no single set of countries dominating the connections.”
Leetaru says that what he covers in his article is just scratching the surface of where he’s heading with this work. He adds, “Current work is focusing on doing this forecasting in real-time and moving from the country level down to the city level you see in the animated maps, generating forecasts for every city, organization, etc., in the world, and integrating other data sources to really increasingly move towards peering deeper and deeper into the global media consciousness.”