Per Øyvind Karlsen has announced the availability of the first alpha release of Mandriva Linux 2012: “As many of you might already be aware of, our first Mandriva Linux 2012 alpha has been ready for release for almost a week now, yet it only made its way to….
Scientists celebrated a breakthrough in their understanding of the human genome this month – the results of a large collaborative project driven by big data and built with Linux.
On Sept. 5, Nature and two other scientific journals simultaneously published 30 papers with the results of the ENCODE (Encyclopedia of DNA Elements) Project. The 5-year project involved nearly 450 scientists from 30 institutions around the globe and produced scores of data on how and when genes are regulated.
Their discoveries will serve as the basis for further biological research and advances in medical care.
The project’s success also makes it a model for big data collaborations and scientific analysis, said Mark Gerstein, a bioinformatics and computer science professor at Yale University and a lead researcher on the ENCODE studies. The papers provide documentation of what’s possible with modern data technology and computational methods – not to mention the collaborative process. (For more in-depth analysis, see ENCODE Consortium coordinator Ewan Birney’s Nature article, Lessons for Big Data Projects.)
“The open source movement was a big inspiration for the genomics community,” Gerstein said. “The genomics world grew up with Linux.”
Built on Linux
Though ENCODE computing and data storage is scattered amongst the various institutions involved in the project, the Data Coordination Center at the University of California, Santa Cruz is the main repository for the results collected in the ENCODE project studies.
The center keeps roughly 50 Terabytes of nicely packaged and compressed data available for public download online, as well as 200 Terabytes of uncompressed raw data, said Jim Kent, a bioinformatics researcher at UCSC who runs the ENCODE project’s data center.
They use a computer cluster running CentOS and IBM’s GPFS, (Generalized Parallel File System) – an enterprise storage management system originally developed for large multimedia files that also works well for genomics files, he said. Bonus: It’s free for academic use.
“It’s proven very robust,” Kent said.
In addition to the storage systems, the lab has a compute cluster with 1,000 CPU cores and 256 machines. Key to their computing efficiency is the job scheduler, Parasol, developed in-house especially for running the same DNA sequencing program hundreds of thousands of times on the same data. It’s available for free in portable C code, Kent said.
“It has a lot of steps to be robust when nodes fail and it’s been quite useful,” he said.
Biology of ENCODE
ENCODE was made possible, in large part, by rapid advances in big data processing and DNA sequencing technology over the past five years. It also builds on work completed in 2001 by the Human Genome Project to sequence all 3 billion chemical base pairs in human DNA.
With the DNA sequence in hand, ENCODE set out to map the functions of all those bases. Or, set in programming terms, ENCODE was interested in the logic, not the straight-line code of the DNA, Kent said.
A decade ago it was thought that the main function of DNA was to code for proteins – the chemical dictators of cellular activity. But only about 1 to 1.5 percent of DNA is comprised of genes that actually code for proteins, he said. The rest of the genome is either devoted to regulating those genes or it’s so-called junk DNA, evolutionary relics that don’t have a current function.
Researchers in the ENCODE project sorted through the DNA in short segments, called “reads,” of about 35 to 75 base pairs to find function and map its place in the genome. Sequencing machines produce millions of these reads at a time, creating a massive pile of data to comb through.
Read mapping was just the first of three steps in data analysis, but it required the most intensive compute power. The results were compiled into UCSC’s central repository.
“Before ENCODE we’d identified less than 1 percent of the regulatory regions,” Kent said, “and with ENCODE we’re close to 75 percent.”
The ENCODE database now serves as an annotated map of the genome. It’s a framework for future discoveries built on Linux and assembled through a large collaborative effort inspired by Linux and the open source community.
Editor’s Note: Mark Gerstein is a professed Linux fan. Check out his 2010 analysis comparing the evolution of the Linux call graph to that of the genome – what he calls the “operating system of a living organism.”
The delayed alpha build of Fedora 18 has been released: “The Fedora 18 ‘Spherical Cow’ alpha release is plumping up! This release offers a preview of some of the best free and open-source technology currently under development. Features: NetworkManager hotspots improve the ability to use a computer’s WiFi….
Businesses are subscribing to software, storage and computing power delivered over the Internet at a jaw-dropping pace. Over the next five years, global spending on cloud-computing services will increase at a pace five times greater than the growth of the information technology (IT) industry as a whole. To survive in this new landscape, technology makers will have to completely redefine their products, business models and cultures.
Instead of selling direct to the corporations that actually use computing services, hardware, software and infrastructure vendors will all need to pivot to serve the new cloud services new market. That’s the lesson from the latest forecasts by market researcher IDC.
The difference is stark. IDC estimates companies will spend $100 billion on IT cloud services by 2016. That compares to $40 billion companies are expected to spend this year and represents a five-year, compound annual growth rate of more than 26%.
One of the stumbling blocks in migrating to the Linux desktop is the mistaken view that you can’t take it with you. Your data must remain captive to the Microsoft operating system. Not true at all. A related misconception that stalls many Windows users from adopting the Linux OS is the belief that when you buy a new computer or install Linux to an existing computer, you must give up one operating system for the other. Again, not true at all.
After 18 months of development, the latest version of the Xen hypervisor has been released and adds improved documentation and new default tool collection. The performance limits of the software have also been increased across the board.
Personal computers score 80 out of 100 in a new ACSI study, even as Dell and HP see their shipments shrink. The trick, apparently, is to count tablets as PCs. [Read more]
Presentation material from this year’s Linux Plumbers Conference (LPC) includes background information on Linux support for ARM cores, ACPI 5.0 and UEFI. There are also videos and PDF files on network and virtualisation technologies.
For those interested in Wayland, Qt, and 3D, there’s an interesting new Wayland compositor out in the wild. This compositor renders a 3D maze using Qt and brings in some Wolfenstein 3D elements while allowing Wayland surfaces to be rendered on the walls…