We are used to having huge datasets pouring out of high-throughput genome centres, but with the advent of ultra high-throughput sequencing, genotyping and other functional genomics in every laboratory we are facing a scary new era in petabyte scale data. For example, the 1000 genomes' projects will probably produce about 1 Tb of finished data. To process data, this project required about 100 Tbs of scratch disk. Working at this level, real technical limitations start to hamper progress. One has to consider storage, but not just having enough, but making sure its available to your compute (network), that you have sufficient I/O to do anything in real time. Software language and implementation become critical factors when dealing with terabytes of data. With such high-intensity computing, power (getting enough), cooling, etc. become real issues. How do you let anyone else access the data? Is the data backed up and even if it is how many years would it take to restore from tape?
So how will we solve all these technical hurdles? Each of these can be solved with technical knowledge. But you do not want to have to worry about working within these constraints.
When working with large datasets, these constraints can continually hamper progress on getting real research done. Whilst one can choose to solve each of these individual problems, the impact of these constrains on the scientific workflow can be considerable. It would be wiser to optimize for productivity.
In software development, similar constrains are addressed with abstraction layers. Database access is mediated through relational mapping tools, visualization is aided with powerful graphical packages preventing individual research groups from having to reinvent the wheel. Rails, Eclipse, Processing, Hibernate, Catalyst.
Cloud computing offers a similar level of abstraction for many of the constraints encountered when dealing with extremely large (?) datasets. You might have encountered similar ideas when using hosted services such as Google Mail, ManyEyes (http://manyeyes.alphaworks.ibm.com), others. These tools provide an example of what we would ideally like in the perfect world of Bioinformatics. We do not have to worry about how the data is stored, keeping the software up to date. Its all taken care of for you. First steps have been taken along these lines by companies such as Amazon, Google and Microsoft. Amazon has started to provide Bioinformatics datasets in their publicly hosted datasets (http://aws.amazon.com/publicdatasets/) such as Ensembl and Genbank.
A recent requirement to assemble a full human genome from 454 short read data provided a good real life example of these approaches. With 140 million individual reads requiring alignment using SSAHA exceeding the available compute capacity in our own data centre, a build was performed on Amazon's elastic compute service, EC2. In an afternoon, a scalable, ad hoc cluster with queue management and replicated data storage was constructed with nothing more than a few web service calls and a valid credit card. No service contracts. No consultation with the vendor, just 100 nodes performing SSAHA alignments.
Implications for large scale data centres, currently engineered to provide peak capacity, which often goes unused in idle periods. The elastic, pay-as-you-go nature of cloud services such as AWS means lower infrastructure overheads, as only in-use compute and storage is billable.
Cloud computing has green credentials too, so long as the off-site compute is located where renewable sources of energy are used preferentially. Additionally, whilst unused compute may still require cooling and power in a local data centre, it can be reused by others in the cloud.
The transfer of large datasets can also be simplified with cloud approaches. As an alternative to shipping the data for others to analyse, cloud approaches allow the compute to remain close to the data. Allowing others to access your compute infrastructure may be preferable to distributing large datasets.
Conflict of Interest: none declared.