Over at The Technium, Kevin Kelly recently posted about Zillionics.
Zillionics is a new realm, and our new home. The scale of so many moving parts require new tools, new mathematics, new mind shifts.
The post relates to the changing scale of modern information, and the requirement of rapidly increasing the rate of capture and analysis. As Kevin notes, nowhere is this more true than in Biology:
Zillionics is a realm much more at home in biology—where there have been zillions of genes and organisms for a long time—than in our recent manufactured world. Living systems know how to handle zillionics.
Whilst natural systems have long relied upon the interplay of massive networks at the cellular and sub-cellular level, the world of informatics has only really started to think in terms of peta and exa in the past 12 months.
To say we're playing catch up is an understatement.
A quick example, taken from recent experiences at Sanger. With the advent of new sequencing techniques from the Illumina crew (and others), more bases are being sequenced than ever, increasing from 3.5Gb per week to around 200Gb per week. The entire contents of GenBank, the repository of available sequence collected over the past 15 years, is starting to look small by comparison.
Talk about your paradigm shifts: clearly, as informaticians, the era of zillionics has arrived.
Big science, big challenges (big potential)
The potential for scientific insight from these vast amounts of data is clear, but the challenge in collecting, curating and adding value to information on this scale is formidable. As with many aspects of life sciences, the apects of a new approach which hold the most potential also create the biggest challenges.
- Data flow
Whilst the vast scope of large scale data streams (such as those spooling from new sequencing technologies) holds value, pretty much any physical manipulation of such data is tough. Collecting, moving, mirroring, backing up and warehousing are tricky; providing reliable, reusable access to these repositories is where the scale of the problem really becomes apparent.
- Decentralisation
In two ways: decentralisation of data production, and decentralisation of data consumption. As the technology required to generate the data (be it genomic sequence, ambulatory monitoring or geotagging) becomes increasingly commoditised, so the importance of providing robust approaches to collecting and pooling the information increases. The logical corollary to this is that the number of parties looking to access this pooled data also increases.
The approach we take towards developing scientific software can have a great deal of impact in this new era. Somewhat paradoxically, this increase in scale and complexity calls for software that is significantly more flexible and lightweight. Less, in this respect, is definitely more.
- Less code
Keeping application design simple and code structure uncluttered leads to software that is more amenable to change. Fields experiencing this explosive growth really have to exist in the moment: solving the most important problems today is more important than planning for tomorrow, when the game may well have switched up again.
- Less coupling
Keeping software components and applications loosely coupled helps to keep everything agile. Swapping out an element that doesn't fit the bill whilst the information flow keeps ticking is essential. "We can't change A because of B" doesn't fly in the new world order.
- Less benchmarking
Finding bottlenecks in poorly performing code is important, but jumping in to a problem performance first is rarely a good move. This is doubly true when dealing with data that grows in orders of magnitude, or faster. Chances are that any performance metric is going to be meaningless within 6 months as the data builds, and the domain remodels: building software and infrastructure that is flexible to these growing needs is key.
- Lower barriers to entry
Providing access to the data housed by scientific software becomes critical - lightweight, simple interfaces help both humans and machines manage, analyze and curate vast datasets.
- Less downtime
Providing services that allow others to access and build upon existing data need to be reliable and available. Simple software is easier to maintain, migrate and scale.
Expanding the network
Two of Kevin's other essays on this are also well worth a read:
