The second Barcamp to take place in Cambridge is fast approaching! All are welcome to attend, contribute and discuss topics at the interface of science and technology at Barcamb 2.
The timings are in place, the rooms are booked, the refreshments have been ordered and a few early talk suggestions have been posted to the wiki. Those planning to attend are welcome to add their own before the day.
As with last year, we'll kick things off with breakfast at the Genome Campus, before introducing ourselves with a classic Barcamp 3-word introduction. The schedule for the day will be put up on our trusty whiteboard: and away we go. One of the most exciting things about a Barcamp is the unexpected meetings and associations that spring up over the course of the day: fresh ideas, fresh perspectives, fresh approaches.
Very much looking forward to it.
More details on the wiki. Sign up for free at Upcoming. Follow barcamb on Twitter.
With tight deadlines, developing software around scientific domains can mean that these tools are overlooked, with the developer's eyes focused on achieving a particular task. This post isn't about these tools per se, but how and when to employ them for maximum impact.
Correct
- Test coverage: the proportion of exercised code is calculated
- Test driven design: tests are written before the modules they exercise
- Automated testing: automated build servers
- Functional and integration tests: controller, templates and integrated system tests
- User acceptance testing: working with users to evaluate usefulness
- Code randomization and limit tests: searching for uncaught edge cases
Right
It is easy to put off testing on hearing tall tales of 100% test coverage (required in some companies, unbelievably) or dramatic, comprehensive test facilities, for fear that
writing tests is an all or nothing venture, and opting for nothing. In practice, writing tests should focus on quality in the right place:
- The right scope
Test only what needs testing: avoid testing methods already in place elsewhere. - The right scale
Building far reaching helpers or complex mocking is typically avoidable for most modules - The right time
The earlier the better!
Review
Building a review period into a development timeline is a great way to evaluate current practices to highlight what's working, and what isn't. This is the place
to think about ramping up a test framework to more advanced approaches as necessary, rather than never getting going because the scale of doing things 'correctly' is too vast.
"Agile" vs agile
A common misconception is that 'agile' development is a millstone: loaded down with jargon and heavyweight academic practices that
fail to translate into the real world. However, by focussing less on doctrine and more on pragmatic use and review, modern tools and techniques
can add tremendous value whist reducing risk and costs of scientific software projects.
If you missed my recent talk at RailsConf in Portland, I'm giving it again this evening at the London Ruby User Group.
Genomes on Rails
Monday the 9th of June, from 6:30pm to 8:00pm
London Ruby User Group
The Old Sessions House on Clerkenwell Green
I'll be talking about how we're using Ruby and Rails on the next generation sequencing platform at the Wellcome Trust Sanger Institute. To get the skinny, take a look at the quick interview I did with the RailsEnvy guys in Portland.
Matt Wood at Railsconf 2008 from Gregg Pollack on Vimeo.
Hopefully see you there.
Calling all scientists! I'm organising a Birds of a Feather session at this year's RailsConf.
All those using Rails in the scientific domain are invited to join a discussion of how Rails, Ruby and other Web 2.0 technologies are helping fuel scientific innovation. From analysis and information management, to communication and collaboration: what else can we do to move things forward in bioinformatics, genomics, astrophysics, etc?
Not sure on the where or the when yet: keep your eyes out for a poster on the BoF board on Level 1 of the conference. I'll also be posting updates from the conference here, and on Twitter.
Thanks to Siddhartha for the impetus. See you in Portland.
After a busy week, I'm preparing to head off to Portland, Oregan for this year's RailsConf. Myself and a few colleagues from the software teams at the Wellcome Trust Sanger Institute will be flying out on Wednesday, ready to get started on Thursday morning.
With the full line up published, here are some of the sessions that have piqued my interest.
- Faster, Better, ORM with DataMapper by Yehuda Katz
- "Design Patterns" in Ruby by Neal Ford
- Fast, Sexy, and Svelte: Our Kind of Rails Testing by Dan Manges and Zak Tamsen
- Small Things, Loosely Joined and Written Fast by Justin Gehtland
- Skynet - A Ruby Map/Reduce Framework by Adam Pisoni
- CRUD doesn't have an 'S' in it: Managing complex searching in Rails by Stephen Midgley
I'll be there speaking about how we're using Ruby and Rails as part of Sanger's new sequencing platform, too. If you're attending, feel free to drop by and say hello.
With an excellent line up, along with some great keynote speakers, this year's conference promises to be engaging and enlightening. Looking forward to it.
Yesterday I gave a talk to the Informatics group at the Wellcome Trust Sanger Institute that focused on using Cloud computing within a scientific domain.
The talk is less geared towards specific distributed implementations, but highlights how Cloud computing starts to become an essential tool when doing research with very large datasets. It was designed to start a conversation, which I hope it will.
Trying something a little bit new, I've also recorded a screencast of the talk, complete with a running commentary from yours truly. This was recorded offline, but gets most of the points across I think.
I'll be giving a talk at this week's Scientific Computing forum in Cambridge, all about Scrum and agile development in science.
The talk is part of a wider discussion on productivity tools, and will introduce the central tenets of agile development, along with how we use the Scrum methodology at the Wellcome Trust Sanger Institute.
Thursday 15th May, 2008, at the Centre for Mathematical Sciences on Wilberforce Road in Cambridge.
As software developers, a single question is obvious: why?
Excel is not fast, it's not particularly easy to learn or use. It's labour intensive to get data in and out, major incompatibilities exist between versions and sharing up to date information is a pain. At first glance this seems to fly in the face of many of the central tenets of scientific software: openness, compatibility, ease of use, and real-time performance.The key is flexibility, which gives Excel the (almost unique) ability to keep pace with the rate of change in scientific fields.
- Instant feedback
Updates to data and their layout are realised immediately. - Flexible modeling
Fields of a modeled domain can be manipulated in a few clicks. Entirely new attributes and "models" are quick and easy to add. - No penalty to change
Making these changes is not only possible, it is positively encouraged, thanks to undo, versioning and autosaving. - Extensible
Functions and macros allow additional value to be added to data quickly.
For those working on 'real' software, it might be easy to bite our thumbs at spreadsheets and their creators; instead we should be proclaiming our love for them, and keep their patterns for rapid change at the front of our minds.
Very few domains innovate as quickly as the fast moving fields within science. Whilst this innovation ebbs and flows from one discipline to the next, when it arrives, the software underpinning these fields (and the development process used to create them) needs to be as flexible as possible to keep up.
Taking an iterative approach to development can help keep software on the crest of this wave. However, development can often become stuttered as the problem domain is explored. Let's consider an example, a laboratory information management app, designed to capture and store data from a wet lab process. The development team set about building the functional specification:
- Database design: 2 months
- Develop framework for user interface display: 2 months
- Data capture module: 2 months
- Instrument automation module: 3 months
- Release!
Boom.
The hard work is undone. Despite their best intentions, the team realise that the schema and UI framework aren't a good fit for QC data. The system has been growing harder to work with for a while, and so they take this opportunity to work from a clean slate. They go for the nuclear option: The Rewrite.
Realigning vs rewriting
In cases where the code base is insufficiently advanced or when the code smells particularly bad, a rewrite may appear to be a good option. However, the cost is signifiant. Moreover, when the next Boom happens, that rewrite will be harder to swallow.
Instead of a formal specification, and scheduling long periods of development time which could be lost in the event of a Boom, highly innovative and dynamic fields require continual realignment and adjustment, not full scale rewrites. Instead, let's consider how the team of our laboratory application put together the next version:
- Meet with collaborators: 1 hour
- Prioritise and estimate work: 1 hour
- Starting with the highest priority, implement features and fixes: 10 days
- Release!
- Review progress. Rinse. Repeat.
The key to reaping the benefits of iterative development is to build regular reviews and opportunities for reflection right into the development cycle. The more often you pause to consider what's working for you and what's not, the more value you will add to the hot scientific fields.
A thousand dollars is largely seen as the goal of modern genomic sequencing. That's a full human genome sequenced for $1000, a task that cost $500 million only 5 years ago, as Duncan notes.
Using next gen sequencing approaches (such as Illumina, 454 and ABI's SOLiD), a genome currently costs around $100,000 to sequence. That said, the cost in producing, storing and analysing that sequence is much greater.
Give the orders-of-magnitude rate of change in this area every few years, we need to start getting ready for the $1000 genome now, with simple software development approaches that can keep up.
Over at The Technium, Kevin Kelly recently posted about Zillionics.
Zillionics is a new realm, and our new home. The scale of so many moving parts require new tools, new mathematics, new mind shifts.
The post relates to the changing scale of modern information, and the requirement of rapidly increasing the rate of capture and analysis. As Kevin notes, nowhere is this more true than in Biology:
Zillionics is a realm much more at home in biology—where there have been zillions of genes and organisms for a long time—than in our recent manufactured world. Living systems know how to handle zillionics.
Whilst natural systems have long relied upon the interplay of massive networks at the cellular and sub-cellular level, the world of informatics has only really started to think in terms of peta and exa in the past 12 months.
To say we're playing catch up is an understatement.
A quick example, taken from recent experiences at Sanger. With the advent of new sequencing techniques from the Illumina crew (and others), more bases are being sequenced than ever, increasing from 3.5Gb per week to around 200Gb per week. The entire contents of GenBank, the repository of available sequence collected over the past 15 years, is starting to look small by comparison.
Talk about your paradigm shifts: clearly, as informaticians, the era of zillionics has arrived.
Big science, big challenges (big potential)
The potential for scientific insight from these vast amounts of data is clear, but the challenge in collecting, curating and adding value to information on this scale is formidable. As with many aspects of life sciences, the apects of a new approach which hold the most potential also create the biggest challenges.
- Data flow
Whilst the vast scope of large scale data streams (such as those spooling from new sequencing technologies) holds value, pretty much any physical manipulation of such data is tough. Collecting, moving, mirroring, backing up and warehousing are tricky; providing reliable, reusable access to these repositories is where the scale of the problem really becomes apparent.
- Decentralisation
In two ways: decentralisation of data production, and decentralisation of data consumption. As the technology required to generate the data (be it genomic sequence, ambulatory monitoring or geotagging) becomes increasingly commoditised, so the importance of providing robust approaches to collecting and pooling the information increases. The logical corollary to this is that the number of parties looking to access this pooled data also increases.
The approach we take towards developing scientific software can have a great deal of impact in this new era. Somewhat paradoxically, this increase in scale and complexity calls for software that is significantly more flexible and lightweight. Less, in this respect, is definitely more.
- Less code
Keeping application design simple and code structure uncluttered leads to software that is more amenable to change. Fields experiencing this explosive growth really have to exist in the moment: solving the most important problems today is more important than planning for tomorrow, when the game may well have switched up again.
- Less coupling
Keeping software components and applications loosely coupled helps to keep everything agile. Swapping out an element that doesn't fit the bill whilst the information flow keeps ticking is essential. "We can't change A because of B" doesn't fly in the new world order.
- Less benchmarking
Finding bottlenecks in poorly performing code is important, but jumping in to a problem performance first is rarely a good move. This is doubly true when dealing with data that grows in orders of magnitude, or faster. Chances are that any performance metric is going to be meaningless within 6 months as the data builds, and the domain remodels: building software and infrastructure that is flexible to these growing needs is key.
- Lower barriers to entry
Providing access to the data housed by scientific software becomes critical - lightweight, simple interfaces help both humans and machines manage, analyze and curate vast datasets.
- Less downtime
Providing services that allow others to access and build upon existing data need to be reliable and available. Simple software is easier to maintain, migrate and scale.
Expanding the network
Two of Kevin's other essays on this are also well worth a read:
mpb19270i:~ mw4$ history 1000 | awk '{a[$2]++}END{for(i in a)
{print a[i] " " i}}' | sort -rn | head
174 cap
83 cd
41 ls
40 svn
18 exit
15 ec2-describe-images
14 ./script/server
13 mate
9 top
9 ssh
Capistrano reassuringly high on my list. Mark started the fire, Textism prompted me: I tag Roger and Deepak to spread the word.
Last year, about 30 scientists and tech folks got together on the Wellcome Trust Genome Campus at the first Barcamp in Cambridge. The whole day was a great success, so I think we should do it again.
One and all are invited to a day of talks, demos and discussions at Barcamb 2, on Friday 1st August, 2008 at the Wellcome Trust Genome Campus
The Barcamp concept is simple: there is no predetermined schedule or invited speakers, only a collection of awesome people ready to participate and contribute. Last year, we had a wide range of talks and demos from some excellent speakers. I think everyone enjoyed the day last year: we had some great feedback, including a word or two from Tim O'Reilly based on Ian's excellent coverage. There are some photos up on Flickr, too.
For more details, check out the wiki, follow barcamb on Twitter, or sign up at Upcoming.org. See you in August!
The full schedule for this year's RailsConf has just been posted online.
Some highlights form this year's line up that I'll certainly be attending:
- Design Patterns in Ruby, Neal Ford (and indeed, all other talks from Thoughtworks people)
- UI Design on Rails, Ryan Singer
- Advanced RESTful Rails, Ben Scofield
- Small Things, Loosely Joined and Written Fast, Justin Gehtland
- Advanced Active Record Techniques, Chad Pytel (clash city with ActiveRecord Associations and the Proxy Pattern)
And, of course, pencil in 11.45 on Sunday 1st June for my talk, Genomes on Rails. In fact, forget the pencil, you can go right ahead and commit yourself in ink. : )
Simple checklists can be a great way to organise development tasks and prepare for automation.
But what makes up a good checklist?
- 1. Keep it simple: the more complex, the less likely it will require to follow and the less likely it is to stick.
- 2. No more than 5 items
- 3. Keep it flexible: bump less important items as necessary
- 4. Merge related items into automated, single list items
- 5. Share it: chances are that others can benefit too
The actual implementation of a solution in code is only one step toward solving a problem, and is part of a larger development pipeline.
- Design
Putting together a map of a particular problem, and planning out a possible solution. This can happen quickly (fixing an obvious edge case bug, for example), or require more time (schema design and wide architectural refactorings both springing to mind). - Implementation
Putting fingers to keyboards to implement the plan - Quality control
Checking that the solution solves the original problem in an expected, reliable way - Sign off
An end point - the problem is solved and ready to be used in the trenches
Classically, a developer will spend the majority of their time implementing, but exercising the other sides of the square should be encouraged. Simple checklists can provide a good guide to make sure the key pieces are put in place, making the next stage (and all forthcoming problems) easier.
A quick example. A ticket has been opened indicating that a column sort is failing. We've got a pretty good idea of where the problem lies and add a simple test to capture the errant behaviour. Sure enough, the test fails. We fix up the problem so that the new test passes. We're now ready to sign off our work - here is our checklist:
- 1. Run the full test suite
- 2. Commit the changes back to Subversion
- 3. Tag the code with the build or release number
- 4. Release the update
- 5. Post an update to the appropriate mailing list or blog
Compiling checklists also provides an excellent starting point when looking to automate development processes since the particularly repetitive and important bits have already been identified, prioritised and ordered.
After that, a simple rake task should take care of things, time after time.
A quick story, in the Tom-storms-in-to-the-room style.
Chris, Sarah and Oliver are part of a team working on a medical informatics application. Chris and Sarah are putting together an implementation of a simple diagnostic 'surgical seive' - it's looking good, but query performance is proving to be a bottleneck. In the daily stand up meeting, Chris and Sarah might highlight this as a problem that's holding up progress; Oliver suggests a fix, which the three might discuss after the meeting, reporting back the next day.
Agile development removes road blocks
For example: Chris starts out by adding a unit test that exercises the necessary classes and calls the poorly performing code. During this time, Sarah branches the code in the source code repository: this setup allows rapid exploration whilst ensuring existing functionality is conserved. The two then pair together to remove the bottleneck, migrating the underlying domain model and existing logic as necessary. Once they're happy that all the tests still pass, they can merge their changes.
Pulling together the right approach and the right tools can remove a road block more quickly.
Scrum's iterative approach to development consists of four main sections:
- Project planning
Collecting stories from project owners, prioritisation and planning for the upcoming sprint. - Development sprint
The actual development work, directed by the prioritised backlog created in the project planning stage - The Demo
A demonstration of the completed stories provided by the development sprint - Feedback
A retrospective on the good, bad and straight up ugly parts of the most recent iteration.
Disparate projects require different approach
Within the scientific software realm, however, things operate on a subtly different level. You are less likely to find yourself in a larger team focussed toward a single project (although, certainly, such projects exist), and much more likely to find a team
working on a wide range of scientific problems.
This is not to say that the team wouldn't have a collection of libraries shared between projects, but that the projects themselves are geared towards very different users in very different environments.
Some examples, currently in play at Sanger:
- Automatic trace release, as required by our funding source, and used by project leaders
- Illumina sequencing tracking, used by faculty members to organise and track sequencing requests
- Laboratory information management, used by lab staff to direct experiments at the bench
- Repository for short read traces, used by all to query and recover short read traces
The Good
The solutions to problems encountered in one project can often seed an innovative change in direction elsewhere.
The Bad
Even short meetings use up valuable time. With larger teams (6+), this can become considerable, especially if the focus of the meeting is lost. A steady, guiding hand is essential to make sure that the meeting stays on track and adds value.
Editors have a role to play in development meetings as much as they do in other areas.
The Ugly
As recently posted, I've been concerned with promoting Flow recently. A regular, daily meeting is pretty much guaranteed to break the flow of work, no matter when it occurs during the day, but this interruption can be damaging to both the work, and the meeting.
Beforehand, developers are required to break their process and collect their thoughts for the meeting, swapping back to their previous task when the meeting concludes. Because the meeting is deliberately short, developers can find themselved in limbo: focused on neither their work nor the meeting, often forgoing the opportunity to collaborate in favour of getting back to their broken task.
Improving valueThe often rapid reduction of 'real' roadblocks to progress, and the lack of overlap between projects on scientific software teams leaves the daily stand up meeting a little redundant. It is reduced to a simple status update, often adding little value towards progress.
An alternative approach could be a weekly 'State of the Union' discussion in which each team member takes a few minutes (no more than three) to present last week's work and outline the challenges for the next.
- A slightly longer, less frequent meeting removes interruptions and promotes Flow. First thing on a Monday morning is probably a good time to get started.
- Less swapping between tasks, since the meeting is less frequent
- A simple agenda (which can be the same each week, as happens at Apple) promotes focus
- Provides an opportunity to highlight new libraries, patterns or plugins which can be shared across the team.
- Limiting each team members time is equivalent to limiting what questions are answered at the daily scrum meeting: it promotes brevity and prevents show stealing.
- Can for the basis of a regular team retrospective on the most recent sprint.
I've just heard that I'll be speaking at this year's RailsConf in Portland! I've been given 45 minutes to wax lyrical on how Ruby and Rails are helping us stay agile and keep up with the furious rate of change of the next generation sequencing platform at the Sanger Institute, during my session, Genomes on Rails.
Many thanks to the organisers of RailsConf, it's a real honour to share the same stage with the likes of Dan Benjamin, Ezra Zygmuntowicz, Chris Wanstrath and Ryan Singer.
It's also skillful, technical, and a lot of people don't get it.
Dave Thomas touched on this in his 2007 RailsConf keynote. With this in mind, it's not a stretch to think that great development, like great art, require a certain mental state. This has been referred to as 'flow', and I think it breaks down in to three steps:
- The Load
A mental 'load' of the state of a project and the problem to be solved; this can also include opening up a development environment, running tests, pulling down the latest updates. - The Flow
The actual act of doing work: adding functionality, running down bugs, refactoring and testing where necessary. - The Commit
Often accompanied by a version controlcommit, this is a commitment that the task is complete. Job done.
Most people who have worked with software have probably experiences these steps, and keeping then in mind when working within a software development team is key.
At best, breaking the Flow can bounce you right back up to the Load; at worst, you end up in a perpetual state of almost flow, neither achieving nor failing at any real rate.
What can be done to keep things fluid?
-
Remove interruption, not communication
Communication within a team is essential, especially when working on a large project. Interruptions though, can break flow, and so should be removed where possible.
We're talking about non-essential tasks being thrust into a developers psyche: the telephone calls, the 'hey-look-at-this' moments, fire fighting technical problems, unnecessary requests for updates, difficult deployments. The list goes on. Project leaders also need to be kept in check: in dev teams, MBWA can be especially damaging.
-
Solitude isn't the answer
Some solve this problem by separating developers into cubicles or offices. In my opinion, this does little to remove interruption, and only serves to create artificial boundaries at times when collaboration would be beneficial. A peaceful, distraction free workspace is essential, but solitude is not the answer.
A case in point: you see those guys with laptops on the train or at a noisy airport: you can spot the ones in Flow; they are practically glowing. The same is true of XP-style pair programming: two developers can tune in to Flow together, so long as they are not disturbed with non-dev tasks.
A case study: at Sanger, we hold our daily stand up meeting in the morning, at 9.30. We gather around a desk and talk over progress and problems for 10 minutes each morning, before going back to work. Although useful, we noticed that these meetings would slip later in to the day, sometimes not happening at all. This could be put down to a lack of discipline on our part, but with Flow in mind, it became clear that even a short meeting was interrupting Flow. Rather then settling in to the task at hand, developers would be required to break from their current task and 'Load' for the meeting, 'Commit' after it, and then swap back to the task at hand.
This isn't an indictment of the Scrum approach: one of the benefits is that it provides space for reflection at the end of a sprint. This time should be used to highlight what's working and what's not working within the framework, and has given us room to start thinking about alternatives.
We're going to be trying a few different things in the coming months to try and combat these problems, and feed our experiences back on this blog.
The end pointIn general, a productive developer is a happy developer, and creating an environment where flow can be easily achieved and sustained can go a long way to help that.
I'm packing up my laptop, cowboy hat and steak knives to head out to Austin for this year's SXSW Interactive festival tomorrow, which should be a blast.
SXSW is really a very unique experience: it's pretty much the only conference in which interactive designers, hardcore tech folk, and the ideas guys gather together to swap stories from the trenches. A unique mix of cutting edge tech talks, retrospectives and Next Big Things, I'm always surprised that it doesn't attract more scientists or informaticians, after all, interaction, scalability and innovation are our currency. Perhaps it's the lack of academic rigor (replaced with a very Web++ 'hey, this worked for me, want to try it'), or an attendance list (replaced with an online directory and an after party sponsored by Facebook), but I always return buzzing with new ideas and approaches, which is worth the entry price alone.
So - if anyone else is attending and fancies a taco (or a SXSW-Scrum-style-daily-stand-up?), drop me a line.
I'm part of the Production Software team here at the Wellcome Trust Sanger Institute, developing software to support the various genomic research and sequencing efforts in the Institute. A relatively mall development team works on a wide range of projects, from sequencing tracking to medical re-sequencing. As with all scientific projects the requirements, policies and protocols of these projects change frequently - and the software development often struggles to keep up.
One of the biggest problems scientific software faces is a failure to deliver on expectations. Those steering the scientific aspects of a project often have a very clear picture of what they need the software to do in order to complete their research on time and on budget. They have a clear picture of the process and the data. However, this picture is often poorly interpreted by software developers, resulting in delivering software that fits poorly with the scientific process, and is inflexible when change inevitably arrives.
Essentially - we’re aiming to have our software in use more quickly, increasing the speed at which we receive feedback.
There are solutions to these problem. As we mentioned, some are technical (using tests, continuous integration, version control and the like), but as important are attempts to increase the interaction between developers and front-line scientists.
One approach, the one we’re going to try to start using here is called Scrum. Named after the daily meeting at its heart, Scrum is a method of managing rapidly changing software projects. It addresses a lot of the problems outlined above, and appears to be an excellent fit for the development of scientific software.
1. What is Scrum?
Scrum is a framework for managing software projects. It makes use of small teams to deliver software that addresses a collection of requirements, drawn up . The software is delivered incrementally: there is no monolithic, formal functional specification, instead a set of ’stories’ is drawn up by the project leader, with each one outlining a particular aspect that the software should address.
Each story is given an importance, which creates a ‘backlog’ of ranked items that need implementing to support the current scientific process.
The development team will tackle (no pun intended) these items one by one, in order of importance, during a development ’sprint’ over a predetermined time period: usually two to four weeks.
Once the sprint backlog has been finalised, the team are encouraged to work uninterrupted on the stories for the period of the sprint (bug fixes and unplanned items are still permitted).
The team update their process each day, at a regular, short, ’stand up’ meeting. To keep a tight focus on these meetings, and prevent them from overrunning (meetings are toxic), each team member is only permitted to answer three questions:
- What did you do yesterday?
- What are you going to do today?
- Are there any road blocks to stop you doing that?
Additional discussion is encouraged, but takes place offline. Non-team members can attend this meeting if they wish, but are not permitted to contribute. Again - this aims to keep the meeting short and to the point.
This meeting occurs regularly for the duration of the sprint. Savvy teams can also monitor their progress more visually - but we’ll come back to that in a future posting.
At the end of the sprint session, the software is demonstrated to both the scientific and development teams, before a final stage in the process occurs: a retrospective. This is a meeting in which the development sprint is discussed: what worked, what didn’t, what was good, what was bad.
his is an essential part of the Scrum approach, as it provides an open forum in which the practical nature of the approach can be pruned (or in some cases, felled).
Once the retrospective is over, planning for the next sprint can start.
And the beat goes on.
2. How does it benefit scientific research?
The iterative approach is also more open to change, an essential property of scientific software. Because the sprints are usually relatively short, rapid feedback ensures that as the software develops, it stays in line with the current approaches used at the bench. With the ranked backlog, the most important features also get handled first, so something useful is delivered earlier than it would be with a tradiationally specified product.
3. How does it benefit software development?
For many of the same reasons as it benefits the scientific team: the ranked list keeps the project on track, and makes sure that development time is focused, not wasted on unnecessary features.
Progress can be easily tracked and fed back to the project leaders, and the retrospective meetings provide a good sounding block for changes. Scrum also sits nicely along side some of the Extreme Programming approaches.
4. How are you going to do this?
One advantage of Scrum is that it is not a ‘do or die’ approach - but more of a framework. It’s perfectly acceptable to cherry pick which aspects to use, or dropping ones which don’t work. We intend on trying to stick close to the major principles of Scrum to start with. One of the advantages of regular retrospectives, of course, is that it provides space to discuss what does work, and what does not.
The first hurdle in using Scrum, is to convey its advantages to other folks in the software development and scientific teams - we gave a simple talk to some of the project leaders we collaborate with: the slides are available here.
UPDATE: I've also collected some thoughts on how to go about build a product backlog.I often come across interesting stuff on the web which is worth highlighting, and so today I'm add a linked list to the site: Loosely Coupled.
There is a link up there in the top right of each page; those following along via RSS can grab the separate feed here: http://greenisgood.tumblr.com/rss
37 Signals recently blogged about the importance of having an 'editor' in the software development process:
It’s not about designing or writing or coding, it’s about trimming those weeds back before they ruin the lawn.
I think this is spot on, especially for scientific software. It's all too easy when developing an app to find an elegant solution to a problem, but to keep drilling down the 'what if' list in an attempt to cover every edge case. The result is often a dilution of that elegant solution: with more code, more potential bugs and a greater maintenance overhead.
Having a voice (even if it's your own), that is constantly requiring you to cut features, refactor and refine approaches is one of the best things that can happen to an app. The legendary Mythical Man Month talks about being willing to "throw the first one away" when it comes to early prototypes, but the same is true at all points in the development cycle. Cutting down on the features frees you up to perfect the core competency of your app.
It's not hard to imagine that this is exactly what happened with the iPhone, which has about as perfect an implementation as I've ever seen. Sure - there are some features which would be nice to have, but I am happy to sacrifice them in favour of the phone doing what it does extremely well. A small team, with a focused editor have achieved a level of pixel perfection that we can all strive towards. From a recent article in Technology Review:
"That's why it's perfect," says [Robert] Brunner, "and the reason this is getting done is because Steve Jobs is saying, 'Do it.'"Let an editor guide you.
Peer review is one of the corner stones of the scientific method. Conferences, talks, journal clubs and publishing all provide a forum for research to be overseen and reviewed for quality, relevance and in some cases, santity.
Peer review to ensure the longevity of informatics software is far less common, especially with code that is "only" used in-house. However,
it is no less relevant in a field which is rapidly moving and has high standards of quality as a prerequisite. A good approach to address
this is the code review, a process in which a fellow developer checks out a project and provides a second opinion. Quality, relevance and (in even more cases) sanity
are double checked.
The aim is relatively simple: to ensure the code is readable (in more than just formatting) and to highlight possible bugs and security concerns.
When to review
As important as code reviews are, they need not be large scale, formal occasions. Some advocate reviewing the development approach early in the design phase, whilst others prefer to address it later in the development cycle, before the code is integrated into a project. Rightly or wrongly, relatively discrete projects are common in scientific software, and so the
timing of a code review may not be obvious. The important step is to bake the review in to the development cycle (if you have one), and to ensure it happens early, but before deadlines are looming. This sweet spot is hard to define, but easy to find in practice: it's the point at which the software is starting to target its specified aim, but before it enters user testing or integration.
If your API or web resource is taking shape, and you've started to address the initial key aspects of functionality: that's a good time to take a step back and review what you've done so far.
How to review: introducing TODO
A good place to start a review is with a fresh checkout from the project's source code repository: get everything initially, setup the necessary configurations and run the test suite (more on that in a minute). Assuming that everything passes, take a look at the first test file and use that to jump in to the code.
Take a look at the code, and follow any method or object definitions back to the source. Attempt to follow the flow of the application, without delving in to any 3rd party frameworks or modules. Should you find an area of the code which could do with some attention, flag it with a simple comment. For example:
# TODO: find_all has been deprecated, use find(:all) instead genes = Gene.find_allTagging your comments with
TODO means that your review comments can be parsed out as necessary. Many IDEs (Eclipse and IDEA, for example) will identify these tags for you, for everyone else, there is always grep.
When the review is complete, check back in your comments, and let the developer know that there are some fresh things to look in to.
A simple aim: decreasing complexity
Mike Swanson makes a good point: one of the central roles of the code review is to reduce code complexity with the aim of increasing maintainability. This leads to a reduction in hot spots for possible bugs (now and in the future), and reduces the overall burden of maintenance. In the same way that early prototypes help define a domain, code reviews provide a second pass in which to simplify and refine an approach.
How tests help code reviews
The presence of a test suite is important for the code review process to work smoothly. For the reviewer, tests provide a good explanation of how components interact and define a good roadmap for the review, especially for those new to a code base. For the reviewee addresses any specific points raised by the review, re-running the tests is a good way to ensure that the project's functionality is still in tact.
Responsibility to the reviewee
Chances are that the original developer of the code under review has poured a reasonable amount of effort in to their project, and it's important to keep your review comments concise and constructive. If there are problems, provide a good explanation of why and make a reasonable attempt to supply an alternative. If you can reference that alternative to some additional
reading, so much the better. Unless there have been severe architectural design missteps avoid sweeping criticism and instead provide guidance on what to refactor and where. Pay particular attention to complex code, or code that doesn't reach any prescribed style guidelines.
An important rule is not to change the code yourself: simply add guidance, tagging it with TODO.
All in all: don't be a nuisance - aim to add value.
The talk is largely inspired by Carole Goble's BOSC keynote, The Seven Deadly Sins of Bioinformatics, and touches on the differences in publishing and reusing data found in YouTube and commonly used Bioinformatics tools.
We're planning on writing this up as a paper, but in the mean time, you can download the slides here. By the by, Slideshare is also an excellent example of the sort of data, code and service resuse that we could do with more of in scientific software.
Following on from my recent talk, I got to thinking about the key areas those involved in developing applications for the web should concentrate on to build truly outstanding apps.
Front and center
Clearly, the primary goal of any application (on the web, on your desktop, on your iPhone) is to solve a problem on behalf of its users. In this article, I'd like to take that as a given, and focus on the flip side: in an area of science in which collaboration is virtually guaranteed, what can developers do to make truly outstanding apps for their fellow developers?
Three things come to mind: maintenance, maintenance and maintenance. Picking up someone else's code is always a bit of a nighmare. Doubly so if that code is a 6000 line Perl script with no comments. Quadruply so if that code has fallen over taking down the web site you've just had published. We've all been there, folks: ouch.
Reliable code that's easy to read, easy to update and easy to test makes for happy developers. Happy because they know where to start, happy because they know what to change, and happy that their changes will continue to work well. With that in mind, I present the Web Standards: three levels of achievement aimed at bringing maintainability and reliability to web app development, and keeping a smile on your team's face. Future generations will thank you too.
Bronze:
Silver:
- 80% test coverage
- 80% documentation coverage
- Integration testing (via something a kin to WATIR or Selenium)
- Automated browser testing: two platforms
- Automated deployment
- AJAX load and update indicators
- Announcements for updates: via a blog, forum or mailing list
Gold:
- Over 90% test coverage
- Over 90% documentation coverage
- Automated browser testing: three platforms
- Semantic markup: microformats, RSS, RDF
- Accessibile, with unobtrusive javascript
- Screen reader friendly
- Under continuous integration
None of these are technically difficult: a few minutes of diligence today can help save hours, days, months of anguish
tomorrow.
Most of the web apps I'm currently working on score a solid Bronze (this one being a notable exception: the templates don't validate): but I hope this will give myself and others something to push for.
Thanks to Richard for starting the medal meme.
I spoke about developing for the web, introducing an A to Z of web app techniques, technology and jargon. It starts at ajax, meanders through microformats and finishes up at the Zen Garden.
My thanks to Slideshare for picking up the slides and featuring them on their home page as part of their first birthday party. Zing!
Some good stuff in there: namespaced controllers, HTTP basic authentication and an extension of the
responds_to block which
now separates out the mime type from the renderer. This could mean some migration troubles for some (templates are now named index.html.erb, for example), but the
additional flexibility of customising responses by type seems worth it. DHH shows a demo that added a custom response type for the iPhone, for example.
Moreover, David advocated a much more mature approach to Rails evangelism: after three years, the doors to Rails are now well and truly open. Many of the original stumbling blocks have been removed (
script/server couldn't be easier, for example), which means that we should now be able to ease off on the schtick, and allow the framework to attract
users on it's own merits. Existing Rails developers also get back and start enjoying the framework, instead of non-stop
evangelism.
And speaking of peripherals...
Dave spoke about the art of software engineering (and the engineering of software art), and made a plea for developers to start building beautiful apps. Hear hear.
Personally, I've often felt that development work was not unlike sculpture: it requires a focus on tiny details whilst maintaining an appreciation of the thing as a whole. Dave Thomas compared it to various other artistic pursuits: poetry, writing, painting, focusing on what developers can learn. Modular design, rapid prototyping, giving the client what they need: there is nothing you can't throw at the metaphor.
One final point really stood out: that developers should sign our work as an artist would a painting: I'm definitely going to be promoting this on my return to the lab. Not only does this fits nicely in to the humane approach advocated by Kathy Sierra and Aza Raskin, but it also encourages developers to take ownership, nay, pride in their work. Fantastic.
- Spotted! David Black chatting to people in the very long registration queue.
- Spotted! Chad Fowler chatting to David Black and getting ready for his charity workshop on testing with Ruby and Rails
Let's consider a common situation in bioinformatics: You're handed a text file from a collaborator: annotation.txt. As usual, the file isn't in a common format, and you want to grab some data from it for inclusion in an upcoming paper. The solution is straight foward enough: write a script that parses the file and spits out the required data to the terminal, or another file.
Piece of cake. You put together a Perl script and get the data out by lunch time. Later that month, you need to get at some additional data in the text file, so you rework your script, export the data, and go about submitting your paper.
Some time passes.
The reviews of your paper eventually come back: Reviewer 1 loved it, but Reviewer 2 would like clarification on some results that rely on annotation.txt. You grab the script to parse the file one more time and it fails.
Why, where, who, what?
Three things are now certain: it was working; it's not now; you're going to have to fix it before replying to the editor.
Ouch.
So you go back through your script: to unfamiliar regular expressions that made sense at the time, echoing variables to remind you of what represents what, reworking things that look wrong, fixing bugs, reordering the code...
STOP. Step away from the keyboard. This is lunacy.
You're experiencing a common problem: I call it The Fall Back. Instead of writing a response to the editor with new supporting data, you've fallen backwards in time, direction, concentration and motivation, in to resurrecting a trivial script.
In the best case, the fix will be obvious, at worst, you won't find the problem and a rewritten script will produce different results. This has caught me out (along with many talented colleagues) in the past. The solution helps you to step forward, and to keep moving forward without retreading old ground. With only a few simple changes, we can avoid The Fall altogether, write less code, and return our focus to innovation and problem solving.
Stepping forwards
Ideally we'd like to make sure that we never fall backwards. It's disheartening, for a start. Moreover, it breaks up the natural flow of our ideas, and puts us under even greater time pressures: with the speed at which scientific research moves, this can be catastrophic.
When developing software to support research, we need to make sure that that software is as flexible and maintainable as possible, rather than the rate determining step.
Where to start
A simple set of guidelines can help us step forward.
-
Pillar 1: Manage your source code
Version control helps keep code up to date, in sync and available. It acts rather like a librarian: checking in new and updated source code, and checking out entire projects on demand.
When code is placed under version control, it is possible to monitor and track changes, allowing you to roll back to previous versions if necessary, and identify key changes in the code base. If you're planning on sharing your code with others, all potential users can grab a copy of your code from the repository, each of which can be kept up to date with your central changes. The code base can also be modified by multiple people at the same time.
-
Pillar 2: Test, test, test
Without doubt the best way to avoid the Fall Back is through unit testing. In short, in addition to the script that does the heavy lifting, we also write another program, which tests that the heavy lifting is done correctly. This may sound like extra work, but in practice it soon becomes second nature.
-
Pillar 3: Automate repetitive tasks
Repetitive tasks are simply a pain when they have to be performed manually. Common tasks like making sure a file is in the correct directory, or bundling up an application for deployment can be simply streamlined and performed with a single command. This ensures that they are done right each and every time, without the worry that a vital step could be forgotten.
Green is Good: keeping The Melv smiling since 2007.
-
Marcel Molina (noradio) spotted eating breakfast in the Maritim Hotel, Friedrichstraβe, possibly with Nicholas Seckar (ulysses).
I'm particularly interested to see what the Joyent CEO has to say about scaling Rails. At the Sanger Institute, we're currently scaling up our sequencing pipeline to support new sequencing instruments from Illumina and 454. These things are real data behemoths - ensuring 24/7 high throughput at the multi-petabyte level is a major challenge.
More on the Core Team →
This month's focus is the semantic web, with talks from Peter Corbett (who gave an excellent demo of his recent work at BarCamb), and what promises to be a lively discussion on RDF hosted by fellow Genome Campus denizen Renato Golin. I'm also on the roster, giving an introduction to microformats in biology.
It feels to me that the SciComp@Cam group is starting to pick up some momentum, highlighting the healthy state of computational science in and around Cambridge. There are 235 registered members to date: if you work in the field in the area and are not yet signed up, feel free to join us.
If you, like me, find the potent cocktail of semantic markup, wine and nibbles nigh on irresistible, I very much look forward to meeting you there.
This month's event kicks off at 5.15pm at the Center for Mathematical Sciences on Wilberforce Road.
Bugs thrive on the same human brain deficiencies that earn magicians their living. We are shown something that is apparently impossible -- but the reality is that we just don't have all the information.Totally true: and like magic tricks, when you know what the problem is, bugs seem incredibly obvious.
An aside: Much like Green is Good, Steven also recently swapped out Moveable Type for a custom blogging engine more to his suiting.
Ian did a great job of summarizing the talks, which I found to be a fascinating list, because of the focus on science and hardware hacking, two areas that (as you know if you read Make), are very much on our radar as an alpha geek birthing ground for the next generation of disruptive technologies.Great to see such positive feedback. I thoroughly enjoyed the day, and now that this blog is back up and running, I should really post something about it myself.
If anyone else if going, give me a shout and maybe we can grab some bratwurst before the event.
Scrum is an iterative approach to collaboration and software development, but you’ve got to start somewhere. In this post, we outline how to create a list of new features or improvements - a product backlog in Scrum terminology.
The Scrum framework all hangs on the product backlog - a list of features drawn up by the person (or people) who will eventually put the developed software in to use. This article guides you through how to create a product backlog in Google Documents.
There are numerous ways of storing a product backlog (from pen and paper to dedicated full blown applications), but let’s start simple. An online spreadsheet is available to all with a browser, and can be viewed and edited by multiple people at once. We also get version control for free. So let’s get started.
Backlogs explained
In The Scrum Book, a product backlog is defined as:
an evolving, prioritised view of business and technical functionality that needs to be developed into a system
A product backlog pulls together an ordered list of requirements which are ‘queued’ to get rolled into the software in the future. The list can be reordered and edited as necessary, which ensures that the rapidly changing nature of research can be easily mapped to the backlog. Before each development cycle, the most important of these items are selected and implemented within a relatively short time frame.
Once the product backlog has been built, a project leader can meet with the development team to decide which items will be included in the next development cycle (often referred to as a sprint).
What makes up a product backlog?
In short - stories.
Stories are short vignettes describing how the project leader would like the software to operate. They are non-technical, and written from the point of view of the project. For example, if we are thinking of a patient management system, the following stories would be likely:
All patient records should be unavailable until the user’s hospital staff number and ward code have been verified.
Once verified, the staff member should only be able to edit their own patient’s records.
All staff should be able to look up drug names in an online pharmacopedia.
In addition to these stories, the development team may add some additional ‘tech stories’, which are usually more along the lines of setting up servers, or stress testing systems.
Once the stories have been written, they should be ranked in order of importance by scoring each story with an ‘importance’ value. These values don’t need to be sequential, or even within the same order of magnitude. Indeed, it may help if they are not when planning what shoud be included in the next development sprint. Ideally, all the important stories should have unique values.
Looking again at our patient management example:
All patient records should be unavailable until the user’s hospital staff number and ward code have been verified. Importance: 1000
Once verified, the staff member should only be able to edit their own patient’s records. Importance: 500
All staff should be able to look up drug names in an online pharmacopedia. Importance: 10
With a few more stories, we’ll have enough information to start building a prioritised short list for the sprint. Ideally, product backlogs should contain as many items as necessary - don’t stop at the number you think the developers will be able to handle within the time frame of the sprint.
What happens next
Once the project leader is happy with the backlog, they and the development team meet to create a sprint backlog of items to be worked on (in order of priority) by the developers.
During a sprint
Although the product backlog can be updated at any point, the items marked for development during the next sprint can not be altered once the sprint has started. This is to ensure that the developers can work unimpeded by changing scope, altered priorities and moving targets. Newly important items on the product backlog can be scheduled for the next sprint, but cannot be inserted into one that has already started.
The Spreadsheet
As we mentioned earlier, an online spreadsheet is a good place to start putting a backlog together. We’ll use the Google Documents spreadsheet. You’ll need a Google account to sign in and get started.
Once logged in, create a new Spreadsheet by clicking on New → Spreadsheet. An empty spreadsheet will open in a new window.
Go ahead and create three columns, adjusting the columns as necessary.
| Name | Description | Importance |
Save the spreadsheet by going to File → Save, and then start adding some stories!
| Name | Description | Importance |
| Staff verification | All patient records should be unavailable until the user’s hospital staff number and ward code have been verified. | 1000 |
| Read only access | Once verified, the staff member should only be able to edit their own patient’s records. | 500 |
| Drug name look up | All staff should be able to look up drug names in an online pharmacopedia | 10 |
Continue until you have the makings of a fine backlog. Should others want to contribute, you can share the spreadsheet by clicking on the Share tab.
Finally
The product backlog is the starting point of the Scrum process, but can be kept up to date with changing needs as they occur. The next step is to meet with the development team and plan the sprint, which is the topic of another post.
To make Scrum work, it requires buy-in from the various project leaders a software development team might work with.
I recently gave a small lunch time talk here at the Wellcome Trust Sanger Institute to project leaders who work with the production software team to explain the benefits of the Scrum approach, and what it meant for them going forward. My slides are here.
The talk went well - with all the project leaders present agreeing to give it a try. One of Scrum’s major advantages when seeking support (over more general ‘agile’ approaches, even) is that every aspect can be time boxed. If Scrum fails for some reason, the relatively short development sprint means that the actual cost of the failure is low.
Tom Peters may disagree with me, but when developing mission critical software, reduced risk is a good thing.
Update: fixed the broken slideshare viewer.Let’s talk about science and software.
Modern scientific research is innovative, fast moving and fun. It favours the quick exploration of ideas, intuitive collaborations and practical, reusable techniques. We believe that software should promote these features too, and this blog aims to introduce, discuss and report on techniques to help make scientific software better.
The authors of this blog are part of the sequencing informatics teams at the Wellcome Trust Sanger Institute. We aim to deliver high quality software tools to support the various genomic projects taking place at the Institute and with our collaborators worldwide. Through posts on this blog, we hope to introduce modern software development concepts which we think are essential to modern scientific software design and development.
The speed of these innovations, collaborations and explorations is increasing nearly as quickly as information is flowing from such projects. We think that scalable, reliable software can lead the way in promoting these light speed changes.
We hope you can join us.

