Agile Methodologies for Big Data Projects


We're big fans of Agile methodologies at D2D CRC, they're a great fit for R&D projects where the goals are aspirational and fluid. We really benefit from Agile's ability to respond to change as we improve our understanding of the problem and refine our requirements.

Using Agile for Big Data projects can present some unique challenges, particularly the overhead of deploying and maintaining distributed systems and the difficulty changing direction after significant technology investment. This blog post describes some of these challenges and the lessons learnt from using Agile on Big Data projects at D2D CRC.

Agile at D2D CRC

While Agile is used consistently at D2D CRC, the exact methodology varies, with empowered teams selecting the methodology that best suites their project. Larger teams generally use Scrum, while smaller teams tend towards Kanban and Lean. All projects however have a Big Data and/or advanced analytic element, bringing technical complexity to every project. To add to this complexity most projects receive contributions from distributed research streams based in many Australian universities.

At a high-level, a typical Agile process looks like this:

  1. Capture a backlog of tasks as Epics and User Stories.
  2. Do a small burst of work, completing a subset of tasks in the backlog.
  3. Demonstrate the new features to users.
  4. Refine based on what we've learnt.
  5. Repeat steps 2-5.

While we're doing this we work together closely as a team, and work closely with our end-users. The refinements can be adjustments to the backlog based on user feedback, or to our development process to improve our ability to deliver features.

We use automation to improve our release tempo, keeping the code in a releasable state by using Jenkins for continuous integration, enabling reliable and automated releases with Ansible, and monitoring our production infrastructure for failures using Consul. These tools give us the confidence to deploy software releases often, at least every Sprint, if not more frequently.

The Agile Sweet Spot

While Agile is popular, it is not a good fit for every project. We like to think of there being an Agile "sweet spot" that's a product of technical complexity and certainty of requirements. Projects with low technical complexity and clear requirements don't need the exploratory ability Agile provides, a simpler waterfall approach might be sufficient. Conversely, for projects with high technical complexity and requirements uncertainty it will be difficult to capture enough detail to start working on a product backlog - further refinement and technical understanding is required.

Projects at D2D CRC often start on the boundary of the sweet spot. We generally have high-level user requirements, but don't always know if they're achievable. We almost always have high technical complexity due to the need for Big Data scale and advanced analytics.

The way we cope with this is with is with early activities aimed at bringing the project back into the sweet spot. Literature reviews can help understand the problem domain and select likely approaches to user problems. Lightweight prototyping can help to understand technology options and de-risk core components of the solution. These activities are sometimes performed in a pre-project "Sprint 0" phase, but more commonly they're captured as tasks in the project backlog and prioritised early, as they provide the best opportunity to refine further work.

Just-in-Time Scalability

By far the biggest lesson learnt for me has been to scale only as required, and to scale up before scaling out (use more powerful machines before distributed systems). There is an operational overhead to deploying and maintaining distributed systems that is best avoided until is is really needed. Effort spent maintaining the capability is effort you're not spending on user facing features - you can't demonstrate a Hadoop cluster to an end user, it's an enabler, not a feature. In one extreme example the refinements we made to our product backlog based on user feedback led us to reduce scale and focus on advanced analytics. If we had added the cluster on our roadmap early it would have resulted in wasted effort.

Just-in-Time scalability needs to be balanced against the need to de-risk technical complexity. When you are confident that scale-out is needed it may be necessary to start prototyping Big Data tools to understanding the level of effort required, providing better estimates for future tasks.

Changing Direction

Investment in Big Data technology can be significant, and changing products can be difficult. This problem is not unique to Big Data, e.g. changing relational databases can also be challenging, but the scale of the problem is certainly larger. Changes may be justified if the wrong product was selected in the first place, but more commonly they result from a new product maturing and offering improved features or performance. The Big Data landscape moves quickly, with new technologies emerging often.

As with any technology change it is important to view previous investments as a sunk cost. The cost of changing needs to be weighted against the business value gained from improved performance, or from the ability to delivery new features faster. Adding backlog tasks to perform lightweight prototyping before making any changes is wise. Tools like Docker and Vagrant can make prototyping with local or virtualised clusters easy.


Agile has been a useful tool for projects at D2D CRC, allowing us to efficiently build software in a research environment with high technical complexity. Managing the complexity introduced by Big Data tools has been an ongoing challenge, with various approaches being used and important lessons learnt along the way. We're always refining our Agile approach to Big Data problems and will continue to develop new techniques as we work in this space.