BY DAVID BLOCKOW
Big Data applications have some unique characteristics that make solution design more challenging than traditional software systems. At a high level I talk about the "3 V's" of Big Data - Volume, Velocity and Variety - but more important is how these V's manifest as design considerations and architecture patterns. A good design process should acknowledge the difference between Big Data and traditional systems and provide opportunities to address the unique Big Data challenges.
D2D CRC's design process has evolved over many years of designing and developing Big Data solutions. It is very much a "living process", adjusting as Big Data products and technology mature, and with the lessons learnt from building and maintaining real-world systems. This blog post provides a snapshot of the thought process we use when approaching a Big Data problem.
Start with User Needs
It should go without saying, but all architectures should start by considering the user's needs. What questions does the user need answered? What capability are they trying to enable?
As engineers it's easy to think of Big Data as "Hadoop" or "Spark" - we often focus on the technology and lose sight of the fact that these tools are only enablers, not solutions in their own right. The capability they provide is certainly impressive, allowing us to work with data at unprecedented scale, but we need to remember that their primary purpose is to extract value from the data.
How user needs are captured isn't particularly important - I like use cases, but user stories or requirements can work just as well. The level of detail can be important though, especially early in the design process. We start by capturing high-level descriptions only, focusing on the "what", not the "how". Low-level details are refined in an agile fashion as the solution evolves.
What Data is Required?
After capturing user needs we can start thinking about the data needed to support them. How do we access the data? Is it available within the organisation, or do we need to collect it from an external source? Are there multiple data sources? What is the data volume? What are the formats? What is the data quality?
The answers to these questions start to drive architecture decisions. Data volumes and formats can help decide the most appropriate style of storage. Data velocity and the level of processing required at ingest can suggest design patterns such as buffering or back pressure.
While we can start thinking about storage technologies at this stage, we would not select products before considering how the data will be read by users and analytics, leading us to...
What Analytics are Required?
One of the most challenging aspects of designing a Big Data solution is selecting an appropriate representation of the data to support analytics. This requires a good understanding of the types of analytics that will be performed and the way those analytics will access the data. Will the analytics read individual records? Will they scan large ranges of records? Will they work with subsets of the data, or all the data at once? Will they combine data from different sources?
Similarly, we need an understanding of the required performance. How quickly are results needed? Seconds? Days? Will the analytics run as the data is collected, or at a later stage? Are analytics run programatically, or in response to interactive user queries?
The answers to these questions help determine the processing style - stream, batch, or interactive (or a combination of all three). They also help refine the storage solution, e.g. analytics that scan large ranges of records usually benefit from using a columnar store, and help to select appropriate design patterns, e.g. denormalising and storing multiple representations of the data to support different types of analysis. The right data representation for bulk analytics can be the difference between receiving answers in minutes or days.
Addressing user needs usually involves making analytic output available in some form e.g. visualising static results on a dashboard, or providing interfaces for users to perform ad-hoc queries, or machine APIs for integrating with other business systems. Understanding how the analytic output is accessed can help refine the storage and processing solutions and can help select design patterns, such as aggregation to support interactive visualisations of large data sets, or pre-calculating and caching to improve performance.
Do We Need a Big Data Solution?
At many points throughout the design process we consider if a Big Data solution is really necessary. The operational overhead of deploying and maintaining a distributed system can be significant, and is not something we would recommend taking on unnecessarily. You can achieve a lot these days using single-node, large memory machines with multi-Terabyte SSDs - for data sets < ~3TB this is often sufficient. In some cases it may even be worth reducing scope to make a single-node solution achievable, or at least validating that the business value gained from introducing Big Data tools is worth the additional overhead.
A process that acknowledges the unique Big Data challenges can help when designing effective solution architectures. This blog post presents a simplified view of our thought process when tackling Big Data problems - in reality the process is much more iterative, refining answers to all the questions as the solution matures. At D2D CRC we usually follow solution design with prototyping for high-risk components of the architecture, typically the storage and processing layers, validating and potentially further refining the design.