BY MAHFUZUL AMIN
The Innovation Exchange Agriculture team at the Data to Decisions CRC (D2D CRC) has recently completed a weather data transformation and ingestion project for the Sheep CRC’s Wellbeing Data Platform (WDP). Weather data is a key part of WDP and is used for predicting the future wellbeing of sheep. This blog post describes how we transformed and ingested our weather data using Apache NiFi.
The transformation process
The weather data used for this project was obtained from the Bureau of Meteorology (BOM). The raw weather data that we receive from BOM needed to be transformed to be ingested into the WDP.
The process consisted of: downloading a number of weather observation and forecast files from BOM’s File Transfer Protocol (FTP) site in Network Common Data Form (netCDF) format; storing the files for future use; reading the files to receive the data; transforming the data; and loading the data into the database. The whole process – including the business logic – was implemented with Nifi using customised Nifi processors.
What is Apache NiFi?
Apache NiFi is an enterprise integration and dataflow automation tool. NiFi can be thought of in terms of data logistics; similarly to how parcel delivery services move and track packages, NiFi helps to move and track data. It provides real-time control that makes it easy to manage the movement of data between any source and any destination.
The project was written using flow-based programming and provides a web-based user interface to manage data flows in real time.
NiFi also allows a user to send, receive, route, transform and sort data, as needed, in an automated and configurable way. Though similar tools exist, NiFi stands out because of its user-friendly, drag-and-drop graphical user interface and the ease with which it can be customised on the fly for specific needs. It is also highly scalable and can run on something as simple as a single laptop or clustered across many high-performance servers.
NiFi Core Concepts
- FlowFile: A FlowFile is an object that moves through the Flow. It has a map of key/value pair attribute strings and file content of zero or more bytes.
- Processor: The Processors perform the work and execute the business logic. Each Processor does some combination of data routing, transformation and mediation between systems. Processors have access to attributes of a given FlowFile and its content stream. A Processor can operate on zero or more FlowFiles in a given unit of work and either commit that work or rollback.
- Connection: Connections provide the actual linkage between processors. These act as queues and allow various processes to interact at differing rates. These queues can be prioritised dynamically and can have upper bounds on load, which enable back pressure.
- Flow Controller: The Flow Controller maintains the knowledge of how processes connect and manages the threads and allocations thereof which all processes use. The Flow Controller acts as the broker facilitating the exchange of FlowFiles between processors.
- Process Group: A Process Group is a specific set of processes and their connections, which can receive data via input ports and send data out via output ports. In this manner, process groups allows for the creation of entirely new components simply by composition of other components.
A detailed overview can be found on NiFi site.
Example NiFi flow: “Compress File”
The following simple data flow receives a file via GetFile processor from an input directory, passes it to the CompressContent processor to compress it and finally passes it to the PutFile processor to write it to disk. In the event of failure to compress or failure to write, the processors will attempt to redo their tasks.
High level overview of D2D CRC weather data processing
A set of custom FTP processors (developed in-house) retrieve Australian weather data from the BOM FTP site three times a day (observation files are in the Australian Water Availability Project [AWAP] format and forecast files are in the The Australian Digital Forecast Database [ADFD], Access [Access-G and Access-R] formats). The observation and forecast files are processed in separate data flows.
- A custom netCDF processor filters and processes AWAP files to receive daily minimum and maximum temperatures, daily total rainfall (precipitation), daily total solar exposure and vapour pressure at 9 am and 3 pm and inserts the data into separate tables in our database.
- Another custom processor triggers the Calculations and Transformations. These are custom SQL scripts that are run by a custom processor based on NiFi’s ExecuteSQL processor.
- A third set of processors execute the Quality Control operations.
- The final processor joins the observation data and inserts it into the main ClimateData table.
Forecast flow: This is divided into two separate flows to receive all the required data
- Max temp
- Min temp
- Wind speed/direction
- Relative humidity
After the ingestion, this flow uses the same steps as the observation flow (i.e. calculation, transformation, QC and insertion into the main ClimateData table).
Wind speed and Solar radiation forecasts are provided by the BOM in Access-G and Access-R files. Access-R is a three day forecast at hourly resolution and Access-G is a 10 day forecast at three-hourly resolution. Just like the Observation flow and the ADFD flow, once the files are ingested, they are passed through processors for calculation, transformation, QC and insertion into the main ClimateData table.
When all the flows are completed we can export a Climate File CSV file that contains the weather data from yesterday’s observation up to the seventh day’s forecast. The BOM releases several updates over the course of each day, so we have programmed our Nifi flows to run whenever new data is available.
Using NiFi to implement weather data processing works efficiently, produces accurate outputs and is stable and scalable. The D2D CRC Innovation Exchange Agriculture Team has built a robust system that incorporates weather files, networking, databases, NiFi and NiFi customisation. Today, it continues to run effectively as a part of the Sheep CRC’s Wellbeing Data Platform.