Tuesday 5 June 2018

Apache Nifi - VIII

In the last post, we saw how we can fetch data from MongoDB using Apache Nifi. We extend the discussion further by introducing another processor called RunMongoAggregation. GetMongo processor offers us a small window to query the MongoDB database but RunMongoAggregation processor ups the game by offering exciting possibilities. In the world of MongoDB, Aggregation Framework is a data processing tool for performing analysis in real-time and generating pre-aggregated reports. The fact that we have a separate processor, RunMongoAggregation, dedicated to run Aggregation against MongoDB only goes to show the significance of this processing framework even in Apache Nifi. More details about this framework are here. Perhaps we will explore the capabilities of Aggregation Framework when we deal with MongoDB context in some later articles. But, for now, let us focus on how we can leverage the Aggregation Framework in MongoDB using Nifi to process data

The environments and sample data details are in the previous post. We begin by showing a very simple aggregation pipeline in a flow consisting of just two steps as shown below:





















The properties of RunMongoAggregation are shown below:


As you can see, we have entered the pipeline in the Query field. Note that we have coded the same conditions that we used in the earlier post in form of Aggregation pipeline. So, we should get the same results as in the previous post. The properties of PutFile remain the same but are shown below for the sake of completeness:



















Let us run this flow:

























Let us examine the data provenance on RunMongoAggregation processor:












We can see that the three records. Further confirmation is seen by checking the file at the target directory:












This is in line with the results in the last post

In the next example, we attempt a self-join using $lookup stage and fetch the manager first name and manager last name. The flow remains the same but we merely switch the value in the Query field as shown below:


The properties for PutFile remain the same as shown earlier. Let us run this flow:
























Let us now examine the results in the file at target directory:







Note that are are 5 records as per the aggregation pipeline. Also, note that the manager first name and manager last name appears along with the employee record. In particular not that Steven King has no manager and that all the other four employees have Steven as their manager