Monday 4 June 2018

Apache Nifi - VII

In this post, we will see how Apache Nifi integrates with MongoDB. MongoDB is a document database that supports querying and indexing along with scalability and flexibility. More details about MongoDB are here. We will use the latest versions of Nifi and MongoDB for all the work in this post: Apache Nifi 1.6 and MongoDB 3.6.5. MongoDB Community Edition is installed and employees data has already been loaded into the database using mongoimport command. This utility is reminiscent of SQL*Loader in Oracle database. The employees data has been repeatedly used in earlier blogs. Since this is only sample data, any other data may also be used

Using Compass, we can view the employee collection under users database and is shown below for reference:













It can been seen from Compass that there are 107 documents in the employees collection. We will use Nifi to interact with MongoDB and as a first activity fetch data from MongoDB. Note that like mongoimport, we have a corresponding mongoexport that can be used to export data out of  MongoDB. But, we are interested in fetching data from MongoDB using Nifi as the fetched data can be later processed using the rich features of Nifi.

The first flow that we will see is shown below:
























The flow consists of just two processors: GetMongo and PutFile. The properties of GetMongo are shown below:



















We set the properties for Mongo URI, Database, Collection and Results per FlowFile. Notice that the Mongo URI (Uniform Resource Identifier) points to the running instance of MongoDB that listening for client connections on TCP port 27017. The Results per FlowFile (shown below) offers the flexibility to limit the data extracted per FlowFile from MongoDB.



















On the PutFile processor, we just set the destination directory as shown below:



















Running this simple flow, we can see the Data Provenance on the GetMongo processor as shown below:
















Clicking on View above shows the data that has been extracted from MongoDB as shown below:





















We can also see the file under the destination directory:

























In the second attempt the flow used is the same but, we add bells and whistles to GetMongo processor as shown below:





































Query filters the records to only those having DEPARTMENT_ID as 90. This filter is passed to an attribute called query_detail to the FlowFile. Only four columns are projected and the _id field explicitly skipped. The salary is sorted in descending order and only the top 3 records are picked. The properties of PutFile remain the same. The query_detail attribute can be seen in the Attributes tab under Data Provenance as shown below:
















On the Content tab, we can view the result that is shown below:


























We can also see the file in destination folder as shown below:


























Note that there are only three records as we had configured. We can finally confirm this by running the same properties that we set in Nifi in Compass and the results compare well between the two as shown below:











With this last step, we conclude this post