Monday, 19 February 2018

Apache Nifi - III

In the third segment of the series on Apache Nifi, we will run the simple flow that we developed in the previous post. The simple flow consists of two processors and a connector as shown below:








Navigate to D:\source (and also D:\sink) folder and make sure it is empty as shown below:







We will run the PickFile processor by selecting it and right clicking and clicking on Start as shown below:

















Now, the red square will appear as a green triangle to indicate that the processor is running as shown below:







We will drop a file called employees.csv (from our earlier blogs. You can use any text file instead) into D:\source as shown below:









Note that this file is picked up by the simple flow as is shown below and is already in the queue between the two processors:







Now, note that D:\source no longer has the file as shown below:







We can inspect the simple flow while the file is in transit by right clicking on the PickFile processor and clicking on View data provenance as shown below:











On the Nifi Data Provenance window, click on i as shown below:

















On the Provenance Event window that comes up, click on CONTENT tab. You can either download the file or view the file as shown below:



















Now, in the same way, we can inspect the file in the queue by right clicking on connector and then List Queue as shown below:











Then, on the success window, note that the file name and size are mentioned. Click on i as in the previous case to see the file content:









On the FlowFile, you can see the file name and size and also the download option:



















Click on DOWNLOAD to save the file and drop it into D:\InTransit folder for reference:















Then, click on OK button and X on right top on success window to return to the flow and enable the DropFile processor:








Then, navigate to D:\sink folder to see the file:










Thus, the file was picked from D:\source folder and dropped into D:\sink folder. We can stop the processors by right clicking anywhere on the canvas and clicking on Stop as shown below:













This step concludes the running on the simple flow in Apache Nifi

Saturday, 17 February 2018

Apache Nifi - II

In the second segment of the post on Apache Nifi, we will build a simple flow. We will use the Apache Nifi instance that was installed in the earlier post. This simple flow will pick a file from a folder and drop the file to another folder. Translating that to Nifi parlance, the simple flow will have just three parts: a GetFile processor, a PutFile processor and a connector to connect the two processors. Following steps detailed here, let us start the Nifi and navigate to http://localhost:8080/nifi/. The Nifi GUI is shown below:













As a first step, let us create three folders called source, sink and inTransit under parent directory. So, the folder structure will like as shown below:







The reason we created these three folders is that we will be using these three folders for demonstrating the simple flow. Files from source folder will be picked up by Nifi and dropped into sink folder. Files in transit will be dropped into inTransit folder. On the Nifi GUI, drag a processor into the area below it as shown below:














Once we drag the processor, the Add Processor dialog box comes up as shown below:




















In the Filter box, enter getfile as shown below:




















This will bring up only one search result called GetFile. Select it and click on Add. The GetFile processor is added to the designer interface:


















In the same manner, add PutFile processor. Now he interface will look as shown below:














Right click on the GetFile processor and select Configure as shown below:

















On the Configure Processor window, set name to PickFile and click on Scheduling tab:


















On the Scheduling tab, enter 10 sec to set the Run Schedule. Details about Run Schedule can be seen by hovering mouse pointer on the question mark next to the Run Schedule field name. Click on Properties tab:


















On the Properties tab, set Input Directory to D:\\source\\. This the directory from which the file is picked up. Note that Keep Source File property is set to False:


















Click on Apply to apply these changes on this processor. In the same manner, right click on PutFile and click on Configure Processor. On the Configure Processor window, set name to DropFile and check on check boxes next to failure and success under Automatically terminate relationships as shown below:


















Click on Properties to set the Directory to D:\\sink\\. This is the directory where the file will be dropped. Click on Apply. Finally, hover the mouse on either of the processors on the yellow triangle to see the message that there needs to be a connection:










To solve this issue, hover the mouse on PickFile processor and click on arrow in the green circle as shown below and drag it onto the DropFile processor and release the mouse as shown below:








This will bring up the Create Connection window that has the details about the connection. Click on Add:



















This will complete the connection between these two processors:







Note that the yellow triangle is replaced with a red square indicating that this processor is ready for Start operation.This completes the process of building a simple flow in Apache Nifi. We will take up running this simple flow in the next post

Thursday, 15 February 2018

Apache Nifi - I

In this post, we will look at another tool, Apache Nifi, that is touted as supporting powerful and scalable directed graphs of data routing, transformation, and system mediation logic. More details on Apache Nifi can be found here. One of the outstanding features of this tool is the web interface that can be used for design, control, feedback, and monitoring purposes. All ETL experts out there coming from "small data" background and familiar with tools like Informatica PowerCenter, Oracle Data Integrator Enterprise Edition, Microsoft SSIS, IBM InfoSphere DataStage, etc will enjoy working on Apache Nifi. Apache NiFi is based on technology previously called "Niagara Files". Hence, the name Nifi that is short for "Niagara Files"

The first post on this topic is dedicated to installation. This is in line with our strategy - we try to understand tools by starting from scratch and not overwhelm the readers with complex configurations on the first run. We will then try to add more concepts and features of the product in latter posts. We will install Apache Nifi on Windows and we will use this environment for all the work in this post. One of the prerequisites for installation of Nifi is Java. Also, make sure the JAVA_HOME of PATH Environment variables are set correctly to point to the Java installation. You can make sure it is installed by running below command:

java -version









The binaries for Apache Nifi can be downloaded here. We will work with the latest release, 1.5.0. On this page, click here to download the binaries. After downloading nifi-1.5.0-bin.zip, drop it into a folder called nifi as shown below:









 Right click on the file click on Extract all ...

















Change the installation directory to D:\ as shown below and click on Extract:



















The files are extracted as shown below:




















Navigate as shown below to see the installed folders:










On the command line, enter below commands to run Nifi:

D:\>
D:\>cd D:\nifi-1.5.0\bin\
D:\nifi-1.5.0\bin>run-nifi











A series of output statements are seen that will end with s line like "INFO [main] org.apache.nifi.bootstrap.Command Launched Apache NiFi with Process ID 9048" as shown below:


















This means that Nifi is up and running. One can verify that Nifi is running by running status command as follows:

D:\>
D:\>cd D:\nifi-1.5.0\bin\
D:\nifi-1.5.0\bin>status-nifi









A series of output statements are seen that will end with s line like "[main] INFO org.apache.nifi.bootstrap.Command - Apache NiFi is currently running, listening to Bootstrap on port 49643, PID=9048" as shown below:



















Lastly, navigate to http://localhost:8080/ to open the Nifi Designer window as shown below:


















Click on nifi as a final step to see below application:














Alternately, you can navigate to http://localhost:8080/nifi/ directly. 

To bring down the application, click CTRL+C on the window where the application was started.

This complete the installation steps for Apache Nifi on Windows

Sunday, 4 February 2018

Apache Kafka - IV

In the third part of Apache Kafka, we will write a producer in Java. We will use the same environment that we set up in this post.

In addition, we are using IntelliJ IDEA, Community Edition (Version: 2017.3.4, Build: 173.4548.28, Released: January 30, 2018) as IDE for the Java program. A screenshot is shown below:














In this post, we will write a producer in Java. This will read a file and publish the contents to a topic. A consumer will subscribe to this topic and output the message to the console

We invoke zookeeper and kafka as shown below. Then, we create a topic called topic1:

Start Zookeeper:

Open first terminal:

F:\>
F:\>cd kafka_2.12-1.0.0

F:\kafka_2.12-1.0.0>bin\windows\zookeeper-server-start.bat config\zookeeper.properties

Start Kafka:

Next, open another terminal and start Kafka as shown below:

F:\>
F:\>cd kafka_2.12-1.0.0

F:\kafka_2.12-1.0.0>bin\windows\kafka-server-start.bat config\server.properties


Create Topic:

In a third terminal, let us now proceed to create a topic called topic1 having a replication factor as 1 and number of partitions as 1:

F:\>
F:\>cd kafka_2.12-1.0.0F:\kafka_2.12-1.0.0>bin\windows\kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic topic1 

Write below code in Intellij IDEA and run it:

import java.util.Properties;

import java.io.IOException;
import java.io.BufferedReader;
import java.io.FileReader;

import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;

public class customProducer {
    private static final String topic= "topic1";
    public static void main(String[] args) {
        String fileName = "C:\\kafka_2.12-1.0.0\\NOTICE";

        Properties properties = new Properties();

        // kafka bootstrap server properties        properties.setProperty("bootstrap.servers", "127.0.0.1:9092");
        properties.setProperty("key.serializer", StringSerializer.class.getName());
        properties.setProperty("value.serializer", StringSerializer.class.getName());

        // producer properties        properties.setProperty("acks", "1");
        properties.setProperty("retries", "3");
        properties.setProperty("linger.ms", "1");

        KafkaProducer<String,String> producer = 
                                 new org.apache.kafka.clients.producer.KafkaProducer<String,String>(properties);

        try (BufferedReader br = new BufferedReader(new FileReader(fileName))) {

            String line;
            while ((line = br.readLine()) != null) {
                ProducerRecord<String,String> producerRecord =
                        new ProducerRecord<String,String>(topic, line);
                producer.send(producerRecord);
            }

        } catch (IOException e) {
            e.printStackTrace();
        }

        producer.close();
    }
}


Above code reads the NOTICE file under Kafka Home and publishes the file contents to topic called topic1. In a way, the code above simulates the FileSource Connector under Kafka Connect in the sense that it reads a file and publishes to a topic

Next, we start a consumer to receive the messages published to the topic1:

Start Consumer:

On another, start a console consumer as follows:

bin\windows\kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic topic1 --from-beginning

Output is shown below:




 








The contents to NOTICE file is shown below for reference:


















The messages are identical to the contents of file above