In the next few posts starting with this one, we will take a look at Apache Kafka. Apache Kafka is a distributed streaming platform written in Scala and Java languages. We have already seen Sqoop and Apache Flume as ingestion tools for migrating data into Big Data systems. Apache Kafka serves the twin purpose of:
1) migrating data between systems/applications in real time through streaming data pipelines, and
2) processing streams of data in real time through applications
Apacha Kafka uses the Publish Subscribe model of messaging pattern. In the Publish Subscribe model, publishers are applications that are senders of messages and subscribers are applications are recipients of messages. Publishers publish messages without explicitly specifying recipients or having knowledge of intended recipients. Publishers do not program the messages to be sent to any particular subscriber. Instead, the publishers publish messages to topics. Subscribers that evince an interest in a particular topic can subscribe to it to receive all the messages that are published to that topic by the publishers. Note that all subscribers subscribing to the same topic will receive the same messages related to that topic. In this sense, a loosely coupled architecture between the publishers and the subscribers is achieved and either of them do not have any knowledge of the other and can operate independently of the other.
Apache Kafka terminology consists of:
1) Producers: publish data to any number of topics of their choice. The producer is responsible for choosing which record to assign to which partition within that topic. This assignment of a record to a partition can be done in a round-robin fashion to balance load or it can be done according to a custom partition function
2) Consumers: are consumer instances grouped under a consumer group name. For every record that is published by a Producer under a topic, that message is delivered to one consumer instance that is part of a subscribing consumer group
3) Topics: are categories or feed names to which records are published by Producers. Topics in Kafka are always multi-subscriber based in that a topic can have zero, one, or many consumers that subscribe to the data written to it. Topics are split into one or more partitions. For each topic, a minimum of one partition exists.
4) Partitions: are ordered, immutable sequence of records that is continually appended to by records being published by producers. A sequential id number called the offset that uniquely identifies each record within the partition is assigned to each record within the partition. The Kafka cluster retains all published records within a partition, regardless of their having been consumed by a consumer, using a configurable retention period
5) Records: consists of an optional key, a value, and a timestamp. Producers publish records to a partition under a topic based on the record key. If the record has a key, then, the default partitioner for Java uses a hash of the record key to choose the partition, else, uses a round-robin strategy if the record has no key
The four core Kafka APIs (Refer above figure) are:
1) Producer API: allows an application to publish a stream of records to one or more Kafka topics
2) Consumer API: allows an application to subscribe to one or more topics and process the stream of records produced to them
3) Streams API: allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams
4) Connector API: allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems
Apache Kafka has the following prerequisites:
1) Java Runtime Environment
2) Memory - Sufficient memory for all Kafka related configurations
3) Disk Space - Sufficient disk space for all Kafka related configurations
4) Directory Permissions - Read/Write permissions for directories for Kafka related configurations
5) Zookeeper
In the first part of the series, we will install Kafka on Windows.
Download latest Kafka binaries from this link. As of writing this post, 1.0.0 is the latest release. The current stable version is 1.0.0. The link to the binary is:
https://www.apache.org/dyn/closer.cgi?path=/kafka/1.0.0/kafka_2.12-1.0.0.tgz
Click on sha512 to see this:
kafka_2.12-1.0.0.tgz: 1B647B7F 392148AA 2B9D4755 0A1502E5 0BE4B005 C70C82DA
E03065B8 9484E664 00528BE4 0CA2D54F 35EB2C0E 70F35C88
A04777C2 B625DAA5 D5546CAA B4ED6818
Move the downloaded binary to a drive, F drive in this case. Now, let us use the utility mentioned in this post to ensure completeness of the downloaded binary:
F:\>CertUtil -hashfile kafka_2.12-1.0.0.tgz SHA512
SHA512 hash of kafka_2.12-1.0.0.tgz:
1b647b7f392148aa2b9d47550a1502e50be4b005c70c82dae03065b89484e66400528be40ca2d54f35eb2c0e70f35c88a04777c2b625daa5d5546caab4ed6818
CertUtil: -hashfile command completed successfully.
The values are the same. Move the file to a folder created under F drive called kafka_2.12-1.0.0. We will use 7-Zip Manager to unzip this file. Open 7-Zip File Manager and point it to this file:
Click on Extract and enter F:\kafka_2.12-1.0.0\ in Extract to. Then, click OK:
Select newly created file, kafka_2.12-1.0.0.tar. Click on Extract again and enter F:\ in Extract to. Then, click OK:
The final extracted files are shown below:
On the command line, navigate to kafka folder as follows:
F:\>
F:\>cd kafka_2.12-1.0.0\bin\windows
F:\kafka_2.12-1.0.0\bin\windows>
To see zookeeper files, enter below command:
F:\kafka_2.12-1.0.0\bin\windows>dir zookeeper*
Output is shown below:
Directory of F:\kafka_2.12-1.0.0\bin\windows
10/27/2017 09:26 PM 1,192 zookeeper-server-start.bat
10/27/2017 09:26 PM 905 zookeeper-server-stop.bat
10/27/2017 09:26 PM 977 zookeeper-shell.bat
3 File(s) 3,074 bytes
To see kafka files, enter below command:
F:\kafka_2.12-1.0.0\bin\windows>dir kafka*
Output is shown below:
Directory of F:\kafka_2.12-1.0.0\bin\windows
10/27/2017 09:26 PM 873 kafka-acls.bat
10/27/2017 09:26 PM 885 kafka-broker-api-versions.bat
10/27/2017 09:26 PM 876 kafka-configs.bat
10/27/2017 09:26 PM 925 kafka-console-consumer.bat
10/27/2017 09:26 PM 925 kafka-console-producer.bat
10/27/2017 09:26 PM 883 kafka-consumer-groups.bat
10/27/2017 09:26 PM 884 kafka-consumer-offset-checker.bat
10/27/2017 09:26 PM 938 kafka-consumer-perf-test.bat
10/27/2017 09:26 PM 874 kafka-mirror-maker.bat
10/27/2017 09:26 PM 900 kafka-preferred-replica-election.bat
10/27/2017 09:26 PM 940 kafka-producer-perf-test.bat
10/27/2017 09:26 PM 888 kafka-reassign-partitions.bat
10/27/2017 09:26 PM 880 kafka-replay-log-producer.bat
10/27/2017 09:26 PM 886 kafka-replica-verification.bat
10/27/2017 09:26 PM 5,276 kafka-run-class.bat
10/27/2017 09:26 PM 1,377 kafka-server-start.bat
10/27/2017 09:26 PM 997 kafka-server-stop.bat
10/27/2017 09:26 PM 882 kafka-simple-consumer-shell.bat
10/27/2017 09:26 PM 875 kafka-topics.bat
19 File(s) 21,964 bytes
This concludes the first post on Apache Kafka ...
1) migrating data between systems/applications in real time through streaming data pipelines, and
2) processing streams of data in real time through applications
Apacha Kafka uses the Publish Subscribe model of messaging pattern. In the Publish Subscribe model, publishers are applications that are senders of messages and subscribers are applications are recipients of messages. Publishers publish messages without explicitly specifying recipients or having knowledge of intended recipients. Publishers do not program the messages to be sent to any particular subscriber. Instead, the publishers publish messages to topics. Subscribers that evince an interest in a particular topic can subscribe to it to receive all the messages that are published to that topic by the publishers. Note that all subscribers subscribing to the same topic will receive the same messages related to that topic. In this sense, a loosely coupled architecture between the publishers and the subscribers is achieved and either of them do not have any knowledge of the other and can operate independently of the other.
Apache Kafka terminology consists of:
1) Producers: publish data to any number of topics of their choice. The producer is responsible for choosing which record to assign to which partition within that topic. This assignment of a record to a partition can be done in a round-robin fashion to balance load or it can be done according to a custom partition function
2) Consumers: are consumer instances grouped under a consumer group name. For every record that is published by a Producer under a topic, that message is delivered to one consumer instance that is part of a subscribing consumer group
3) Topics: are categories or feed names to which records are published by Producers. Topics in Kafka are always multi-subscriber based in that a topic can have zero, one, or many consumers that subscribe to the data written to it. Topics are split into one or more partitions. For each topic, a minimum of one partition exists.
4) Partitions: are ordered, immutable sequence of records that is continually appended to by records being published by producers. A sequential id number called the offset that uniquely identifies each record within the partition is assigned to each record within the partition. The Kafka cluster retains all published records within a partition, regardless of their having been consumed by a consumer, using a configurable retention period
5) Records: consists of an optional key, a value, and a timestamp. Producers publish records to a partition under a topic based on the record key. If the record has a key, then, the default partitioner for Java uses a hash of the record key to choose the partition, else, uses a round-robin strategy if the record has no key
The four core Kafka APIs (Refer above figure) are:
1) Producer API: allows an application to publish a stream of records to one or more Kafka topics
2) Consumer API: allows an application to subscribe to one or more topics and process the stream of records produced to them
3) Streams API: allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams
4) Connector API: allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems
Apache Kafka has the following prerequisites:
1) Java Runtime Environment
2) Memory - Sufficient memory for all Kafka related configurations
3) Disk Space - Sufficient disk space for all Kafka related configurations
4) Directory Permissions - Read/Write permissions for directories for Kafka related configurations
5) Zookeeper
In the first part of the series, we will install Kafka on Windows.
Download latest Kafka binaries from this link. As of writing this post, 1.0.0 is the latest release. The current stable version is 1.0.0. The link to the binary is:
https://www.apache.org/dyn/closer.cgi?path=/kafka/1.0.0/kafka_2.12-1.0.0.tgz
Click on sha512 to see this:
kafka_2.12-1.0.0.tgz: 1B647B7F 392148AA 2B9D4755 0A1502E5 0BE4B005 C70C82DA
E03065B8 9484E664 00528BE4 0CA2D54F 35EB2C0E 70F35C88
A04777C2 B625DAA5 D5546CAA B4ED6818
Move the downloaded binary to a drive, F drive in this case. Now, let us use the utility mentioned in this post to ensure completeness of the downloaded binary:
F:\>CertUtil -hashfile kafka_2.12-1.0.0.tgz SHA512
SHA512 hash of kafka_2.12-1.0.0.tgz:
1b647b7f392148aa2b9d47550a1502e50be4b005c70c82dae03065b89484e66400528be40ca2d54f35eb2c0e70f35c88a04777c2b625daa5d5546caab4ed6818
CertUtil: -hashfile command completed successfully.
The values are the same. Move the file to a folder created under F drive called kafka_2.12-1.0.0. We will use 7-Zip Manager to unzip this file. Open 7-Zip File Manager and point it to this file:
Click on Extract and enter F:\kafka_2.12-1.0.0\ in Extract to. Then, click OK:
Select newly created file, kafka_2.12-1.0.0.tar. Click on Extract again and enter F:\ in Extract to. Then, click OK:
The final extracted files are shown below:
On the command line, navigate to kafka folder as follows:
F:\>
F:\>cd kafka_2.12-1.0.0\bin\windows
F:\kafka_2.12-1.0.0\bin\windows>
To see zookeeper files, enter below command:
F:\kafka_2.12-1.0.0\bin\windows>dir zookeeper*
Output is shown below:
Directory of F:\kafka_2.12-1.0.0\bin\windows
10/27/2017 09:26 PM 1,192 zookeeper-server-start.bat
10/27/2017 09:26 PM 905 zookeeper-server-stop.bat
10/27/2017 09:26 PM 977 zookeeper-shell.bat
3 File(s) 3,074 bytes
To see kafka files, enter below command:
F:\kafka_2.12-1.0.0\bin\windows>dir kafka*
Output is shown below:
Directory of F:\kafka_2.12-1.0.0\bin\windows
10/27/2017 09:26 PM 873 kafka-acls.bat
10/27/2017 09:26 PM 885 kafka-broker-api-versions.bat
10/27/2017 09:26 PM 876 kafka-configs.bat
10/27/2017 09:26 PM 925 kafka-console-consumer.bat
10/27/2017 09:26 PM 925 kafka-console-producer.bat
10/27/2017 09:26 PM 883 kafka-consumer-groups.bat
10/27/2017 09:26 PM 884 kafka-consumer-offset-checker.bat
10/27/2017 09:26 PM 938 kafka-consumer-perf-test.bat
10/27/2017 09:26 PM 874 kafka-mirror-maker.bat
10/27/2017 09:26 PM 900 kafka-preferred-replica-election.bat
10/27/2017 09:26 PM 940 kafka-producer-perf-test.bat
10/27/2017 09:26 PM 888 kafka-reassign-partitions.bat
10/27/2017 09:26 PM 880 kafka-replay-log-producer.bat
10/27/2017 09:26 PM 886 kafka-replica-verification.bat
10/27/2017 09:26 PM 5,276 kafka-run-class.bat
10/27/2017 09:26 PM 1,377 kafka-server-start.bat
10/27/2017 09:26 PM 997 kafka-server-stop.bat
10/27/2017 09:26 PM 882 kafka-simple-consumer-shell.bat
10/27/2017 09:26 PM 875 kafka-topics.bat
19 File(s) 21,964 bytes
This concludes the first post on Apache Kafka ...