Kafka Connect is a tool to reliably and scalably stream data between Kafka and other systems. It is an open-source component and framework to get Kafka connected with the external systems. There are connectors that help to move huge data sets into and out of the Kafka system. Kafka Connect is only used to copy the streamed data, thus its scope is not broad. It executes as an independent process for testing and a distributed, scalable service support for an organization.
Kafka Connect provides existing connector implementations for moving some common data:
- Source Connector: A source connector takes the whole databases and streams table updates to the topics. It is able to collect metrics from the user's entire application servers into the topics. This makes the data available for stream processing with low latency.
- Sink Connector: This connector is used to deliver data from a topic into the secondary indices like the Hadoop system for offline analysis.
Features of Kafka Connect
There are following features of Kafka Connect:
- Common Framework: It works as a common framework for the connectors. The Kafka Connect allows to integrate other systems with Kafka. This makes the connector deployment, management as well as development simple.
- Can work in standalone or distributed modes: Kafka Connect can either scale up to provide centrally managed service support to an organization or scale down for testing, developing, and deploying small productions.
- REST interface: Submits as well as manages Kafka connectors to the Kafka Connect by REST API.
- Manages offset automatically: Kafka connect is able to automatically manage the commit process by getting little information from the Connectors.
- Distributed as well as scalable: By default, Kafka connect is scalable and distributed. Thus, the number of workers can be extended for scaling up the Kafka Connect cluster.
- Streaming or batch integration: Kafka Connect provides the solution to bridge the streaming and batch systems.
Kafka Connect Terminologies
Some important terms will help to understand Kafka Connect:
- Connectors: The connectors are used to coordinate and manage the copying of data between Kafka as well as other systems. A connector instance is created that performs data streaming management. All the classes used by the connectors are defined in a plugin called Connector Plugin.
- Tasks: It does the actual implementation of copying data to or from the Apache Kafka. Each instance of connector coordinates a set of tasks which actually copies the data. A Kafka connector is able to break a single job into multiple tasks. This provides built-in support for copying the data parallelly and scalably with little configurations. As these tasks can be started, restarted, or stopped anytime to provide a scalable and resilient pipeline, its state is preserved in special topics i.e., 'config.storage.topic' and 'status.storage.topic'. The associated connectors manage the state.
- Workers: Both connectors and tasks are the logical units of work. The workers are the running processes which execute both connectors and tasks.
There are two types of workers:
- Standalone Workers: These are the workers where a single process executes all connectors as well as tasks. It is the simplest mode, so it requires fewer configurations. But, there is limited functionality as well as scalability. It does not have any fault tolerance beyond monitoring.
- Distributed Workers: Unlike the above, it provides scalable and automatic fault tolerance. Here, multiple worker processes to execute the connectors and tasks using the same group id. These workers automatically schedule the execution across all active workers. The workers redistribute the works if a new worker is added, or any worker fails or shuts down.
- Converters: It is the code which is used to convert the data between Kafka Connect and the sender/receiver. These converters are used by the tasks to change the data format from bytes to Kafka Connect internal data.
- Transforms: It is used to alter the data to make it simple and lightweight. It is a simple function which takes a single record as input, modifies it, and outputs the record. Kafka Connect provides a number of transformations, and they perform simple and useful modifications. There are various transforms used for data modification, such as cast, drop, ExtractTopic, and many more.
Advantages of Kafka Connect
- Data centric pipeline: Kafka Connect uses data abstraction to push or pull data to Apache Kafka.
- Flexible and scalable: Kafka Connect is able to execute with streaming and batch-oriented systems on a single node.
- Reusability and extensibility: Kafka Connect extends the existing connectors as per the user needs.