Reliable Event Data Transport at Scale
A lot is spoken about parallel computing, large scale storage and realtime analytics these days. Relatively lot less is said about transporting event data from producer to consumer reliably and at scale. The problem is not unique to any organization dealing with BigData and definitely non-trivial. Hadoop is the defacto platform for storing and crunching data. However for event data transport, lot of efforts are evolving currently.
Recently I have been looking at some of the available technology choices for transporting the log event data across different distributed sub systems. These systems are distributed within a data-center and also spread over WAN. The goals were to have a reliable, efficient, scalable transport mechanism which supports batch as well as near real-time data consumers. The options which I primarily looked at are FLUME, SCRIBE and KAFKA. I did not look at enterprise messaging systems like ActiveMQ and RabbitMQ as they don’t cater to the scale and thru-put requirements we have.
Some folks were asking me how do you compare these. So I thought of publishing the matrix which I created sometime back comparing these on different attributes which I think are crucial.
I have excluded Flume from the comparison below. Flume although being very promising and extensible system, is currently in a bit of flux and being rewritten under Flume NG code name - FLUME-728. Flume NG addresses concerns about the reliable data transfer. It has been simplified and has durable channels and transactions semantics. Currently the alpha of FLUME NG is out with minimal feature set.
|Community||Active - Apache Incubator Production at Linkedin||No cummunity, no feature/bug fixes in last one year. Production at facebook but a different codebase. Rewritten in java as Calligraphus (not available outside facebook)|
|Paradigm||Pull. Consumers pull from Kafka cluster||Push. Not suitable for central service to push to disparate consumers. Data can be pushed to a reliable store like HDFS. Consumers pick from HDFS.|
|Compression support||Yes - gzip. Transparent to producers and consumers.||No|
|Dependency||Zookeeper||HDFS (assuming HDFS is used as store)|
|Cross DC replication||Yes. Via Mirroring||HDFS distcp|
Producer Side Integration
|Integration||Producer application embed Kafka client library. Messages are streamed directly to Kafka brokers.||Producer application embed scribe client library. Messages are streamed to local agent.|
|Language||JVM languages, python,||All thrift supported|
||Pluggable||Pluggable, works on byte stream|
Consumer Side Integration
|Semantics||Pull based on message offset||Data is pushed to reliable store like HDFS. All consumers consume from HDFS|
|Stream view||Yes||No. File based view assuming HDFS as store. Stream view library can be built on top of HDFS something similar to Ptail.|
|Language||Jvm languages||Supported HDFS clients|
|Multi-subscriber||Yes - consumer groups. Automatic load balancing between all consumers in group||No|
|Data completeness measure||Can be implemented by having out of band topic||Hard to implement|
|In-order data arrival||Ordered per Partition||No|
|Availability||Partial and temporary loss of data residing on unavailable broker. High availability of partitions achieved by replication, not yet there - KAFKA-50||Limited by HDFS availability|
|Map-Reduce friendly||Yes. Data can be pulled to HDFS via MR job||Yes. Data already on HDFS|
|Producer spooling to disk||Not yet. Under development - KAFKA-156||Agent spooling is there. Producer spooling not there. Messages dropped if local agent is down|
|Agent/Broker fail-over||Messages rerouted to available broker in the cluster||Secondary can be configured via buffer store|
|Producer Acks||Not yet. KAFKA-49||TRY_LATER is returned by next hop|
|Data Replication||Intra cluster not yet. KAFKA-50.||Not needed in scribe. Replication in HDFS|
||ZK cluster, Brokers||Scribe agent on each producer machine, scribe collectors, HDFS|
|Metrics||Exposed over JMX||Over thrift|
|Configuartion||New topics and consumers can be created on the fly||New topics and consumers can be created on the fly|
|Adding more collectors/brokers||Can be added dynamically||No. Scribe collectors behind a Load Balancer ?|
|Graceful decommission||Not yet - KAFKA-155||Yes|
|Online Upgrades||Rolling upgrades can be done after KAFKA-155||Rolling upgrades for scribe collectors. HDFS upgrades will stall consumer getting data. Though scribe collectors will spool during that time|
In summary, none of these can be used right away for end to end durable data transport which can carry critical and revenue bearing data. Kafka has a great multi-subscriber support with very low message latencies. Some of the reliability and operability features are work in progress. Scribe is ok for data transport with some operability limitations. Also pushing messages from a central scribe service to disparate consumers is not ideal. The recommended way is to get the messages to a reliable and scalable store like HDFS and consumers pull from there. Facebook has rewritten Scribe in java as Calligraphus which is used in their production internally. It would be interesting to see if Facebook put it out in open source along with the PTAIL consumer library.
Flume architecture is very similar to Scribe. Flume NG aims to be simpler and solve the end to end data durability; but in nascent state as of now.
There is lot of activity on Kafka and Flume NG. Hoping the momentum continues; these projects get widespread adoption and matures in handling critical data in near term.