sharad_ag

  • Archive
  • RSS

April Hadoop Meetup in Bangalore

We had tremendous interest this time in the Hadoop Meetup in Bangalore. About 180 people wanted to attend the meetup but due to limitation on the space we had to restrict to 150. These many people can’t fit any of the rooms so we did it in the open office floor at InMobi.

Zainab Bawa helped us with the video recording. We will sharing the videos and slides soon on the meetup group.

Arun talking about Big Data and Hadoop 2.0.

  • 1 year ago
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

First Hadoop Bangalore Meetup

Last week InMobi hosted the first Hadoop Meetup in Bangalore. About 90 people attended, with diverse background and different level of Hadoop experience. There was huge response for the meetup and unfortunately we had to turn away few folks due to limited space. Next time hoping to have a bigger facility to accommodate more.

Govind Kanshi has written a nice detailed coverage on the event. Thanks everybody who attended and for the engaging discussions.

  • 1 year ago
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Reliable Event Data Transport at Scale

A lot is spoken about parallel computing, large scale storage and realtime analytics these days. Relatively lot less is said about transporting event data from producer to consumer reliably and at scale. The problem is not unique to any organization dealing with BigData and definitely non-trivial. Hadoop is the defacto platform for storing and crunching data. However for event data transport, lot of efforts are evolving currently.

Recently I have been looking at some of the available technology choices for transporting the log event data across different distributed sub systems. These systems are distributed within a data-center and also spread over WAN. The goals were to have a reliable, efficient, scalable transport mechanism which supports batch as well as near real-time data consumers. The options which I primarily looked at are FLUME, SCRIBE and KAFKA. I did not look at enterprise messaging systems like ActiveMQ and RabbitMQ as they don’t cater to the scale and thru-put requirements we have.

Some folks were asking me how do you compare these. So I thought of publishing the matrix which I created sometime back comparing these on different attributes which I think are crucial.

I have excluded Flume from the comparison below. Flume although being very promising and extensible system, is currently in a bit of flux and being rewritten under Flume NG code name - FLUME-728. Flume NG addresses concerns about the reliable data transfer. It has been simplified and has durable channels and transactions semantics. Currently the alpha of FLUME NG is out with minimal feature set.

Attribute Kafka Scribe
License Apache Apache
Community Active - Apache Incubator Production at Linkedin No cummunity, no feature/bug fixes in last one year. Production at facebook but a different codebase. Rewritten in java as Calligraphus (not available outside facebook)
Documentation Growing Very thin
Paradigm Pull. Consumers pull from Kafka cluster Push. Not suitable for central service to push to disparate consumers. Data can be pushed to a reliable store like HDFS. Consumers pick from HDFS.
Language scala c++
Compression support Yes - gzip. Transparent to producers and consumers. No
Dependency Zookeeper HDFS (assuming HDFS is used as store)
Cross DC replication Yes. Via Mirroring HDFS distcp
Producer Side Integration
Integration Producer application embed Kafka client library. Messages are streamed directly to Kafka brokers. Producer application embed scribe client library. Messages are streamed to local agent.
Language JVM languages, python, All thrift supported
Encoder
Pluggable Pluggable, works on byte stream
Consumer Side Integration
Semantics Pull based on message offset Data is pushed to reliable store like HDFS. All consumers consume from HDFS
Stream view Yes No. File based view assuming HDFS as store. Stream view library can be built on top of HDFS something similar to Ptail.
Language Jvm languages Supported HDFS clients
Multi-subscriber Yes - consumer groups. Automatic load balancing between all consumers in group No
Data completeness measure Can be implemented by having out of band topic Hard to implement
In-order data arrival Ordered per Partition No
Availability Partial and temporary loss of data residing on unavailable broker. High availability of partitions achieved by replication, not yet there - KAFKA-50 Limited by HDFS availability
Map-Reduce friendly Yes. Data can be pulled to HDFS via MR job Yes. Data already on HDFS
Reliability/Message Durability
Producer spooling to disk Not yet. Under development - KAFKA-156 Agent spooling is there. Producer spooling not there. Messages dropped if local agent is down
Agent/Broker fail-over Messages rerouted to available broker in the cluster Secondary can be configured via buffer store
Producer Acks Not yet. KAFKA-49 TRY_LATER is returned by next hop
Data Replication Intra cluster not yet. KAFKA-50. Not needed in scribe. Replication in HDFS
Operability
Setup
ZK cluster, Brokers Scribe agent on each producer machine, scribe collectors, HDFS
Metrics Exposed over JMX Over thrift
Configuartion New topics and consumers can be created on the fly New topics and consumers can be created on the fly
Adding more collectors/brokers Can be added dynamically No. Scribe collectors behind a Load Balancer ?
Graceful decommission Not yet - KAFKA-155 Yes
Online Upgrades Rolling upgrades can be done after KAFKA-155 Rolling upgrades for scribe collectors. HDFS upgrades will stall consumer getting data. Though scribe collectors will spool during that time

In summary, none of these can be used right away for end to end durable data transport which can carry critical and revenue bearing data. Kafka has a great multi-subscriber support with very low message latencies. Some of the reliability and operability features are work in progress. Scribe is ok for data transport with some operability limitations. Also pushing messages from a central scribe service to disparate consumers is not ideal. The recommended way is to get the messages to a reliable and scalable store like HDFS and consumers pull from there. Facebook has rewritten Scribe in java as Calligraphus which is used in their production internally. It would be interesting to see if Facebook put it out in open source along with the PTAIL consumer library.
Flume architecture is very similar to Scribe. Flume NG aims to be simpler and solve the end to end data durability; but in nascent state as of now.
There is lot of activity on Kafka and Flume NG. Hoping the momentum continues; these projects get widespread adoption and matures in handling critical data in near term.
  • 1 year ago
  • 4
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Finally..

After procrastination for too long, finally I have created this web presence. I have been keeping too busy communicating to machines, thinking in programming abstractions. Here I would try to organize some of the random left overs in my head intended for human consumption. Here you can expect posts about Big Data technologies, Hadoop, open source, cloud computing and yes just about anything.

  • 1 year ago
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

About

Avatar Hadoop Committer. Working on Hadoop Map-Reduce for 4 years. Recently being part of Hadoop Nextgen/YARN/MRv2.

Author of the YARN core including event driven Asynchronous Actors library, Lightweight State Machine library and Component life-cycle framework. Author of Map-Reduce Application Master.

LinkedIn Profile

Twitter

loading tweets…

  • RSS
  • Random
  • Archive
  • Mobile
Effector Theme by Pixel Union