Software Development

Apache Flume: Distributed Log Collection for Hadoop by Steve Hoffman

Posted On February 25, 2017 at 4:52 pm by / Comments Off on Apache Flume: Distributed Log Collection for Hadoop by Steve Hoffman

By Steve Hoffman

circulate information to Hadoop utilizing Apache Flume


  • Integrate Flume along with your information sources
  • Transcode your info en-route in Flume
  • Route and separate your facts utilizing common expression matching
  • Configure failover paths and load-balancing to take away unmarried issues of failure
  • Utilize Gzip Compression for documents written to HDFS

In Detail

Apache Flume is a dispensed, trustworthy, and to be had provider for successfully amassing, aggregating, and relocating quite a lot of log facts. Its major aim is to bring info from purposes to Apache Hadoop's HDFS. It has an easy and versatile structure in line with streaming info flows. it's strong and fault tolerant with many failover and restoration mechanisms.

Apache Flume: dispensed Log assortment for Hadoop covers issues of HDFS and streaming data/logs, and the way Flume can unravel those difficulties. This e-book explains the generalized structure of Flume, including relocating facts to/from databases, NO-SQL-ish information shops, in addition to optimizing functionality. This e-book comprises real-world situations on Flume implementation.

Apache Flume: dispensed Log assortment for Hadoop begins with an architectural review of Flume after which discusses every one part intimately. It publications you thru the entire install procedure and compilation of Flume.

It offers you a heads-up on find out how to use channels and channel selectors. for every architectural part (Sources, Channels, Sinks, Channel Processors, Sink teams, etc) a number of the implementations can be lined intimately in addition to configuration suggestions. you should use it to customise Flume for your particular wishes. There are tips given on writing customized implementations to boot that will assist you research and enforce them.

  • By the tip, you need to be in a position to build a chain of Flume brokers to move your streaming information and logs out of your platforms into Hadoop in close to genuine time.
  • What you'll study from this book

    • Understand the Flume architecture
    • Download and set up open resource Flume from Apache
    • Discover whilst to exploit a reminiscence or file-backed channel
    • Understand and configure the Hadoop dossier procedure (HDFS) sink
    • Learn the best way to use sink teams to create redundant facts flows
    • Configure and use a number of resources for data
    • Inspect information documents and path to varied or a number of locations according to payload content
    • Transform facts en-route to Hadoop
    • Monitor your information flows


    A starter consultant that covers Apache Flume in detail.

    Who this publication is written for

    Apache Flume: dispensed Log assortment for Hadoop is meant for those that are answerable for relocating datasets into Hadoop in a well timed and trustworthy demeanour like software program engineers, database directors, and knowledge warehouse administrators.

    Show description

    Read or Download Apache Flume: Distributed Log Collection for Hadoop PDF

    Similar software development books

    The Technical and Social History of Software Engineering

    Software program engineering is among the world’s most fun and significant fields. Now, pioneering practitioner Capers Jones has written the definitive heritage of this world-changing undefined. Drawing on numerous many years as a number one researcher and innovator, he illuminates the field’s large sweep of growth and its many eras of invention.

    Pattern-Oriented Software Architecture, On Patterns and Pattern Languages

    Software program styles have revolutionized the best way builders take into consideration how software program is designed, equipped, and documented, and this distinctive booklet bargains an in-depth glance of what styles are, what they aren't, and the way to exploit them successfully

    The merely publication to aim to strengthen a accomplished language that integrates styles from key literature, it additionally serves as a reference handbook for all pattern-oriented software program structure (POSA) patterns

    Addresses the query of what a development language is and compares a number of development paradigms

    Developers and programmers working in an object-oriented surroundings will locate this e-book to be a useful source

    Express in Action

    Exhibit in motion is a gently designed instructional that teaches you the way to construct internet functions utilizing Node and Express.

    Express in motion teaches you ways to construct net functions utilizing Node and convey. It starts off by means of introducing Node's strong characteristics and exhibits you ways they map to the good points of show. You'll discover key improvement thoughts, meet the wealthy environment of significant other instruments and libraries, and get a glimpse into its internal workings. through the tip of the booklet, you'll manage to use convey to construct a Node app and know the way to check it, hook it as much as a database, and automate the dev method.

    The People CMM: A Framework for Human Capital Management (2nd Edition)

    Businesses are actually competing in markets, one for his or her services and products and one for the expertise required to supply or practice them. luck within the former depends upon luck within the latter. the facility to compete is at once relating to the power to draw, increase, encourage, arrange, and maintain the gifted humans had to accomplish strategic company pursuits.

    Extra resources for Apache Flume: Distributed Log Collection for Hadoop

    Sample text

    You'll want to size this higher if your ingestion is heavy enough that you can't tolerate normal planned or unplanned outages. For instance, there are many configuration changes you can make in Hadoop that require a cluster restart. If you have Flume writing important data into Hadoop, the file channel should be sized to tolerate the time it takes to restart Hadoop (and maybe add a comfort buffer for the unexpected). If your cluster or other systems are unreliable, you can set this higher to handle even larger amounts of downtime.

    The behavior of this grouping is dictated by something called the sink processor, which determines how events are routed. There is a default sink processor that contains a single sink that is used whenever you have a sink that isn't part of any sink group. Our Hello World example in Chapter 2, Flume Quick Start, used the default sink processor. No special configuration is necessary for single sinks. In order for Flume to know about the sink groups, there is a new top-level agent property called sinkgroups.

    MaxPenality is reached. Summary In this chapter we covered in depth the HDFS sink, the Flume output that writes streaming data into the HDFS. We covered how Flume can separate data into different HDFS paths based on time or contents of Flume headers. Several file-rolling techniques were also discussed including the following: • Time rotation • Event count rotation • Size rotation • Rotation on idle only [ 45 ] Sinks and Sink Processors Compression was discussed as a means to reduce storage requirements in HDFS and should be used when possible.

    Download PDF sample

    Rated 4.99 of 5 – based on 41 votes