Pangool Streaming ? #6

epalace · 2012-04-11T13:20:55Z

Is able Pangool to work with Hadoop Streaming ?

ivanprado · 2012-04-11T14:28:57Z

Current parameters of Hadoop Streamming are:

Options:
-input <path> DFS input file(s) for the Map step
-output <path> DFS output directory for the Reduce step
-mapper <cmd|JavaClassName> The streaming command to run
-combiner <cmd|JavaClassName> The streaming command to run
-reducer <cmd|JavaClassName> The streaming command to run
-file <file> File/dir to be shipped in the Job jar file
-inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional.
-outputformat TextOutputFormat(default)|JavaClassName Optional.
-partitioner JavaClassName Optional.
-numReduceTasks <num> Optional.
-inputreader <spec> Optional.
-cmdenv <n>=<v> Optional. Pass env.var to streaming commands
-mapdebug <path> Optional. To run this script when a map task fails
-reducedebug <path> Optional. To run this script when a reduce task fails
-io <identifier> Optional.

The reduce script receives all the data without being grouped. So the script is responsible of detecting changes in key, and creating manually the groups.

Seems we could configure the streaming job, allowing to define the group by and sort by options. The reduce and combiner script would be called once per group. That could be inefficient, as the start up&down times of the scripts can be relevant. But, by the other side, maybe is useful.

We could also allow to provide an intermediate schema, so than text is translated to Tuples after the mapper. That allows:

Smaller serialization size: primitive types (int, double, etc) are serialized as bytes, not strings
Improved sorting: sorting by numbers does not need padding
Allows for sorting by fields in a different order they have in the input record without rewriting the record in the mapper

pereferrera · 2013-10-01T08:46:05Z

Sorry, I don't get what has "Hadoop Streaming" to do with Pangool.

In my mind one uses Hadoop Streaming for orthogonal reasons to those for using Pangool or Java MapRed.

Unless you can ellaborate more on why is this useful... I don't see it. There are already very good APIs on top of Hadoop Streaming like Python MapRed APIs.

ivanprado · 2013-10-01T10:30:54Z

It could have sense at some point to build some kind of "Hadoop Streaming"
but on top of Pangool, by doing use of its power for managing schemas. It
would be more efficient than the default Hadoop Streaming in the sense that
the intermediate serialization would be much efficient. Also, results could
be optionally written in TupleFiles easily.

Anyway, I don't see that as a big priority, so I would close the ticket.

2013/10/1 Pere Ferrera [email protected]

Sorry, I don't get what has "Hadoop Streaming" to do with Pangool.

In my mind one uses Hadoop Streaming for orthogonal reasons to those for
using Pangool or Java MapRed.

Unless you can ellaborate more on why is this useful... I don't see it.
There are already very good APIs on top of Hadoop Streaming like Python
MapRed APIs.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-25433819
.

Iván de Prado
CEO & Co-founder
www.datasalt.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pangool Streaming ? #6

Pangool Streaming ? #6

epalace commented Apr 11, 2012

ivanprado commented Apr 11, 2012

pereferrera commented Oct 1, 2013

ivanprado commented Oct 1, 2013

Pangool Streaming ? #6

Pangool Streaming ? #6

Comments

epalace commented Apr 11, 2012

ivanprado commented Apr 11, 2012

pereferrera commented Oct 1, 2013

ivanprado commented Oct 1, 2013