obtain result of map-reduce job as stream - java

Iam writing a map-reduce job in Java I would like to know is it possible to obtain output of the job as stream(may be a output stream) rather a physical output file. My objective is to use the stream for another application.

You can write a Custom Output Format and use that write to any stream you want to. Not necessarily a file. See this tutorial on how to write a Custom Output Format.
Or else you can make use Hadoop Streaming API. Have a look here for that.

I don't think you can do this using Apache-Hadoop. It is designed to work in a distributed system and AFAIK providing the way to emit an output stream would defy the purpose, as then how system would decide on the stream to emit, i.e. of which reducer! You may write to a flat-file/DB/amazon-s3 etc but perhaps you won't get a stream.

Related

Confused about inputstreams and reading from files

I tried to understand the logic behind inputstreams and reading from files, but I fail to understand how you can read from a file using an inputstream.
My understanding is that when using input devices like a keyboard, you send input data through the input stream to the system. If you are reading from an input stream, aren't you reading the input data that's being send to the system at that time?
If we are creating an inputstream with the following code:
FileInputStream test = new FileInputStream("loremipsum.txt");
And if we try to read from the newly created inputstream with test.read(); how is there any data flowing through the inputstream? As no inputdata has been input from an input device at the time, but has already been input way beforehand. Is there something I'm missing out on? It almost seems to me as input streams are used in two different ways: Java using inputstreams to read data from a source and input devices using to input data to a source.
Java streams are a general concept / interface - a stream of data that you need to open, then read the data from (or write data to for output streams), then close. The basic stream only supports sequential reading / writing, no random access. Also, the data may or may not be readily available when you attempt to read from the stream, so the read may or may not block.
This abstraction allows us to use the same approach regardless of where we read the data from - it might be keyboard, a file, a network connection, output form another program or even some kind of generator that generates an endless sequence of data. Simply put, reading the input from file behaves the same as if someone in the background opened the file and typed its content on the keyboard really fast.
There are ways in Java to read the file in another ways (e.g. random access instead of sequential), but if you need to read the file from start to end, streams are a useful abstraction.

Is any way can let Cloud Dataflow output like stream?

I use Google Cloud Dataflow to process bound data and output to BigQuery, and I want it can process something and write something (like stream, not batch), Is any way I can do this?
Currently, Dataflow will wait worker process dont all data, and write to BigQuery, I try to add FixedWindow and use Log Timestamp param be a window_timestamp, but It doesn't work.
I want to know:
Is windowing right way to handle this problem?
Is BigQueryIO really write batch or maybe it just not show on my dashboard (background write stream?)
Is any way to do I need?
My source code is here: http://pastie.org/10907947
Thank you very much!
You need to set the streaming property to true in your PipelineOptions.
See "streaming execution" for more information.
In addition, you'll need to be using sources/sinks that can generate/consume unbounded data. BigQuery can already write in both modes, but currently TextIO only reads bounded data. But it's definitely possible to write a custom unbounded source that scans a directory for new files.

How do I log from a mapper? (hadoop with commoncrawl)

I'm using the commoncrawl example code from their "Mapreduce for the Masses" tutorial. I'm trying to make modifications to the mapper and I'd like to be able to log strings to some output. I'm considering setting up some noSQL db and just pushing my output to it, but it doesn't feel like a good solution. What's the standard way to do this kind of logging from java?
While there is no special solution for the logs aside of usual logger (at least one I am aware about) I can see about some solutions.
a) if logs are of debug purpose - indeed write usual debug logs. In case of the failed tasks you can find them via UI and analyze.
b) if this logs are some kind of output you want to get alongside some other output from you job - assign them some specail key and write to the context. Then in the reducer you will need some special logic to put them to the output.
c) You can create directory on HDFS and make mapper to write to there. It is not classic way for MR because it is side effect - in some cases it can be fine. Especially taking to account that after each mapper will create its own file - you can use command hadoop fs -getmerge ... to get all logs as one file.
c) If you want to be able to monitor the progress of your job, number of error etc - you can use counters.

Connecting arbitrary streams to standard input

I have a program that needs to process data from Standard-In. I can call it on the command line like so java program < input. I want to write a proper unit test for its main(). Is it possible to reassociate a method's System.in with some other stream?
In the test, I can read the sample data, and then somehow run the original program with its stdin connected to some stream that i define (from the sample data) and verify that it returns what i expect. I considered using these classes:
PipedInputStream and
PipedOutputStream. But it would require me to modify the original program to read from a PipedInputStream every time I test it. Or I can isolate the stream reading into a function (eg. parseStream(InputStream) ) and pass a PipedInputStream which is already connected to the sample data.
I can also write a shell script to pipe whatever i want into its stdin, but the method in question will be a part of a series of processing steps so it shouldn't itself write to stdout and actually returns ArrayList<SomeCompositeType>. Where SomeCompositeType contains the data that was read in a structured way (eg. some ints, arrays, Maps, etc..)
So is it possible to call some method that reads from System.in with a different stream?
See my comment.
What you appear to want is provided by System.setIn :-)

Could I duplicate or intercept an output stream in Java?

I want to intercept the standard output stream, then copy the content to another stream, but I also hope to keep the standard output stream like the original. Could I achieve that in Java?
You can use something like the example of TeeOutputStream explained here Writing Your Own Java I/O Stream Classes
Basically you create a TeeOutputStream, give it your stream and current System.out
then use System.setOut with the new stream.
Anything written to System.out will be written to the original System.out as well as your stream so you can do whatever you want with it
Edit:
Oracle took off this page, It is also possible to use TeeOutputStream from Apache Commons to do the same thing without adding any code.
Take a look at this package: org.apache.commons.io.output. I think that TeeOutputStream is what you're looking for.

Categories

Resources