Java Map Reduce Read from Different Format - Avro, Textfile

Java Map Reduce Read from Different Format - Avro, Textfile - java

I have a few Hive tables where some of them are in Avro format and some of them are in plain textfile. The schemas are slightly different but all contains certain attributes that I need.
I am planning to write a map reduce job process the data. The question is that I am trying to avoid tons of separate jobs and trying to simplify the process as much as possible. Finger crossed that I only need to write one job.
Is there any example of showing how to read different format of inputs in one mapper.
Say for example, I have a hdfs path I know in AVRO, and I also have another hdfs path where the data is in plain text file.
// Pseudo code
mapper (Paths){
for(Path in Paths){
if Path.containsAvro() {
... read as avro
} else {
... read as textfile
}
..
}
}

Use two different mappers, one for each format, for the same job. The mappers can each read their own format of data but must all write the same format of data. Use something like this to configure:
MultipleInputs.addInputPath(job, new Path(path_to_data_with_format_1), SomeInputFormat.class, ReadFormatOneMapper.class);
MultipleInputs.addInputPath(job, new Path(path_to_data_with_format_2), SomeOtherInputFormat.class, ReadFormatTwoMapper.class);
Of course, SomeInputFormat and SomeOtherInputFormat aren't real input format classes. In this example the two mapper classes would output key/value pairs with the same kay/value types and the reducer, if you have one, would get the data from both mappers.

Related

extraction of multiple occurrences of variable data from large string

I have a very long string in a text file.It is basically the below string repeated around 1000 times (as one long string, not 1000 strings).The string has variables which change with each repetition (those in bold).I'd like to extract the variables in an automated way, and return the output into either a CSV or formatted txt file (Random Bank, Random Rate, Random Product)I can do this successfully using https://regex101.com, however it involves a lot of manual copy&paste.I'd like to write a bash script to automate extracting the information, but have no luck in attempting various grep commands.How can this be done? (I'd also consider doing in Java).
[{"AccountName":"Random Product","AccountType":"Variable","AccountTypeId":1,"AER":Random Rate,"CanManageByMobileApp":false,"CanManageByPost":true,"CanManageByTelephone":true,"CanManageInBranch":false,"CanManageOnline":true,"CanOpenByMobileApp":false,"CanOpenByPost":false,"CanOpenByTelephone":false,"CanOpenInBranch":false,"CanOpenOnline":true,"Company":"Random Bank","Id":"S9701Monthly","InterestPaidFrequency":"Monthly"

This is JSON formatted data which you can't parse with regular expression engines. Get a JSON parser. If this file is larger than, say, 1GB, find one that lets you 'stream' (which is the term for parsing it and dealing with the data as it parses, vs the more usual route which turns the entire input into an object; if the file is huge, that object'd be huge, might run out of memory - hence you'd need the streaming aspect).
Here is one tutorial for Jackson-streaming.

Mapreduce questions

I am trying to implement a Mapreduce program to do wordcounts from 2 files, and then comparing the word counts from these files to see what are the most common words...
I noticed that after doing wordcount for file 1, the results that go into the directory "/data/output1/", there are 3 files inside.
- "_SUCCESS"
- "_logs"
- "part-r-00000"
The "part-r-00000" is the file that contains the results from file1 wordcount. How do I make my program read that particular file if the file name is generated in real-time without me knowing beforehand the filename?
Also, for the (key, value) pairs, I have added an identifier to the "value", so as to be able to identify which file and count that word belongs to.
public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
Text newValue = new Text();
newValue.set(value.toString() + "_f2");
context.write(key, newValue);
}
at a later stage, how do i "remove" the identifier such that i can just get the "value"?

Just point your next MR job to /data/output1/. It will read all three files as input, but _SUCCESS and _logs are both empty so they'll have no affect on your program. They're just written that way so that you can tell that the MR job writing to the directory has finished successfully.

If you want to implement word count from 2 different files then you could use multipleinput class with help of which you can apply map reduce program on both files simultaneously. Refer this link for a example of how to implement it http://www.hadooptpoint.com/hadoop-multiple-input-files-example-in-mapreduce/ here you will define separate mapper for each input file thus you can add different identifier in both mapper file and then when there output will go to reducer it can identify from which map file that input is coming from and can process accordingly to it. And you can remove identifier in same way you add them like for example if you add a prefix # in mapper 1 output key and # in mapper 2 output key then in reducer you can identify from which map input is coming from using this prefix and then you can simple remove this prefix in reducer.
Aside from it about your other query related to file reading, it is simple the output file name aways have a pattern that if your are using hadoop1.x then result will be stored in file name as part-00000 and onward and with hadoop 2.x result will be stored in file name part-r-00000 if there is another output which need to be write in same ouput path then it will be stored in part-r-00001 and onwards. Other two files which are generated have no significance for developer they more of a act as a half for hadoop itself
Hope this solve your query. Please comment if answer is not clear.

How to save an array or object to be used in other runs

I'm working with a sensor that taking data and giving it to me whenever I call on it. I want to collect 1 minute of data in an arraylist and then save it to a file or something so that I can analyze it at a different time.
So I have something like this:
ArrayList<DataObject> data = new ArrayList<DataObject>();
public void onUpdate(DataObject d) { //being called by my sensor every second
data.add(d);
}
I want to save the ArrayList data to a file to my computer so that I can later feed it into a different program and use it there.
How do I go about doing this?

If you want to save these as CSV files, they'll be easily exportable and importable, even to Excel (which may be of value for doing further work or passing on results).
Check out OpenCSV and in particular this entry in the FAQ relating to writing the data out.
e.g.
CSVWriter writer = new CSVWriter(new FileWriter("yourfile.csv"), ',');
// feed in your array (or convert your data to an array)
String[] entries = "first#second#third".split("#");
writer.writeNext(entries);
writer.close();

I think you should just output the values to a file with something as a delimiter between the values, then read the files into an array in a new program. To get the array into a file, loop through the array while appending each number to a file when looped through until you reach the end.

If the other program is also based on Java, you could leverage the Java Serializable inferface. Here is a tutorial, Java - Serialization.

it would be best to use ObjectOutputStream for the purpose, since the output of the sensor is a integer or double. using writeObject method method your task can be done.
see the link for a detailed reading:
http://docs.oracle.com/javase/7/docs/api/java/io/ObjectOutputStream.html

Creating multiple output files with AvroMultipleOutputs

I have a Reducer using AvroKeyOutput as the output format. By default, MapReduce will write all my keys to a single output file. I would like to write to a separate output file for each key value. Avro provides the AvroMultipleOutputs method, but examples are slim. The one provided by Apache AvroMultipleOutputs, shows how to pre-configure the various outputs when defining the job. The examples shows:
JOB:
AvroMultipleOutputs.addNamedOutput(job, "avro1", AvroOutputFormat.class, schema);
AvroMultipleOutputs.addNamedOutput(job, "avro2", AvroOutputFormat.class, null);
REDUCER:
amos = new AvroMultipleOutputs(conf);
amos.getCollector("avro1", reporter).collect(datum);
amos.getCollector("avro2", "A", reporter).collect(datum);
amos.getCollector("avro3", "B", reporter).collect(datum);
But I don't know how many files I will need or what their names are, since it is based on the key values that come out of my reducer. How could I modify this to accommodate dynamic file naming?

A strategy you can use in this situation:
use a Map-only job (zero reduce tasks)
have a single named multiple output configuration
during the map() use your key value for the base output path in AvroMultipleOutputs.write(String namedOutput, Object key, Object value, String baseOutputPath)

Hadoop and MapReduce, How do I send the equivalent of an array of lines pulled from a csv to the map function, where each array contained lines x - y;

Okay, so I have been reading a lot about Hadoop and MapReduce, and maybe it’s because I’m not as familiar with iterators as most, but I have a question I can’t seem to find a direct answer too. Basically, as I understand it, the map function is executed in parallel by many machine and/or cores. Thus, whatever you are working on must not depend on prior code being executed for the program to make any kind of speed gains. This works perfectly for me, but what I’m doing requires me to test information in small batches. Basically I need to send batches of lines in a .csv as arrays of 32, 64, 128 or whatever lines each. Like lines 0 – 127 go to core1’s execution of the map function, lines 128 – 255 lines go to core2’s, etc., .etc . Also I need to have the contents of each batch available as a whole inside the function, as if I had passed it an array. I read a little about how the new java API allows for something called push and pull, and that this allows things to be sent in batches, but I couldn’t find any example code. I dunno, I’m going to continue researching, and I’ll post anything I find, but if anyone knows, could they please post in this thread. I would really appreciate any help I might receive.
edit
If you could simply ensure that the chunks of the .csv are sent in sequence you could preform it this way. I guess this also assumes that there are globals in mapreduce.
//** concept not code **//
GLOBAL_COUNTER = 0;
GLOBAL_ARRAY = NEW ARRAY();
map()
{
GLOBAL_ARRAY[GLOBAL_COUNTER] = ITERATOR_VALUE;
GLOBAL_COUNTER++;
if(GLOBAL_COUNTER == 127)
{
//EXECUTE TEST WITH AN ARRAY OF 128 VALUES FOR COMPARISON
GLOBAL_COUNTER = 0;
}
}

If you're trying to get a chunk of lines from your CSV file into the mapper, you might consider writing your own InputFormat/RecordReader and potentially your own WritableComparable object. With the custom InputFormat/RecordReader you'll be able to specify how objects are created and passed to the mapper based on the input you receive.
If the mapper is doing what you want, but you need these chunks of lines sent to the reducer, make the output key for the mapper the same for each line you want in the same reduce function.
The default TextInputFormat will give input to your mapper like this (the keys/offsets in this example are just random numbers):
0 Hello World
123 My name is Sam
456 Foo bar bar foo
Each of those lines will be read into your mapper as a key,value pair. Just modify the key to be the same for each line you need and write it to the output:
0 Hello World
0 My name is Sam
1 Foo bar bar foo
The first time the reduce function is read, it will receive a key,value pair with the key being "0" and the value being an Iterable object containing "Hello World" and "My name is Sam". You'll be able to access both of these values in the same reduce method call by using the Iterable object.
Here is some pseudo code:
int count = 0
map (key, value) {
int newKey = count/2
context.write(newKey,value)
count++
}
reduce (key, values) {
for value in values
// Do something to each line
}
Hope that helps. :)

If the end goal of what you want is to force certain sets to go to certain machines for processing you want to look into writing your own Partitioner. Otherwise, Hadoop will split data automatically for you depending on the number of reducers.
I suggest reading the tutorial on the Hadoop site to get a better understanding of M/R.

If you simply want to send N lines of input to a single mapper, you can user the NLineInputFormat class. You could then do the line parsing (splitting on commas, etc) in the mapper.
If you want to have access to the lines before and after the line the mapper is currently processing, you may have to write your own input format. Subclassing FileInputFormat is usually a good place to start. You could create an InputFormat that reads N lines, concatenates them, and sends them as one block to a mapper, which then splits the input into N lines again and begins processing.
As far as globals in Hadoop go, you can specify some custom parameters when you create the job configuration, but as far as I know, you cannot change them in a worker and expect the change to propagate throughout the cluster. To set a job parameter that will be visible to workers, do the following where you are creating the job:
job.getConfiguration().set(Constants.SOME_PARAM, "my value");
Then to read the parameters value in the mapper or reducer,
public void map(Text key, Text value, Context context) {
Configuration conf = context.getConfiguration();
String someParam = conf.get(Constants.SOME_PARAM);
// use someParam in processing input
}
Hadoop has support for basic types such as int, long, string, bool, etc to be used in parameters.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.