How we can have access to an array into a mapper? - java

I'm totally new in MapReduce programming and in my first MR code, I have this question. In my mapper, I need to have access to a 2D array that has been created and filled before the mapper in the main class. How can I have access to it? Should I export it into txt and try to read it in the mapper? If so, how should in insert it into mapper? I have no idea how should I make it available? My code is in Java.

You can do this couple of ways.
After you created the 2D array, you can load this file into HDFS and then use DistributedCache in Java M/R API to access this data in your mapper/reducer code. Take a look at this: http://developer.yahoo.com/hadoop/tutorial/module5.html
If your data is not too large and you have an object that represents this data which is serializable and quite small you can pass it along via the job Configuration. Serialize it and include a base64 encoded version of it in the Configuration. Then you can access this data in mapper/reducer: http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/conf/Configuration.html#set(java.lang.String, java.lang.String)

Related

How can I specify dynamically the ItemProcessor for a JSON Job?

I have different JSON files and need to read, process and write the containing JSON objects of a JSON array.
The output format (more specific: the output class) is for all files the same. Lets call it OutputClass. Hence the item processor is something like ItemProcessor<X, OutPutClass>. Where X is the class of the specific JSON file.
The difference between files are:
The JSON array / the information is at a different position in every JSON file
The structure of the JSON objects in the JSON array are different (the objects in file a have a different syntax than the ones in file b)
I already came across of #StepScope and was able to dynamically generate a reader (depending on job parameters) which starts reading at a different position in the JSON structure.
But I have no idea how to dynamically choose an ItemProcessor depending on the job parameters. Because I got many different JSON files and want to reduce the amount of code to write for each file.
Since you were able to create a dynamic item reader based on job parameters by using the a step-scoped bean (which is the way I would do it too), you can use the same approach to create a dynamic item processor as well.

Data reconciliation in Hadoop in the most primitive way

I need to do data reconciliation in Hadoop based on key comparisons. That means I will have old data in one folder and the newer data will be put into different folders. At the end of the batch I was planning simply on moving the newer data to reside with the old one. The data would be json files from which I have to extract the keys.
I'm taking my first steps with Hadoop so I just wanna do it with MapReduce program only, i.e. without tools such as Spark, Pig, Hive etc. I was thinking of simply going through all the old data at the beginning of the program, before Job object creation, and putting all the IDs into a Java HashMap that would be accessible from the mapper task. If there's a key missing in the newer data, the mapper would output that. The reducer would concern itself with categories of the IDs that are missing but that's another story. After the job has finished, I would move the newer data into old data's folder.
The only thing that I find a bit clunky is this loading phase into Java HashMap object. This is not probably the most elegant solution so I was wondering if MapReduce model has some dedicated data structures/functionality for that kind of purpose (populating a global hash map with all the data from HDFS before the first map task is run)?
I think solution with HashMap is not a good idea. You can use few inputs for your command.
Depends on input file mapper can understand if this data is new and write it with suitable value. Then reducer will check if this data is contained only in "new input" and write this data.
So as result of job you will get only new data.

Spring Batch CSV to Mongodb - 100s of columns

I am very new to Java and have been tasked to use spring batch to read in some text files. So far Spring batch resources online have helped me to get to a point where I am reading, processing and writing some simple test .csv files into Mongo.
The problem I have now is that the actual file I would like to read from has over 600 columns. Meaning that with the current way I am reading in my file to Java, I would need 600+ fields in my #Document mongo model.
I have been thinking of a couple of ways to get around this,
first I was thinking maybe I could read in each line as a string and then in my processor deal with splitting everything up and formatting the data to then return a list of my MongoTemplate but returning a List is not viable from the overridden process method.
So my question to you guys is,
What is the best way to handle reading in files with hundreds of
columns in spring batch? Or what would be the best resource to start
reading to help point me in the right direction.
Thanks!
I had a same problem I used
http://opencsv.sourceforge.net/apidocs/com/opencsv/CSVReader.html
for reading csvs.
I suggest you use Map instead of 600 java fields.
Besides, 600X600 java strings is not a big deal for java and neither for mongo.
To manipulate with mongo use http://jongo.org/
If you really need batch processing of data.
Your flow for batches should be,
Loop here : divide in batches(say 300 per loop)
Read 300X300 java objects(or in a Map) from file in memory.
Sanitize or Process them if needed.
Store in mongoDB.
return if EOF.
I ended up just reading in each line as a String object. Then in the processor looping over the String object with a delimiter creating my Mongo repository objects and storing them. So I am basically doing all of the writing inside the processor method which I would say is definitely not best practice but gives me the desired end result.

Passing data to OPL model from Java

I have OPL .mod model and I run it from Java code. The model needs some external data.
Currently model loads the data from .dat file with
IloOplFactory.createOplRunConfiguration(String modelName, String[] dataFiles)
method.
I want to load the data directly from Java code.
I found
IloOplFactory.createOplRunConfiguration(OplModelDefinition, OplDataElements)
but I can't understand how to use it (how to define elements for OplDataElements).
Could someone provide example of defining elements and usage of this method?
(Or better way to pass data from Java to OPL model)
Thanks in advance.
I have done this to pass in control and configuration data to a model, typically parameter values and flags. Once you create an instance of IloOplDataElements, you can just add it as a data source for your model, e.g.
IloOplDataElements configData = new IloOplDataElements(env);
configData.addElement(configData.makeElement("modelIteration", 1));
configData.addElement(configData.makeElement("debug", 2));
// etc
myModel.addDataSource(configData);
I haven't tried doing this with array data, but I guess it should be similar.

writing data in to files with java

I am writing a server in java that allows clients to play a game similar to 20 questions. The game itself is basically a binary tree with nodes that are questions about an object and leaves that are guesses at the object's identity. When the game guesses wrong it needs to be able to get the right answer from the player and add it to the tree. This data is then saved to a random access file.
The question is: How do you go about representing a tree within a file so that the data can be reaccessed as a tree at a later time.
If you know where I can find information on keeping data structures like trees organized as such when writing/reading to files then please link it. Thanks a lot.
Thanks for the quick answers everyone. This is a school project so it has some odd requirements like using random access files and telnet.
This data is then saved to a random access file.
That's the hard way to solve your problem (the "random access" bit, I mean).
The problem you are really trying to solve is how to persist a "complicated" data structure. In fact, there are a number of ways that this can be done. Here are some of them ...
Use Java persistence. This is simple to implement; make sure that your data structure is serializable, and then its just a few lines of code to serialize and few more lines to deserialize. The downsides are:
Serialized objects can be fragile in the face of code changes.
Serialization is not incremental. You write/read the whole graph each time.
If you have multiple separate serialized graphs, you need some scheme to name and manage them.
Use XML. This is more work to implement than Java persistence, but it has the advantage of being less fragile. And if something does go wrong, there's a chance you can fix it with XSLT or a text editor. (There are XML "binding" libraries that eliminate a lot of the glue coding.)
Use an SQL database. This addresses all of the downsides of Java persistence, but involves more coding ... and using a different computational model to access the persistent data (query versus graph navigation).
Use a database and an Object Relational Mapping technology; e.g. a JPA or JDO implementation. (Hibernate is a popular choice). These bridge between the database and in-memory views of data in a more or less transparent fashion, and avoids a lot of the glue code that you need to write in the SQL database and XML cases.
I think you're looking for serialization. Try this:
http://java.sun.com/developer/technicalArticles/Programming/serialization/
As mentioned, serialization is what you are looking for. It allows you to write an object to a file, and read it back later with minimal effort. The file will automatically be read back in as your object type. This makes things much easier than trying to store the object yourself using XML.
Java serialization has some pitfalls (like when you update your class). I would serialize in a text format. Json is my first choice here but xml and yaml would work as well.
This way you would have a file that doesn't rely on the binary version of your class.
There are several java libraries: http://www.json.org
Some examples:
http://code.google.com/p/json-simple/wiki/DecodingExamples
http://code.google.com/p/json-simple/wiki/EncodingExamples
And to save and read from the file you can use the Commons Io:
import org.apache.commons.io.FileUtis;
import java.io.File;
...
File dataFile = new File("yourfile.json");
String data = FileUtils.readFileToString(dataFile);
FileUtils.writeStringToFile(dataFile, content);

Categories

Resources