Advantages of using NullWritable in Hadoop - java

What are the advantages of using NullWritable for null keys/values over using null texts (i.e. new Text(null)). I see the following from the «Hadoop: The Definitive Guide» book.
NullWritable is a special type of Writable, as it has a zero-length serialization. No bytes
are written to, or read from, the stream. It is used as a placeholder; for example, in
MapReduce, a key or a value can be declared as a NullWritable when you don’t need
to use that position—it effectively stores a constant empty value. NullWritable can also
be useful as a key in SequenceFile when you want to store a list of values, as opposed
to key-value pairs. It is an immutable singleton: the instance can be retrieved by calling
NullWritable.get()
I do not clearly understand how the output is written out using NullWritable? Will there be a single constant value in the beginning output file indicating that the keys or values of this file are null, so that the MapReduce framework can ignore reading the null keys/values (whichever is null)? Also, how actually are null texts serialized?
Thanks,
Venkat

The key/value types must be given at runtime, so anything writing or reading NullWritables will know ahead of time that it will be dealing with that type; there is no marker or anything in the file. And technically the NullWritables are "read", it's just that "reading" a NullWritable is actually a no-op. You can see for yourself that there's nothing at all written or read:
NullWritable nw = NullWritable.get();
ByteArrayOutputStream out = new ByteArrayOutputStream();
nw.write(new DataOutputStream(out));
System.out.println(Arrays.toString(out.toByteArray())); // prints "[]"
ByteArrayInputStream in = new ByteArrayInputStream(new byte[0]);
nw.readFields(new DataInputStream(in)); // works just fine
And as for your question about new Text(null), again, you can try it out:
Text text = new Text((String)null);
ByteArrayOutputStream out = new ByteArrayOutputStream();
text.write(new DataOutputStream(out)); // throws NullPointerException
System.out.println(Arrays.toString(out.toByteArray()));
Text will not work at all with a null String.

I change the run method. and success
#Override
public int run(String[] strings) throws Exception {
Configuration config = HBaseConfiguration.create();
//set job name
Job job = new Job(config, "Import from file ");
job.setJarByClass(LogRun.class);
//set map class
job.setMapperClass(LogMapper.class);
//set output format and output table name
//job.setOutputFormatClass(TableOutputFormat.class);
//job.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, "crm_data");
//job.setOutputKeyClass(ImmutableBytesWritable.class);
//job.setOutputValueClass(Put.class);
TableMapReduceUtil.initTableReducerJob("crm_data", null, job);
job.setNumReduceTasks(0);
TableMapReduceUtil.addDependencyJars(job);
FileInputFormat.addInputPath(job, new Path(strings[0]));
int ret = job.waitForCompletion(true) ? 0 : 1;
return ret;
}

You can always wrap your string in your own Writable class and have a boolean indicating it has blank strings or not:
#Override
public void readFields(DataInput in) throws IOException {
...
boolean hasWord = in.readBoolean();
if( hasWord ) {
word = in.readUTF();
}
...
}
and
#Override
public void write(DataOutput out) throws IOException {
...
boolean hasWord = StringUtils.isNotBlank(word);
out.writeBoolean(hasWord);
if(hasWord) {
out.writeUTF(word);
}
...
}

Related

Why List<String> is always empty when using MapReduce and HDFS?

So I have a program that it uses Mapper, Combiner and Reducer to get some fields of the IMDB repository and this program fine works when I'm running it on my machine.
When I put this code to run inside Docker using Hadoop HDFS it doesn't get some values that I need, or to be precise, the Combiner which puts some values into a List, that is a public class variable, doesn't work or something because when I try to use that List in the Reducer it looks like it is always empty. When I was running on my machine (without Docker and Hadoop HDFS) it would put the values into the List but when running on Docker it looks like it is always empty. I have also printed the size of the List on the main and it returns 0, any suggestions?
public class FromParquetToParquetFile{
public static List<String> top10 = new ArrayList<>();
....
}
The Combiner looks like:
public static class FromParquetQueriesCombiner extends Reducer<Text,Text, Text,Text> {
#Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
long total = 0;
long maior = -1;
String tconst = "";
String title = "";
for (Text value : values) {
total++; //numero de filmes
String[] fields = value.toString().split("\t");
top10.add(key.toString() + "\t" + fields[2] + "\t" + fields[3] + "\t" + fields[0] + "\t" + fields[1]);
int x = Integer.parseInt(fields[3]);
if (x >= maior) {
tconst = fields[0];
title = fields[1];
maior = x;
}
}
StringBuilder result = new StringBuilder();
result.append(total);
result.append("\t");
result.append(tconst);
result.append("\t");
result.append(title);
result.append("\t");
context.write(key, new Text(result.toString()));
}
}
And Reducer looks like (it has a setup to order the List):
public static class FromParquetQueriesReducer extends Reducer<Text, Text, Void, GenericRecord> {
private Schema schema;
#Override
protected void setup(Context context) throws IOException, InterruptedException {
Collections.sort(top10, new Comparator<String>() {
#Override
public int compare(String o1, String o2) {
String[] aux = o1.split("\t");
String[] aux2 = o2.split("\t");
...
return -result;
}
});
schema = getSchema("hdfs:///schema.alinea2");
}
#Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
...
for(String s : top10)
...
}
}
As explained here, "public" variables (in Java sense) don't quite "translate" into a parallel computing model aimed to be implemented in a distributed system (and this is why while you didn't have any issue running your application locally, things "broke" when you run it along the HDFS).
Mapper and Reducer instances are isolated and more-or-less "independent" from whatever being put "around" the functions that describe them. That means they don't really have access to the variables being put either on the parent class (i.e. FromParquetToParquetFile here) or in the driver/main function of the program. From that we can understand that (in case you want to preserve the current way of functionality your job has) we need some type of risky workaround (or a straight up hack job) to make a list publicly accessible and "static" within the thematic constraints we are working on.
The solution for this is to set user-named values that are referring to the job's Configuration object. This means that you have to use the Configuration object you probably created in your driver to set top10 as this type of variable. Since your List may have relatively "small" Strings in length (i.e. just several sentences) for each element, all you have to do is use some sort of a delimiter to store all of the elements in just one String (since this is the datatype used for those type of Configuration variables) like this element1#element2#element3#... (but be very careful with this, as you must always be sure that there's enough memory for that String to exist in the first place, this is why this is merely a workaround after all).
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
conf.set("top10", " "); // initialize `top10` as an empty String
// the description of the job(s), etc, ...
}
In order to read and write to top10, at first you need to declare it to the setup function that you need to have in both of your combiner and your reducer like this (with the code snippet below showing how it would look like for the reducer, of course):
public static class FromParquetQueriesReducer extends Reducer<Text, Text, Void, GenericRecord>
{
private Schema schema;
private String top10;
#Override
protected void setup(Context context) throws IOException, InterruptedException
{
top10 = context.getConfiguration().get("top10");
// everything else inside the setup function...
}
// ...
}
With those two adjustments, you can use top10 inside the reduce function of yours just fine, after using the split function to split the elements from inside the String of top10 like this:
String[] data = top10.split("#"); // split the elements from the String
List<String> top10List = new ArrayList<>(); // create ArrayList
Collections.addAll(top10List, data); // put all the elements to the list
With all that being said, I must say that this type of functionality is way beyond the abilities of vanilla Hadoop that heavily relies on MapReduce. In case this is anything more than a CS class assignment, you need to reevaluate the usage of Hadoop's MapReduce engine here, in order to make out of all of this with "extensions" like Apache Hive or Apache Spark that are way more flexible and SQL-like and can match some of the aspects of your application.

Hadoop - Problem with Text to float conversion

I have a csv file containing key-value pairs; it can have multiple records for the same key. I am writing a mapreduce program to aggregate this data - for each key, it is supposed to give the frequency of key and sum of values for the key.
My mapper reads the csv file and emits both key and value as Text type eventhough they are numeric (doing this way because I am running into problems using FloatWritable for value).
In the reducer, when I try to convert the Text value to float, I am running into NumberFormatException and the value shown in the error is not even in my input.
Heres my code:
public static class AggReducer
extends Reducer<Text,Text,Text,Text> {
private Text result = new Text();
public void reduce(Text key, Iterable<FloatWritable> values,
Context context
) throws IOException, InterruptedException {
int numTrips = 0;
int totalFare = 0;
for (Text val : values) {
totalFare += Float.parseFloat(val.toString());
numTrips++;
}
String resultStr = String.format("%1s,%2s", numTrips, totalFare);
result.set(resultStr);
context.write(key, result);
}
}
Note : I made the reducer produce mapper's output without any changes and that gave the expected output
running into NumberFormatException and the value shown in the error is not even in my input
Well, that's quite impossible. Somewhere the value needs to be in your input or generated mapper output. Try catch works just as well in a reducer as anywhere else, though
FWIW, use DoubleWritable

Why does SequenceFile writer's append operation overwrites all values with the last value?

First, Consider this CustomWriter class:
public final class CustomWriter {
private final SequenceFile.Writer writer;
CustomWriter(Configuration configuration, Path outputPath) throws IOException {
FileSystem fileSystem = FileSystem.get(configuration);
if (fileSystem.exists(outputPath)) {
fileSystem.delete(outputPath, true);
}
writer = SequenceFile.createWriter(configuration,
SequenceFile.Writer.file(outputPath),
SequenceFile.Writer.keyClass(LongWritable.class),
SequenceFile.Writer.valueClass(ItemWritable.class),
SequenceFile.Writer.compression(SequenceFile.CompressionType.BLOCK, new DefaultCodec()),
SequenceFile.Writer.blockSize(1024 * 1024),
SequenceFile.Writer.bufferSize(fileSystem.getConf().getInt("io.file.buffer.size", 4 * 1024)),
SequenceFile.Writer.replication(fileSystem.getDefaultReplication(outputPath)),
SequenceFile.Writer.metadata(new SequenceFile.Metadata()));
}
public void close() throws IOException {
writer.close();
}
public void write(Item item) throws IOException {
writer.append(new LongWritable(item.getId()), new ItemWritable(item));
}
}
What I am trying to do is consume a asynchronous stream of Item type objects. The consumer has a reference to a CustomWriter instance. It then calls the CustomWriter#write method for every item it receives. When the stream ends, the CustomWriter#close method is called to close the writer.
As you can see I've only created a single writer and it starts appending to a brand new file. So, there is no question that this is not the cause.
I should also note that I am currently running this in a unit-test environment using MiniDFSCluster as per the instructions here. If I run this in a non unit-test environment (i.e. without MiniDFSCluster), it seems to work just fine.
When I try to read the file back all I see is the last written Item object N times (where N is the total number of items that were received in the stream). Here is an example:
sparkContext.hadoopFile(path, SequenceFileInputFormat.class, LongWritable.class, ItemWritable.class)
.collect()
.forEach(new BiConsumer<>() {
#Override
public void accept(Tuple2<LongWritable, ItemWritable> tuple) {
LongWritable id = tuple._1();
ItemWritable item = tuple._2();
System.out.print(id.get() + " -> " + item.get());
}
});
This will print something like this:
...
1234 -> Item[...]
1234 -> Item[...]
1234 -> Item[...]
...
Am I doing something wrong or, is this a side effect of using MiniDFSCluster?
Writable (such as LongWritable, ItemWritable) is reused during processing data. When receiving a record, Writable usually just replaces its content, and you will just receive the same Writable object. You should copy them to a new object if you want to collect them into an array.

Java Properties - int becomes null

Whilst I've seen similar looking questions asked before, the accepted answers have seemingly provided an answer to a different question (IMO).
I have just joined a company and before I make any changes/fixes, I want to ensure that all the tests pass. I've fixed all but one, which I've discovered is due to some (to me) unexpected behavior in Java.
If I insert a key/value pair into a Properties object where the value is an int, I expected autoboxing to come into play and getProperty would return a string. However, that's not what's occuring (JDK1.6) - I get a null. I have written a test class below:
import java.util.*;
public class hacking
{
public static void main(String[] args)
{
Properties p = new Properties();
p.put("key 1", 1);
p.put("key 2", "1");
String s;
s = p.getProperty("key 1");
System.err.println("First key: " + s);
s = p.getProperty("key 2");
System.err.println("Second key: " + s);
}
}
And the output of this is:
C:\Development\hacking>java hacking
First key: null
Second key: 1
Looking in the Properties source code, I see this:
public String getProperty(String key) {
Object oval = super.get(key);
String sval = (oval instanceof String) ? (String)oval : null;
return ((sval == null) && (defaults != null)) ? defaults.getProperty(key) : sval;
}
The offending line is the second line - if it's not a String, it uses null.
I can't see any reason why this behavior would be desired/expected. The code was written by almost certainly someone more capable than I am, so I assume there is a good reason for it. Could anyone explain? If I've done something dumb, save time and just tell me that! :-)
Many thanks
This is form docs:
"Because Properties inherits from Hashtable, the put and putAll methods can be applied to a Properties object. Their use is strongly discouraged as they allow the caller to insert entries whose keys or values are not Strings. The setProperty method should be used instead. If the store or save method is called on a "compromised" Properties object that contains a non-String key or value, the call will fail. Similarly, the call to the propertyNames or list method will fail if it is called on a "compromised" Properties object that contains a non-String key."
I modified your code to use the setProperty method as per the docs and it brings up compilation error
package com.stackoverflow.framework;
import java.util.*;
public class hacking
{
public static void main(String[] args)
{
Properties p = new Properties();
p.setProperty("key 1", 1);
p.setProperty("key 2", "1");
String s;
s = p.getProperty("key 1");
System.err.println("First key: " + s);
s = p.getProperty("key 2");
System.err.println("Second key: " + s);
}
}

How to write a hashtable<string, string > in to text file,java?

I have hastable
htmlcontent is html string of urlstring .
I want to write hastable into a .text file .
Can anyone suggest a solution?
How about one row for each entry, and two strings separated by a comma? Sort of like:
"key1","value1"
"key2","value2"
...
"keyn","valuen"
keep the quotes and you can write out keys that refer to null entries too, like
"key", null
To actually produce the table, you might want to use code similar to:
public void write(OutputStreamWriter out, HashTable<String, String> table)
throws IOException {
String eol = System.getProperty("line.separator");
for (String key: table.keySet()) {
out.write("\"");
out.write(key);
out.write("\",\"");
out.write(String.valueOf(table.get(key)));
out.write("\"");
out.write(eol);
}
out.flush();
}
For the I/O part, you can use a new PrintWriter(new File(filename)). Just call the println methods like you would System.out, and don't forget to close() it afterward. Make sure you handle any IOException gracefully.
If you have a specific format, you'd have to explain it, but otherwise a simple for-each loop on the Hashtable.entrySet() is all you need to iterate through the entries of the Hashtable.
By the way, if you don't need the synchronized feature, a HashMap<String,String> would probably be better than a Hashtable.
Related questions
Java io ugly try-finally block
Java hashmap vs hashtable
Iterate Over Map
Here's a simple example of putting things together, but omitting a robust IOException handling for clarity, and using a simple format:
import java.io.*;
import java.util.*;
public class HashMapText {
public static void main(String[] args) throws IOException {
//PrintWriter out = new PrintWriter(System.out);
PrintWriter out = new PrintWriter(new File("map.txt"));
Map<String,String> map = new HashMap<String,String>();
map.put("1111", "One");
map.put("2222", "Two");
map.put(null, null);
for (Map.Entry<String,String> entry : map.entrySet()) {
out.println(entry.getKey() + "\t=>\t" + entry.getValue());
}
out.close();
}
}
Running this on my machine generates a map.txt containing three lines:
null => null
2222 => Two
1111 => One
As a bonus, you can use the first declaration and initialization of out, and print the same to standard output instead of a text file.
See also
Difference between java.io.PrintWriter and java.io.BufferedWriter?
java.io.PrintWriter API
Methods in this class never throw I/O exceptions, although some of its constructors may. The client may inquire as to whether any errors have occurred by invoking checkError().
For text representation, I would recommend picking a few characters that are very unlikely to occur in your strings, then outputting a CSV format file with those characters as separators, quotes, terminators, and escapes. Essentially, each row (as designated by the terminator, since otherwise there might be line-ending characters in either string) would have as the first CSV "field" the key of an entry in the hashtable, as the second field, the value for it.
A simpler approach along the same lines would be to designate one arbitrary character, say the backslash \, as the escape character. You'll have to double up backslashes when they occur in either string, and express in escape-form the tab (\t) and line-end ('\n); then you can use a real (not escape-sequence) tab character as the field separator between the two fields (key and value), and a real (not escape-sequence) line-end at the end of each row.
You can try
public static void save(String filename, Map<String, String> hashtable) throws IOException {
Properties prop = new Properties();
prop.putAll(hashtable);
FileOutputStream fos = new FileOutputStream(filename);
try {
prop.store(fos, prop);
} finally {
fos.close();
}
}
This stores the hashtable (or any Map) as a properties file. You can use the Properties class to load the data back in again.
import java.io.*;
class FileWrite
{
public static void main(String args[])
{
HashTable table = //get the table
try{
// Create file
BufferedWriter writer = new BufferedWriter(new FileWriter("out.txt"));
writer.write(table.toString());
}catch (Exception e){
e.printStackTrace();
}finally{
out.close();
}
}
}
Since you don't have any requirements to the file format, I would not create a custom one. Just use something standard. I would recommend use json for that!
Alternatives include xml and csv but I think that json is the best option here. Csv doesn't handle complex types like having a list in one of the keys of your map and xml can be quite complex to encode/decode.
Using json-simple as example:
String serialized = JSONValue.toJSONString(yourMap);
and then just save the string to your file (what is not specific of your domain either using Apache Commons IO):
FileUtils.writeStringToFile(new File(yourFilePath), serialized);
To read the file:
Map map = (JSONObject) JSONValue.parse(FileUtils.readFileToString(new File(yourFilePath));
You can use other json library as well but I think this one fits your need.

Categories

Resources