I am using Hadoop Map Reduce to count things in our Hbase database. I am using the initTableMapperJob method to perform my Hbase mapreduce job.
This code snippet below works but I need to pass in values so that I count only rows where something in the key is greater than X. The API docs don't show any other method I can use to pass in custom values.
TableMapReduceUtil.initTableMapperJob(tableName, scan, MyMapper.class, Text.class, IntWritable.class, job);
MyMapper.class extends TableMapper but I didn't find anything for that class either.
Ideally, I'd like to write a function like this:
#Override
public void map(ImmutableBytesWritable rowkey, Result columns, Context context) throws IOException, InterruptedException {
// extract the ci as the key from the rowkey
String pattern = "\\d*_(\\S*)_(\\d{10})";
String ciname = null;
Pattern r = Pattern.compile(pattern);
String strRowKey = Bytes.toString(rowkey.get());
Matcher m = r.matcher(strRowKey);
long ts = 0;
// get parts of the rowkey
ts = Long.valueOf(m.group(2)).longValue();
ciname = m.group(1);
// check the time here to see if we count it or not in the counts
if (ts > starttime) && (ts <= endtime)
context.write(new Text(ciname), ONE);
I need a way to pass in the starttime and endtime values into the mapper for the comparison test to see if we need to count the row or not.
Related
The for loop at the end of this is very slow when there are 50k rows. Is there a quicker way to get a list of Strings from the rows of a javax.servlet.jsp.jstl.sql.Result? Use a different collection? Convert the Strings differently?
The query is absolutely fine. I am not including here the custom objects that are used to run it. I can't change them. I think you only need to know that it returns a Result.
(I didn't write any of this code.)
private int m_ObjectGroupId;
private int m_FetchSize;
private ArrayList<String> m_InternalIds;
...method following member initialisation...
String internalIdString = "INTERNALID";
String selectSql = "SELECT " + internalIdString + " FROM MODELOBJECT WHERE OBJECTGROUPID = ?";
ArrayList<Object> valuesToSet = new ArrayList<Object>();
valuesToSet.add(m_ObjectGroupId);
BaseForPreparedStatement selectStatement = new BaseForPreparedStatement(selectSql.toString(), valuesToSet);
SqlQueryResult queryResult = DBUtils.executeQueryRS(p_Context, selectStatement, getConnection(), m_FetchSize);
Result result = queryResult.getResult();
m_InternalIds = new ArrayList<>(result.getRowCount());
for (int i = 0; i < result.getRowCount(); i++) {
m_InternalIds.add((String)result.getRows()[i].get(internalIdString));
}
UPDATE:
The query only takes 1s whereas the loop takes 30s.
result.getRows().getClass() is a java.util.SortedMap[].
Depending on the implementation of javax.servlet.jsp.jstl.sql.Result#getRows() (for example the Tomcat taglibs at https://github.com/apache/tomcat-taglibs-standard/blob/main/impl/src/main/java/org/apache/taglibs/standard/tag/common/sql/ResultImpl.java#L134) it can be that getRows() does unnecessary work each time you call it.
You could rewrite your extraction loop as
for (SortedMap m: result.getRows()) {
m_InternalIds.add((String) m.get(internalIdString));
}
which calls getRows() only once.
My mapper class will output key-value pairs like:
abc 1
abc 2
abc 1
And I want to merge the values and calculate the occurrence of same pair in reducer class using HashMap, the output is like:
abc 1:2 2:1
But my output result is:
abc 1:2:1 2:1:1
It feels like there are additional Strings concatenated with the output, but I don't know why.
Here is my code:
Text combiner = new Text();
StringBuilder strBuilder = new StringBuilder();
#Override
public void reduce(Text key, Iterable<Text> values,
Context context
) throws IOException, InterruptedException {
HashMap<Text, Integer> result = new HashMap<Text, Integer>();
for (Text val : values) {
if(result.containsKey(val)){
int newVal = result.get(val) + 1;
result.put(val, newVal);
}else{
result.put(val, 1);
}
}
for(Map.Entry<Text, Integer> entry: result.entrySet()){
strBuilder.append(entry.getKey().toString());
strBuilder.append(":");
strBuilder.append(entry.getValue());
strBuilder.append("\t");
}
combiner.set(strBuilder.toString());
context.write(key, combiner);
}
I tested this code an it looks ok. The most likely reason you're getting output like this is because you're running this reducer as your combiner as well, which would explain why you're getting three values. The combine does the first concatenation, followed by the reduce which does a second.
You need to make sure a combiner isn't being configured in your job setup.
I would also suggest you change your code to make sure you store new versions of the Text values in your HashMap, remember Hadoop will be reusing the objects. So you should really be doing something like:
result.put(new Text(val), newVal);
or change your HashMap to store Strings, which is safe since they're immutable.
I am using hadoop map-reduce for processing XML file. I am directly storing the JSON data into mongodb. How can I achieve that only non-duplicate records will be stored into database before executing BulkWriteOperation?
The duplicate records criteria will be based on product image and product name, I do not want to use a layer of morphia where we can assign indexes to the class members.
Here is my reducer class:
public class XMLReducer extends Reducer<Text, MapWritable, Text, NullWritable>{
private static final Logger LOGGER = Logger.getLogger(XMLReducer.class);
protected void reduce(Text key, Iterable<MapWritable> values, Context ctx) throws IOException, InterruptedException{
LOGGER.info("reduce()------Start for key>"+key);
Map<String,String> insertProductInfo = new HashMap<String,String>();
try{
MongoClient mongoClient = new MongoClient("localhost", 27017);
DB db = mongoClient.getDB("test");
BulkWriteOperation operation = db.getCollection("product").initializeOrderedBulkOperation();
for (MapWritable entry : values) {
for (Entry<Writable, Writable> extractProductInfo : entry.entrySet()) {
insertProductInfo.put(extractProductInfo.getKey().toString(), extractProductInfo.getValue().toString());
}
if(!insertProductInfo.isEmpty()){
BasicDBObject basicDBObject = new BasicDBObject(insertProductInfo);
operation.insert(basicDBObject);
}
}
//How can I check for duplicates before executing bulk operation
operation.execute();
LOGGER.info("reduce------end for key"+key);
}catch(Exception e){
LOGGER.error("General Exception in XMLReducer",e);
}
}
}
EDIT: After the suggested answer I have added :
BasicDBObject query = new BasicDBObject("product_image", basicDBObject.get("product_image"))
.append("product_name", basicDBObject.get("product_name"));
operation.find(query).upsert().updateOne(new BasicDBObject("$setOnInsert", basicDBObject));
operation.insert(basicDBObject);
I am getting error like: com.mongodb.MongoInternalException: no mapping found for index 0
Any help will be useful.Thanks.
I suppose it all depends on what you really want to do with the "duplicates" here as to how you handle it.
For one you can always use .initializeUnOrderedBulkOperation() which won't "error" on a duplicate key from your index ( which you need to stop duplicates ) but will report any such errors in the returned BulkWriteResult object. Which is returned from .execute()
BulkWriteResult result = operation.execute();
On the other hand, you can just use "upserts" instead and use operators such as $setOnInsert to only make changes where no duplicate existed:
BasicDBObject basicdbobject = new BasicDBObject(insertProductInfo);
BasicDBObject query = new BasicDBObject("key", basicdbobject.get("key"));
operation.find(query).upsert().updateOne(new BasicDBObject("$setOnInsert", basicdbobject));
So you basically look up the value of the field that holds the "key" to determine a duplicate with a query, then only actually change any data where that "key" was not found and therefore a new document and "inserted".
In either case the default behaviour here will be to "insert" the first unique "key" value and then ignore all other occurances. If you want to do other things like "overwrite" or "increment" values where the same key is found then the .update() "upsert" approach is the one you want, but you will use other update operators for those actions.
In our latest upgrade of our CDH Cluster we have come accross many methods and classes which have been made deprecated.
One such case is the method raw() which I was using to get the epochTimestamp out of our Hbase table records as shown below:
String epochTimestamp = String.valueOf(values.raw()[0].getTimestamp());
My PM has asked me to get rid of all such deprecated functions and replace the same with latest.
From https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Result.html I found that listCells is the equivalent of raw(), but can anyone help me with how to obtain the epochTimestamp from HBase record using listCells?
A replacement for
raw()
could be to iterate over:
result.listCells()
Which gives you a cell, and then to get the value, use:
CellUtil.cloneValue(cell)
See the documentation:
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/CellUtil.html
Example:
for (Cell cell : result.listCells()) {
String row = new String(CellUtil.cloneRow(cell));
String family = new String(CellUtil.cloneFamily(cell));
String column = new String(CellUtil.cloneQualifier(cell));
String value = new String(CellUtil.cloneValue(cell));
long timestamp = cell.getTimestamp();
System.out.printf("%-20s column=%s:%s, timestamp=%s,
value=%s\n", row, family, column, timestamp, value);
}
As per documentation:
public List listCells()
Create a sorted list of the Cell's in this result.
Since HBase 0.20.5 this is equivalent to raw().
Thus your code should look:
String epochTimestamp = String.valueOf(values.listCells().get(0).getTimestamp());
to get all columns and values, u can use CellUtil like below:
List<Cell> cells = listCells();
Map<String, String> result = new HashMap<>();
for (Cell c : cells) {
result.put(Bytes.toString(CellUtil.cloneQualifier(c)),Bytes.toString(CellUtil.cloneValue(c)));
}
I am currently working on a MapReduce Job which I am only using the mapper without the reducer. I do not need to write the key out because I only need the values which are stored in an array and want to write it out as my final output file. How can achieve this on Hadoop? Instead of writing to the output both the key and the value, I am only interested in writing out only the values. The values are in an array. Thanks
public void pfor(TestFor pfor,LongWritable key, Text value, Context context, int times) throws IOException, InterruptedException{
int n = 0;
while(n < times){
pfor.pforMap(key,value, context);
n++;
}
for(int i =0;i<uv.length; i++){
LOG.info(uv[i].get() + " Final output");
}
IntArrayWritable edge = new IntArrayWritable();
edge.set(uv);
context.write(new IntWritable(java.lang.Math.abs(randGen.nextInt())), edge);
uv= null;
}
Use NullWritable as value and emit your "edge" as key.
https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/io/NullWritable.html