I am trying to create a map-reduce job in Java on table from a HBase database. Using the examples from here and other stuff from the internet, I managed to successfully write a simple row-counter. However, trying to write one that actually does something with the data from a column was unsuccessful, since the received bytes are always null.
A part of my Driver from the job is this:
/* Set main, map and reduce classes */
job.setJarByClass(Driver.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
Scan scan = new Scan();
scan.setCaching(500);
scan.setCacheBlocks(false);
/* Get data only from the last 24h */
Timestamp timestamp = new Timestamp(System.currentTimeMillis());
try {
long now = timestamp.getTime();
scan.setTimeRange(now - 24 * 60 * 60 * 1000, now);
} catch (IOException e) {
e.printStackTrace();
}
/* Initialize the initTableMapperJob */
TableMapReduceUtil.initTableMapperJob(
"dnsr",
scan,
Map.class,
Text.class,
Text.class,
job);
/* Set output parameters */
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(TextOutputFormat.class);
As you can see, the table is called dnsr. My mapper looks like this:
#Override
public void map(ImmutableBytesWritable row, Result value, Context context)
throws InterruptedException, IOException {
byte[] columnValue = value.getValue("d".getBytes(), "fqdn".getBytes());
if (columnValue == null)
return;
byte[] firstSeen = value.getValue("d".getBytes(), "fs".getBytes());
// if (firstSeen == null)
// return;
String fqdn = new String(columnValue).toLowerCase();
String fs = (firstSeen == null) ? "empty" : new String(firstSeen);
context.write(new Text(fqdn), new Text(fs));
}
Some notes:
the column family from the dnsr table is just d. There are multiple columns, some of them being called fqdn and fs (firstSeen);
even if the fqdn values appear correctly, the fs are always the "empty" string (I added this check after I had some errors that were saying that you can't convert null to a new string);
if I change the fs column name with something else, for example ls (lastSeen), it works;
the reducer doesn't do anything, just outputs everything it receives.
I created a simple table scanner in javascript that is querying the exact same table and columns and I can clearly see the values are there. Using the command line and doing queries manually, I can clearly see the fs values are not null, they are bytes that can e later converted into a string (representing a date).
What can be the problem I'm always getting null?
Thanks!
Update:
If I get all the columns in a specific column family, I don't receive fs. However, a simple scanner implemented in javascript return fs as a column from the dnsr table.
#Override
public void map(ImmutableBytesWritable row, Result value, Context context)
throws InterruptedException, IOException {
byte[] columnValue = value.getValue(columnFamily, fqdnColumnName);
if (columnValue == null)
return;
String fqdn = new String(columnValue).toLowerCase();
/* Getting all the columns */
String[] cns = getColumnsInColumnFamily(value, "d");
StringBuilder sb = new StringBuilder();
for (String s : cns) {
sb.append(s).append(";");
}
context.write(new Text(fqdn), new Text(sb.toString()));
}
I used an answer from here to get all the column names.
In the end, I managed to find the 'problem'. Hbase is a column oriented datastore. Here, data is stored and retrieved in columns and hence can read only relevant data if only some data is required. Every column family has one or more column qualifiers (columns) and each column has multiple cells. The interesting part is that every cell has its own timestamp.
Why was this the problem? Well, when you are doing a ranged search, only the cells whose timestamp is in that range are returned, so you may end up with a row with "missing cells". In my case, I had a DNS record and other fields such as firstSeen and lastSeen. lastSeen is a field that is updated every time I see that domain, firstSeen will remain unchanged after the first occurrence. As soon as I changed the ranged map reduce job to a simple map reduce job (using all time data), everything was fine (but the job took longer to finish).
Cheers!
Related
We created a program to make the use of the database easier in other programs. So the code im showing gets used in multiple other programs.
One of those other programs gets about 10,000 records from one of our clients and has to check if these are in our database already. If not we insert them into the database (they can also change and have to be updated then).
To make this easy we load all the entries from our whole table (at the moment 120,000), create a class for every entry we get and put all of them into a Hashmap.
The loading of the whole table this way takes around 5 minutes. Also we sometimes have to restart the program because we run into a GC overhead error because we work on limited hardware. Do you have an idea of how we can improve the performance?
Here is the code to load all entries (we have a global limit of 10.000 entries per query so we use a loop):
public Map<String, IMasterDataSet> getAllInformationObjects(ISession session) throws MasterDataException {
IQueryExpression qe;
IQueryParameter qp;
// our main SDP class
Constructor<?> constructorForSDPbaseClass = getStandardConstructor();
SimpleDateFormat itaTimestampFormat = new SimpleDateFormat("yyyyMMddHHmmssSSS");
// search in standard time range (modification date!)
Calendar cal = Calendar.getInstance();
cal.set(2010, Calendar.JANUARY, 1);
Date startDate = cal.getTime();
Date endDate = new Date();
Long startDateL = Long.parseLong(itaTimestampFormat.format(startDate));
Long endDateL = Long.parseLong(itaTimestampFormat.format(endDate));
IDescriptor modDesc = IBVRIDescriptor.ModificationDate.getDescriptor(session);
// count once before to determine initial capacities for hash map/set
IBVRIArchiveClass SDP_ARCHIVECLASS = getMasterDataPropertyBag().getSDP_ARCHIVECLASS();
qe = SDP_ARCHIVECLASS.getQueryExpression(session);
qp = session.getDocumentServer().getClassFactory()
.getQueryParameterInstance(session, new String[] {SDP_ARCHIVECLASS.getDatabaseName(session)}, null, null);
qp.setExpression(qe);
qp.setHitLimitThreshold(0);
qp.setHitLimit(0);
int nrOfHitsTotal = session.getDocumentServer().queryCount(session, qp, "*");
int initialCapacity = (int) (nrOfHitsTotal / 0.75 + 1);
// MD sets; and objects already done (here: document ID)
HashSet<String> objDone = new HashSet<>(initialCapacity);
HashMap<String, IMasterDataSet> objRes = new HashMap<>(initialCapacity);
qp.close();
// do queries until hit count is smaller than 10.000
// use modification date
boolean keepGoing = true;
while(keepGoing) {
// construct query expression
// - basic part: Modification date & class type
// a. doc. class type
qe = SDP_ARCHIVECLASS.getQueryExpression(session);
// b. ID
qe = SearchUtil.appendQueryExpressionWithANDoperator(session, qe,
new PlainExpression(modDesc.getQueryLiteral() + " BETWEEN " + startDateL + " AND " + endDateL));
// 2. Query Parameter: set database; set expression
qp = session.getDocumentServer().getClassFactory()
.getQueryParameterInstance(session, new String[] {SDP_ARCHIVECLASS.getDatabaseName(session)}, null, null);
qp.setExpression(qe);
// order by modification date; hitlimit = 0 -> no hitlimit, but the usual 10.000 max
qp.setOrderByExpression(session.getDocumentServer().getClassFactory().getOrderByExpressionInstance(modDesc, true));
qp.setHitLimitThreshold(0);
qp.setHitLimit(0);
// Do not sort by modification date;
qp.setHints("+NoDefaultOrderBy");
keepGoing = false;
IInformationObject[] hits = null;
IDocumentHitList hitList = null;
hitList = session.getDocumentServer().query(qp, session);
IDocument doc;
if (hitList.getTotalHitCount() > 0) {
hits = hitList.getInformationObjects();
for (IInformationObject hit : hits) {
String objID = hit.getID();
if(!objDone.contains(objID)) {
// do something with this object and the class
// here: construct a new SDP sub class object and give it back via interface
doc = (IDocument) hit;
IMasterDataSet mdSet;
try {
mdSet = (IMasterDataSet) constructorForSDPbaseClass.newInstance(session, doc);
} catch (Exception e) {
// cause for this
String cause = (e.getCause() != null) ? e.getCause().toString() : MasterDataException.ERRMSG_PART_UNKNOWN;
throw new MasterDataException(MasterDataException.ERRMSG_NOINSTANCE_POSSIBLE, this.getClass().getSimpleName(), e.toString(), cause);
}
objRes.put(mdSet.getID(), mdSet);
objDone.add(objID);
}
}
doc = (IDocument) hits[hits.length - 1];
Date lastModDate = ((IDateValue) doc.getDescriptor(modDesc).getValues()[0]).getValue();
startDateL = Long.parseLong(itaTimestampFormat.format(lastModDate));
keepGoing = (hits.length >= 10000 || hitList.isResultSetTruncated());
}
qp.close();
}
return objRes;
}
Loading 120,000 rows (and more) each time will not scale very well, and your solution may not work in the future as the record size grows. Instead let the database server handle the problem.
Your table needs to have a primary key or unique key based on the columns of the records. Iterate through the 10,000 records performing JDBC SQL update to modify all field values with where clause to exactly match primary/unique key.
update BLAH set COL1 = ?, COL2 = ? where PKCOL = ?; // ... AND PKCOL2 =? ...
This modifies an existing row or does nothing at all - and JDBC executeUpate() will return 0 or 1 indicating number of rows changed. If number of rows changed was zero you have detected a new record which does not exist, so perform insert for that new record only.
insert into BLAH (COL1, COL2, ... PKCOL) values (?,?, ..., ?);
You can decide whether to run 10,000 updates followed by however many inserts are needed, or do update+optional insert, and remember JDBC batch statements / auto-commit off may help speed things up.
So I have a collection of emails and what I want to do is use them to output unique triplets (sender email, receiver email, timestamp) like so:
user1#stackoverflow.com user2#stackoverflow.com 09/12/2009 16:45
user1#stackoverflow.com user9#stackoverflow.com 09/12/2009 18:45
user3#stackoverflow.com user4#stackoverflow.com 07/05/2008 12:29
In the above example user 1 sent a single email to multiple recipients (user 2 and user 9). To store the recipients, I created a data structure EdgeWritable(implements WritableComparable)that will hold the Sender and Recipient email addresses as well as a Timestamp.
My mapper looks like this:
private final EdgeWritable edge = new EdgeWritable(); // Data structure for triplets.
private final NullWritable noval = NullWritable.get();
...
#Override
public void map(Text key, BytesWritable value, Context context)
throws IOException, InterruptedException {
byte[] bytes = value.getBytes();
Scanner scanner = new Scanner(new ByteArrayInputStream(bytes), "UTF-8");
String from = null; // Sender's Email address
ArrayList<String> recipients = new ArrayList<String>(); // List of recipients' Email addresses
long millis = -1; // Date
// Parse information from file
while(scanner.hasNext()) {
String line = scanner.nextLine();
if (line.startsWith("From:")) {
from = procFrom(stripCommand(line, "From:")); // Get sender e-mail address.
} else if (line.startsWith("To:")) {
procRecipients(stripCommand(line, "To:"), recipients); // Populate recipients into a list.
} else if (line.startsWith("Date:")) {
millis = procDate(stripCommand(line, "Date:")); // Get timestamp.
if (line.equals("")) { // Empty line indicates the end of the header
break;
}
}
scanner.close();
// Emit EdgeWritable as intermediate key containing Sender, Recipient and Timestamp.
if (from != null && recipients.size() > 0 && millis != -1) {
//EdgeWritable has 2 Text values (ew[0] and ew[1]) and a Timestamp. ew[0] is the sender, ew[1] is a recipient.
edge.set(0, from); // Set ew[0]
for(int i = 0; i < recipients.size(); i++) {
edge.set(1, recipients.get(i)); // Set edge from sender to each recipient i.
edge.setTS(millis); // Set date.
context.write(edge, noval); // Emit the edge as an intermediate key with a null value.
}
}
}
...
My reducer simply formats the date and outputs the edges:
public void reduce(EdgeWritable key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
String date = MailReader.sdf.format(edge.getTS());
out.set(edge.get(0) + " " + edge.get(1) + " " + date); // same edge from Mapper (an EdgeWritable).
context.write(noval, out); // same noval from Mapper (a NullWritable).
}
Using EdgeWritable as the intermediate key and NullWritable as the value (in mapper) is a requirement, I'm not permitted to use other methods. This is my first Hadoop / MapReduce program and I just wanted to know that I'm going in the right direction. I have looked at plenty of MapReduce examples online and have never seen key/value pairs being emitted in a for-loop the way I have done it. I feel like I'm missing some sort of trick here, but using a for-loop in this way is the only approach I can think of.
Is this 'bad'? I hope this is clear but please let me know if any further clarification is needed.
Map method gets called for each record, so your array list is having only 1 record for every call. Declare your array list at class level so that u can store values for all records. Then in clean up method you can do the emit logic which you have written inside map. Try this and let me know if that works.
I have the following rows with these keys in hbase table "mytable"
user_1
user_2
user_3
...
user_9999999
I want to use the Hbase shell to delete rows from:
user_500 to user_900
I know there is no way to delete, but is there a way I could use the "BulkDeleteProcessor" to do this?
I see here:
https://github.com/apache/hbase/blob/master/hbase-examples/src/test/java/org/apache/hadoop/hbase/coprocessor/example/TestBulkDeleteProtocol.java
I want to just paste in imports and then paste this into the shell, but have no idea how to go about this. Does anyone know how I can use this endpoint from the jruby hbase shell?
Table ht = TEST_UTIL.getConnection().getTable("my_table");
long noOfDeletedRows = 0L;
Batch.Call<BulkDeleteService, BulkDeleteResponse> callable =
new Batch.Call<BulkDeleteService, BulkDeleteResponse>() {
ServerRpcController controller = new ServerRpcController();
BlockingRpcCallback<BulkDeleteResponse> rpcCallback =
new BlockingRpcCallback<BulkDeleteResponse>();
public BulkDeleteResponse call(BulkDeleteService service) throws IOException {
Builder builder = BulkDeleteRequest.newBuilder();
builder.setScan(ProtobufUtil.toScan(scan));
builder.setDeleteType(deleteType);
builder.setRowBatchSize(rowBatchSize);
if (timeStamp != null) {
builder.setTimestamp(timeStamp);
}
service.delete(controller, builder.build(), rpcCallback);
return rpcCallback.get();
}
};
Map<byte[], BulkDeleteResponse> result = ht.coprocessorService(BulkDeleteService.class, scan
.getStartRow(), scan.getStopRow(), callable);
for (BulkDeleteResponse response : result.values()) {
noOfDeletedRows += response.getRowsDeleted();
}
ht.close();
If there exists no way to do this through JRuby, Java or alternate way to quickly delete multiple rows is fine.
Do you really want to do it in shell because there are various other better ways. One way is using the native java API
Construct an array list of deletes
pass this array list to Table.delete method
Method 1: if you already know the range of keys.
public void massDelete(byte[] tableName) throws IOException {
HTable table=(HTable)hbasePool.getTable(tableName);
String tablePrefix = "user_";
int startRange = 500;
int endRange = 999;
List<Delete> listOfBatchDelete = new ArrayList<Delete>();
for(int i=startRange;i<=endRange;i++){
String key = tablePrefix+i;
Delete d=new Delete(Bytes.toBytes(key));
listOfBatchDelete.add(d);
}
try {
table.delete(listOfBatchDelete);
} finally {
if (hbasePool != null && table != null) {
hbasePool.putTable(table);
}
}
}
Method 2: If you want to do a batch delete on the basis of a scan result.
public bulkDelete(final HTable table) throws IOException {
Scan s=new Scan();
List<Delete> listOfBatchDelete = new ArrayList<Delete>();
//add your filters to the scanner
s.addFilter();
ResultScanner scanner=table.getScanner(s);
for (Result rr : scanner) {
Delete d=new Delete(rr.getRow());
listOfBatchDelete.add(d);
}
try {
table.delete(listOfBatchDelete);
} catch (Exception e) {
LOGGER.log(e);
}
}
Now coming down to using a CoProcessor. only one advice, 'DON'T USE CoProcessor' unless you are an expert in HBase.
CoProcessors have many inbuilt issues if you need I can provide a detailed description to you.
Secondly when you delete anything from HBase it's never directly deleted from Hbase there is tombstone marker get attached to that record and later during a major compaction it gets deleted, so no need to use a coprocessor which is highly resource exhaustive.
Modified code to support batch operation.
int batchSize = 50;
int batchCounter=0;
for(int i=startRange;i<=endRange;i++){
String key = tablePrefix+i;
Delete d=new Delete(Bytes.toBytes(key));
listOfBatchDelete.add(d);
batchCounter++;
if(batchCounter==batchSize){
try {
table.delete(listOfBatchDelete);
listOfBatchDelete.clear();
batchCounter=0;
}
}}
Creating HBase conf and getting table instance.
Configuration hConf = HBaseConfiguration.create(conf);
hConf.set("hbase.zookeeper.quorum", "Zookeeper IP");
hConf.set("hbase.zookeeper.property.clientPort", ZookeeperPort);
HTable hTable = new HTable(hConf, tableName);
If you already aware of the rowkeys of the records that you want to delete from HBase table then you can use the following approach
1.First create a List objects with these rowkeys
for (int rowKey = 1; rowKey <= 10; rowKey++) {
deleteList.add(new Delete(Bytes.toBytes(rowKey + "")));
}
2.Then get the Table object by using HBase Connection
Table table = connection.getTable(TableName.valueOf(tableName));
3.Once you have table object call delete() by passing the list
table.delete(deleteList);
The complete code will look like below
Configuration config = HBaseConfiguration.create();
config.addResource(new Path("/etc/hbase/conf/hbase-site.xml"));
config.addResource(new Path("/etc/hadoop/conf/core-site.xml"));
String tableName = "users";
Connection connection = ConnectionFactory.createConnection(config);
Table table = connection.getTable(TableName.valueOf(tableName));
List<Delete> deleteList = new ArrayList<Delete>();
for (int rowKey = 500; rowKey <= 900; rowKey++) {
deleteList.add(new Delete(Bytes.toBytes("user_" + rowKey)));
}
table.delete(deleteList);
I can't seem to update an existing record in my table using a strongly-typed dataset. I can add a new record, but if I make changes to an existing record it doesn't work.
Here is my code:
private void AddEmplMaster()
{
dsEmplMast dsEmpMst = new dsEmplMast();
SqlConnection cn = new SqlConnection();
cn.ConnectionString = System.Configuration.ConfigurationSettings.AppSettings["cn.ConnectionString"];
SqlDataAdapter da1 = new SqlDataAdapter("SELECT * FROM UPR00100", cn);
SqlCommandBuilder cb1 = new SqlCommandBuilder(da1);
da1.Fill(dsEmpMst.UPR00100);
DataTable dtMst = UpdateEmpMst(dsEmpMst);
da1.Update(dsEmpMst.UPR00100);
}
This procedure is called from above to assign the changed fields to a record:
private DataTable UpdateEmpMst(dsEmplMast dsEmpMst)
{
DataTable dtMst = new DataTable();
try
{
dsEmplMast.UPR00100Row empRow = dsEmpMst.UPR00100.NewUPR00100Row();
empRow.EMPLOYID = txtEmplId.Text.Trim();
empRow.LASTNAME = txtLastName.Text.Trim();
empRow.FRSTNAME = txtFirstName.Text.Trim();
empRow.MIDLNAME = txtMidName.Text.Trim();
empRow.ADRSCODE = "PRIMARY";
empRow.SOCSCNUM = txtSSN.Text.Trim();
empRow.DEPRTMNT = ddlDept.SelectedValue.Trim();
empRow.JOBTITLE = txtJobTitle.Text.Trim();
empRow.STRTDATE = DateTime.Today;
empRow.EMPLOYMENTTYPE = "1";
dsEmpMst.UPR00100.Rows.Add(empRow);
}
catch { }
return dtMst;
}
Thank you
UPDATE:
Ok I figured it out. In my UpdateEmpMst() procedure I had to check if the record exists then to retrieve it first. If not then create a new record to add. Here is what I added:
try
{
dsEmplMast.UPR00100Row empRow;
empRow = dsEmpMst.UPR00100.FindByEMPLOYID(txtEmplId.Text.Trim());
if (empRow == null)
{
empRow = dsEmpMst.UPR00100.NewUPR00100Row();
dsEmpMst.UPR00100.Rows.Add(empRow);
}
then I assign my data to the new empRow I created and updates fine.
In order to edit an existing record in a dataset, you need to access a particular column of data in a particular row. The data in both typed and untyped datasets can be accessed via the following:
With the indices of the tables, rows, and columns collections.
By passing the table and column names as strings to their respective collections.
Although typed datasets can use the same syntax as untyped datasets, there are additional advantages to using typed datasets. For more information, see the "To update existing records using typed datasets" section below.
To update existing records in either typed or untyped datasets
Assign a value to a specific column within a DataRow object.
The table and column names of untyped datasets are not available at design time and must be accessed through their respective indices.
I'm having problems setting row timestamp using java api.
When I'm trying to add a timestamp value to put constructor (or into put.add()) nothing happens and after reading rows from table I get system provided timestamps.
public static boolean addRecord(String tableName, String rowKey,
String family, String qualifier, Object value)
{
try {
HTable table = new HTable(conf, tableName);
Put put = new Put(Bytes.toBytes(rowKey), 12345678l);
put.add(Bytes.toBytes(family), Bytes.toBytes(qualifier), Bytes.toBytes(value.toString()));
table.put(put);
return true;
} catch (Exception e) {
e.printStackTrace();
return false;
}
}
HBase 0.92.1 running in standalone mode.
Thanks in advance for any help!
Most likely, you already have rows in the table that have timestamp > 12345678l. To confirm that this is not the case, try it with a very large value for timestamp, say Long.MAX_VALUE.
If it is indeed the case, you can simply delete the older versions. Then this entry will show up.