I am using Cassandra 3.3.0 to store some data in a ColumnFamily created like that:
CREATE COLUMNFAMILY inventory(objectid varint, timestamp varint,
numerator varint, data text, PRIMARY KEY ((objectid), timestamp,
numerator)) with clustering order by (timestamp desc, numerator asc);
After each new entry, I want to inform some third party application about new entries and its properties, but I am too stupid to access the changed data using the augment function.
public Collection<Mutation> augment(Partition update) {
String key = null;
String timestamp = null;
String numerator = null;
String data = null;
try {
UnfilteredRowIterator it = update.unfilteredIterator();
CFMetaData cfMetaData = Schema.instance.getCFMetaData("keyspace", "inventory");
while (it.hasNext()) {
Unfiltered un = it.next();
Clustering clt = (Clustering) un.clustering();
Iterator<Cell> cls = update.getRow(clt).cells().iterator();
while(cls.hasNext()){
Cell cell = cls.next();
data = new String(cell.value().array());
}
un.toString(cfMetaData); //RETURNS timestamp, numerator and data as one string....
}
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
update.toString() returns all information in a String, including the ones I want, but I actually do not want to parse it through. Does anybody know how to access the values of "objectid", "timestamp" and "numerator" directly?
Related
We created a program to make the use of the database easier in other programs. So the code im showing gets used in multiple other programs.
One of those other programs gets about 10,000 records from one of our clients and has to check if these are in our database already. If not we insert them into the database (they can also change and have to be updated then).
To make this easy we load all the entries from our whole table (at the moment 120,000), create a class for every entry we get and put all of them into a Hashmap.
The loading of the whole table this way takes around 5 minutes. Also we sometimes have to restart the program because we run into a GC overhead error because we work on limited hardware. Do you have an idea of how we can improve the performance?
Here is the code to load all entries (we have a global limit of 10.000 entries per query so we use a loop):
public Map<String, IMasterDataSet> getAllInformationObjects(ISession session) throws MasterDataException {
IQueryExpression qe;
IQueryParameter qp;
// our main SDP class
Constructor<?> constructorForSDPbaseClass = getStandardConstructor();
SimpleDateFormat itaTimestampFormat = new SimpleDateFormat("yyyyMMddHHmmssSSS");
// search in standard time range (modification date!)
Calendar cal = Calendar.getInstance();
cal.set(2010, Calendar.JANUARY, 1);
Date startDate = cal.getTime();
Date endDate = new Date();
Long startDateL = Long.parseLong(itaTimestampFormat.format(startDate));
Long endDateL = Long.parseLong(itaTimestampFormat.format(endDate));
IDescriptor modDesc = IBVRIDescriptor.ModificationDate.getDescriptor(session);
// count once before to determine initial capacities for hash map/set
IBVRIArchiveClass SDP_ARCHIVECLASS = getMasterDataPropertyBag().getSDP_ARCHIVECLASS();
qe = SDP_ARCHIVECLASS.getQueryExpression(session);
qp = session.getDocumentServer().getClassFactory()
.getQueryParameterInstance(session, new String[] {SDP_ARCHIVECLASS.getDatabaseName(session)}, null, null);
qp.setExpression(qe);
qp.setHitLimitThreshold(0);
qp.setHitLimit(0);
int nrOfHitsTotal = session.getDocumentServer().queryCount(session, qp, "*");
int initialCapacity = (int) (nrOfHitsTotal / 0.75 + 1);
// MD sets; and objects already done (here: document ID)
HashSet<String> objDone = new HashSet<>(initialCapacity);
HashMap<String, IMasterDataSet> objRes = new HashMap<>(initialCapacity);
qp.close();
// do queries until hit count is smaller than 10.000
// use modification date
boolean keepGoing = true;
while(keepGoing) {
// construct query expression
// - basic part: Modification date & class type
// a. doc. class type
qe = SDP_ARCHIVECLASS.getQueryExpression(session);
// b. ID
qe = SearchUtil.appendQueryExpressionWithANDoperator(session, qe,
new PlainExpression(modDesc.getQueryLiteral() + " BETWEEN " + startDateL + " AND " + endDateL));
// 2. Query Parameter: set database; set expression
qp = session.getDocumentServer().getClassFactory()
.getQueryParameterInstance(session, new String[] {SDP_ARCHIVECLASS.getDatabaseName(session)}, null, null);
qp.setExpression(qe);
// order by modification date; hitlimit = 0 -> no hitlimit, but the usual 10.000 max
qp.setOrderByExpression(session.getDocumentServer().getClassFactory().getOrderByExpressionInstance(modDesc, true));
qp.setHitLimitThreshold(0);
qp.setHitLimit(0);
// Do not sort by modification date;
qp.setHints("+NoDefaultOrderBy");
keepGoing = false;
IInformationObject[] hits = null;
IDocumentHitList hitList = null;
hitList = session.getDocumentServer().query(qp, session);
IDocument doc;
if (hitList.getTotalHitCount() > 0) {
hits = hitList.getInformationObjects();
for (IInformationObject hit : hits) {
String objID = hit.getID();
if(!objDone.contains(objID)) {
// do something with this object and the class
// here: construct a new SDP sub class object and give it back via interface
doc = (IDocument) hit;
IMasterDataSet mdSet;
try {
mdSet = (IMasterDataSet) constructorForSDPbaseClass.newInstance(session, doc);
} catch (Exception e) {
// cause for this
String cause = (e.getCause() != null) ? e.getCause().toString() : MasterDataException.ERRMSG_PART_UNKNOWN;
throw new MasterDataException(MasterDataException.ERRMSG_NOINSTANCE_POSSIBLE, this.getClass().getSimpleName(), e.toString(), cause);
}
objRes.put(mdSet.getID(), mdSet);
objDone.add(objID);
}
}
doc = (IDocument) hits[hits.length - 1];
Date lastModDate = ((IDateValue) doc.getDescriptor(modDesc).getValues()[0]).getValue();
startDateL = Long.parseLong(itaTimestampFormat.format(lastModDate));
keepGoing = (hits.length >= 10000 || hitList.isResultSetTruncated());
}
qp.close();
}
return objRes;
}
Loading 120,000 rows (and more) each time will not scale very well, and your solution may not work in the future as the record size grows. Instead let the database server handle the problem.
Your table needs to have a primary key or unique key based on the columns of the records. Iterate through the 10,000 records performing JDBC SQL update to modify all field values with where clause to exactly match primary/unique key.
update BLAH set COL1 = ?, COL2 = ? where PKCOL = ?; // ... AND PKCOL2 =? ...
This modifies an existing row or does nothing at all - and JDBC executeUpate() will return 0 or 1 indicating number of rows changed. If number of rows changed was zero you have detected a new record which does not exist, so perform insert for that new record only.
insert into BLAH (COL1, COL2, ... PKCOL) values (?,?, ..., ?);
You can decide whether to run 10,000 updates followed by however many inserts are needed, or do update+optional insert, and remember JDBC batch statements / auto-commit off may help speed things up.
I currently have stored procedures for Oracle SQL, version 18c, for both inserting and fetching multiple rows of data from one parent table and one child table, being called from my Java Spring Boot application. Everything works fine, but it is extremely slow, for only a few rows of data.
When only inserting 70 records between the two, it takes up to 267 seconds into empty tables. Fetching that same data back out takes about 40 seconds.
Any help would be greatly appreciated or if there is any additional information needed from me.
Below is a cut down and renamed version of my stored procedures for my parent and child tables, actual parent table has 32 columns and child has 11.
PROCEDURE processParentData(
i_field_one varchar2,
v_parent_id OUT number) is
v_new PARENT%ROWTYPE;
BEGIN
v_new.id := ROW_SEQUENCE.nextval;
v_new.insert_time := systimestamp;
v_new.field_one := i_field_one;
insert into PARENT values v_new;
v_parent_id := v_new.id;
END;
PROCEDURE readParentData(
i_field_one IN varchar2,
v_parent OUT SYS_REFCURSOR) AS
BEGIN
OPEN v_parent FOR select h.* from PARENT h
where h.field_one = i_field_one;
END;
PROCEDURE processChild(
i_field_one varchar2,
i_parent_id number) is
v_new CHILD%ROWTYPE;
BEGIN
v_new.id := ROW_SEQUENCE.nextval;
v_new.insert_time := systimestamp;
v_new.field_one := i_field_one;
v_new.parent_id := i_parent_id;
insert into CHILD values v_new;
END;
PROCEDURE readChild(
i_parent_id IN number,
v_child OUT SYS_REFCURSOR) AS
BEGIN
OPEN v_child FOR select h.* from CHILD h
where h.parent_id = i_parent_id;
END;
For my Java code I am using Spring JDBC. After I get the parent data, I then fetch each child data by looping through the parent data and calling readChild with the parent ID for each.
var simpleJdbcCall = new SimpleJdbcCall(jdbcTemplate)
.withCatalogName("PARENT_PACKAGE")
.withProcedureName("processParentData");
SqlParameterSource sqlParameterSource = new MapSqlParameterSource()
.addValue("i_field_one", locationId)
.addValue("v_parent_id", null);
Map<String, Object> out = simpleJdbcCall.execute(sqlParameterSource);
var stopId = (BigDecimal) out.get("v_parent_id");
return stopId.longValue();
var simpleJdbcCall = new SimpleJdbcCall(jdbcTemplate)
.withCatalogName("PARENT_PACKAGE")
.withProcedureName("readParentData")
.returningResultSet("v_parent", BeanPropertyRowMapper.newInstance(Parent.class));
SqlParameterSource sqlParameterSource = new MapSqlParameterSource()
.addValue("i_field_one", location.getId());
Map<String, Object> out = simpleJdbcCall.execute(sqlParameterSource);
return (List<Parent>) out.get("v_parent");
UPDATE 1: As I know and have tested, using the same data and tables, if I use pure JDBC or JPA/Hibernate for inserting and fetching to the tables directly and avoid using stored procedures, then the whole process of inserting and fetching only takes a few seconds.
The issue is, at the company I work at, they have set a policy that all applications going forward are not allowed to have direct read/write access to the database and everything must be done through stored procedures, they say for security reasons. Meaning I need to workout how to do the same thing we have been doing for years with direct read/write access, now with only using Oracle stored procedures.
UPDATE 2: Adding my current Java code for fetching the child data.
for (Parent parent : parents) {
parent.setChilds(childRepository.readChildByParentId(parent.getId()));
}
public List<Child> readChildByParentId(long parentId) {
var simpleJdbcCall = new SimpleJdbcCall(jdbcTemplate)
.withCatalogName("CHILD_PACKAGE")
.withProcedureName("readChild")
.returningResultSet("v_child", BeanPropertyRowMapper.newInstance(Child.class));
SqlParameterSource sqlParameterSource = new MapSqlParameterSource()
.addValue("i_parent_id ", parentId);
Map<String, Object> out = simpleJdbcCall.execute(sqlParameterSource);
return (List<Child>) out.get("v_child");
}
The problem is that the insert you are trying to perform using the stored procedure is not optimized, because you are calling the database every time you try to insert a row.
I strongly recommend you to transform the data to XML (for example, you can also use CSV) and pass it to the procedure, then loop over it and perform the inserts that you need.
Here is an example made using Oracle:
CREATE OR REPLACE PROCEDURE MY_SCHEMA.my_procedure(xmlData clob) IS
begin
FOR CONTACT IN (SELECT *
FROM XMLTABLE(
'/CONTACTS/CONTACT' PASSING
XMLTYPE(contactes)
COLUMNS param_id FOR ORDINALITY
,id NUMBER PATH 'ID'
,name VARCHAR2(100) PATH 'NAME'
,surname VARCHAR2(100) PATH 'SURNAME'
))
LOOP
INSERT INTO PARENT_TABLE VALUES CONTACT.id, CONTACT.name, CONTACT.surname;
end loop;
end;
The XML, you can use a String to pass the data to the procedure:
<CONTACTS>
<CONTACT>
<ID>1</ID>
<NAME>Jonh</NAME>
<SURNAME>Smith</SURNAME>
</CONTACT>
<CONTACTS>
For my Java code I am using Spring JDBC. After I get the parent data, I then fetch each child data by looping through the parent data and calling readChild with the parent ID for each.
Instead of fetching child data in loop, you can modify your procedure to accept list of parent id and return all the data in one call.
It will be helpful if you share spring boot for loop code as well.
Update
Instead of fetching single parent details, you should have update your code like this. Also you have to update your procedure as well.
List<Long> parents = new ArrayList<>();
for (Parent parent : parents) {
parents.add(parent.getId());
}
You can use java streams but that is secondary things.
Now you have to modify your procedure and method to accept multiple parent ids.
List<Child> children = childRepository.readreadChildByParentId(parents);
public List<Child> readChildByParentId(long parentId) {
var simpleJdbcCall = new SimpleJdbcCall(jdbcTemplate)
.withCatalogName("CHILD_PACKAGE")
.withProcedureName("readChild")
.returningResultSet("v_child", BeanPropertyRowMapper.newInstance(Child.class));
SqlParameterSource sqlParameterSource = new MapSqlParameterSource()
.addValue("i_parent_id ", parentId);
Map<String, Object> out = simpleJdbcCall.execute(sqlParameterSource);
return (List<Child>) out.get("v_child");
}
After having all the children you can set parent children via java code.
P.S.
Could you please check if you fetch parents with children if parent is coming from the database?
Your performance problems are probably related with the number of operations performed against the database: you are iterating in Java your collections, and interacting with the database in every iteration. You need to minimize the number of operations performed.
One possible solution can be the use of the standard STRUCT and ARRAY Oracle types. Please, consider for instance the following example:
public static void insertData() throws SQLException {
DriverManagerDataSource dataSource = ...
JdbcTemplate jdbcTemplate = new JdbcTemplate(dataSource);
jdbcTemplate.setResultsMapCaseInsensitive(true);
SimpleJdbcCall insertDataCall = new SimpleJdbcCall(jdbcTemplate)
.withCatalogName("parent_child_pkg")
.withProcedureName("insert_data")
.withoutProcedureColumnMetaDataAccess()
.useInParameterNames("p_parents")
.declareParameters(
new SqlParameter("p_parents", OracleTypes.ARRAY, "PARENT_ARRAY")
);
OracleConnection connection = null;
try {
connection = insertDataCall
.getJdbcTemplate()
.getDataSource()
.getConnection()
.unwrap(OracleConnection.class)
;
List<Parent> parents = new ArrayList<>(100);
Parent parent = null;
List<Child> chilren = null;
Child child = null;
for (int i = 0; i < 100; i++) {
parent = new Parent();
parents.add(parent);
parent.setId((long) i);
parent.setName("parent-" + i);
chilren = new ArrayList<>(1000);
parent.setChildren(chilren);
for (int j = 0; j < 1000; j++) {
child = new Child();
chilren.add(child);
child.setId((long) j);
child.setName("parent-" + j);
}
}
System.out.println("Inserting data...");
StopWatch stopWatch = new StopWatch();
stopWatch.start("insert-data");
StructDescriptor parentTypeStructDescriptor = StructDescriptor.createDescriptor("PARENT_TYPE", connection);
ArrayDescriptor parentArrayDescriptor = ArrayDescriptor.createDescriptor("PARENT_ARRAY", connection);
StructDescriptor childTypeStructDescriptor = StructDescriptor.createDescriptor("CHILD_TYPE", connection);
ArrayDescriptor childArrayDescriptor = ArrayDescriptor.createDescriptor("CHILD_ARRAY", connection);
Object[] parentArray = new Object[parents.size()];
int pi = 0;
for (Parent p : parents) {
List<Child> children = p.getChildren();
Object[] childArray = new Object[children.size()];
int ci = 0;
for (Child c : children) {
Object[] childrenObj = new Object[2];
childrenObj[0] = c.getId();
childrenObj[1] = c.getName();
STRUCT childStruct = new STRUCT(childTypeStructDescriptor, connection, childrenObj);
childArray[ci++] = childStruct;
}
ARRAY childrenARRAY = new ARRAY(childArrayDescriptor, connection, childArray);
Object[] parentObj = new Object[3];
parentObj[0] = p.getId();
parentObj[1] = p.getName();
parentObj[2] = childrenARRAY;
STRUCT parentStruct = new STRUCT(parentTypeStructDescriptor, connection, parentObj);
parentArray[pi++] = parentStruct;
}
ARRAY parentARRAY = new ARRAY(parentArrayDescriptor, connection, parentArray);
Map in = Collections.singletonMap("p_parents", parentARRAY);
insertDataCall.execute(in);
connection.commit();
stopWatch.stop();
System.out.println(stopWatch.prettyPrint());
} catch (Throwable t) {
t.printStackTrace();
connection.rollback();
} finally {
if (connection != null) {
try {
connection.close();
} catch (Throwable nested) {
nested.printStackTrace();
}
}
}
}
Where:
CREATE OR REPLACE TYPE child_type AS OBJECT (
id NUMBER,
name VARCHAR2(512)
);
CREATE OR REPLACE TYPE child_array
AS TABLE OF child_type;
CREATE OR REPLACE TYPE parent_type AS OBJECT (
id NUMBER,
name VARCHAR2(512),
children child_array
);
CREATE OR REPLACE TYPE parent_array
AS TABLE OF parent_type;
CREATE SEQUENCE PARENT_SEQ INCREMENT BY 1 MINVALUE 1;
CREATE SEQUENCE CHILD_SEQ INCREMENT BY 1 MINVALUE 1;
CREATE TABLE parent_table (
id NUMBER,
name VARCHAR2(512)
);
CREATE TABLE child_table (
id NUMBER,
name VARCHAR2(512),
parent_id NUMBER
);
CREATE OR REPLACE PACKAGE parent_child_pkg AS
PROCEDURE insert_data(p_parents PARENT_ARRAY);
END;
CREATE OR REPLACE PACKAGE BODY parent_child_pkg AS
PROCEDURE insert_data(p_parents PARENT_ARRAY) IS
l_parent_id NUMBER;
l_child_id NUMBER;
BEGIN
FOR i IN 1..p_parents.COUNT LOOP
SELECT parent_seq.nextval INTO l_parent_id FROM dual;
INSERT INTO parent_table(id, name)
VALUES(l_parent_id, p_parents(i).name);
FOR j IN 1..p_parents(i).children.COUNT LOOP
SELECT child_seq.nextval INTO l_child_id FROM dual;
INSERT INTO child_table(id, name, parent_id)
VALUES(l_child_id, p_parents(i).name, l_parent_id);
END LOOP;
END LOOP;
END;
END;
And Parent and Child are simple POJOs:
import java.util.ArrayList;
import java.util.List;
public class Parent {
private Long id;
private String name;
private List<Child> children = new ArrayList<>();
public Long getId() {
return id;
}
public void setId(Long id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public List<Child> getChildren() {
return children;
}
public void setChildren(List<Child> children) {
this.children = children;
}
}
public class Child {
private Long id;
private String name;
public Long getId() {
return id;
}
public void setId(Long id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
}
Please, forgive for the code legibility and incorrect error handling, I will improve the answer later including some information about obtaining the data as well.
The times you mention are horrible indeed. A big boost forward in performance will be to work set based. This means reducing the row by row database calls.
Row by row is synonymous for slow, especially when network round trips are involved.
One call to get the parent.
One call to get the set of children and process them. The jdbc fetch size is a nice tunable here. Give it a chance to work for you.
You do not need to use DYNAMIC SQL OPEN v_parent FOR and also it is not clear how the view v_parent is defined.
Try to check exec plan of this query:
FOR select h.* from PARENT h where h.field_one = ?;
Usually returning recordset via SYS_REFCURSOR increases performance when you return more (let's say) than 10K records.
The SimpleJdbcCall object can be reused in your scenario as only the parameters changes. The SimpleJdbcCall object compiles the jdbc statement on the first invocation. It does some meta-data fetching and it interacts with the Database for that. So, having separate objects would mean fetching same metadata that many times which is not needed.
So, I suggest to initialise all the 4 SimpleJdbcCall objects in the very beginning and then work with them.
var insertParentJdbcCall = new SimpleJdbcCall(jdbcTemplate)
.withCatalogName("PARENT_PACKAGE")
.withProcedureName("processParentData");
var readParentJdbcCall = new SimpleJdbcCall(jdbcTemplate)
.withCatalogName("PARENT_PACKAGE")
.withProcedureName("readParentData")
.returningResultSet("v_parent", BeanPropertyRowMapper.newInstance(Parent.class));
var insertChildJdbcCall = new SimpleJdbcCall(jdbcTemplate)
.withCatalogName("CHILD_PACKAGE")
.withProcedureName("processChildData");
var readChildJdbcCall = new SimpleJdbcCall(jdbcTemplate)
.withCatalogName("CHILD_PACKAGE")
.withProcedureName("readChild")
.returningResultSet("v_child", BeanPropertyRowMapper.newInstance(Child.class));
I have the following rows with these keys in hbase table "mytable"
user_1
user_2
user_3
...
user_9999999
I want to use the Hbase shell to delete rows from:
user_500 to user_900
I know there is no way to delete, but is there a way I could use the "BulkDeleteProcessor" to do this?
I see here:
https://github.com/apache/hbase/blob/master/hbase-examples/src/test/java/org/apache/hadoop/hbase/coprocessor/example/TestBulkDeleteProtocol.java
I want to just paste in imports and then paste this into the shell, but have no idea how to go about this. Does anyone know how I can use this endpoint from the jruby hbase shell?
Table ht = TEST_UTIL.getConnection().getTable("my_table");
long noOfDeletedRows = 0L;
Batch.Call<BulkDeleteService, BulkDeleteResponse> callable =
new Batch.Call<BulkDeleteService, BulkDeleteResponse>() {
ServerRpcController controller = new ServerRpcController();
BlockingRpcCallback<BulkDeleteResponse> rpcCallback =
new BlockingRpcCallback<BulkDeleteResponse>();
public BulkDeleteResponse call(BulkDeleteService service) throws IOException {
Builder builder = BulkDeleteRequest.newBuilder();
builder.setScan(ProtobufUtil.toScan(scan));
builder.setDeleteType(deleteType);
builder.setRowBatchSize(rowBatchSize);
if (timeStamp != null) {
builder.setTimestamp(timeStamp);
}
service.delete(controller, builder.build(), rpcCallback);
return rpcCallback.get();
}
};
Map<byte[], BulkDeleteResponse> result = ht.coprocessorService(BulkDeleteService.class, scan
.getStartRow(), scan.getStopRow(), callable);
for (BulkDeleteResponse response : result.values()) {
noOfDeletedRows += response.getRowsDeleted();
}
ht.close();
If there exists no way to do this through JRuby, Java or alternate way to quickly delete multiple rows is fine.
Do you really want to do it in shell because there are various other better ways. One way is using the native java API
Construct an array list of deletes
pass this array list to Table.delete method
Method 1: if you already know the range of keys.
public void massDelete(byte[] tableName) throws IOException {
HTable table=(HTable)hbasePool.getTable(tableName);
String tablePrefix = "user_";
int startRange = 500;
int endRange = 999;
List<Delete> listOfBatchDelete = new ArrayList<Delete>();
for(int i=startRange;i<=endRange;i++){
String key = tablePrefix+i;
Delete d=new Delete(Bytes.toBytes(key));
listOfBatchDelete.add(d);
}
try {
table.delete(listOfBatchDelete);
} finally {
if (hbasePool != null && table != null) {
hbasePool.putTable(table);
}
}
}
Method 2: If you want to do a batch delete on the basis of a scan result.
public bulkDelete(final HTable table) throws IOException {
Scan s=new Scan();
List<Delete> listOfBatchDelete = new ArrayList<Delete>();
//add your filters to the scanner
s.addFilter();
ResultScanner scanner=table.getScanner(s);
for (Result rr : scanner) {
Delete d=new Delete(rr.getRow());
listOfBatchDelete.add(d);
}
try {
table.delete(listOfBatchDelete);
} catch (Exception e) {
LOGGER.log(e);
}
}
Now coming down to using a CoProcessor. only one advice, 'DON'T USE CoProcessor' unless you are an expert in HBase.
CoProcessors have many inbuilt issues if you need I can provide a detailed description to you.
Secondly when you delete anything from HBase it's never directly deleted from Hbase there is tombstone marker get attached to that record and later during a major compaction it gets deleted, so no need to use a coprocessor which is highly resource exhaustive.
Modified code to support batch operation.
int batchSize = 50;
int batchCounter=0;
for(int i=startRange;i<=endRange;i++){
String key = tablePrefix+i;
Delete d=new Delete(Bytes.toBytes(key));
listOfBatchDelete.add(d);
batchCounter++;
if(batchCounter==batchSize){
try {
table.delete(listOfBatchDelete);
listOfBatchDelete.clear();
batchCounter=0;
}
}}
Creating HBase conf and getting table instance.
Configuration hConf = HBaseConfiguration.create(conf);
hConf.set("hbase.zookeeper.quorum", "Zookeeper IP");
hConf.set("hbase.zookeeper.property.clientPort", ZookeeperPort);
HTable hTable = new HTable(hConf, tableName);
If you already aware of the rowkeys of the records that you want to delete from HBase table then you can use the following approach
1.First create a List objects with these rowkeys
for (int rowKey = 1; rowKey <= 10; rowKey++) {
deleteList.add(new Delete(Bytes.toBytes(rowKey + "")));
}
2.Then get the Table object by using HBase Connection
Table table = connection.getTable(TableName.valueOf(tableName));
3.Once you have table object call delete() by passing the list
table.delete(deleteList);
The complete code will look like below
Configuration config = HBaseConfiguration.create();
config.addResource(new Path("/etc/hbase/conf/hbase-site.xml"));
config.addResource(new Path("/etc/hadoop/conf/core-site.xml"));
String tableName = "users";
Connection connection = ConnectionFactory.createConnection(config);
Table table = connection.getTable(TableName.valueOf(tableName));
List<Delete> deleteList = new ArrayList<Delete>();
for (int rowKey = 500; rowKey <= 900; rowKey++) {
deleteList.add(new Delete(Bytes.toBytes("user_" + rowKey)));
}
table.delete(deleteList);
I have a HBase table (from java) and i want to query the table by list of keys. I did the following, but its not working.
mFilterFeatureIt = mFeatureSet.iterator();
FilterList filterList=new FilterList(FilterList.Operator.MUST_PASS_ONE);
while (mFilterFeatureIt.hasNext()) {
long myfeatureId = mFilterFeatureIt.next();
System.out.println("FeatureId:"+myfeatureId+" , ");
RowFilter filter = new RowFilter(CompareOp.EQUAL,new BinaryComparator(Bytes.toBytes(myfeatureId)) );
filterList.addFilter(filter);
}
outputMap = HbaseUtils.getHbaseData("mytable", filterList);
System.out.println("Size of outputMap map:"+ outputMap.szie());
public static Map<String, Map<String, String>> getHbaseData(String table, FilterList filter) {
Map<String, Map<String, String>> data = new HashMap<String, Map<String, String>>();
HTable htable = null;
try {
htable = new HTable(HTableConfiguration.getHTableConfiguration(),table);
Scan scan = new Scan();
scan.setFilter(filter);
ResultScanner resultScanner = htable.getScanner(scan);
Iterator<Result> results = resultScanner.iterator();
while (results.hasNext()) {
Result result = results.next();
String rowId = Bytes.toString(result.getRow());
List<KeyValue> columns = result.list();
if (null != columns) {
HashMap<String, String> colData = new HashMap<String, String>();
for (KeyValue column : columns) {
colData.put(Bytes.toString(column.getFamily()) + ":"+ Bytes.toString(column.getQualifier()),Bytes.toString(column.getValue()));
}
data.put(rowId, colData);
}
}
} catch (IOException e) {
e.printStackTrace();
} finally {
if (htable != null)
try {
htable.close();
} catch (IOException e) {
e.printStackTrace();
}
}
return data;
}
FeatureId:80515900 ,
FeatureId:80515901 ,
FeatureId:80515902 ,
Size of outputMap map: 0
I see that value of feature id is what i want , but I always get the above output even if the key is present in the hbase table. Can anyone tell me what am i doing wrong ?
EDIT:
I posted the code for my hbase util method too above, so that you can point me to any bugs there.
I am trying to do an SQL equivalent of select * FROM mytable where featureId in (80515900, 80515901, 80515902) My idea to achieve the same in HBase was to create a filter list with one filter for each featureId. Is that correct ?
Here is the content of my table
scan 'mytable', {COLUMNS => ['sample:tag_count'] }
80515900 column=sample:tag_count, timestamp=1339304052748, value=4
80515901 column=sample:tag_count, timestamp=1339304052748, value=0
80515902 column=sample:tag_count, timestamp=1339304052748, value=3
80515903 column=sample:tag_count, timestamp=1339304052748, value=1
80515904 column=sample:tag_count, timestamp=1339304052748, value=2
Its not returning any data as while inserting the data into hbase,
the data-type for key is 'String' (from your scan result) & while fetching, the value passed in RowFilter has 'long' data type. Use this filter:
RowFilter filter = new RowFilter(CompareOp.EQUAL,new
BinaryComparator(Bytes.toBytes(myfeatureId.toString())) );
the while loop will always generate a new filter and added to the filter list.
The circuit are all the keys in the filter. This filter can never apply to a single row. Create only one filter in the while loop pointing to a knowing "myfeatureId".
while (mFilterFeatureIt.hasNext()) {
long myfeatureId = mFilterFeatureIt.next();
System.out.println("FeatureId:"+myfeatureId+" , ");
if ( myfeatureId=="80515902") {
RowFilter filter = new RowFilter(CompareOp.EQUAL,new BinaryComparator(Bytes.toBytes(myfeatureId)) );
filterList.addFilter(filter);
}
}
EDIT
For rows quantity, the query is responsible. HBase is not
HBase filters
Filters push row selection criteria out to the HBase. Rows can be filtered remotely and in parallel. Using these functions helps you to avoid sending rows to the client that are not needed.
To get a part out of the key, gets all from 80515900 .. 80515909 try this
of course remove from the loop
RowFilter filter = new RowFilter(CompareOp.EQUAL,new BinaryComparator(Bytes.toBytes(myfeatureId)) );
filterList.addFilter(filter);
and add above the line outputMap = HbaseUtils.getHbaseData("mytable", filterList);
....
RowFilter filter = new RowFilter(CompareOp.EQUAL,new SubStringComparator("8051590"));
filterList.addFilter(filter);
outputMap = HbaseUtils.getHbaseData("mytable", filterList);
I'm having problems setting row timestamp using java api.
When I'm trying to add a timestamp value to put constructor (or into put.add()) nothing happens and after reading rows from table I get system provided timestamps.
public static boolean addRecord(String tableName, String rowKey,
String family, String qualifier, Object value)
{
try {
HTable table = new HTable(conf, tableName);
Put put = new Put(Bytes.toBytes(rowKey), 12345678l);
put.add(Bytes.toBytes(family), Bytes.toBytes(qualifier), Bytes.toBytes(value.toString()));
table.put(put);
return true;
} catch (Exception e) {
e.printStackTrace();
return false;
}
}
HBase 0.92.1 running in standalone mode.
Thanks in advance for any help!
Most likely, you already have rows in the table that have timestamp > 12345678l. To confirm that this is not the case, try it with a very large value for timestamp, say Long.MAX_VALUE.
If it is indeed the case, you can simply delete the older versions. Then this entry will show up.