Accumulo delete a row item when row ids are the same

Accumulo delete a row item when row ids are the same - java

I am trying to figure out a way to delete a specific row from a table when it has the same row id as another couple rows in Accumulo. This is how I have my table set up:
m0 : property : name -> erp
m0 : property : age -> 23
m0 : purchase : food -> 5.00
m0 : purchase : gas -> 24.00
m0 : purchase : beer -> 15.00
Say I want to delete gas from the table. I know I could use connection.tableOperations().deleteRows(table, start, stop) but if I pass in the row id of m0 - 1 and m0 to the function it is going to delete all of these entries. Can I do a delete where colFam = something and colQual = something? I didn't see anything in the API to support this but frankenstein code is cool also :)

Yes it is possible. I was thinking of rows and columns still in a sql mindest. In order to delete a column (which is what I was thinking of) rather than a row. You just write another mutation. For example:
Text rowId = new Text("m0");
Text colFam = new Text("purchase");
Text colQual = new Text("gas");
Mutation mut = new Mutation(rowId);
mut.putDelete(colFam, colQual);
writer = connection.createBatchWriter(tableName, new BatchWriter());
try{
writer.addMutation(mut);
}catch{
...
}
Works perfect :)

Related

Managing java list object and iterating them

I have a list which is a java object like below.
public class TaxIdentifier {
public String id;
public String gender;
public String childId;
public String grade,
public String isProcessed;
////...///
getters and setters
///....///
}
Records in DB looks like below,
id gender childId grader isProcessed
11 M 111 3 Y
12 M 121 4 Y
11 M 131 2 Y
13 M 141 5 Y
14 M 151 1 Y
15 M 161 6 Y
List<TaxIdentifier> taxIdentifierList = new ArrayList<TaxIdentifier>();
for (TaxIdentifier taxIdentifier : taxIdentifierList) {
}
while I process for loop and get the id = 11, i have to check if there are other records with id = 11 and process them together and do a DB operation and then take the next record say in this case 12 and see if there are other records with id = 12 and so on.
One option is i get the id and query the DB to return all id = 11 and so on.
But this is too much back and forth with the DB.
What is the best way to do the same in java? Please advice.

If you anyway need to process all the records in the corresponding database table - you should retrieve all of them in 1 database roundtrip.
After that, you can collect all your TaxIdentifier records in dictionary data structure and process in whatever way you want.
The brief example may look like this:
Map<String, List<TaxIdentifier>> result = repositoty.findAll().stream().collect(Collectors.groupingBy(TaxIdentifier::getId));
Here all the TaxIdentifier records are grouped by TaxIdentifier's id (all the records with id equals "11") can be retrieved and processed this way:
List<TaxIdentifier> taxIdentifiersWithId11 = result.get("11");

I would leverage the power of your database. When you query, you should order your query by id in ascending order. Not sure which database you are using but I would do:
SELECT * FROM MY_DATABASE WHERE IS_PROCESSED = 'N' ORDER BY ID ASC;
That takes the load of sorting it off of your application and onto your database. Then your query returns unprocessed records with the lowest id's on top. Then just sequentially work through them in order.

retrieve histogram from mssql table using java

I want to implement java application that can connect to any sql server and load any table from it. For each table I want to create histogram based on some arbitrary columns.
For example if I have this table
name profit
------------
name1 12
name2 14
name3 18
name4 13
I can create histogram with bin size 4 based on min and max value of profit column and count number of records for each bin.
result is:
profit count
---------------
12-16 3
16-20 1
My solution for this problem is retrieving all the data based on required columns and after that construct the bins and group by the records using java stream Collectors.groupingBy.
I'm not sure if my solution is optimized and for this I want some help to find the better algorithm specially when I have big number of records.(for example use some benefits of sql server or other frameworks that can be used.)
Can I use better algorithm for this issue?
edit 1:
assume my sql result is in List data
private String mySimpleHash(Object[] row, int index) {
StringBuilder hash = new StringBuilder();
for (int i = 0; i < row.length; i++)
if (i != index)
hash.append(row[i]).append(":");
return hash.toString();
}
//index is index of column for histogram
List<Object[]> histogramData = new ArrayList<>();
final Map<String, List<Object[]>> map = data.stream().collect(
Collectors.groupingBy(row -> mySimpleHash(Arrays.copyOfRange(row, index))));
for (final Map.Entry<String, List<Object[]>> entry : map.entrySet()) {
Object[] newRow = newData.get(rowNum);
double result = entry.getValue().stream()
.mapToDouble(row ->
Double.valueOf(row[index].toString())).count();
newRow[index] = result;
histogramData.add(newRow);
}

As you have considered, performing the aggregation after getting all the data out of SQL server is going to be very expensive if the number of rows in your tables increase. You can simply do the aggregation within SQL. Depending on how you are expressing your histogram bins, this is either trivial or requires some work. In your case, the requirement that the lowest bin start at min value requires a little bit of setup as opposed to binning starting from 0. See sample below. The inner query is mapping values to a bin number, the outer query is aggregating and computing the bin boundaries.
CREATE TABLE Test (
Name varchar(max) NOT NULL,
Profit int NOT NULL
)
INSERT Test(Name, Profit)
VALUES
('name1', 12),
('name2', 14),
('name3', 18),
('name4', 13)
DECLARE #minValue int = (SELECT MIN(Profit) FROM Test)
DECLARE #binSize int = 4
SELECT
(#minValue + #binSize * Bin) AS BinLow,
(#minValue + #binSize * Bin) + #binSize - 1 AS BinHigh,
COUNT(*) AS Count
FROM (
SELECT
((Profit - #minValue) / #binSize) AS Bin
FROM
Test
) AS t
GROUP BY Bin
| BinLow | BinHigh | Count |
|--------|---------|-------|
| 12 | 15 | 3 |
| 16 | 19 | 1 |
http://sqlfiddle.com/#!18/d093c/9

How filter Scan of HBase by part of row key?

I have HBase table with row keys, which consist of text ID and timestamp, like next:
...
string_id1.1470913344067
string_id1.1470913345067
string_id2.1470913344067
string_id2.1470913345067
...
How can I filter Scan of HBase (in Scala or Java) to get results with some string ID and timestamp more than some value?
Thanks

Fuzzy row approach is efficient for this kind of requirement and when data is is huge :
As explained by this article
FuzzyRowFilter takes as parameters row key and a mask info.
In example above, in case we want to find last logged in users and row key format is userId_actionId_timestamp (where userId has fixed length of say 4 chars), the fuzzy row key we are looking for is ????_login_. This translates into the following params for FuzzyRowKey:
FuzzyRowFilter rowFilter = new FuzzyRowFilter(
Arrays.asList(
new Pair<byte[], byte[]>(
Bytes.toBytesBinary("\x00\x00\x00\x00_login_"),
new byte[] {1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0})));
Would suggest to go through hbase-the-definitive guide -->Client API: Advanced Features

Lets say you somehow ended up having your lines in a monadic traversable structure like List or RDD. Now, you want to have only the strings with id = "string_id2" and timestamp > 1470913345000.
Now what is the problem here ? Just filter you traversable monadic structure on these two criteria.
val filtered = listOrRddOfLines
.map(l => {
val idStr :: timestampStr :: Nil = l.split('.').toList
(idStr, timestampStr.toLong)
})
.filter({
case (idStr, timestamp) => idStr.equals("string_id2") && (timestamp > "1470913345000".toLong)
})

I resolve my problem by using to filters:
- PrefixFilter (I put to this filter first part of row key. In my case - string ID, for example "string_id1.")
- RowFilter (I put there two parametres: first - CompareOp.GREATER_OR_EQUAL, second - all my row key with necessary timestamp, for example "string_id1.1470913345000"
In result I get all cells with row key, which has necessary string_id if first part, and with timestamp more or equal than I put in filter in second part. It is exactly what I want.
Code snippet:
val s = new Scan()
s.addFamily(family.getBytes)
val filterList = new FilterList()
filterList.addFilter(new PrefixFilter(Bytes.toBytes(prefixOfRowKey)))
filterList.addFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new BinaryComparator(valueForBinaryFilter.getBytes())))
s.setFilter(filterList)
val scanner = table.getScanner(s)
Thanks to everyone who helped to find a solution.

How would one model somthing updateable, say like a “status” of a thing in Cassandra CQL3 and be able to query on this status?

This is a bit of a contrived example to illustrate my question, but let's say I have a Car entity which contains Lightbulb entities. A car has several lightbulbs, each of which could be "on", "off" or "broken".
Each type of lightbulb has a unique id. (left headlight = 100, right headlight = 101... that sort of thing)
The status of a lightbulb needs to be constantly updated.
What I'd like to do is query for a specific car for a set of lightbulbs with a specific status.
something like:
"give me all the lightbulbs with status "on" for car "chevy" model "nova" vin "xyz-123"".
create table lightbulbstatus (
bulbid uuid,
carmake text,
carmodel text,
carvin uuid,
lastupdate timestamp,
status int,
/* row key * /* col keys */
PRIMARY KEY( (carmake, carmodel, carvin), ? ? ? ?)
);
I believe the row key should have the car coordinate in it, but beyond that, I'm a bit lost. I assume each time there is a status change to a bulb, we add a column. But I'm not sure what the keys should be in the column to make the query work.
I think in RDBMS-land, you could do a subselected or nested query to find bulbs with the status = on.
select * from lightbulbstatus where status = 1 and lastupdate > (select lastupdate from lightbulbstatus where status != 1);
No idea how you would do this in CQL3. Obviously sub-selects are not allowed.

Since you do not have to maintain status history, I would suggest to have a single row for each bulb by the following primary key:
PRIMARY KEY( (carmake, carmodel, carvin), bulbid)
In order to query lightbulbs by status you need to create a secondary index:
CREATE INDEX lightbulb_by_status ON lightbulbstatus (status);
SELECT * FROM lightbulbstatus
WHERE status = 1
AND carmake = 'chevy'
AND carmodel = 'nova'
AND carvin = cfe638e9-5cd9-43c2-b5f4-4cc9a0e6b0ff;
Although cardinality of the status is low, your query includes the partition key and is highly efficient.
If the number of rows to be filtered is very small (like number of lightbulbs in a car), you may consider to filter lightbulbs by status in the application (and skip the secondary index).
If you should handle a case that an obsolete lightbulb status update might override a more recent status update (as your RDBMS query suggests), consider using lightweight transactions:
UPDATE lightbulbstatus set status = 0, lastupdate = '2014-11-08 23:50:30+0019'
WHERE carmake = 'chevy'
AND carmodel = 'nova'
AND carvin = cfe638e9-5cd9-43c2-b5f4-4cc9a0e6b0ff
AND bulbid = 9124f318-8253-4d94-b865-3be07899c8ff
IF status = 1 AND lastupdate < '2014-11-08 23:50:30+0019';
Hope it helps.

How can I fetch first n rows from a TopLink query?

For optimization purpose, I want to fetch first N results in a subquery (I'm getting first N ID values) and in the main query fetch full rows for the ID values in the subquery and order them. What I have now is
// This just adds params to the subquery
Expression managedEmp = generateExpression(p_upravljackaFilter);
ReportQuery subQuery = new ReportQuery(CustomDocument.class,
managedEmp);
subQuery.addAttribute("m_id");
Expression exp = new ExpressionBuilder().get("m_id").in(subQuery);
ReadAllQuery testQuery = new ReadAllQuery(CustomDocument.class,
exp);
testQuery.addAscendingOrdering("m_creationDate");
List documentList = (List)getTopLinkTemplate().executeQuery(testQuery, true);
What I'm trying so far is using a user defined function, like this:
ExpressionOperator fetchFirst = new ExpressionOperator();
fetchFirst.setSelector(1);
Vector s = new Vector();
s.addElement("FETCH FIRST 5 ROWS ONLY");
fetchFirst.printsAs(s);
fetchFirst.bePostfix();
fetchFirst.setNodeClass(FunctionExpression.class);
ExpressionOperator.initializeOperators();
ExpressionOperator.addOperator(fetchFirst);
expression = expression.and(builder.get("m_datumKreiranja").getFunction(fetchFirst);
This is literally where I stopped so this won't work but it can show you which way I'm heading. Is something like this even possible? I'm using Java 1.4 and toplink 10g.

Really simple, just insert into second line:
managedEmp = managedEmp.postfixSQL("FETCH FIRST 5 ROWS ONLY");
My mistake was in the fact that I tried it like this:
managedEmp.postfixSQL("FETCH FIRST 5 ROWS ONLY");
because I didn't read what postfixSQL does.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Accumulo delete a row item when row ids are the same - java

Related

Managing java list object and iterating them

retrieve histogram from mssql table using java

How filter Scan of HBase by part of row key?

How would one model somthing updateable, say like a “status” of a thing in Cassandra CQL3 and be able to query on this status?

How can I fetch first n rows from a TopLink query?

Categories

Resources