I was wonder if I could delete some columns of some rows with timestamp without scanning the whole database
my code is like below:
public static final void deleteBatch(long date, String column, String...ids) throws Exception{
Connection con = null; // connection instance
HTable table = null; // htable instance
List<Delete> deletes = new ArrayList<Delete>(ids.length);
for(int i = 0; i < ids.length; i++){
String id = ids[i];
Delete delete = new Delete(id.getBytes());
delete.addColumn(/* CF */, Bytes.toString(column));
also tried:
delete.addColumn(/* CF */, Bytes.toString(column), date);
this works, but deletes all column prior to given date,
I want something like this:
Delete delete = new Delete(id.getBytes());
delete.setTimestamp(date-1, date);
I don't want to delete prior or after a specific date, I want to delete exact time range I give.
Also my MaxVersion of HTableDescriptor is set to Integer.MAX_VALUE to keep all changes.
as mentioned in the Delete API Documentation:
Specifying timestamps, deleteFamily and deleteColumns will delete all
versions with a timestamp less than or equal to that passed
it delets all columns which their timestamps are equal or less than given date.
how can I achieve that?
any answer appreciated
After struggling for weeks I found a solution for this problem.
the apache HBase has a feature called coprocessor which hosts and manages the core execution of data level operations (get, delete, put ...) and can be overrided(developed) for custom computions like data aggregation and bulk processing against the data outside the client scope.
there are some basic implemention for common problems like bulk delete and etc..
I am working on a task in which I need to process data in chunks. I have a properties file in which I define the chunk size, suppose 500 and the data that I am getting form the data base is suppose 1000 records. I want to process 1000 records in chunks 500 each using Multi Threading.
This is the first time I am implementing this so please let me know if I can achieve the same using another technique. The main purpose behind this is that I am generating an excel file in which I populate the data keeping in mind the chunk size. So probably first thread processes 500 records and second thread next 500.
Partial Code (Rest parses the xml and writes in Excel using POI)
public List<NYProgramTO> getNYPPAData() throws Exception {
List<NYProgramTO> to = dao.getLatestNYData();
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
Document document = null;
// Returns chunkSize
List<NYProgramTO> myList = getNextChunk(to);
ExecutorService executor = Executors.newFixedThreadPool(myList.size());
.forEach((NYProgramTO nyTo) ->
executor.execute(new NYExecutorThread(nyTo, migrationConfig , appContext, dao));
executor.awaitTermination(300, TimeUnit.SECONDS);
dao.getLatestNYData(); method returns me the total number of records from the database and this is how I populate the list to.
I have the following method which gives me the next set of chunk, so suppose if 500 records had processed this method should give next 500 records to process (Hope this makes sense).
private static List<NYProgramTO> getNextChunk(List<NYProgramTO> list) {
currentIndex = 0; // This is static int class variable
List<NYProgramTO> nyList = new ArrayList<>();
if(list.size() == 0) {
return list;
int totalCount = list.size();
for(int i = currentIndex; i < (currentIndex + chunkSize); i++) {
if(i == totalCount) break;
return nyList;
In my first method I create threads now here I am not sure to how many thread do I need to create. Currently I am passing the size of the list that I receive from getNextChunk(); method.
NYExecutorThread this class simply implements Runnable and I don't have any logic in it yet. Currently I simply pass parameters on the constructor to be able to get the configurations and create threads.
It is a little confusing and I want if anyone has implemented such a logic, please let me know how can I go ahead with this?
I have a CSV file full of data downloaded from Fitbit. The data inside the CSV file follows a basic format:
<Type of Data>
Here is a small example of the layout of the file:
Date,Calories Burned,Steps,Distance,Floors,Minutes Sedentary,Minutes Lightly Active,Minutes Fairly Active,Minutes Very Active,Activity Calories
...other logs
Date,Minutes Asleep,Minutes Awake,Number of Awakenings,Time in Bed
Now, I am using CSVParser from Apache Common's library to go through this data. My goal is to turn this into Java Objects that can turn relevant data into Json (I need the Json to upload into a different website). CSVParser has an iterator that I can use to iterate through the CSVRecords in the file. So, essentially, I have a "list" of all of the data.
Because the file contains different types of data (Sleep logs, Activity logs, etc), I need to get a subsection/sub-list of the file, and pass it into a class to analyse it.
I need to iterate over the list and look for the keyword that identifies a new section of the file (e.g. Activities, Foods, Sleep, etc). Once I have identified what the next part of the file is, I need to select all of the following rows up until the next category.
Now, for the question in this Question: I don't know how to use an iterator to get the equivalent of List.sublist(). Here is what I have been trying:
while (iterator.hasNext())
CSVRecord current = iterator.next();
if (current.get(0).equals("Activities"))
iterator.next(); //Columns
while (iterator.hasNext() && iterator.next().get(0).isData()) //isData isn't real, but I can't figure out what I need to do.
//How do I sublist it here?
So, I need to determine if the next CSVRecord begins with a quote/has data, and loop until I find the next category, and finally pass a subsection of the file (using the iterator) to another function to do something with the correct log.
I considered converting it first to a List with a while loop, and then sub-listing, but that seemed wasteful. Correct me if I am wrong.
Also, I can't assume that each section will have the same amount of rows following it. They might have similar, but there is also the food logs, which follow a completely different pattern. Here are two different days. Foods follows the normal pattern, but the Food Logs do not.
Date,Calories In
Food Log 20160717
Daily Totals
"","Fat","0 g"
"","Fiber","0 g"
"","Carbs","0 g"
"","Sodium","0 mg"
"","Protein","0 g"
"","Water","0 fl oz"
Food Log 20160718
"","Raspberry Yogurt","190"
"","Almond Sweet & Salty Granola Bar","140"
"","Goldfish Baked Snack Crackers, Cheddar","140"
"","Bagels, Whole Wheat","190"
"","Braided Twists Honey Wheat Pretzels","343"
"","Apples, raw, gala, with skin - 1 medium","98"
"Daily Totals"
"","Fat","21 g"
"","Fiber","13 g"
"","Carbs","202 g"
"","Sodium","1,538 mg"
"","Protein","28 g"
"","Water","24 fl oz"
The easiest way to do what you want is to simply remember that previous category data, and when you hit a new category, process that previous category data and reset for the next category. This should work:
String categoryName = null;
List<List<String>> categoryData = new ArrayList<>();
while (iterator.hasNext()) {
CSVRecord current = iterator.next();
if (current.size() == 1) { //start of next category
processCategory(categoryName, categoryData);
categoryName = current.get(0);
iterator.next(); //skip header
} else { //category data
List<String> rowData = new ArrayList<>(current.size());
CollectionUtils.addAll(rowData, current.iterator()); //uses Apache Commons Collections, but you can use whatever
processCategory(categoryName, categoryData); //last category of file
And then:
void processCategory(String categoryName, List<List<String>> categoryData) {
if (categoryName != null) { //first category of the file, skip
//do stuff
The above assumes that a List<List<String>> is the data structure that you want to deal with, but you can tweak as you see fit. I might even recommend simply passing List<Iterable<String>> to the process method (CSVRecord implements Iterable<String>) and handling the row data there.
This can definitely be cleaned up further, but it should get you started.
I am using JHDF5 to log a collection of values to a hdf5 file. I am currently using two ArrayLists to do this, one with the values and one with the names of the values.
ArrayList<String> valueList = new ArrayList<String>();
ArrayList<String> nameList = new ArrayList<String>();
IHDF5Writer writer = HDF5Factory.configure("My_Log").keepDataSetsIfTheyExist().writer();
HDF5CompoundType<List<?>> type = writer.compound().getInferredType("", nameList, valueList);
writer.compound().write("log1", type, valueList);
This will log the values in the correct way to the file My_Log and in the dataset "log1". However, this example always overwrites the previous log of the values in the dataset "log1". I want to be able to log to the same dataset everytime, adding the latest log to the next line/index of the dataset. For example, if I were to change the value of "Name2" to "Value3" and log the values, and then change "Name1" to "Value4" and "Name2" to "Value5" and log the values, the dataset should look like this:
I thought the keepDataSetsIfTheyExist() option to would prevent the dataset to be overwritten, but apparently it doesn't work that way.
Something similar to what I want can be achieved in some cases with writer.compound().writeArrayBlock(), and specify by what index the array block shall be written. However, this solution doesn't seem to be compatible with my current code, where I have to use lists for handling my data.
Is there some option to achieve this that I have overlooked, or can't this be done with JHDF5?
I don't think that will work. It is not quite clear to me, but I believe the getInferredType() you are using is creating a data set with 2 name -> value entries. So it is effectively creating an object inside the hdf5. The best solution I could come up with was to read the previous values add them to the valueList before outputting:
ArrayList<String> valueList = new ArrayList<>();
try (IHDF5Reader reader = HDF5Factory.configure("My_Log.h5").reader()) {
String[] previous = reader.string().readArray("log1");
for (int i = 0; i < previous.length; i++) {
valueList.add(i, previous[i]);
} catch (HDF5FileNotFoundException ex) {
// Nothing to do here.
MDArray<String> values = new MDArray<>(String.class, new long[]{valueList.size()});
for (int i = 0; i < valueList.size(); i++) {
values.set(valueList.get(i), i);
try (IHDF5Writer writer = HDF5Factory.configure("My_Log.h5").writer()) {
writer.string().writeMDArray("log1", values);
If you call this code a second time with "Value3" and "Value4" instead, you will get 4 values. This sort of solution might become unpleasant if you start to have hierarchies of datasets however.
To solve your issue, you need to define the dataset log1 as extendible so that it can store an unknown number of log entries (that are generated over time) and write these using a point or hyperslab selection (otherwise, the dataset will be overwritten).
If you are not bound to a specific technology to handle HDF5 files, you may wish to give a look at HDFql which is an high-level language to manage HDF5 files easily. A possible solution for your use-case using HDFql (in Java) is:
public class Example
public Class Log
String name1;
String name2;
public boolean doSomething(Log log)
log.name1 = "Value1";
log.name2 = "Value2";
return true;
public static void main(String args[])
// declare variables
Log log = new Log();
int variableNumber;
// create an HDF5 file named 'My_Log.h5' and use (i.e. open) it
HDFql.execute("CREATE AND USE FILE My_Log.h5");
// create an extendible HDF5 dataset named 'log1' of data type compound
// register variable 'log' for subsequent usage (by HDFql)
variableNumber = HDFql.variableRegister(log);
// call function 'doSomething' that does something and populates variable 'log' with an entry
// alter (i.e. extend) dataset 'log1' to +1 (i.e. add a new row)
HDFql.execute("ALTER DIMENSION log1 TO +1");
// insert (i.e. write) data stored in variable 'log' into dataset 'log1' using a point selection
HDFql.execute("INSERT INTO log1(-1) VALUES FROM MEMORY " + variableNumber);
I have an application that uses hibernate. At one part I am trying to retrieve documents. Each document has an account number. The model looks something like this:
private Long _id;
private String _acct;
private String _message;
private String _document;
private String _doctype;
private Date _review_date;
I then retrieve the documents with a document service. A portion of the code is here:
public List<Doc_table> getDocuments(int hours_, int dummyFlag_,List<String> accts) {
List<Doc_table> documents = new ArrayList<Doc_table>();
Session session = null;
Criteria criteria = null;
try {
// Lets create a previous Date by subtracting the number of
// subtractHours_ passed.
session = HibernateUtil.getSession();
if (accts == null) {
Calendar cutoffTime = Calendar.getInstance();
cutoffTime.add(Calendar.HOUR_OF_DAY, hours_);
criteria = session.createCriteria(Doc_table.class).add(
Restrictions.gt("dbcreate_date", cutoffTime.getTime()))
.add(Restrictions.eq("dummyflag", dummyFlag_));
} else
{ criteria = session.createCriteria(Doc_table.class).add(Restrictions.in("acct", accts));
documents = criteria.list();
for (int x = 0; x < documents.size(); x++) {
Doc_table document = documents.get(x);
......... more stuff here
This works great if I'm retrieving a small number of documents. But when the document size is large I get a heap space error, probably because the documents take up a lot of space and when you retrieve several thousand of them, bad things happen.
All I really want to do is retrieve each document that fits my criteria, grab the account number and return a list of account numbers (a far smaller object than a list of objects). If this were jdbc, I would know exactly what to do.
But in this case I'm stumped. I guess I'm looking for a way where I can bring just get the account numbers of the Doc_table object back.
Or alternatively, some way where I can retrieve documents one at a time from the database using hibernate that fit my criteria (instead of bringing back the whole List of objects which uses too much memory).
There are several ways to deal with the problem:
loading the docs in batches of an smaller size
(The way you noticed) not to query for the Document, but only for the account numbers:
List accts = session.createQuery("SELECT d._acct FROM Doc d WHERE ...");
List<String> accts = session.createCriteria(Doc.class).
When there is a special field in you Document class that contains the huge amount Document byte data, then you could map this special field as a Lazy loaded field.
Create a second entity class (read only) that contains only the fields that you need and map it to the same table
Instead of fetching all documents i.e, all records at once, try to limit the rows being fetched. Also, deploy a strategy where in you can store documents temporarily as flat files and fetch them later or delete after usage. Though its a bit long process,its efficient way of handling and delivering documents from database.