Stream data to velocity template on the fly

Stream data to velocity template on the fly - java

Could someone let me know, if it is possible to open somehow Velocity Excel template and stream to it partially a bigger amount of data on the fly?
Let say I would like to read in a loop from external resource ArrayLists with data. One list with 10 000 items per each iteration. In simple iteration I would like to push the list to Velocity Excel template and forget about it by jumping to the next iteration. At the end of data processing I would do a final merge the Velocity context and template with all data.
So far, I've seen ways of generating Excel reports by Velocity engine in a few simple steps:
create Velocity template
create Velocity context
put data to context
merge context and template
but I need to repeate 3th step several times.

Here is my tested solution that works well in my case. I'm able to generate to Excel sheet a huge amout of data loaded directly from database.
Connection conn = getDs().getConnection();
// such a configuration of prepared statement is mandatory for large amount of data
PreparedStatement ps = conn.prepareStatement(MY_QUERY, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY, ResultSet.CLOSE_CURSORS_AT_COMMIT);
// load huge amount of data per 50000 items
ps.setFetchSize(50000);
ResultSet rs = ps.executeQuery();
// calculate count of returned data earlier
final CountWrapper countWrapper = new CountWrapper(countCalculatedEarlier);
CustomResultSetIterator<MyObject> rowsIter = new CustomResultSetIterator<MyObject>(rs, new org.apache.commons.dbutils.BasicRowProcessor(), MyObject.class) {
private Long iterCount = 0L;
#Override
public boolean hasNext() {
// you can't call isLast method on defined as forward only cursor of result set, hence there is a homegrown calculation whether last item is reached or not
return iterCount < countWrapper.getCount().longValue();
};
#Override
public MyObject next() {
iterCount++;
return super.next();
};
};
VelocityContext context = new VelocityContext();
// place an interator here instead of collection object
context.put("rows", rowsIter);
Template t = ve.getTemplate(template);
File file = new File(parthToFile);
FileWriter fw = new FileWriter(file);
// generate on the fly in my case 1 000 000 rows in Excel, watch out such an Excel may have 1GB size
t.merge(context, fw);
// The CustomResultSetIterator is a:
public class CustomResultSetIterator<E> implements Iterator<E> {
//...implement all interface's methods
}

The only option I see with default Velocity would be to implement something like a "streaming Collection" which never holds the complete data in memory, but rather provides the data in the iterator one by one.
You would then put this Collection into the Velocity context and use it in the Velocity Template to iterate over the items. Internally the Collection would retrieve the items in the hasNext()/next() call one after the other from your external source.

Related

HBase - delete columns of rows with range of timestamp without scanning

I was wonder if I could delete some columns of some rows with timestamp without scanning the whole database
my code is like below:
public static final void deleteBatch(long date, String column, String...ids) throws Exception{
Connection con = null; // connection instance
HTable table = null; // htable instance
List<Delete> deletes = new ArrayList<Delete>(ids.length);
for(int i = 0; i < ids.length; i++){
String id = ids[i];
Delete delete = new Delete(id.getBytes());
delete.addColumn(/* CF */, Bytes.toString(column));
/*
also tried:
delete.addColumn(/* CF */, Bytes.toString(column), date);
*/
delete.setTimestamp(date);
deletes.add(delete);
}
table.delete(deletes);
table.close();
}
this works, but deletes all column prior to given date,
I want something like this:
Delete delete = new Delete(id.getBytes());
delete.setTimestamp(date-1, date);
I don't want to delete prior or after a specific date, I want to delete exact time range I give.
Also my MaxVersion of HTableDescriptor is set to Integer.MAX_VALUE to keep all changes.
as mentioned in the Delete API Documentation:
Specifying timestamps, deleteFamily and deleteColumns will delete all
versions with a timestamp less than or equal to that passed
it delets all columns which their timestamps are equal or less than given date.
how can I achieve that?
any answer appreciated

After struggling for weeks I found a solution for this problem.
the apache HBase has a feature called coprocessor which hosts and manages the core execution of data level operations (get, delete, put ...) and can be overrided(developed) for custom computions like data aggregation and bulk processing against the data outside the client scope.
there are some basic implemention for common problems like bulk delete and etc..

How to process data in chunks in java using Multi Threading?

I am working on a task in which I need to process data in chunks. I have a properties file in which I define the chunk size, suppose 500 and the data that I am getting form the data base is suppose 1000 records. I want to process 1000 records in chunks 500 each using Multi Threading.
This is the first time I am implementing this so please let me know if I can achieve the same using another technique. The main purpose behind this is that I am generating an excel file in which I populate the data keeping in mind the chunk size. So probably first thread processes 500 records and second thread next 500.
Partial Code (Rest parses the xml and writes in Excel using POI)
public List<NYProgramTO> getNYPPAData() throws Exception {
this.getConfiguration();
List<NYProgramTO> to = dao.getLatestNYData();
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
Document document = null;
// Returns chunkSize
List<NYProgramTO> myList = getNextChunk(to);
ExecutorService executor = Executors.newFixedThreadPool(myList.size());
myList.stream()
.forEach((NYProgramTO nyTo) ->
{
executor.execute(new NYExecutorThread(nyTo, migrationConfig , appContext, dao));
});
executor.shutdown();
executor.awaitTermination(300, TimeUnit.SECONDS);
System.gc();
dao.getLatestNYData(); method returns me the total number of records from the database and this is how I populate the list to.
I have the following method which gives me the next set of chunk, so suppose if 500 records had processed this method should give next 500 records to process (Hope this makes sense).
private static List<NYProgramTO> getNextChunk(List<NYProgramTO> list) {
currentIndex = 0; // This is static int class variable
List<NYProgramTO> nyList = new ArrayList<>();
if(list.size() == 0) {
return list;
}
int totalCount = list.size();
for(int i = currentIndex; i < (currentIndex + chunkSize); i++) {
if(i == totalCount) break;
nyList.add(list.get(i));
}
return nyList;
}
In my first method I create threads now here I am not sure to how many thread do I need to create. Currently I am passing the size of the list that I receive from getNextChunk(); method.
NYExecutorThread this class simply implements Runnable and I don't have any logic in it yet. Currently I simply pass parameters on the constructor to be able to get the configurations and create threads.
It is a little confusing and I want if anyone has implemented such a logic, please let me know how can I go ahead with this?
Thanks

How do I take a "slice" of a List that only has an iterator?

I have a CSV file full of data downloaded from Fitbit. The data inside the CSV file follows a basic format:
<Type of Data>
<Columns-comma-separated>
<Data-related-to-columns>
Here is a small example of the layout of the file:
Activities
Date,Calories Burned,Steps,Distance,Floors,Minutes Sedentary,Minutes Lightly Active,Minutes Fairly Active,Minutes Very Active,Activity Calories
"2016-07-17","3,442","9,456","4.41","12","612","226","18","44","1,581"
"2016-07-18","2,199","7,136","3.33","10","370","93","12","46","1,092"
...other logs
Sleep
Date,Minutes Asleep,Minutes Awake,Number of Awakenings,Time in Bed
"2016-07-17","418","28","17","452"
"2016-07-18","389","26","10","419"
Now, I am using CSVParser from Apache Common's library to go through this data. My goal is to turn this into Java Objects that can turn relevant data into Json (I need the Json to upload into a different website). CSVParser has an iterator that I can use to iterate through the CSVRecords in the file. So, essentially, I have a "list" of all of the data.
Because the file contains different types of data (Sleep logs, Activity logs, etc), I need to get a subsection/sub-list of the file, and pass it into a class to analyse it.
I need to iterate over the list and look for the keyword that identifies a new section of the file (e.g. Activities, Foods, Sleep, etc). Once I have identified what the next part of the file is, I need to select all of the following rows up until the next category.
Now, for the question in this Question: I don't know how to use an iterator to get the equivalent of List.sublist(). Here is what I have been trying:
while (iterator.hasNext())
{
CSVRecord current = iterator.next();
if (current.get(0).equals("Activities"))
{
iterator.next(); //Columns
while (iterator.hasNext() && iterator.next().get(0).isData()) //isData isn't real, but I can't figure out what I need to do.
{
//How do I sublist it here?
}
}
}
So, I need to determine if the next CSVRecord begins with a quote/has data, and loop until I find the next category, and finally pass a subsection of the file (using the iterator) to another function to do something with the correct log.
Edit
I considered converting it first to a List with a while loop, and then sub-listing, but that seemed wasteful. Correct me if I am wrong.
Also, I can't assume that each section will have the same amount of rows following it. They might have similar, but there is also the food logs, which follow a completely different pattern. Here are two different days. Foods follows the normal pattern, but the Food Logs do not.
Foods
Date,Calories In
"2016-07-17","0"
"2016-07-18","1,101"
Food Log 20160717
Daily Totals
"","Calories","0"
"","Fat","0 g"
"","Fiber","0 g"
"","Carbs","0 g"
"","Sodium","0 mg"
"","Protein","0 g"
"","Water","0 fl oz"
Food Log 20160718
Meal,Food,Calories
"Lunch"
"","Raspberry Yogurt","190"
"","Almond Sweet & Salty Granola Bar","140"
"","Goldfish Baked Snack Crackers, Cheddar","140"
"","Bagels, Whole Wheat","190"
"","Braided Twists Honey Wheat Pretzels","343"
"","Apples, raw, gala, with skin - 1 medium","98"
"Daily Totals"
"","Calories","1,101"
"","Fat","21 g"
"","Fiber","13 g"
"","Carbs","202 g"
"","Sodium","1,538 mg"
"","Protein","28 g"
"","Water","24 fl oz"

The easiest way to do what you want is to simply remember that previous category data, and when you hit a new category, process that previous category data and reset for the next category. This should work:
String categoryName = null;
List<List<String>> categoryData = new ArrayList<>();
while (iterator.hasNext()) {
CSVRecord current = iterator.next();
if (current.size() == 1) { //start of next category
processCategory(categoryName, categoryData);
categoryName = current.get(0);
categoryData.clear();
iterator.next(); //skip header
} else { //category data
List<String> rowData = new ArrayList<>(current.size());
CollectionUtils.addAll(rowData, current.iterator()); //uses Apache Commons Collections, but you can use whatever
categoryData.add(rowData);
}
}
processCategory(categoryName, categoryData); //last category of file
And then:
void processCategory(String categoryName, List<List<String>> categoryData) {
if (categoryName != null) { //first category of the file, skip
//do stuff
}
}
The above assumes that a List<List<String>> is the data structure that you want to deal with, but you can tweak as you see fit. I might even recommend simply passing List<Iterable<String>> to the process method (CSVRecord implements Iterable<String>) and handling the row data there.
This can definitely be cleaned up further, but it should get you started.

JHDF5 - How to avoid dataset being overwritten

I am using JHDF5 to log a collection of values to a hdf5 file. I am currently using two ArrayLists to do this, one with the values and one with the names of the values.
ArrayList<String> valueList = new ArrayList<String>();
ArrayList<String> nameList = new ArrayList<String>();
valueList.add("Value1");
valueList.add("Value2");
nameList.add("Name1");
nameList.add("Name2");
IHDF5Writer writer = HDF5Factory.configure("My_Log").keepDataSetsIfTheyExist().writer();
HDF5CompoundType<List<?>> type = writer.compound().getInferredType("", nameList, valueList);
writer.compound().write("log1", type, valueList);
writer.close();
This will log the values in the correct way to the file My_Log and in the dataset "log1". However, this example always overwrites the previous log of the values in the dataset "log1". I want to be able to log to the same dataset everytime, adding the latest log to the next line/index of the dataset. For example, if I were to change the value of "Name2" to "Value3" and log the values, and then change "Name1" to "Value4" and "Name2" to "Value5" and log the values, the dataset should look like this:
I thought the keepDataSetsIfTheyExist() option to would prevent the dataset to be overwritten, but apparently it doesn't work that way.
Something similar to what I want can be achieved in some cases with writer.compound().writeArrayBlock(), and specify by what index the array block shall be written. However, this solution doesn't seem to be compatible with my current code, where I have to use lists for handling my data.
Is there some option to achieve this that I have overlooked, or can't this be done with JHDF5?

I don't think that will work. It is not quite clear to me, but I believe the getInferredType() you are using is creating a data set with 2 name -> value entries. So it is effectively creating an object inside the hdf5. The best solution I could come up with was to read the previous values add them to the valueList before outputting:
ArrayList<String> valueList = new ArrayList<>();
valueList.add("Value1");
valueList.add("Value2");
try (IHDF5Reader reader = HDF5Factory.configure("My_Log.h5").reader()) {
String[] previous = reader.string().readArray("log1");
for (int i = 0; i < previous.length; i++) {
valueList.add(i, previous[i]);
}
} catch (HDF5FileNotFoundException ex) {
// Nothing to do here.
}
MDArray<String> values = new MDArray<>(String.class, new long[]{valueList.size()});
for (int i = 0; i < valueList.size(); i++) {
values.set(valueList.get(i), i);
}
try (IHDF5Writer writer = HDF5Factory.configure("My_Log.h5").writer()) {
writer.string().writeMDArray("log1", values);
}
If you call this code a second time with "Value3" and "Value4" instead, you will get 4 values. This sort of solution might become unpleasant if you start to have hierarchies of datasets however.

To solve your issue, you need to define the dataset log1 as extendible so that it can store an unknown number of log entries (that are generated over time) and write these using a point or hyperslab selection (otherwise, the dataset will be overwritten).
If you are not bound to a specific technology to handle HDF5 files, you may wish to give a look at HDFql which is an high-level language to manage HDF5 files easily. A possible solution for your use-case using HDFql (in Java) is:
public class Example
{
public Class Log
{
String name1;
String name2;
}
public boolean doSomething(Log log)
{
log.name1 = "Value1";
log.name2 = "Value2";
return true;
}
public static void main(String args[])
{
// declare variables
Log log = new Log();
int variableNumber;
// create an HDF5 file named 'My_Log.h5' and use (i.e. open) it
HDFql.execute("CREATE AND USE FILE My_Log.h5");
// create an extendible HDF5 dataset named 'log1' of data type compound
HDFql.execute("CREATE DATASET log1 AS COMPOUND(name1 AS VARCHAR, name2 AS VARCHAR)(0 TO UNLIMITED)");
// register variable 'log' for subsequent usage (by HDFql)
variableNumber = HDFql.variableRegister(log);
// call function 'doSomething' that does something and populates variable 'log' with an entry
while(doSomething(log))
{
// alter (i.e. extend) dataset 'log1' to +1 (i.e. add a new row)
HDFql.execute("ALTER DIMENSION log1 TO +1");
// insert (i.e. write) data stored in variable 'log' into dataset 'log1' using a point selection
HDFql.execute("INSERT INTO log1(-1) VALUES FROM MEMORY " + variableNumber);
}
}
}

Hibernate memory management

I have an application that uses hibernate. At one part I am trying to retrieve documents. Each document has an account number. The model looks something like this:
private Long _id;
private String _acct;
private String _message;
private String _document;
private String _doctype;
private Date _review_date;
I then retrieve the documents with a document service. A portion of the code is here:
public List<Doc_table> getDocuments(int hours_, int dummyFlag_,List<String> accts) {
List<Doc_table> documents = new ArrayList<Doc_table>();
Session session = null;
Criteria criteria = null;
try {
// Lets create a previous Date by subtracting the number of
// subtractHours_ passed.
session = HibernateUtil.getSession();
session.beginTransaction();
if (accts == null) {
Calendar cutoffTime = Calendar.getInstance();
cutoffTime.add(Calendar.HOUR_OF_DAY, hours_);
criteria = session.createCriteria(Doc_table.class).add(
Restrictions.gt("dbcreate_date", cutoffTime.getTime()))
.add(Restrictions.eq("dummyflag", dummyFlag_));
} else
{ criteria = session.createCriteria(Doc_table.class).add(Restrictions.in("acct", accts));
}
documents = criteria.list();
for (int x = 0; x < documents.size(); x++) {
Doc_table document = documents.get(x);
......... more stuff here
}
This works great if I'm retrieving a small number of documents. But when the document size is large I get a heap space error, probably because the documents take up a lot of space and when you retrieve several thousand of them, bad things happen.
All I really want to do is retrieve each document that fits my criteria, grab the account number and return a list of account numbers (a far smaller object than a list of objects). If this were jdbc, I would know exactly what to do.
But in this case I'm stumped. I guess I'm looking for a way where I can bring just get the account numbers of the Doc_table object back.
Or alternatively, some way where I can retrieve documents one at a time from the database using hibernate that fit my criteria (instead of bringing back the whole List of objects which uses too much memory).

There are several ways to deal with the problem:
loading the docs in batches of an smaller size
(The way you noticed) not to query for the Document, but only for the account numbers:
List accts = session.createQuery("SELECT d._acct FROM Doc d WHERE ...");
or
List<String> accts = session.createCriteria(Doc.class).
setProjection(Projections.property("_acct")).
list();
When there is a special field in you Document class that contains the huge amount Document byte data, then you could map this special field as a Lazy loaded field.
Create a second entity class (read only) that contains only the fields that you need and map it to the same table

Instead of fetching all documents i.e, all records at once, try to limit the rows being fetched. Also, deploy a strategy where in you can store documents temporarily as flat files and fetch them later or delete after usage. Though its a bit long process,its efficient way of handling and delivering documents from database.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Stream data to velocity template on the fly - java

Related

HBase - delete columns of rows with range of timestamp without scanning

How to process data in chunks in java using Multi Threading?

How do I take a "slice" of a List that only has an iterator?

JHDF5 - How to avoid dataset being overwritten

Hibernate memory management

Categories

Resources