How to do multiple add operation apache jena tdb - java

I have to serialize some specific properties (about ten film's properties) for a set of 1500 entity from DBpedia. So for each entity I run a sparql query in order to retrieve them and after that, for each ResultSet I store all the data in the tdb dataset using the default apache jena tdb API. I create a single statement for each property and I add them using this code:
public void addSolution(QuerySolution currSolution, String subjectURI) {
if(isWriteMode) {
Resource currResource = datasetModel.createResource(subjectURI);
Property prop = datasetModel.createProperty(currSolution.getResource("?prop").toString());
Statement stat = datasetModel.createStatement(currResource, prop, currSolution.get("?value").toString());
datasetModel.add(stat);
}
}
How can I do in order to execute multiple add operations on a single dataset? What's the strategy that I should use?
EDIT:
I'm able to execute all the code without errors, but no files were created by the TDBFactory. Why this happens?
I think that I need Joshua Taylor's help

It sounds like the query is running over the remote dbpedia endpoint. Assuming that's correct you can do a couple of things.
Firstly wrap the update in a transaction:
dataset.begin(ReadWrite.WRITE);
try {
for (QuerySolution currSolution: results) {
addSolution(...);
}
dataset.commit();
} finally {
dataset.end();
}
Secondly, you might be able to save yourself work by using CONSTRUCT to get a model back, rather than having to loop through the results. I'm not clear what's going on with subjectURI, however, but it might be as simple as:
CONSTRUCT { <subjectURI> ?prop ?value }
WHERE
... existing query body ...

I've solved my problem and I want to put here the problem that I've got for anyone will have the same.
For each transaction that you do, you need to re-obtain the dataset model and don't use the same for all the transaction.
So for each transaction that you start you need to obtain the dataset model just after the call to begin().
I hope that will be helpful.

Related

Spark - Collect partitions using foreachpartition

We are using spark for file processing. We are processing pretty big files with each file around 30 GB with about 40-50 million lines. These files are formatted. We load them into data frame. Initial requirement was to identify records matching criteria and load them to MySQL. We were able to do that.
Requirement changed recently. Records not meeting criteria are now to be stored in an alternate DB. This is causing issue as the size of collection is too big. We are trying to collect each partition independently and merge into a list as suggested here
https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/dont_collect_large_rdds.html
We are not familiar with scala, so we are having trouble converting this to Java. How can we iterate over partitions one by one and collect?
Thanks
Please use df.foreachPartition to execute for each partition independently and won't returns to driver. You can save the matching results into DB in each executor level. If you want to collect the results in driver, use mappartitions which is not recommended for your case.
Please refer the below link
Spark - Java - foreachPartition
dataset.foreachPartition(new ForeachPartitionFunction<Row>() {
public void call(Iterator<Row> r) throws Exception {
while (t.hasNext()){
Row row = r.next();
System.out.println(row.getString(1));
}
// do your business logic and load into MySQL.
}
});
For mappartitions:
// You can use the same as Row but for clarity I am defining this.
public class ResultEntry implements Serializable {
//define your df properties ..
}
Dataset<ResultEntry> mappedData = data.mapPartitions(new MapPartitionsFunction<Row, ResultEntry>() {
#Override
public Iterator<ResultEntry> call(Iterator<Row> it) {
List<ResultEntry> filteredResult = new ArrayList<ResultEntry>();
while (it.hasNext()) {
Row row = it.next()
if(somecondition)
filteredResult.add(convertToResultEntry(row));
}
return filteredResult.iterator();
}
}, Encoders.javaSerialization(ResultEntry.class));
Hope this helps.
Ravi

Getting CPU 100 percent when I am trying to downloading CSV in Spring

I am getting CPU performance issue on server when I am trying to download CSV in my project, CPU goes 100% but SQL returns the response within 1 minute. In the CSV we are writing around 600K records for one user it is working fine but for concurrent users we are getting this issue.
Environment
Spring 4.2.5
Tomcat 7/8 (RAM 2GB Allocated)
MySQL 5.0.5
Java 1.7
Here is the Spring Controller code:-
#RequestMapping(value="csvData")
public void getCSVData(HttpServletRequest request,
HttpServletResponse response,
#RequestParam(value="param1", required=false) String param1,
#RequestParam(value="param2", required=false) String param2,
#RequestParam(value="param3", required=false) String param3) throws IOException{
List<Log> logs = service.getCSVData(param1,param2,param3);
response.setHeader("Content-type","application/csv");
response.setHeader("Content-disposition","inline; filename=logData.csv");
PrintWriter out = response.getWriter();
out.println("Field1,Field2,Field3,.......,Field16");
for(Log row: logs){
out.println(row.getField1()+","+row.getField2()+","+row.getField3()+"......"+row.getField16());
}
out.flush();
out.close();
}}
Persistance Code:- I am using spring JDBCTemplate
#Override
public List<Log> getCSVLog(String param1,String param2,String param3) {
String sql =SqlConstants.CSV_ACTIVITY.toString();
List<Log> csvLog = JdbcTemplate.query(sql, new Object[]{param1, param2, param3},
new RowMapper<Log>() {
#Override
public Log mapRow(ResultSet rs, int rowNum)
throws SQLException {
Log log = new Log();
log.getField1(rs.getInt("field1"));
log.getField2(rs.getString("field2"));
log.getField3(rs.getString("field3"));
.
.
.
log.getField16(rs.getString("field16"));
}
return log;
}
});
return csvLog;
}
I think you need to be specific on what you meant by "100% CPU usage" whether it's the Java process or MySQL server. As you have got 600K records, trying to load everything in to memory would easily end up in OutOfMemoryError. Given that this works for one user means that you've got enough heap space to process this number of records for just one user and symptoms surface when there are multiple users trying to use the same service.
First issue I can see in your posted code is that you try to load everything into one big list and the size of the list varies based on the content of the Log class. Using a list like this also means that you have to have enough memory to process JDBC result set and generate new list of Log instances. This can be a major problem with a growing number of users. This type of short-lived objects will cause frequent GC and once GC cannot keep up with the amount of garbage being collected it fails obviously. To solve this major issue my suggestion is to use ScrollableResultSet. Additionally you can make this result set read-only, for example below is code fragment for creating a scrollable result set. Take a look at the documentation for how to use it.
Statement st = conn.createStatement(ResultSet.TYPE_SCROLL_SENSITIVE, ResultSet.CONCUR_READ_ONLY);
Above option is suitable if you're using pure JDBC or SpringJDBC template. If Hibernate is already used in your project you can still achieve the same this with the below code fragment. Again please check the documentation for more information and you have a different JPA provider.
StatelessSession session = sessionFactory.openStatelessSession();
Query query = session.createSQLQuery(queryStr).setCacheable(false).setFetchSize(Integer.MIN_VALUE).setReadOnly(true);
query.setParameter(query_param_key, query_paramter_value);
ScrollableResults resultSet = query.scroll(ScrollMode.FORWARD_ONLY);
This way you're not loading all the records to Java process in one go, instead you they're loaded on demand and will have small memory footprint at any given time. Note that JDBC connection will be open until you're done with processing the entire record set. This also means that your DB connection pool can be exhausted if many users are going to download CSV files from this endpoint. You need to take measures to overcome this problem (i.e use of an API manager to rate limit the calls to this endpoint, reading from a read-replica or whatever viable option).
My other suggestion is to stream data which you have already done, so that any records fetched from the DB are processed and sent to client before the next set of records are processed. Again I would suggest you to use a CSV library such as SuperCSV to handle this as these libraries are designed to handle a good load of data.
Please note that this answer may not exactly answer your question as you haven't provided necessary parts of your source such as how to retrieve data from DB but will give the right direction to solve this issue
Your problem in loading all data on application server from database at once, try to run query with limit and offset parameters (with mandatory order by), push loaded records to client and load next part of data with different offset. It help you decrease memory footprint and will not required keep connection to database open all the time. Of course, database will loaded a bit more, but maybe whole situation will better. Try different limit values, for example 5K-50K and monitor cpu usage - on both app server and database.
If you can allow keep many open connection to database #Bunti answer is very good.
http://dev.mysql.com/doc/refman/5.7/en/select.html

Jena TDB hangs/freezes on named model access

I have a problem with Apache Jena TDB. Basically I create a new Dataset, load data from an RDF/XML file into a named model with the name "http://example.com/model/filename" where filename is the name of the XML/RDF file. After loading the data, all statements from the named model are inserted into the default model. The named model is kept in the dataset for backup reasons.
When I now try to query the named models in the Dataset, TDB freezes and the application seems to run in an infinite loop, so it is not terminated nor does it throw an exception.
What is causing that freeze and how can I prevent it?
Example code:
Dataset ds = TDBFactory.createDataset("tdb");
Model mod = ds.getDefaultModel();
File f = new File("example.rdf");
FileInputStream fis = new FileInputStream(f);
ds.begin(ReadWrite.WRITE);
// Get a new named model to load the data into
Model nm = ds.getNamedModel("http://example.com/model/example.rdf");
nm.read(fis, null);
// Do some queries on the Model using the utility methods of Model, no SPARQL used
// Add all statements from the named model to the default model
mod.add(nm);
ds.commit();
ds.end();
// So far everything works as expected, but the following line causes the freeze
Iterator<String> it = ds.listNames();
Any method call that accesses the existing named models causes the same freeze reaction, so this is the same for getNamedModel("http://example.com/model/example.rdf"); for example. Adding new named models by calling getNamedModel("http://example.com/model/example123.rdf"); works fine, so only access to existing models is broken.
Used environment: Linux 64bit, Oracle Java 1.7.0_09, Jena 2.7.4 (incl. TDB 0.9.4)
Thanks in advance for any help!
Edit: Fixed mistake in code fragment
Edit2: Solution (my last comment to AndyS answer)
Ok, I went through the whole program and added all missing transactions. Not it is working as expected. I suspect Jena throwing an Exception during the shutdown sequence of my program but that Exception was not reported properly and the "freeze" was caused by other threads not terminating correctly. Thanks for pointing the faulty transaction usage out.
Could you turn this into a test case and send it to the jena users mailing list please?
You should get the default model inside the transaction - you got it outside.
Also, if you have used a dataset transactionally, you can't use it untransactionally as you do at ds.listNames. It shouldn't freeze - you should get some kind of warning.

how to copy a schema in mysql using java

in my application i need to copy a schema with its tables and store procedures from a base schemn to a new schema.
i am looking for a way to implement this.
i looked into exacting the mysqldump using cmd however it is not a good solution because i have a client side application and this requires an instillation of the server on the client side.
the other option is my own implantation using show query.
the problem here is that i need t implement it all from scratch and the must problematic part is that i will need to arrange the order of the tables according to there foreign key (because if there is a foreign key in the table, the table i am pointing to needs to be created first).
i also thought of creating a store procedure to do this but store procedures in my SQL cant access the disk.
perhaps someone has an idea on how this can be implemented in another way?
You can try using the Apache ddlutils. There is a way to export the ddls from a database to an xml file and re-import it back.
The api usage page has examples on how to export schema to an xml file, read from xml file and apply it to a new database. I have reproduced those functions below along with a small snippet on how to use it to accomplish what you are asking for. You can use this as starting point and optimize it further.
DataSource sourceDb;
DataSource targetDb;
writeDatabaseToXML(readDatabase(sourceDb), "database-dump.xml");
changeDatabase(targetDb,readDatabaseFromXML("database-dump.xml"));
public Database readDatabase(DataSource dataSource)
{
Platform platform = PlatformFactory.createNewPlatformInstance(dataSource);
return platform.readModelFromDatabase("model");
}
public void writeDatabaseToXML(Database db, String fileName)
{
new DatabaseIO().write(db, fileName);
}
public Database readDatabaseFromXML(String fileName)
{
return new DatabaseIO().read(fileName);
}
public void changeDatabase(DataSource dataSource,
Database targetModel)
{
Platform platform = PlatformFactory.createNewPlatformInstance(dataSource);
platform.createTables(targetModel, true, false);
}
You can use information_schema to fetch the foreign key information and build a dependency tree. Here is an example.
But I think you are trying to solve something that has been solved many times before. I'm not familiar with Java, but there are ORM tools (for Python at least) that can inspect your current database and create a complementing model in Java (or Python). Then you can deploy that model into another database.

Inserting or updating multiple records in database in a multi-threaded way in java

I am updating multiple records in database. Now whenever UI sends the list of records to be updated, I have to just update those records in database. I am using JDBC template for that.
Earlier Case
Earlier what I was whenever I got records from UI, I just do
jdbcTemplate.batchUpdate(Query, List<object[]> params)
Whenever there was an exception, I used to rollback whole transaction.
(Updated : Is batchUpdate multi-threaded or faster than batch update in some way?)
Later Case
But later as requirement changed whenever there was exception. So, whenever there is some exception, I should know which records failed to update. So I had to sent the records back to UI in case of exception with a reason, why did they failed.
so I had to do something similar to this:
for(Record record : RecordList)
{
try{
jdbcTemplate.update(sql, Object[] param)
}catch(Exception ex){
record.setReason("Exception : "+ex.getMessage());
continue;
}
}
So am I doing this in right fashion, by using the loop?
If yes, can someone suggest me how to make it multi-threaded.
Or is there anything wrong in this case.
To be true, I was hesitating to use try catch block inside the loop :( .
Please correct me, really need to learn a better way because I myself feel, there must be a better way , thanks.
make all update-operation to a Collection Callable<>,
send it to java.util.concurrent.ThreadPoolExecutor. the pool is multithreaded.
make Callable:
class UpdateTask implements Callable<Exception> {
//constructor with jdbctemplate,sql,param goes here.
#Override
public Exception call() throws Exception {
try{
jdbcTemplate.update(sql, Object[] param)
}catch(Exception ex){
return ex;
}
return null;
}
invoke call:
<T> List<Future<T>> java.util.concurrent.ExecutorService.invokeAll(Collection<? extends Callable<T>> tasks) throws InterruptedException
Your case looks like you need to use validation in java and filter out the valid data alone and send to the data base for updating.
BO layer
-> filter out the Valid Record.
-> Invalid Record should be send back with some validation text.
In DAO layer
-> batch update your RecordList
This will give you the best performance.
Never use database insert exception as a validation mechanism.
Exceptions are costly as the stack trace has to be created
Connection to database is another costly process and will take time to get a connection
Java If-Else will run much faster for same data-base validation

Categories

Resources