Design a data source independent application using batch data - java

We have a legacy application that reads data from mongo for each user (query result is small to large based on user request) and our app creates a file for each user and drops to FTP server /s3. We are reading data as mongo cursor and writing each batch to file as soon it gets batch data so file writing performance is decent. This application works great but bound to mongo and mongo cursor.
Now we have to redesign this application as we have to support different data sources i.e MongoDB, Postgres DB, Kinesis, S3, etc. We have thought below ideas so far:
Build data APIs for each source and expose a paginated REST response. This is a feasible solution but it might be slow for large
query data compare to the current cursor response.
Build a data abstraction layer by feeding batch data in kafka and read batch data stream in our file generator.but most of the time user asks for sorted data so we would need to read messages in sequence. We will lose benefit of great throughput and lot of extra work to combine these data message before writing to file.
We are looking for a solution to replace the current mongo cursor and make our file generator independent of the data source.

So it sounds like you essentially want to create an API where you can maintain the efficiency of streaming as much as possible, as you are doing with writing the file while you are reading the user data.
In that case, you might want to define a push-parser API for your ReadSources which will stream data to your WriteTargets which will write the data to anything that you have an implementation for. Sorting will be handled on the ReadSource side of things since for some sources you can read in an ordered manner (such as from databases); For those sources for which you can't do this you might simply perform an intermediate step to sort your data (such as write to a temporary table) then stream it to the WriteTarget.
A basic implementation might look vaguely like this:
public class UserDataRecord {
private String data1;
private String data2;
public String getRecordAsString() {
return data1 + "," + data2;
}
}
public interface WriteTarget<Record> {
/** Write a record to the target */
public void writeRecord(Record record);
/** Finish writing to the target and save everything */
public void commit();
/** Undo whatever was written */
public void rollback();
}
public abstract class ReadSource<Record> {
protected final WriteTarget<Record> writeTarget;
public ReadSource(WriteTarget<Record> writeTarget) { this.writeTarget = writeTarget; }
public abstract void read();
}
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
public class RelationalDatabaseReadSource extends ReadSource<UserDataRecord> {
private Connection dbConnection;
public RelationalDatabaseReadSource (WriteTarget<UserDataRecord> writeTarget, Connection dbConnection) {
super(writeTarget);
this.dbConnection = dbConnection;
}
#Override public void read() {
// read user data from DB and encapsulate it in a record
try (Statement statement = dbConnection.createStatement();
ResultSet resultSet = statement.executeQuery("Select * From TABLE Order By COLUMNS");) {
while (resultSet.next()) {
UserDataRecord record = new UserDataRecord();
// stream the records to the write target
writeTarget.writeRecord(record);
}
} catch (SQLException e) {
e.printStackTrace();
}
}
}
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.nio.charset.StandardCharsets;
public class FileWriteTarget implements WriteTarget<UserDataRecord> {
private File fileToWrite;
private PrintWriter writer;
public FileWriteTarget(File fileToWrite) throws IOException {
this.fileToWrite = fileToWrite;
this.writer = new PrintWriter(new FileWriter(fileToWrite));
}
#Override public void writeRecord(UserDataRecord record) {
writer.println(record.getRecordAsString().getBytes(StandardCharsets.UTF_8));
}
#Override public void commit() {
// write trailing records
writer.close();
}
#Override
public void rollback() {
try { writer.close(); } catch (Exception e) { }
fileToWrite.delete();
}
}
This is just the general idea and needs serious improvement.
Anyone please feel free to update this API.

Related

What is the most efficient way to persist thousands of entities?

I have fairly large CSV files which I need to parse and then persist into PostgreSQL. For example, one file contains 2_070_000 records which I was able to parse and persist in ~8 minutes (single thread). Is it possible to persist them using multiple threads?
public void importCsv(MultipartFile csvFile, Class<T> targetClass) {
final var headerMapping = getHeaderMapping(targetClass);
File tempFile = null;
try {
final var randomUuid = UUID.randomUUID().toString();
tempFile = File.createTempFile("data-" + randomUuid, "csv");
csvFile.transferTo(tempFile);
final var csvFileName = csvFile.getOriginalFilename();
final var csvReader = new BufferedReader(new FileReader(tempFile, StandardCharsets.UTF_8));
Stopwatch stopWatch = Stopwatch.createStarted();
log.info("Starting to import {}", csvFileName);
final var csvRecords = CSVFormat.DEFAULT
.withDelimiter(';')
.withHeader(headerMapping.keySet().toArray(String[]::new))
.withSkipHeaderRecord(true)
.parse(csvReader);
final var models = StreamSupport.stream(csvRecords.spliterator(), true)
.map(record -> parseRecord(record, headerMapping, targetClass))
.collect(Collectors.toUnmodifiableList());
// How to save such a large list?
log.info("Finished import of {} in {}", csvFileName, stopWatch);
} catch (IOException ex) {
ex.printStackTrace();
} finally {
tempFile.delete();
}
}
models contains a lot of records. The parsing into records is done using parallel stream, so it's quite fast. I'm afraid to call SimpleJpaRepository.saveAll, because I'm not sure what it will do under the hood.
The question is: What is the most efficient way to persist such a large list of entities?
P.S.: Any other improvements are greatly appreciated.
You have to use batch inserts.
Create an interface for a custom repository SomeRepositoryCustom
public interface SomeRepositoryCustom {
void batchSave(List<Record> records);
}
Create an implementation of SomeRepositoryCustom
#Repository
class SomesRepositoryCustomImpl implements SomeRepositoryCustom {
private JdbcTemplate template;
#Autowired
public SomesRepositoryCustomImpl(JdbcTemplate template) {
this.template = template;
}
#Override
public void batchSave(List<Record> records) {
final String sql = "INSERT INTO RECORDS(column_a, column_b) VALUES (?, ?)";
template.execute(sql, (PreparedStatementCallback<Void>) ps -> {
for (Record record : records) {
ps.setString(1, record.getA());
ps.setString(2, record.getB());
ps.addBatch();
}
ps.executeBatch();
return null;
});
}
}
Extend your JpaRepository with SomeRepositoryCustom
#Repository
public interface SomeRepository extends JpaRepository, SomeRepositoryCustom {
}
to save
someRepository.batchSave(records);
Notes
Keep in mind that, if you are even using batch inserts, database driver will not use them. For example, for MySQL, it is necessary to add a parameter rewriteBatchedStatements=true to database URL.
So better to enable driver SQL logging (not Hibernate) to verify everything. Also can be useful to debug driver code.
You will need to make decision about splitting records by packets in the loop
for (Record record : records) {
}
A driver can do it for you, so you will not need it. But better to debug this thing too.
P. S. Don't use var everywhere.

How to tuning HTTPClient performance in crawling large amount small files?

I just want to crawl some Hacker News Stories, and my code:
import org.apache.http.client.fluent.Request;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.logging.Logger;
import java.util.stream.IntStream;
public class HackCrawler {
private static String getUrlResponse(String url) throws IOException {
return Request.Get(url).execute().returnContent().asString();
}
private static String crawlItem(int id) {
try {
String json = getUrlResponse(String.format("https://hacker-news.firebaseio.com/v0/item/%d.json", id));
if (json.contains("\"type\":\"story\"")) {
return json;
}
} catch (IOException e) {
System.out.println("crawl " + id + " failed");
}
return "";
}
public static void main(String[] args) throws FileNotFoundException {
Logger logger = Logger.getLogger("main");
PrintWriter printWriter = new PrintWriter("hack.json");
for (int i = 0; i < 10000; i++) {
logger.info("batch " + i);
IntStream.range(12530671 - (i + 1) * 100, 12530671 - i * 100)
.parallel()
.mapToObj(HackCrawler::crawlItem).filter(x -> !x.equals(""))
.forEach(printWriter::println);
}
}
}
Now it will cost 3 seconds to crawl 100(1 batch) items.
I found use multithreading by parallel will give a speed up (about 5 times), but I have no idea about how to optimise it further.
Could any one give some suggestion about that?
To achieve what Fayaz means I would use Jetty Http Client asynchronous features (https://webtide.com/the-new-jetty-9-http-client/).
httpClient.newRequest("http://domain.com/path")
.send(new Response.CompleteListener()
{
#Override
public void onComplete(Result result)
{
// Your logic here
}
});
This client internally uses Java NIO to listen for incoming responses with a single thread per connection. It then dispatches content to worker threads which are not involved in any blocking I/O operation.
You can try to play with the maximum number of connections per destination (a destination is basically an host)
http://download.eclipse.org/jetty/9.3.11.v20160721/apidocs/org/eclipse/jetty/client/HttpClient.html#setMaxConnectionsPerDestination-int-
Since you are heavily loading a single server, this should be quite high.
The following steps should get you started.
Use a single thread to get response from the site as this is basically an IO operation.
Put these responses into a queue(Read about various implementations of BlockingQueue)
Now you can have multiple threads to pick up these responses and process them as you wish.
Basically, you will be having a single producer thread that gets the responses from the sites and multiple consumers who process these responses.

issue at handling multithreading in java (using ThreadLocal) [duplicate]

This question already has an answer here:
Downloading file/files in Java. Multithreading, this works?
(1 answer)
Closed 8 years ago.
I'm experiencing some issue at working multithreading in java. I am a student, java beginner and just developing this for entertainment purposes (sorry for bad grammar, english is not my first lang).
I'm doing a tiny downloader for personal use that accepts a maximum of 5 simultaneous downloads. Each download is handled by a different thread (not the common threads, just swingworker, which I also use to avoid UI freezing.).
public static String PATH; //File path
public static String NAME; //File name
//Download Method
public void download() {
//download code...
}
This method works fine (it downloads a file and then saves it into the hard drive as expected). But the issue comes when I want to do two (or more) downloads simultaneously. Consider I am downloading two files at the same time, file A and file B. When I start the download of A, strings PATH and NAME obtain its values according to file A, all OK. Then, I start downloading B and the previous stored values from A are replaced with the values that correspond to B. So when the download of A is complete, the file name is the same that B should have when download B is completed.
Resuming, I need different ìnstances of the same variable, that will contain different and indepentant values.
I started to research about the topic and led to ThreadLocal variables. This type of variable is supposed to change in each running thread, just what I need.
I tried to implement this to my code.
public static String PATH; //File path
public static String NAME; //File name
public ThreadLocal<String> TL_PATH = new ThreadLocal<String>();
public ThreadLocal<String> TL_NAME = new ThreadLocal<String>();
public void download() {
//Try to set ThreadLocal to PATH and NAME variables.
TL_NAME.set(NAME);
TL_PATH.set(PATH);
//download code...
}
Once I did this, everything was the same. What's wrong in my code? (No exceptions are thrown in any case, just what I explained before)
You are doing it wrong
You should implement Runnable and have it take the url and path to download to.
Use ExecutorService set to 5 threads in the pool.
To manage the individual instances of the class that implements Runnable.
Here is an example:
import javax.annotation.Nonnull;
import java.io.File;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
public class Q24591731
{
private static final ExecutorService EXECUTOR_SERVICE;
static
{
EXECUTOR_SERVICE = Executors.newFixedThreadPool(5);
}
public static void main(final String[] args)
{
final Download d1;
try { d1 = new Download(new URL("www.someurl.to.download.com"), new File("dest/file/name")); }
catch (MalformedURLException e) { throw new RuntimeException(e); }
// create as many downloads as you need how ever you want.
EXECUTOR_SERVICE.submit(d1);
// when all are submitted
EXECUTOR_SERVICE.shutdown();
try
{
EXECUTOR_SERVICE.awaitTermination(1, TimeUnit.MINUTES);
}
catch (InterruptedException e)
{
System.exit(1);
}
}
public static class Download implements Runnable
{
private final URL url;
private final File dest;
public Download(#Nonnull final URL url, #Nonnull final File dest)
{
this.url = url;
this.dest = dest;
}
#Override
public void run()
{
// download the file and write it to disk
}
}
}
Exercise for the reader:
ExecutorCompletionService is a more appropriate solution, but it is a little more involved and works best with Callable instead of Runnable. Especially for batch processing like this. But I have examples on other answers here and don't feel like repeating that solution again.

ExtendedDataTable in RichFaces 4: DataModel handling

I have another question, somewhat related to the one I posted in January. I have a list, which is rich:extendedDataTable component, and it gets updated on the fly, as the user types his search criteria in a separate text box (i.e. the user types in the first 4 characters and as he keeps typing, the results list changes). And in the end it works fine, when I use RichFaces 3, but as I upgraded to RichFaces 4, I've got all sorts of compilation problems. The following classes are no longer accessible and there no suitable replacement for these, it seems:
org.richfaces.model.DataProvider
org.richfaces.model.ExtendedTableDataModel
org.richfaces.model.selection.Selection
org.richfaces.model.selection.SimpleSelection
Here is what it was before:
This is the input text that should trigger the search logic:
<h:inputText id="firmname" value="#{ExtendedTableBean.searchValue}">
<a4j:support ajaxSingle="true" eventsQueue="firmListUpdate"
reRender="resultsTable"
actionListener="#{ExtendedTableBean.searchForResults}" event="onkeyup" />
</h:inputText>
Action listener is what should update the list. Here is the extendedDataTable, right below the inputText:
<rich:extendedDataTable tableState="#{ExtendedTableBean.tableState}" var="item"
id="resultsTable" value="#{ExtendedTableBean.dataModel}">
... <%-- I'm listing columns here --%>
</rich:extendedDataTable>
And here's the back-end code, where I use my data model handling:
/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/
package com.beans;
import java.io.FileInputStream;
import java.util.ArrayList;
import java.util.Collections;
import java.util.ConcurrentModificationException;
import java.util.List;
import java.util.Properties;
import java.util.concurrent.CopyOnWriteArrayList;
import javax.faces.context.FacesContext;
import javax.faces.event.ActionEvent;
import org.richfaces.model.DataProvider;
import org.richfaces.model.ExtendedTableDataModel;
public class ExtendedTableBean {
private String sortMode="single";
private ExtendedTableDataModel<ResultObject> dataModel;
//ResultObject is a simple pojo and getResultsPerValue is a method that
//read the data from the properties file, assigns it to this pojo, and
//adds a pojo to the list
private Object tableState;
private List<ResultObject> results = new CopyOnWriteArrayList<ResultObject>();
private List<ResultObject> selectedResults =
new CopyOnWriteArrayList<ResultObject>();
private String searchValue;
/**
* This is the action listener that the user triggers, by typing the search value
*/
public void searchForResults(ActionEvent e) {
synchronized(results) {
results.clear();
}
//I don't think it's necessary to clear results list all the time, but here
//I also make sure that we start searching if the value is at least 4
//characters long
if (this.searchValue.length() > 3) {
results.clear();
updateTableList();
} else {
results.clear();
}
dataModel = null; // to force the dataModel to be updated.
}
public List<ResultObject> getResultsPerValue(String searchValue) {
List<ResultObject> resultsList = new CopyOnWriteArrayList<ResultObject>();
//Logic for reading data from the properties file, populating ResultObject
//and adding it to the list
return resultsList;
}
/**
* This method updates a firm list, based on a search value
*/
public void updateTableList() {
try {
List<ResultObject> searchedResults = getResultsPerValue(searchValue);
//Once the results have been retrieved from the properties, empty
//current firm list and replace it with what was found.
synchronized(firms) {
firms.clear();
firms.addAll(searchedFirms);
}
} catch(Throwable xcpt) {
//Exception handling
}
}
/**
* This is a recursive method, that's used to constantly keep updating the
* table list.
*/
public synchronized ExtendedTableDataModel<ResultObject> getDataModel() {
try {
if (dataModel == null) {
dataModel = new ExtendedTableDataModel<ResultObject>(
new DataProvider<ResultObject>() {
public ResultObject getItemByKey(Object key) {
try {
for(ResultObject c : results) {
if (key.equals(getKey(c))){
return c;
}
}
} catch (Exception ex) {
//Exception handling
}
return null;
}
public List<ResultObject> getItemsByRange(
int firstRow, int endRow) {
return Collections.unmodifiableList(results.subList(firstRow, endRow));
}
public Object getKey(ResultObject item) {
return item.getResultName();
}
public int getRowCount() {
return results.size();
}
});
}
} catch (Exception ex) {
//Exception handling
}
return dataModel;
}
//Getters and setters
}
Now that the classes ExtendedTableDataModel and DataProvider are no longer available, what should I be using instead? RichFaces forum claims there's really nothing and developers are pretty much on their own there (meaning they have to do their own implementation). Does anyone have any other idea or suggestion?
Thanks again for all your help and again, sorry for a lengthy question.
You could convert your data model to extend the abstract org.ajax4jsf.model.ExtendedDataModel instead which actually is a more robust and performant datamodel for use with <rich:extendedDataTable/>. A rough translation of your existing model to the new one below (I've decided to use your existing ExtendedDataModel<ResultObject> as the underlying data source instead of the results list to demonstrate the translation):
public class MyDataModel<ResultObject> extends ExtendedDataModel<ResultObject>{
String currentKey; //current row in the model
Map<String, ResultObject> cachedResults = new HashMap<String, ResultObject>(); // a local cache of search/pagination results
List<String> cachedRowKeys; // a local cache of key values for cached items
int rowCount;
ExtendedTableDataModel<ResultObject> dataModel; // the underlying data source. can be anything
public void setRowKey(Object item){
this.currentKey = (ResultObject)item.getResultName();
}
public void walk(FacesContext context, DataVisitor visitor, Range range, Object argument) throws IOException {
int firstRow = ((SequenceRange)range).getFirstRow();
int numberOfRows = ((SequenceRange)range).getRows();
cachedRowkeys = new ArrayList<String>();
for (ResultObject result : dataModel.getItemsByRange(firstRow,numberOfRows)) {
cachedRowKeys.add(result.getResultName());
cachedResults.put(result.getResultName(), result); //populate cache. This is strongly advised as you'll see later.
visitor.process(context, result.getResultName(), argument);
}
}
}
public Object getRowData() {
if (currentKey==null) {
return null;
} else {
ResultObject selectedRowObject = cachedResults.get(currentKey); // return result from internal cache without making the trip to the database or other underlying datasource
if (selectedRowObject==null) { //if the desired row is not within the range of the cache
selectedRowObject = dataModel.getItemByKey(currentKey);
cachedResults.put(currentKey, selectedRowObject);
return selectedRowObject;
} else {
return selectedRowObject;
}
}
public int getRowCount(){
if(rowCount == 0){
rowCount = dataModel.getRowCount(); //cache row count
return rowCount;
}
return rowCount
}
Those are the 3 most important methods in that class. There are a bunch of other methods, basically carry over from legacy versions that you don't need to worry yourself about. If you're saving JSF state to client, you might be interested in the org.ajax4jsf.model.SerializableDataModel for serialization purposes. See an example for that here. It's an old blog but the logic is still applicable.
Unrelated to this, your current implementation of getRowData will perform poorly in production grade app. Having to iterate thru every element to return a result? Try a better search algorithm.

Executing JDBC MySQL query with this custom method

I've been doing my homework and I decided to re-write my vote4cash class which manages the mysql for my vote4cash reward system into a new class called MysqlManager. The MysqlManager class I've made needs to allow the Commands class to connect to mysql - done and it needs to allow the Commands class to execute a query - I need help with this part. I've had a lot more progress with the new class that I've made but I'm stuck on one of the last, most important parts of the class, again, allowing the commands class to execute a query.
In my MysqlManager class I have put the code to connects to MySql under
public synchronized static void createConnection() {
Now I just need to put the code that allows the Commands class to execute a query under this as well. I've researched and tried to do this for a while now, but I've had absolutely no luck.
The entire MysqlManager class:
package server.util;
/*
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
*/
import java.sql.*;
import java.net.*;
import server.model.players.Client;//Will be needed eventually so that I can reward players who have voted.
/**
* MySQL and Vote4Cash Manager
* #author Cloudnine
*
*/
public class MysqlManager {
/** MySQL Connection */
public static Connection conn = null;
public static Statement statement = null;
public static ResultSet results = null;
public static Statement stmt = null;
public static ResultSet auth = null;
public static ResultSet given = null;
/** MySQL Database Info */
public static String DB = "vote4gold";
public static String URL = "localhost";
public static String USER = "root";
public static String PASS = "";
public static String driver = "com.mysql.jdbc.Driver"; //Driver for JBDC(Java and MySQL connector)
/** Connects to MySQL Database*/
public synchronized static void createConnection() {
try {
Class.forName(driver);
conn = DriverManager.getConnection(URL + DB, USER, PASS);
conn.setAutoCommit(false);
stmt = conn.createStatement();
Misc.println("Connected to MySQL Database");
}
catch(Exception e) {
//e.printStackTrace();
}
}
public synchronized static void destroyConnection() {
try {
statement.close();
conn.close();
} catch (Exception e) {
//e.printStackTrace();
}
}
public synchronized static ResultSet query(String s) throws SQLException {
try {
if (s.toLowerCase().startsWith("select")) {
ResultSet rs = statement.executeQuery(s);
return rs;
} else {
statement.executeUpdate(s);
}
return null;
} catch (Exception e) {
destroyConnection();
createConnection();
//e.printStackTrace();
}
return null;
}
}
The snippet of my command:
if (playerCommand.equals("claimreward")) {
try {
PreparedStatement ps = DriverManager.getConnection().createStatement("SELECT * FROM votes WHERE ip = hello AND given = '1' LIMIT 1");
//ps.setString(1, c.playerName);
ResultSet results = ps.executeQuery();
if(results.next()) {
c.sendMessage("You have already been given your voting reward.");
} else {
ps.close();
ps = DriverManager.getConnection().createStatement("SELECT * FROM votes WHERE ip = hello AND given = '0' LIMIT 1");
//ps.setString(1, playerCommand.substring(5));
results = ps.executeQuery();
if(results.next()) {
ps.close();
ps = DriverManager.getConnection().createStatement("UPDATE votes SET given = '1' WHERE ip = hello");
//ps.setString(1, playerCommand.substring(5));
ps.executeUpdate();
c.getItems().addItem(995, 5000000);
c.sendMessage("Thank you for voting! You've recieved 5m gold!");
} else {
c.sendMessage("You haven't voted yet. Vote for 5m gold!");
}
}
ps.close();
} catch (Exception e) {
e.printStackTrace();
}
}
return;
How the command works:
When a player types ::commandname(in this case, claimreward), the commands function will be executed. This isn't the entire commands class, just the part that I feel is needed to be posted for my question to be detailed enough for a helpful answer.
Note: I have all my imports.
Note: Mysql connects successfully.
Note: I need to make the above command code snippet able to execute mysql queries.
Note: I prefer the query to be executed straight from the command, instead of from the MysqlManager, but I will do whatever I need to resolve this problem.
I feel that I've described my problem detailed and relevantly enough, but if you need additional information or understanding on anything, tell me and I'll try to be more specific.
Thank you for taking the time to examine my problem. Thanks in advance if you are able to help.
-Alex
Your approach is misguided on many different levels, I can't even start to realize what should be done how here.
1) Don't ever use static class variables unless you know what you do there (and I'm certain, you don't)
2) I assume there is a reason you create your own jdbc connection (e.G. its part of your homework) if not, you shouldn't do that. I see you use DriverManager and PreparedStatement in one part, you should continue to use them.
3) Your approach seems to intend to start with a relative good code base (your command part) and then goes to a very low-level crude approach on database connections (your MysqlManager) unless really necessary and you know what you do, you should stay on the same level of abstraction and aim for the most abstract that fits your needs. (In this case, write MysqlManager the way you wrote Command)
4) In your previous question (that you just assumed everybody here has read, which is not the case) you got the suggestion to redesign your ideas, you should do that. Really, take a class in coding principles learn about anti-patterns and then start from scratch.
So in conclusion: Write at least the MysqlManager again, its fatally broken beyond repair. I'm sorry. Write me an email if you have further questions, I will take my time to see how I can help you. (an#steamnet.de)

Categories

Resources