Building high volume batch data processing tool in Java [closed]

Building high volume batch data processing tool in Java [closed] - java

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am trying to build an ETL tool using Java. ETL tools are for doing batch read, write, update operations on high volume of data (both relational and other kind). I am finding it difficult to choose right framework/tool to accomplish this task.
A simplified Typical Use Case:
Establish a connection with a database (source)
Read 1 million records joining two tables
Establish a connection with another database (target)
Update/write those 1 million records in the target database
My Choices:
Use plain JDBC. Build a higher level API using JDBC to accomplish the tasks of connecting, reading and writing data to and from databases.
Use some framework like Spring or Hibernate. I have never used these frameworks. I think Hibernate is for ORM purposes, but mine not a ORM kind of requirement. Spring may have some batch processing things but I wonder whether the effort to learn that is actually less than doing it myself as in my option 1.
Any other option/ framework?
which one among above is best suited for me?
Considerations
I need to choose an option that can give me high level of performance. I won't mind complexity or losing flexibility in favor of more performance.
I don't already know any of the frameworks like Spring etc. I only know core Java.
Of late, I have done lot of googling but will appreciate if you can provide me some "first hand" opinion.

Based on you usage scenario I would recommend Spring Batch. It is very easy to learn and implement. On high level it contains the following 3 important components.
ItemReader: This component is used the read batch data from source. You have ready to use implementations like JDBCITeamReader, HibernateItemReader etc.
Item Processor: This component is used to write the JAVA code which will do some processing if needed. If no processing is needed this can be skipped.
Item Writer: This component is used to write the data to target in batches. Even for this component you have ready to use implementations similar to ItemReader.

Thanks for all the updates related to Spring Batch. However, after some research I have decided to use EasyBatch. From https://github.com/j-easy/easy-batch,
Easy Batch is a framework that aims to simplify batch processing with
Java. It's main goal is to take care of the boilerplate code for
tedious tasks such as reading, filtering, parsing and validating input
data and to let you concentrate on your batch processing business
logic.

Try Data Pipeline, a lightweight ETL engine for Java. It's easy and simple to use.

Related

How to handle big data with Java framework? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm fairly new to data science, and now just starting to develop a system that required me to analyze large data (e.g. 5 - 6 million records in each DBs).
In a bigger picture: I have multiple DBs containing various kind of data which need to be integrated. After integrating the data, I also need to perform some data analysis. And lastly, I need to visualize the data to many clients.
Overall,I want to know what is the current technology/trend for handling big data (i.e with java framework)

The answer is: Depends of your non-functional requirements. Your use cases will be critical in deciding which technology to use.
Let me share one of my experience, in order to clarify what I mean:
In 2012 I needed to deal with ~2 million non-structured records per month, and perform algorithms of entropy (information theory) and similarity for ~600 requests per minute.
Our scenario were composed by:
Records non-structured, but already in JSON format.
The algorithms for entropy and similarity were based in all content of the DB vs records to be matched (Take a look in [Shannon entropy formula][1], and you will understand the complexity I'm talking about)
more them 100 different web applications as clients of this solution.
Given those requirements (and many others), and after performing PoCs with [Casandra][2], [Hadoop][3], [Voldmort][4], [neo4j][5], and also tests of stress, resiliency, scalability, and robustness, we arrived in the best solution for that moment (2012):
Java EE 7 (with the new Garbage-First (G1) collector activated)
JBoss AS 7 ([wildfly][6]) + [Infinispan][7] for the MapReduce race condition, among other clusters' control, and distributed cache needs.
Servlet 3.0 (because it's Non-blocking I/O)
[Nginx][8] (In that time was beta, but different of httpd2, it was already multiple connections in a non-blocking fashion)
[mongoDB][9] (due our raw content already being in JSON document style)
[Apache Mahout][10] for all algorithms implementation, including the MapReduce strategy
among other stuffs.
So, all depends on your requirements. There's no silver bullet. Each situation demands an architectural analysis.
I remember Nasa in that time was processing ~1TB per hour in AWS with Hadoop, due the [Mars project with the Curiosity][11].
In your case, I would recommend paying attention in your requirements, maybe a Java framework it's not what you need (or not just what you need):
If you are going just to implement algorithms for data analysis, statisticians and data miners (for example), probably [R programming language][12] is gonna be the best choice.
If you need a really fast I/O (aircraft stuff for example): any native compiled language like [Go Lang][13], [C++][14], etc.
But if actually you're going to create a web applications that actually will be just a client or feed the big data solution, I'd recommend something more soft and scalable like [nodeJS][15] or even a just in time compiled technology like those one based in JVM ([Scala][16], [Jython][17], Java) in [dockerized][18] [microservices][19]...
Good luck! (Sorry, the Stack Overflow didn't allow me to add the references link yet - But all I have talked about here, can easily been googled).

Object oriented design and Database Design Process [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm a little bit confused on how the process of developping a database based application.
I'm using java language and Relational database.
What's the correct way of looping through the process of developping Object Oriented database based application like "inventory management control".
Developping database schema then doing OOD or Vice versa.

Since I assume you are about to use a traditional RDBMS, from my own experience, it'll be best to first design the database schema: think of all the tables you need to store your information, think of the relations between them (foreign keys).
The next step should be writing the application itself. I assume you're about to use Java, and can benefit from OOP design.
In such a case I strongly recommend using an ORM technology, like Hibernate, to fulfil the gasp between your OOP app design and RDBMS design. Though it's not obligated, since you can use simple JDBC approach.
From my experience, developing this way is much less time consuming than first design your high level OOP application, and then trying to fit a DB schema to it, because usually messing with DB is more time-expensive than with high level OOP abstraction.

There are a number of different approaches possible and they each have their merits and downside.
If you follow the ORM approach and use a tool like Hibernate you can hide much of the database implementation. You would proceed with your OOD and the database schema would drop out of that. ORMs like Hibernate even do the schema generation for you (this is very helpful in testing as you can create an in-memory database on the fly for your tests).
The advantages of this approach is that you can focus on the OOD and work with 'thin slices' where the database schema is generated as you progress. This fits in well with the agile approach.
The downside with the ORM approach is that it may not result in an optimised database schema. For example, the performance of your database schema may not be as good as if you had focused more on the schema design.
If you decide to focus on the database design you can spend time optimising it for performance and other non-functional requirements (like scalability and auditing). The downside with this approach is that it may restrict the way you do OOD in your code and it may be more difficult to work in the iterative fashion preferred by agile.

Hibernate vs JDBI [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am building a web service using the Dropwizard framework (version 0.7.0). It involves executing some read-only queries to the database, manipulating the result set and then returning that data set. I am using MySQL as a database engine. Since I am new to this framework, I want to know which option I should choose: Hibernate or JDBI.

I've used both of these. I've used Hibernate with GORM in Grails as well as in a traditional Spring app and I've used JDBI in Dropwizard.
I have really enjoyed the simplicity of JDBI and here are a couple of reasons why I prefer it over Hibernate.
I know exactly what SQL is going to be executed to acquire the data I'm requesting. With Hibernate, you can sometimes have to do a lot of messing around with HQL and configuring your objects to what you intended to have returned. You ultimately resort to SQL, but then have the difficultly of properly mapping your results back to your domain objects, or you give up and allow hibernate to fetch them one by one.
I don't need to worry about lazy/eager fetching and how that is going to affect my query time on large data sets.
Mappings aren't complicated because you manage them on your own and you don't have to rely on getting the right combinations of annotations and optimizations.
For your case in particular, it sounds like you'd want something lightweight because you don't have a lot of use cases and that would definitely be JDBI over Hibernate in my opinion.

Really, both of these solutions are just "lock-in".
If you want to go with a persisted model type interface, write your code against JPA (if you are sure it's only going to back to a relational database) or JDO (if you might want to back to relational and other-type databases, like the no-SQL movement). This is because with either of these solutions, when problems occur you can switch persistence providers without rewriting the bulk of your code.
If you want to go with a procedural persistence model (dealing with SQL queries directly and such), then go with JDBi or perhaps even JDBC. JDBi provides a very nice abstraction over JDBC; however, there are cases where you want the lower level access (for performance reasons, of the kind were you are tuning the queries and database in concert). Again JDBC is a standard such that you can swap out one database for another with some ease; however, the SQL itself won't be as easy to swap out.
To amend the SQL swap out problems, I recommend using sets of property files to hold the queries, and then a Resource loader type mechanisim to bind the SQL for the right database to the code. It isn't 100% foolproof; but it does get you a bit further.
Now, if you ask me what I'd use, I highly recommend JDO.

if you have very few work upon database then use JDBI else go for Hibernate as it is very strong and provide many additional features to your persistence logic.

Is CAMEL or other Enterprise integration framework more suitable for this usecase? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
My application needs to work as middleware(MW) where it has got orders(in form of XML) from various customers which contains the --------------Priority 1
supplier id where custmers can send the XML to one of these components
1)JMS queue
2)File system
3)HTTP
4)Web service request(rest/soap)
This MW will first validate the incoming request and send the acknowledgement to customers who requested order
over their preferred channels. Channel and customer end point info exists in the incoming XML.
Once it gets the order, it needs to send order requests to different suppliers over their preffered channels in the form of xml.
I have the supplier and their preferred channel info in my db.
So its a Enterprise Integration usecase.
I was planning to do it using core java technologies. here is the approach i was planning.
Will have four listener/entry endpoint for each type of incoming request (JMS queue, File system, HTTP, Web service request(rest/soap)).
These listeners will put the will put the xml string on jms queue. This it will work as receptionist and make the process asynchronous.
Now i will have jms consumer which will listen on queue.(Consumer can be on same system or different as producer depending on load on
producer machine). This consumer will parse the xml string to java objects. Perform the validation. Send the acknowledgement to
customers(acknowledgement needs to be send based on customer preference. I will be using acknowledgement processor factory which will send
the acknowledgement based on preference). Once validation is done, convert this pojo to another pojo format so xstream/jaxb further
marshal it to xml format and send to suppliers on their preferred channel(supplier preference is stored in db) like thru soap,jms,file request etc.
Some i came across this CAMEL link http://java.dzone.com/articles/open-source-integration-apache and looks like its provides the perfect solution
and found this is the Enterprise Integration usecase.
Experts please advise is Camel the right solution for this. Or some other Enterprise integration fraework like Spring integration, ESB
will be more benefecial in this case.If somebody can point me to the resource where ESB solves this kind of usecase. It would be really helpful.
I could not explore all solution as because of time constraint so looking for expert suggestion so that can concentrate on one.

Something like Camel is completely appropriate for this task.
Things like Camel provide toolsets and components that make stitching together workflows like you describer easier, with the caveat that you must learn the overall tool (i.e. Camel, in this case), first.
For a skilled, experience developer, and simple use cases, you can see where they might take the approach that you're taking. Provisioning the workflow with tools at hand, including, perhaps, custom code, rather than taking the time to learn a new tool.
Recall while tools can be a great benefit (features, testing, quality, documentation), they also bring a burden (support, resources, complexity). A key aspect of bringing tool sets in to your environment is that while you may not have written the code, you are still ultimately responsible for it's behavior in your environment.
So, all that said, you need to ascertain whether the time investment of incorporating a tool like Camel is worth the benefit to your current project. Odds are that if you intend to continue and do more integrations in the future, investing in such a tool would be a good idea, as the tool will make those integration easier.
But be conscious that something like Camel, which is quite flexible, also brings along with it inherent complexity. But for simple stuff like what you're talking about, I think it's a solid fit.

How to use JPA in a existing project? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
My team is working on a medium size application (OLTP style). We were interested by switching to JPA instead of only using JDBC queries. Mostly for performance and practical reason. I'm not looking for a tutorial that shows me how to create a persistence.xml or Entity class in Eclipse. What i would like to know is what would be the steps to convert all the database queries into the JPA format. I know that the whole application must use JPA.
Many programmers has worked on this project over the years, so not everyone has the same SQL knowledge or the same programming skills. So there must be in this application 1000+ customs queries, using multiple tables (something that native JPA does not support very well), or query that is selecting only a few fields in a table... This is getting a bit out of control and i think that JPA would create a nice toolbox to make sure that everyone is going the same direction.
What should i look for to make sure that i'm not going into a process (convertion) that will never end ? Some sort of guideline.
(Again, i'm not looking for programming exemples, nor Eclipse tutorial.)
Thanks!

First step is convert you database schema into database model using JPA, you need to be clear what are the table, sequences, database objects that you are using in your existing application and start modeling all the schema with JPA you should consider use JPA annotation.
The step above will determine what will be your entities, embeddables and mapped superclass, their properties and the relationships they have, this step is very crucial as your logic will depends on the correctness of this model.
Then start looking for all the queries that are involved in your project, as you said that you have 1000+ queries consider use two scenario, convert all of them in JPQL queries or use a mix between native queries and named queries, I really prefer to convert all in JPQL unless are very database dependent. A step you must follow is find all of them, probably are some existing tool that convert from SQL to JPQL but I believe is better idea make by your own.
Once you have queries and model for the database start the creation of your new DAO using JPA and EntityManager stuff, I should recommend extract the interface for your exisiting DAO and start moving to a JPA implementation using the same interface, this will avoid break some code on your own, don't forget unit and IT test for your new DAO.
Also with the above approach you could start moving the application module by module, DAO or by DAO does not require to move full application at once. This will give you a kind of process in which you will see some progress each time you finish a new DAO or module.
Not sure what you mean about programming examples, I think those are the required steps but each project is different from each other, so consider this as some kind of guidelines.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.