How to handle big data with Java framework? [closed]

How to handle big data with Java framework? [closed] - java

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm fairly new to data science, and now just starting to develop a system that required me to analyze large data (e.g. 5 - 6 million records in each DBs).
In a bigger picture: I have multiple DBs containing various kind of data which need to be integrated. After integrating the data, I also need to perform some data analysis. And lastly, I need to visualize the data to many clients.
Overall,I want to know what is the current technology/trend for handling big data (i.e with java framework)

The answer is: Depends of your non-functional requirements. Your use cases will be critical in deciding which technology to use.
Let me share one of my experience, in order to clarify what I mean:
In 2012 I needed to deal with ~2 million non-structured records per month, and perform algorithms of entropy (information theory) and similarity for ~600 requests per minute.
Our scenario were composed by:
Records non-structured, but already in JSON format.
The algorithms for entropy and similarity were based in all content of the DB vs records to be matched (Take a look in [Shannon entropy formula][1], and you will understand the complexity I'm talking about)
more them 100 different web applications as clients of this solution.
Given those requirements (and many others), and after performing PoCs with [Casandra][2], [Hadoop][3], [Voldmort][4], [neo4j][5], and also tests of stress, resiliency, scalability, and robustness, we arrived in the best solution for that moment (2012):
Java EE 7 (with the new Garbage-First (G1) collector activated)
JBoss AS 7 ([wildfly][6]) + [Infinispan][7] for the MapReduce race condition, among other clusters' control, and distributed cache needs.
Servlet 3.0 (because it's Non-blocking I/O)
[Nginx][8] (In that time was beta, but different of httpd2, it was already multiple connections in a non-blocking fashion)
[mongoDB][9] (due our raw content already being in JSON document style)
[Apache Mahout][10] for all algorithms implementation, including the MapReduce strategy
among other stuffs.
So, all depends on your requirements. There's no silver bullet. Each situation demands an architectural analysis.
I remember Nasa in that time was processing ~1TB per hour in AWS with Hadoop, due the [Mars project with the Curiosity][11].
In your case, I would recommend paying attention in your requirements, maybe a Java framework it's not what you need (or not just what you need):
If you are going just to implement algorithms for data analysis, statisticians and data miners (for example), probably [R programming language][12] is gonna be the best choice.
If you need a really fast I/O (aircraft stuff for example): any native compiled language like [Go Lang][13], [C++][14], etc.
But if actually you're going to create a web applications that actually will be just a client or feed the big data solution, I'd recommend something more soft and scalable like [nodeJS][15] or even a just in time compiled technology like those one based in JVM ([Scala][16], [Jython][17], Java) in [dockerized][18] [microservices][19]...
Good luck! (Sorry, the Stack Overflow didn't allow me to add the references link yet - But all I have talked about here, can easily been googled).

Related

Building high volume batch data processing tool in Java [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am trying to build an ETL tool using Java. ETL tools are for doing batch read, write, update operations on high volume of data (both relational and other kind). I am finding it difficult to choose right framework/tool to accomplish this task.
A simplified Typical Use Case:
Establish a connection with a database (source)
Read 1 million records joining two tables
Establish a connection with another database (target)
Update/write those 1 million records in the target database
My Choices:
Use plain JDBC. Build a higher level API using JDBC to accomplish the tasks of connecting, reading and writing data to and from databases.
Use some framework like Spring or Hibernate. I have never used these frameworks. I think Hibernate is for ORM purposes, but mine not a ORM kind of requirement. Spring may have some batch processing things but I wonder whether the effort to learn that is actually less than doing it myself as in my option 1.
Any other option/ framework?
which one among above is best suited for me?
Considerations
I need to choose an option that can give me high level of performance. I won't mind complexity or losing flexibility in favor of more performance.
I don't already know any of the frameworks like Spring etc. I only know core Java.
Of late, I have done lot of googling but will appreciate if you can provide me some "first hand" opinion.

Based on you usage scenario I would recommend Spring Batch. It is very easy to learn and implement. On high level it contains the following 3 important components.
ItemReader: This component is used the read batch data from source. You have ready to use implementations like JDBCITeamReader, HibernateItemReader etc.
Item Processor: This component is used to write the JAVA code which will do some processing if needed. If no processing is needed this can be skipped.
Item Writer: This component is used to write the data to target in batches. Even for this component you have ready to use implementations similar to ItemReader.

Thanks for all the updates related to Spring Batch. However, after some research I have decided to use EasyBatch. From https://github.com/j-easy/easy-batch,
Easy Batch is a framework that aims to simplify batch processing with
Java. It's main goal is to take care of the boilerplate code for
tedious tasks such as reading, filtering, parsing and validating input
data and to let you concentrate on your batch processing business
logic.

Try Data Pipeline, a lightweight ETL engine for Java. It's easy and simple to use.

Single Large Webapp or multiple small webapps? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I have a website, consisting of about 20 Java Web applications (Servlet/JSP based webapps), of varying sizes, each handling different areas of the site.
The combined size of all 20 war's is 350mb, however by combining them I anticipate being able to ultimately reduce that and realise combined caching benefits.
Is it best to keep them separate, or merge them into a single Uber webapp war file? (and why)
I'm particularly interested in knowing any technical drawbacks of merging them.

I "vote" to combine them.
Pros
Code sharing: If you combine them, you can share code between them (becase there will be only one).
This does not apply to just your code, it also applies all the external libraries you use which will be the bigger gain I think.
Less memory: Combined will also require less memory (might be very significant) because the external libraries used by multiple apps will only have to be loaded once.
Maintainability: Also if you change something in your code base or database, you only have to change it in one place and re-deploy one app only.
Easier synchronization: If the separate apps do something critical in the database for example, it's harder to synchronize them compared to the case when everything is in one app.
Easier collaboration between different parts/modules of the code. If they are combined, you can simply call methods of other modules. If they are in different web apps, you have to do it in a dirty way like HTTP calls, RMI etc.
Cons
It will be bigger (obviously). If you worry about it being too big, just exclude the libs from the deployment war, place it under the tomcat libs.
The separate apps might use different versions of the same lib. But it's better to sort them out early when it can be done easier and with less work.
Another drawback can be the longer deployment time. Again, "outsourcing" the libs can help making it faster.

There is no drawback in terms of size, memory issues or performance when used in single file as systems are getting faster each day. And as you said running in different apps or same one, the total combined resources consumed will be the same in terms of processing or computation power. Now its a maintenance and administration issues that decides to keep a single or multiple. If you have multiple modules which might changes frequently and independently of one another, its better to have multiple webapps, talking via RMI or WS calls for intercommunication(if required). If all of them are oriented as one unit, where everything changes at once you may go with single app. having multi apps will help to install and update each one easily with respect to change in functionality at module level
deploying multiple applications to Tomcat
http://www.coderanch.com/t/471496/Tomcat/Deploying-multiple-applications-WAR
Hope it helps

Android only game in OpenGL: performance in C++ (NDK) vs Java (Dalvik) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I know that similar questions have been asked before, but...
We want to develop (at least hope) an indie game but still a game with high quality graphics with hundreds if not thousends of moving objects on the screen so we expect very high number of polygons and requirements for hittest and perhaps some AI.
I know the basic problem with java is garbage collection. But it's not an issue, we plan to allocate all of the required memory before the game starts and for transient objects we will use pooling (so in the game loop the new keyword will never be written). And we plan to use every possible techique mentioned here (Google I/O 2009 - Writing Real-Time Games for Android).
The main reason we insist on Java is deployment and we only want to develop for Android (for now at least)
So can the same performance in a game be achived with Java (even if that means ugly/not idiomatic code) as if we did it with c++. If not, what are the specifics? Or perhaps if it's possible but very-very unpractical what are these reasons?
(For example I read something about java Buffers and OpenGL are not the best pairing but don't remember the specifics - maybe some expert)

You're going to be paying a fixed additional cost per call to use OpenGL from Java source code. Android provides Java-language wrappers around the calls. For example, if you call glDrawArrays, you're calling a native method declared in GLES20.java, which is defined in android_opengl_GLES20.cpp. You can see from the code that it's just forwarding the call with minimal overhead.
Nosing around in the file, you can see other calls that perform additional checks and hence are slightly more expensive.
The bottom line as far as unavoidable costs goes is that the price of making lots of GLES calls is higher with Java source than native source. (While looking at the performance of Android Breakout with systrace I noticed that there was a lot of CPU overhead in the driver because I was doing a lot of redundant state updates. The cost of doing so from native code would have been lower, but the cost of doing zero work is less than the cost of doing less work.)
The deeper question has to do with whether you need to write your code so differently (e.g. to avoid allocations) that you simply can't get the same level of performance. You do have to work with direct ByteBuffer objects rather than simple arrays, which may require a bit more management on your side. But aside from the current speed differences between compute-intensive native and Java code on Android, I'm not aware of anything that fundamentally prevents good performance from strictly-Java implementations.

Java based library for sensor data collection [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I'm looking for an embeddable Java library that is suitable for collecting real-time streams of sensor data in a general-purpose way. I plan to use this to develop a "hub" application for reporting on multiple disparate sensor streams, running on a JVM based server (will also be using Clojure for this).
Key things it needs to have:
Interfaces for various common sensor types / APIs. I'm happy to build what I need myself, but it would be nice if some standard stuff comes out of the box.
Suitable for "soft real time" usage, i.e. fairly low latency and low overhead.
Ability to monitor and manage streams at runtime, gather statistics etc.
Open source under a reasonably permissive license so that I can integrate it with other code (Apache, EPL, BSD, LGPL all fine)
A reasonably active community / developer ecosystem
Is there something out there that fits this profile that you can recommend?

1. Round-robin database (wikipedia)
RRDtool (acronym for round-robin database tool) aims to handle
time-series data like network bandwidth, temperatures, CPU load, etc.
The data are stored in a round-robin database (circular buffer), thus
the system storage footprint remains constant over time.
This approach/DB format is widely used, stable and simple enough. Out of the box it allows to generate nice plots:
There is Java implementation -- RRD4J:
RRD4J is a high performance data logging and graphing system for time
series data, implementing RRDTool's functionality in Java. It follows
much of the same logic and uses the same data sources, archive types
and definitions as RRDTool does. Open Source under Apache 2.0 License.
Update
Forget to mention there is Clojure RRD API (examples).
2. For some experiments with real-time data I would suggest to consider Perst
It is small, fast and reliable enough, but distributed under GPLv3. Perst provides several indexing algorithms:
B-Tree
T-Tree (optimized for in-memory database)
R-Tree (spatial index)
Patricia Trie (prefix search)
KD-Tree (multidimensional index)
Time series (large number of fixed size objects with timestamp)
The last one suits your needs very well.
3. Neo4J with Relationship indexes
A good example where this approach pays dividends is in time series
data, where we have readings represented as a relationship per
occurrence.
4. Oracle Berkeley DB Java Edition
Oracle Berkeley DB Java Edition is an open source, embeddable,
transactional storage engine written entirely in Java. It takes full
advantage of the Java environment to simplify development and
deployment. The architecture of Oracle Berkeley DB Java Edition
supports very high performance and concurrency for both read-intensive
and write-intensive workloads.
Suggestion
Give a try to RRD4J:
It is simple enough
It dose provide quite a nice plots
It has Clojure API
It supports several back-ends including Oracle Berkeley DB Java Edition
It can store/visualize detailed data sets

For collecting real-time streams of sensor data following might be of help
Have you checked LeJos API's. This http://lejos.sourceforge.net/nxt/nxj/api/index.html
Also it is worth checking Oracle Java ME Embedded and the target markets they are addressing http://www.unitask.com/oracledaily/2012/10/04/at-the-java-demogrounds-oracle-java-me-embedded-enables-the-internet-of-things/
Can be downloaded from http://www.oracle.com/technetwork/java/embedded/downloads/javame/index.html
For storing the Time series data nothing beats cassandra http://cassandra.apache.org/ and to answer why cassandra refer http://www.datastax.com/why-cassandra
For accessing Cassandra from Java refer https://github.com/jmctee/Cassandra-Client-Tutorial
It is quite helpful and applying the time series concept in cassandra db refer
http://www.datastax.com/wp-content/uploads/2012/08/C2012-ColumnsandEnoughTime-JohnAkred.pdf

What are performant, scalable ways to resolve links from http://bit.ly [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Given a series of URLS from a stream where millions could be bit.ly, google or tinyurl shortened links, what is the most scalable way to resolve those to get their final url?
A Multi-threaded crawler doing HEAD requests on each short link while caching ones you've already resolved? Are there services that already provide this?
Also factor in not getting blocked from the url shortening service.
Assume the scale is 20 million shortened urls per day.

Google provides an API. So does bit.ly (and bit.ly asks to be notified of heavy use, and specify what they mean by light usage). I am not aware of an appropriate API for tinyurl (for decoding), but there may be one.
Then you have to fetch on the order of 230 URLs per second to keep up with your desired rates. I would measure typical latencies for each service and create one master actor and as many worker actors as you needed so the actors could block on lookup. (I'd use Akka for this, not default Scala actors, and make sure each worker actor gets its own thread!)
You also should cache the answers locally; it's much faster to look up a known answer than it is to ask these services for one. (The master actor should take care of that.)
After that, if you still can't keep up because of, for example, throttling by the sites, you had better either talk to the sites or you'll have to do things that are rather questionable (rent a bunch of inexpensive servers at different sites and farm out the requests to them).

Using HEAD method is an interesting idea by I am afraid it can fail because I am not sure the services you mentioned support HEAD at all. If for example the service is implemented as a java servlet it can implement doGet() only. In this case doHead() is unsupported.
I'd suggest you to try to use GET but do not read the whole response. Read HTTP status line only.
As far as you have very serious requirements for performance you cannot these requests synchronously, i.e. you cannot use HttpUrlConnection. You should use NIO package directly. In this case you will be able to send requests to all millions of destinations using only one thread and get responses very quickly.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.