The Apache Projects and Big Data World

The Apache Projects and Big Data World - java

I'm a seasoned LAMP Developer and have decent experience with php, nginx, haproxy, redis, mongodb, and aws services. Whenever large data requirement comes to the table I go with aws web services and recently started reading about big data expecting to play with the technology on my own instead of using a hosted service for large data handling, stream processing etc.
However it's not the same journey like learning LAMP and because of the nature of the use cases it's hard to find good resources for a newbie. Specially for someone who haven't been with the Java eco system. (To my understanding Java software pretty much cover the popular big data stacks). The below list of softwares popups in pretty much everywhere when talking about big data, but it's very hard to grasp the concept of each and descriptions available at home pages of each project is pretty vague.
For instance "Cassandra", on surface it's the a good database to store time series data, but when reading more about analytics then other stacks come up, hadoop, pig, zookeeper etc.
Cassandra
Flink
Flume
Hadoop
Hbase
Hive
Kafka
Spark
Zookeeper
So in a nutshell, what does these software do? In context to big data, some of these projects share the same aspects, why do they co-exist then? what's the advantage? when to use what?

As for hadoop, you have to understand, that Hadoop can mean two things, depending on the context. A bit like the term "Linux", if your familiar with that.
only the core: The real "Hadoop" is only a file system for decentralized storage of very large files + request framework to these files via Map/Reduce.
the whole ecosystem: This includes the core and all other tools that have been put onto hadoop for data analytics. Flume, Hbase, Hive, Kafka, Spark, Zookeeper are terms belonging to this category. Flink also might be, I am not sure.
Cassandra might also belong to the second category, because "Hadoop integration was added way back in version 0.6 of Cassandra".
To understand the whole ecosystem better, you have to understand how this is all structured:
From bottom to top:
bottom layer: Here you have your distributed file system and the Map/Reduce request framework. HDFS is the name of the file system, you will also see this term a lot. On top of HDFS, you can use HBase, which is a column oriented database on top of HDFS ¹.
middle layer, execution engines: In the middle we have several different engines, which can query the hadoop file system for information. Actually, some people put Map/Reduce on a second layer, because the Hadoop environment now also includes Tez and Spark. Tez speeds up queries by using graphs for map/reduce execution, I think. And Spark is an in-memory engine.
top layer, user abstractions: On top of the execution engines you have the user API/abstractions. This includes apache Hive (SQL like queries) and Pig (in my eyes a mixture of SQL and programming language). But there are also more specialized abstractions like MLlib, which is a library for machine learning on top of a hadoop system using Spark as middle layer.
Somewhere aside, we also have management tools for this whole ecosystem. Managing servers, managing the task execution order (job scheduling) and so on. This is where Kafka and Zookeeper belong to.
¹ I currently do not understand the relationship between HBase vs. ORC files or Parquet.

Related

Java Web platform for data analysis

I am working on a data analytics project and I built my website with Laravel (PHP).
However, I am now required to :
analyze the massive amount of data from the database
Keep a lot of in-memory objects
have a system running 24/7 analyzing and processing data
I don't believe that PHP is best suited for this task and was thinking of using java instead ( use it as an API that will process the data and return the results to my website for viewing). It will have to run on a server.
These are some types of data analysis that I need to do :
Retrieve 10,000 plus records from MySQL and hold. Analyze the data for patterns. Build models from the data. Analyze graphs
I have never used any JAVA services/frameworks online and was wondering what is best suited for my task. What I came across was :
Spring
Jersey

You could try to combine Apache storm + spring framework to resolve your problem. I am currently working on the similar project as yours.

I also had similar troubles before, not advertising, and later used other people recommended FineReport, it is based on a Java web report software, can do a lot of dataset data analysis, and support MYSQL and other databases, you can try ~

What are the pros/cons/substitutes for using MySQL Connector/MXJ for an application

I recently made an interesting application using Play Framework and MySQL Connector/MXJ to make a completely portable web server with database, independent of any currently installed software(including Java).
I'm still new to MXJ, and the desktop application realm (as opposed to straight-up webapps), so I'm wondering if there are other, better methods for storing/accessing large amounts of data than embedded MySQL. I would assume so, since it seems not many people use MXJ. It essentially just packs mysqld.exe in its various forms for multiple operating systems and platforms. It runs in its own thread, and stores its data in whatever directory you provide.
For an application that frequently analyzes and searches through data in large chunks(100MB to 5GB), what other (fast)options are there, or am I justified in my webapp-laziness of bringing along MySQL?

Independent of any currently installed software(including Java).
If you are looking for an embedded database for a desktop application, then you can go for SQLITE. However, there are pros/cons for using either MySQL or SQLite
SQLite:
Easier to setup
Great for temporary (testing databases)
Great for rapid development
Great for embedding in an application
Doesn't have user management
Doesn't have many performance features
Doesn't scale well.
MySQL:
Far more difficult/complex to set up
Better options for performance tuning
Fit for a production database
Can scale well if tuned properly
Can manage users, permissions, etc.
You can find more info on when to use SQLite here
UPDATE: I came across HSQLDB and here are its test results. HamsterDb is another option.

Do you really need a database if your app is single user and desktop based? Maybe it is faster to simply write large files to the local filesystem then loading then through the network tier. If your app is very complex you could use an embedded db just for storing your domain and configuration, but if its not maybe you can avoid using a db + sql + o/r-mapping and so on.

Cloud for Flex, Java, mongoDb?

I am about to develop my masters project using Flex as front end, BlazeDs, Java Web Services and MongoDB in the backend. I am looking to deploy and manage it on a cloud. (The application analyzes financial data from various sources, I will need to query multiple endpoints for news articles and DB for processing)
It is my experiment to usage of cloud rather than deploying on my local for demo and presentation purposes.
I saw heroku (http://www.heroku.com/), but I am not sure if it allows flash.
Please suggest a cloud application platform which allows Flex, BlazeDs, Java Web Services and MongoDB?

Amazon Web Services is a good place to start. You can have a instance ready in like 15-30min from signing up. If you are just experimenting, you ought to try to get the Amazon Linux Image (AMI) up and running. Scour the net on HOWTO set up Tomcat, for your requirements it might be too much to go J2EE, but you might know better.
But a word of advice, it's better to get your application working on a local machine first. Then drop the programmer hat and put on the deployment hat 100% cause it's a b!tch configuring deployment environment for Tomcat configurations, Blaze DS, Mongo's failover servers, load balancers and all kinds of non-programming tasks. You will want to work your development stack close to home so you can diagnose quickly.
Cloud business is great only when you want 1) Not use your home PC and bandwidth as a server 2) You want to have global mirror points to your application so that user's latency in one area of the world is not slower than another part of the world 3) You want to distribute computing load burden on one application across many instances of the same application.
Clouds are relatively cheap to deploy but if you got an application that hording GB's of bandwidth and storage, be prepared to fork over $1000's+ in costs. You can save money by going with an OS with no licensing costs to get a better rate.

matlab and enterprise applications

I have long background in enterprise engineering, abut as circumstance has it have have found my role changing. I have been tasked to lead a quantitative finance group, performing time series evaluation of proprietary data.
Our application stack (on the engineering side, which I have no influence on, but yet need to interface with) is JAVA (or SCALA) to Hibernate 3.x (annotations and xml) running on tomcat. Tons of experienced software guys...
I need data from them for two functions
research (i imagine pulling straight from the DB)
as parameters to any algorithms developed (described below)
My team is mostly folks with math and computational finance degrees, a couple w/ limited java experience (I have considerable .NET experience as well).
We are tasked to:
developed (multiple) algorithms that generate discreet trading signals (events) out of our underlying data
apply those algorithms to events coming from our web applications in real time
raise any trading signals (events) back to the application stack as they occur
a. display events visually in the application
b. send events to clients over the internet (somehow)
The best case is that any tool (MATLAB) used for the purpose of algorithm research and development, will also be used in the production environment - and be completely integrated to our production systems (as a listener to events, and then again as a source of events feeding back in).
The worst case is that any algorithm we develop needs to be reimplemented in the JAVA/SCALA space for integration.
My questions are
is matlab integration with java sufficient for this? They are not using an application server (like JBOSS), so i guess each tomcat machine is logically and physically its own JVM instance. So I don't see any JVM constraints (as in MATLAB owning its own instance) as a major obstacle
has anybody interfaced matlab to a database over Hibernate?
does .NET a better choice for interfacing with matlab? If so which features does it offer that java integration does not?
what capabilities are there in Matlab to "compile" your work into modules, and add to standard unit testing and automated build processes (ie HUDSON)?
Thanks

MATLAB's Java integration is sufficient for your aims. There is no issue in using
Java classes from the MATLAB JVM interacting with JBOSS as well.
Yes through JBOSS.
Never touched .NET, but you won't get the seamless support as seen for Java. Using Java you may use MATLAB as Java scripting engine, similar to projects like Groovy, or use instances of proxy classes using api calls.
Use MATLAB Builder JA in order to generate Java classes from your MATLAB code. The compiled code may be tested with any black box testing tool.

Regarding #4: For testing inside the MATLAB environment, I recommend Steve Eddins' test framework: http://www.mathworks.com/matlabcentral/fileexchange/22846

Which Map-Reduce library and/or platform to use with java

I was reading and hearing some stuff about cloud computing and map-reduce techniques lately. I am thinking of playing around with some algorithms to get practical experience in that field and see what is possible right now.
Here is what I want to do:
I would like to use some public cloud platform (e.g. Google App Engine, Google Map Reduce, Amazon ECS, Amazon Map Reduce) that comes with built in map reduce functionality or if it comes without built in support, use an additional map reduce java libary (e.g. Hadoop, Hive), and implement/deploy some algorithms.
Has anyone made some experience in that field and indicate a good point to start? Or name some combinations which have worked well in practice?
Thanks in advance!

Amazon EC2 has some pre-bundled Hadoop AMIs. See Running Hadoop on Amazon EC2 for a tutorial.
In particular, the Cloudera distribution comes to mind - it comes with Pig and Hive as well.

Apache Hadoop is a major open-source Java distributed computing framework, and it includes a MapReduce subproject that is based off of the original Google MapReduce.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.