I am working on a data analytics project and I built my website with Laravel (PHP).
However, I am now required to :
analyze the massive amount of data from the database
Keep a lot of in-memory objects
have a system running 24/7 analyzing and processing data
I don't believe that PHP is best suited for this task and was thinking of using java instead ( use it as an API that will process the data and return the results to my website for viewing). It will have to run on a server.
These are some types of data analysis that I need to do :
Retrieve 10,000 plus records from MySQL and hold. Analyze the data for patterns. Build models from the data. Analyze graphs
I have never used any JAVA services/frameworks online and was wondering what is best suited for my task. What I came across was :
Spring
Jersey
You could try to combine Apache storm + spring framework to resolve your problem. I am currently working on the similar project as yours.
I also had similar troubles before, not advertising, and later used other people recommended FineReport, it is based on a Java web report software, can do a lot of dataset data analysis, and support MYSQL and other databases, you can try ~
Related
I'm a seasoned LAMP Developer and have decent experience with php, nginx, haproxy, redis, mongodb, and aws services. Whenever large data requirement comes to the table I go with aws web services and recently started reading about big data expecting to play with the technology on my own instead of using a hosted service for large data handling, stream processing etc.
However it's not the same journey like learning LAMP and because of the nature of the use cases it's hard to find good resources for a newbie. Specially for someone who haven't been with the Java eco system. (To my understanding Java software pretty much cover the popular big data stacks). The below list of softwares popups in pretty much everywhere when talking about big data, but it's very hard to grasp the concept of each and descriptions available at home pages of each project is pretty vague.
For instance "Cassandra", on surface it's the a good database to store time series data, but when reading more about analytics then other stacks come up, hadoop, pig, zookeeper etc.
Cassandra
Flink
Flume
Hadoop
Hbase
Hive
Kafka
Spark
Zookeeper
So in a nutshell, what does these software do? In context to big data, some of these projects share the same aspects, why do they co-exist then? what's the advantage? when to use what?
As for hadoop, you have to understand, that Hadoop can mean two things, depending on the context. A bit like the term "Linux", if your familiar with that.
only the core: The real "Hadoop" is only a file system for decentralized storage of very large files + request framework to these files via Map/Reduce.
the whole ecosystem: This includes the core and all other tools that have been put onto hadoop for data analytics. Flume, Hbase, Hive, Kafka, Spark, Zookeeper are terms belonging to this category. Flink also might be, I am not sure.
Cassandra might also belong to the second category, because "Hadoop integration was added way back in version 0.6 of Cassandra".
To understand the whole ecosystem better, you have to understand how this is all structured:
From bottom to top:
bottom layer: Here you have your distributed file system and the Map/Reduce request framework. HDFS is the name of the file system, you will also see this term a lot. On top of HDFS, you can use HBase, which is a column oriented database on top of HDFS ¹.
middle layer, execution engines: In the middle we have several different engines, which can query the hadoop file system for information. Actually, some people put Map/Reduce on a second layer, because the Hadoop environment now also includes Tez and Spark. Tez speeds up queries by using graphs for map/reduce execution, I think. And Spark is an in-memory engine.
top layer, user abstractions: On top of the execution engines you have the user API/abstractions. This includes apache Hive (SQL like queries) and Pig (in my eyes a mixture of SQL and programming language). But there are also more specialized abstractions like MLlib, which is a library for machine learning on top of a hadoop system using Spark as middle layer.
Somewhere aside, we also have management tools for this whole ecosystem. Managing servers, managing the task execution order (job scheduling) and so on. This is where Kafka and Zookeeper belong to.
¹ I currently do not understand the relationship between HBase vs. ORC files or Parquet.
I am developing a webapp and am looking into how I can automate testing of the web site such as seeing how it copes with multiple concurrent users / heavy traffic. Could anyone point me in the direction of any software or techniques I could be using to help me do this?
I am also looking into how to automate testing things at the front end? For example I have unit tested all of my business logic at the backend, but them am unsure as to what I should be do in order to automate testing of everything else.
For heavy traffic testing, I've been using JMeter. For front end testing, I'm using Selenium.
Beside Apache JMeter, which generates artificial load and allows you to test performance, there are two main technologies for testing accurately performance during operation:
Tagging Systems (like Google Analytics)
Access Log File Analysis
With tagging you create an account with Google Analytics and add some JavaScript code to the relevant places of your code, that allows the browser of your visitors to connect to GA and get captured there.
The Access log file holds all information about each session. There is a data overload, so data has to be Extracted, Transformed and Loaded (ETL) to a database. The evaluation can be then performed in nearly real-time. You can create some dashboard application that does the ETL and displays the status of you application in nearly real time.
I had same need some years ago while developing a large scale webapp.
I've been using Apache JMeter as for automation testing, and Yourkit Java Profiler for profiling Heap JVM usage and actually found lot of memory leaks!
cheers
Selenium to test the flow and expected results
Yorkit to profile CPU and Memory usage => excellent to track concurrency issues and memory leaks
Spring Insight to visually understand your application performance / load +
See the SQL executed for any page request => with drill down to the corresponding source code
Find pages which are executing slowly and drill into the cause
Verify your application's transactions are working as designed
Spring Insight is deployable as a stand alone war (Tomcat / tC Server / etc..)
I have written a relatively simple Java App Engine application which I would like to be able to port to another cloud provider.
I am using the JDO datastore API so I think my data handling should be portable to other backends as listed here: http://www.datanucleus.org/products/accessplatform/index.html
I would ideally like to deploy my application onto EC2 with minimal code changes. What is my best approach?
Note: I am aware of the http://code.google.com/p/appscale/ project but I want to avoid using this as it doesn't look like they are updating very often.
AppScale remains your best option to avoid rewriting any code. They do keep up to date with official App Engine - for instance, they just released preliminary support for Go. Even if they weren't so assiduous at keeping up to date, though, this would only be relevant if some feature you required wasn't yet supported - and it sounds like your needs are fairly basic.
JDO should be trivial, there might be some Google specific configuration here and there but generally it should be easy. The storage model Google promotes is not bad for RDBMS either, but you might need to fine tune your model depending on the backend you end up with.
If you're not using the low-level Google APIs, you should be pretty much there.
I managed to get my application working on EC2 using the following components.
Tomcat 7
Datanucelus
HBase
I had to manually create a table in HBase for each of my data classes but was able to configure Datanucleus to auto create the columns.
I also had to change my primary key value generation strategy from identity to increment as per this table of supported features.
http://www.datanucleus.org/products/accessplatform_3_0/datastore_features.html
I am developing a Java application using Google App Engine that depends on a largish dataset to be present. Without getting into specifics of my application, I'll just state that working with a small subset of the data is simply not practical. Unfortunately, at the time of this writing, the Google App Engine for Java development server stores the entire datastore in memory. According to Ikai Lan:
The development server datastore stub is an in memory Map that is persisted
to disk.
I simply cannot import my entire dataset into the development datastore without running into memory problems. Once the application is pushed into Google's cloud and uses BigTable, there is no issue. But deployment to the cloud takes a long time making development cycles kind of painful. So developing this way is not practical.
I've noticed the Google App Engine for Python development server has an option to use SQLite as the backend datastore which I presume would solve my problem.
dev_appserver.py --use_sqlite
But the Java development server includes no such option (at least not documented). What is the best way to get a large dataset working with the Google App Engine Java development server?
There's no magic solution - the only datastore stub for the Java API, currently, is an in-memory one. Short of implementing your own disk-based stub, your only options are to find a way to work with a subset of data for testing, or do your development on appspot.
I've been using the mapper api to import data from the blobstore, as described by Ikai Lan in this blog entry - http://ikaisays.com/2010/08/11/using-the-app-engine-mapper-for-bulk-data-import/.
I've found it to be much faster and more stable than using the remote api bulkloader - especially when loading medium sized datasets (100k entities) into the local datastore.
I've been reading a little about Google's AppEngine that provides application hosting. I've been trying it out as I think it looks quite interesting but I'm a bit concerned about the database part.
Say I'm developing my Java app locally. I don't want to deploy to Google every time I make change to the code, so I setup a nice little Servlet container on my development machine to test things easily. With AppEngine you store things using their datastore API, which basically lets you model your data using Java objects - which is nice.
However, it seems like this data is embedded in the application code itself (inside the .war that is deployed to Google). Can I simply use their datastore api locally? How will it be stored on my local machine? Is this all handled by them so that I just have to worry about using the datastore API and when I deploy it to Google the data will just be stored in a different way than how it's stored on my local machine?
I'm just a little confused because I'm used to having the data part layered out of my application code.
I hope I'm clear enough. Thanks.
Development datastore and Production datastore are two different and separated things:
Development datastore is tipically a file based datastore named local_db.bin that it's just useful to store your data in your testing environment; the data is not replicated to the production environment when you deploy your application.
This kind of datastore is meant to be used with a fairly small number of entities and its performance has nothing to do with the powerful Production datastore beast based on Big Table.
All you need to do is to use the Datastore API that creates a level of abstraction between your code and the underlying datastore; in testing your data will be stored in the local datastore file, in production the created data will be saved to the Google App Engine datastore with all the features and limitations that this implies.