Apache Kafka Consumer - Analytics Platform - Visualize data

Apache Kafka Consumer - Analytics Platform - Visualize data - java

I am new to Apache Kafka and also to data analytics.
I am able to consume messages from Apache kafka consumer in Java.
Now I want to take these real time data and display it like a dashboard.
I want to visualize all these data using any open source tool.
One tool which I found is Druid but the documentation provided is not enough to learn and proceed with that.
Also I read that Druid is very difficult to install and deploy in production.
Any other tools available to do this? Any help is appreciated.

you can use Apache Zeppelin https://zeppelin.apache.org/ to visualise your kafka topics. It has a web interface for notebook and it supports java. You can write your code on it and visualise its output

I recently started using Metatron Discovery. https://metatron.app/
It is a free and open source software for data visualization.
It supports kafka, so you can visualize your real-time data with a wide variety of charts.
If you are interested in open source, this will be helpful.
https://github.com/metatron-app/metatron-discovery/

You could also use the Elastic stack. If you get your data with Kafka, then store it in Elasticsearch, you could quickly have a dashboard with Kibana. When I had to install and deploy it, I found it very easy to use.

Related

Kafka Connect internal architecture

I am trying to understand the internal nuances of Kafka connect, like how the design has been implemented and which patterns are used.
specifically I want to understand how to develop similar app which can take input configuration and start acting according to the configuration, so when we have to implement some new feature we can just write connects. So others need not to spend more time in reinventing the wheel.

Apache Kafka is open source, and you can find the Connect source code here.
But if you just want to create a source connector, the internals will not really help you do that.
Confluent has a blog post on developing custom source connectors.
Alternatively, you could not "reinvent the wheel" yourself, and use projects like Apache Camel to see if they support a source system you're using.

Java Web platform for data analysis

I am working on a data analytics project and I built my website with Laravel (PHP).
However, I am now required to :
analyze the massive amount of data from the database
Keep a lot of in-memory objects
have a system running 24/7 analyzing and processing data
I don't believe that PHP is best suited for this task and was thinking of using java instead ( use it as an API that will process the data and return the results to my website for viewing). It will have to run on a server.
These are some types of data analysis that I need to do :
Retrieve 10,000 plus records from MySQL and hold. Analyze the data for patterns. Build models from the data. Analyze graphs
I have never used any JAVA services/frameworks online and was wondering what is best suited for my task. What I came across was :
Spring
Jersey

You could try to combine Apache storm + spring framework to resolve your problem. I am currently working on the similar project as yours.

I also had similar troubles before, not advertising, and later used other people recommended FineReport, it is based on a Java web report software, can do a lot of dataset data analysis, and support MYSQL and other databases, you can try ~

Spring Boot production monitoring

Spring Boot Actuator exposes a lot of metrics and information of the deployed container. However, production operations guys probably don't want to stare at pure JSON objects on their browser :)
What would be good "standard" tools for monitoring this in production? This would include graphs, triggers for alerts, etc.

The spring-boot-admin project is also a great monitoring tool that your production support guys may be interested in. It doesn't process and graph the metrics at all like graphite+grafana, but it is a great simple tool to setup and use to see the state of all of your running spring-boot applications.
https://github.com/codecentric/spring-boot-admin

You're right! Looking at JSON Objects all day is not that pretty. One setup that our team finds handy is to use the following
jmxtrans to export the data to Graphite.
Graphana to show the data in a nice way, after pulling the metrics from Graphite. Documentation to do that is on Graphana's website
Nagios for triggering alerts, pulling the same data from Graphite, there's a nice module here for that.

The Apache Projects and Big Data World

I'm a seasoned LAMP Developer and have decent experience with php, nginx, haproxy, redis, mongodb, and aws services. Whenever large data requirement comes to the table I go with aws web services and recently started reading about big data expecting to play with the technology on my own instead of using a hosted service for large data handling, stream processing etc.
However it's not the same journey like learning LAMP and because of the nature of the use cases it's hard to find good resources for a newbie. Specially for someone who haven't been with the Java eco system. (To my understanding Java software pretty much cover the popular big data stacks). The below list of softwares popups in pretty much everywhere when talking about big data, but it's very hard to grasp the concept of each and descriptions available at home pages of each project is pretty vague.
For instance "Cassandra", on surface it's the a good database to store time series data, but when reading more about analytics then other stacks come up, hadoop, pig, zookeeper etc.
Cassandra
Flink
Flume
Hadoop
Hbase
Hive
Kafka
Spark
Zookeeper
So in a nutshell, what does these software do? In context to big data, some of these projects share the same aspects, why do they co-exist then? what's the advantage? when to use what?

As for hadoop, you have to understand, that Hadoop can mean two things, depending on the context. A bit like the term "Linux", if your familiar with that.
only the core: The real "Hadoop" is only a file system for decentralized storage of very large files + request framework to these files via Map/Reduce.
the whole ecosystem: This includes the core and all other tools that have been put onto hadoop for data analytics. Flume, Hbase, Hive, Kafka, Spark, Zookeeper are terms belonging to this category. Flink also might be, I am not sure.
Cassandra might also belong to the second category, because "Hadoop integration was added way back in version 0.6 of Cassandra".
To understand the whole ecosystem better, you have to understand how this is all structured:
From bottom to top:
bottom layer: Here you have your distributed file system and the Map/Reduce request framework. HDFS is the name of the file system, you will also see this term a lot. On top of HDFS, you can use HBase, which is a column oriented database on top of HDFS ¹.
middle layer, execution engines: In the middle we have several different engines, which can query the hadoop file system for information. Actually, some people put Map/Reduce on a second layer, because the Hadoop environment now also includes Tez and Spark. Tez speeds up queries by using graphs for map/reduce execution, I think. And Spark is an in-memory engine.
top layer, user abstractions: On top of the execution engines you have the user API/abstractions. This includes apache Hive (SQL like queries) and Pig (in my eyes a mixture of SQL and programming language). But there are also more specialized abstractions like MLlib, which is a library for machine learning on top of a hadoop system using Spark as middle layer.
Somewhere aside, we also have management tools for this whole ecosystem. Managing servers, managing the task execution order (job scheduling) and so on. This is where Kafka and Zookeeper belong to.
¹ I currently do not understand the relationship between HBase vs. ORC files or Parquet.

How to get Java to access Cassandra 1.0.10

Can anyone recommend a good way to create scripts in java that will work with an older Cassandra 1.0.10 database? I'm having trouble finding stuff online. Is thrift a type of driver?
Thanks!

Apache Thrift is way to connect to the Cassandra RPC server 1).
In the Cassandra source three there is a file /interface/cassandra.thrift which is an interface description file (IDL) that can be fed to the Apache Thrift compiler in order to generate Java code. By means of this Java code you will be able to access Cassandra. The whole process is described in more detail in the Cassandra wiki.
However, it is recommended to use a more high-level client library instead, because the raw Cassandra API is quite complex. You'll find the existing libraries, such as Hector, much more handy for your task.
1) Some more details about Thrift can be found in this answer.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.