Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I'm looking for an embeddable Java library that is suitable for collecting real-time streams of sensor data in a general-purpose way. I plan to use this to develop a "hub" application for reporting on multiple disparate sensor streams, running on a JVM based server (will also be using Clojure for this).
Key things it needs to have:
Interfaces for various common sensor types / APIs. I'm happy to build what I need myself, but it would be nice if some standard stuff comes out of the box.
Suitable for "soft real time" usage, i.e. fairly low latency and low overhead.
Ability to monitor and manage streams at runtime, gather statistics etc.
Open source under a reasonably permissive license so that I can integrate it with other code (Apache, EPL, BSD, LGPL all fine)
A reasonably active community / developer ecosystem
Is there something out there that fits this profile that you can recommend?
1. Round-robin database (wikipedia)
RRDtool (acronym for round-robin database tool) aims to handle
time-series data like network bandwidth, temperatures, CPU load, etc.
The data are stored in a round-robin database (circular buffer), thus
the system storage footprint remains constant over time.
This approach/DB format is widely used, stable and simple enough. Out of the box it allows to generate nice plots:
There is Java implementation -- RRD4J:
RRD4J is a high performance data logging and graphing system for time
series data, implementing RRDTool's functionality in Java. It follows
much of the same logic and uses the same data sources, archive types
and definitions as RRDTool does. Open Source under Apache 2.0 License.
Update
Forget to mention there is Clojure RRD API (examples).
2. For some experiments with real-time data I would suggest to consider Perst
It is small, fast and reliable enough, but distributed under GPLv3. Perst provides several indexing algorithms:
B-Tree
T-Tree (optimized for in-memory database)
R-Tree (spatial index)
Patricia Trie (prefix search)
KD-Tree (multidimensional index)
Time series (large number of fixed size objects with timestamp)
The last one suits your needs very well.
3. Neo4J with Relationship indexes
A good example where this approach pays dividends is in time series
data, where we have readings represented as a relationship per
occurrence.
4. Oracle Berkeley DB Java Edition
Oracle Berkeley DB Java Edition is an open source, embeddable,
transactional storage engine written entirely in Java. It takes full
advantage of the Java environment to simplify development and
deployment. The architecture of Oracle Berkeley DB Java Edition
supports very high performance and concurrency for both read-intensive
and write-intensive workloads.
Suggestion
Give a try to RRD4J:
It is simple enough
It dose provide quite a nice plots
It has Clojure API
It supports several back-ends including Oracle Berkeley DB Java Edition
It can store/visualize detailed data sets
For collecting real-time streams of sensor data following might be of help
Have you checked LeJos API's. This http://lejos.sourceforge.net/nxt/nxj/api/index.html
Also it is worth checking Oracle Java ME Embedded and the target markets they are addressing http://www.unitask.com/oracledaily/2012/10/04/at-the-java-demogrounds-oracle-java-me-embedded-enables-the-internet-of-things/
Can be downloaded from http://www.oracle.com/technetwork/java/embedded/downloads/javame/index.html
For storing the Time series data nothing beats cassandra http://cassandra.apache.org/ and to answer why cassandra refer http://www.datastax.com/why-cassandra
For accessing Cassandra from Java refer https://github.com/jmctee/Cassandra-Client-Tutorial
It is quite helpful and applying the time series concept in cassandra db refer
http://www.datastax.com/wp-content/uploads/2012/08/C2012-ColumnsandEnoughTime-JohnAkred.pdf
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I really like data.frames in R because you can store different types of data in one data structure and you have a lot of different methods to modify the data (add column, combine data.frames,...), it is really easy to extract a subset from the data,...
Is there any Java library available which have the same functionality? I'm mostly interested in storing different types of data in a matrix-like fashion and be able to extract a subset of the data.
Using a two-dimensional array in Java can provide a similar structure, but it is much more difficult to add a column and afterwards extract the top k records.
Tablesaw (https://github.com/jtablesaw/tablesaw) is Java dataframe begun in 2015 and is under active development (2018). It's designed to be as scalable as possible without sacrificing ease-of-use. Features include filtering by rows and columns, descriptive stats, map/reduce functions, cross-tabs, plots, machine learning. Apache license.
In one query test it returned 500+ records from a 1/2 billion record table in 2 ms.
Contributions, feature requests, and feedback are welcome.
I have just open-sourced a first draft version of Paleo, a Java 8 library which offers data frames based on typed columns (including support for primitive values). Columns can be created programmatically (through a simple builder API), or imported from text file.
Please refer to the README for further details.
The project is still wet from birth – I am very interested in feedback / PRs, tia!
I also found myself in need of a data frame structure while working in Java recently. Fortunately, after writing a very basic implementation I was able to get approval to release it as open source. You can find my implementation here: Joinery -- Data frames for Java. Contributions and feature requests are welcome.
Not being very proficient with R, but you should have a look at Guava, specifically Tables. They do not provide the exact functionality you want, but you could either extend them or their specification could help you in writing your own Collection.
Morpheus (http://www.zavtech.com/morpheus/docs/) provides a DataFrame analogue to that of R. It is a high performance column store data structure that enables data to sorted, sliced, grouped, and aggregated in either the row or column dimension. It also supports parallel processing for many of these operations using the Fork & Join framework internally.
You can easily read & write data to CSV files, databases and also a proprietary JSON format. Adapters to load data from Quandl, Google Finance and others are also available.
It has built in support for various styles of Linear Regressions, Principal Component Analysis, Linear Algebra and other types of analytics support. The feature set is still growing, but it is already a very capable framework.
In R we have the dataframe, in Python we have pandas, in Java:
There is the Schema from the deeplearning4j
There is also a version for the data analysis of the ubiquitous iris data if you want to just get started, here
There are also other custom objects (from Weka, from Tensorflow that are more or less the same).
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm fairly new to data science, and now just starting to develop a system that required me to analyze large data (e.g. 5 - 6 million records in each DBs).
In a bigger picture: I have multiple DBs containing various kind of data which need to be integrated. After integrating the data, I also need to perform some data analysis. And lastly, I need to visualize the data to many clients.
Overall,I want to know what is the current technology/trend for handling big data (i.e with java framework)
The answer is: Depends of your non-functional requirements. Your use cases will be critical in deciding which technology to use.
Let me share one of my experience, in order to clarify what I mean:
In 2012 I needed to deal with ~2 million non-structured records per month, and perform algorithms of entropy (information theory) and similarity for ~600 requests per minute.
Our scenario were composed by:
Records non-structured, but already in JSON format.
The algorithms for entropy and similarity were based in all content of the DB vs records to be matched (Take a look in [Shannon entropy formula][1], and you will understand the complexity I'm talking about)
more them 100 different web applications as clients of this solution.
Given those requirements (and many others), and after performing PoCs with [Casandra][2], [Hadoop][3], [Voldmort][4], [neo4j][5], and also tests of stress, resiliency, scalability, and robustness, we arrived in the best solution for that moment (2012):
Java EE 7 (with the new Garbage-First (G1) collector activated)
JBoss AS 7 ([wildfly][6]) + [Infinispan][7] for the MapReduce race condition, among other clusters' control, and distributed cache needs.
Servlet 3.0 (because it's Non-blocking I/O)
[Nginx][8] (In that time was beta, but different of httpd2, it was already multiple connections in a non-blocking fashion)
[mongoDB][9] (due our raw content already being in JSON document style)
[Apache Mahout][10] for all algorithms implementation, including the MapReduce strategy
among other stuffs.
So, all depends on your requirements. There's no silver bullet. Each situation demands an architectural analysis.
I remember Nasa in that time was processing ~1TB per hour in AWS with Hadoop, due the [Mars project with the Curiosity][11].
In your case, I would recommend paying attention in your requirements, maybe a Java framework it's not what you need (or not just what you need):
If you are going just to implement algorithms for data analysis, statisticians and data miners (for example), probably [R programming language][12] is gonna be the best choice.
If you need a really fast I/O (aircraft stuff for example): any native compiled language like [Go Lang][13], [C++][14], etc.
But if actually you're going to create a web applications that actually will be just a client or feed the big data solution, I'd recommend something more soft and scalable like [nodeJS][15] or even a just in time compiled technology like those one based in JVM ([Scala][16], [Jython][17], Java) in [dockerized][18] [microservices][19]...
Good luck! (Sorry, the Stack Overflow didn't allow me to add the references link yet - But all I have talked about here, can easily been googled).
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I really like data.frames in R because you can store different types of data in one data structure and you have a lot of different methods to modify the data (add column, combine data.frames,...), it is really easy to extract a subset from the data,...
Is there any Java library available which have the same functionality? I'm mostly interested in storing different types of data in a matrix-like fashion and be able to extract a subset of the data.
Using a two-dimensional array in Java can provide a similar structure, but it is much more difficult to add a column and afterwards extract the top k records.
Tablesaw (https://github.com/jtablesaw/tablesaw) is Java dataframe begun in 2015 and is under active development (2018). It's designed to be as scalable as possible without sacrificing ease-of-use. Features include filtering by rows and columns, descriptive stats, map/reduce functions, cross-tabs, plots, machine learning. Apache license.
In one query test it returned 500+ records from a 1/2 billion record table in 2 ms.
Contributions, feature requests, and feedback are welcome.
I have just open-sourced a first draft version of Paleo, a Java 8 library which offers data frames based on typed columns (including support for primitive values). Columns can be created programmatically (through a simple builder API), or imported from text file.
Please refer to the README for further details.
The project is still wet from birth – I am very interested in feedback / PRs, tia!
I also found myself in need of a data frame structure while working in Java recently. Fortunately, after writing a very basic implementation I was able to get approval to release it as open source. You can find my implementation here: Joinery -- Data frames for Java. Contributions and feature requests are welcome.
Not being very proficient with R, but you should have a look at Guava, specifically Tables. They do not provide the exact functionality you want, but you could either extend them or their specification could help you in writing your own Collection.
Morpheus (http://www.zavtech.com/morpheus/docs/) provides a DataFrame analogue to that of R. It is a high performance column store data structure that enables data to sorted, sliced, grouped, and aggregated in either the row or column dimension. It also supports parallel processing for many of these operations using the Fork & Join framework internally.
You can easily read & write data to CSV files, databases and also a proprietary JSON format. Adapters to load data from Quandl, Google Finance and others are also available.
It has built in support for various styles of Linear Regressions, Principal Component Analysis, Linear Algebra and other types of analytics support. The feature set is still growing, but it is already a very capable framework.
In R we have the dataframe, in Python we have pandas, in Java:
There is the Schema from the deeplearning4j
There is also a version for the data analysis of the ubiquitous iris data if you want to just get started, here
There are also other custom objects (from Weka, from Tensorflow that are more or less the same).
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
Most time I build simple applications that share data through an MS Access DB through a network drive.
However, MS Access in damn slow, does not support common features of SQL, is a pain to automate and makes uses to "explore" my application's data directly.
I would like to migrate to an different db with a similar profile. It should be
needing no installation, because they are not permitted through the IT departement
be file-based (same as above)
savely placeable on a network drive to give multi-user support
open source
(preferably have a pure java driver)
Do you know anything out there that matches at least some of my creterias?
I have reviewed sqlite, derby, hsqldb. They all seem to support all requirements except the shareability through an network drive.
But this is the most required feature.
I would appreciate any answers.
I'm pretty sure the requirement "needing no installation" is a show-stopper for you. The alternative is to learn to make better use of Access. With careful design, it will surprise you. In the olden days, when lawyers in the US were expected to design, build, and maintain their own litigation support databases, I used to demonstrate design and queries against a 2,000,000 row Access database. Response times were less than 250 msec. I know that sounds awfully slow nowadays, but back then--early 1990s--that was ripping fast.
Not kidding about the lawyers. The canonical reference for litigation support databases at the time was Wilmer, Cutler, and Pickering Manual on Litigation Support Databases. In my experience, most lawyers believe their expertise at the bar transfers to all other fields. Including database design.
Try HSQLDB.
From their homepage:
It offers a small, fast multithreaded and transactional database engine which offers in-memory and disk-based tables and supports embedded and server modes.
It's also used by JBoss AS as its internal database.
I have reviewed sqlite, derby, hsqldb. They all seem to support all requirements except the shareability through an network drive.
You might need to put a small server on the network, though :)
I have worked with Firebird Embedded with great results, the only problem: no multi-user access is posible.
It provides you with full modern RDBMS features.
No need for installation: just copy a dll.
The database is a single file.
Here is some info: http://www.firebirdsql.org/manual/ufb-cs-embedded.html
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I want to ask about a good triplestore to use for large datasets, it should:
Scale well (millions of triples)
Have a Java interface
You should consider using the OpenLink Virtuoso store. It is available via an OpenSource license and scales to billions of triples. You can use it via the Sesame and Jena APIs.
See here for an overview of large scale triple stores. Virtuoso is definitely easier to set up than BigData. Beside that I have used the Sesame NativeStore, which doesn't scale too well.
4Store is also a good choice, although I haven't used it. One benefit of Virtuoso over 4Store is that you can easily mix standard relational models with RDF, since Virtuoso is under the hood a relational database.
4store: Scalable RDF storage
Quoting 4store Web ...
4store's main strengths are its
performance, scalability and
stability. It does not provide many
features over and above RDF storage
and SPARQL queries, but if your are
looking for a scalable, secure, fast
and efficient RDF store, then 4store
should be on your shortlist.
Personally I have tested 4store with very large databases (up to 2 billion triples) with very good results. 4store is written in C, runs on Linux/Unix 64 bit platforms and the current version 1.1.1 has partially implemented SPARQL 1.1.
4store can be deployed on a cluster of commodity servers which may boost the performance of your queries and assertion throughput can get up to 100 KTriples/second. But even if you use it in a single server you will get quite a decent performance.
Here at the University of Southampton is our choice for very big datasets in research projects and also for our Webmaster team, see Data Stores for Southampton and ECS Open Data.
Here you have also a list of all the libraries that you can use to query and administrate 4store Client Libraries. Also, 4store's IRC channel has an active community of users that will help if you run into any issues.
If you are a Linux/Unix user 4store is definitely a good choice.
I would also recommend 4store, but in the spirit of full disclosure, I was the lead architect :)
If you want to take advantage of the standardisation of RDF stores then you should look to use a Java library that implements SPARQL, rather than using one that exposes a JAVA API natively.
Otherwise you could end up being stuck with whatever store you choose first, due to the effort of moving between them, which is typical SQL migration hell.
I am personally quite happy with GraphDB . Which runs quite well on medium hardware (256GB ram server) with 15 billion triples. Which is accesible both via the sesame and jena interfaces. (Although jena is beta'ish).
If you can afford it an Oracle 12c instance is not bad. And might fit in with an existing oracle infrastructure (back-ups etc...).
Virtuoso 7.1 scales very well and can deal with humongous data volumes for reasonable cost. Unfortunately its SPARQL standards compliance is spotty
#Steve - don't know how to comment so I guess I am going to answer 2 questions at once.
JDBC driver for SPARQL below:
http://code.google.com/p/jdbc4sparql/
supports SPARQL Protocol and SPARUL (over the SPARQL protocol as an update, not over the SPARUL protocol).
#myahya
4Store is highly recommended, so worth appraising as a candidate.
Virtuoso also has native JDBC drivers and supports large datasets (up to 12 billion triples)
www.openlinksw.com/wiki/main/Main/
Also, Oracle have something, but be prepared to pay big bucks:
http://www.oracle.com/technetwork/database/options/semantic-tech/index.html
In addition to 4Store, Virtuoso, and Owlim, Bigdata is also worth looking at.