Java object analogue to R data.frame [closed]

Java object analogue to R data.frame [closed] - java

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I really like data.frames in R because you can store different types of data in one data structure and you have a lot of different methods to modify the data (add column, combine data.frames,...), it is really easy to extract a subset from the data,...
Is there any Java library available which have the same functionality? I'm mostly interested in storing different types of data in a matrix-like fashion and be able to extract a subset of the data.
Using a two-dimensional array in Java can provide a similar structure, but it is much more difficult to add a column and afterwards extract the top k records.

Tablesaw (https://github.com/jtablesaw/tablesaw) is Java dataframe begun in 2015 and is under active development (2018). It's designed to be as scalable as possible without sacrificing ease-of-use. Features include filtering by rows and columns, descriptive stats, map/reduce functions, cross-tabs, plots, machine learning. Apache license.
In one query test it returned 500+ records from a 1/2 billion record table in 2 ms.
Contributions, feature requests, and feedback are welcome.

I have just open-sourced a first draft version of Paleo, a Java 8 library which offers data frames based on typed columns (including support for primitive values). Columns can be created programmatically (through a simple builder API), or imported from text file.
Please refer to the README for further details.
The project is still wet from birth – I am very interested in feedback / PRs, tia!

I also found myself in need of a data frame structure while working in Java recently. Fortunately, after writing a very basic implementation I was able to get approval to release it as open source. You can find my implementation here: Joinery -- Data frames for Java. Contributions and feature requests are welcome.

Not being very proficient with R, but you should have a look at Guava, specifically Tables. They do not provide the exact functionality you want, but you could either extend them or their specification could help you in writing your own Collection.

Morpheus (http://www.zavtech.com/morpheus/docs/) provides a DataFrame analogue to that of R. It is a high performance column store data structure that enables data to sorted, sliced, grouped, and aggregated in either the row or column dimension. It also supports parallel processing for many of these operations using the Fork & Join framework internally.
You can easily read & write data to CSV files, databases and also a proprietary JSON format. Adapters to load data from Quandl, Google Finance and others are also available.
It has built in support for various styles of Linear Regressions, Principal Component Analysis, Linear Algebra and other types of analytics support. The feature set is still growing, but it is already a very capable framework.

In R we have the dataframe, in Python we have pandas, in Java:
There is the Schema from the deeplearning4j
There is also a version for the data analysis of the ubiquitous iris data if you want to just get started, here
There are also other custom objects (from Weka, from Tensorflow that are more or less the same).

Related

Full-Text-Search of database [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
Iam looking for a performant way and readable way to implement a full-text-search. I have a lot of requirements for the serach. See this list below.
Requirements
Peformance
My database growing up very fast. Load all data into HEAP an doing some .stream()-magic is not an option. The search should be performed by the DBMS.
Readability
I need easy a solution. A complex query like this How to implement simple full text search in JPA (Spring Data JPA)? (see option #2) is also not a solution. I would need some JOINs and the resulting query is to complex.
The overhead with an "index-field" is also not possible (to much joined data).
Concurrency
The application need to be scalable (with n-instances), so a solution with Lucene is not very good here is an example
no mixing of technologies
I dont want to mix the logic into different systems. This means, the whole search-logic should be defined in Java. A combination of the Java-Logic with views or sql-functions should be avoided.
Discovered options yet
QueryDsl
This is my old solution. But its very complex and produced a lot of problems with the automated generated classes.
Lucence
I like this. But there only one big problem: The index. Keep the index up2date on all instances is going a bit too overkill.
Very long #Query
The resulting query getting to complex to handle it.
Java.stream()...
// kinda
getAllUsers().stream()
.filter(user -> user.getName().contains(searchTerm)
|| user.getSex().contains(searchTerm)
|| user.getAge().toString().equals(searchTerm)
|| ...)
I have to much data to do that. So this solution will also not scale well.
Specification Interface
My preferred solution. But maybe there are other (and better) solutions?
SearchFiled or similar
Too many JOINS. Too much data.
?
Question
What are your expericenes with full-text-search in a Spring-Boot-Application? Do you know a solution that met my requierements?

If you have reached till Lucene, then a step further is Solr. I haven't used the options you have mentioned above, but I have certainly worked with Solr and can safely say that it is worth a try, for speed and ease of use.
Out of the four constraints you have put, the first three are taken care of, I feel with Solr.
Performance: Solr is a proven candidate in this area.
Readability: I assume you mean readability of code. Though this depends upon the code and design are done, the Solr part is quite friendly to code, understand and maintain because of the lack of JOIN and other RDBMS concepts.
Concurrency: From the official documentation at lucene.apache.org/solr:
Both Lucene and Solr were designed to scale to support large implementations with minimal custom coding.
and that Solr can do the following in this regard:
distributing an index across multiple servers
replicating an index on multiple servers
merging indexes
no mixing of technologies: With the option of using Solr, you have at least two technologies: Java and Solr. I am not sure if you wanted to keep your solution to pure Java/JEE. If that is the case, then this may not satisfy that need.
However, this requirement:
The search should be performed by the DBMS.
is surely not taken care of.
Also, can't think of a way other than a custom design for this:
Keep the index up2date on all instances is a bit overkill.
A warning: It may take some time to get a good grasp on Solr if you are new to it.

you may consider apache solr for searching

Is there Pandas DataFrame equivalent in Java? [duplicate]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I really like data.frames in R because you can store different types of data in one data structure and you have a lot of different methods to modify the data (add column, combine data.frames,...), it is really easy to extract a subset from the data,...
Is there any Java library available which have the same functionality? I'm mostly interested in storing different types of data in a matrix-like fashion and be able to extract a subset of the data.
Using a two-dimensional array in Java can provide a similar structure, but it is much more difficult to add a column and afterwards extract the top k records.

Tablesaw (https://github.com/jtablesaw/tablesaw) is Java dataframe begun in 2015 and is under active development (2018). It's designed to be as scalable as possible without sacrificing ease-of-use. Features include filtering by rows and columns, descriptive stats, map/reduce functions, cross-tabs, plots, machine learning. Apache license.
In one query test it returned 500+ records from a 1/2 billion record table in 2 ms.
Contributions, feature requests, and feedback are welcome.

I have just open-sourced a first draft version of Paleo, a Java 8 library which offers data frames based on typed columns (including support for primitive values). Columns can be created programmatically (through a simple builder API), or imported from text file.
Please refer to the README for further details.
The project is still wet from birth – I am very interested in feedback / PRs, tia!

I also found myself in need of a data frame structure while working in Java recently. Fortunately, after writing a very basic implementation I was able to get approval to release it as open source. You can find my implementation here: Joinery -- Data frames for Java. Contributions and feature requests are welcome.

Not being very proficient with R, but you should have a look at Guava, specifically Tables. They do not provide the exact functionality you want, but you could either extend them or their specification could help you in writing your own Collection.

Morpheus (http://www.zavtech.com/morpheus/docs/) provides a DataFrame analogue to that of R. It is a high performance column store data structure that enables data to sorted, sliced, grouped, and aggregated in either the row or column dimension. It also supports parallel processing for many of these operations using the Fork & Join framework internally.
You can easily read & write data to CSV files, databases and also a proprietary JSON format. Adapters to load data from Quandl, Google Finance and others are also available.
It has built in support for various styles of Linear Regressions, Principal Component Analysis, Linear Algebra and other types of analytics support. The feature set is still growing, but it is already a very capable framework.

In R we have the dataframe, in Python we have pandas, in Java:
There is the Schema from the deeplearning4j
There is also a version for the data analysis of the ubiquitous iris data if you want to just get started, here
There are also other custom objects (from Weka, from Tensorflow that are more or less the same).

Extracting data from Wikipedia [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I am creating a Spring application and I have the need to integrate with Wikipedia. In particular, I would like to extract data on a given (large) set of Cities, e.g. country, website and coordinates.
I am trying to understand which libraries or frameworks can be useful, but the big issue I am dealing with is that there is no reference structure for the pages I would like to extract information from. The main problem is not that some information is missing, which would be totally acceptable, but rather the city representation changes from city to city. E.g. the DBPedia ontology (e.g. http://dbpedia.org/ontology/City) does not reflect what I can extract via SPARQL query from dbpedia.org/sparql. This way, I don't know how to extract the data I need systematically (i.e. for my entire set).
Is there any technology that can support my task of extracting data on a predefined set of cities?
One solution could be to put in place some Natural Language Processing in order to seek for the required info in the entire Wikipedia page, but that requires a lot of effort, if I have to write it on my own.
Another solution could be leveraging a source of structured data that pre-processed Wikipedia for me and gave some structure to the contained information, but I could not find one.
On third solution could be trying to make different queries to Wikipedia, but I cannot figure out a way to extract the information I need via those Wikipedia APIs.

Data from Wikipedia is being transfered to Wikidata. Using their API you could get what you want. If you want a shortcut you could use the Wikidata query tool: http://wdq.wmflabs.org/api_documentation.html

Im not a java guy, but I did something like this in .Net.
You need some kind of web scraping framework.
In .Net there is HtmlAgilityPack. Where you get the site and then with fx XPATH go through elements of the sites. Offcourse you need to know where on the site the informations is. That could be the tags around the heading, text and so on.
For java, the framework I just found was
Tag Soup
HtmlUnit
Web-Harvest
jARVEST
jsoup
Jericho HTML Parser

Java commercial-friendly R-tree implementation? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I need a commercial-friendly (Apache Licence, LGPL, Mozilla Public License etc) R-tree implementation in Java, in order to substitute the geonames Web Service for timezones, as suggested in the question "Determine timezone from latitude/longitude without using web services like Geonames.org". I have found some around, but I was wondering if someone has evaluated or used them in practice.

https://github.com/rweeks/util/blob/master/src/com/newbrightidea/util/RTree.java -
LGPL implementation of R-Tree by Russ Weeks. It looks very simple and clear and not dependent on external libraries.
http://www.mischiefblog.com/?p=171
http://www.mischiefbox.com/blog/uploads/rtree.jar
LGPL implementation of R-Tree by Chris Jones. Another simple and clear solution.
http://www.khelekore.org/prtree/
CPL 1.0 implementation of Priority R-Tree by Robert Olofsson
http://jsi.sourceforge.net/
LGPL - project aims to maintain a high performance Java version of the RTree spatial indexing algorithm.

First of all let me point out that if You'll look up the nearest city from the given coordinates, it might not be in the same time zone! What you really need, in my opinion, is an information about it administrative affiliation - minimum would be a country, but in some cases it should be even more than that, i.e. state. That information can be retrieved with Google Maps API and then correlated to some more detailed TZ information.
there is a free alternative to GeoNames - EarthTools. There are some limitations to the service itself (number of requests, etc.), but still it's good, tested and working just fine for me.
Second of all - there is a free alternative to GeoNames - EarthTools. There are some limitations to the service itself (number of requests, etc.), but still it's good, tested and working just fine for me.
Third of all - if You would care about importing the data into DB, most of the current DB implementations provide geo spatial indexes that you can use. If You need that information embedded in Your application, you can use H2Database (embeded Java DB) with H2Spatial addition - although I've tried it and I can not recommend it fully. Neo4j have a great spatial index implementation
Additionally You can use Solr for GeoSpatial searches. It's nice, it's quick and it's easy to implement. I'm actually in the middle of the process of migrating my DB searches to Solr...
Last, but not least, below you'll find some of the ones I've tested a while back:
JSI - LGPL
GeoTools - LGPL, an overkill, will give You far more than what you need... but it's great!
Possibly few more there, but the ones I've tested so far...

RTree simple Java class created by me:
https://github.com/hadmir/rtree/blob/master/RTree.java
All objects are stored inside two int[] arrays, so it is really easy to persist (to file). Also, fact that adding new rects doesn't create any objects means that you can insert millions of rectangles into RTree and JVM will not burn in flames. This is useful for geo projects, where object counts are usually enormous.
Only 2D rectangles are stored (so, for complex object you need to find bounding rectangle). Query returns all rects (IDs of rects) intersecting or overlapping with "query rectangle".

Java (ME) library for fixed-length record files [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I am looking for a library that can run on Java ME (Foundation Profile 1.1, CDC) and allows me to basically do something along the lines of
FILE OF type;
in Pascal.
Background: I need to have a largish (approx. 100MB) set of around 500.000 records for lookups by a known index value quickly. Do I really have to write this myself? Databases like Derby are way too big and bring lots of features (stored procedures, anyone?) I do not need.
Ideally I would just like to define a class with a few fields based on primitive types and Strings as a value holder object and persist these in a file I could - should the need arise - manually recover. That's why I am not too much into serialization. From the past I have fought several occasions of corrupted binary data files which could not be recovered at all.

Your biggest problem here is establishing a correspondence between field names and columns in the file, as you really shouldn't assume that the class layout matches the field ordering in the source file.
If the file were to contain a header row then it's a simple matter of using reflection/introspection and shouldn't take more than a day to implement yourself.
Alternatively, you'll have to use an annotation of some sort to specify, for each field, where it appears in the file.
Have you instead considered alternative text serialization methods, such as CSV, JSON or XML using XStream? These avoid the risks of binary corruption and would get you up and running faster, but might also impose a higher memory footprint which could be an issue as you're targeting a mobile device.

After looking around for quite some time, I have finally come to xBaseJ from SourceForge. It relies on java.nio, which is normally not included in the JavaME CDC profile, but we had a contractor port the relevant parts to the mobile J9 VM. Armed with this, we are now building our application on top of DBase III compatible files. Apart from being pretty reasonably fast, even on the mobile platform, this gives us access to a plethora of tools that can handle this format, without having to teach non-tech folks about a JDBC based DB admin tool they do not feel comfortable with.
There has just been a recent release of a whole eBook, called "The Minimum You Need To Know About xBaseJ", which is available for free from the project's website, too.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.