Java text extraction and data structure design - java

I have a huge set of data of tables in Open Office 3.0 document format.
Table 1:
(x range)|(x1,y1) |(x2,y2)|(x3,x3)|(x4,y4)
(-20,90) |(-20,0) |(-5,1) |(5,1) |(10,0)
...
Like wise i have n number of tables.All of these tables are fuzzy set membership functions.In simple terms they are computational model's according to which i have to process the input data.There are many number of such tables with differing row size and column size 3/4 .These data's are not going to change once loaded.
Example:
When i get a value of x in between the range -20 to 90.I will apply the first rule(given above).Suppose that it is -1(which is in between value of -20 and -5).Then I have to find a corresponding value between 0 and 1.
My First question is how to extract all the data from the tables in document format so that i can use in my java program.I know a bit of python and I know python can be useful in such cases.But then how to use it in my Java program.
Secondly what would be the best data structure i should use in such a senario.
Note: I'm not using any database.So i would prefer to keep the tables either in xml or some other format so that i can load it easily to the program.I also thinking of making a suitable data structure and then serializing them so that I can load them whenever required instead of parsing a file and recreating the data structure.Please post your comments.

In order to parse an OpenOffice Document in Java (to extract data), you can use a dedicated API such as ODFDOM.
I think this solution is very complicated for what you need. A easier solution would be to extract manually the OpenOffice table, to put it in a format more friendly to parse in Java:
CSV
DataBase (MySQL, etc.)

Related

Deeplearning4j: How would I prepare this data for a RNN that uses LSTM?

In my code, I download data from a source to a CSV file, then I apply a transformation process to it, after which it is written to a final CSV file. At this point in time, one row of my data looks like this:
45.414001,10358500,45.698002,44.728001,0.0
The first column is the data I want to predict, and the final column(the one with the 0s) is just a place holder for now, it will be a double number. Using deeplearning4j, I then load this data from the CSV file into a recordreader. Here is what that looks like:
RecordReader recordReader = new CSVRecordReader(numSkipLines);
recordReader.initialize(new FileSplit(inputPath));
So my question is, what should I do next? I want to use this data with a RNN LSTM model, which will predict the first column, one step into the future. What should I do next?
Usually, depending on how much data/rows you have, you separate between training and testing data. Testing data can usually be a bit smaller than your training data as the testing is just to see if your model predicts effectively.
The training data should be split between a smaller training and a validation set. You should be able to use the validation set to see after how many epochs/rounds of training do you being to overfit/underfit. You want to train to when the model just begins to overfit.

Store multiple values in a file - best format?

I want to store multiple values (String, Int and Date) in a file via Java in Android Studio.
I don't have that much experience in that area, so I tried to google a bit, but I didn't get the solution, which I've been looking for. So, maybe you can recommend me something?
What I've tried so far:
Android offers a SharedPreferences feature, which allows a user to save a primitive value for a key. But I have multiple values for a key, so that won't work for me.
Another option is saving data on an external storage medium as file. As far as good. But I want to keep the filesize at minimum and load the file as fast as possible. That's the place, where I can't get ahead. If I directly save all values as simple text, I would need to parse the .txt file per hand to load the data which will take time for multiple entries.
Is there a possibility to save multiple entries with multiple values for a particular key in an efficient way?
No need to reinvent a bicycle. Most probably the best option for your case is using the databases. Look into Sqlite or Realm.
You don’t divulge enough details about your data structure or volume, so it is difficult to give a specific solution.
Generally speaking, you have these three choices.
Serialize a collection
I have multiple values for a key
You could use a Map with a List or Set as its value. This has been discussed countless times on Stack Overflow.
Then use Serialization to write and read to storage.
Text file
Write a text file.
Use Tab-delimited or CSV format if appropriate. I suggest using the Apache Commons CSV library for that.
Database
If you have much data, or concurrency issues with multiple threads, use a database such as the H2 Database Engine.

Compare actual data (xls) with expected data (xlsx) in Java

I have a scenario to run assertions on the Actual data that is provided in XLS file against the Expected Data provided in XLSX file basing on an identifier column in Java. Can anyone provide any advice or suggestion on this please?
Actual Data
Field(Name) Field(Identifier) Entity(Name) ParentEntity(Name)
Lead time Article.DeliveryTime Item None
Expected Data
Field(Name) Field(Identifier) Entity(Name) ParentEntity(Name)
Lead time Article.DeliveryTime Item ParentQualifier
The number of Rows and columns might change basing on the data provided, but the Field(Identifier) would be given in both the files.
I suggest converting the Excel files to some more structured format such as *.csv. You can do that with Excel by just saving it to *.csv. There are many CSV parser libraries.
If that is not possible for some reason (not owner of data, management,...) you could use Apache POI to parse the *.xls / *.xlsx files and then do the testing. How to shown here Link1 or here Link2. Then you can simply run the assertions with JUnit.
There are two potential problems though:
Changing columns: You need to parse the Excel without specifying exact column names. Then only compare columns that have the same name.
Data doesn't fit in memory: Search for id's and only load matching rows.

how to implement RowLoader in gemfirexd?

How to write Rowloader JAVA code to inject data from sample.csv file into GenfireXD database.
The GemFireXD distribution includes a JDBCRowLoader source example. Look in the examples directory. In your case you will have to determine which field of your CSV you want to consider as primary keys, parse the CSV and return rows as needed.
You can check IMPORT_DATA_EX and IMPORT_TABLE_EX procedures to load data into GemFireXD.
Since you mentioned csv format IMPORT_DATA_EX might be the recommend way to do it since you can also tweak the number of threads and constraints while loading the data. It's definitely one of the fastest ways to do it but please note that the csv file but be available from the node you're issuing the command.
You might also want to consider starting a peer member with host-data=false.
Reference: http://gemfirexd.docs.pivotal.io/latest/userguide/index.html#reference/system_procedures/derby/rrefimportdataproc_ex.html

Document Management System - Database Design

I'm writing my own Document Management System (DMS) in Java (the ones available don't satisfy my needs).
The documents shall be described by the Qualified DublinCore Metadata Standard. The easiest way to do this, in my opinion is do pack the key-value pairs in a RDF model with a XML representation.
To store the metadata for all documents i have two ideas (the document files will be stored in the filesystem):
Store all metadata of all documents in a single XML file
Make a XML file for each document and store it either in the filesystem or in a RDBMS (like the H2 database engine for Java), a key-value database won't solve this because the keys for one document are not unique.
Since (many) documents are linked among each other the first approach may would be better for analysing the data, but the second approach may be much faster.
Which solution you would recommend? Or are there any better solutions?
Stefan
I don't know how your analysis work, but if you need the complete graph in memory to do your analysis then use variante 1 (Store all metadata of all documents in a single XML file), because you will get no gain (but only extra work) from variante 2 in this scenario.
added
If this extra work for variant 2 is not to much, then I recomend variant 2, because it can be more calable.
you could update or add document meta data by writing only a small xml file instead of a huge one
it depends on what xml parser you use, but in some cases it is faster to parse some smaller xml files than one huge one (but this strongly depends on the ammout of data).
Have you considered using MongoDB and GridFS? http://www.mongodb.org/display/DOCS/GridFS+Specification
You can store your documents directly in MongoDB as binary and even store the associated metadata for that particular file in any format you want. It would have the ability to store documents even if they have the same name and it will generate it's own unique IDs.
BTW: even if it does not belong to your question: have a look at a JCR (Java Content Repository) implementation like JackRabbit. You could use it to store your documents and maybe your meta data too.
I'd look into a NO SQL document solution like Couch DB to see if it could help you.
I don't like the file system solution; there's no abstraction whatsoever to help you there.
If your are always accessing all documents, none of your approaches would be slower than the other. But I would recommend the second approach. When it comes to analyzing the data, you'll need to read all documents, so there is no difference if they are in different files or in one file...

Categories

Resources