JHDF5 - How to avoid dataset being overwritten

JHDF5 - How to avoid dataset being overwritten - java

I am using JHDF5 to log a collection of values to a hdf5 file. I am currently using two ArrayLists to do this, one with the values and one with the names of the values.
ArrayList<String> valueList = new ArrayList<String>();
ArrayList<String> nameList = new ArrayList<String>();
valueList.add("Value1");
valueList.add("Value2");
nameList.add("Name1");
nameList.add("Name2");
IHDF5Writer writer = HDF5Factory.configure("My_Log").keepDataSetsIfTheyExist().writer();
HDF5CompoundType<List<?>> type = writer.compound().getInferredType("", nameList, valueList);
writer.compound().write("log1", type, valueList);
writer.close();
This will log the values in the correct way to the file My_Log and in the dataset "log1". However, this example always overwrites the previous log of the values in the dataset "log1". I want to be able to log to the same dataset everytime, adding the latest log to the next line/index of the dataset. For example, if I were to change the value of "Name2" to "Value3" and log the values, and then change "Name1" to "Value4" and "Name2" to "Value5" and log the values, the dataset should look like this:
I thought the keepDataSetsIfTheyExist() option to would prevent the dataset to be overwritten, but apparently it doesn't work that way.
Something similar to what I want can be achieved in some cases with writer.compound().writeArrayBlock(), and specify by what index the array block shall be written. However, this solution doesn't seem to be compatible with my current code, where I have to use lists for handling my data.
Is there some option to achieve this that I have overlooked, or can't this be done with JHDF5?

I don't think that will work. It is not quite clear to me, but I believe the getInferredType() you are using is creating a data set with 2 name -> value entries. So it is effectively creating an object inside the hdf5. The best solution I could come up with was to read the previous values add them to the valueList before outputting:
ArrayList<String> valueList = new ArrayList<>();
valueList.add("Value1");
valueList.add("Value2");
try (IHDF5Reader reader = HDF5Factory.configure("My_Log.h5").reader()) {
String[] previous = reader.string().readArray("log1");
for (int i = 0; i < previous.length; i++) {
valueList.add(i, previous[i]);
}
} catch (HDF5FileNotFoundException ex) {
// Nothing to do here.
}
MDArray<String> values = new MDArray<>(String.class, new long[]{valueList.size()});
for (int i = 0; i < valueList.size(); i++) {
values.set(valueList.get(i), i);
}
try (IHDF5Writer writer = HDF5Factory.configure("My_Log.h5").writer()) {
writer.string().writeMDArray("log1", values);
}
If you call this code a second time with "Value3" and "Value4" instead, you will get 4 values. This sort of solution might become unpleasant if you start to have hierarchies of datasets however.

To solve your issue, you need to define the dataset log1 as extendible so that it can store an unknown number of log entries (that are generated over time) and write these using a point or hyperslab selection (otherwise, the dataset will be overwritten).
If you are not bound to a specific technology to handle HDF5 files, you may wish to give a look at HDFql which is an high-level language to manage HDF5 files easily. A possible solution for your use-case using HDFql (in Java) is:
public class Example
{
public Class Log
{
String name1;
String name2;
}
public boolean doSomething(Log log)
{
log.name1 = "Value1";
log.name2 = "Value2";
return true;
}
public static void main(String args[])
{
// declare variables
Log log = new Log();
int variableNumber;
// create an HDF5 file named 'My_Log.h5' and use (i.e. open) it
HDFql.execute("CREATE AND USE FILE My_Log.h5");
// create an extendible HDF5 dataset named 'log1' of data type compound
HDFql.execute("CREATE DATASET log1 AS COMPOUND(name1 AS VARCHAR, name2 AS VARCHAR)(0 TO UNLIMITED)");
// register variable 'log' for subsequent usage (by HDFql)
variableNumber = HDFql.variableRegister(log);
// call function 'doSomething' that does something and populates variable 'log' with an entry
while(doSomething(log))
{
// alter (i.e. extend) dataset 'log1' to +1 (i.e. add a new row)
HDFql.execute("ALTER DIMENSION log1 TO +1");
// insert (i.e. write) data stored in variable 'log' into dataset 'log1' using a point selection
HDFql.execute("INSERT INTO log1(-1) VALUES FROM MEMORY " + variableNumber);
}
}
}

Related

Get a string list of values from a generic object

So I have an Object that comes in that can be any of 100 different specific objects, with different elements inside it, from other objects, lists, sequences, primitives etc.
I want to strip the values in a depth first fashion to make a string of simple values with a delimiter between them. I have mapped the fields and stored them elsewhere using recursion/reflection that only happens once a new Object type comes in for the first time.
An example of how I'm storing the data in the database for a few simple example objects:
Object A layout table: Timestamp = 12345 Fields = Length|Width|Depth
Object B layout table: Timestamp = 12345 Fields = Height|Weight|Name
Object A layout table: Timestamp = 12350 Fields = Length|Width|Depth|Label
Object A sample: Timestamp = 12348 Values = 5|7|2
Object A sample: Timestamp = 12349 Values = 4|3|1
Object B sample: Timestamp = 12346 Values = 75|185|Steve Irwin
Object A sample: Timestamp = 12352 Values = 7|2|8|HelloWorld
Below is my current solution. I'm seeking improvements or alternatives to the design to accomplish the goal stated above.
Currently I get the object in and translate it to JSON using gson.toJson(); From that, I cycle through the JSON to get values using the code below. Issue is, this code is very CPU intensive on the low end CPU I am developing for due to the fact that there are many samples coming in per second. Overall purpose of the application is a data recorder that records real time samples into a SQLite database. I have also attempted to store the unmodified JSON into a SQLite BLOB column, but this is terribly inefficient with regards to DB size. Is there a better/more efficient method for getting values out of an object?
I don't have an issue storing the field mapping since it only needs to be done once, but the value stripping needs to be done for every sample. I know you can do it via reflection as well, but that is also processing heavy. Anyone have a better method?
public static List<String> stripValuesFromJson(JsonElement json)
{
// Static array list that will have the values added to it. This will
// be the return object
List<String> dataList = new ArrayList<String>();
// Iterate through the JSONElement and start parsing out values
for (Entry<String, JsonElement> entry : ((JsonObject) json).entrySet())
{
// Call the recursive processor that will parse out items based on their individual type: primitive, array, seq etc
dataList.addAll(dataParser(entry.getValue()));
}
return dataList;
}
/**
* The actual data processor that parses out individual values and deals with every possible type of data that can come in.
*
* #param json - The json object being recursed through
* #return - return the list of values
*/
public static List<String> dataParser(JsonElement json)
{
List<String> dataList = new ArrayList<String>();
// Deal with primitives
if (json instanceof JsonPrimitive)
{
// Deal with items that come up as true/false.
if (json.getAsString().equals("false"))
{
dataList.add("0");
} else if (json.getAsString().equals("true"))
{
dataList.add("1");
} else
{
dataList.add(json.getAsString());
}
// Send through recursion to get the primitives or objects out of this object
} else if (json instanceof JsonObject)
{
dataList.addAll(stripValuesFromJson(json));
} else if (json instanceof JsonArray)
{
// Send through recursion for each element in this array/sequence
for (JsonElement a : (JsonArray) json)
{
dataList.addAll(dataParser(a));
}
} else if (json instanceof JsonNull)
{
dataList.add(null);
} else
{
errorLog.error("Unknown JSON type: " + json.getClass());
}
return dataList;
}

One thing you could try out is writing your own JSON parser which simply emits values. I have more experience in JavaCC so I'd take one of existing JSON grammars and modify it so that it only outputs values. This should not be too complicated.
Take for example the booleanValue production from the mentioned grammar:
Boolean booleanValue(): {
Boolean b;
}{
(
(
<TRUE>
{ b = Boolean.TRUE; }
) | (
<FALSE>
{ b = Boolean.FALSE; }
)
)
{ return b; }
}
Basically you will need to replace returning the boolean value with appending "1" or "0" to the target list.
ANTLR is another option.

How to implement this property file?

I have this piece of data (this is just one part of one line of the whole file):
000000055555444444******4444 YY
I implemented this CSV config file to be able to read each part of the data and parse it:
128-12,140-22,YY
The first pair (128-12) represent at what position in the line to start reading and then the amount of characters to read, that first pair is for the account number.
The second pair if for the card number.
And the thir parameter is for the registry name.
Anyways, what I do is String.split(","), and then assign the array[0] as the account number and so on.
But I want to change that CSV config file to a Property file, but I'm not sure of how to implement that solution, if I use a Properties file I'd have to add a bunch of if/then in order to properly map my values, here's what I'm thinking of doing:
Property cfg = new Property();
cfg.put("fieldName", "accountNumber");
cfg.put("startPosition", "128");
cfg.put("length", "12");
But I'd have to say if("fieldName".equals("accountNumber")) then assign accountNumber; is there a way to implement this in such a way that I could avoid implementing all this decisions? right now with my solution I don't have to use ifs, I only say accountNumber = array[0]; and that's it, but I don't think that's a good solution and I think that using Property would be more elegant or efficient
EDIT:
This probably needs some more clarification, this data I'm showing is part of a parsing program that I'm currently doing for a client; the data holds information for many many of their customers and I have to parse a huge mess of data that I receive from them, into something more readable in order to convert it to a PDF file, so far the program is under production but I'm trying to refactor it a little bit. All the customer's information is saved into different Registry classes, each class having it's own set of fields with unique information, lets say that this is what RegistryYY would look like:
class RegistryYY extends Registry{
String name;
String lastName;
PhysicalAddress address;
public RegistryYY(String dataFromFile) {
}
}
I want to implement the Property solution, because in that way, I could make the Property for parsing the file, or interpreting the data correctly to be owned by each Registry class, I mean, a Registry should know what data it needs from the data received from the file right?, I think that if I do it that way, I could make each Registry an Observer and they would decide if the current line read from the file belongs to them by checking the registry name stored in the current line and then they'd return an initialized Registry to the calling object which only cares about receiving and storing a Registry class.
EDIT 2:
I created this function to return the value stored in each line's position:
public static String getField(String fieldParams, String rawData){
// splits the field
String[] fields = fieldParams.split("-");
int fieldStart = Integer.parseInt(fields[0]); // get initial position of the field
int fieldLen = Integer.parseInt(fields[1]); // get length of field
// gets field value
String fieldValue = FieldParser.getStringValue(rawData, fieldStart, fieldLen);
return fieldValue;
}
Which works with the CSV file, I'd like to change the implementation to work with the Property file instead.

Is there any reason why you need to have the record layout exposed to the outside world ? does it need to be configurable ?
I think your proposed approached of using the Property file is better than your current approach of using the CSV file since it is more descriptive and meaningful. I would just add a "type" attribute to your Property definition as well to enforce your conversion i.e. for Numeric/String/Date/Boolean.
I wouldnt use an "if" statement to process your property file. You can load all the properties into an Array at the beginning and then iterate around the array for each line of your data file and process that section accordingly something like pseudo code below,
for each loop of data-file{
SomeClass myClass = myClassBuilder(data-file-line)
}
myClassBuilder SomeClass (String data-file-line){
Map<column, value> result = new HashMap<>
for each attribute of property-file-list{
switch attribute_type {
Integer:
result.put(fieldname, makeInteger(data-file-line, property_attribute)
Date:
result.put(fieldname, makeDate(data-file-line, property_attribute)
Boolean :
result.put(fieldname, makeBoolean(data-file-line, property_attribute)
String :
result.put(fieldname, makeBoolean(data-file-line, property_attribute)
------- etc
}
}
return new SomeClass(result)
}
}
If your record layout doesnt need to be configurable then you could do all the conversion inside your Java application only and not even use a Property file.
If you could get your data in XML format then you could use the JAXB framework and simply have your data definition in an XML file.

First of all, thanks to the guys who helped me, #robbie70, #RC. and #VinceEmigh.
I used YAML to parse a file called "test.yml" with the following information in it:
statement:
- fieldName: accountNumber
startPosition: 128
length: 12
- fieldName: cardNumber
startPosition: 140
length: 22
- fieldName: registryName
startPosition: 162
length: 2
This is what I made:
// Start of main
String fileValue = "0222000000002222F 00000000000111110001000000099999444444******4444 YY";
YamlReader reader = new YamlReader(new FileReader("test.yml"));
Object object = reader.read();
System.out.println(object);
Map map = (Map) object;
List list = (List) map.get("statement");
for(int i = 0; i < list.size(); i++) {
Map map2 = (Map) list.get(i);
System.out.println("Value: " + foo(map2, fileValue));
}
}
// End of main
public static String foo(Map map, String source) {
int startPos = Integer.parseInt((String) map.get("startPosition"));
int length = Integer.parseInt((String) map.get("length"));
return getField(startPos, length, source);
}
public static String getField(int start, int length, String source) {
return source.substring(start, start+length);
}
It correctly displays the output:
Value: 000000099999
Value: 444444******4444
Value: YY
I know that maybe the config file has some lists and other unnecessary values and what nots, and that maybe the program needs a little improvement, but I think that I can take it from here and implement what I had in mind.
EDIT:
I made this other one, using Apache Commons, this is what I have in the configuration property file:
#properties defining the statement file
#properties for account number
statement.accountNumber.startPosition = 128
statement.accountNumber.length = 12
statement.account.rules = ${statement.accountNumber.startPosition} ${statement.accountNumber.length}
#properties for card number
statement.cardNumber.startPosition = 140
statement.cardNumber.length = 22
statement.card.rules = ${statement.cardNumber.startPosition} ${statement.cardNumber.length}
#properties for registry name
statement.registryName.startPosition = 162
statement.registryName.length = 2
statement.registry.rules = ${statement.registryName.startPosition} ${statement.registryName.length}
And this is how I read it:
// Inside Main
String valorLeido = "0713000000007451D 00000000000111110001000000099999444444******4444 YY";
Parameters params = new Parameters();
FileBasedConfigurationBuilder<FileBasedConfiguration> builder =
new FileBasedConfigurationBuilder<FileBasedConfiguration>(PropertiesConfiguration.class)
.configure(params.properties()
.setFileName("config.properties"));
try {
Configuration config = builder.getConfiguration();
Iterator<String> keys = config.getKeys();
String account = getValue(getRules(config, "statement.account.rules"), valorLeido);
String cardNumber = getValue(getRules(config, "statement.card.rules"), valorLeido);
String registryName = getValue(getRules(config, "statement.registry.rules"), valorLeido);
} catch (org.apache.commons.configuration2.ex.ConfigurationException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
// End of Main
public static String getRules(Configuration config, String rules) {
return config.getString(rules);
}
public static String getValue(String rules, String source) {
String[] tokens = rules.split(" ");
int startPos = Integer.parseInt(tokens[0]);
int length = Integer.parseInt(tokens[1]);
return getField(startPos, length, source);
}
I'm not entirely sure, I think that with the YAML file it looks simpler, but I really like the control I get with the Apache Commons Config, since I can pass around the Configuration object to each registry, and the registry knows what "rules" it wants to get, let's say that the Registry class only cares about "statement.registry.rules" and that's it, with the YAML option I'm not entirely sure of how to do that yet, maybe I'll need to experiment with both options a little bit more, but I like where this is going.
PS:
That weird value I used in fileValue is what I'm dealing with, now add nearly 1,000 characters to the length of the line and you'll understand why I want to have a config file for parsing it (don't ask why....clients be crazy)

How do I take a "slice" of a List that only has an iterator?

I have a CSV file full of data downloaded from Fitbit. The data inside the CSV file follows a basic format:
<Type of Data>
<Columns-comma-separated>
<Data-related-to-columns>
Here is a small example of the layout of the file:
Activities
Date,Calories Burned,Steps,Distance,Floors,Minutes Sedentary,Minutes Lightly Active,Minutes Fairly Active,Minutes Very Active,Activity Calories
"2016-07-17","3,442","9,456","4.41","12","612","226","18","44","1,581"
"2016-07-18","2,199","7,136","3.33","10","370","93","12","46","1,092"
...other logs
Sleep
Date,Minutes Asleep,Minutes Awake,Number of Awakenings,Time in Bed
"2016-07-17","418","28","17","452"
"2016-07-18","389","26","10","419"
Now, I am using CSVParser from Apache Common's library to go through this data. My goal is to turn this into Java Objects that can turn relevant data into Json (I need the Json to upload into a different website). CSVParser has an iterator that I can use to iterate through the CSVRecords in the file. So, essentially, I have a "list" of all of the data.
Because the file contains different types of data (Sleep logs, Activity logs, etc), I need to get a subsection/sub-list of the file, and pass it into a class to analyse it.
I need to iterate over the list and look for the keyword that identifies a new section of the file (e.g. Activities, Foods, Sleep, etc). Once I have identified what the next part of the file is, I need to select all of the following rows up until the next category.
Now, for the question in this Question: I don't know how to use an iterator to get the equivalent of List.sublist(). Here is what I have been trying:
while (iterator.hasNext())
{
CSVRecord current = iterator.next();
if (current.get(0).equals("Activities"))
{
iterator.next(); //Columns
while (iterator.hasNext() && iterator.next().get(0).isData()) //isData isn't real, but I can't figure out what I need to do.
{
//How do I sublist it here?
}
}
}
So, I need to determine if the next CSVRecord begins with a quote/has data, and loop until I find the next category, and finally pass a subsection of the file (using the iterator) to another function to do something with the correct log.
Edit
I considered converting it first to a List with a while loop, and then sub-listing, but that seemed wasteful. Correct me if I am wrong.
Also, I can't assume that each section will have the same amount of rows following it. They might have similar, but there is also the food logs, which follow a completely different pattern. Here are two different days. Foods follows the normal pattern, but the Food Logs do not.
Foods
Date,Calories In
"2016-07-17","0"
"2016-07-18","1,101"
Food Log 20160717
Daily Totals
"","Calories","0"
"","Fat","0 g"
"","Fiber","0 g"
"","Carbs","0 g"
"","Sodium","0 mg"
"","Protein","0 g"
"","Water","0 fl oz"
Food Log 20160718
Meal,Food,Calories
"Lunch"
"","Raspberry Yogurt","190"
"","Almond Sweet & Salty Granola Bar","140"
"","Goldfish Baked Snack Crackers, Cheddar","140"
"","Bagels, Whole Wheat","190"
"","Braided Twists Honey Wheat Pretzels","343"
"","Apples, raw, gala, with skin - 1 medium","98"
"Daily Totals"
"","Calories","1,101"
"","Fat","21 g"
"","Fiber","13 g"
"","Carbs","202 g"
"","Sodium","1,538 mg"
"","Protein","28 g"
"","Water","24 fl oz"

The easiest way to do what you want is to simply remember that previous category data, and when you hit a new category, process that previous category data and reset for the next category. This should work:
String categoryName = null;
List<List<String>> categoryData = new ArrayList<>();
while (iterator.hasNext()) {
CSVRecord current = iterator.next();
if (current.size() == 1) { //start of next category
processCategory(categoryName, categoryData);
categoryName = current.get(0);
categoryData.clear();
iterator.next(); //skip header
} else { //category data
List<String> rowData = new ArrayList<>(current.size());
CollectionUtils.addAll(rowData, current.iterator()); //uses Apache Commons Collections, but you can use whatever
categoryData.add(rowData);
}
}
processCategory(categoryName, categoryData); //last category of file
And then:
void processCategory(String categoryName, List<List<String>> categoryData) {
if (categoryName != null) { //first category of the file, skip
//do stuff
}
}
The above assumes that a List<List<String>> is the data structure that you want to deal with, but you can tweak as you see fit. I might even recommend simply passing List<Iterable<String>> to the process method (CSVRecord implements Iterable<String>) and handling the row data there.
This can definitely be cleaned up further, but it should get you started.

text classifier with weka: how to correctly train a classifier issue

I'm trying to build a text classifier using Weka, but the probabilities with distributionForInstance of the classes are 1.0 in one and 0.0 in all other cases, so classifyInstance always returns the same class as prediction. Something in the training doesn't work correctly.
ARFF training
#relation test1
#attribute tweetmsg String
#attribute classValues {politica,sport,musicatvcinema,infogeneriche,fattidelgiorno,statopersonale,checkin,conversazione}
#DATA
"Renzi Berlusconi Salvini Bersani",politica
"Allegri insulta la terna arbitrale",sport
"Bravo Garcia",sport
Training methods
public void trainClassifier(final String INPUT_FILENAME) throws Exception
{
getTrainingDataset(INPUT_FILENAME);
//trainingInstances consists of feature vector of every input
for(Instance currentInstance : inputDataset)
{
Instance currentFeatureVector = extractFeature(currentInstance);
currentFeatureVector.setDataset(trainingInstances);
trainingInstances.add(currentFeatureVector);
}
classifier = new NaiveBayes();
try {
//classifier training code
classifier.buildClassifier(trainingInstances);
//storing the trained classifier to a file for future use
weka.core.SerializationHelper.write("NaiveBayes.model",classifier);
} catch (Exception ex) {
System.out.println("Exception in training the classifier."+ex);
}
}
private Instance extractFeature(Instance inputInstance) throws Exception
{
String tweet = inputInstance.stringValue(0);
StringTokenizer defaultTokenizer = new StringTokenizer(tweet);
List<String> tokens=new ArrayList<String>();
while (defaultTokenizer.hasMoreTokens())
{
String t= defaultTokenizer.nextToken();
tokens.add(t);
}
Iterator<String> a = tokens.iterator();
while(a.hasNext())
{
String token=(String) a.next();
String word = token.replaceAll("#","");
if(featureWords.contains(word))
{
double cont=featureMap.get(featureWords.indexOf(word))+1;
featureMap.put(featureWords.indexOf(word),cont);
}
else{
featureWords.add(word);
featureMap.put(featureWords.indexOf(word), 1.0);
}
}
attributeList.clear();
for(String featureWord : featureWords)
{
attributeList.add(new Attribute(featureWord));
}
attributeList.add(new Attribute("Class", classValues));
int indices[] = new int[featureMap.size()+1];
double values[] = new double[featureMap.size()+1];
int i=0;
for(Map.Entry<Integer,Double> entry : featureMap.entrySet())
{
indices[i] = entry.getKey();
values[i] = entry.getValue();
i++;
}
indices[i] = featureWords.size();
values[i] = (double)classValues.indexOf(inputInstance.stringValue(1));
trainingInstances = createInstances("TRAINING_INSTANCES");
return new SparseInstance(1.0,values,indices,1000000);
}
private void getTrainingDataset(final String INPUT_FILENAME)
{
try{
ArffLoader trainingLoader = new ArffLoader();
trainingLoader.setSource(new File(INPUT_FILENAME));
inputDataset = trainingLoader.getDataSet();
}catch(IOException ex)
{
System.out.println("Exception in getTrainingDataset Method");
}
System.out.println("dataset "+inputDataset.numAttributes());
}
private Instances createInstances(final String INSTANCES_NAME)
{
//create an Instances object with initial capacity as zero
Instances instances = new Instances(INSTANCES_NAME,attributeList,0);
//sets the class index as the last attribute
instances.setClassIndex(instances.numAttributes()-1);
return instances;
}
public static void main(String[] args) throws Exception
{
Classificatore wekaTutorial = new Classificatore();
wekaTutorial.trainClassifier("training_set_prova_tent.arff");
wekaTutorial.testClassifier("testing.arff");
}
public Classificatore()
{
attributeList = new ArrayList<Attribute>();
initialize();
}
private void initialize()
{
featureWords= new ArrayList<String>();
featureMap = new TreeMap<>();
classValues= new ArrayList<String>();
classValues.add("politica");
classValues.add("sport");
classValues.add("musicatvcinema");
classValues.add("infogeneriche");
classValues.add("fattidelgiorno");
classValues.add("statopersonale");
classValues.add("checkin");
classValues.add("conversazione");
}
TESTING METHODS
public void testClassifier(final String INPUT_FILENAME) throws Exception
{
getTrainingDataset(INPUT_FILENAME);
//trainingInstances consists of feature vector of every input
Instances testingInstances = createInstances("TESTING_INSTANCES");
for(Instance currentInstance : inputDataset)
{
//extractFeature method returns the feature vector for the current input
Instance currentFeatureVector = extractFeature(currentInstance);
//Make the currentFeatureVector to be added to the trainingInstances
currentFeatureVector.setDataset(testingInstances);
testingInstances.add(currentFeatureVector);
}
try {
//Classifier deserialization
classifier = (Classifier) weka.core.SerializationHelper.read("NaiveBayes.model");
//classifier testing code
for(Instance testInstance : testingInstances)
{
double score = classifier.classifyInstance(testInstance);
double[] vv= classifier.distributionForInstance(testInstance);
for(int k=0;k<vv.length;k++){
System.out.println("distribution "+vv[k]); //this are the probabilities of the classes and as result i get 1.0 in one and 0.0 in all the others
}
System.out.println(testingInstances.attribute("Class").value((int)score));
}
} catch (Exception ex) {
System.out.println("Exception in testing the classifier."+ex);
}
}
I want to create a text classifier for short messages, this code is based on this tutorial http://preciselyconcise.com/apis_and_installations/training_a_weka_classifier_in_java.php . The problem is that the classifier predict the wrong class for almost every message in the testing.arff because the probabilities of the classes are not correct. The training_set_prova_tent.arff has the same number of messages per class.
The example i'm following use a featureWords.dat and associate 1.0 to the word if it is present in a message instead I want to create my own dictionary with the words present in the training_set_prova_tent plus the words present in testing and associate to every word the number of occurrences .
P.S
I know that this is exactly what can i do with the filter StringToWordVector but I haven't found any example that exaplain how to use this filter with two file: one for the training set and one for the test set. So it seems easier to adapt the code I found.
Thank you very much

It seems like you changed the code from the website you referenced in some crucial points, but not in a good way. I'll try to draft what you're trying to do and what mistakes I've found.
What you (probably) wanted to do in extractFeature is
Split each tweet into words (tokenize)
Count the number of occurrences of these words
Create a feature vector representing these word counts plus the class
What you've overlooked in that method is
You never reset your featureMap. The line
Map<Integer,Double> featureMap = new TreeMap<>();
originally was at the beginning extractFeatures, but you moved it to initialize. That means that you always add up the word counts, but never reset them. For each new tweet, your word count also includes the word count of all previous tweets. I'm sure that is not what you wanted.
You don't initialize featureWords with the words you want as features. Yes, you create an empty list, but you fill it iteratively with each tweet. The original code initialized it once in the initialize method and it never changed after that. There are two problems with that:
With each new tweet, new features (words) get added, so your feature vector grows with each tweet. That wouldn't be such a big problem (SparseInstance), but that means that
Your class attribute is always in another place. These two lines work for the original code, because featureWords.size() is basically a constant, but in your code the class label will be at index 5, then 8, then 12, and so on, but it must be the same for every instance.
indices[i] = featureWords.size();
values[i] = (double) classValues.indexOf(inputInstance.stringValue(1));
This also manifests itself in the fact that you build a new attributeList with each new tweet, instead of only once in initialize, which is bad for already explained reasons.
There may be more stuff, but - as it is - your code is rather unfixable. What you want is much closer to the tutorial source code which you modified than your version.
Also, you should look into StringToWordVector because it seems like this is exactly what you want to do:
Converts String attributes into a set of attributes representing word occurrence (depending on the tokenizer) information from the text contained in the strings. The set of words (attributes) is determined by the first batch filtered (typically training data).

Trying to compare a HashSet element with an element in a List

I have a HashSet that I created and this is what it contains. It will contain more later on, this is pasted from standard out when I did a toString on it. Just to show the contents.
foo.toString(): Abstractfoo [id=2, serial=1d21d, value=1.25, date=2012-09-02 12:00:00.0]
INFO [STDOUT] price.toString(): Abstractfoo [id=1, serial=1d24d, value=1.30, date=2012-09-19 12:00:00.0]
I have a List that I also have and I need to compare the two. One of the elements in List is:
Bar.toString(): Bar [id=1d21d, name=Dell, description=Laptop, ownerId=null]
Here is what I am trying to do...
Bar contains all of the elements I want foo to have. There will only be one unique serial. I would like my program to see if an element in the list that is in HashSet contains the id for bar. So serial == id.
Here is what I've been trying to do
Removed code and added clearer code below
I've verified the data is getting entered into the HashSet and List correctly by viewing it through the debugger.
foo is being pulled from a database through hibernate, and bar is coming from a different source. If there is an element in bar I need to add it to a list and I'm passing it back to my UI where I'll enter some additional data and then commit it to the database.
Let me know if this makes sense and if I can provide anymore information.
Thanks
EDIT: Here is the class
#RequestMapping(value = "/system", method = RequestMethod.GET)
public #ResponseBody
List<AbstractSystem> SystemList() {
// Retrieve system list from database
HashSet<AbstractSystem> systemData = new HashSet<AbstractSystem>(
systemService.getSystemData());
// Retrieve system info from cloud API
List<SystemName> systemName= null;
try {
systemName = cloudClass.getImages();
} catch (Exception e) {
LOG.warn("Unable to get status", e);
}
// Tried this but, iter2 only has two items and iter has many more.
// In production it will be the other way around, but I need to not
// Have to worry about that
Iterator<SystemName> iter = systemName.iterator();
Iterator<AbstractSystem> iter2 = systemData .iterator();
while(iter.hasNext()){
Image temp = iter.next();
while(iter2.hasNext()){
AbstractPricing temp2 = iter2.next();
System.out.println("temp2.getSerial(): " + temp2.getSerial());
System.out.println("temp.getId(): " + temp.getId());
if(temp2.getSerial().equals(temp.getId())){
System.out.println("This will be slow...");
}
}
}
return systemData;
}

If N is the number of items in systemName and M is the number of items in systemData, then you've effectively built an O(N*M) method.
If you instead represent your systemData as a HashMap of AbstractSystem by AbstractSystem.getSerial() values, then you just loop through the systemName collection and lookup by systemName.getId(). This becomes more like O(N+M).
(You might want to avoid variables like iter, iter2, temp2, etc., since those make the code harder to read.)
EDIT - here's what I mean:
// Retrieve system list from database
HashMap<Integer, AbstractSystem> systemDataMap = new HashMap<AbstractSystem>(
systemService.getSystemDataMap());
// Retrieve system info from cloud API
List<SystemName> systemNames = cloudClass.getImages();
for (SystemName systemName : systemNames) {
if (systemDataMap.containsKey(systemName.getId()) {
System.out.println("This will be slow...");
}
}
I used Integer because I can't tell from your code what the type of AbstractSystem.getSerial() or SystemName.getId() are. This assumes that you store the system data as a Map elsewhere. If not, you could construct the map yourself here.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JHDF5 - How to avoid dataset being overwritten - java

Related

Get a string list of values from a generic object

How to implement this property file?

How do I take a "slice" of a List that only has an iterator?

text classifier with weka: how to correctly train a classifier issue

Trying to compare a HashSet element with an element in a List

Categories

Resources