Java - xgboost DMatrix input

Java - xgboost DMatrix input - java

When creating a DMatrix in java with the xgboost4j package, at first i succeed to create the matrix using a "filepath".
DMatrix trainMat = new DMatrix("...\\xgb_training_input.csv");
But when I try to train the model:
Booster booster = XGBoost.train(trainMat, params, round, watches, null, null);
I get the following error:
...regression_obj.cc:108: label must be in [0,1] for logistic regression
now my data is solid. I've checked it out on an xgb model built in python.
I'm guessing the problem is with the data format somehow.
currently the format is as follows:
x1,x2,x3,x4,x5,y
where x1-x5 are "Real" numbers and y is either 0 or 1. file end is .csv
Maybe the separator shouldn't be ',' ?

DMatrix gets an .libsvm file. which can be easily created with python.
libsvm looks like this:
target 0:column1 1:column2 2:column3 ... and so on
so the target is the first column, while every other column (predictor) is being attached to increasing index with ":" in between.

Related

How do I iterate over an entire document in OpenOffice/LibreOffice with UNO

I am writing java code to access a document open in Libre Office.
I now need to write some code which iterate over the entire document, hopefully in the same order it is shown in the editor.
I can use this code to iterate over all the normal text:
XComponent writerComponent=xComponentLoader.loadComponentFromURL(loadUrl, "_blank", 0, loadProps);
XTextDocument mxDoc=UnoRuntime.queryInterface(XTextDocument.class, writerComponent);
XText mxDocText=mxDoc.getText();
XEnumerationAccess xParaAccess = (XEnumerationAccess) UnoRuntime.queryInterface(XEnumerationAccess.class, mxDocText);
XEnumeration xParaEnum = xParaAccess.createEnumeration();
Object element = xParaEnum.nextElement();
while (xParaEnum.hasMoreElements()) {
XEnumerationAccess inlineAccess = (XEnumerationAccess) UnoRuntime.queryInterface(XEnumerationAccess.class, element);
XEnumeration inline = inlineAccess.createEnumeration();
// And I can then iterate over this inline element and get all the text and formatting.
}
But the problem is that this does not include any chart objects.
I can then use
XDrawPagesSupplier drawSupplier=UnoRuntime.queryInterface(XDrawPagesSupplier.class, writerComponent);
XDrawPages pages=drawSupplier.getDrawPages();
XDrawPage drawPage=UnoRuntime.queryInterface(XDrawPage.class,page);
for(int j=0;j!=drawPage.getCount();j++) {
Object sub=drawPage.getByIndex(j);
XShape subShape=UnoRuntime.queryInterface(XShape.class,sub);
// Now I got my subShape, but how do I know its position, relative to the text.
}
And this gives me all charts (And other figures I guess), but the problem is: How do I find out where these charts are positioned in relation to the text in the model. And how do I get a cursor which represent each chart?
Update:
I am now looking for an anchor for my XShape, but XShape don't have a getAnchor() method.
But If I use
XPropertySet prop=UnoRuntime.queryInterface(XPropertySet.class,shape);
I get the prop class.
And I call prop.getPropertyValue("AnchorType") which gives me an ancher type of TextContentAnchorType.AS_CHARACTER
but I just can't get the anchor itself. There are no anchor or textrange property.
btw: I tried looking into installing "MRI" for libre office, but the only version I could find hav libreoffice 3.3 as supported version, and it would not install on version 7.1
----- Update 2 -----
I managed to make it work. It turns out that my XShape also implements XTextContent (Thank you MRI), so all I had to do was:
XTextContent subContent=UnoRuntime.queryInterface(XTextContent.class,subShape);
XTextRange anchor=subContent.getAnchor();
XTextCursor cursor = anchor.getText().createTextCursorByRange(anchor.getStart());
cursor.goRight((short)50,true);
System.out.println("String=" + cursor.getString());
This gives me a cursor which point to the paragraph, which I can then move forward/backward to find out where the shape is. So this println call will print the 50 chars following the XShape.

How do I find out where these charts are positioned in relation to the text in the model. And how do I get a cursor which represent each chart?
Abridged comments
Anchors pin objects to a specific location. Does the shape have a method getAnchor() or property AnchorType? I would use an introspection tool such as MRI to determine this. Download MRI 1.3.4 from https://github.com/hanya/MRI/releases.
As far as a cursor, maybe it is similar to tables:
oText = oTable.getAnchor().getText()
oCurs = oText.createTextCursor()
Code solution given by OP
XTextContent subContent=UnoRuntime.queryInterface(XTextContent.class,subShape);
XTextRange anchor=subContent.getAnchor();
XTextCursor cursor = anchor.getText().createTextCursorByRange(anchor.getStart());
cursor.goRight((short)50,true);
System.out.println("String=" + cursor.getString());

How can I write a _FillValue parameter in a NetCDF CHAR Variable using Java?

I am trying to create a NetCDF file using java (unidata library). One of the requirements is to include the _FillValue attribute in all the Variables. I have one of type CHAR, and I can not do it.
The Attribute constructor only accepts Strings or numbers (or arrays of them), not chars. I have tried both of them anyway but the final netcdf does not show the attribute.
Other languages let you do it (we have seen this working in matlab), but I don't know how to do it using java.
I see in the documentation that the _FillValue should be of the same type of the Variable itself but Attribute values does not accept Chars, only String or Numbers
For example: When I try
Nc4Chunking chunker = Nc4ChunkingStrategy.factory(Nc4Chunking.Strategy.standard, 6, true);
NetcdfFileWriter dataFile = NetcdfFileWriter.createNew(NetcdfFileWriter.Version.netcdf4_classic, fileName, chunker);
....
Variable varid_scdr = dataFile.addVariable(null, "SCDR", DataType.CHAR, dimsTMS15);
varid_scdr.addAttribute(new Attribute("_FillValue", " "));
....
dataFile.write(varid_scdr, scodData);
dataFile.close();
The resulting netcdf file has no _FillValue, it is not written in the file.
But if I change the attribute name and do this
varid_scdr.addAttribute(new Attribute("FillValue", " "));
the parameter is present in the output file
I have no problems with other data types or other attribute names. I am prety sure that the problem is about the attribute _FillValue for the variable of type Char. I dont know how to write it and I need the _FillValue attribute to be explicity present in the variable attribute list.
********* 5th July 2019 ***********
I realized that the problem is only related to netcdf4 and netcdf4_classic files. So perhaps is about chunking or something like that. If I try it creating netcdf3 files it workis.
Any help about this issue? what am I missing?

I think this is due to bug that has been addressed in the latest version of netcdf-java (v5.0.0). v5.0.0 has been released and is available for download; my hope is that the announcement will go out today.
If you want to be explicit about writing a CHAR valued attribute, one way to to it would be:
String fillValue = " ";
Array charArrayFillValue = ArrayChar.makeFromString(fillValue, 1);
charAttrFillValue = new Attribute("_FillValue", charArrayFillValue);
varid_scdr.addAttribute(charAttrFillValue)
another way would be:
String fillValue = " ";
Array charArrayFillValue = ArrayChar.makeFromString(fillValue, 1);
charAttrFillValue = new Attribute("_FillValue", DataType.CHAR);
charAttrFillValue.setValues(charArrayFillValue);
varid_scdr.addAttribute(charAttrFillValue)
Both of those are a bit verbose, though. I just checked using version 5, and your one liner works:
varid_scdr.addAttribute(new Attribute("_FillValue", " "));
However, if you try to pass in a value for _FillValue that isn't a string of length 1, the netCDF-C library will throw an error. So this:
varid_scdr.addAttribute(new Attribute("_FillValue", "ab"));
will result in:
-36 (NetCDF: Invalid argument) on attribute ':_FillValue = "ab"' on var varid_scdr
netCDF-Java will make sure the string you pass in gets converted to CHARs, but it won't truncate the resulting set of CHARs to fit into the single character limit on the _FillValue attribute.

pmml model created from xgboost in R leads to different result than original model in R

I have a ranking task, where my training data looks like this:
session_id item_id item_features target
---------------------------------------------
session1 item1 ... 1
session1 item2 ... 0
...
sessionN item1 ... 0
sessionN itemX ... 10
sessionN itemY ... 0
...
I am using xgboost in R with the objective "rank:pairwise" for training the model. xgboost expects grouped data (same session_id) to be bunched together in the training and test sets. The lines belonging to the same session_id have to be specified using the function setinfo() (e. g. setinfo(model, 'group', group_info).
When I evaluate the model in R, applying new data works perfectly. However, I have used the package pmml to convert the model into a pmml file in order to use it in Java.
In Java the pmml file gets parsed and evaluated via the org.jpmml pmml-evaluator dependency (v. 1.3.15). Feeding the same data as in R to the org.jpmml.evaluator.Evaluator yields different results, though. The results are mostly negative values - which is no valid result in my setup- all predicted targets should be positive.
I have come up with two possible explanations:
There might be a bug in the pmml conversion in my scenario
I have no idea, where I can apply the equivalent of setinfo() in Java. Since I am only applying the model to a single session at a time, I was under the impression that I did not need to specify it. But maybe, I was wrong.
Please contact me for fully working example including training and test data, I will send via mail. But for starters, here is the R code from training the model:
library(xgboost)
example_matrix_train <- xgb.DMatrix(X, label = y)
setinfo(example_matrix_train, 'group', example_train_groupInfo)
example.model <- xgboost(data = example_matrix_train, objective = "rank:pairwise", max.depth = 8, eta = 0.2, nthread = 8, nround = 10, verbose=0)
library(pmml)
library(pmmlTransformations)
xgb.dump(example.model, "example.model.dumped.trees")
logfile <- file(paste0("pmml_example_model",Sys.Date(),".txt"), open="a")
sink(logfile)
pmml(example.model, inputFeatureNames = colnames(example_train), outputLabelName = "prediction1", xgbDumpFile = "example.model.dumped.trees")
sink()
Any help is welcome

I have come up with two possible explanations: There might be a bug in the pmml conversion
This is the true explanation - the pmml package is producing incorrect PMML for XGBoost models. The technical reason is that it is using XGBoost text dump file as input, but the information contained therein is incomplete (eg. rounded threshold values).
If you're looking to export XGBoost models into PMML, then you should be using the r2pmml package, which is using XGBoost binary files as input.

In truth, the 'pmml' package currently does not support the 'rank:pairwise' objective function you need. The upcoming release of the 'pmml' package (version 1.5.3) includes a check for unsupported objective functions.

How to load libsvm model into Android

I have generated a model file from a model trained in MATLAB, and I would like to load this into Android from a mobile device.
The model file looks like this shown for the three first SV's and the params (should be correct):
svm_type 0
kernel_type 2
gamma 3.3636
coef0 0
nr_class 2
total_sv 1106
rho -0.7401
label 0 1
nr_sv 754 352
SV
0 1:8.02710 2:8.90538 3:9.56450 4:10.15383
0 1:7.87334 2:8.71629 3:9.41049 4:9.45693
0 1:8.52795 2:9.19652 3:10.17247 4:10.30913 ...
However, when I load this using svm.svm_load_model(), the resulting model is null:
FileReader fIn = new FileReader("mymodel.txt");
BufferedReader bufferedReader = new BufferedReader(fIn);
svm_model model = svm.svm_load_model(bufferedReader);
I can't seem to find the problem, anyone got an answer?
Thx
EDIT: I figured out what the error is. The model file output from MATLAB is apparently not fully compatible with the Android load_model function in the way that the values to keys svm_type and kernel_type has to specified as strings instead of numerals (c_svc instead of 0, rbf instead of 2).

EDIT: I figured out what the error is. The model file output from MATLAB is apparently not fully compatible with the Android load_model function in the way that the values to keys svm_type and kernel_type has to specified as strings instead of numerals (c_svc instead of 0, rbf instead of 2).

you can do it because libsvm is written in C/C++. So you can use a wrapper (interfaced by JNI or whatever) to access this "C-based libsvm library" in Android
For example, you can use this wrapper: https://github.com/yctung/AndroidLibSvm
After you load the program. Go to edit the "AndroidLibSvm/app/src/main/jni/jnilibsvm.cpp" file.
In this file you can load the model file by
model=svm_load_model(modelFile);
You can also access other libsvm functions as you want

Convert OpenIE triplet to N-Triplet (NT)

I downloaded and used OpenIE4.1 jar file (downloadable from http://knowitall.github.io/openie/) to process some free text documents and produced triplet-like outputs along with the text and confidence score, for instance,
The rail launchers are conceptually similar to the underslung SM-1
0.93 (The rail launchers; are; conceptually similar to the underslung SM-1)
I wrote a java parser to extract OpenIE triplets which confidence score is >= 0.85 and
need to know the way to convert it to N-triplet (NT), format look like.
Not sure if I need to be familiar with the ontology that I'm trying to map to.

After discussion with my colleagues. This is what I should do to create N-Triplet(NT) and Detailed Java codes can be found in another Question: Use RDF API (Jena, OpenRDF or Protege) to convert OpenIE outputs
Create a blank node identifier for each distinct :subject in the file (call it node_s)
Create a blank node identifier for each distinct :object in the file (call it node_o)
Define a URI for each distinct predicate
Create these triples:
1. node_s rdf:type <http://mypage.org/vocab#Corpus>
2. node_s dc:title “The rail launchers”
3. node_s dc:source “Sample File”
4. node_s rdf:predicate <http://mypage.org/vocab#are>
5. node_o rdf:type <http://mypage.org/vocab#Corpus>
6. node_o dc:title “conceptually similar to the underslung SM-1”

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java - xgboost DMatrix input - java

DMatrix gets an .libsvm file. which can be easily created with python. libsvm looks like this: target 0:column1 1:column2 2:column3 ... and so on so the target is the first column, while every other column (predictor) is being attached to increasing index with ":" in between.

Related

How do I iterate over an entire document in OpenOffice/LibreOffice with UNO

How can I write a _FillValue parameter in a NetCDF CHAR Variable using Java?

pmml model created from xgboost in R leads to different result than original model in R

How to load libsvm model into Android

Convert OpenIE triplet to N-Triplet (NT)

Categories

Resources