Convert OpenIE triplet to N-Triplet (NT) - java

I downloaded and used OpenIE4.1 jar file (downloadable from http://knowitall.github.io/openie/) to process some free text documents and produced triplet-like outputs along with the text and confidence score, for instance,
The rail launchers are conceptually similar to the underslung SM-1
0.93 (The rail launchers; are; conceptually similar to the underslung SM-1)
I wrote a java parser to extract OpenIE triplets which confidence score is >= 0.85 and
need to know the way to convert it to N-triplet (NT), format look like.
Not sure if I need to be familiar with the ontology that I'm trying to map to.

After discussion with my colleagues. This is what I should do to create N-Triplet(NT) and Detailed Java codes can be found in another Question: Use RDF API (Jena, OpenRDF or Protege) to convert OpenIE outputs
Create a blank node identifier for each distinct :subject in the file (call it node_s)
Create a blank node identifier for each distinct :object in the file (call it node_o)
Define a URI for each distinct predicate
Create these triples:
1. node_s rdf:type <http://mypage.org/vocab#Corpus>
2. node_s dc:title “The rail launchers”
3. node_s dc:source “Sample File”
4. node_s rdf:predicate <http://mypage.org/vocab#are>
5. node_o rdf:type <http://mypage.org/vocab#Corpus>
6. node_o dc:title “conceptually similar to the underslung SM-1”

Related

Export ReqIF from DOORS via DWA

My goal is to export a DOORS project to a ReqIF using Java. How can I achieve that? I know it is possible to do it manually in the DOORS client, so I assume that there's a way to automate that as well.
Currently, the closest I've come to this is by using DWA, OSLC and LYO to export a single requirement into an XML string. However, that also has massive problems, in the form that fields or informations get lost.
What I want is a sort of pipeline where I can ensure that if I import a ReqIF File to DOORS, I can then export it and get the same thing back out again.
However, currently, what I observe happening is this:
ReqIF (Input) DOORS OSLC (Output)
------------------------------------------------------------------------------------------------
ReqIF.ForeignID ForeignID doorsAttribute:absoluteNumber
Absolute Number dcterms:identifier
AdditionalInfo AdditionalInfo m_property:attrDef-1003
Created By doorsAttribute:createdBy [One Value]
ReqIF.ForeignCreatedBy ForeignCreatedBy doorsAttribute:createdBy [A different value]
Created On doorsAttribute:createdOn [One Value]
ReqIF.ForeignCreatedOn ForeignCreatedOn doorsAttribute:createdOn [A different value]
dcterms:created
ReqIF.ForeignCreatedThru Created Thru doorsAttribute:createdThru
FollowupReference FollowupReference rm_property:attrDef-1006
FollowupReference2 FollowupReference2 rm_property:attrDef-1007
FunctionalApportionment FunctionalAppointment [LOST]
ImplementerName ImplementerName [LOST]
IsReq IsReq rm_property:attrDef-1015
ReqIF.ForeignModifiedBy ForeignModifiedBy doorsAttribute:modifiedBy [One Value]
Last Modified By doorsAttribute:modifiedBy [A different value]
ReqIF.ForeignModifiedOn ForeignModifiedOn doorsAttribute:modifiedOn
Last Modified On doorsAttribute:modifiedOn
rm_property:attrDef-8
dcterms:modified
ReqIF.ChapterName Object Heading doorsAttribute:objectHeading
Object Number
ReqIF.Name Object Short Text doorsAttribute:objectShortText
dcterms:description
ReqIF.Text Object Text rm:primaryText
Paragraph Style Paragraph Style rm_property:attrDef-1016
Picture [LOST]
PictureName [LOST]
PictureNum [LOST]
Rational Rational rm_property:attrDef-1017
Req-Id Req-Id rm_property:attrDef-1018
ReqSafetyLevel ReqSafetyLevel [LOST]
SourceReference SourceReference rm_property:attrDef-1020
SourceReference2 SourceReference2 rm_property:attrDef-1021
StageFrom StageFrom [LOST]
StageTo StageTo [LOST]
VerificationMethod VerificationMethod [LOST]
VerificationState VerificationState rm_property:attrDef-1025
VerifierComment VerifierComment rm_property:attrDef-1026
oslc:instanceShape
acp:accessControl
dcterms:contributor
oslc:serviceProvider
Obviously, this approach has some problems, most notably that certain values from the original ReqIF input never reach the OSLC output. But there's also the issue that some fields in the outputted XML have the same title and are thus indistinguishable, but contain different values. Here's some samples of the output:
<doorsAttribute:modifiedBy rdf:parseType="Literal"><div>ForeignModifiedBy: bcsthfr</div></doorsAttribute:modifiedBy>
<doorsAttribute:modifiedBy rdf:parseType="Literal">kira_resari</doorsAttribute:modifiedBy>
<doorsAttribute:modifiedOn rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2015-01-14</doorsAttribute:modifiedOn>
<doorsAttribute:modifiedOn rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2020-08-07</doorsAttribute:modifiedOn>
That's why I'm now looking for a different approach that would simply return to me the ReqIF, as imported, either as a string or file. Is that possible, and if yes how can I do it?

How to prepare training data for OpenNLP to Tokenize the token that contains more than one word?

In some language (for example: Vietnamese), some vocabulary consists of multiple words. So that some tokens which contain more than one word can be tokenized not just using the white space.
I have following input:
Người dân địa phương đã nhiều lần báo Điện lực Bến Tre nhưng chưa được giải quyết .
Expected output:
["Người dân", "địa phương", "đã", "nhiều", "lần", "báo", "Điện lực", "Bến Tre", "nhưng", "chưa", "được", "giải quyết"]
Training data I have _ connect the word that need to stick together in one token:
Người_dân địa_phương đã nhiều lần báo Điện_lực Bến_Tre nhưng chưa được giải_quyết .
Here is command line I use to train
opennlp TokenizerTrainer -model "model/vi-token.bin" -alphaNumOpt 1 -lang "vi" -data "data/merge_vlsp_removehtml" -encoding "UTF-8" -params param/wordseg.param
with param
Iterations=1000
However, the output cannot connect multiple word in one token but it split by whitespace.
Command I run to get output
opennlp TokenizerME model/vi-token.bin < sample/sample_text > sample/sample_text.out
What should I do with training data our config param to train the tokenizer with multiple word each token ?
Rather than using the underscore for training, use tags. OpenNLP uses tags as the reference for training. Follow the instructions given for NER and training your Tokenizer.
opennlp provides 'TokenizerTrainer' tool to train data. The OpenNLP format contains one sentence per line. You can also specify tokens either separated by a whitespace or by a special tag.
you can follow this blog for head start in opennlp for various purposes. The post will show you how to create a training file and build a new model.
You can easily create your own training data-set using the modelbuilder addon and follow some rules as mentioned here to train create a good NER model.
you can find some help using modelbuilder addon here.
It is basically, you put all the information in a text file and the NER entities in another. The addon searches for a paticular entity and replace it with the required tag. Hence producing the tagged data. It must be pretty easy to use this tool!
Also, follow mr. markg's answer to get an understanding on creating new models on your own. This will help you build your own models which can be customized for your applications.
Hope this helps!

pmml model created from xgboost in R leads to different result than original model in R

I have a ranking task, where my training data looks like this:
session_id item_id item_features target
---------------------------------------------
session1 item1 ... 1
session1 item2 ... 0
...
sessionN item1 ... 0
sessionN itemX ... 10
sessionN itemY ... 0
...
I am using xgboost in R with the objective "rank:pairwise" for training the model. xgboost expects grouped data (same session_id) to be bunched together in the training and test sets. The lines belonging to the same session_id have to be specified using the function setinfo() (e. g. setinfo(model, 'group', group_info).
When I evaluate the model in R, applying new data works perfectly. However, I have used the package pmml to convert the model into a pmml file in order to use it in Java.
In Java the pmml file gets parsed and evaluated via the org.jpmml pmml-evaluator dependency (v. 1.3.15). Feeding the same data as in R to the org.jpmml.evaluator.Evaluator yields different results, though. The results are mostly negative values - which is no valid result in my setup- all predicted targets should be positive.
I have come up with two possible explanations:
There might be a bug in the pmml conversion in my scenario
I have no idea, where I can apply the equivalent of setinfo() in Java. Since I am only applying the model to a single session at a time, I was under the impression that I did not need to specify it. But maybe, I was wrong.
Please contact me for fully working example including training and test data, I will send via mail. But for starters, here is the R code from training the model:
library(xgboost)
example_matrix_train <- xgb.DMatrix(X, label = y)
setinfo(example_matrix_train, 'group', example_train_groupInfo)
example.model <- xgboost(data = example_matrix_train, objective = "rank:pairwise", max.depth = 8, eta = 0.2, nthread = 8, nround = 10, verbose=0)
library(pmml)
library(pmmlTransformations)
xgb.dump(example.model, "example.model.dumped.trees")
logfile <- file(paste0("pmml_example_model",Sys.Date(),".txt"), open="a")
sink(logfile)
pmml(example.model, inputFeatureNames = colnames(example_train), outputLabelName = "prediction1", xgbDumpFile = "example.model.dumped.trees")
sink()
Any help is welcome
I have come up with two possible explanations: There might be a bug in the pmml conversion
This is the true explanation - the pmml package is producing incorrect PMML for XGBoost models. The technical reason is that it is using XGBoost text dump file as input, but the information contained therein is incomplete (eg. rounded threshold values).
If you're looking to export XGBoost models into PMML, then you should be using the r2pmml package, which is using XGBoost binary files as input.
In truth, the 'pmml' package currently does not support the 'rank:pairwise' objective function you need. The upcoming release of the 'pmml' package (version 1.5.3) includes a check for unsupported objective functions.

Java - xgboost DMatrix input

When creating a DMatrix in java with the xgboost4j package, at first i succeed to create the matrix using a "filepath".
DMatrix trainMat = new DMatrix("...\\xgb_training_input.csv");
But when I try to train the model:
Booster booster = XGBoost.train(trainMat, params, round, watches, null, null);
I get the following error:
...regression_obj.cc:108: label must be in [0,1] for logistic regression
now my data is solid. I've checked it out on an xgb model built in python.
I'm guessing the problem is with the data format somehow.
currently the format is as follows:
x1,x2,x3,x4,x5,y
where x1-x5 are "Real" numbers and y is either 0 or 1. file end is .csv
Maybe the separator shouldn't be ',' ?
DMatrix gets an .libsvm file. which can be easily created with python.
libsvm looks like this:
target 0:column1 1:column2 2:column3 ... and so on
so the target is the first column, while every other column (predictor) is being attached to increasing index with ":" in between.

jcrfsuite training file format

From what I understand from the example of POS Tagging given in the examples of jcrfsuite. The training file is tab separated and first token is the label. But I do not get the BigCluster| thing. Can somebody help me with how to specify tokens in training file.
Example below:
O BigCluster|00 BigCluster|0000 BigCluster|000000 BigCluster|00000000 BigCluster|0000000000 BigCluster|000000000000 BigCluster|00000000000000 BigCluster|0000000000000000 NextBigCluster|0100 NextBigCluster|01000101 NextBigCluster|010001011111 POSTagDict|D POSTagDict|N POSTagDict|^ POSTagDict|$ POSTagDict|G NextPOSTag|V 1gramSuff|i 1gramPref|i prevword| prevcurr||i nextword|predict nextword|predict currnext|i|predict Word|I Lower|i Xxdshape|X charclass|1, first-shortcap prevnext||predict t=0
Test file format:
! BigCluster|01 BigCluster|0110 BigCluster|011011 BigCluster|01101100 BigCluster|0110110011 BigCluster|011011001100 BigCluster|01101100110000 BigCluster|0110110011000000 NextBigCluster|1000 NextBigCluster|10001000 NextBigCluster|100010000000 POSTagDict|V NextPOSTag|, metaph_POSDict|N 1gramSuff|n 2gramSuff|nn 3gramSuff|mnn 4gramSuff|mmnn 5gramSuff|mmmnn 6gramSuff|ammmnn 7gramSuff|aammmnn 8gramSuff|aaammmnn 9gramSuff|daaammmnn 1gramPref|d 2gramPref|da 3gramPref|daa 4gramPref|daaa 5gramPref|daaam 6gramPref|daaamm 7gramPref|daaammm 8gramPref|daaammmn 9gramPref|daaammmnn prevword| prevcurr||daaammmnn nextword|. nextword|. currnext|daaammmnn|. Word|Daaammmnn Lower|daaammmnn Xxdshape|Xxxxxxxxx charclass|1,2,2,2,2,2,2,2,2, first-initcap prevnext||. t=0
What is specified after the label is a list of feature-name and feature-value.
It is in a sparse representation instead of tabular representation.
BigCluster is just one of the features and it's relevant to the specific example only. You should create your own features if you are training from scratch.
I have noticed that CRFsuite does not care for the naming convention nor feature design of labels and attributes, because treats them as strings.
CRFsuite learns weights of associations (feature weights) between attributes and labels, without knowing the meaning of labels and attributes. In other words, one can design and use arbitrary features just by writing label and attribute names in data sets, just find the best posible attributes for your example and run some experiments with different sets of attributes and features. And you will good to go.

Categories

Resources