Parse a single POJO from multiple YAML documents representing different classes - java

I want to use a single YAML file which contains several different objects - for different applications. I need to fetch one object to get an instance of MyClass1, ignoring the rest of docs for MyClass2, MyClass3, etc. Some sort of selective de-serializing: now this class, then that one... The structure of MyClass2, MyClass3 is totally unknown to the application working with MyClass1. The file is always a valid YAML, of course.
The YAML may be of any structure we need to implement such a multi-class container. The preferred parsing tool is snakeyaml.
Is it sensible? How can I ignore all but one object?
UPD: replaced all "document" with "object". I think we have to speak about the single YAML document containing several objects of different structure. More of it, the parser knows exactly only 1 structure and wants to ignore the rest.
UDP2: I think it is impossible with snakeyaml. We have to read all objects anyway - and select the needed one later. But maybe I'm wrong.
UPD2: sample config file
---
-
exportConfiguration781:
attachmentFieldName: "name"
baseSftpInboxPath: /home/user/somedir/
somebool: false
days: 9999
expected:
- ABC w/o quotes
- "Cat ABC"
- "Some string"
dateFormat: yyyy-MMdd-HHmm
user: someuser
-
anotherConfiguration:
k1: v1
k2:
- v21
- v22

This is definitely possible with SnakeYAML, albeit not trivial. Here's a general rundown what you need to do:
First, let's have a look what loading with SnakeYAML does. Here's the important part of the YAML class:
private Object loadFromReader(StreamReader sreader, Class<?> type) {
Composer composer = new Composer(new ParserImpl(sreader), resolver, loadingConfig);
constructor.setComposer(composer);
return constructor.getSingleData(type);
}
The composer parses YAML input into Nodes. To do that, it doesn't need any knowledge about the structure of your classes, since every node is either a ScalarNode, a SequenceNode or a MappingNode and they just represent the YAML structure.
The constructor takes a root node generated by the composer and generates native POJOs from it. So what you want to do is to throw away parts of the node graph before they reach the constructor.
The easiest way to do that is probably to derive from Composer and override two methods like this:
public class MyComposer extends Composer {
private final int objIndex;
public MyComposer(Parser parser, Resolver resolver, int objIndex) {
super(parser, resolver);
this.objIndex = objIndex;
}
public MyComposer(Parser parser, Resolver resolver, LoaderOptions loadingConfig, int objIndex) {
super(parser, resolver, loadingConfig);
this.objIndex = objIndex;
}
#Override
public Node getNode() {
return strip(super.getNode());
}
private Node strip(Node input) {
return ((SequenceNode)input).getValue().get(objIndex);
}
}
The strip implementation is just an example. In this case, I assumed your YAML looks like this (object content is arbitrary):
- {first: obj}
- {second: obj}
- {third: obj}
And you simply select the object you actually want to deserialize by its index in the sequence. But you can also have something more complex like a searching algorithm.
Now that you have your own composer, you can do
Constructor constructor = new Constructor();
// assuming we want to get the object at index 1 (i.e. second object)
Composer composer = new MyComposer(new ParserImpl(sreader), new Resolver(), 1);
constructor.setComposer(composer);
MyObject result = (MyObject)constructor.getSingleData(MyObject.class);

The answer of #flyx was very helpful for me, opening the way to workaround the library (in our case - snakeyaml) limitations by overriding some methods. Thanks a lot! It's quite possible there is a final solution in it - but not now. Besides, the simple solution below is robust and should be considered even if we'd found the complete library-intruding solution.
I've decided to solve the task by double distilling, sorry, processing the configuration file. Imagine the latter consisting of several parts and every part is marked by the unique token-delimiter. For the sake of keeping the YAML-likenes, it may be
---
#this is a unique key for the configuration A
<some YAML document>
---
#this is another key for the configuration B
<some YAML document
The first pass is pre-processing. For the given String fileString and String key (and DELIMITER = "\n---\n". for example) we select a substring with the key-defined configuration:
int begIndex;
do {
begIndex= fileString.indexOf(DELIMITER);
if (begIndex == -1) {
break;
}
if (fileString.startsWith(DELIMITER + key, begIndex)) {
fileString = fileString.substring(begIndex + DELIMITER.length() + key.length());
break;
}
// spoil alien delimiter and repeat search
fileString = fileString.replaceFirst(DELIMITER, " ");
} while (true);
int endIndex = fileString.indexOf(DELIMITER);
if (endIndex != -1) {
fileString = fileString.substring(0, endIndex);
}
Now we feed the fileString to the simple YAML parsing
ExportConfiguration configuration = new Yaml(new Constructor(ExportConfiguration.class))
.loadAs(fileString, ExportConfiguration.class);
This time we have a single document that must co-respond to the ExportConfiguration class.
Note 1: The structure and even the very content of the rest of configuration file plays absolutely no role. This was the main idea, to get independent configurations in a single file
Note 2: the rest of configurations may be JSON or XML or whatever. We have a method-preprocessor that returns a String configuration - and the next processor parses it properly.

Related

XMLUnit - compare xml and ignore few tags based on a condition

I have couple of xmls which needs to be compared with different set of similar xml and while comparing i need to ignore tags based on a condition, for example
personal.xml - ignore fullname
address.xml - igone zipcode
contact.xml - ignore homephone
here is the code
Diff documentDiff=DiffBuilder
.compare(actualxmlfile)
.withTest(expectedxmlfile)
.withNodeFilter(node -> !node.getNodeName().equals("FullName"))
.ignoreWhitespace()
.build();
How can i add conditions at " .withNodeFilter(node -> !node.getNodeName().equals("FullName")) " or is there a smarter way to do this
You can join multiple conditions together using "and" (&&):
private static void doDemo1(File actual, File expected) {
Diff docDiff = DiffBuilder
.compare(actual)
.withTest(expected)
.withNodeFilter(
node -> !node.getNodeName().equals("FullName")
&& !node.getNodeName().equals("ZipCode")
&& !node.getNodeName().equals("HomePhone")
)
.ignoreWhitespace()
.build();
System.out.println(docDiff.toString());
}
If you want to keep your builder tidy, you can move the node filter to a separate method:
private static void doDemo2(File actual, File expected) {
Diff docDiff = DiffBuilder
.compare(actual)
.withTest(expected)
.withNodeFilter(node -> testNode(node))
.ignoreWhitespace()
.build();
System.out.println(docDiff.toString());
}
private static boolean testNode(Node node) {
return !node.getNodeName().equals("FullName")
&& !node.getNodeName().equals("ZipCode")
&& !node.getNodeName().equals("HomePhone");
}
The risk with this is you may have element names which appear in more than one type of file - where that node needs to be filtered from one type of file, but not any others.
In this case, you would also need to take into account the type of file you are handling. For example, you can use the file names (if they follow a suitable naming convention) or use the root elements (assuming they are different) - such as <Personal>, <Address>, <Contact> - or whatever they are, in your case.
However, if you need to distinguish between XML file types, for this reason, you may be better off using that information to have separate DiffBuilder objects, with different filters. That may result in clearer code.
I had provided the separate method in the below link for !node.getNodeName().equals("FullName")(which you are using in your code), I think by using that separate method you can just pass the array of nodes which you want to ignore and see the results. And incase you wish to add any other conditions based on your requirement, you can try and play in this method.
https://stackoverflow.com/a/68099435/13451711

Stanford Core NLP: Entity type non deterministic

I had built a java parser using Stanford Core NLP. I am finding an issue in getting the consistent results with the CORENLP object. I am getting the different entity types for the same input text. It seems like a bug to me in CoreNLP. Wondering if any of the StanfordNLP users have encountered this issue and found workaround for the same. This is my Service class which I am instantiating and reusing.
class StanfordNLPService {
//private static final Logger logger = LogConfiguration.getInstance().getLogger(StanfordNLPServer.class.getName());
private StanfordCoreNLP nerPipeline;
/*
Initialize the nlp instances for ner and sentiments.
*/
public void init() {
Properties nerAnnotators = new Properties();
nerAnnotators.put("annotators", "tokenize,ssplit,pos,lemma,ner");
nerPipeline = new StanfordCoreNLP(nerAnnotators);
}
/**
* #param text Text from entities to be extracted.
*/
public void printEntities(String text) {
// boolean tracking = PerformanceMonitor.start("StanfordNLPServer.getEntities");
try {
// Properties nerAnnotators = new Properties();
// nerAnnotators.put("annotators", "tokenize,ssplit,pos,lemma,ner");
// nerPipeline = new StanfordCoreNLP(nerAnnotators);
Annotation document = nerPipeline.process(text);
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
// Get the entity type and offset information needed.
String currEntityType = token.get(CoreAnnotations.NamedEntityTagAnnotation.class); // Ner type
int currStart = token.get(CoreAnnotations.CharacterOffsetBeginAnnotation.class); // token offset_start
int currEnd = token.get(CoreAnnotations.CharacterOffsetEndAnnotation.class); // token offset_end.
String currPos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class); // POS type
System.out.println("(Type:value:offset)\t" + currEntityType + ":\t"+ text.substring(currStart,currEnd)+"\t" + currStart);
}
}
}catch(Exception e){
e.printStackTrace();
}
}
}
Discrepancy result: type changed from MISC to O from the initial use.
Iteration 1:
(Type:value:offset) MISC: Appropriate 100
(Type:value:offset) MISC: Time 112
Iteration 2:
(Type:value:offset) O: Appropriate 100
(Type:value:offset) O: Time 112
Here is the answer from the NER FAQ:
http://nlp.stanford.edu/software/crf-faq.shtml
Is the NER deterministic? Why do the results change for the same data?
Yes, the underlying CRF is deterministic. If you apply the NER to the same sentence more than once, though, it is possible to get different answers the second time. The reason for this is the NER remembers whether it has seen a word in lowercase form before.
The exact way this is used as a feature is in the word shape feature, which treats words such as "Brown" differently if it has or has not seen "brown" as a lowercase word before. If it has, the word shape will be "Initial upper, have seen all lowercase", and if it has not, the word shape will be "Initial upper, have not seen all lowercase".
This feature can be turned off in recent versions with the flag -useKnownLCWords false
I've looked over the code some, and here is a possible way to resolve this:
What you could do to solve this is load each of the 3 serialized CRF's with useKnownLCWords set to false, and serialize them again. Then supply the new serialized CRF's to your StanfordCoreNLP.
Here is a command for loading a serialized CRF with useKnownLCWords set to false, and then dumping it again:
java -mx600m -cp "*:." edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -useKnownLCWords false -serializeTo classifiers/new.english.all.3class.distsim.crf.ser.gz
Put whatever names you want to obviously! This command assumes you are in stanford-corenlp-full-2015-04-20/ and have a directory classifiers with the serialized CRF's. Change as appropriate for your set up.
This command should load the serialized CRF, override with the useKnownLCWords set to false, and then re-dump the CRF to new.english.all.3class.distsim.crf.ser.gz
Then in your original code:
nerAnnotators.put("ner.model","comma-separated-list-of-paths-to-new-serialized-crfs");
Please let me know if this works or if it's not working, and I can look more deeply into this!
After doing some research, I found the issue is in ClassifierCombiner.classify() method. One of the baseClassifiers edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz loaded by default is returning different type on some occasion. I am trying to load only the first model to resolve this issue.
The problem is the following area of the code
CRFClassifier.classifyMaxEnt()
int[] bestSequence = tagInference.bestSequence(model); Line 1249
ExactBestSequenceFinder.bestSequence() is returning different sequence for for the above model for the same input when called multiple times.
Not sure if this needs code fix or some configuration changes to the model. Any additional insight is appreciated.

How to extract one boolean field from XML?

I have a model which is in XML format as shown below and I need to parse the XML and check whether my XML has internal-flag flag set as true or not. In my other models, it might be possible, that internal-flag flag is set as false. And sometimes, it is also possible that this field won't be there so by default it will be false from my code.
<?xml version="1.0"?>
<ClientMetadata
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.google.com client.xsd"
xmlns="http://www.google.com">
<client id="200" version="13">
<name>hello world</name>
<description>hello hello</description>
<organization>TESTER</organization>
<author>david</author>
<internal-flag>true</internal-flag>
<clock>
<clock>
<for>
<init>val(tmp1) = 1</init>
<clock>
<eval><![CDATA[result("," + $convert(val(tmp1)))]]></eval>
</clock>
</for>
<for>
<incr>val(tmp1) -= 1</incr>
<clock>
<eval><![CDATA[result("," + $convert(val(tmp1)))]]></eval>
</clock>
</for>
</clock>
</clock>
</client>
</ClientMetadata>
I have a POJO in which I am storing my above model -
public class ModelMetadata {
private int modelId;
private String modelValue; // this string will have my above XML data as string
// setters and getters here
}
Now what is the best way to determine whether my model has internal-flag set as true or not?
// this list will have all my Models stored
List<ModelMetadata> metadata = getModelMetadata();
for (ModelMetadata model : metadata) {
// my model will be stored in below variable in XML format
String modelValue = model.getModelValue();
// now parse modelValue variable and extract `internal-flag` field property
}
Do I need to use XML parsing for this or is there any better way to do this?
Update:-
I have started using Stax and this is what I have tried so far but not sure how can I extract that field -
InputStream is = new ByteArrayInputStream(modelValue.getBytes());
XMLStreamReader r = XMLInputFactory.newInstance().createXMLStreamReader(is);
while(r.hasNext()) {
// now what should I do here?
}
There is an easy solution using XMLBeam (Disclosure: I'm affiliated with that project), just a few lines:
public class ReadBoolean {
public interface ClientMetaData {
#XBRead("//xbdefaultns:internal-flag")
boolean hasFlag();
}
public static void main(String[] args) throws IOException {
ClientMetaData clientMetaData = new XBProjector().io().url("res://xmlWithBoolean.xml").read(ClientMetaData.class);
System.out.println("Has flag:"+clientMetaData.hasFlag());
}
}
This program prints out
Has flag:true
for your XML.
You could also do some simple string parsing, but this will only work for small cases with proper XML and if there's only a single <internal-flag> element.
This is a simple solution to your problem without using any XML parsing utilities. Other solutions may be more robust or powerful.
Find the index of the string literal <internal-flag>. If it doesn't exist, return false.
Go forward "<internal-flag>".length (15) characters. Read up to the next </internal-flag>, which should be the string true or false.
Take that string, use Boolean.parseBoolean(String) to get a boolean value.
If you want me to help you out with the code just drop a comment!
If you are willing to consider adding Groovy to your mix (e.g. see the book Making Java Groovy) then using a Groovy XMLParser and associated classes will make this simple.
If you need to stick to Java, let me put in a shameless plug for my Xen library, which mimics a lot of the "Groovy way". The answer to your question would be:
Xen doc = new XenParser().parseText(YOUR_XML_STRING);
String internalFlag = doc.getText(".client.internal-flag");
boolean isSet = "true".equals(internalFlag);
If the XML comes from a File, Stream, or URI, that can be handled too.
Caveat emptor, (even though it is free) this is a fairly new library, written solely by a random person (me), and not thoroughly tested on all the crazy XML out there. If anybody knows of a similar, more "mainstream" library I'd be very interested in hearing about it.

Retrieve Data Properties of An Individual using OWL API in eclipse

I want to retrieve all data properties set for an individual of any class using owl api. The code i have used is
OWLNamedIndividual inputNoun = df.getOWLNamedIndividual(IRI.create(prefix + "Cow"));
for (OWLDataProperty prop: inputNoun.getDataPropertiesInSignature())
{
System.out.println("the properties for Cow are " + prop); //line 1
}
This code compiles with success but line 1 print nothing at all. What should be the correct syntax. Have thoroughly googled and couldnt find any thing worth it.
OWLNamedIndividual::getDataPropertiesInSignature() does not return the properties for which the individual has a filler, it returns the properties that appear in the object itself. For an individual this is usually empty. The method is on the OWLObject interface, which covers things like class and property expressions and ontologies, for which it has a more useful output.
If you want the data properties with an actual filler for an individual, use OWLOntology::getDataPropertyAssertionAxioms(OWLIndividual), like this:
OWLNamedIndividual input = ...
Set<OWLDataPropertyAssertionAxiom> properties=ontology.getDataPropertyAssertionAxioms(input);
for (OWLDataPropertyAssertionAxiom ax: properties) {
System.out.println(ax.getProperty());
}

Convert a list of URLs to a Tree

Just to be sure I'm not reinventing the wheel, I want to see if there is some known algorithm, class, or something that can help me solve my problem. I have a huge list of URLs from an application. I'd like to feed those URLs into a tree to create a sitemap-like data structure.
It seems that something like this may have done before. However, everything I see from my searches appears to do it from xml to tree. Ideally I'd like to have answer in Java, but I'm sure I could translate it to Java myself if necessary. If I need to do it myself, I'd probablty take each URL and break them into indexes.
[root] [0] [1] [1] -file
wwe.site.com/dir1/dir2/file.html
[root] [0] [1] [1]
www.site.com/dirabc/dir2/file.html
So, I'd parse each url into offsets [0], [1], [2], … etc., and those be depth down in tree where to add them. That was at least my initial plan. I'm open to any and all suggestions!
You could define your UrlTree as nested HashMaps
public class UrlTree {
private final Map<String, UrlTree> branches = new HashMap<String, UrlTree>();
public void add(String[] tokens, int i) {
if (i >= tokens.length) {
return;
}
final String token = tokens[i];
UrlTree branch = branches.get(token);
if (branch == null) {
branch = new UrlTree();
branches.put(token, branch);
}
branch.add(tokens, i + 1);
}
...
}
You'll need to implement TreeModel in a way that reflects the hierarchy of your observed directory structure. FileTreeModel is an example, and ac.Name is a simple class that parses paths for a vintage file system. See also How to Use Trees. An instance of NetBeans Outline, illustrated here, would make a nice alternative view.

Categories

Resources