SnakeYAML low-level API not parsing MapNode correctly

SnakeYAML low-level API not parsing MapNode correctly - java

I wanted to do some processing on values in yml file. Someone suggested me to use snakeYAML's low-level API for this purpose. So I've written some code using that but I'm pretty much stuck due to the following reasons.
Here's the code I've wrote:
public static void main(String[] args) throws Exception{
Yaml yaml = new Yaml();
FileReader contentFromFile=new FileReader("/Users/prakash.tiwari/Desktop/yamlInput.yml");
for (Node node : yaml.composeAll(contentFromFile)) {
System.out.println(node);
}
}
Here is my yamlInput.yml:
Prakash:
Lucky:
Number: 11
Really? : NotAtAll
Here's what was printed on console(It was in a single line, I formatted it and added a comment to make it readable):
<org.yaml.snakeyaml.nodes.MappingNode
(
tag=tag:yaml.org,2002:map,
values=
{
key=<org.yaml.snakeyaml.nodes.ScalarNode (tag=tag:yaml.org,2002:str, value=Prakash)>;
value=355629945 // Why is this a garbage value?
}
{
key=<org.yaml.snakeyaml.nodes.ScalarNode (tag=tag:yaml.org,2002:str, value=Really?)>;
value=
<NodeTuple
keyNode=<org.yaml.snakeyaml.nodes.ScalarNode (tag=tag:yaml.org,2002:str, value=Really?)>;
valueNode=<org.yaml.snakeyaml.nodes.ScalarNode (tag=tag:yaml.org,2002:str, value=NotAtAll)>
>
}
)>
At this point I can extract the valueNode which are also ScalarNode by searching for valueNode=<org.yaml.snakeyaml.nodes.ScalarNode and then process the value in this node .
But the issue is that I don't know why it puts a garbage value while composing map nodes. So here are my questions:
How to correctly compose yaml files so that map nodes appear correctly and not garbage values?
After I'm done processing and have successfully replaced the values with the processed ones, how do I put these back to a yaml file?
If you think this is a rubbish method to start with, please suggest me a better one.

The reason you get the „garbage value“ is because of this section in MappingValue's toString method:
if (node.getValueNode() instanceof CollectionNode) {
// to avoid overflow in case of recursive structures
buf.append(System.identityHashCode(node.getValueNode()));
} else {
buf.append(node.toString());
}
Since the composed graph may contain cycles (because anchors & aliases have been resolved at this stage of parsing), toString will not recurse into Collection nodes (i.e. Mappings and Sequences).
This means that your node tree has indeed be composed correctly and you simply should not use toString to inspect it. That answers your first question.
To write that back to a YAML file, use something like
Emitter emitter = new Emitter(/* e.g. a FileWriter */ writer, new DumperOptions());
for (Event event : yaml.serialize(/* the root node */ node)) {
emitter.emit(event);
}
To answer question 3: In the previous question, you mentioned that you want to change (encrypt) certain values and leave the others untouched. If that is the case, I suggest you use yaml.parse instead of yaml.compose because you lose fewer information when working on the event stream than when working on the composed graph (meaning that the output will be more similar to the input).
You can then go through the generated events, identify the events you want to alter and replace them with the altered events, and then use an Emitter like I showed in the code above.
I showed with some Python code here how to identify events in an event stream from a list of strings (kind-a „YAML-path“), however that code inserts events at the given path instead of altering them. The Python and Java API are somewhat similar, so if you can read Python, that code may help you.

Related

Transform list of Either into list of left and list of right

Vavr's Either seems to solve one of my problems were some method does a lot of checks and returns either CalculationError or CalculationResult.
Either<CalculationError, CalculationResult> calculate (CalculationData calculationData) {
// either returns Either.left(new CalculationError()) or Either.right(new CalculationResult())
}
I have a wrapper which stores both errors and results
class Calculation {
List<CalculationResult> calculationResults;
List<CalculationError> calculationErrors;
}
Is there any neat solution to transform stream from Collection<CalculationData> data to Calculation?

This can be easily done using a custom collector. With a bit of pseudo code representing the Either:
Collector<Either<CalculationError, CalculationResult>, ?, Calculation> collector = Collector.of(
Calculation::new,
(calc, either) -> {
if (either has error) {
calc.calculationErrors.add(either.error);
} else {
calc.calculationResults.add(either.result);
}
},
(calc1, calc2) -> {
calc1.calculationErrors.addAll(calc2.calculationErrors);
calc1.calculationResults.addAll(calc2.calculationResults);
return calc1;
}
);
Calculation calc = data.stream()
.map(this::calculate)
.collect(collector);
Note that Calculation should initialize its two lists (in the declaration or a new constructor).

Well, you're using vavr, so 'neat' is right out. Tends to happen when you use tools that are hostile to the idiomatic form of the language. But, then again, 'neat' is a nebulous term with no clear defined meaning, so, I guess, whatever you think is 'neat', is therefore 'neat'. Neat, huh?
Either itself has the sequence method - but both of them work the way Either is supposed to work: They are left-biased in the sense that any Lefts present is treated as erroneous conditions, and that means all the Right values are discarded if even one of your Eithers is a Left. Thus, you cannot use either of the sequence methods to let Either itself bake you a list of the Right values. Even sequenceRight won't do this for you (it stops on the first Left in the list and returns that instead). The filter stuff similarly doesn't work like that - Either very much isn't really an Either in the sense of what that word means if you open a dictionary: It does not mean: A homogenous mix of 2 types. It's solely a non-java-like take on exception management: Right contains the 'answer', left contains the 'error' (you're using it correctly), but as a consequence there's nothing in the Either API to help with this task - which in effect involves 'please filter out the errors and then do something' ("Silently ignore errors" is rarely the right move. It is what is needed here, but it makes sense that the Either API isn't going to hand you a footgun. Even if you need it here).
Thus, we just write it plain jane java:
var calculation = new Calculation();
for (var e : mix) {
if (e.isLeft()) calculation.calculationErrors.add(e.getLeft());
if (e.isRight()) calculation.calculationResult.add(e.getRight());
}
(This presumes your Calculation constructor at least initializes those lists to empty mutables).
NB: Rob Spoor's answer also assumes this and is much, much longer. Sometimes the functional way is the silly, slow, unwieldy, hard to read, way.
NB2: Either.sequence(mix).orElseRun(s -> calculation.errors = s.asJava()); is a rather 'neat' way (perhaps - it's in the eye of the beholder) of setting up the errors field of your Calculation class. No joy for such a 'neat' trick to fill the 'results' part of it all, however. That's what the bulk of my answer is trying to explain: There is no nice API for that in Either, and it's probably by design, as that involves intentionally ignoring the errors in the list of Eithers.

Since you are using VAVr, you may consider using Traversable instead of Collection. This will give you the method partition, which can be used to classify your list of Eithers into two groups like so:
Traversable<Either<CalculationError, CalculationResult>> calculations = ...;
var partitionedCalcs = calculations.partition(Either::isRight);
var results = partitionedCalcs._1.map(Either::getRight);
var errors = partitionedCalcs._2.map(Either::getLeft);
Calculation calcs = new Calculation(results, errors);
If you don't want to change your existing use of Collection to use a Traversable, then you can easily convert between them by using, for example, List.ofAll(Iterator) and Value.toJavaCollection(Function).

Recording line/character positions from generated FreeMarker templates

Is there a way to record specific line/character positions from generated FreeMarker templates? The purpose would be to highlight specific sections of the generated output file without having to parse the generated output file.
For example, let's say I have this template:
function foo()
{
ordinary_crap();
ordinary_crap();
do_something_special();<#mark foospecial>
ordinary_crap();
}
function bar()
{
ordinary_crap();
do_something_really_special();<#mark barspecial>
ordinary_crap();
ordinary_crap();
}
function baz()
{
foo();<#mark foo_call_1>
ordinary_crap();
bar();<#mark bar_call_1>
}
I want the <#mark> directive not to yield any generated output, but to associate mark names foospecial, barspecial, foo_call_1 and bar_call_1 with the line and position-within-a-line of where the <#mark> directives are located in the generated output. The example above I showed independent single points but it would be also useful to have begin/end pairs to mark specific ranges.
The alternatives I can see are
parsing the output independently -- not always possible, for example what if there are several identical instances of something in the output, and I want to highlight a specific one of those?
adding "mark hints" and removing them via my own postprocessing step. For example
<mark name="years">Fourscore and seven</mark> years ago
something really brilliant happened to a really nice guy named
<mark name="niceguyname">Fred</mark>.
Then I could postprocess this and remove the <mark > tags (assuming they don't conflict with the rest of the content), recording positions as I go.
But both of these seem kind of hacky.

From your TemplateDirectiveModel implementation (I assume that's how you implement mark, not with #macro), call env.getCurrentDirectiveCallPlace(). The returned DirectiveCallPlace has getBeginColumn() and getBeginLine() methods.

Is this the right naming convention ?

Assuming we have a method which calls another.
public readXMLFile() {
// reading each line and parsing each line to return node.
Node = parse(line);
}
private parse() {
}
Now is it a good practice to use a more comprehensive function name like "readXMLFileAndParse" ?
Pro's:
It provides a more comprehensive information to caller of what the function is supposed to be doing.
Else client may wonder if it only reads where is the "parse" utility.
In other words I see a clear advantage of a function name to be comprehensive of all the activities nested within it. Is this right thing to do aka is this considered a good practice ?

It's a guideline that every method can only have one job (single responsibility).
However this will cause problems for the naming where a method will return a result of a combination of sub-methods.
Therefore you should name it to describe its primary function: parsing a file. Reading a file is part of that, but it's not vital to the end-user since it's implicated.
Then again, you have to think of what this exactly entails: nobody just parses a file just to parse it. Do you retrieve data? Do you write data?
You should describe your actions on that file, but not as literally as 'readfile' or 'parsefile'.
RetrieveCustomers if you're reading customers would be a lot more descriptive.
public List<Customer> RetrieveCustomers() {
// loop over lines
// call parser
}
private Customer ParseCustomer() { }
If you'd share what exactly it is you're trying to parse, that would help a lot.

I think it depends on the complexity of your class. Since the method is private, no-one, in theory, should care. Named it descriptively enough so you can read your own code 6 months from now, and stop there.
public methods, on the other hand, should be well-named and well-documented. Extra descriptiveness there can't hurt.

Saxon XSLT and NodeList as parameter

What I am trying to achieve is (using Saxon-B 9.1):
1) Run XSLT transformation with object of below Example class as parameter
2) Object's properties are populated using reflexive extension function with selected Nodes
3) Run second XSLT transformation (on different XML input) and pass the above object with populated values as parameter
4) Insert XML nodes from the object into output document
My class is below:
public class Example {
. private NodeSet test;
. public RequestInfo() {}
. public void settest(NodeList t) {
. this.test = t;
. }
. public NodeList gettest() {
. return test;
. }
}
First transformation seems to populate my object fine (using settest() method within XSLT) - I can see correct nodes added to the NodeList.
However, I am getting below error when running 2nd transformation and calling gettest() method from within XSLT:
NodeInfo returned by extension function was created with an incompatible Configuration
I was thinking should I not use NodeList but maybe some different, equivalent type that would be recognised by Saxon? I tried it with NodeSet but got same error message.
Any help on this would be appreciated.

You haven't shown enough information to see exactly what you are doing wrong, but I can try and explain the error message. Saxon achieves its fast performance in part by allocating integer codes to all the names used in your XML documents and stylesheets, and using integer comparisons to compare names. The place where the mapping of integers to names is held is the NamePool, and the NamePool is owned by the Saxon Configuration object; so all documents, stylesheets, etc participating in a transformation must be created under the same Configuration (it's a bit like the DOM rule that all Nodes must be created under the Document they are attached to). The message means that you've got at least two different Configuration objects around. A Configuration gets created either explicitly by your application, or implicitly when you create a TransformerFactory, an XPathFactory, or other such objects.
I wonder whether this mixing of XSLT and Java code is really a good idea? Often when I see it, the Java code is being used because people haven't mastered how to achieve the desired effect in XSLT. There are many good reasons for NOT using the DOM with Saxon: it's very slow, it requires more lines of code, it's not thread-safe, it's harder to debug, ...

Application design for processing data prior to database

I have a large collection of data in an excel file (and csv files). The data needs to be placed into a database (mysql). However, before it goes into the database it needs to be processed..for example if columns 1 is less than column 3 add 4 to column 2. There are quite a few rules that must be followed before the information is persisted.
What would be a good design to follow to accomplish this task? (using java)
Additional notes
The process needs to be automated. In the sense that I don't have to manually go in and alter the data. We're talking about thousands of lines of data with 15 columns of information per line.
Currently, I have a sort of chain of responsibility design set up. One class(Java) for each rule. When one rule is done, it calls the following rule.
More Info
Typically there are about 5000 rows per data sheet. Speed isn't a huge concern because
this large input doesn't happen often.
I've considered drools, however I wasn't sure the task was complicated enough for drols.
Example rules:
All currency (data in specific columns) must not contain currency symbols.
Category names must be uniform (e.g. book case = bookcase)
Entry dates can not be future dates
Text input can only contain [A-Z 0-9 \s]
etc..
Additionally if any column of information is invalid it needs to be reported when
processing is complete
(or maybe stop processing).
My current solution works. However I think there is room for improvement so I'm looking
for ideals as to how it can be improved and or how other people have handled similar
situations.
I've considered (very briefly) using drools but I wasn't sure the work was complicated enough to take advantage of drools.

If I didn't care to do this in 1 step (as Oli mentions), I'd probably use a pipe and filters design. Since your rules are relatively simple, I'd probably do a couple delegate based classes. For instance (C# code, but Java should be pretty similar...perhaps someone could translate?):
interface IFilter {
public IEnumerable<string> Filter(IEnumerable<string> file) {
}
}
class PredicateFilter : IFilter {
public PredicateFilter(Predicate<string> predicate) { }
public IEnumerable<string> Filter(IEnumerable<string> file) {
foreach (string s in file) {
if (this.Predicate(s)) {
yield return s;
}
}
}
}
class ActionFilter : IFilter {
public ActionFilter(Action<string> action) { }
public IEnumerable<string> Filter(IEnumerable<string> file) {
foreach (string s in file) {
this.Action(s);
yield return s;
}
}
}
class ReplaceFilter : IFilter {
public ReplaceFilter(Func<string, string> replace) { }
public IEnumerable<string> Filter(IEnumerable<string> file) {
foreach (string s in file) {
yield return this.Replace(s);
}
}
}
From there, you could either use the delegate filters directly, or subclass them for the specifics. Then, register them with a Pipeline that will pass them through each filter.

I think your method is OK. Especially if you use the same interface on every processor.
You could also look to somethink called Drules, currently Jboss-rules. I used that some time ago for a rule-heavy part of my app and what I liked about it is that the business logic can be expressed in for instance a spreadsheet or DSL which then get's compiled to java (run-time and I think there's also a compile-time option). It makes rules a bit more succint and thus readable. It's also very easy to learn (2 days or so).
Here's a link to the opensource Jboss-rules. At jboss.com you can undoubtedly purchase an offically maintained version if that's more to your companies taste.

Just create a function to enforce each rule, and call every applicable function for each value. I don't see how this requires any exotic architecture.

A class for each rule? Really? Perhaps I'm not understanding the quantity or complexity of these rules, but I would (semi-pseudo-code):
public class ALine {
private int col1;
private int col2;
private int coln;
// ...
public ALine(string line) {
// read row into private variables
// ...
this.Process();
this.Insert();
}
public void Process() {
// do all your rules here working with the local variables
}
public void Insert() {
// write to DB
}
}
foreach line in csv
new ALine(line);

Your methodology of using classes for each rule does sound a bit heavy weight but it has the advantage of being easy to modify and expand should new rules come along.
As for loading the data bulk loading is the way to go. I have read some informaiton which suggests it may be as much as 3 orders of magnitude faster than loading using insert statements. You can find some information on it here

Bulk load the data into a temp table, then use sql to apply your rules.
use the temp table, as a basis for the insert into real table.
drop the temp table.

you can see that all the different answers are coming from their own experience and perspective.
Since we don't know much about the complexity and number of rows in your system, we tend to give advice based on what we have done earlier.
If you want to narrow down to a 1/2 solutions for your implementation, try giving more details.
Good luck

It may not be what you want to hear, it isn't the "fun way" by any means, but there is a much easier way to do this.
So long as your data is evaluated line by line... you can setup another worksheet in your excel file and use spreadsheet style functions to do the necessary transforms, referencing the data from the raw data sheet. For more complex functions you can use the vba embedded in excel to write out custom operations.
I've used this approach many times and it works really well; its just not very sexy.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.