XmlStreamReader behaves randomly at the start - java

I would expect the XmlStreamReader to start at the start of the document (obviously) and then jump to the root of the XML document when I call next() on it. However, scaringly, I see it jump to the first tag with text inside and always omitting the root and often(???) the second tag.
the document looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<objektliste xmlns="http://www.pixelboxx.de/ns/erco/translations/1.0">
<uebersetzungen key="122671" attribute="7505">
<thumbnail>abrakadabra.jpg</thumbnail>
<text sprache="1031">We like the abla abla abla</text>
<text sprache="2057">We like the spoonBlaBlaBla[en]</text>
<text sprache="1036">Wicher</text>
</uebersetzungen>
<uebersetzungen key="122679" attribute="7505">
<thumbnail>122679.jpg</thumbnail>
<text sprache="1031">Kiefer</text>
<text sprache="1036">franek</text>
</uebersetzungen>
</objektliste>
Am I going insane, is my eclipse going insane or I don't see something obvious?
The program seems to always omit "objektliste" and usually jump to "thumbnail" first, even though in previous debug sessions it seemed to behave even more random.
help!!!
btw, the code is extremely simple:
XMLStreamReader streamReader = factory.createXMLStreamReader( is);
while( streamReader.hasNext())
{
//event type 7 here, everything seems to be ok.
streamReader.next();
//bang! armaggeddon - skips the root, jumps to thumbnail.

The call to streamReader.next() is event based .
The next() method causes the reader to read the next parse event. The next() method returns an integer which identifies the type of event just read.
The event type can be determined using getEventType().
I think you may be experiencing issues with the end element events showing up and you were not expecting it.
Using the following code, I see the proper order being processed as expected:
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader streamReader = factory.createXMLStreamReader(is);
while( streamReader.hasNext()) {
int eventType = streamReader.next();
switch (eventType) {
case XMLStreamReader.START_ELEMENT:
String elementName = streamReader.getLocalName();
System.out.println("Element: " + elementName);
break;
case XMLStreamReader.END_ELEMENT:
break;
}
}
Which generates the expected output:
Element: objektliste
Element: uebersetzungen
Element: thumbnail
Element: text
Element: text
Element: text
Element: uebersetzungen
Element: thumbnail
Element: text
Element: text

Related

Java Stax how to get only value of specific child nodes

I use Stax for get nodeName and nodeValue of my xml file (size 90 MB) :
<?xml version="1.0" encoding="UTF-8"?>
<name1>
<type>
<coord>67</coord>
<umc>57657</umc>
</type>
<lang>
<eng>989</eng>
<spa>123</spa>
</lang>
</name1>
<name2>
<type>
<coord>534</coord>
<umc>654654</umc>
</type>
<lang>
<eng>354</eng>
<spa>2424</spa>
</lang>
</name2>
<name3>
<type>
<coord>23432</coord>
<umc>14324</umc>
</type>
<lang>
<eng>141</eng>
<spa>142</spa>
</lang>
</name3>
I can get localName but not child nodes... if I want to get the value for all child nodes different of 'spa' how can I process to get that ?
Java:
XMLStreamReader dataXML = factory.createXMLStreamReader(new FileReader(path));
while (dataXML.hasNext())
{
int type = dataXML.next();
switch(type)
{
case XMLStreamReader.START_ELEMENT:
System.out.println(dataXML.getLocalName());
break;
case XMLStreamReader.CHARACTERS:
System.out.println(dataXML.getText());
break;
}
}
To keep track of element being parsed it's needed to introduce variable holding the current tag name as well as the variable with the tag name(s) of interest:
String localname = null;
String tagName = "spa";
while (dataXML.hasNext()) {
int type = dataXML.next();
switch (type) {
case XMLStreamReader.SPACE:
continue;
case XMLStreamReader.START_ELEMENT:
localname = dataXML.getLocalName();
System.out.println(dataXML.getLocalName());
break;
case XMLStreamReader.CHARACTERS:
if (!tagName.equals(localname)) {
System.out.println(dataXML.getText());
}
break;
}
}
In case there are several tags you want to handle, variable tagName could be replaced with a list:
List<String> tagNames = new ArrayList<>();
tagNames.add("spa");
And the check would be following:
if (!tagNames.contains(localname)) {
System.out.println(dataXML.getText());
}
You use StAX parsing. It means You pull events from a parser. StAX parsing doesn't have any information about detail structure of Your document.
Please check Differences between DOM, SAX or StAX and Java StAX parser
If You want to get children of Your XML element, You need to track it by Yourself.
If You really want children being accessed in a convenient way - use DOM parsing strategy. But as You've mentioned, Your document is ~90MB what may be really heavy to load it fully.

Replace string with jsoup only in text portions

I have found several topics with similar questions and valuable answers, but I am still struggling with this:
I want to parse some html with Jsoup so I can replace, for example,
"changeme"
with
<changed>changeme</changed>
, but only if it appears on a text portion of the html, no if it is part of a tag. So, starting with this html:
<body>
<p>test changeme app</p>
</BODY>
</HTML>
I would want to get to this:
<body>
<p>test <changed>changeme</changed> app</p>
</BODY>
</HTML>
I have tried several approaches, this one is which brings me closer to the desired result:
Document doc = null;
try {
doc = Jsoup.parse(new File("tmp1450348256397.txt"), "UTF-8");
} catch (Exception ex) {
}
Elements els = doc.body().getAllElements();
for (Element e : els) {
if (e.text().contains("changeme")) {
e.html(e.html().replaceAll("changeme","<changed>changeme</changed>"));
}
}
html = doc.toString();
System.out.println(html);
But with this approach I find two problems:
<body>
<p><a href="http://<changed>changeme</changed> .html">test
<changed>
changeme
</changed>
app</a></p>
</BODY>
</HTML>
Line breaks are inserted before and after the new element I am introducing. This is not a real problem as I coul get rid of them if I use #changed# to do the replacing and after the doc.toString() I replace them again to the desired value (with < >).
The real problem: The URL in the href has been modified, and I don't want it to happen.
Ideas? Thx.
Here is my solution:
String html=""
+"<p><a href=\"http://changeme.html\">"
+ "test changeme "
+ "<div class=\"changeme\">"
+ "inner text changeme"
+ "</div>"
+ " app</a>"
+"</p>";
Document doc = Jsoup.parse(html);
Elements els = doc.body().getAllElements();
for (Element e : els) {
List<TextNode> tnList = e.textNodes();
for (TextNode tn : tnList){
String orig = tn.text();
tn.text(orig.replaceAll("changeme","<changed>changeme</changed>"));
}
}
html = doc.toString();
System.out.println(html);
TextNodes are always leaf nodes, i.e. they do not contain more HTML elements. In your original approach you replace the HTML of an element with new HTML with replaced changme strings. You only check for the changeme to be part of the TextNodes contents, but you replace every occurrence in the HTML string of the element, including all occurrences outside TextNodes.
My solution basically works like yours, but I use the JSoup method textNodes(). This way I don't need to typecast.
P.S.
Of course, my solution as well as yours will contain <changed>changeme</changed> instead of <changed>changeme</changed> in the end. This may or may not be what you want. If you do not want this, then your result is not any more valid HTML, since changed is no valid HTML tag. Jsoup will not help you in this case. However, you can of course replace in the resulting string all <changed>changeme</changed> again - outside JSoup.
I think your issue is that you're replacing the elements html rather than just its text, change:
e.html(e.html().replaceAll("changeme","<changed>changeme</changed>"));
to
e.text(e.text().replaceAll("changeme","<changed>changeme</changed>"));
the line breaks issue can probably be solved by doing doc.outputSettings().prettyPrint(false); before doing html = doc.toString();
Finally I tried this solution (at the end of the question), using TextNodes:
How I can replace "text" in the each tag using Jsoup
This is the resulting code:
Elements els = doc.body().getAllElements();
for (Element e : els) {
for (Node child : e.childNodes()){
if (child instanceof TextNode && !((TextNode) child).isBlank()) {
((TextNode)child).text(((TextNode)child).text().replaceAll("changeme","<changed>changeme</changed>"));
}
}
}
Now the output is the expected, and it even does not introduce extra break lines. In this case prettyPrint must be set to True.
The only problem is that I don't really understand the difference of using TextNode vs Element.text(). If someone wants to provide some info it will be much appreciated.
Thanks.

Java appendChild (CSV to XML conversion) doesn't work for ONE node

So I need to convert a CSV file to seperate XML files (one XML file per line in the CSV file). And this all works fine, except for the fact that it refuses to add one value. It adds the other without a problem, but for some mysterious reason refuses to create a tag for one of my nodes.
Document newDoc = documentBuilder.newDocument();
Element rootElement = newDoc.createElement("XMLoutput");
newDoc.appendChild(rootElement);
String header = headers.get(col);
String value = null;
String value2 = null;
if (col < rowValues.length) {
if(header.equals("delay")) {
value = rowValues[col];
Thread.sleep(Long.parseLong(value));
}
Element shipidElement = newDoc.createElement("shipID");
shipidElement.appendChild(newDoc.createTextNode(FilenameUtils.getBaseName(csvFileName)));
rootElement.appendChild(shipidElement);
if(header.equals("centraleID")) {
value = rowValues[col];
System.out.println(value); //to check if the if condition works, it does
Element centralElement = newDoc.createElement(header);
Text child = newDoc.createTextNode(value);
centralElement.appendChild(child);
rootElement.appendChild(centralElement);
}
else if(header.equals("afstandTotKade")) {
value2 = rowValues[col];
Element curElement = newDoc.createElement(header);
curElement.appendChild(newDoc.createTextNode(value2));
rootElement.appendChild(curElement);
}
String timeStamp = new SimpleDateFormat("HH:mm:ss").format(new Date());
Element timeElement = newDoc.createElement("Timestamp");
timeElement.appendChild(newDoc.createTextNode(timeStamp));
rootElement.appendChild(timeElement);
}
So in the above code, the if loop checking for CentraleID actually works, because it prints out the values, however the XML file does not add a tag, not even if I just insert a string instead of the header value. It does, however, insert the "afstandTotKade" node and timestamp node. I am dumbfounded.
PS: this is only part of the code of course, but the problem is so minute, that it seemed superfluous to add all of it.
PPS: the code was originally the same as the others, I've just been playing around.
This is the resulting XML File btw, so the other if does add the nodes, and I know the spelling is correct in checking the header because it does print out the values (the sout code) when I run it:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<XMLoutput>
<shipID>1546312</shipID>
<afstandTotKade>1000</afstandTotKade>
<Timestamp>22:06:33</Timestamp>
</XMLoutput>

Parse CDATA from XML to enable editing Java

I am writing a simulator which communicates with a client's piece of software over a local socket. The communication language is XML. I have written some code which works - parsing the incoming XML string into Document via the DocumentBuilder interface.
I have been encountering a problem with CDATA (Having never seen it before). Basically, I need to access fields within the CDATA tag and change them. I load up a 'template' XML document (to reply to the messages with) and use values received in the first message inside the response. Some of the fields that need to be changed are in this CDATA tag (clear what I mean below).
public static String getOutputMessage(String input) throws Exception{
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
Document inputDoc, outputDoc;
Element messageElement = (Element)inputDoc.getElementsByTagName("TRANS").item(0);
messageType = messageElement.getAttribute("name");
if (messageType.equals("processTransaction")){
outputDoc = db.parse(path+"processTransaction\\posPrintReceipt.xml");
outputDoc = changeContent(outputDoc, "PAN_NUMBER", transaction.getPan_number());
outputDoc = changeContent(outputDoc, "TOKEN", transaction.getToken());
outputDoc = changeContent(outputDoc, "TOTAL_AMOUNT", transaction.getTotal_amount());
outputDoc = changeContent(outputDoc, "TRANSACTION_TIME", transaction.getTransaction_time());
outputDoc = changeContent(outputDoc, "TRANSACTION_DATE", transaction.getTransaction_date());
}
}
private static Document changeContent(Document doc,String tag,String value) {
System.out.println("Changing: ["+tag+" : "+value+"]");
NodeList nodes=doc.getElementsByTagName(tag);
Node node = nodes.item(0);
Node parent=node.getParentNode();
node.setTextContent(value);
System.out.println(doc.getElementsByTagName(tag).item(0) + " " + node.getTextContent());
parent.replaceChild(node, doc.getElementsByTagName(tag).item(0));
return doc;
}
The functions above work on normal Elements but below is an example XML message I have to read and change some values such as
<RLSOLVE_MSG version="5.0">
<MESSAGE>
<SOURCE_ID>DP01</SOURCE_ID>
<TRANS_NUM>000001</TRANS_NUM>
</MESSAGE>
<POI_MSG type="interaction">
<INTERACTION name="posPrintReceipt">
<RECEIPT type="merchant" format="xml">
<![CDATA[<RECEIPT>
<AUTH_CODE>06130</AUTH_CODE>
<CARD_SCHEME>VISA</CARD_SCHEME>
<CURRENCY_CODE>GBP</CURRENCY_CODE>
<CUSTOMER_PRESENCE>internet</CUSTOMER_PRESENCE>
<FINAL_AMOUNT>1.00</FINAL_AMOUNT>
<MERCHANT_NUMBER>8888888</MERCHANT_NUMBER>
<PAN_NUMBER>454420******0382</PAN_NUMBER>
<PAN_EXPIRY>12/15</PAN_EXPIRY>
<TERMINAL_ID>04176421</TERMINAL_ID>
<TOKEN>454420bbbbbkqrm0382</TOKEN>
<TOTAL_AMOUNT>1.00</TOTAL_AMOUNT>
<TRANSACTION_DATA_SOURCE>keyed</TRANSACTION_DATA_SOURCE>
<TRANSACTION_DATE>14/02/2014</TRANSACTION_DATE>
<TRANSACTION_NUMBER>000001</TRANSACTION_NUMBER>
<TRANSACTION_RESPONSE>06130</TRANSACTION_RESPONSE>
<TRANSACTION_TIME>17:13:17</TRANSACTION_TIME>
<TRANSACTION_TYPE>purchase</TRANSACTION_TYPE>
<VERIFICATION_METHOD>unknown</VERIFICATION_METHOD>
<DUPLICATE>false</DUPLICATE>
</RECEIPT>]]>
</RECEIPT>
</INTERACTION>
</POI_MSG>
CDATA is an encoding mechanism to include arbitrary data within an XML file. Everything within CDATA is parsed as a single string when loading the XML into a Document instance. If you need to access the contents of the CDATA as a DOM document, you will need to instantiate a second Document object from the string contents, make your changes, then serialize that back to a string and put the string back into a CDATA in the original document.
I dont think CDATA section will be parsed as other regular elements in the XML. CDATA section is purely to escape any syntax checks. My suggestion would be use a element to represent the data in CDATA section. If you still want to use CDATA section, I guess you'll need parse the section as a string and then load the data into a Document.

Jsoup Selector "not"

This code here:
Document doc = Jsoup.connect("http://wikitravel.org/en/San_Francisco").get();
System.out.println(doc.select("h2:contains(Get around) ~ *:not(h2:contains(See) ~ *)"));
outputs http://pastebin.com/gkcCfr1F. Is there a selector to make the "not" selector inclusive? Right now it is removing everything after the "see", when I'd like to remove the last h2 tag with the id="see" along with the everything else because I'm attempting to parse individual sections of a wiki.
The final output that I would like to obtain is: http://pastebin.com/ntpVrgui
I would do something like this :
Get the content div :
StringBuilder sb = new StringBuilder();
boolean start = false;
Document doc = Jsoup.connect("http://wikitravel.org/en/San_Francisco").get();
Elements content = doc.select("#content");
for (Element element : content) {
/*Pseudo code
if element is h3 and it contains span with id Navigating and if start is
false append it to stringbuilder, set start to true, else append everything in between until you reach h2 with span id See
*/
}

Categories

Resources