Can't parse XML (from web) using JSoup

Can't parse XML (from web) using JSoup - java

I am trying to work with small XML files sent from web and parse few attributes from them. How would I approach this in JSoup? I know it's not XML Parser but HTML one but it supports XML too and I don't have to build any Handlers, BuildFactories and such as I would have to in DOM, SAX etc.
Here is example xml: LINK I can't paste it here because it exits the code tag after every line - if someone can fix that I would be grateful.
And here is my piece of code::
String xml = "http://www.omdbapi.com/?t=Private%20Ryan&y=&plot=short&r=xml";
Document doc = Jsoup.parse(xml, "", Parser.xmlParser());
// want to select first occurrence of genre tag though there is only one it
// doesn't work without .first() - but it doesn't parse it
Element genreFromXml = doc.select("genre").first();
String genre = genreFromXml.text();
System.out.println(genre);
It results in NPE at:
String genre = genreFromXml.text();

There are 2 issues in your code:
You provide a String representation of an URL while an XML content is expected, you should rather use the method parse(InputStream in, String charsetName, String baseUri, Parser parser) instead to parse your XML as an input stream.
There is no element genre in your XML, genre is an attribute of the element movie.
Here is how your code should look like:
String url = "http://www.omdbapi.com/?t=Private%20Ryan&y=&plot=short&r=xml";
// Parse the doc using an XML parser
Document doc = Jsoup.parse(new URL(url).openStream(), "UTF-8", "", Parser.xmlParser());
// Select the first element "movie"
Element movieFromXml = doc.select("movie").first();
// Get its attribute "genre"
String genre = movieFromXml.attr("genre");
// Print the result
System.out.println(genre);
Output:
Drama, War

Related

> and < gets converted to > and < while adding a xml like string in element.setTextContent()

I have a string which looks like an XML
Ex: String sample = "<GrpHdr><MsgId>MQSECJYJHRBPDTZTYNNEYXOZUPAUDEKVDFV</MsgId><CreDtTm>2023-02-02T21:48:58.075+05:30</CreDtTm></GrpHdr>";
I am trying to create an XML document with an element containing the above information:
Ex:
<ns1:TstCode>T</ns1:TstCode>
<ns1:FType>SCF</ns1:FType>
<ns1:FileRef>220811084023</ns1:FileRef>
<ns1:RoutingInd>ALL</ns1:RoutingInd>
<ns1:FileBusDt>2022-08-11</ns1:FileBusDt>
<ns1:FIToFI xmlns="urn:iso:std:iso:20022:tech:xsd">
<GrpHdr>
<MsgId>MQSECJYJHRBPDTZTYNNEYXOZUPAUDEKVDFV</MsgId>
<CreDtTm>2023-02-02T21:48:58.075+05:30</CreDtTm>
</GrpHdr>
</ns1:FIToFI>
When I create the document for the above XML using this code:
private static DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
DOMImplementation domImpl = db.getDOMImplementation();
private Document buildExampleDocumentWithNamespaces(DOMImplementation domImpl, String output) {
Document document = domImpl.createDocument("urn:Scf:xsd:$BlkCredTrf", "ns1:BlkCredTrf", null);
document.getDocumentElement().appendChild(document.createElement("ns1:TstCode")).setTextContent("T");
document.getDocumentElement().appendChild(document.createElement("ns1:FType")).setTextContent("SCF");
document.getDocumentElement().appendChild(document.createElement("ns1:FileRef")).setTextContent("220811084023");
document.getDocumentElement().appendChild(document.createElement("ns1:RoutingInd")).setTextContent("ALL");
document.getDocumentElement().appendChild(document.createElement("ns1:FileBusDt")).setTextContent("2022-08-11");
document.getDocumentElement().appendChild(document.createElementNS("urn:iso:std:iso:tech:xsd","ns1:FIToFI");
return document;
}
I do not have issues until this point.
When I try to add <GrpHdr><MsgId>MQSECJYJHRBPDTZTYNNEYXOZUPAUDEKVDFV</MsgId><CreDtTm>2023-02-02T21:48:58.075+05:30</CreDtTm></GrpHdr> as a Text content to the FIToFI tag at last, using the code:
document.getDocumentElement().appendChild(document.createElementNS("urn:iso:std:iso:tech:xsd","ns1:FIToFI").setTextContent(sample);
The XML gets created like this:
<ns1:TstCode>T</ns1:TstCode>
<ns1:FType>SCF</ns1:FType>
<ns1:FileRef>220811084023</ns1:FileRef>
<ns1:RoutingInd>ALL</ns1:RoutingInd>
<ns1:FileBusDt>2022-08-11</ns1:FileBusDt>
<ns1:FIToFI xmlns="urn:iso:std:iso:20022:tech:xsd">
<GrpHdr>
<MsgId>MQSECJYJHRBPDTZTYNNEYXOZUPAUDEKVDFV</MsgId>
<CreDtTm>2023-02-02T21:48:58.075+05:30</CreDtTm>
</GrpHdr>
</ns1:FIToFI>
Please help me to create this XML without the escape characters.

That the content is escaped is intended. When you set the text content of an element, any special character like < have to be escaped like <, otherwise the text content will be interpreted as other XML content like elements or comments. That's why setTextContent() will escape the content for you.
When you want to add an element instead, you use methods like appendChild() with an Element argument. Build your elements as usual with the createElement() method and add them together like this:
Element element = document.createElementNS("urn:iso:std:iso:tech:xsd","ns1:FIToFI");
Element grpHdr = document.createElement("GrpHdr");
Element msgId = document.createElement("MsgId");
msgId.setTextContent("MQSECJYJHRBPDTZTYNNEYXOZUPAUDEKVDFV");
grpHdr.appendChild(msgId);
Element creDtTm = document.createElement("CreDtTm");
creDtTm.setTextContent("2023-02-02T21:48:58.075+05:30");
grpHdr.appendChild(creDtTm);
element.appendChild(grpHdr);
document.getDocumentElement().appendChild(element);
This will add the XML element inside the other XML element.
When you have the inner XML as a string, parse the XML string with DocumentBuilder.parse() (see How to create a XML object from String in Java?) and import the Element with the Document.importNode() method (see org.w3c.dom.DOMException: WRONG_DOCUMENT_ERR: A node is used in a different document than the one that created it). The code can look like this:
String innerXml = "<GrpHdr><MsgId[...]eDtTm></GrpHdr>";
Document innerDocument = db.parse(new InputSource(new StringReader(innerXml)));
Element innerRootElement = innerDocument.getDocumentElement();
Node importedNode = document.importNode(innerRootElement, true);
element.appendChild(importedNode);

Android - how to parse html by jsoup and fill into the arraylist?

I want to read the date from this HTML link:
http://jadvalbaz.blog.ir/post/%D8%B1%D8%A7%D9%87%D9%86%D9%85%D8%A7%DB%8C-%D8%AD%D9%84-%D8%AC%D8%AF%D9%88%D9%84-%D8%AD%D8%B1%D9%81-%D8%B0
if you look at the view-source
ذات اریه (پنومونی- سینه پهلو)ذر (مورچه ریز)ذرع (مقیاس طول)ذره ای بنیادی از رده هیبرونها که بار الکتریکی ندارد (لاندا)ذره منفی اتم (الکترون)ذریه (نسل)ذل (خواری)ذم (نکوهش)ذهاب (رفتن)ذی (صاحب)
my words are separated by <.br>, I want to read each word to ArrayList, I means how to omit the <.br> and read the words.
here is my code:
Document document = Jsoup.connect(url).get();
for (Element span : document.select("?").select("?")) {
title = span.toString();
name.add(title);
}
How to read them, what to put instead of question mark.
any suggestion?

edit the css of your template and define a class for your words then Use the Element.select(String selector) and Elements.select(String selector) method.
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Element masthead = doc.select("p.words").first(); // p with class=words
follow below link for more information about extracting data with this methods:
Use selector-syntax to find elements

Elements returns empty string

I am trying to scrape prices of a website with jSoup, but I only get an empty string.
I've tested my code with jSoup Online and I expect <meta itemprop="price" content="6,99"> to be printed when I use the following code:
Document doc = Jsoup.connect(URL).get();
Elements meta = doc.select("meta[itemprop=price]");
System.out.println("meta: " + meta.text());
price = meta.attr("content");
However, I just get an empty string and no error. What am I doing wrong here?
For the ones interested I am trying to scrape the price of this page

Try this:
Document doc = Jsoup.connect(URL).get();
Element meta = doc.select("meta[itemprop=price]").first();
System.out.println("meta: " + meta.text());
String price = meta.attr("content");

The webserver you are trying to access needs another user agent string to respond with the info you want. Try this:
Document doc = Jsoup.connect(URL).userAgent("Mozilla/5.0").get();

extracting the value of a tag from xml in which xml message is coming as string

I have the below method ...
public void sendmessage( final String messageText)
{
}
and in which the parameter messageText contains a an xml message now out of this xml message i need to extract the value of an xml tag and sent it it into an integer variable
that is in the above string parameter messageText which contains an xml message there is this tag as shown below
<transferGroupId>206320940</transferGroupId>
now i want to extract the e value of this tag and strored inside a variable please advise how to achieve this
below is the complete xml message shown below..
<?l version="1.0" encoding="UTF-8"?>
<emml message="emml-transfer-lifecycle">
<messageHeader>
<businessDate>2016-01-09</businessDate>
<eventDateTime timeContextReference="London">2016-01-09T16:55:00.485
</eventDateTime>
<system id="ACSDE">
<systemId>ADS ABLO</systemId>
<systemClass>ADS</systemClass>
<systemRole>Reference</systemRole>
</system>
<timeContext id="ndon">
<location>ABLO</location>
</timeContext>
</messageHeader>
<transferEventHeader>
<transferGroupStatus>Settled</transferGroupStatus>
<transferGroupIdentifier>
<transferGroupId>206320940</transferGroupId>
<systemReference>Ghtr</systemReference>
<transferGroupClassificationScheme>Primary Identifier
</transferGroupClassificationScheme>
</transferGroupIdentifier>
</transferEventHeader>
</emml>
I have tried this approach as shown below
String tagname = "transferGroupId";
String t = getTagValue( messageText, tagname);
and then further it is calling this method ..
public static String getTagValue(String messageText, String tagname){
return messageText.split("<"+tagname+">")[1].split("</"+tagname+">")[0];
but it this does not work in the end please advise how can i overcome from this
the other thing that was advise of jsoup also i have tried as shown below but it is throwing the exception that Parser class does not have any method named xmlParser in it ..
Document doc = Jsoup.parse(messageText, "", Parser.xmlParser());
for (Element e : doc.select("transferGroupId")) {
System.out.println(e.text());
}

JSoup sounds like what you need. (It has xml parsing support)
In JSoup:
Document doc = Jsoup.parse(messageText, "", Parser.xmlParser());
for (Element e : doc.select("transferGroupId")) {
System.out.println(e.text());
}
This will print out the text of the transferGroupId, which is 206320940 in this case. You can do other things with this such as sending a message using your own methods and resources.
Hope this helps!

Parse CDATA from XML to enable editing Java

I am writing a simulator which communicates with a client's piece of software over a local socket. The communication language is XML. I have written some code which works - parsing the incoming XML string into Document via the DocumentBuilder interface.
I have been encountering a problem with CDATA (Having never seen it before). Basically, I need to access fields within the CDATA tag and change them. I load up a 'template' XML document (to reply to the messages with) and use values received in the first message inside the response. Some of the fields that need to be changed are in this CDATA tag (clear what I mean below).
public static String getOutputMessage(String input) throws Exception{
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
Document inputDoc, outputDoc;
Element messageElement = (Element)inputDoc.getElementsByTagName("TRANS").item(0);
messageType = messageElement.getAttribute("name");
if (messageType.equals("processTransaction")){
outputDoc = db.parse(path+"processTransaction\\posPrintReceipt.xml");
outputDoc = changeContent(outputDoc, "PAN_NUMBER", transaction.getPan_number());
outputDoc = changeContent(outputDoc, "TOKEN", transaction.getToken());
outputDoc = changeContent(outputDoc, "TOTAL_AMOUNT", transaction.getTotal_amount());
outputDoc = changeContent(outputDoc, "TRANSACTION_TIME", transaction.getTransaction_time());
outputDoc = changeContent(outputDoc, "TRANSACTION_DATE", transaction.getTransaction_date());
}
}
private static Document changeContent(Document doc,String tag,String value) {
System.out.println("Changing: ["+tag+" : "+value+"]");
NodeList nodes=doc.getElementsByTagName(tag);
Node node = nodes.item(0);
Node parent=node.getParentNode();
node.setTextContent(value);
System.out.println(doc.getElementsByTagName(tag).item(0) + " " + node.getTextContent());
parent.replaceChild(node, doc.getElementsByTagName(tag).item(0));
return doc;
}
The functions above work on normal Elements but below is an example XML message I have to read and change some values such as
<RLSOLVE_MSG version="5.0">
<MESSAGE>
<SOURCE_ID>DP01</SOURCE_ID>
<TRANS_NUM>000001</TRANS_NUM>
</MESSAGE>
<POI_MSG type="interaction">
<INTERACTION name="posPrintReceipt">
<RECEIPT type="merchant" format="xml">
<![CDATA[<RECEIPT>
<AUTH_CODE>06130</AUTH_CODE>
<CARD_SCHEME>VISA</CARD_SCHEME>
<CURRENCY_CODE>GBP</CURRENCY_CODE>
<CUSTOMER_PRESENCE>internet</CUSTOMER_PRESENCE>
<FINAL_AMOUNT>1.00</FINAL_AMOUNT>
<MERCHANT_NUMBER>8888888</MERCHANT_NUMBER>
<PAN_NUMBER>454420******0382</PAN_NUMBER>
<PAN_EXPIRY>12/15</PAN_EXPIRY>
<TERMINAL_ID>04176421</TERMINAL_ID>
<TOKEN>454420bbbbbkqrm0382</TOKEN>
<TOTAL_AMOUNT>1.00</TOTAL_AMOUNT>
<TRANSACTION_DATA_SOURCE>keyed</TRANSACTION_DATA_SOURCE>
<TRANSACTION_DATE>14/02/2014</TRANSACTION_DATE>
<TRANSACTION_NUMBER>000001</TRANSACTION_NUMBER>
<TRANSACTION_RESPONSE>06130</TRANSACTION_RESPONSE>
<TRANSACTION_TIME>17:13:17</TRANSACTION_TIME>
<TRANSACTION_TYPE>purchase</TRANSACTION_TYPE>
<VERIFICATION_METHOD>unknown</VERIFICATION_METHOD>
<DUPLICATE>false</DUPLICATE>
</RECEIPT>]]>
</RECEIPT>
</INTERACTION>
</POI_MSG>

CDATA is an encoding mechanism to include arbitrary data within an XML file. Everything within CDATA is parsed as a single string when loading the XML into a Document instance. If you need to access the contents of the CDATA as a DOM document, you will need to instantiate a second Document object from the string contents, make your changes, then serialize that back to a string and put the string back into a CDATA in the original document.

I dont think CDATA section will be parsed as other regular elements in the XML. CDATA section is purely to escape any syntax checks. My suggestion would be use a element to represent the data in CDATA section. If you still want to use CDATA section, I guess you'll need parse the section as a string and then load the data into a Document.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Can't parse XML (from web) using JSoup - java

Related

> and < gets converted to > and < while adding a xml like string in element.setTextContent()

Android - how to parse html by jsoup and fill into the arraylist?

Elements returns empty string

extracting the value of a tag from xml in which xml message is coming as string

Parse CDATA from XML to enable editing Java

Categories

Resources