Parsing XML with apostrophe

Parsing XML with apostrophe - java

Taking the BBC News RSS feed for example, one of their news items is as follows:
<item><title>Pupils 'bullied on sports field'</title><description>bla bla..
I have some java code parsing this - however, when a title contains an apostrophe (as above), the parsing stops, so I end up with the following title: Pupils ' and then it continues on and parses the description (which is fine). How do I get it to parse the full title? The following is a segment of code from inside my for loop where I parse the info:
NodeList title = element.getElementsByTagName("title");
Element line = (Element) title.item(0);
tmp.setTitle(getCharacterDataFromElement(line).toString());
The exact same code is used to parse the other elements like description and pubDate etc, which are all fine.
This is the getCharacterDataFromElement method:
public static String getCharacterDataFromElement(Element e) {
Node child = ((Node) e).getFirstChild();
if (child instanceof CharacterData) {
CharacterData cd = (CharacterData) child;
return cd.getData();
}
return "";
}
What am I doing wrong? I use the DocumentBuilder, DocumentBuilderFactory and org.w3c.dom to work with the RSS Feed.

Your getCharacterDataFromElement only looks at the first child - see if there are further child elements too and tack all the text together
HTH - DF

As davidfrancis suggested, you should iterate over all children in getCharacterDataFromElement().
Alternatively, if you can use DOM level 3, you can use the Node.getTextContent() method instead which does what you want.
NodeList title = element.getElementsByTagName("title");
Element line = (Element)title.item(0);
tmp.setTitle(line.getTextContent());

Well, AFAIK, apostrophe is a reserved character in XML and thus should be encoded as &apos;.
This means the BBC News RSS feed doesnt provide well-formatted XML.
The best thing would be to issue a bug report to the BBC News RSS feed provider so that they fix it.

Related

Java 9, INVALID_CHARACTER_ERR when trying to add URL as element in XML

I'm working with XML for the first time, trying to generate XML to send over to a client and I'm having a hell of a time doing it. Whenever I try to pass a URL, I get an INVALID_CHARACTER_ERR and nothing I've tried so far works.
I tried using replacements like & #123; and so on for the curly braces, and tried escaping everything that wasn't a letter, resulting in the abomination under my code. It seems to throw the error if I have any kind of character that isn't a letter. Another thing that I noticed is that the document's InputEncoding is null, but that seems to be because I'm creating it in code, does that mean that it actually doesn't have an encoding type? I haven't been able to find an easy way to set it either.
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document orders = dBuilder.newDocument();
Element order = orders.createElement("{https://secure.targeturl.com/foo/bar}tagpayload");
Element tOrder = orders.createElement("tagorder");
order.appendChild(tOrder);
Element header = orders.createElement("orderheader");
tOrder.appendChild(header);
Element billto = orders.createElement("billto");
header.appendChild(billto); ```
``` "& #123;https& #58;& #47;& #47;secure& #46;targeturl& #46;com/foo& #47;bar& #125;tagpayload" ```

This is not the correct way to create a namespaced element:
Element order = orders.createElement("{https://secure.targeturl.com/foo/bar}tagpayload");
Instead, use the createElementNS method:
Element order = orders.createElementNS("https://secure.targeturl.com/foo/bar", "tagpayload");
You are seeing an exception because { is not a legal character in an XML element name. createElement has no awareness of namespaces or the “{uri}name” namespace notation.

Jsoup - Convert html texts into a list of Strings

Using Jsoup I want to be able add text existing in each html tag to a List<String> in order.
This is fairly easy using BeautifulSoup4 in python but I'm having a hard time in Java.
BeautifulSoup Code:
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(body, 'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
text_list =[]
for t in visible_texts:
text_list.append(t.strip())
return list(filter(None, text_list))
html = urllib.request.urlopen('https://someURL.com/something').read()
print(text_from_html(html))
This code will print ["text1", "text2", "text3",...]
My initial attempt was to follow the Jsoup documentation for text conversion.
Jsoup Code Attempt-1:
Document doc = Jsoup.connect('https://someURL.com/something')
.userAgent("Bot")
.get();
Elements divElements = doc.select("*")
List<String> texts = divElements.eachText();
System.out.println(texts);
What ends up happening is a duplication of texts ["text1 text2 text3","text2 text3", "text3",...]
My assumption is that Jsoup goes through each Element and prints out every text within that Element including the text existing in each child node. Then it goes to the child node and prints out the remaining text, so on and so forth.
I have seen many people specify Tag/Attributes via cssQuery to bypass this problem but my project requires to do this for any scrape-able website.
Any suggestion is appreciated.

Your assumption is right - but BeautifulSoup would probably do the same. Only the text=True in findAll(text=True) limits the result to pure text-nodes. To have the equivalent in JSoup use the following selector:
Elements divElements = doc.select(":matchText");

How to extract xml tag value without using the tag name in java?

I am using java.I have an xml file which looks like this:
<?xml version="1.0"?>
<personaldetails>
<phno>1553294232</phno>
<email>
<official>xya#gmail.com</official>
<personal>bk#yahoo.com</personal>
</email>
</personaldetails>
Now,I need to check each of the tag values for its type using specific conditions,and put them in separate files.
For example,in the above file,i write conditions like 10 digits equals phone number,
something in the format of xxx#yy.com is an email..
So,what i need to do is i need to extract the tag values in each tag and if it matches a certain condition,it is put in the first text file,if not in the second text file.
in that case,the first text file will contain:
1553294232
xya#gmail.com
bk#yahoo.com
and the rest of the values in the second file.
i just don't know how to extract the tag values without using the tag name.(or without using GetElementsByTagName).
i mean this code should extract the email bk#yahoo.com even if i give <mailing> instead of <personal> tag.It should not depend on the tag name.
Hope i am not confusing.I am new to java using xml.So,pardon me if my question is silly.
Please Help.

Seems like a typical use case for XPath
XPath allows you to query XML in a very flexible way.
This tutorial could help:
http://www.javabeat.net/2009/03/how-to-query-xml-using-xpath/
If you're using Java script, which could to be the case, since you mention getElementsByTagName(), you could just use JQuery selectors, it will give you a consistent behavior across browsers, and JQuery library is useful for a lot of other things, if you are not using it already... http://api.jquery.com/category/selectors/
Here for example is information on this:
http://www.switchonthecode.com/tutorials/xml-parsing-with-jquery

Since you don't know your element name, I would suggest creating a DOM tree and iterating through it. As and when you get a element, you would try to match it against your ruleset (and I would suggest using regex for this purpose) and then write it to your a file.
This would be a sample structure to help you get started, but you would need to modify it based on your requirement:
public void parseXML(){
try{
DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc;
doc = documentBuilder.parse(new File("test.xml"));
getData(null, doc.getDocumentElement());
}catch(Exception exe){
exe.printStackTrace();
}
}
private void getData(Node parentNode, Node node){
switch(node.getNodeType()){
case Node.ELEMENT_NODE:{
if(node.hasChildNodes()){
NodeList list = node.getChildNodes();
int size = list.getLength();
for(int index = 0; index < size; index++){
getData(node, list.item(index));
}
}
break;
}
case Node.TEXT_NODE:{
String data = node.getNodeValue();
if(data.trim().length() > 0){
/*
* Here you need to check the data against your ruleset and perform your operation
*/
System.out.println(parentNode.getNodeName()+" :: "+node.getNodeValue());
}
break;
}
}
}
You might want to look at the Chain of Responsibility design pattern to design your ruleset.

java: unescaped quotes terminate xml text node value

I'm writing an android app in java. The app emulates flashcards, with questions on one side and answers on the other.
I am presently slurping a well-formed (as I believe) .xml document (which is produced by a Qt-based program which has no problem reading the output back in) using the following (fairly standard) code:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
try
{
DocumentBuilder builder = factory.newDocumentBuilder();
Document dom = builder.parse(new File(diskLocation));
Element pack = dom.getDocumentElement();
NodeList flashCards = pack.getElementsByTagName("flashcard");
for (int i=0; i < flashCards.getLength(); i++)
{
FlashCard flashCard = new FlashCard();
Node cardNode = flashCards.item(i);
NodeList cardProperties = cardNode.getChildNodes();
for (int j=0;j<cardProperties.getLength();j++)
{
Node cardProperty = cardProperties.item(j);
String propertyName = cardProperty.getNodeName();
if (propertyName.equalsIgnoreCase("Question"))
{
flashCard.setQuestion(cardProperty.getFirstChild().getNodeValue());
}
else if (propertyName.equalsIgnoreCase("Answer"))
{
flashCard.setAnswer(cardProperty.getFirstChild().getNodeValue());
}
else if
...etc.
Here is a flashcard for learning xml:
<flashcard>
<Question>What is the entity reference for ' " '?</Question>
<Answer>&quot;</Answer>
<Info></Info>
<Hint></Hint>
<KnownLevel>1</KnownLevel>
<LastCorrect>1</LastCorrect>
<CurrentStreak>4</CurrentStreak>
<LevelUp>4</LevelUp>
<AnswerTime>0</AnswerTime>
</flashcard>
As I understand the standard, '<' and '&' need to be escaped ('>' probably should be), but quotes and apostrophes don't (unless they're in attributes), yet when the question and answer for this card are parsed, they come out as What is the entity reference for ' and & respectively;
The input seems to follow standards. Is the java XMLDom implementation really not standards-compliant, or am I missing something?
I find it very difficult to believe I'm the only one to have (had) this problem, yet I've searched both google and stack overflow and found surprisingly little of direct relevance.
Thank you for any help!
Rob
Edit: I've just realised the file has a !DOCTYPE, but doesn't start with an <?xml tag.
I wonder if this makes any difference.

From the standard:
In the content of elements, character data is any string of characters which does not contain the start-delimiter of any markup
which means that either ' or " MUST be escaped in the content of elements.

Parsing XML Textlist

I'm trying to parse a XML file. I'm able to parse normal text node but how do I parse a textlist? I'm getting the firstChild of the textlist thats sadly all. If I try to do
elem.nextSibling();
it is always null which can't be, I know there are two other values left.
Does someone can provide me an example maybe?
Thanks!
XML example
<viewentry position="1" unid="7125D090682C3C3EC1257671002F66F4" noteid="962" siblings="65">
<entrydata columnnumber="0" name="Categories">
<textlist>
<text>Lore1</text>
<text>Lore2</text>
</textlist>
</entrydata>
<entrydata columnnumber="1" name="CuttedSubjects">
<text>
LoreImpsum....
</text>
</entrydata>
<entrydata columnnumber="2" name="$35">
<datetime>20091117T094224,57+01</datetime>
</entrydata>
</viewentry>

I assume you're using a DOM parser.
The first child of the <textlist> node is not the first <text> node but rather the raw text that contains the whitespace and carriage return between the end of <textlist> and the beginning of <text>. The output of the following snippet (using org.w3c.dom.* and javax.xml.parsers.*)
Node grandpa = document.getElementsByTagName("textlist").item(0);
Node daddy = grandpa.getFirstChild();
while (daddy != null) {
System.out.println(">>> " + daddy.getNodeName());
Node child = daddy.getFirstChild();
if (child != null)
System.out.println(">>>>>>>> " + child.getTextContent());
daddy = daddy.getNextSibling();
}
shows that <textlist> has five children: the two <text> elements and the three raw text pieces before, between and after them.
>>> #text
>>> text
>>>>>>>> Lore1
>>> #text
>>> text
>>>>>>>> Lore2
>>> #text
When parsing XML this way, it's easy to overlook that the structure of the DOM-tree can be complicated. You can quickly end up iterating over a NodeList in the wrong generation, and then you get nulls where you would expect siblings. This is one of the reasons why people came up with all kinds of xml-to-java stuff, from homegrown XMLHelper classes to XPath expressions to Digester to JAXB, so you need to go down to the DOM level only when you absolutely have to.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing XML with apostrophe - java

Your getCharacterDataFromElement only looks at the first child - see if there are further child elements too and tack all the text together HTH - DF

Well, AFAIK, apostrophe is a reserved character in XML and thus should be encoded as '. This means the BBC News RSS feed doesnt provide well-formatted XML. The best thing would be to issue a bug report to the BBC News RSS feed provider so that they fix it.

Related

Java 9, INVALID_CHARACTER_ERR when trying to add URL as element in XML

Jsoup - Convert html texts into a list of Strings

How to extract xml tag value without using the tag name in java?

java: unescaped quotes terminate xml text node value

Parsing XML Textlist

Categories

Resources

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing XML with apostrophe - java

Your getCharacterDataFromElement only looks at the first child - see if there are further child elements too and tack all the text together HTH - DF

Well, AFAIK, apostrophe is a reserved character in XML and thus should be encoded as &apos;. This means the BBC News RSS feed doesnt provide well-formatted XML. The best thing would be to issue a bug report to the BBC News RSS feed provider so that they fix it.

Related

Java 9, INVALID_CHARACTER_ERR when trying to add URL as element in XML

Jsoup - Convert html texts into a list of Strings

How to extract xml tag value without using the tag name in java?

java: unescaped quotes terminate xml text node value

Parsing XML Textlist

Categories

Resources

Well, AFAIK, apostrophe is a reserved character in XML and thus should be encoded as '. This means the BBC News RSS feed doesnt provide well-formatted XML. The best thing would be to issue a bug report to the BBC News RSS feed provider so that they fix it.