This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Best way to parse an XML String in Java?
I have a String value which is actually an xml data. I have to parse the String of xml data and get individual value from it. How can we do this?
There are lots of different XML APIs in Java - some built into the framework, some not.
One of the simpler ones to use is JDOM:
SAXBuilder builder = new SAXBuilder();
Document doc = builder.build(new StringReader(text));
// Now examine the document, perhaps by XPath or by navigating
// programmatically, like this:
String fooContents = doc.getRootElement()
.getChild("foo")
.getText();
for example xstream. It is very simple
Related
This question already has an answer here:
Parse special characters in xml stax file
(1 answer)
Closed last month.
I have an XML which I need to parse using XMLInputFactory(java.xml.stream).
XML is of this type:
<SACL>
<Criteria>Dinner</Criteria>
<Value> Rice & amp ;(without spaces) Beverage </Value>
</SACL>
I am parsing this using XML Factory Reader in JAVA and my code is:
if(xmlEvent.asStartElement().getName().getLocalPart().equals("Value"){
xmlEvent = xmlEventReader.nextEvent();
value = xmlEvent.asCharacters().getData().trim(); //Issue is in the if bracket only
}
(xmlEventReader = XMLInputFactory.newInstance().createXMLEventReader(new FileInputStream(file.getPath())); //using java.xml.stream.XMLEventReader
But it is parsing the data like this only "Rice" (missing & Beverage)
Expected Output : Rice & Beverage
Can someone suggest what is the issue with "& amp ;"(without spaces) and how can it be fixed?
I've worked on a project that did XML parsing recently, so I know almost exactly what's happening here: the parser sees & as a separate event (XMLStreamConstants.ENTITY_REFERENCE).
Try setting property XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES to true in your XML parser's options. If the parser is properly implemented, the entity is replaced and made part of the text.
Keep in mind that the parser is allowed to split it into multiple characters events, especially if you have large pieces of text. Setting property XMLInputFactory.IS_COALESCING to true should prevent that.
This question already has an answer here:
java tika how to convert html to plain text retaining specific element
(1 answer)
Closed 1 year ago.
I'm new to Tika and stuggling to understand it.
What I want to achieve is extracting the link's href of a HTML page (which can be any webpage).
For trial version, I just tried to extract the links as such (or even just the first) using XPath. But I can never get it right and the handler is always empty.
(In this example, I've removed the XHTML: namspace bits because otherwise I had a SAX error).
The code example is below. Thanks so much for any help :)
XPathParser xhtmlParser = new XPathParser ("xhtml", XHTMLContentHandler.XHTML);
org.apache.tika.sax.xpath.Matcher anchorLinkContentMatcher = xhtmlParser.parse("//body//a");
ContentHandler handler = new MatchingContentHandler(
new ToXMLContentHandler(), anchorLinkContentMatcher);
HtmlParser parser = new HtmlParser();
Metadata metadata = new Metadata();
ParseContext pcontext = new ParseContext();
try {
parser.parse(urlContentStream, handler, metadata,pcontext);
System.out.println(handler);
}
catch (Exception e)
{
....
}
I found an answer (at least to get something working, even if not yet final version, I got something from the handler).
The answer is at java tika how to convert html to plain text retaining specific element
This question already has answers here:
How to parse a String containing XML in Java and retrieve the value of the root node?
(6 answers)
Closed 9 years ago.
Hello I am getting back a string from a webservice.
I need to parse this string and get the text in error message?
My string looks like this:
<response>
<returnCode>-2</returnCode>
<error>
<errorCode>100</errorCode>
<errorMessage>ERROR HERE!!!</errorMessage>
</error>
</response>
Is it better to just parse the string or convert to xml then parse?
I'd use Java's XML document libraries. It's a bit of a mess, but works.
String xml = "<response>\n" +
"<returnCode>-2</returnCode>\n" +
"<error>\n" +
"<errorCode>100</errorCode>\n" +
"<errorMessage>ERROR HERE!!!</errorMessage>\n" +
"</error>\n" +
"</response>";
Document doc = DocumentBuilderFactory.newInstance()
.newDocumentBuilder()
.parse(new InputSource(new StringReader(xml)));
NodeList errNodes = doc.getElementsByTagName("error");
if (errNodes.getLength() > 0) {
Element err = (Element)errNodes.item(0);
System.out.println(err.getElementsByTagName("errorMessage")
.item(0)
.getTextContent());
} else {
// success
}
I would probably use an XML parser to convert it into XML using DOM, then get the text. This has the advantage of being robust and coping with any unusual situations such as a line like this, where something has been commented out:
<!-- commented out <errorMessage>ERROR HERE!!!</errorMessage> -->
If you try and parse it yourself then you might fall foul of things like this. Also it has the advantage that if the requirements expand, then its really easy to change your code.
http://docs.oracle.com/cd/B28359_01/appdev.111/b28394/adx_j_parser.htm
It's an XML document. Use an XML parser.
You could tease it apart using string operations. But you have to worry about entity decoding, character encodings, CDATA sections etc. An XML parser will do all of this for you.
Check out JDOM for a simpler XML parsing approach than using raw DOM/SAX implementations.
I am working on an Android application that parses one or more XML feeds based on user preferences. Is it possible to parse (using SAX Parser) more than one XML feed at once by providing the parser with an array of URLs of my XML feeds?
If no, what would be an alternative way of listing the parsed items from different XML feeds in one list? An intuitive approach is to use java.io.SequenceInputStream to merge the two input streams. However, this throws a NullPointerException:
try {
URL urlOne = new URL("http://example.com/feedone.xml");
URL urlTwo = new URL("http://example.com/feedtwo.xml");
InputStream streamOne = urlOne.openStream();
InputStream streamTwo = urlTwo.openStream();
InputStream streamBoth = new SequenceInputStream(streamOne, streamTwo);
InputSource sourceBoth = new InputSource(streamBoth);
//Parsing
stream = xmlHandler.getStream();
}
catch (Exception error) {
error.printStackTrace();
}
List<Item> content = stream.getList();
return content;
The tactic of appending the streams before parsing is not likely to work well, as the appended XML will not be valid XML. As each XML input has its own root element, the appended XML will have multiple roots, which is not permitted in XML. Additionally it's likely to have multiple XML headers like
<?xml version="1.0" encoding="UTF-8"?>
which is also invalid.
While it's possible to preprocess the input to work around these issues, you're likely better off parsing them separately and dealing with getting the results combined later.
It's possible to make a SAX parser add the parsed elements to an existing list of elements. If you post code in your question showing how you're parsing a single file, we might be able to help figure out how to adjust it to your need for multiple inputs.
This question already has answers here:
Remove HTML tags from a String
(35 answers)
Closed 1 year ago.
The community reviewed whether to reopen this question 8 months ago and left it closed:
Not suitable for this site
Can you recommend an open source Java library (preferably ASL/BSD/LGPL license) that converts HTML to plain text - cleans all the tags, converts entities (&, , etc.) and handles <br> and tables properly.
More Info
I have the HTML as a string, there's no need to fetch it from the web. Also, what I'm looking is for a method like this:
String convertHtmlToPlainText(String html)
Try Jericho.
The TextExtractor class sounds like it will do what you want. Sorry can't post a 2nd link as I'm a new user but scroll down the homepage a bit and there's a link to it.
HtmlUnit, it even shows the page after processing JavaScript / Ajax.
The bliki engine can do this, in two steps. See info.bliki.wiki / Home
How to convert HTML to Mediawiki text -- nediawiki text is already a rather plain text format, but you can convert it further
How to convert Mediawiki text to plain text -- your goal.
It will be some 7-8 lines of code, like this:
// html to wiki
import info.bliki.html.HTML2WikiConverter;
import info.bliki.html.wikipedia.ToWikipedia;
// wiki to plain text
import info.bliki.wiki.filter.PlainTextConverter;
import info.bliki.wiki.model.WikiModel;
...
String sbodyhtml = readFile( infilepath ); //get content as string
HTML2WikiConverter conv = new HTML2WikiConverter();
conv.setInputHTML( sbodyhtml );
String resultwiki = conv.toWiki(new ToWikipedia());
WikiModel wikiModel = new WikiModel("${image}", "${title}");
String plainStr = wikiModel.render(new PlainTextConverter(false), resultwiki );
System.out.println( plainStr );
Jsoup can do this simpler:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
...
Document doc = Jsoup.parse(sbodyhtml);
String plainStr = doc.body().text();
but in the result you lose all paragraph formatting -- there will be no any newlines.
I use TagSoup, it is available for several languages and does a really good job with HTML found "in the wild". It produces either a cleaned up version of the HTML or XML, that you can then process with some DOM/SAX parser.
I've used Apache Commons Lang to go the other way. But it looks like it can do what you need via StringEscapeUtils.