Parsing using HTMLParser - java

Parser parser = new Parser();
parser.setInputHTML("d:/index.html");
parser.setEncoding("UTF-8");
NodeList nl = parser.parse(null);
/*
SimpleNodeIterator sNI=list.elements();
while(sNI.hasMoreNodes()){
System.out.println(sNI.nextNode().getText());}
*/
NodeList trs = nl.extractAllNodesThatMatch(new TagNameFilter("tr"),true);
for(int i=0;i<trs.size();i++) {
NodeList nodes = trs.elementAt(i).getChildren();
NodeList tds = nodes.extractAllNodesThatMatch(new TagNameFilter("td"),true);
System.out.println(tds.toString());
I am not getting any output, eclipse shows javaw.exe terminated.

Pass the path to the resource into the constructor.
Parser parser = new Parser("index.html");
Parse and print all the divs on this page:
Parser parser = new Parser("http://stackoverflow.com/questions/7293729/parsing-using-htmlparser/");
parser.setEncoding("UTF-8");
NodeList nl = parser.parse(null);
NodeList div = nl.extractAllNodesThatMatch(new TagNameFilter("div"),true);
System.out.println(div.toString());
parser.setInputHtml(String inputHtml) doesn't do what you think it does. It treats inputHtml as the html input to the parser. You use the constructor to point the parser at an html resource (file or URL).
Example:
Parser parser = new Parser();
parser.setInputHTML("<div>Foo</div><div>Bar</div>");

Related

Storing xml data in Java object using jaxb

<?xml version="1.0" encoding="UTF-8"?>
<filepaths>
<application_information_ticker>
<desc>Ticker1</desc>
<folder_path>../atlas/info/</folder_path>
</application_information_ticker>
<document_management_system>
<desc></desc>
<folder_path>../atlas/dms/</folder_path>
</document_management_system>
</filepaths>
I have a xml file like this. I need to convert this xml file into java object using JAXB. Because of nested tags, I couldn't perform the operation. Please suggest me a solution for this
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
InputSource is = new InputSource( new StringReader( xmlString) );
Document doc = builder.parse( is );
XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
xpath.setNamespaceContext(new PersonalNamespaceContext());
XPathExpression expr = xpath.compile("//src_small/text()");
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
List<String> list = new ArrayList<String>();
for (int i = 0; i < nodes.getLength(); i++) {
list.add (nodes.item(i).getNodeValue());
System.out.println(nodes.item(i).getNodeValue());

Efficiently unmarshaling a part of a large xml file with JAXB and XMLStreamReader

I want to unmarshall part of a large XML file. There exists solution of this already, but I want to improve it for my own implementation.
Please have a look at the following code: (source)
public static void main(String[] args) throws Exception {
XMLInputFactory xif = XMLInputFactory.newFactory();
StreamSource xml = new StreamSource("input.xml");
XMLStreamReader xsr = xif.createXMLStreamReader(xml);
xsr.nextTag();
while(!xsr.getLocalName().equals("VersionList")&&xsr.getElementText().equals("1.81")) {
xsr.nextTag();
}
I want to unmarshall the input.xml (given below) for the node: versionNumber="1.81"
With the current code, the XMLStreamReader will first check the node versionNumber="1.80" and then it will check all sub nodes of versionNumber and then it will again move to node: versionNumber="1.81", where it will satisfy the exit condition of the while loop.
Since, I want to check node versionNumber only, iterating its subnodes are unnecessary and for large xml file, iterating all sub nodes of version 1.80 will take lone time. I want to check only root nodes (versionNumber) and if the first root node (versionNumber=1.80) is not matched, the XMLStreamReader should directly jump to next root node ((versionNumber=1.81)). But it seems not achievable with xsr.nextTag(). Is there any way, to iterate through the desired root nodes only?
input.xml:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<fileVersionListWrapper FileName="src.h">
<VersionList versionNumber="1.80">
<Reviewed>
<commentId>v1.80(c5)</commentId>
<author>Robin</author>
<lines>47</lines>
<lines>48</lines>
<lines>49</lines>
</Reviewed>
<Reviewed>
<commentId>v1.80(c6)</commentId>
<author>Sujan</author>
<lines>82</lines>
<lines>83</lines>
<lines>84</lines>
<lines>85</lines>
</Reviewed>
</VersionList>
<VersionList versionNumber="1.81">
<Reviewed>
<commentId>v1.81(c4)</commentId>
<author>Robin</author>
<lines>47</lines>
<lines>48</lines>
<lines>49</lines>
</Reviewed>
<Reviewed>
<commentId>v1.81(c5)</commentId>
<author>Sujan</author>
<lines>82</lines>
<lines>83</lines>
<lines>84</lines>
<lines>85</lines>
</Reviewed>
</VersionList>
</fileVersionListWrapper>
You can get the node from the xml using XPATH
XPath, the XML Path Language, is a query language for selecting nodes from an XML document. In addition, XPath may be used to compute values (e.g., strings, numbers, or Boolean values) from the content of an XML document. What is Xpath.
Your XPath expression will be
/fileVersionListWrapper/VersionList[#versionNumber='1.81']
meaning you want to only return VersionList where the attribute is 1.81
JAVA Code
I have made an assumption that you have the xml as string so you will need the following idea
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
InputSource inputSource = new InputSource(new StringReader(xml));
Document document = builder.parse(inputSource);
XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile("/fileVersionListWrapper/VersionList[#versionNumber='1.81']");
NodeList nl = (NodeList) expr.evaluate(document, XPathConstants.NODESET);
Now it will be simply loop through each node
for (int i = 0; i < nl.getLength(); i++)
{
System.out.println(nl.item(i).getNodeName());
}
to get the nodes back to to xml you will have to create a new Document and append the nodes to it.
Document newXmlDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
Element root = newXmlDocument.createElement("fileVersionListWrapper");
for (int i = 0; i < nl.getLength(); i++)
{
Node node = nl.item(i);
Node copyNode = newXmlDocument.importNode(node, true);
root.appendChild(copyNode);
}
newXmlDocument.appendChild(root);
once you have the new document you will then run a serializer to get the xml.
DOMImplementationLS domImplementationLS = (DOMImplementationLS) document.getImplementation();
LSSerializer lsSerializer = domImplementationLS.createLSSerializer();
String string = lsSerializer.writeToString(document);
now that you have your String xml , I have made an assumption you already have a Jaxb object which looks similar to this
#XmlRootElement(name = "fileVersionListWrapper")
public class FileVersionListWrapper
{
private ArrayList<VersionList> versionListArrayList = new ArrayList<VersionList>();
public ArrayList<VersionList> getVersionListArrayList()
{
return versionListArrayList;
}
#XmlElement(name = "VersionList")
public void setVersionListArrayList(ArrayList<VersionList> versionListArrayList)
{
this.versionListArrayList = versionListArrayList;
}
}
Which you will simple use the Jaxb unmarshaller to create the objects for you
JAXBContext jaxbContext = JAXBContext.newInstance(FileVersionListWrapper .class);
Unmarshaller jaxbUnmarshaller = jaxbContext.createUnmarshaller();
StringReader reader = new StringReader(xmlString);
FileVersionListWrapper fileVersionListWrapper = (FileVersionListWrapper) jaxbUnmarshaller.unmarshal(reader);

how to parse inner tag of xml in android

This is my xml. please tell me DOM parse method to parse the whole video tag. actually it has an inner tag of location that is disturbing me.
<video>
<video_id>39</video_id>
<title>test video</title>
<description>asdasd asd as</description>
<address/>
<location>
<longitude>-86.785012400000030</longitude>
<latitude>33.353920000000000</latitude>
</location>
<phone>2055555555</phone>
<website>http://www.google.com</website>
<time>154</time>
<youtube_url>http://www.youtube.com/watch?v=sdfgd</youtube_url>
<youtube_video_id>sdfgd</youtube_video_id>
<category_name>User Content</category_name>
<category_id>48</category_id>
</video>
This tutorial might help you
http://www.androidpeople.com/android-xml-parsing-tutorial-using-saxparser
Here's an example reading your xml into a Document object and then using xpath to evalute the text content of the longitude node. Ensure that the video.xml is in the classpath of app (put it in same directory as the java).
I only add the xpath as a point of interest to show you how to query the Document returned.
public class ReadXML {
private static final DocumentBuilderFactory DOCUMENT_BUILDER_FACTORY = DocumentBuilderFactory.newInstance();
private static final XPathFactory XPATH_FACTORY = XPathFactory.newInstance();
public static void main(String[] args) throws Exception {
new ReadXML().parseXml();
}
private void parseXml() throws Exception {
DocumentBuilder db = DOCUMENT_BUILDER_FACTORY.newDocumentBuilder();
InputStream stream = this.getClass().getResourceAsStream("video.xml");
final Document document = db.parse(stream);
XPath xPathEvaluator = XPATH_FACTORY.newXPath();
XPathExpression expr = xPathEvaluator.compile("video/location/longitude");
NodeList nodes = (NodeList) expr.evaluate(document, XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
Node item = nodes.item(i);
System.out.println(item.getTextContent());
}
}
}

XML Searching and Parsing

I have an XML file that I am trying to search using Java. I just need to find an element by its Tag name and then find that Tag's value. So for example:
I have this XML file:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="https://company.com/test/xslt/processing_report.xslt"?>
<Certificate xmlns="urn:us:net:exchangenetwork:Company">
<Value1>Veggie</Value1>
<Value2>Fruits</Value2>
<type1>Apple</type1>
<FindME>Red</FindME>
<Value3>Bread</Value3>
</Certificate>
I want to find the value inside of the FindME Tag. I can't use XPath because different files can have different structures, but they always have a FindME tag. Lastly I am looking for the simplest piece of code, I do not care much about performance. Thank you
Here is the code:
XPathFactory f = XPathFactory.newInstance();
XPathExpression expr = f.newXPath().compile(
"//*[local-name() = 'FindME']/text()");
DocumentBuilderFactory domFactory = DocumentBuilderFactory
.newInstance();
domFactory.setNamespaceAware(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document doc = builder.parse("src/test.xml"); //your XML file
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
System.out.println(nodes.getLength());
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println(nodes.item(i).getNodeValue());
}
Explained :
//* - match any element node - no matter where they are
local-name() = 'FindME' - where local name - i.e; not the full path - matches 'FindME'
text() - get the node value.
I think you need to read up on XPath because it can very easily solve this problem. So can using getElementsByTagName in the DOM API.
You can still use XPath. All you need to do is use //FindMe (read here on // usage) expression. This finds a the "FindMe" elements from any where in the xml irrespective of its parent or path from the root.
If you are using namespaces then make sure you are making the parser aware of that
String findMeVal = null;
InputStream is = //...
XmlPullParser parser = //...
parser.setFeature(XmlPullParser.FEATURE_PROCESS_NAMESPACES, true);
parser.setInput(is, null);
int event;
while (XmlPullParser.END_DOCUMENT != (event = parser.next())) {
if (event == XmlPullParser.START_TAG) {
if ("FindME".equals(parser.getName())) {
findMeVal = parser.nextText();
break;
}
}
}

xpaths not working in java

I am trying to access a url, get the html from it and use xpaths to get certain values from it. I am getting the html just fine and Jtidy seems to be cleaning it appropriately. However, when I try to get the desired values using xpaths, I get an empty NodeList back. I know my xpath expression is correct; I have tested it in other ways. Whats wrong with this code. Thanks for the help.
String url_string = base_url + countries[c];
URL url = new URL(url_string);
Tidy tidy = new Tidy();
tidy.setShowWarnings(false);
tidy.setXHTML(true);
tidy.setMakeClean(true);
Document doc = tidy.parseDOM(url.openStream(), null);
//tidy.pprint(doc, System.out);
String xpath_string = "id('catlisting')//a";
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile(xpath_string);
NodeList nodes = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);
System.out.println("size="+nodes.getLength());
for (int r=0; r<nodes.getLength(); r++) {
System.out.println(nodes.item(r).getNodeValue());
}
Try "//div[#id='catlisting']//a"

Categories

Resources