Parsing XML with embedded data - java

Im trying to parse an XML using android. My problem is that the XML is in a strange format. The entirety of the data I'm trying to parse is located inside one element.
Here is an example:
<a name="3"></a>
<div class="series_alpha">
<h2 class="series_alpha">3</h2>
<ul class="series_alpha"><li>3 Banme no Kareshi<span class="mangacompleted">[Completed]</span></li>
<li>3 Gatsu no Lion</li>
<li>337 Byooshi</li>
<li>360 Degrees Material</li>
<li>37 Degrees Kiss<span class="mangacompleted">[Completed]</span></li>
<li>3x3 Eyes</li>
</ul>
<div class="clear"></div>
</div>
The XML is a piece of the source from this webpage. The data I'm trying to retrieve is found in the <li> tag, specifically the link reference and Manga name. But I dont know how I would separate the link from the title.

After looking it up I found the information inside the tags is known as an "attribute" (I'm a noob i know) and with some google searches found the attributes.getvalue("name of attribute here") method is what I was looking for.
source: XML Parsing to get Attribute Value

Related

Thymeleaf title attibute with html tags

Actually I'm using summernote https://summernote.org/ plugin to style the text and saved it into database. It gives the string as <b style='color:#CCC'>Test</b>.
In normal text cases i'm using th:utext attribute. But i doesn't make this available for th:title. How to do this in thymeleaf ? Thanks in advance
In first scenario, i want to show it as text, So i used this implementation <span th:utext="${text}"></span> and this is working as expected
In second scenario, i want to show it as title for other tag like
<a th:title="${text}">Some other text </a> this gives title with tag as a string.Not applying styles to title. How can i get these title with text style provided by string
In both cases ${text} is <b style='color:#CCC'>Test</b>. How can i get unescaped text in title attribute.
If you are getting <b style='color:#CCC'>Test</b> string using model (like th:utext="${text}"), try like this:
From server: model.addAttribute("text", "<b style='color:#CCC'>Test</b>");
Html #1: <span th:utext="${text}"></span>
Html #2: <a th:title="${text}">Some other text </a>
I tried on my server and worked.

pretty print java with no xml conversion

I am aware that this question has been asked multiple times on this site, however none of the previous answers have worked for me.
I have a String as XML like <A><B/><C></C></A>
When I use the pretty print converters I get:
<A>
<B/>
<C/>
</A>
I want to stop this and get the XML as it was. Like:
<A>
<B/>
<C></C>
</A>
I want an indent=2. Kindly help.
As mentioned in the comment section, for empty XML element tag both
<B></B>
and
<B/>
are equivalent.
I think you should try negotiate with your tester to save you the trouble as this sort of "fix" is deem as unnecessary. I would say your XML code is working as expected.
Try pointing your tester to the W3C XML specification.
See:
https://www.w3.org/TR/REC-xml/#NT-EmptyElemTag
Quote from the link above:
Examples of empty elements:
<IMG align="left"
src="http://www.w3.org/Icons/WWW/w3c_home" />
<br></br>
<br/>

Filling out a HTML-form with complex name (dot-notation in input-tag) with Java and Jaunt API

-
hey folks,
i am building a Java-tool, trying to automatically fill out some form input elements in an HTML-Page using Java and Jaunt API.
the HTML-Code is like:
<fieldset class = "fieldsetlong">
<legend>searchprofile</legend>
<label for="reference">reference:</label>
<input maxlength="50" name="reference" id="reference" type="text" />
</fieldset>
<fieldset class = "fieldsetlong">
<legend>searchcriteria</legend>
<label for="surname">surname:</label>
<input name="searchprofile.surname" id="surname" type="text" />
</fieldset>
The Java-Code for filling in the "normal" Input-field reference (it works) looks like:
form.set("reference", "123Test");
Unfortunately, I am not able to fill out the fields that use the dot-notation searchprofile.surname in the name
Here's a sample of what i've tried (without success):
form.set("surname", "TestPerson");
form.set("searchprofile.surname", "TestPerson");
form.set("name=\"searchprofile.surname\"", pers.getSurname());
form.set("id=\"surname\"", pers.getSurname());
For each of these commands I get a NotFoundException and don't know whether I can do this with Jaunt.
It would appreciate any kind of help in this regard.
Thanks in advance
Edit - is there a way to reach the dot-notated input-field searchprofile.surname with JSoup?
HTML allows dots in the name-Attribute, but does Jaunt accept this abc.name?
Not sure about Jaunt, never used it before. However Jsoup seems to be a pretty decent library to be used here. I myself have been using Jsoup for a fairly long time and it has been very successful in scraping web pages, filling input form and submit, and of course, HTML parsing!
I've posted a step by step guide to fill in form input fields and submit to server in the following answer: How to login with Jsoup
Basically it works very similar to your code, a very brief example would be:
Connection.Response response = Jsoup.connect(url)
.data("Name", "Value")
.method(Method.POST).execute();
Today, at work the Jaunt solution with
form.set("searchprofile.surname", "TestPerson");
worked like a charm.
I don't know what the problem was earlier but I am glad that it worked.
The HTML allows to use dots and minus, etc. which I misinterpreted as some kind of nested forms or hierarchies but the dot-notation is just a valid name-attribute in HTML.

Java / Android HTML custom tag parser

I'm trying to figure out a way to parse a html file with custom tags in the form:
[custom tag="id"]
Here's an example of a file I'm working with:
<p>This is an <em>amazing</em> example. </p>
<p>Such amazement, <span>many wow.</span> </p>
<p>Oh look, a wild [custom tag="amaze"] appears.</p>
We need maor embeds <a href="http://youtu.be/F5nLu232KRo"> bro
What I would like (in an ideal world) is to get back is a list of elements):
List foundElements = [text, custom tag, text, link, text]
Where the element in the above list contains:
Text:
<p>This is an <em>amazing</em> example. </p>
<p>Such amazement, <span>many wow.</span> </p>
<p>Oh look, a wild [custom tag="amaze"] appears.</p>
We need maor embeds
Custom tag:
[custom tag="amaze"]
Link:
<a href="http://youtu.be/F5nLu232KRo">
Text:
appears.</p>We need maor embeds
What I've tried:
Jsoup
Jsoup is great, it works perfectly for HTML. The issue is I can't define custom tags with opening "[" and closing "]". Correct me if I'm wrong?
Jericho
Again like Jsoup, Jericho works great..except for defining custom tags. You're required to use "<".
Java Regex
This is the option I really don't want to go for. It's not reliable and there's a lot of string manipulation that is brittle, especially when you're matching against a lot of regexes.
Last but not least, I'm looking for a performance orientated solution as this is done on an Android client.
All suggestions welcome!

Java : HTML Parsing

I am having HTML contents as given below. The tag that i am looking out for here are "img src" and "!important". Does Java provide any HTML parsing techniques?
<fieldset>
<table cellpadding='0'border='0'cellspacing='0'style="clear :both">
<tr valign='top' ><td width='35' >
<a href='http://mypage.rediff.com/android/32868898'class='space' onmousedown="return
enc(this,'http://track.rediff.com/clickurl=___http%3A%2F%2Fmypage.rediff.com%2Fandroid%2F3 868898___&service=mypage_feeds&clientip=202.137.232.117&pos=0&feed_id=12942949154d255f839677925642&prc_id=32868898&rowid=2064549114')" >
<div style='width:25px;height:25px;overflow:hidden;'>
<img src='http://socialimg04.rediff.com/image.php?uid=32868898&type=thumb' width='25' vspace='0' /></div></a></td> <td><span>
<a href='http://mypage.rediff.com/android/32868898' class="space" onmousedown="return enc(this,'http://track.rediff.com/click?url=___http%3A%2F%2Fmypage.rediff.com%2Fandroid%2F32868898___&service=mypage_feeds&clientip=202.137.232.117&pos=0&feed_id=12942949154d255f839677925642&prc_id=32868898&rowid=2064549114')" >Android </a> </span><span style='color:#000000
!important;'>android se updates...</span><div class='divtext'></div></td></tr><tr><td height='5' ></td></tr></table></fieldset><br/>
String value = Jsoup.parse(new File("d:\\1.html"), "UTF-8").select("img").attr("src");
System.out.println(value); //http://socialimg04.rediff.com/image.php?uid=32868898&type=thumb
System.out.println(Jsoup.parse(new File("d:\\1.html"), "UTF-8").select("span[style$=important;]").first().text());//android se updates...
JSoup
What-are-the-pros-and-cons-of-the-leading-java-html-parsers
Try NekoHtml. This is the HTML parsing library used by various higher-level testing frameworks such as HtmlUnit.
NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.
I used jsoup - this library have nice selector syntax (http://jsoup.org/cookbook/extracting-data/selector-syntax), and for your problem you can use code like this:
File input = new File("input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements pngs = doc.select("img[src$=.png]");
I like using Jericho: http://jericho.htmlparser.net/docs/index.html
It is invulnerable to bad formed html, links leading to unavailable locations etc.
There's a lot of examples on their page, you just get all IMG tags and analyze their attributes to extracts those that pass your needs.

Categories

Resources