HTML Parser fetch link text - java

I'm using HTML Parser to fetch links from a web page. I need to store the URL, link text and the URL to the parent page containing the link. I have managed to get the link URL as well as the parent URL.
I still ned to get the link text.
link text
Unfortunately I'm having a hard time figuring it out, any help would be greatly appreciated.
public static List<LinkContainer> findUrls(String resource) {
String[] tagNames = {"A", "AREA"};
List<LinkContainer> urls = new ArrayList<LinkContainer>();
Tag tag;
String url;
String sourceUrl;
try {
for (String tagName : tagNames) {
Parser parser = new Parser(resource);
NodeList nodes = parser.parse(new TagNameFilter(tagName));
NodeIterator i = nodes.elements();
while (i.hasMoreNodes()) {
tag = (Tag) i.nextNode();
url = tag.getAttribute("href");
sourceUrl = tag.getPage().getUrl();
if (RegexUtil.verifyUrl(url)) {
urls.add(new LinkContainer(url, null, sourceUrl));
}
}
}
} catch (ParserException pe) {
pe.printStackTrace();
}
return urls;
}

Have you tried ((LinkTag) tag).getLinkText() ? Personally I prefer n html parser which produces XML according to a well used standard, e.g., xerces or similar. This is what you get from using e.g., http://nekohtml.sourceforge.net/.

You would need to check the children of each A Tag. If you assume that your A tags only have a single child (the text itself), you can use the getFirstChild() method. This should be an instance of TextNode, and you can call getText() on this to get the link text.

Related

Java XML Read with WSIL file

at the moment I am trying to program a program which is able to render a link of an xml-file. I use Jsoup, my current code is the following
public static String XmlReader() {
InputStream is = RestService.getInstance().getWsilFile();
try {
Document doc = Jsoup.parse(fis, null, "", Parser.xmlParser());
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
}
I would like to read the following part from a XML file:
<wsil:service>
<wsil:abstract>Read the full documentation on: https://host/sap/bc/mdrs/cdo?type=psm_isi_r&objname=II_QUERY_PROJECT_IN&saml2=disabled</wsil:abstract>
<wsil:name>Query Projects</wsil:name>
<wsil:description location="host/sap/bc/srt/wsdl/srvc_00163E5E1FED1EE897C188AB4A5723EF/wsdl11/allinone/ws_policy/document?sap-vhost=host&saml2=disabled" referencedNamespace="http://schemas.xmlsoap.org/wsdl/"/>
</wsil:service>
I want to return the following URL as String
host/sap/bc/srt/wsdl/srvc_00163E5E1FED1EE897C188AB4A5723EF/wsdl11/allinone/ws_policy/document?sap-vhost=host&saml2=disabled
How can I do that ?
Thank you
If there is only one tag wsil:description then you can use this code:
doc.outputSettings().escapeMode(EscapeMode.xhtml);
String val = doc.select("wsil|description").attr("location");
Escape mode should be changed, since you are not working on regular html, but xml.
If you have more than one tag with given name you can search for distinct neighbour element, and find required tag with respect to it:
String val = doc.select("wsil|name:contains(Query Projects)").first().parent().select("wsil|description").attr("location");

Parse XML result from Web service request Java

I have this XML result from a web service request. The tags that are inside the box are the ones that I need from the xml result.
Here's what I have so far:
private Node getMessageNode(QueryResponseQueryResult paramQueryResponseQueryResult, String[] paramArrayOfString)
{
MessageElement[] arrayOfMessageElement = paramQueryResponseQueryResult.get_any();
Document localDocument = null;
String res;
try
{
localDocument = arrayOfMessageElement[0].getAsDocument(); //result from the webservice
}
catch (Exception localException) {}
if (localDocument == null) {
return null;
}
Object localObject = localDocument.getDocumentElement();
localObject = Nodes.findChildByTags((Node)localObject, paramArrayOfString);
return localDocument; //This returns the XML above
}
How do I parse the result to return only those tags on the box and still return it as XML type?
Thanks in advance.
You can use Xpath of XQuery to perform this task.
You should get the document, and then you can get the child node of table using
getElementByTagName("table") or run XPath on it:
See here a good xpath tutorial.

Jsoup: null result in absUrl (abs:)

I tried to make a image links downloader with jsoup. I have made a downloader HTML code part, and when I have done a parse part, I recognized, that sometimes links to images appeared without main part. So I found absUrl solution, but by some reasons it did not work (it gave me null). So I tried use uri.resolve(), but it gave me unchanged result. So now I do not know how to solve it. I attached part of my code, that responsible for parsing ant writing url to string:
public static String finalcode(String textin) throws Exception {
String text = source(textin);
Document doc = Jsoup.parse(text);
Elements images = doc.getElementsByTag("img");
String Simages = images.toString();
int Limages = countLines(Simages);
StringBuilder src = new StringBuilder();
while (Limages > 0) {
Limages--;
Element image = images.get(Limages);
String href = image.attr("src");
src.append(href);
src.append("\n");
}
String result = src.toString();
return result;
}
It looks like you are parsing HTML from String, not from URL. Because of that jsoup can't know from which URL this HTML codes comes from, so it can't create absolute path.
To set this URL for Document you should parse it using Jsoup.parse(String html, String baseUri) version, like
String url = "http://server/pages/document.htlm";
String text = "<img src = '../images/image_name1.jpg'/><img src = '../images/image_name2.jpg'/>'";
Document doc = Jsoup.parse(text, url);
Elements images = doc.getElementsByTag("img");
for (Element image : images){
System.out.println(image.attr("src")+" -> "+image.attr("abs:src"));
}
Output:
../images/image_name1.jpg -> http://server/images/image_name1.jpg
../images/image_name2.jpg -> http://server/images/image_name2.jpg
Other option would be letting Jsoup parse page directly by supplying URL instead of String with HTML
Document doc = Jsoup.connect("http://example.com").get();
This way Document will know from which URL it came, so it will be able to create absolute paths.

JSoup core web text extraction

I am new to JSoup, Sorry if my question is too trivial.
I am trying to extract article text from http://www.nytimes.com/ but on printing the parse document
I am not able to see any articles in the parsed output
public class App
{
public static void main( String[] args )
{
String url = "http://www.nytimes.com/";
Document document;
try {
document = Jsoup.connect(url).get();
System.out.println(document.html()); // Articles not getting printed
//System.out.println(document.toString()); // Same here
String title = document.title();
System.out.println("title : " + title); // Title is fine
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
ok I have tried to parse "http://en.wikipedia.org/wiki/Big_data" to retrieve the wiki data, same issue here as well I am not getting the wiki data in the out put.
Any help or hint will be much appreciated.
Thanks.
Here's how to get all <p class="summary> text:
final String url = "http://www.nytimes.com/";
Document doc = Jsoup.connect(url).get();
for( Element element : doc.select("p.summary") )
{
if( element.hasText() ) // Skip those tags without text
{
System.out.println(element.text());
}
}
If you need all <p> tags, without any filtering, you can use doc.select("p") instead. But in most cases it's better to select only those you need (see here for Jsoup Selector documentation).

htmlcleaner parse with tags

I try to extract some part of page. I use parser HtmlCleaner, and it remove all tags. Are there some settings to save all html tags? Or maybe is better way to extract this part of code, using something else?
My code:
static final String XPATH_STATS = "//div[#class='text']/p/";
// config cleaner properties
HtmlCleaner htmlCleaner = new HtmlCleaner();
CleanerProperties props = htmlCleaner.getProperties();
props.setAllowHtmlInsideAttributes(false);
props.setAllowMultiWordAttributes(true);
props.setRecognizeUnicodeChars(true);
props.setOmitComments(true);
props.setTransSpecialEntitiesToNCR(true);
// create URL object
URL url = new URL(BLOG_URL);
// get HTML page root node
TagNode root = htmlCleaner.clean(url);
Object[] statsNode = root.evaluateXPath(XPATH_STATS);
for (Object tag : statsNode) {
stats = stats + tag.toString().trim();
}
return stats;
thanks for nikhil.thakkar!
I do this by JSON.
The code may help someone:
URL url2 = new URL(BLOG_URL);
Document doc2 = Jsoup.parse(url2, 3000);
Element masthead = doc2.select("div.main_text").first();
String linkOuterH = masthead.outerHtml();
You can use jSoup parser.
More info here: http://jsoup.org/

Categories

Resources