HTML not downloaded correctly

HTML not downloaded correctly - java

I've been trying to download the source code of the Google News rss feed. It's downloaded correctly except from links that are shown weirdly.
static String urlNotizie = "https://news.google.it/news/feeds?pz=1&cf=all&ned=it&hl=it&output=rss";
Document docHtml = Jsoup.connect(urlNotizie).get();
String html = docHtml.toString();
System.out.println(html);
Output:
<html>
<head></head>
<body>
<rss version="2.0">
<channel>
<generator>
NFE/1.0
</generator>
<title>Prima pagina - Google News</title>
<link />http://news.google.it/news?pz=1&ned=it&hl=it
<language>
it
</language>
<webmaster>
news-feedback#google.com
</webmaster>
<copyright>
&copy;2013 Google
</copyright> [...]
Using a URLConnection I'm able to output the correct source of the page. But during parse I have the same issue as above, where it spits a list of <link />. (Again only with links. Parsing other things works fine). URLConnection example:
URL u = new URL(urlNotizie);
URLConnection yc = u.openConnection();
StringBuilder builder = new StringBuilder();
BufferedReader reader = new BufferedReader(new InputStreamReader(
yc.getInputStream()));
String line;
while ((line = reader.readLine()) != null) {
builder.append(line);
builder.append("\n");
}
String html = builder.toString();
System.out.println("HTML " + html);
Document doc = Jsoup.parse(html);
Elements listaTitoli = doc.select("title");
Elements listaCategorie = doc.select("category");
Elements listaDescrizioni = doc.select("description");
Elements listaUrl = doc.select("link");
System.out.println(listaUrl);

Jsoup is designed as a HTML parser, not as a XML (nor RSS) parser.
The HTML <link> element is specified as not having any body. It would be invalid to have a <link> element with a body like as in your XML.
You can parse XML using Jsoup, but you need to explicitly tell it to switch to XML parsing mode.
Replace
Document docHtml = Jsoup.connect(urlNotizie).get();
by
Document docXml = Jsoup.connect(urlNotizie).parser(Parser.xmlParser()).get();

Related

How to extract elements from a String with jsoup?

I want to write a small piece of code that will exctract the "Kategorie" out of a href with jsoup.
Herrscher des Mittelalters
In this case I am searching for Herrscher des Mittelalters.
My code reads the first line of a .txt file with the BufferedReader.
BufferedReader r = new BufferedReader(new InputStreamReader(new FileInputStream(new File(FilePath)), Charset.forName("UTF-8")));
Document doc = Jsoup.parse(r.readLine());
Element elem = doc;
I know there are commands to get the href-link but I don't know commands to search for elements in the href-link.
Any suggestions?
Additional information: My .txt file contains full Wikipedia HTML pages.

This should get you all titles from links. You can split the titles further as you need:
Document d = Jsoup.parse("Herrscher des Mittelalters");
Elements links = d.select("a");
Set<String> categories = new HashSet<>();
for (Element script : links) {
String title = script.attr("title");
if (title.length() > 0) {
categories.add(title);
}
}
System.out.println(categories);

You can use getElementsContainingText() method (org.jsoup.nodes.Document) to search for elements with with any text.
Elements elements = doc.getElementsContainingText("Herrscher des Mittelalters");
for(int i=0; i<elements.size();i++) {
Element element = elements.get(i);
System.out.println(element.text());
}

How to extract links from Google HTML result page?

I am reading a text file that contains HTML code from Google search results. Then I parse it and I try to extract the links with this code:
FileReader in = new FileReader("A.txt");
BufferedReader p = new BufferedReader(in);
while(p.readLine() != null)
{
String html = p.readLine();
Document doc = Jsoup.parse(html);
Elements Link = doc.select("a[href");
for(Element element :Link)
{
if(element != null)
{
System.out.println(element);
}
}
}
But I got many non-link strings. How can I show the links, not anything else?

Please try again with a complete selector, not only "a[href":
Elements links = doc.select("a[href]"); // a with href
See the Selector document for the full support - especially the examples on the right side.

SAX Parsing in Java

I must parse an XML from URL in Java with SAX parser. I didn't find an example on the internet about this topic. All of them are reading an XML from local. Is there an example that xml has nested tags and parsing from url in Java?

Refer this example java snippet
String webServiceURL="web service url or document url here";
URL geoLocationDetailXMLURL = new URL(webServiceURL);
URLConnection geoLocationDetailXMLURLConnection = geoLocationDetailXMLURL.openConnection();
geoLocationDetailXMLURLConnection.setConnectTimeout(120000);
geoLocationDetailXMLURLConnection.setReadTimeout(120000);
BufferedReader geoLeocationDetails = new BufferedReader(new InputStreamReader(geoLocationDetailXMLURLConnection.getInputStream(), "UTF-8"));
InputSource inputSource = new InputSource(geoLeocationDetails);
saxParser.parse(inputSource, handler);

This should help
SAX parser and a file from the nework
The important line being
xr.parse(new InputSource(sourceUrl.openStream()));
where sourceUrl is a string

The method createDOM not return document

I use HtmlCleaner 2.6.1 and Xpath to parse html page in Android application.
Here html page:
http://www.kino-govno.com/comments/42571-postery-kapitan-fillips-i-poslednij-rubezh
http://www.kino-govno.com/comments/42592-fantasticheskie-idei-i-mesta-ih-obitanija
The first link return document, is all right.The second link here in this place:
document = domSerializer.createDOM(tagNode);
returns nothing.
If you create a simple java project without android. That all works fine.
Here is the Code :
String queries = "//div[starts-with(#class, 'news_text op')]/p";
URL url = new URL(link2);
TagNode tagNode = new HtmlCleaner().clean(url);
CleanerProperties cleanerProperties = new CleanerProperties();
DomSerializer domSerializer = new DomSerializer(cleanerProperties);
document = domSerializer.createDOM(tagNode);
xPath = XPathFactory.newInstance().newXPath();
pageNode = (NodeList)xPath.evaluate(queries,document, XPathConstants.NODESET);
String val = pageNode.item(0).getFirstChild().getNodeValue();

That's because HtmlCleaner wraps the paragraphs of the second HTML page into another <div/>, so it is not a direct child any more. Use the descendent-or-self-axis // instead of the child-axis /:
//div[starts-with(#class, 'news_text op')]//p

flaying saucer org.xml.sax.SAXParseException The declaration for the entity HTML.Version must end with >

I have xhtml file with:
which on http://validator.w3.org/ gives me result: This document was successfully checked as HTML 4.01 Transitional!
I am parsing it with the following code:
OutputStream os = null;
ITextRenderer renderer = new ITextRenderer();
os = new FileOutputStream(new File("example.pdf"));
BufferedReader reader1 = new BufferedReader(new FileReader("x:\\workspace\\Test.html"));
StringBuilder sb = new StringBuilder();
String line = null;
while ((line = reader1.readLine()) != null) {
sb.append(line);
}
reader1.close();
String str = sb.toString();
renderer.setDocumentFromString(str);
renderer.layout();
renderer.createPDF(os);
os.close();
and I'm getting error as described in title. Do you know how to fix this issue?

You forgot a closing bracket (>) in your HTML page.
Therefore it is no XHTML page but simply a HTML4 page. The validator you named is only useable to validate HTML4 and not XHTML.
HTML4 lets you do things that are forbidden in XML (and XHTML), e.g. in HTML the following would be legal:
<br

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

HTML not downloaded correctly - java

Related

How to extract elements from a String with jsoup?

How to extract links from Google HTML result page?

SAX Parsing in Java

The method createDOM not return document

flaying saucer org.xml.sax.SAXParseException The declaration for the entity HTML.Version must end with >

Categories

Resources