How to extract links from Google HTML result page?

How to extract links from Google HTML result page? - java

I am reading a text file that contains HTML code from Google search results. Then I parse it and I try to extract the links with this code:
FileReader in = new FileReader("A.txt");
BufferedReader p = new BufferedReader(in);
while(p.readLine() != null)
{
String html = p.readLine();
Document doc = Jsoup.parse(html);
Elements Link = doc.select("a[href");
for(Element element :Link)
{
if(element != null)
{
System.out.println(element);
}
}
}
But I got many non-link strings. How can I show the links, not anything else?

Please try again with a complete selector, not only "a[href":
Elements links = doc.select("a[href]"); // a with href
See the Selector document for the full support - especially the examples on the right side.

Related

How to extract elements from a String with jsoup?

I want to write a small piece of code that will exctract the "Kategorie" out of a href with jsoup.
Herrscher des Mittelalters
In this case I am searching for Herrscher des Mittelalters.
My code reads the first line of a .txt file with the BufferedReader.
BufferedReader r = new BufferedReader(new InputStreamReader(new FileInputStream(new File(FilePath)), Charset.forName("UTF-8")));
Document doc = Jsoup.parse(r.readLine());
Element elem = doc;
I know there are commands to get the href-link but I don't know commands to search for elements in the href-link.
Any suggestions?
Additional information: My .txt file contains full Wikipedia HTML pages.

This should get you all titles from links. You can split the titles further as you need:
Document d = Jsoup.parse("Herrscher des Mittelalters");
Elements links = d.select("a");
Set<String> categories = new HashSet<>();
for (Element script : links) {
String title = script.attr("title");
if (title.length() > 0) {
categories.add(title);
}
}
System.out.println(categories);

You can use getElementsContainingText() method (org.jsoup.nodes.Document) to search for elements with with any text.
Elements elements = doc.getElementsContainingText("Herrscher des Mittelalters");
for(int i=0; i<elements.size();i++) {
Element element = elements.get(i);
System.out.println(element.text());
}

How to update html div content using java

I am working on a java rcp application. Whenever user updates the details in UI, we are suppose to update the same details in html report also. Is there a we can update/add the html elements using java. Using Jsoup I am able to get the required element ID, but not able to innert/update new element to it.
Document htmlFile = null;
try {
htmlFile = Jsoup.parse(new File("C:\\ItemDetails1.html"), "UTF-8");
} catch (IOException e) {
e.printStackTrace();
}
Element div = htmlFile.getElementById("row2_comment");
System.out.println("text: " + div.html());
div.html("<li><b>Comments</b></li><ul><li>Testing for comment</li></ul>");
Any thoughts

Try:
Element div =
htmlFile.getElementById("row2_comment");
div.appendElement("p").attr("class",
"beautiful").text("Some New Text")
To add a new paragraph with some style and text content

Jsoup parsing html duplication on writing to file

I seem to be having this error where text is being written to a file twice, the first time with incorrect formatting and the second with correct formatting. The method below takes in this URL after it's been converted properly. The method is supposed to get print a newline in between the text conversion of all of the children of dividers that are children of the divider "ffaq" where all the body text resides. Any help would be appreciated. I'm fairly new to using jsoup so an explanation would be nice as well.
/**
* Method to deal with HTML 5 Gamefaq entries.
* #param url The location of the HTML 5 entry to read.
**/
public static void htmlDocReader(URL url) {
try {
Document doc = Jsoup.parse(url.openStream(), "UTF-8", url.toString());
//parse pagination label
String[] num = doc.select("div.span12").
select("ul.paginate").
select("li").
first().
text().
split("\\s+");
//get the max page number
final int max_pagenum = Integer.parseInt(num[num.length - 1]);
//create a new file based on the url path
File file = urlFile(url);
PrintWriter outFile = new PrintWriter(file, "UTF-8");
//Add every page to the text file
for(int i = 0; i < max_pagenum; i++) {
//if not the first page then change the url
if(i != 0) {
String new_url = url.toString() + "?page=" + i;
doc = Jsoup.parse(new URL(new_url).openStream(), "UTF-8",
new_url.toString());
}
Elements walkthroughs = doc.select("div.ffaq");
for(Element elem : walkthroughs.select("div")) {
for(Element inner : elem.children()) {
outFile.println(inner.text());
}
}
}
outFile.close();
} catch(Exception e) {
e.printStackTrace();
System.exit(1);
}
}

For every element you call text() you print all the text of its structure.
Assume the below example
<div>
text of div
<span>text of span</span>
</div>
if you call text() for div element you will get
text of div text of span
Then if you call text() for span you will get
text of span
What you need, in order to avoid duplicates is to use ownText(). This will get only the direct text of the element, and not the text of its children.
Long story sort change this
for(Element elem : walkthroughs.select("div")) {
for(Element inner : elem.children()) {
outFile.println(inner.text());
}
}
To this
for(Element elem : walkthroughs.select("div")) {
for(Element inner : elem.children()) {
String line = inner.ownText().trim();
if(!line.equals("")) //Skip empty lines
outFile.println(line);
}
}

JSoup extract only specific parts from Wikipedia

I have managed to extract the information in the "tables" on the right side of a Wikipedia article. However I also want to get paragraphs from the main text of the articles.
The code I'm using atm is only working about 60% of the time(Nullpointers or no text at all). In the example below I'm only interested in the tho first paragraphs, however that is irrelevant for my question.
In the picture below I show what parts I want the text from. I want to be able to iterate through all ... parts in the < divid="mw-content-text"....class="mw-content-ltr"> block.
StringBuilder sb = new StringBuilder();
String url = baseUrl + location;
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select(".mw-content-ltr p");
Element firstParagraph = paragraphs.first();
Element elementTwo = firstParagraph.nextElementSibling();
if (elementTwo == null) {
for (int i = 0; i < 2; i++) {
sb.append(paragraphs.get(i).text());
}
} else {
sb.append(elementTwo.text());
}
return sb.toString();

HTML not downloaded correctly

I've been trying to download the source code of the Google News rss feed. It's downloaded correctly except from links that are shown weirdly.
static String urlNotizie = "https://news.google.it/news/feeds?pz=1&cf=all&ned=it&hl=it&output=rss";
Document docHtml = Jsoup.connect(urlNotizie).get();
String html = docHtml.toString();
System.out.println(html);
Output:
<html>
<head></head>
<body>
<rss version="2.0">
<channel>
<generator>
NFE/1.0
</generator>
<title>Prima pagina - Google News</title>
<link />http://news.google.it/news?pz=1&ned=it&hl=it
<language>
it
</language>
<webmaster>
news-feedback#google.com
</webmaster>
<copyright>
&copy;2013 Google
</copyright> [...]
Using a URLConnection I'm able to output the correct source of the page. But during parse I have the same issue as above, where it spits a list of <link />. (Again only with links. Parsing other things works fine). URLConnection example:
URL u = new URL(urlNotizie);
URLConnection yc = u.openConnection();
StringBuilder builder = new StringBuilder();
BufferedReader reader = new BufferedReader(new InputStreamReader(
yc.getInputStream()));
String line;
while ((line = reader.readLine()) != null) {
builder.append(line);
builder.append("\n");
}
String html = builder.toString();
System.out.println("HTML " + html);
Document doc = Jsoup.parse(html);
Elements listaTitoli = doc.select("title");
Elements listaCategorie = doc.select("category");
Elements listaDescrizioni = doc.select("description");
Elements listaUrl = doc.select("link");
System.out.println(listaUrl);

Jsoup is designed as a HTML parser, not as a XML (nor RSS) parser.
The HTML <link> element is specified as not having any body. It would be invalid to have a <link> element with a body like as in your XML.
You can parse XML using Jsoup, but you need to explicitly tell it to switch to XML parsing mode.
Replace
Document docHtml = Jsoup.connect(urlNotizie).get();
by
Document docXml = Jsoup.connect(urlNotizie).parser(Parser.xmlParser()).get();

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to extract links from Google HTML result page? - java

Please try again with a complete selector, not only "a[href": Elements links = doc.select("a[href]"); // a with href See the Selector document for the full support - especially the examples on the right side.

Related

How to extract elements from a String with jsoup?

How to update html div content using java

Jsoup parsing html duplication on writing to file

JSoup extract only specific parts from Wikipedia

HTML not downloaded correctly

Categories

Resources