extracting subcontents of a link

extracting subcontents of a link - java

How can I extract "read more" part of this news.When I uses jsoup it only gives before the contents of "read more" part.I want to extract entire contents of that news.
Scanner sc=new Scanner(System.in);
String code=sc.nextLine();
doc = Jsoup.connect("http://ieee-link.org/category/events/" +code+ "/").get();
Elements els = doc.select("div.entry");
System.out.println(els.text());

The read more seems to contain a link. You can extract the target of that link and get this URL as well with Jsoup:
Elements els = doc.select("div.entry");
//inside each els we can find something like: <a class="more-link" href="http://ieee-link.org/renesas/">Read More »</a>
for (Element el : els){
Element anchor = el.select("a.more-link");
if (anchor != null){
Document moreDoc = Jsoup.connect(anchor.attr("href")).get();
System.out.println(moreDoc);
}
else{
System.out.println(el);
}
}
Note, that this code is written out of my head. Some method names might be wrong. Spelling is also questionable.

Related

Jsoup parser not working as expected for particular URL only

I am using Jsoup to download the page content and then for parsing it.
public static void main(String[] args) throws IOException {
Document document = Jsoup.connect("http://www.toysrus.ch/product/index.jsp?productId=89689681").get();
final Elements elements = document.select("dt:contains(" + "EAN/ISBN:" + ")");
System.out.println(elements.size());
}
The Problem : If you view the source of page content, there is tag exist <dt> which contains EAN/ISBN: text, but if you run above code, it will give you 0 in output, while it should give me 1. I have already checked html using document.html(), it seems html tags are there, but the tag I wanted is replaced by characters like <dt> instead it should <dt>. Same code is working for other product urls from same site.
I have already worked with Jsoup and developed many parser, but I am not getting why above very simple code is not working. It's strange! Is it Jsoup bug? Can anybody help me?

When using connect() or parse() jsoup will per default expect a valid html and format the input automatically if needed. You may try the xml parser instead.
public static void main(String [] args) throws IOException {
String url = "http://www.toysrus.ch/product/index.jsp?productId=89689681";
Document document = Jsoup.parse(new URL(url).openStream(), "UTF-8", "", Parser.xmlParser());
//final Elements elements = document.select("dt:contains(" + "EAN/ISBN:" + ")");
// the same as above but more readable:
final Elements elements = document.getElementsMatchingOwnText("EAN/ISBN");
System.out.println(elements.size());
}

You need to put single quotes around the 'EAN/ISBN:' value; otherwise it will be interpreted as a variable.
Also, there is no need to break up the string and concatenate pieces together. Just put the whole thing in one string.

Jsoup: null result in absUrl (abs:)

I tried to make a image links downloader with jsoup. I have made a downloader HTML code part, and when I have done a parse part, I recognized, that sometimes links to images appeared without main part. So I found absUrl solution, but by some reasons it did not work (it gave me null). So I tried use uri.resolve(), but it gave me unchanged result. So now I do not know how to solve it. I attached part of my code, that responsible for parsing ant writing url to string:
public static String finalcode(String textin) throws Exception {
String text = source(textin);
Document doc = Jsoup.parse(text);
Elements images = doc.getElementsByTag("img");
String Simages = images.toString();
int Limages = countLines(Simages);
StringBuilder src = new StringBuilder();
while (Limages > 0) {
Limages--;
Element image = images.get(Limages);
String href = image.attr("src");
src.append(href);
src.append("\n");
}
String result = src.toString();
return result;
}

It looks like you are parsing HTML from String, not from URL. Because of that jsoup can't know from which URL this HTML codes comes from, so it can't create absolute path.
To set this URL for Document you should parse it using Jsoup.parse(String html, String baseUri) version, like
String url = "http://server/pages/document.htlm";
String text = "<img src = '../images/image_name1.jpg'/><img src = '../images/image_name2.jpg'/>'";
Document doc = Jsoup.parse(text, url);
Elements images = doc.getElementsByTag("img");
for (Element image : images){
System.out.println(image.attr("src")+" -> "+image.attr("abs:src"));
}
Output:
../images/image_name1.jpg -> http://server/images/image_name1.jpg
../images/image_name2.jpg -> http://server/images/image_name2.jpg
Other option would be letting Jsoup parse page directly by supplying URL instead of String with HTML
Document doc = Jsoup.connect("http://example.com").get();
This way Document will know from which URL it came, so it will be able to create absolute paths.

How to get link from ArrayList filling by Jsoup

I trying to parse website. After all links collect to ArrayList, I wanna parse them again, but I have trouble with initialization of them.
This is my ArrayList:
public ArrayList<String> linkList = new ArrayList<String>();
How I collect links in "doInBackground":
try {
Document doc = Jsoup.connect("http://forurl.com/archive/").get();
Element links = doc.select("a[href]");
for (Element link : links)
{
linkList.add(link.attr("abs:href"));
}
}
In "onPostExecute" showing what I get:
lk.setText("Collected: " +linkList.size()); // showing how much is collected
lj.setText("First link: " +linkList.get(0)); // showing first link
Try to parse child links:
public class imgTread extends AsyncTask<Void, Void, Void> {
Bitmap bitmap;
String[] url = {"http://forurl.com/link1/",
"http://forurl.com/link2/"}; // this way work well
protected Void doInBackground(Void... params) {
try {
for (int i = 0; i < url.length; i++){
Document doc1 = Jsoup.connect(url[0]).get(); // connect to 1 link for example
Elements img = doc1.select("#strip");
String imgSrc = img.attr("src");
InputStream input = new java.net.URL(imgSrc).openStream();
bitmap = BitmapFactory.decodeStream(input);
}
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
I traing to make String[] from ArrayList, but it doesn't work.
String[] url = linkList.toArray(new String[linkList.size()]);
Output for this way will be Ljava.lang.String;#45ds364
The idea is 1) collected all links from url; 2) Connect to them 1 by 1 and get that information what I need.
First point work, the second too, but how tie it.
Thanks for any advise.
Working code:
Document doc = Jsoup.connect(url).get(); // connect to site
Elements links = doc.select("a[href]"); // get all links
String link_addr = links.get(3).attr("abs:href"); // choose 3 link
Document link_doc = Jsoup.connect(link_addr).get(); // connetect to it
Elements img = link_doc.select("#strip"); // get all elements by tag #strip
String imgSrc = img.attr("src"); // get url
InputStream input = new java.net.URL(imgSrc).openStream();
bitmap = BitmapFactory.decodeStream(input);
I hope this helps someone.

You are doing many unnecessary steps. You have a perfectly fine collection of Element objects in your Elements links object. Why do you have to add them to an ArrayList?
If I have understood your question correctly, your thought process should be something like this.
Get all the links from a URL
Establish a new connection to each link
Download all the images on that page where the element id = "strip".
Get all the links:
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
doInBackground(links);
Call the doInBackground method with the links as a parameter:
public static void doInBackground(Elements links) throws IOException {
try {
for (Element element : links) {
//Connect to the first link
Document doc = Jsoup.connect(element.attr("abs:href")).get();
//Select all the elements with ID 'strip'
Elements img = doc.select("#strip");
//Get the source of the image
String imgSrc = img.attr("abs:src");
//Open an InputStream
InputStream input = new java.net.URL(imgSrc).openStream();
//Get the image
bitmap = BitmapFactory.decodeStream(input);
...
//Perhaps save the image somewhere?
//Close the InputStream
input.close();
}
} catch (IllegalArgumentException e) {
System.out.println(e.getMessage());
} catch (MalformedURLException ex) {
System.out.println(ex.getMessage());
}
}
Of course, you will have to properly use AsyncTask and call the methods from preferred places, but this is the overall idea of how you can use Jsoup to do the job you want it to.

If you want create a array instead of list you can do:
try {
Document doc = Jsoup.connect("http://forurl.com/archive/").get();
Element links = doc.select("a[href]");
String array = new String[links.size();
for (int i = 0; i < links.size(); i++)
{
array[i] = link.attr("abs:href");
}
}

First of all what the hell is that?
Why link variable is a collection of Elements and links is a single member of that collection?? Isn't that confusing?
Second, keep to java naming convention and name variables with noncapitalized letters so change LinkList to linkList. Even syntax highlighter got crazy thanks to you.
Third
traing to make String[] from ArrayList, but it doesn't work.
Where are you trying to do that and it does not work? I don't see it anywhere in the code.
Forth
To create Array out of List you have to do something like that
String links[]=linksList.toArray(new String[linksList.size()]);
Fifth
Change the topic to more apropriate, as present one is very missleading (you have no trouble with Jsoup here)

How to get all links (<a href>) in URL

I get some URL and i need to search all the links in this URL and just show them, thats all.
I write its in java:
PrintWriter writer=new PrintWriter("Web.txt");
URL oracle = new URL("http://edition.cnn.com/");
BufferedReader in = new BufferedReader(
new InputStreamReader(oracle.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
{
writer.println(inputLine);
System.out.println(inputLine);
}
in.close();
Now my question is how can I find only links in this huge file?
I thought about <a href" ... ... ..>but its not always right..
Thanks

JSOUP is the way to go! It's a Java API on which you can parse HTML documents (either local or external ones) and navigate on it's DOM structure using a jQuery similiar syntax.
Your code to get all the links should look something like this:
Document doc = Jsoup.connect("http://edition.cnn.com").get(); // Parse this URL's HTML
Elements elements = doc.select("a"); // Search for all <a> elements
Then, to list every link and save it to your file:
for (Element element : elements) {
writer.println(element.attr("href")); // Get the "href" attribute from the element
}

HTML Parser fetch link text

I'm using HTML Parser to fetch links from a web page. I need to store the URL, link text and the URL to the parent page containing the link. I have managed to get the link URL as well as the parent URL.
I still ned to get the link text.
link text
Unfortunately I'm having a hard time figuring it out, any help would be greatly appreciated.
public static List<LinkContainer> findUrls(String resource) {
String[] tagNames = {"A", "AREA"};
List<LinkContainer> urls = new ArrayList<LinkContainer>();
Tag tag;
String url;
String sourceUrl;
try {
for (String tagName : tagNames) {
Parser parser = new Parser(resource);
NodeList nodes = parser.parse(new TagNameFilter(tagName));
NodeIterator i = nodes.elements();
while (i.hasMoreNodes()) {
tag = (Tag) i.nextNode();
url = tag.getAttribute("href");
sourceUrl = tag.getPage().getUrl();
if (RegexUtil.verifyUrl(url)) {
urls.add(new LinkContainer(url, null, sourceUrl));
}
}
}
} catch (ParserException pe) {
pe.printStackTrace();
}
return urls;
}

Have you tried ((LinkTag) tag).getLinkText() ? Personally I prefer n html parser which produces XML according to a well used standard, e.g., xerces or similar. This is what you get from using e.g., http://nekohtml.sourceforge.net/.

You would need to check the children of each A Tag. If you assume that your A tags only have a single child (the text itself), you can use the getFirstChild() method. This should be an instance of TextNode, and you can call getText() on this to get the link text.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

extracting subcontents of a link - java

Related

Jsoup parser not working as expected for particular URL only

Jsoup: null result in absUrl (abs:)

How to get link from ArrayList filling by Jsoup

How to get all links (<a href>) in URL

HTML Parser fetch link text

Categories

Resources