get favicon from html (JSOUP) - java

How can I get icon path from html string uses JSOUP?
I find to diferent way to add favicon on webpage -
(in Google)
first method I can to get uses doc.select("html head meta")
but I can't to select link tag

Get the file name on head element:
Connection con2=Jsoup.connect(url);
Document doc = con2.get();
Element e=doc.head().select("link[href~=.*\\.ico]").first();
String url=e.attr("href");
http://jsoup.org/cookbook/extracting-data/attributes-text-html
http://jsoup.org/cookbook/extracting-data/selector-syntax

As Uwe Plonus pointed out in the comment you can always get the favicon from <website>/favicon.ico
Google favicon

It's pretty late to submit a answer but
correct way to check is "rel" tag
public boolean checkFevicon() {
Elements e = doc.head().select("link[rel=shortcut icon]");
if (e.isEmpty()) {
return false;
} else {
return true;
}
}
jQuery equivalent
$("link[rel='shortcut icon']")

Related

Jsoup get text in paragraph and sub tags

I have an HTML page on which I wish to run "specialised" TTS:
For eg:
<h3>Title <u>Page<u> by Ada Lovelace</h3>
I want to read title and page in a different way:
When I use:
Element body=doc.body();
Elements elements= body.select("*");
for(Element element:elements){
if(!element.ownText().equals("") && element.hasText()){
Log.d("Epub",element.tagName()+" "+element.ownText());
}
}
I get the Log output as:
h3 Title by Ada Lovelace
u Page
I want to get the data as:
h3 Title
u Page
h3 by Ada Lovelace
I do not have any access to he HTML files
Any help is appreciated thanks in advance!
[EDIT]
Hey So I figured a way to do it but instead of using Jsoup I used XML pull parser
private ArrayList<String> stackOfTags=new ArrayList<String>();
private int indexOfTags=-1;
private void prepareTextToSpeech__onHold() {
String opening_tag="";
try{
XmlPullParser parser=prepareText__onHold();
int eventType=parser.getEventType();
while (eventType!=XmlPullParser.END_DOCUMENT){
switch (eventType){
case XmlPullParser.START_TAG:
opening_tag=parser.getName();
stackOfTags.add(parser.getName());
indexOfTags++;
break;
case XmlPullParser.TEXT:
String temp=parser.getText();
if(temp.matches(".*[a-zA-Z]+.*") && !opening_tag.equals("script")){
contentMap.addItemInMap(opening_tag,parser.getText());
Log.d("Epub",stackOfTags.get(indexOfTags)+" "+parser.getText());
}
break;
case XmlPullParser.END_TAG:
stackOfTags.remove(indexOfTags);
indexOfTags--;
break;
}
eventType=parser.next();
}
}catch (Exception e){
Log.d("Epub",e.getMessage());
}
}
This however only works on well formed HTML. In the event that is not the case can someone help
I think the original HTML is <h3>Title <u>Page</u> by Ada Lovelace</h3>
If this is the case your HTML seems well formatted. Jsoup allows to read the the contents of each TextNode, so you very well can read out "Title", "Page" and "by Ada Lovlace" as different strings.
I don't have a running Java environment with me right now, so I can't provide working code, but here is a pointer to sources that tell you how it is done:
How to extract separate text nodes with Jsoup?
Jsoup - extracting text

Use JSoup to get all textual links

I'm using JSoup to grab content from web pages.
I want to get all the links on a page that have some contained text (it doesn't matter what the text is) just needs to be non-empty/image etc.
Example of links I want:
Link to Some Page
Since it contains the text "Link to Some Page"
Links I don't want:
<img src="someimage.jpg"/>
My code looks like this. How can I modify it to only get the first type of link?
Document document = // I get my document object
Elements linksOnPage = document.select("a[href]")
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}
You could do something like this.
It does it's job though it's probably not the fanciest solution out there.
Note: the function text() gets you a clean text so if there are any HTML code fragements inside it, it won't return them.
Document doc = // get the doc
Elements linksOnPage = document.select("a");
for (Element pageElem : linksOnPage){
String link = "";
if(pageElem.text().trim().equals(""))
continue;
// do smth with it
}
I am using this and it's working fine:
Document document = // I get my document object
Elements linksOnPage = document.select("a:matches(([^\\s]+))");
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}

Jsoup Scraping HTML dynamic content

I'm new to Jsoup and I have been trying to create a small code that gets the name of the items in a steam inventory using Jsoup.
public Element getItem(String user) throws IOException{
Document doc;
doc = Jsoup.connect("http://steamcommunity.com/id/"+user+"/inventory").get();
Element element = doc.getElementsByClass("hover_item_name").first();
return element;
}
this methods returns:
<h1 class="hover_item_name" id="iteminfo0_item_name"></h1>
and I want the information beetwen the "h1" labels which is generated when you click on a specific window.
Thank you in advance.
You can use the .select(String cssQuery) method:
doc.select("h1") gives you all h1 Elements.
If you need the actual Text in these tags use the .text() for each Element.
If you need a attribute like class or id use .attr(String attributeKey) on a Element eg:
doc.getElementsByClass("hover_item_name").first().attr("id")
gives you "iteminfo0_item_name"
But if you need to perform clicks on a website you can't do that with JSoup, hence JSoup is a HTML parser and not a browser alternative. Jsoup can't handle dynamic content.
But what you could do is, firstly scrape the relevant data in your h1 tags and then send a new .post() request, respectively an ajax call
If you rather want a real webdriver, have a look at Selenium.
Use .text() and return a String, i.e.:
public String getItem(String user) throws IOException{
Document doc;
doc = Jsoup.connect("http://steamcommunity.com/id/"+user+"/inventory").get();
Element element = doc.getElementsByClass("hover_item_name").first();
String text = element.text();
return text;
}

How to Get Crawl content in Crawljax

I have crawl Dynamic webpage using Crawljax. i can able to get crawl current id, status and dom. but i can't get the Website content.. Any one help me??
CrawljaxConfigurationBuilder builder =
CrawljaxConfiguration.builderFor("http://demo.crawljax.com/");
builder.addPlugin(new OnNewStatePlugin() {
#Override
public String toString() {
return "Our example plugin";
}
#Override
public void onNewState(CrawlerContext cc, StateVertex sv) {
LOG.info("Found a new dom! Here it is:\n{}", cc.getBrowser().getStrippedDom());
String name = cc.getCurrentState().getName();
String url = cc.getBrowser().getCurrentUrl();
System.out.println(cc.getCurrentState().getDom());
System.out.println("New State: " + name + "; url: " + url);
}
});
CrawljaxRunner crawljax = new CrawljaxRunner(builder.build());
crawljax.call();
How to get dynamic/java script Webpage content..
We can able to get website source code
cc.getBrowser().getStrippedDom()); or cc.getCurrentState().getDocument();
This coding are Return Source code (css/java script file)..
Not possible.Because its testing tool.This tool only check Text are available, assign temp data to Fields.
To get the website content, use the following function:
cc.getCurrentState().getDom()
This function does not return a DOM node, but actually returns the page's HTML text instead. This is the right function to use if you want the page content, but it sounds like it returns a DOM node, so the name getDom is a misnomer. To get a DOM node instead, use:
cc.getCurrentState().getDocument()
which returns the Document DOM node.
You can retrieve the page content with:
cc.getCurrentState().getDocument().getTextContent()
(EDIT: This won't work -- getTextContent always returns null when called on Documents.)

How to Crawl Over a Single Website using Jsoup?

I am starting of from a websites homepage. I am parsing the entire web page and I am collecting all the links on that homepage and putting them in a queue. Then I am removing each link from the queue and doing the same thing until I get the text that I want. However if I get a link like youtube.com/something then I am going to all the links on youtube. I want to restrict this.
I want to crawl within the same domain only. How do I do that?
private void crawler() throws IOException {
while (!q.isEmpty()){
String link = q.remove();
System.out.println("------"+link);
Document doc = Jsoup.connect(link).ignoreContentType(true).timeout(0).get();
if(doc.text().contains("publicly intoxicated behavior or persistence")){
System.out.println("************ On this page ******************");
System.out.println(doc.text());
return;
}
Elements links = doc.select("a[href]");
for (Element link1 : links){
String absUrl = link1.attr("abs:href");
if (absUrl == null || absUrl.length() == 0) {
continue;
}
// System.out.println(absUrl);
q.add(absUrl);
}
}
}
This article shows how to write a web crawler. The following line forces all crawled links to be on the mit.edu domain.
if(link.attr("href").contains("mit.edu"))
There might be a bug with that line since relative URLs won't have the domain. I suggest that adding abs: might be better.
if(link.attr("abs:href").contains("mit.edu"))

Categories

Resources