How to get the first link using JSOUP?

How to get the first link using JSOUP? - java

I want to use Jsoup to extract the first link on the google search results. For example, I search for "apple" on google. The first link I see is www.apple.com/. How do I return the first link? I am currently able to extract all links using Jsoup:
new Thread(new Runnable() {
#Override
public void run() {
final StringBuilder stringBuilder = new StringBuilder();
try {
Document doc = Jsoup.connect(sharedURL).get();
String title = doc.title();
Elements links = doc.select("a[href]");
stringBuilder.append(title).append("\n");
for (Element link : links) {
stringBuilder.append("\n").append(" ").append(link.text()).append(" ").append(link.attr("href")).append("\n");
}
} catch (IOException e) {
stringBuilder.append("Error : ").append(e.getMessage()).append("\n");
}
runOnUiThread(new Runnable() {
#Override
public void run() {
// set text
textView.setText(stringBuilder.toString());
}
});
}
}).start();

Do you mean:
Element firstLink = doc.select("a[href]").first();
It works for me. If you meant something else let us know. I checked the search results with the following and its a tough one to decipher as there are so many types of results that come back.. maps, news, ads, etc.
I tidied up the code a little with the use of java lambdas:
public static void main(String[] args) {
new Thread(() -> {
final StringBuilder stringBuilder = new StringBuilder();
try {
String sharedUrl = "https://www.google.com/search?q=apple";
Document doc = Jsoup.connect(sharedUrl).get();
String title = doc.title();
Elements links = doc.select("a[href]");
Element firstLink = links.first(); // <<<<< NEW ADDITION
stringBuilder.append(title).append("\n");
for (Element link : links) {
stringBuilder.append("\n")
.append(" ")
.append(link.text())
.append(" ")
.append(link.attr("href"))
.append("\n");
}
} catch (IOException e) {
stringBuilder.append("Error : ").append(e.getMessage()).append("\n");
}
// replaced some of this for running/testing locally
SwingUtilities.invokeLater(() -> System.out.println(stringBuilder.toString()));
}).start();
}

Related

How can I scrape data from a website using the Jaunt library?

I want to get the title from this website: http://feeds.foxnews.com/foxnews/latest
like this example:
<title><![CDATA[SUCCESSFUL INTERCEPT Pentagon confirms it shot down ICBM-type target]]></title>
and it will show text like this:
"SUCCESSFUL INTERCEPT Pentagon confirms it shot down ICBM-type target
US conducts successful missile intercept test, Pentagon says"
Here's my code. I have used jaunt library.
I don't know why it shows text only "foxnew.com"
import com.jaunt.JauntException;
import com.jaunt.UserAgent;
public class p8_1
{
public static void main(String[] args)
{
try
{
UserAgent userAgent = new UserAgent();
userAgent.visit("http://feeds.foxnews.com/foxnews/latest");
String title = userAgent.doc.findFirst
("<title><![CDATA[SUCCESSFUL INTERCEPT Pentagon confirms it shot down ICBM-type target]]></title>").getText();
System.out.println("\n " + title);
} catch (JauntException e)
{
System.err.println(e);
}
}
}

Search for element types, not values.
Try the following to get the title text of each item in the feed:
public static void main(String[] args) {
try {
UserAgent userAgent = new UserAgent();
userAgent.visit("http://feeds.foxnews.com/foxnews/latest");
Elements items = userAgent.doc.findEach("<item>");
Elements titles = items.findEach("<title>");
for (Element title : titles) {
String titleText = title.getComment(0).getText();
System.out.println(titleText);
}
} catch (JauntException e) {
System.err.println(e);
}
}

How to write a unit test for an XML parser I wrote in Java

The context is as follows:
I've got objects that represent Tweets (from Twitter). Each object has an id, a date and the id of the original tweet (if there was one).
I receive a file of tweets (where each tweet is in the format of 05/04/2014 12:00:00, tweetID, originalID and is in its' own line) and I want to save them as an XML file where each field has its' own tag.
I want to then be able to read the file and return a list of Tweet objects corresponding to the Tweets from the XML file.
After writing the XML parser that does this I want to test that it works correctly. I've got no idea how to test this.
The XML Parser:
public class TweetToXMLConverter implements TweetImporterExporter {
//there is a single file used for the tweets database
static final String xmlPath = "src/main/resources/tweetsDataBase.xml";
//some "defines", as we like to call them ;)
static final String DB_HEADER = "tweetDataBase";
static final String TWEET_HEADER = "tweet";
static final String TWEET_ID_FIELD = "id";
static final String TWEET_ORIGIN_ID_FIELD = "original tweet";
static final String TWEET_DATE_FIELD = "tweet date";
static File xmlFile;
static boolean initialized = false;
#Override
public void createDB() {
try {
Element tweetDB = new Element(DB_HEADER);
Document doc = new Document(tweetDB);
doc.setRootElement(tweetDB);
XMLOutputter xmlOutput = new XMLOutputter();
// display nice nice? WTF does that chinese whacko want?
xmlOutput.setFormat(Format.getPrettyFormat());
xmlOutput.output(doc, new FileWriter(xmlPath));
xmlFile = new File(xmlPath);
initialized = true;
} catch (IOException io) {
System.out.println(io.getMessage());
}
}
#Override
public void addTweet(Tweet tweet) {
if (!initialized) {
//TODO throw an exception? should not come to pass!
return;
}
SAXBuilder builder = new SAXBuilder();
try {
Document document = (Document) builder.build(xmlFile);
Element newTweet = new Element(TWEET_HEADER);
newTweet.setAttribute(new Attribute(TWEET_ID_FIELD, tweet.getTweetID()));
newTweet.setAttribute(new Attribute(TWEET_DATE_FIELD, tweet.getDate().toString()));
if (tweet.isRetweet())
newTweet.addContent(new Element(TWEET_ORIGIN_ID_FIELD).setText(tweet.getOriginalTweet()));
document.getRootElement().addContent(newTweet);
} catch (IOException io) {
System.out.println(io.getMessage());
} catch (JDOMException jdomex) {
System.out.println(jdomex.getMessage());
}
}
//break glass in case of emergency
#Override
public void addListOfTweets(List<Tweet> list) {
for (Tweet t : list) {
addTweet(t);
}
}
#Override
public List<Tweet> getListOfTweets() {
if (!initialized) {
//TODO throw an exception? should not come to pass!
return null;
}
try {
SAXBuilder builder = new SAXBuilder();
Document document;
document = (Document) builder.build(xmlFile);
List<Tweet> $ = new ArrayList<Tweet>();
for (Object o : document.getRootElement().getChildren(TWEET_HEADER)) {
Element rawTweet = (Element) o;
String id = rawTweet.getAttributeValue(TWEET_ID_FIELD);
String original = rawTweet.getChildText(TWEET_ORIGIN_ID_FIELD);
Date date = new Date(rawTweet.getAttributeValue(TWEET_DATE_FIELD));
$.add(new Tweet(id, original, date));
}
return $;
} catch (JDOMException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return null;
}
}
Some usage:
private TweetImporterExporter converter;
List<Tweet> tweetList = converter.getListOfTweets();
for (String tweetString : lines)
converter.addTweet(new Tweet(tweetString));
How can I make sure the the XML file I read (that contains tweets) corresponds to the file I receive (in the form stated above)?
How can I make sure the tweets I add to the file correspond to the ones I tried to add?

Assuming that you have the following model:
public class Tweet {
private Long id;
private Date date;
private Long originalTweetid;
//getters and seters
}
The process would be the following:
create an isntance of TweetToXMLConverter
create a list of Tweet instances that you expect to receive after parsing the file
feed the converter the list you generated
compare the list received by parsing the list and the list you initiated at the start of the test
public class MainTest {
private TweetToXMLConverter converter;
private List<Tweet> tweets;
#Before
public void setup() {
Tweet tweet = new Tweet(1, "05/04/2014 12:00:00", 2);
Tweet tweet2 = new Tweet(2, "06/04/2014 12:00:00", 1);
Tweet tweet3 = new Tweet(3, "07/04/2014 12:00:00", 2);
tweets.add(tweet);
tweets.add(tweet2);
tweets.add(tweet3);
converter = new TweetToXMLConverter();
converter.addListOfTweets(tweets);
}
#Test
public void testParse() {
List<Tweet> parsedTweets = converter.getListOfTweets();
Assert.assertEquals(parsedTweets.size(), tweets.size());
for (int i=0; i<parsedTweets.size(); i++) {
//assuming that both lists are sorted
Assert.assertEquals(parsedTweets.get(i), tweets.get(i));
};
}
}
I am using JUnit for the actual testing.

How to get HtmlElements from a website

I am trying to get urls and html elements from a website.Able to get urls and html from website but, when one url contains multiple elements(like multiple input elements (or)multiple textarea elements)i am able getting only last element.The code like below
GetURLsAndElemens.java
public static void main(String[] args) throws FileNotFoundException,
IOException, ParseException {
Properties properties = new Properties();
properties
.load(new FileInputStream(
"src//io//servicely//ci//plugin//SeleniumResources.properties"));
Map<String, String> urls = gettingUrls(properties
.getProperty("MAIN_URL"));
GettingHTMLElements.getHTMLElements(urls);
// .out.println(urls.size());
// System.out.println(urls);
}
public static Map<String, String> gettingUrls(String mainURL) {
Document doc = null;
Map<String, String> urlsList = new HashMap<String, String>();
try {
System.out.println("Main URL " + mainURL);
// need http protocol
doc = Jsoup.connect(mainURL).get();
GettingHTMLElements.getInputElements(doc, mainURL);
// get page title
// String title = doc.title();
// System.out.println("title : " + title);
// get all links
Elements links = doc.select("a[href]");
for (Element link : links) {
// urlsList.clear();
// get the value from href attribute and adding to list
if (link.attr("href").contains("http")) {
urlsList.put(link.attr("href"), link.text());
} else {
urlsList.put(mainURL + link.attr("href"), link.text());
}
// System.out.println(urlsList);
}
} catch (IOException e) {
e.printStackTrace();
}
// System.out.println("Total urls are "+urlsList.size());
// System.out.println(urlsList);
return urlsList;
}
GettingHtmlElements.java
static Map<String, HtmlElements> urlList = new HashMap<String, HtmlElements>();
public static void getHTMLElements(Map<String, String> urls)
throws IOException {
getElements(urls);
}
public static void getElements(Map<String, String> urls) throws IOException {
for (Map.Entry<String, String> entry1 : urls.entrySet()) {
try {
System.out.println(entry1.getKey());
Document doc = Jsoup.connect(entry1.getKey()).get();
getInputElements(doc, entry1.getKey());
}
catch (Exception e) {
e.printStackTrace();
}
}
Map<String,HtmlElements> list = urlList;
for(Map.Entry<String,HtmlElements> entry1:list.entrySet())
{
HtmlElements ele = entry1.getValue();
System.out.println("url is "+entry1.getKey());
System.out.println("input name "+ele.getInput_name());
}
}
public static HtmlElements getInputElements(Document doc, String entry1) {
HtmlElements htmlElements = new HtmlElements();
Elements inputElements2 = doc.getElementsByTag("input");
Elements textAreaElements2 = doc.getElementsByTag("textarea");
Elements formElements3 = doc.getElementsByTag("form");
for (Element inputElement : inputElements2) {
String key = inputElement.attr("name");
htmlElements.setInput_name(key);
String key1 = inputElement.attr("type");
htmlElements.setInput_type(key1);
String key2 = inputElement.attr("class");
htmlElements.setInput_class(key2);
}
for (Element inputElement : textAreaElements2) {
String key = inputElement.attr("id");
htmlElements.setTextarea_id(key);
String key1 = inputElement.attr("name");
htmlElements.setTextarea_name(key1);
}
for (Element inputElement : formElements3) {
String key = inputElement.attr("method");
htmlElements.setForm_method(key);
String key1 = inputElement.attr("action");
htmlElements.setForm_action(key1);
}
return urlList.put(entry1, htmlElements);
}
Which elements i want take it as a bean.For every url i am getting the urls and htmle elements.but when url contains multiple elements i was getting last element only

You use a class HtmlElements which is not part of JSoup as far as I know. I don't know its inner workings, but I assume it is some sort of list of html nodes or something.
However, you seem to use this class like this:
HtmlElements htmlElements = new HtmlElements();
htmlElements.setInput_name(key);
This indicates that only ONE html element is stored in the htmlElements variable. This would explain why you get only the last element stored - you simply overwrite the one instance all the time.
It is not really clear, since I don't know the HtmlElements class. Maybe something like this works, assuming that HtmlElement is working as a single instance of HtmlElements and HtmlElements has a method add:
HtmlElements htmlElements = new HtmlElements();
...
for (Element inputElement : inputElements2) {
HtmlElement e = new HtmlElement();
htmlElements.add(e);
String key = inputElement.attr("name");
e.setInput_name(key);
}

Extracting link from a facebook page

I want to extract content of a facebook page mainly the links in a facebook page. I tried extracting using jsoup but it does not show the relevant link the link which is showing the likes for the topic for eg :https://www.facebook.com/search/109301862430120/likers.may be it's a jquery,ajax or javascript type code. So how can I extract those link using java how can i extract/access that link or calling a JavaScript function with HTMLUnit
public static void main(String args[])
{
Testing t=new Testing();
t.traceLink();
}
public static void traceLink()
{
// File input = new File("/tmp/input.html");
Document doc = null;
try
{
doc = Jsoup.connect("https://www.facebook.com
/pages/Ice-cream/109301862430120?rf=102173023157556").get();
Elements link = doc.select("a[href]");
String stringLink = null;
for (int i = 0; i < link.size(); i++)
{
stringLink = link.toString();
System.out.println(stringLink);
}}}
System.out.println(link);
}
catch (IOException e)
{
//e.printStackTrace();
}
Element links = doc.select("a[href]").first();
System.out.println(links);

Parsing XML from a website to a String array in Android please help me

Hello I am in the process of making an Android app that pulls some data from a Wiki, at first I was planning on finding a way to parse the HTML, but from something that someone pointed out to me is that XML would be much easier to work with. Now I am stuck trying to find a way to parse the XML correctly. I am trying to parse from a web address right now from:
http://zelda.wikia.com/api.php?action=query&list=categorymembers&cmtitle=Category:Games&cmlimit=500&format=xml
I am trying to get the titles of each of the games into a string array and I am having some trouble. I don't have an example of the code I was trying out, it was by using xmlpullparser. My app crashes everytime that I try to do anything with it. Would it be better to save the XML locally and parse from there? or would I be okay going from the web address? and how would I go about parsing this correctly into a string array? Please help me, and thank you for taking the time to read this.
If you need to see code or anything I can get it later tonight, I am just not near my PC at this time. Thank you.

Whenever you find yourself writing parser code for simple formats like the one in your example you're almost always doing something wrong and not using a suitable framework.
For instance - there's a set of simple helpers for parsing XML in the android.sax package included in the SDK and it just happens that the example you posted could be easily parsed like this:
public class WikiParser {
public static class Cm {
public String mPageId;
public String mNs;
public String mTitle;
}
private static class CmListener implements StartElementListener {
final List<Cm> mCms;
CmListener(List<Cm> cms) {
mCms = cms;
}
#Override
public void start(Attributes attributes) {
Cm cm = new Cm();
cm.mPageId = attributes.getValue("", "pageid");
cm.mNs = attributes.getValue("", "ns");
cm.mTitle = attributes.getValue("", "title");
mCms.add(cm);
}
}
public void parseInto(URL url, List<Cm> cms) throws IOException, SAXException {
HttpURLConnection con = (HttpURLConnection) url.openConnection();
try {
parseInto(new BufferedInputStream(con.getInputStream()), cms);
} finally {
con.disconnect();
}
}
public void parseInto(InputStream docStream, List<Cm> cms) throws IOException, SAXException {
RootElement api = new RootElement("api");
Element query = api.requireChild("query");
Element categoryMembers = query.requireChild("categorymembers");
Element cm = categoryMembers.requireChild("cm");
cm.setStartElementListener(new CmListener(cms));
Xml.parse(docStream, Encoding.UTF_8, api.getContentHandler());
}
}
Basically, called like this:
WikiParser p = new WikiParser();
ArrayList<WikiParser.Cm> res = new ArrayList<WikiParser.Cm>();
try {
p.parseInto(new URL("http://zelda.wikia.com/api.php?action=query&list=categorymembers&cmtitle=Category:Games&cmlimit=500&format=xml"), res);
} catch (MalformedURLException e) {
} catch (IOException e) {
} catch (SAXException e) {}
Edit: This is how you'd create a List<String> instead:
public class WikiParser {
private static class CmListener implements StartElementListener {
final List<String> mTitles;
CmListener(List<String> titles) {
mTitles = titles;
}
#Override
public void start(Attributes attributes) {
String title = attributes.getValue("", "title");
if (!TextUtils.isEmpty(title)) {
mTitles.add(title);
}
}
}
public void parseInto(URL url, List<String> titles) throws IOException, SAXException {
HttpURLConnection con = (HttpURLConnection) url.openConnection();
try {
parseInto(new BufferedInputStream(con.getInputStream()), titles);
} finally {
con.disconnect();
}
}
public void parseInto(InputStream docStream, List<String> titles) throws IOException, SAXException {
RootElement api = new RootElement("api");
Element query = api.requireChild("query");
Element categoryMembers = query.requireChild("categorymembers");
Element cm = categoryMembers.requireChild("cm");
cm.setStartElementListener(new CmListener(titles));
Xml.parse(docStream, Encoding.UTF_8, api.getContentHandler());
}
}
and then:
WikiParser p = new WikiParser();
ArrayList<String> titles = new ArrayList<String>();
try {
p.parseInto(new URL("http://zelda.wikia.com/api.php?action=query&list=categorymembers&cmtitle=Category:Games&cmlimit=500&format=xml"), titles);
} catch (MalformedURLException e) {
} catch (IOException e) {
} catch (SAXException e) {}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to get the first link using JSOUP? - java

Related

How can I scrape data from a website using the Jaunt library?

How to write a unit test for an XML parser I wrote in Java

How to get HtmlElements from a website

Extracting link from a facebook page

Parsing XML from a website to a String array in Android please help me

Categories

Resources