How can I scrape data from a website using the Jaunt library?

How can I scrape data from a website using the Jaunt library? - java

I want to get the title from this website: http://feeds.foxnews.com/foxnews/latest
like this example:
<title><![CDATA[SUCCESSFUL INTERCEPT Pentagon confirms it shot down ICBM-type target]]></title>
and it will show text like this:
"SUCCESSFUL INTERCEPT Pentagon confirms it shot down ICBM-type target
US conducts successful missile intercept test, Pentagon says"
Here's my code. I have used jaunt library.
I don't know why it shows text only "foxnew.com"
import com.jaunt.JauntException;
import com.jaunt.UserAgent;
public class p8_1
{
public static void main(String[] args)
{
try
{
UserAgent userAgent = new UserAgent();
userAgent.visit("http://feeds.foxnews.com/foxnews/latest");
String title = userAgent.doc.findFirst
("<title><![CDATA[SUCCESSFUL INTERCEPT Pentagon confirms it shot down ICBM-type target]]></title>").getText();
System.out.println("\n " + title);
} catch (JauntException e)
{
System.err.println(e);
}
}
}

Search for element types, not values.
Try the following to get the title text of each item in the feed:
public static void main(String[] args) {
try {
UserAgent userAgent = new UserAgent();
userAgent.visit("http://feeds.foxnews.com/foxnews/latest");
Elements items = userAgent.doc.findEach("<item>");
Elements titles = items.findEach("<title>");
for (Element title : titles) {
String titleText = title.getComment(0).getText();
System.out.println(titleText);
}
} catch (JauntException e) {
System.err.println(e);
}
}

Related

How to get the first link using JSOUP?

I want to use Jsoup to extract the first link on the google search results. For example, I search for "apple" on google. The first link I see is www.apple.com/. How do I return the first link? I am currently able to extract all links using Jsoup:
new Thread(new Runnable() {
#Override
public void run() {
final StringBuilder stringBuilder = new StringBuilder();
try {
Document doc = Jsoup.connect(sharedURL).get();
String title = doc.title();
Elements links = doc.select("a[href]");
stringBuilder.append(title).append("\n");
for (Element link : links) {
stringBuilder.append("\n").append(" ").append(link.text()).append(" ").append(link.attr("href")).append("\n");
}
} catch (IOException e) {
stringBuilder.append("Error : ").append(e.getMessage()).append("\n");
}
runOnUiThread(new Runnable() {
#Override
public void run() {
// set text
textView.setText(stringBuilder.toString());
}
});
}
}).start();

Do you mean:
Element firstLink = doc.select("a[href]").first();
It works for me. If you meant something else let us know. I checked the search results with the following and its a tough one to decipher as there are so many types of results that come back.. maps, news, ads, etc.
I tidied up the code a little with the use of java lambdas:
public static void main(String[] args) {
new Thread(() -> {
final StringBuilder stringBuilder = new StringBuilder();
try {
String sharedUrl = "https://www.google.com/search?q=apple";
Document doc = Jsoup.connect(sharedUrl).get();
String title = doc.title();
Elements links = doc.select("a[href]");
Element firstLink = links.first(); // <<<<< NEW ADDITION
stringBuilder.append(title).append("\n");
for (Element link : links) {
stringBuilder.append("\n")
.append(" ")
.append(link.text())
.append(" ")
.append(link.attr("href"))
.append("\n");
}
} catch (IOException e) {
stringBuilder.append("Error : ").append(e.getMessage()).append("\n");
}
// replaced some of this for running/testing locally
SwingUtilities.invokeLater(() -> System.out.println(stringBuilder.toString()));
}).start();
}

Combine one Json with another Json (JAVA)

So I already succeed with one Json to get it all worked. Now I have another class where I want to get only one attribute. I have now a moviedatabase class (which work with JSON and gets all the information) and now I want to add a Trailer which is from Youtube API. so basically I need it to be added into the same JSON to make it easier for me in the future to get it into a HTML. the only problem is I cant get it work. I get a syntax error JSON when using this method.
EDIT CODE 1.1:
Youtube attribute:
public class FilmAttribut {
private String title = "";
private String release = "";
private int vote = 0;
private String overview = "";
private String poster = "";
private String trailer = "";
// getters + setters stripped
}
Youtube class:
public class Youtube {
FilmAttribut movie = new FilmAttribut();
public void search(String trailer, FilmAttribut in) {
HttpResponse<JsonNode> response;
try {
response = Unirest.get("https://www.googleapis.com/youtube/v3/search?key=[app key here]&part=snippet")
.queryString("q", trailer + " trailer")
.asJson();
JsonNode json = response.getBody();
JSONObject envelope = json.getObject();
JSONArray items = envelope.getJSONArray("items");
in.setTrailer("https://youtu.be/" + items.getJSONObject(0).getJSONObject("id").getString("videoId")); //Gives me error here
}
catch (JSONException e) {
e.printStackTrace();
} catch (UnirestException e) {
e.printStackTrace();
}
return null;
}
}
and main method
public class WebService {
public static void main(String[] args) {
setPort(1337);
Gson gson = new Gson();
Youtube yt = new Youtube();
MovieDataBase mdb = new MovieDataBase();
get("/search/:movie", (req, res) -> {
String titel = req.params(":movie");
FilmAttribut film = mdb.searchMovie(titel);
yt.search(titel, film);
String json = gson.toJson(film);
return json;
});
So I think the problem is that you can't have two gson.toJson(film) + gson.toJson(trailer); Because it makes the JSON twice, where one time is for the film (aka. movie) and then a new json is created with trailer which make the syntax error.
So my real question is, is it possible to have another class like I have now youtube. to send the information to a attribute class where I have all my attributes and then run it in main-method so that I can get all the JSON in one JSON.

If I did understand well what you are asking, yes you can, but I would do something like that instead:
public void search(String trailer, FileAttribut in) {
// fetch the trailer from youtube (NB: you should use getters/setters, not public fields)
in.setTrailer("https://youtu.be/" + items.getJSONObject(0).getJSONObject("id").getString("videoId"));
}
and:
FilmAttribut film = mdb.searchMovie(titel);
yt.search(titel, film); // pass "film" to "fill" its trailer
return gson.toJson(film);
OR
public String search(String trailer) {
// fetch the trailer from youtube
return "https://youtu.be/" + items.getJSONObject(0).getJSONObject("id").getString("videoId");
}
and:
FilmAttribut film = mdb.searchMovie(titel);
film.setTrailer(yt.search(titel));
return gson.toJson(film);

WebCrawler with recursion

So I am working on a webcrawler that is supposed to download all images, files, and webpages, and then recursively do the same for all webpages found. However, I seem to have a logic error.
public class WebCrawler {
private static String url;
private static int maxCrawlDepth;
private static String filePath;
/* Recursive function that crawls all web pages found on a given web page.
* This function also saves elements from the DownloadRepository to disk.
*/
public static void crawling(WebPage webpage, int currentCrawlDepth, int maxCrawlDepth) {
webpage.crawl(currentCrawlDepth);
HashMap<String, WebPage> pages = webpage.getCrawledWebPages();
if(currentCrawlDepth < maxCrawlDepth) {
for(WebPage wp : pages.values()) {
crawling(wp, currentCrawlDepth+1, maxCrawlDepth);
}
}
}
public static void main(String[] args) {
if(args.length != 3) {
System.out.println("Must pass three parameters");
System.exit(0);
}
url = "";
maxCrawlDepth = 0;
filePath = "";
url = args[0];
try {
URL testUrl = new URL(url);
URLConnection urlConnection = testUrl.openConnection();
urlConnection.connect();
} catch (MalformedURLException e) {
System.out.println("Not a valid URL");
System.exit(0);
} catch (IOException e) {
System.out.println("Could not open URL");
System.exit(0);
}
try {
maxCrawlDepth = Integer.parseInt(args[1]);
} catch (NumberFormatException e) {
System.out.println("Argument is not an int");
System.exit(0);
}
filePath = args[2];
File path = new File(filePath);
if(!path.exists()) {
System.out.println("File Path is invalid");
System.exit(0);
}
WebPage webpage = new WebPage(url);
crawling(webpage, 0, maxCrawlDepth);
System.out.println("Web crawl is complete");
}
}
the function crawl will parse the contents of a website storing any found images, files, or links into a hashmap, for example:
public class WebPage implements WebElement {
private static Elements images;
private static Elements links;
private HashMap<String, WebImage> webImages = new HashMap<String, WebImage>();
private HashMap<String, WebPage> webPages = new HashMap<String, WebPage>();
private HashMap<String, WebFile> files = new HashMap<String, WebFile>();
private String url;
public WebPage(String url) {
this.url = url;
}
/* The crawl method parses the html on a given web page
* and adds the elements of the web page to the Download
* Repository.
*/
public void crawl(int currentCrawlDepth) {
System.out.print("Crawling " + url + " at crawl depth ");
System.out.println(currentCrawlDepth + "\n");
Document doc = null;
try {
HttpConnection httpConnection = (HttpConnection) Jsoup.connect(url);
httpConnection.ignoreContentType(true);
doc = httpConnection.get();
} catch (MalformedURLException e) {
System.out.println(e.getLocalizedMessage());
} catch (IOException e) {
System.out.println(e.getLocalizedMessage());
} catch (IllegalArgumentException e) {
System.out.println(url + "is not a valid URL");
}
DownloadRepository downloadRepository = DownloadRepository.getInstance();
if(doc != null) {
images = doc.select("img");
links = doc.select("a[href]");
for(Element image : images) {
String imageUrl = image.absUrl("src");
if(!webImages.containsValue(image)) {
WebImage webImage = new WebImage(imageUrl);
webImages.put(imageUrl, webImage);
downloadRepository.addElement(imageUrl, webImage);
System.out.println("Added image at " + imageUrl);
}
}
HttpConnection mimeConnection = null;
Response mimeResponse = null;
for(Element link: links) {
String linkUrl = link.absUrl("href");
linkUrl = linkUrl.trim();
if(!linkUrl.contains("#")) {
try {
mimeConnection = (HttpConnection) Jsoup.connect(linkUrl);
mimeConnection.ignoreContentType(true);
mimeConnection.ignoreHttpErrors(true);
mimeResponse = (Response) mimeConnection.execute();
} catch (Exception e) {
System.out.println(e.getLocalizedMessage());
}
String contentType = null;
if(mimeResponse != null) {
contentType = mimeResponse.contentType();
}
if(contentType == null) {
continue;
}
if(contentType.toString().equals("text/html")) {
if(!webPages.containsKey(linkUrl)) {
WebPage webPage = new WebPage(linkUrl);
webPages.put(linkUrl, webPage);
downloadRepository.addElement(linkUrl, webPage);
System.out.println("Added webPage at " + linkUrl);
}
}
else {
if(!files.containsValue(link)) {
WebFile webFile = new WebFile(linkUrl);
files.put(linkUrl, webFile);
downloadRepository.addElement(linkUrl, webFile);
System.out.println("Added file at " + linkUrl);
}
}
}
}
}
System.out.print("\nFinished crawling " + url + " at crawl depth ");
System.out.println(currentCrawlDepth + "\n");
}
public HashMap<String, WebImage> getImages() {
return webImages;
}
public HashMap<String, WebPage> getCrawledWebPages() {
return webPages;
}
public HashMap<String, WebFile> getFiles() {
return files;
}
public String getUrl() {
return url;
}
#Override
public void saveToDisk(String filePath) {
System.out.println(filePath);
}
}
The point of using a hashmap is to ensure that I do not parse the same website more than once. The error seems to be with my recursion. What is the issue?
Here is also some sample output for starting the crawl at http://www.google.com
Crawling https://www.google.com/ at crawl depth 0
Added webPage at http://www.google.com/intl/en/options/
Added webPage at https://www.google.com/intl/en/ads/
Added webPage at https://www.google.com/services/
Added webPage at https://www.google.com/intl/en/about.html
Added webPage at https://www.google.com/intl/en/policies/
Finished crawling https://www.google.com/ at crawl depth 0
Crawling https://www.google.com/services/ at crawl depth 1
Added webPage at http://www.google.com/intl/en/enterprise/apps/business/?utm_medium=et&utm_campaign=en&utm_source=us-en-et-nelson_bizsol
Added webPage at https://www.google.com/services/sitemap.html
Added webPage at https://www.google.com/intl/en/about/
Added webPage at https://www.google.com/intl/en/policies/
Finished crawling https://www.google.com/services/ at crawl depth 1
**Crawling https://www.google.com/intl/en/policies/ at crawl depth 2**
Added webPage at https://www.google.com/intl/en/policies/
Added webPage at https://www.google.com/intl/en/policies/terms/
Added webPage at https://www.google.com/intl/en/policies/privacy/
Added webPage at https://www.google.com/intl/en/policies/terms/
Added webPage at https://www.google.com/intl/en/policies/faq/
Added webPage at https://www.google.com/intl/en/policies/technologies/
Added webPage at https://www.google.com/intl/en/about/
Added webPage at https://www.google.com/intl/en/policies/
Finished crawling https://www.google.com/intl/en/policies/ at crawl depth 2
**Crawling https://www.google.com/intl/en/policies/ at crawl depth 3**
Notice that it parses http://www.google.com/intl/en/policies/ twice

You are creating a new map for each web-page. This will ensure that if the same link occurs on the page twice it will only be crawled once but it will not deal with the case where the same link appears on two different pages.
https://www.google.com/intl/en/policies/ appears on both https://www.google.com/ and https://www.google.com/services/.
To avoid this use a single map throughout your crawl and pass it as a parameter into the recursion.
public class WebCrawler {
private HashMap<String, WebPage> visited = new HashMap<String, WebPage>();
public static void crawling(Map<String, WebPage> visited, WebPage webpage, int currentCrawlDepth, int maxCrawlDepth) {
}
}
As you are also holding a map of the images etc you may prefer to create a new object, perhaps call it visited, and make it keep track.
public class Visited {
private HashMap<String, WebPage> webPages = new HashMap<String, WebPage>();
public boolean visit(String url, WebPage page) {
if (webPages.containsKey(page)) {
return false;
}
webPages.put(url, page);
return true;
}
private HashMap<String, WebImage> webImages = new HashMap<String, WebImage>();
public boolean visit(String url, WebImage image) {
if (webImages.containsKey(image)) {
return false;
}
webImages.put(url, image);
return true;
}
}

Extracting link from a facebook page

I want to extract content of a facebook page mainly the links in a facebook page. I tried extracting using jsoup but it does not show the relevant link the link which is showing the likes for the topic for eg :https://www.facebook.com/search/109301862430120/likers.may be it's a jquery,ajax or javascript type code. So how can I extract those link using java how can i extract/access that link or calling a JavaScript function with HTMLUnit
public static void main(String args[])
{
Testing t=new Testing();
t.traceLink();
}
public static void traceLink()
{
// File input = new File("/tmp/input.html");
Document doc = null;
try
{
doc = Jsoup.connect("https://www.facebook.com
/pages/Ice-cream/109301862430120?rf=102173023157556").get();
Elements link = doc.select("a[href]");
String stringLink = null;
for (int i = 0; i < link.size(); i++)
{
stringLink = link.toString();
System.out.println(stringLink);
}}}
System.out.println(link);
}
catch (IOException e)
{
//e.printStackTrace();
}
Element links = doc.select("a[href]").first();
System.out.println(links);

Parsing XML from a website to a String array in Android please help me

Hello I am in the process of making an Android app that pulls some data from a Wiki, at first I was planning on finding a way to parse the HTML, but from something that someone pointed out to me is that XML would be much easier to work with. Now I am stuck trying to find a way to parse the XML correctly. I am trying to parse from a web address right now from:
http://zelda.wikia.com/api.php?action=query&list=categorymembers&cmtitle=Category:Games&cmlimit=500&format=xml
I am trying to get the titles of each of the games into a string array and I am having some trouble. I don't have an example of the code I was trying out, it was by using xmlpullparser. My app crashes everytime that I try to do anything with it. Would it be better to save the XML locally and parse from there? or would I be okay going from the web address? and how would I go about parsing this correctly into a string array? Please help me, and thank you for taking the time to read this.
If you need to see code or anything I can get it later tonight, I am just not near my PC at this time. Thank you.

Whenever you find yourself writing parser code for simple formats like the one in your example you're almost always doing something wrong and not using a suitable framework.
For instance - there's a set of simple helpers for parsing XML in the android.sax package included in the SDK and it just happens that the example you posted could be easily parsed like this:
public class WikiParser {
public static class Cm {
public String mPageId;
public String mNs;
public String mTitle;
}
private static class CmListener implements StartElementListener {
final List<Cm> mCms;
CmListener(List<Cm> cms) {
mCms = cms;
}
#Override
public void start(Attributes attributes) {
Cm cm = new Cm();
cm.mPageId = attributes.getValue("", "pageid");
cm.mNs = attributes.getValue("", "ns");
cm.mTitle = attributes.getValue("", "title");
mCms.add(cm);
}
}
public void parseInto(URL url, List<Cm> cms) throws IOException, SAXException {
HttpURLConnection con = (HttpURLConnection) url.openConnection();
try {
parseInto(new BufferedInputStream(con.getInputStream()), cms);
} finally {
con.disconnect();
}
}
public void parseInto(InputStream docStream, List<Cm> cms) throws IOException, SAXException {
RootElement api = new RootElement("api");
Element query = api.requireChild("query");
Element categoryMembers = query.requireChild("categorymembers");
Element cm = categoryMembers.requireChild("cm");
cm.setStartElementListener(new CmListener(cms));
Xml.parse(docStream, Encoding.UTF_8, api.getContentHandler());
}
}
Basically, called like this:
WikiParser p = new WikiParser();
ArrayList<WikiParser.Cm> res = new ArrayList<WikiParser.Cm>();
try {
p.parseInto(new URL("http://zelda.wikia.com/api.php?action=query&list=categorymembers&cmtitle=Category:Games&cmlimit=500&format=xml"), res);
} catch (MalformedURLException e) {
} catch (IOException e) {
} catch (SAXException e) {}
Edit: This is how you'd create a List<String> instead:
public class WikiParser {
private static class CmListener implements StartElementListener {
final List<String> mTitles;
CmListener(List<String> titles) {
mTitles = titles;
}
#Override
public void start(Attributes attributes) {
String title = attributes.getValue("", "title");
if (!TextUtils.isEmpty(title)) {
mTitles.add(title);
}
}
}
public void parseInto(URL url, List<String> titles) throws IOException, SAXException {
HttpURLConnection con = (HttpURLConnection) url.openConnection();
try {
parseInto(new BufferedInputStream(con.getInputStream()), titles);
} finally {
con.disconnect();
}
}
public void parseInto(InputStream docStream, List<String> titles) throws IOException, SAXException {
RootElement api = new RootElement("api");
Element query = api.requireChild("query");
Element categoryMembers = query.requireChild("categorymembers");
Element cm = categoryMembers.requireChild("cm");
cm.setStartElementListener(new CmListener(titles));
Xml.parse(docStream, Encoding.UTF_8, api.getContentHandler());
}
}
and then:
WikiParser p = new WikiParser();
ArrayList<String> titles = new ArrayList<String>();
try {
p.parseInto(new URL("http://zelda.wikia.com/api.php?action=query&list=categorymembers&cmtitle=Category:Games&cmlimit=500&format=xml"), titles);
} catch (MalformedURLException e) {
} catch (IOException e) {
} catch (SAXException e) {}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can I scrape data from a website using the Jaunt library? - java

Related

How to get the first link using JSOUP?

Combine one Json with another Json (JAVA)

WebCrawler with recursion

Extracting link from a facebook page

Parsing XML from a website to a String array in Android please help me

Categories

Resources