Recursively retrieving links from web page in Java

Recursively retrieving links from web page in Java - java

I'm working on a simplified website downloader (Programming Assignment) and I have to recursively go through the links in the given url and download the individual pages to my local directory.
I already have a function to retrieve all the hyperlinks(href attributes) from a single page, Set<String> retrieveLinksOnPage(URL url). This function returns a vector of hyperlinks. I have been told to download pages up to level 4. (Level 0 being the Home Page) Therefore I basically want to retrieve all the links in the site but I'm having difficulty coming up with the recursion algorithm. In the end, I intend to call my function like this :
retrieveAllLinksFromSite("http://www.example.com/ldsjf.html",0)
Set<String> Links=new Set<String>();
Set<String> retrieveAllLinksFromSite (URL url, int Level,Set<String> Links)
{
if(Level==4)
return;
else{
//retrieveLinksOnPage(url,0);
//I'm pretty Lost Actually!
}
}
Thanks!

Here is the pseudo code:
Set<String> retrieveAllLinksFromSite(int Level, Set<String> Links) {
if (Level < 5) {
Set<String> local_links = new HashSet<String>();
for (String link : Links) {
// do download link
Set<String> new_links = ;// do parsing the downloaded html of link;
local_links.addAll(retrieveAllLinksFromSite(Level+1, new_links));
}
return local_links;
} else {
return Links;
}
}
You will need to implement thing in the comments yourself. To run the function from a given single link, you need to create an initial set of links which contains only one initial link. However, it also works if you ahve multiple initial links.
Set<String> initial_link_set = new HashSet();
initial_link_set.add("http://abc.com/");
Set<String> final_link_set = retrieveAllLinksFromSite(1, initial_link_set);

You can use a HashMap instead of a Vector to store the links and their levels (since you need to recursively get all links down to level 4)
Also , it would be something like this(just giving an overall hint) :
HashMap Links=new HashMap();
void retrieveAllLinksFromSite (URL url, int Level)
{
if(Level==4)
return;
else{
retrieve the links on current page and for each retrieved link,
do {
download the link
Links.put(the retrieved url,Level) // store the link with level in hashmap
retrieveAllLinksFromSite (the retrieved url ,Level+1) //recursively call for
further levels
}
}
}

Related

Getting links from the table and all the tabs of a website using Jsoup

I'm new to web scraping so the question may not have been framed perfectly. I am trying to extract all the drug name links from a given page alphbetically and as a result extract all a-z drug links, then iterate over these links to extract information from within each of these like generic name, brand etc. I have a very basic code below that doesn't work. Some help in approaching this problem will be much appreciated.
public class WebScraper {
public static void main(String[] args) throws Exception {
String keyword = "a"; //will iterate through all the alphabets eventually
String url = "http://www.medindia.net/drug-price/brand-index.asp?alpha=" + keyword;
Document doc = Jsoup.connect(url).get();
Element table = doc.select("table").first();
Elements links = table.select("a[href]"); // a with href
for (Element link : links) {
System.out.println(link.attr("href"));
}
}

After looking at the website and what you are expecting to get, it looks like you are grabbing the wrong table element. You don't want the first table, you want the second.
To grab a specific table, you can use this:
Element table = doc.select("table").get(1);
This will get the table at index 1, ie the second table in the document.

Delete all files in 'folder' or with prefix in Google Cloud Bucket from Java

I know the idea of 'folders' is sort of non existent or different in Google Cloud Storage, but I need a way to delete all objects in a 'folder' or with a given prefix from Java.
The GcsService has a delete function, but as far as I can tell it only takes 1 GscFilename object and does not honor wildcards (i.e., "folderName/**" did not work).
Any tips?

The API only supports deleting a single object at a time. You can only request many deletions using many HTTP requests or by batching many delete requests. There is no API call to delete multiple objects using wildcards or the like. In order to delete all of the objects with a certain prefix, you'd need to list the objects, then make a delete call for each object that matches the pattern.
The command-line utility, gsutil, does exactly that when you ask it to delete the path "gs://bucket/dir/**. It fetches a list of objects matching that pattern, then it makes a delete call for each of them.
If you need a quick solution, you could always have your Java program exec gsutil.
Here is the code that corresponds to the above answer in case anyone else wants to use it:
public void deleteFolder(String bucket, String folderName) throws CoultNotDeleteFile {
try
{
ListResult list = gcsService.list(bucket, new ListOptions.Builder().setPrefix(folderName).setRecursive(true).build());
while(list.hasNext())
{
ListItem item = list.next();
gcsService.delete(new GcsFilename(file.getBucket(), item.getName()));
}
}
catch (IOException e)
{
//Error handling
}
}

Extremely late to the party, but here's for current google searches. We can delete multiple blobs efficiently by leveraging com.google.cloud.storage.StorageBatch.
Like so:
public static void rmdir(Storage storage, String bucket, String dir) {
StorageBatch batch = storage.batch();
Page<Blob> blobs = storage.list(bucket, Storage.BlobListOption.currentDirectory(),
Storage.BlobListOption.prefix(dir));
for(Blob blob : blobs.iterateAll()) {
batch.delete(blob.getBlobId());
}
batch.submit();
}
This should run MUCH faster than deleting one by one when your bucket/folder contains a non trivial amount of items.
Edit since this is getting a little attention, I'll demo error handling:
public static boolean rmdir(Storage storage, String bucket, String dir) {
List<StorageBatchResult<Boolean>> results = new ArrayList<>();
StorageBatch batch = storage.batch();
try {
Page<Blob> blobs = storage.list(bucket, Storage.BlobListOption.currentDirectory(),
Storage.BlobListOption.prefix(dir));
for(Blob blob : blobs.iterateAll()) {
results.add(batch.delete(blob.getBlobId()));
}
} finally {
batch.submit();
return results.stream().allMatch(r -> r != null && r.get());
}
}
This method will:
Delete every blob in the given folder of the given bucket returning true if so. The method will return false otherwise. One can look into the return method of batch.delete() for a better understanding and error proofing.
To ensure ALL items are deleted, you could call this like:
boolean success = false
while(!success)) {
success = rmdir(storage, bucket, dir);
}

I realise this is an old question, but I just stumbled upon the same issue and found a different way to resolve it.
The Storage class in the Google Cloud Java Client for Storage includes a method to list the blobs in a bucket, which can also accept an option to set a prefix to filter results to blobs whose names begin with the prefix.
For example, deleting all the files with a given prefix from a bucket can be achieved like this:
Storage storage = StorageOptions.getDefaultInstance().getService();
Iterable<Blob> blobs = storage.list("bucket_name", Storage.BlobListOption.prefix("prefix")).iterateAll();
for (Blob blob : blobs) {
blob.delete(Blob.BlobSourceOption.generationMatch());
}

How to check if the page has content?

I would like to avoid activating some page if its content is empty. I do this with some servlet as follow:
#SlingServlet(paths = "/bin/servlet", methods = "GET", resourceTypes = "sling/servlet/default")
public class ValidatorServlet extends SlingAllMethodsServlet {
#Override
protected void doGet(SlingHttpServletRequest request, SlingHttpServletResponse response) {
String page = "pathToPage";
PageManager pageManager = request.adaptTo(PageManager.class);
Page currentPage = pageManager.getPage(page);
boolean result = pageHasContent(currentPage);
}
Now how to check, if currentPage has content?

Please note that the following answer was posted in 2013 when CQ/AEM was a lot different to the current version. The following may not work consistently if used. Refer to Tadija Malic's answer below for more on this.
The hasContent() method of the Page class can be used to check whether the page has content or not. It returns true if the page has jcr:content node, else returns false.
boolean result = currentPage != null ? currentPage.hasContent() : false;
In case you would like to check for pages that have not been authored, one possible way is to check if there are any additional nodes that are present under jcr:content.
Node contentNode = currentPage.getContentResource().adaptTo(Node.class);
boolean result = contentNode.hasNodes();

I would create an OSGi service that takes a Page and walks its content tree according to the rules that you set to find out whether the page has meaningful content.
Whether a page has actual content or not is application-specific, so creating your own service will give you full control on that decision.

One way is to create a new page using the same template and then iterating through the node list and calculating the hash of components (or their content depending on what exactly you want to compare). Once you have the hash of an empty page template, then then you can compare any other page hash with that.
Note: this solution needs to be adapted to your own use case. Maybe it is enough for you to check which components are on the page and their order, and maybe you want to compare their configurations as well.
private boolean areHashesEqual(final Resource copiedPageRes, final Resource currentPageRes) {
final Resource currentRes = currentPageRes.getChild(com.day.cq.commons.jcr.JcrConstants.JCR_CONTENT);
return currentRes != null && ModelUtils.getPageHash(copiedPageRes).equals(ModelUtils.getPageHash(currentRes));
}
Model Utils:
public static String getPageHash(final Resource res) {
long pageHash = 0;
final Queue<Resource> components = new ArrayDeque<>();
components.add(res);
while (!components.isEmpty()) {
final Resource currentRes = components.poll();
final Iterable<Resource> children = currentRes.getChildren();
for (final Resource child : children) {
components.add(child);
}
pageHash = ModelUtils.getHash(pageHash, currentRes.getResourceType());
}
return String.valueOf(pageHash);
}
/**
* This method returns product of hashes of all parameters
* #param args
* #return int hash
*/
public static long getHash(final Object... args) {
int result = 0;
for (final Object arg : args) {
if (arg != null) {
result += arg.hashCode();
}
}
return result;
}
Note: using Queue will consider the order of components as well.
This was my approach, but I had a very specific use case. In general, you would want to think if you really want to calculate the hash of every component on every page you want to publish since this will slow down the publishing process.
You can also compare hash in every iteration and break the calculation on the first difference.

Using JSoup to scrape Google Results

I'm trying to use JSoup to scrape the search results from Google. Currently this is my code.
public class GoogleOptimization {
public static void main (String args[])
{
Document doc;
try{
doc = Jsoup.connect("https://www.google.com/search?as_q=&as_epq=%22Yorkshire+Capital%22+&as_oq=fraud+OR+allegations+OR+scam&as_eq=&as_nlo=&as_nhi=&lr=lang_en&cr=countryCA&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=&as_filetype=&as_rights=").userAgent("Mozilla").ignoreHttpErrors(true).timeout(0).get();
Elements links = doc.select("what should i put here?");
for (Element link : links) {
System.out.println("\n"+link.text());
}
}
catch (IOException e) {
e.printStackTrace();
}
}
}
I'm just trying to get the title of the search results and the snippets below the title. So yea, I just don't know what element to look for in order to scrape these. If anyone has a better method to scrape Google using java I would love to know.
Thanks.

Here you go.
public class ScanWebSO
{
public static void main (String args[])
{
Document doc;
try{
doc = Jsoup.connect("https://www.google.com/search?as_q=&as_epq=%22Yorkshire+Capital%22+&as_oq=fraud+OR+allegations+OR+scam&as_eq=&as_nlo=&as_nhi=&lr=lang_en&cr=countryCA&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=&as_filetype=&as_rights=").userAgent("Mozilla").ignoreHttpErrors(true).timeout(0).get();
Elements links = doc.select("li[class=g]");
for (Element link : links) {
Elements titles = link.select("h3[class=r]");
String title = titles.text();
Elements bodies = link.select("span[class=st]");
String body = bodies.text();
System.out.println("Title: "+title);
System.out.println("Body: "+body+"\n");
}
}
catch (IOException e) {
e.printStackTrace();
}
}
}
Also, to do this yourself I would suggest using chrome. You just right click on whatever you want to scrape and go to inspect element. It will take you to the exact spot in the html where that element is located. In this case you first want to find out where the root of all the result listings are. When you find that, you want to specify the element, and preferably an unique attribute to search it by. In this case the root element is
<ol eid="" id="rso">
Below that you will see a bunch of listings that start with
<li class="g">
This is what you want to put into your initial elements array, then for each element you will want to find the spot where the title and body are. In this case, I found the title to be under the
<h3 class="r" style="white-space: normal;">
element. So you will search for that element in each listing. The same goes for the body. I found the body to be under so I searched for that using the .text() method and it returned all the text under that element. The key is to ALWAYS try and find the element with an original attribute (using a class name is ideal). If you don't and only search for something like "div" it will search the entire page for ANY element containing div and return that. So you will get WAY more results than you want. I hope this explains it well. Let me know if you have any more questions.

Hierarchical Vaadin Tree from XML(MSDL)

im tryind to build an Vaadin tree from an XML file(MSDL), im stuck at adding child items to my tree. So far i can read from my XML file and display the the tags/info i want but i cant make an Hierarchical strukture out of it , e.g :
i have an XML file with some information about Planets and their moons and the galaxy they are in :
Milky Way
-Sunsystem
-Earth
-"Moon"
-Mars
-Phobos
-Deimos
-Saturn
-Titan
-Tethys
Pinwheel Galaxy
-somesystem
-weirdPlanet1
-moon1
-moon2
-weirdPlanet2
-moon1
-moon2
now i want to have the same strukture in my vaadin tree. i have tryed lots of things but the result was always the same : some null values where added to the tree of i could see only the galaxys but i couldnt expand them or i could see a tree with all the infos but there whee no strukture at all all planets / moons where just listed :/

I'm pretty sure this doesn't have anything to do with the Tree itself. Instead of adding the data directly to the Tree, you can try this:
Parse the XML data into a HierarchicalContainer
Iterate through the HierarchicalContainer with the sample code below and verify that it's identical to your XML file structure
Bind the data container to the tree by calling Tree.setContainerDataSource(Container)
Sample code to iterate through a HierarchicalContainer:
void iterateContainer() {
for(Object rootItemId : myContainer.rootItemIds()) {
Item rootItem = myContainer.getItem(rootItemId);
System.out.println(rootItem.getItemProperty(myLabelProperty).getValue());
iterateChildren(rootItemId, 1);
}
}
void iterateChildren(Object parentItemId, int indent) {
for(Object childItemId : myContainer.getChildren(parentItemId)) {
Item childItem = myContainer.getItem(childItemId);
for(int i = 0; i < indent; i++) {
System.out.print(" ");
}
System.out.println(childItem.getItemProperty(myLabelProperty).getValue());
if(myContainer.hasChildren(childItemId)) {
iterateChildren(childItemId, indent+1);
}
}
}
This is just some untested QnD code, but this should help you to iterate through the container.
edit: Just noticed that my answer could have been (partially) a stupid solution, since Tree already utilizes HierarchicalContainer. You can initialize myContainer HierarchicalContainer myContainer = (HierarchicalContainer) myTree.getContainerDataSource(); and use the code above.
edit2: And if the structure isn't identical, see where it goes wrong and let the debugger do the rest .. :)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Recursively retrieving links from web page in Java - java

Related

Getting links from the table and all the tabs of a website using Jsoup

Delete all files in 'folder' or with prefix in Google Cloud Bucket from Java

How to check if the page has content?

Using JSoup to scrape Google Results

Hierarchical Vaadin Tree from XML(MSDL)

Categories

Resources