How to get each URLs of the website using Selenium? - java

I am currently automating the website in which the URL is constantly changing (SSO like website).. In that we are passing parameters in querystring.. I want to capture each of the URLs the website goes through to reach to the specific page. How can I achieve that using Selenium Webdriver..
I tried driver.getCurrentUrl() on regular intervals, but it is not reliable..
Is there any other work-around for this?
Many thanks!

Try to run the following:
driver.get("http://www.telegraph.co.uk/");
List<WebElement> links = driver.findElements(By.tagName("a"));
List<String> externalUrls = new ArrayList();
List<String> internalUrls = new ArrayList();
System.out.println(links.size());
for (int i = 1; i <= links.size(); i = i + 1) {
String url = links.get(i).getAttribute("href");
System.out.println("Name:"+links.get(i).getText());
System.out.println("url"+url);
System.out.println("----");
if (url.startsWith("http://www.telegraph.co.uk/")) {
if(!internalUrls.contains(url))
internalUrls.add(links.get(i).getAttribute("href"));
} else {
if(!externalUrls.contains(url))
externalUrls.add(links.get(i).getAttribute("href"));
}
}
If you want to gather all the links for your website, then I would do something like:
public class GetAllLinksFromThePage {
static List<String> externalUrls = new ArrayList();
static List<String> internalUrls = new ArrayList();
public static void main(String[] args) {
MyChromeDriver myChromeDriver = new MyChromeDriver();
WebDriver driver = myChromeDriver.initChromeDriver();
checkForLinks(driver, "http://www.telegraph.co.uk/");
System.out.println("finish");
}
public static void checkForLinks(WebDriver driver, String page) {
driver.get(page);
System.out.println("PAGE->" + page);
List<WebElement> links = driver.findElements(By.tagName("a"));
for (WebElement we : links) {
String url = we.getAttribute("href");
if (url.startsWith("http://www.telegraph.co.uk/")) { //mymainpage
if (!internalUrls.contains(url)) {
internalUrls.add(we.getAttribute("href"));
System.out.println(we.getText() + " has added to internalUrls");
checkForLinks(driver, url);
}
} else if (!externalUrls.contains(url)) {
externalUrls.add(we.getAttribute("href"));
System.out.println(we.getText() + " has added to externalUrls");
}
}
}
}
Hope that helped!

Related

I try to automate signup through the google application using selenium and java. its signup but I getting some error in the console

I am a beginner in selenium, automation testing. I try to automate signup through the google application its signup after signup it's redirecting to deshboard but I got some error in the console.
driver.findElement(By.cssSelector("a.btn.btn-circle.btn-sm.btn-google.mr-2")).click();
WebDriverWait wait = new WebDriverWait(driver,50);
wait.until(ExpectedConditions.numberOfWindowsToBe(2));
Set<String> s1 = driver.getWindowHandles();
Iterator<String> i1 = s1.iterator();
while(i1.hasNext())
{
String next_tab = i1.next();
if (!parentWindow.equalsIgnoreCase(next_tab))
{
driver.switchTo().window(next_tab);
WebDriverWait wait2 = new WebDriverWait(driver, 20);
wait2.until(ExpectedConditions.elementToBeClickable(By.xpath("//input[#id='identifierId']"))).sendKeys("netcse02#gmail.com");
driver.findElement(By.id("identifierNext")).click();
new WebDriverWait(driver, 10).until(ExpectedConditions.elementToBeClickable(By.xpath("//input[#name='password']"))).sendKeys("*******");
driver.findElement(By.id("passwordNext")).click();
wait.until(ExpectedConditions.titleContains("Unify | Dashboard"));
System.out.println(driver.getTitle());
Thread.sleep(5000);
}
}
register with google application:
no such window: window was already closed:
This is two generic methods that I use in my automation framework.
public void waitForNewWindowAndSwitchToIt(WebDriver driver) {
String cHandle = driver.getWindowHandle();
String newWindowHandle = null;
Set<String> allWindowHandles = driver.getWindowHandles();
// Wait for 20 seconds for the new window and throw exception if not found
for (int i = 0; i < 20; i++) {
if (allWindowHandles.size() > 1) {
for (String allHandlers : allWindowHandles) {
if (!allHandlers.equals(cHandle))
newWindowHandle = allHandlers;
}
driver.switchTo().window(newWindowHandle);
break;
} else {
}
}
if (cHandle == newWindowHandle) {
throw new RuntimeException("Time out - No window found");
}
}
public boolean closeAllOtherWindows(WebDriver driver, String openWindowHandle) {
Set<String> allWindowHandles = driver.getWindowHandles();
for (String currentWindowHandle : allWindowHandles) {
if (!currentWindowHandle.equals(openWindowHandle)) {
driver.switchTo().window(currentWindowHandle);
driver.close();
}
}
driver.switchTo().window(openWindowHandle);
if (driver.getWindowHandles().size() == 1)
return true;
else
return false;
}

Crawling amazon.com

I'm crawling amazon products and principle it's going fine.
I have three classes from this nice tutorial:
http://www.netinstructions.com/how-to-make-a-simple-web-crawler-in-java/
I added the files to the following code (class Spider):
import java.io.FileNotFoundException;
import java.util.*;
public class Spider {
public static final int MAX_PAGES_TO_SEARCH = 10000;
private Set<String> pagesVisited = new HashSet<String>();
private List<String> pagesToVisit = new LinkedList<String>();
public void search(String url) {
while (this.pagesVisited.size() < MAX_PAGES_TO_SEARCH) {
String currentUrl;
SpiderLeg leg = new SpiderLeg();
if (this.pagesToVisit.isEmpty()) {
//System.out.println("abc");
currentUrl = url;
this.pagesVisited.add(url);
} else {
//System.out.println("def");
currentUrl = this.nextUrl();
}
try {
Thread.sleep(10000);
leg.crawl(currentUrl); // Lots of stuff happening here. Look at the crawl method in
} catch (FileNotFoundException e) {
System.out.println("Oops, FileNotFoundException caught");
} catch (InterruptedException e) {
e.printStackTrace();
}
this.pagesToVisit.addAll(leg.getLinks());
//System.out.println("Test");
}
System.out.println("\n**Done** Visited " + this.pagesVisited.size() + " web page(s)");
SpiderLeg leg = new SpiderLeg();
leg.calcAdjMatrix();
for (int i = 0; i < leg.adjMatrix.length; i++) {
System.out.println(Arrays.toString(leg.adjMatrix[i]));
}
}
private String nextUrl() {
String nextUrl;
do {
if (this.pagesToVisit.isEmpty()){
return "https://www.amazon.de/Proband-Thriller-Guido-Kniesel/dp/1535287004/ref=sr_1_1?s=books&ie=UTF8&qid=1478247246&sr=1-1&keywords=%5B%5D";
}
nextUrl = this.pagesToVisit.remove(0);
} while (this.pagesVisited.contains(nextUrl));
this.pagesVisited.add(nextUrl);
return nextUrl;
}
}
class SpiderLeg:
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.*;
import java.util.*;
public class SpiderLeg {
// We'll use a fake USER_AGENT so the web server thinks the robot is a normal web browser.
private static final String USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36";
private static List<String> links = new LinkedList<String>();
private static String graphLink;
private Document htmlDocument;
private static double counter = 0;
static Map<String, Set<String>> adjMap = new HashMap<String, Set<String>>();
static int[][] adjMatrix;
static List<String> mapping;
public boolean crawl(String url) throws FileNotFoundException {
if (url.isEmpty()) {
return false;
}
try{
Connection connection = Jsoup.connect(url).ignoreContentType(true).userAgent(USER_AGENT);
Document htmlDocument = connection.get();
this.htmlDocument = htmlDocument;
if(connection.response().statusCode() == 200){
// 200 is the HTTP OK status code
// indicating that everything is great.
counter++;
double progress;
progress = (counter/Spider.MAX_PAGES_TO_SEARCH)*100;
System.out.println("\n**Visiting** Received web page at " + url);
System.out.println("\n**Progress** " + progress + "%");
}
if(!connection.response().contentType().contains("text/html")) {
System.out.println("**Failure** Retrieved something other than HTML");
return false;
}
//Elements linksOnPage = htmlDocument.select("a[href*=/gp/product/]");
Elements linksOnPage = htmlDocument.select("a[href*=/dp/]");
Elements salesRank = htmlDocument.select("span.zg_hrsr_rank");
Elements category = htmlDocument.select("span.zg_hrsr_ladder a");
String categoryString = category.html();
String salesRankString = salesRank.html();
salesRankString = salesRankString.replace("\n", " ");
categoryString = categoryString.replace("\n", " ");
//System.out.println(categoryString);
System.out.println("Found (" + linksOnPage.size() + ") links");
PrintWriter pw = new PrintWriter(new FileWriter("Horror.csv", true));
StringBuilder sb = new StringBuilder();
int beginIndex = url.indexOf(".de/");
int endIndex = url.indexOf("/dp");
String title = url.substring(beginIndex+4,endIndex);
if(!adjMap.containsKey(title)){
if(categoryString.contains("Horror")){
adjMap.put(title, new HashSet<String>());
sb.append(title);
sb.append(',');
sb.append(salesRankString);
sb.append(',');
sb.append(categoryString);
sb.append(',');
for(Element link : linksOnPage){
String graphLink = link.attr("abs:href");
if(!graphLink.contains("one-click")){
if(!graphLink.contains("Kindle")){
if(!graphLink.contains("unsticky")){
this.links.add(graphLink);
//adjMap.get(url).add(graphLink);
adjMap.get(title).add(cutTitle(graphLink));
sb.append(graphLink);
sb.append(',');
}
}
}
}
sb.append('\n');
pw.write(sb.toString());
pw.close();
}
}
System.out.println("done!");
return true;
}
catch(IOException ioe) {
// We were not successful in our HTTP request
System.out.println("Error in out HTTP request " + ioe);
return false;
}
}
public static void calcAdjMatrix(){
Set<String> allMyURLs = new HashSet(adjMap.keySet());
for(String s: adjMap.keySet()){
allMyURLs.addAll(adjMap.get(s));
System.out.println(s + "\t" + adjMap.get(s));
}
int dim = allMyURLs.size();
adjMatrix = new int[dim][dim];
List<String> nodes_list = new ArrayList<>();
for(String s: allMyURLs){
nodes_list.add(s);
}
for(String s: nodes_list){
Set<String> outEdges = adjMap.get(s);
int i = nodes_list.indexOf(s);
if(outEdges != null){
for(String s1: outEdges){
int j = nodes_list.indexOf(s1);
adjMatrix[i][j] = 1;
}
}
}
}
public String cutTitle(String url) throws FileNotFoundException{
int beginIndex = url.indexOf(".de/");
int endIndex = url.indexOf("/dp");
String title;
if(url.contains(".de") && url.contains("/dp")){
title = url.substring(beginIndex+4,endIndex);
}else{
title = "wrong url";
}
return title;
}
public boolean searchForWord(String searchWord) {
if(this.htmlDocument == null) {
System.out.println("ERROR! Call crawl() before performing analysis on the document");
return false;
}
System.out.println("Searching for the word " + searchWord + "...");
String bodyText = this.htmlDocument.body().text();
return bodyText.toLowerCase().contains(searchWord.toLowerCase());
}
public List<String> getLinks(){
return this.links;
}
}
class SpiderTest:
public class SpiderTest {
public static void main(String[] args) {
Spider spider = new Spider();
spider.search("https://www.amazon.de/Wille-geschehe-Psychothriller-Guido-Kniesel/dp/1537455389/ref=pd_sim_14_1?_encoding=UTF8&psc=1&refRID=CQPDDGY4BJ4D8THNNSZ6");
}
}
Now the problem is, that after 100 URLs I think, that amazon is banning me from the server. The program doesn't find URLs anymore.
Does anyone has an idea how I can fix that?
Well, don't be rude and crawl them then.
Check their robots.txt (wiki) to see what they allow you to do. Don't be surprised if they ban you if you go places they don't want you to go.
The problem is very common when you try to crawl big websites that don't want to be crawled. They basically block you for a period of time to prevent their data being crawled or stolen.
With that being said, you have two options, either make each request from a different IP/server which will make your requests look legit and avoid the ban, or go for the easiest way which is to use a service that does that for you.
I've done both and the first one is complex, takes time and needs maintenance (you have to build a network of servers), the second option is usually not free but very fast to implement and guarantees that all your requests will always return data and you won't be banned.
There are some services on the internet that does that. I've used proxycrawl (which also has a free tier) in the past which works very good. They have an API that you can call and you only can use your same code, just changing the url you call.
This would be an example for amazon:
GET https://api.proxycrawl.com?token=yourtoken&url=https://amazon.com
And you would get always a response, even if you crawl 1000 pages a second, you will never be banned as you will be calling that proxy instead which does all the magic for you.
I hope it helps :)
You can try using proxy servers to prevent being blocked. There are services providing working proxies. I have good experience using https://gimmeproxy.com which specifically has proxies supporting amazon.
To get proxy working with Amazon, you need just to make the following request:
https://gimmeproxy.com/api/getProxy?api_key=your_api_key&websites=amazon
You will get JSON response with all proxy data which you can use later as needed:
{
"supportsHttps": true,
"protocol": "socks5",
"ip": "116.182.122.182",
"port": "1915",
"get": true,
"post": true,
"cookies": true,
"referer": true,
"user-agent": true,
"anonymityLevel": 1,
"websites": {
"example": true,
"google": false,
"amazon": true
},
"country": "BR",
"tsChecked": 1517952910,
"curl": "socks5://116.182.122.182:1915",
"ipPort": "116.182.122.182:1915",
"type": "socks5",
"speed": 37.78,
"otherProtocols": {}
}

WebCrawler with recursion

So I am working on a webcrawler that is supposed to download all images, files, and webpages, and then recursively do the same for all webpages found. However, I seem to have a logic error.
public class WebCrawler {
private static String url;
private static int maxCrawlDepth;
private static String filePath;
/* Recursive function that crawls all web pages found on a given web page.
* This function also saves elements from the DownloadRepository to disk.
*/
public static void crawling(WebPage webpage, int currentCrawlDepth, int maxCrawlDepth) {
webpage.crawl(currentCrawlDepth);
HashMap<String, WebPage> pages = webpage.getCrawledWebPages();
if(currentCrawlDepth < maxCrawlDepth) {
for(WebPage wp : pages.values()) {
crawling(wp, currentCrawlDepth+1, maxCrawlDepth);
}
}
}
public static void main(String[] args) {
if(args.length != 3) {
System.out.println("Must pass three parameters");
System.exit(0);
}
url = "";
maxCrawlDepth = 0;
filePath = "";
url = args[0];
try {
URL testUrl = new URL(url);
URLConnection urlConnection = testUrl.openConnection();
urlConnection.connect();
} catch (MalformedURLException e) {
System.out.println("Not a valid URL");
System.exit(0);
} catch (IOException e) {
System.out.println("Could not open URL");
System.exit(0);
}
try {
maxCrawlDepth = Integer.parseInt(args[1]);
} catch (NumberFormatException e) {
System.out.println("Argument is not an int");
System.exit(0);
}
filePath = args[2];
File path = new File(filePath);
if(!path.exists()) {
System.out.println("File Path is invalid");
System.exit(0);
}
WebPage webpage = new WebPage(url);
crawling(webpage, 0, maxCrawlDepth);
System.out.println("Web crawl is complete");
}
}
the function crawl will parse the contents of a website storing any found images, files, or links into a hashmap, for example:
public class WebPage implements WebElement {
private static Elements images;
private static Elements links;
private HashMap<String, WebImage> webImages = new HashMap<String, WebImage>();
private HashMap<String, WebPage> webPages = new HashMap<String, WebPage>();
private HashMap<String, WebFile> files = new HashMap<String, WebFile>();
private String url;
public WebPage(String url) {
this.url = url;
}
/* The crawl method parses the html on a given web page
* and adds the elements of the web page to the Download
* Repository.
*/
public void crawl(int currentCrawlDepth) {
System.out.print("Crawling " + url + " at crawl depth ");
System.out.println(currentCrawlDepth + "\n");
Document doc = null;
try {
HttpConnection httpConnection = (HttpConnection) Jsoup.connect(url);
httpConnection.ignoreContentType(true);
doc = httpConnection.get();
} catch (MalformedURLException e) {
System.out.println(e.getLocalizedMessage());
} catch (IOException e) {
System.out.println(e.getLocalizedMessage());
} catch (IllegalArgumentException e) {
System.out.println(url + "is not a valid URL");
}
DownloadRepository downloadRepository = DownloadRepository.getInstance();
if(doc != null) {
images = doc.select("img");
links = doc.select("a[href]");
for(Element image : images) {
String imageUrl = image.absUrl("src");
if(!webImages.containsValue(image)) {
WebImage webImage = new WebImage(imageUrl);
webImages.put(imageUrl, webImage);
downloadRepository.addElement(imageUrl, webImage);
System.out.println("Added image at " + imageUrl);
}
}
HttpConnection mimeConnection = null;
Response mimeResponse = null;
for(Element link: links) {
String linkUrl = link.absUrl("href");
linkUrl = linkUrl.trim();
if(!linkUrl.contains("#")) {
try {
mimeConnection = (HttpConnection) Jsoup.connect(linkUrl);
mimeConnection.ignoreContentType(true);
mimeConnection.ignoreHttpErrors(true);
mimeResponse = (Response) mimeConnection.execute();
} catch (Exception e) {
System.out.println(e.getLocalizedMessage());
}
String contentType = null;
if(mimeResponse != null) {
contentType = mimeResponse.contentType();
}
if(contentType == null) {
continue;
}
if(contentType.toString().equals("text/html")) {
if(!webPages.containsKey(linkUrl)) {
WebPage webPage = new WebPage(linkUrl);
webPages.put(linkUrl, webPage);
downloadRepository.addElement(linkUrl, webPage);
System.out.println("Added webPage at " + linkUrl);
}
}
else {
if(!files.containsValue(link)) {
WebFile webFile = new WebFile(linkUrl);
files.put(linkUrl, webFile);
downloadRepository.addElement(linkUrl, webFile);
System.out.println("Added file at " + linkUrl);
}
}
}
}
}
System.out.print("\nFinished crawling " + url + " at crawl depth ");
System.out.println(currentCrawlDepth + "\n");
}
public HashMap<String, WebImage> getImages() {
return webImages;
}
public HashMap<String, WebPage> getCrawledWebPages() {
return webPages;
}
public HashMap<String, WebFile> getFiles() {
return files;
}
public String getUrl() {
return url;
}
#Override
public void saveToDisk(String filePath) {
System.out.println(filePath);
}
}
The point of using a hashmap is to ensure that I do not parse the same website more than once. The error seems to be with my recursion. What is the issue?
Here is also some sample output for starting the crawl at http://www.google.com
Crawling https://www.google.com/ at crawl depth 0
Added webPage at http://www.google.com/intl/en/options/
Added webPage at https://www.google.com/intl/en/ads/
Added webPage at https://www.google.com/services/
Added webPage at https://www.google.com/intl/en/about.html
Added webPage at https://www.google.com/intl/en/policies/
Finished crawling https://www.google.com/ at crawl depth 0
Crawling https://www.google.com/services/ at crawl depth 1
Added webPage at http://www.google.com/intl/en/enterprise/apps/business/?utm_medium=et&utm_campaign=en&utm_source=us-en-et-nelson_bizsol
Added webPage at https://www.google.com/services/sitemap.html
Added webPage at https://www.google.com/intl/en/about/
Added webPage at https://www.google.com/intl/en/policies/
Finished crawling https://www.google.com/services/ at crawl depth 1
**Crawling https://www.google.com/intl/en/policies/ at crawl depth 2**
Added webPage at https://www.google.com/intl/en/policies/
Added webPage at https://www.google.com/intl/en/policies/terms/
Added webPage at https://www.google.com/intl/en/policies/privacy/
Added webPage at https://www.google.com/intl/en/policies/terms/
Added webPage at https://www.google.com/intl/en/policies/faq/
Added webPage at https://www.google.com/intl/en/policies/technologies/
Added webPage at https://www.google.com/intl/en/about/
Added webPage at https://www.google.com/intl/en/policies/
Finished crawling https://www.google.com/intl/en/policies/ at crawl depth 2
**Crawling https://www.google.com/intl/en/policies/ at crawl depth 3**
Notice that it parses http://www.google.com/intl/en/policies/ twice
You are creating a new map for each web-page. This will ensure that if the same link occurs on the page twice it will only be crawled once but it will not deal with the case where the same link appears on two different pages.
https://www.google.com/intl/en/policies/ appears on both https://www.google.com/ and https://www.google.com/services/.
To avoid this use a single map throughout your crawl and pass it as a parameter into the recursion.
public class WebCrawler {
private HashMap<String, WebPage> visited = new HashMap<String, WebPage>();
public static void crawling(Map<String, WebPage> visited, WebPage webpage, int currentCrawlDepth, int maxCrawlDepth) {
}
}
As you are also holding a map of the images etc you may prefer to create a new object, perhaps call it visited, and make it keep track.
public class Visited {
private HashMap<String, WebPage> webPages = new HashMap<String, WebPage>();
public boolean visit(String url, WebPage page) {
if (webPages.containsKey(page)) {
return false;
}
webPages.put(url, page);
return true;
}
private HashMap<String, WebImage> webImages = new HashMap<String, WebImage>();
public boolean visit(String url, WebImage image) {
if (webImages.containsKey(image)) {
return false;
}
webImages.put(url, image);
return true;
}
}

How to get HtmlElements from a website

I am trying to get urls and html elements from a website.Able to get urls and html from website but, when one url contains multiple elements(like multiple input elements (or)multiple textarea elements)i am able getting only last element.The code like below
GetURLsAndElemens.java
public static void main(String[] args) throws FileNotFoundException,
IOException, ParseException {
Properties properties = new Properties();
properties
.load(new FileInputStream(
"src//io//servicely//ci//plugin//SeleniumResources.properties"));
Map<String, String> urls = gettingUrls(properties
.getProperty("MAIN_URL"));
GettingHTMLElements.getHTMLElements(urls);
// .out.println(urls.size());
// System.out.println(urls);
}
public static Map<String, String> gettingUrls(String mainURL) {
Document doc = null;
Map<String, String> urlsList = new HashMap<String, String>();
try {
System.out.println("Main URL " + mainURL);
// need http protocol
doc = Jsoup.connect(mainURL).get();
GettingHTMLElements.getInputElements(doc, mainURL);
// get page title
// String title = doc.title();
// System.out.println("title : " + title);
// get all links
Elements links = doc.select("a[href]");
for (Element link : links) {
// urlsList.clear();
// get the value from href attribute and adding to list
if (link.attr("href").contains("http")) {
urlsList.put(link.attr("href"), link.text());
} else {
urlsList.put(mainURL + link.attr("href"), link.text());
}
// System.out.println(urlsList);
}
} catch (IOException e) {
e.printStackTrace();
}
// System.out.println("Total urls are "+urlsList.size());
// System.out.println(urlsList);
return urlsList;
}
GettingHtmlElements.java
static Map<String, HtmlElements> urlList = new HashMap<String, HtmlElements>();
public static void getHTMLElements(Map<String, String> urls)
throws IOException {
getElements(urls);
}
public static void getElements(Map<String, String> urls) throws IOException {
for (Map.Entry<String, String> entry1 : urls.entrySet()) {
try {
System.out.println(entry1.getKey());
Document doc = Jsoup.connect(entry1.getKey()).get();
getInputElements(doc, entry1.getKey());
}
catch (Exception e) {
e.printStackTrace();
}
}
Map<String,HtmlElements> list = urlList;
for(Map.Entry<String,HtmlElements> entry1:list.entrySet())
{
HtmlElements ele = entry1.getValue();
System.out.println("url is "+entry1.getKey());
System.out.println("input name "+ele.getInput_name());
}
}
public static HtmlElements getInputElements(Document doc, String entry1) {
HtmlElements htmlElements = new HtmlElements();
Elements inputElements2 = doc.getElementsByTag("input");
Elements textAreaElements2 = doc.getElementsByTag("textarea");
Elements formElements3 = doc.getElementsByTag("form");
for (Element inputElement : inputElements2) {
String key = inputElement.attr("name");
htmlElements.setInput_name(key);
String key1 = inputElement.attr("type");
htmlElements.setInput_type(key1);
String key2 = inputElement.attr("class");
htmlElements.setInput_class(key2);
}
for (Element inputElement : textAreaElements2) {
String key = inputElement.attr("id");
htmlElements.setTextarea_id(key);
String key1 = inputElement.attr("name");
htmlElements.setTextarea_name(key1);
}
for (Element inputElement : formElements3) {
String key = inputElement.attr("method");
htmlElements.setForm_method(key);
String key1 = inputElement.attr("action");
htmlElements.setForm_action(key1);
}
return urlList.put(entry1, htmlElements);
}
Which elements i want take it as a bean.For every url i am getting the urls and htmle elements.but when url contains multiple elements i was getting last element only
You use a class HtmlElements which is not part of JSoup as far as I know. I don't know its inner workings, but I assume it is some sort of list of html nodes or something.
However, you seem to use this class like this:
HtmlElements htmlElements = new HtmlElements();
htmlElements.setInput_name(key);
This indicates that only ONE html element is stored in the htmlElements variable. This would explain why you get only the last element stored - you simply overwrite the one instance all the time.
It is not really clear, since I don't know the HtmlElements class. Maybe something like this works, assuming that HtmlElement is working as a single instance of HtmlElements and HtmlElements has a method add:
HtmlElements htmlElements = new HtmlElements();
...
for (Element inputElement : inputElements2) {
HtmlElement e = new HtmlElement();
htmlElements.add(e);
String key = inputElement.attr("name");
e.setInput_name(key);
}

Extracting link from a facebook page

I want to extract content of a facebook page mainly the links in a facebook page. I tried extracting using jsoup but it does not show the relevant link the link which is showing the likes for the topic for eg :https://www.facebook.com/search/109301862430120/likers.may be it's a jquery,ajax or javascript type code. So how can I extract those link using java how can i extract/access that link or calling a JavaScript function with HTMLUnit
public static void main(String args[])
{
Testing t=new Testing();
t.traceLink();
}
public static void traceLink()
{
// File input = new File("/tmp/input.html");
Document doc = null;
try
{
doc = Jsoup.connect("https://www.facebook.com
/pages/Ice-cream/109301862430120?rf=102173023157556").get();
Elements link = doc.select("a[href]");
String stringLink = null;
for (int i = 0; i < link.size(); i++)
{
stringLink = link.toString();
System.out.println(stringLink);
}}}
System.out.println(link);
}
catch (IOException e)
{
//e.printStackTrace();
}
Element links = doc.select("a[href]").first();
System.out.println(links);

Categories

Resources