I want to extract content of a facebook page mainly the links in a facebook page. I tried extracting using jsoup but it does not show the relevant link the link which is showing the likes for the topic for eg :https://www.facebook.com/search/109301862430120/likers.may be it's a jquery,ajax or javascript type code. So how can I extract those link using java how can i extract/access that link or calling a JavaScript function with HTMLUnit
public static void main(String args[])
{
Testing t=new Testing();
t.traceLink();
}
public static void traceLink()
{
// File input = new File("/tmp/input.html");
Document doc = null;
try
{
doc = Jsoup.connect("https://www.facebook.com
/pages/Ice-cream/109301862430120?rf=102173023157556").get();
Elements link = doc.select("a[href]");
String stringLink = null;
for (int i = 0; i < link.size(); i++)
{
stringLink = link.toString();
System.out.println(stringLink);
}}}
System.out.println(link);
}
catch (IOException e)
{
//e.printStackTrace();
}
Element links = doc.select("a[href]").first();
System.out.println(links);
Related
I want to use Jsoup to extract the first link on the google search results. For example, I search for "apple" on google. The first link I see is www.apple.com/. How do I return the first link? I am currently able to extract all links using Jsoup:
new Thread(new Runnable() {
#Override
public void run() {
final StringBuilder stringBuilder = new StringBuilder();
try {
Document doc = Jsoup.connect(sharedURL).get();
String title = doc.title();
Elements links = doc.select("a[href]");
stringBuilder.append(title).append("\n");
for (Element link : links) {
stringBuilder.append("\n").append(" ").append(link.text()).append(" ").append(link.attr("href")).append("\n");
}
} catch (IOException e) {
stringBuilder.append("Error : ").append(e.getMessage()).append("\n");
}
runOnUiThread(new Runnable() {
#Override
public void run() {
// set text
textView.setText(stringBuilder.toString());
}
});
}
}).start();
Do you mean:
Element firstLink = doc.select("a[href]").first();
It works for me. If you meant something else let us know. I checked the search results with the following and its a tough one to decipher as there are so many types of results that come back.. maps, news, ads, etc.
I tidied up the code a little with the use of java lambdas:
public static void main(String[] args) {
new Thread(() -> {
final StringBuilder stringBuilder = new StringBuilder();
try {
String sharedUrl = "https://www.google.com/search?q=apple";
Document doc = Jsoup.connect(sharedUrl).get();
String title = doc.title();
Elements links = doc.select("a[href]");
Element firstLink = links.first(); // <<<<< NEW ADDITION
stringBuilder.append(title).append("\n");
for (Element link : links) {
stringBuilder.append("\n")
.append(" ")
.append(link.text())
.append(" ")
.append(link.attr("href"))
.append("\n");
}
} catch (IOException e) {
stringBuilder.append("Error : ").append(e.getMessage()).append("\n");
}
// replaced some of this for running/testing locally
SwingUtilities.invokeLater(() -> System.out.println(stringBuilder.toString()));
}).start();
}
I have recently started using JXBrowser to build a Visualisation of Edit Distances (Levenshtein). I am using JXBrowser to integrate JAVA with HTML, CSS and JS.
My application starts with the MainFrame class by loading up my start screen, specifically hello.html.
public MainFrame() {
final Browser browser = new Browser();
BrowserView view = new BrowserView(browser);
JFrame frame = new JFrame("JxBrowser - EditDistance");
frame.setDefaultCloseOperation(WindowConstants.EXIT_ON_CLOSE);
frame.add(view, BorderLayout.CENTER);
frame.setSize(500, 400);
frame.setLocationRelativeTo(null);
frame.setVisible(true);
InputStream urlStream = getClass().getResourceAsStream("../web/hello.html");
String html = null;
try (BufferedReader urlReader = new BufferedReader(new InputStreamReader (urlStream))) {
StringBuilder builder = new StringBuilder();
String row;
while ((row = urlReader.readLine()) != null) {
builder.append(row);
}
html = builder.toString();
} catch (IOException e) {
throw new RuntimeException(e);
}
browser.loadHTML(html);
DOMDocument document = browser.getDocument();
final DOMElement documentElement = document.getDocumentElement();
if (documentElement != null) {
try{
DOMElement element = documentElement.findElement(By.id("button"));
element.addEventListener(DOMEventType.OnClick, new DOMEventListener() {
public void handleEvent(DOMEvent event) {
new UserInput();
}
}, false);
}catch(NullPointerException e){
System.out.println("NULLL on Entry");
}
}
}
I then call UserInput() and no Null error is thrown. I then load my UserInputForm class using the same methodology as above, instead using UserInputForm.html as my view.
InputStream urlStream = getClass().getResourceAsStream("../web/UserInputForm.html");
String html = null;
try (BufferedReader urlReader = new BufferedReader(new InputStreamReader (urlStream))) {
StringBuilder builder = new StringBuilder();
String row;
while ((row = urlReader.readLine()) != null) {
builder.append(row);
}
html = builder.toString();
} catch (IOException e) {
throw new RuntimeException(e);
}
browser.loadHTML(html);
final DOMDocument document = browser.getDocument();
final DOMElement documentElement = document.getDocumentElement();
if (documentElement != null) {
DOMElement submitElement = documentElement.findElement(By.id("enterButton"));
if (submitElement != null) {
submitElement.addEventListener(DOMEventType.OnClick, new DOMEventListener() {
public void handleEvent(DOMEvent event) {
DOMElement source = document.findElement(By.id("sourceString"));
DOMElement target = document.findElement(By.id("targetString"));
}}, false);
}
else{
System.out.println("NULL on Sub Form");
}
}
}
The problem occurs mainly when the UserInputForm loads. I get a NULL returned by the submitElement document element. Sometimes I get a NULL returned as the application starts. I feel like I am missing a fundamental procedure when loading these forms up. Does anyone have any insight into making sure that document elements don't return NULL? Is this an issue with my HTML loading techniques?
The Browser.loadHTML() method is executed asynchronously as a request to load a specific HTML. Therefore, there is no guarantee that the web page is loaded completely when this method returns.
Before accessing the DOM document on the loaded web page, it is necessary to wait until the web page is loaded completely. If the web page is not loaded completely, the DOM document or some DOM elements may appear broken or missing.
The following sample code demonstrates how to load an HTML and wait until it's loaded completely:
// Blocks current thread execution and waits until the web page is loaded completely
Browser.invokeAndWaitFinishLoadingMainFrame(browser, new Callback<Browser>() {
#Override
public void invoke(Browser value) {
value.loadHTML("<html><body>Your HTML goes here</body></html>");
}
});
Note: use this approach for loading the web pages only.
The following article describes how to load a web page and wait until it is loaded completely: https://jxbrowser.support.teamdev.com/support/solutions/articles/9000013107-loading-waiting
I am currently automating the website in which the URL is constantly changing (SSO like website).. In that we are passing parameters in querystring.. I want to capture each of the URLs the website goes through to reach to the specific page. How can I achieve that using Selenium Webdriver..
I tried driver.getCurrentUrl() on regular intervals, but it is not reliable..
Is there any other work-around for this?
Many thanks!
Try to run the following:
driver.get("http://www.telegraph.co.uk/");
List<WebElement> links = driver.findElements(By.tagName("a"));
List<String> externalUrls = new ArrayList();
List<String> internalUrls = new ArrayList();
System.out.println(links.size());
for (int i = 1; i <= links.size(); i = i + 1) {
String url = links.get(i).getAttribute("href");
System.out.println("Name:"+links.get(i).getText());
System.out.println("url"+url);
System.out.println("----");
if (url.startsWith("http://www.telegraph.co.uk/")) {
if(!internalUrls.contains(url))
internalUrls.add(links.get(i).getAttribute("href"));
} else {
if(!externalUrls.contains(url))
externalUrls.add(links.get(i).getAttribute("href"));
}
}
If you want to gather all the links for your website, then I would do something like:
public class GetAllLinksFromThePage {
static List<String> externalUrls = new ArrayList();
static List<String> internalUrls = new ArrayList();
public static void main(String[] args) {
MyChromeDriver myChromeDriver = new MyChromeDriver();
WebDriver driver = myChromeDriver.initChromeDriver();
checkForLinks(driver, "http://www.telegraph.co.uk/");
System.out.println("finish");
}
public static void checkForLinks(WebDriver driver, String page) {
driver.get(page);
System.out.println("PAGE->" + page);
List<WebElement> links = driver.findElements(By.tagName("a"));
for (WebElement we : links) {
String url = we.getAttribute("href");
if (url.startsWith("http://www.telegraph.co.uk/")) { //mymainpage
if (!internalUrls.contains(url)) {
internalUrls.add(we.getAttribute("href"));
System.out.println(we.getText() + " has added to internalUrls");
checkForLinks(driver, url);
}
} else if (!externalUrls.contains(url)) {
externalUrls.add(we.getAttribute("href"));
System.out.println(we.getText() + " has added to externalUrls");
}
}
}
}
Hope that helped!
I want to get the title from this website: http://feeds.foxnews.com/foxnews/latest
like this example:
<title><![CDATA[SUCCESSFUL INTERCEPT Pentagon confirms it shot down ICBM-type target]]></title>
and it will show text like this:
"SUCCESSFUL INTERCEPT Pentagon confirms it shot down ICBM-type target
US conducts successful missile intercept test, Pentagon says"
Here's my code. I have used jaunt library.
I don't know why it shows text only "foxnew.com"
import com.jaunt.JauntException;
import com.jaunt.UserAgent;
public class p8_1
{
public static void main(String[] args)
{
try
{
UserAgent userAgent = new UserAgent();
userAgent.visit("http://feeds.foxnews.com/foxnews/latest");
String title = userAgent.doc.findFirst
("<title><![CDATA[SUCCESSFUL INTERCEPT Pentagon confirms it shot down ICBM-type target]]></title>").getText();
System.out.println("\n " + title);
} catch (JauntException e)
{
System.err.println(e);
}
}
}
Search for element types, not values.
Try the following to get the title text of each item in the feed:
public static void main(String[] args) {
try {
UserAgent userAgent = new UserAgent();
userAgent.visit("http://feeds.foxnews.com/foxnews/latest");
Elements items = userAgent.doc.findEach("<item>");
Elements titles = items.findEach("<title>");
for (Element title : titles) {
String titleText = title.getComment(0).getText();
System.out.println(titleText);
}
} catch (JauntException e) {
System.err.println(e);
}
}
I am trying to get urls and html elements from a website.Able to get urls and html from website but, when one url contains multiple elements(like multiple input elements (or)multiple textarea elements)i am able getting only last element.The code like below
GetURLsAndElemens.java
public static void main(String[] args) throws FileNotFoundException,
IOException, ParseException {
Properties properties = new Properties();
properties
.load(new FileInputStream(
"src//io//servicely//ci//plugin//SeleniumResources.properties"));
Map<String, String> urls = gettingUrls(properties
.getProperty("MAIN_URL"));
GettingHTMLElements.getHTMLElements(urls);
// .out.println(urls.size());
// System.out.println(urls);
}
public static Map<String, String> gettingUrls(String mainURL) {
Document doc = null;
Map<String, String> urlsList = new HashMap<String, String>();
try {
System.out.println("Main URL " + mainURL);
// need http protocol
doc = Jsoup.connect(mainURL).get();
GettingHTMLElements.getInputElements(doc, mainURL);
// get page title
// String title = doc.title();
// System.out.println("title : " + title);
// get all links
Elements links = doc.select("a[href]");
for (Element link : links) {
// urlsList.clear();
// get the value from href attribute and adding to list
if (link.attr("href").contains("http")) {
urlsList.put(link.attr("href"), link.text());
} else {
urlsList.put(mainURL + link.attr("href"), link.text());
}
// System.out.println(urlsList);
}
} catch (IOException e) {
e.printStackTrace();
}
// System.out.println("Total urls are "+urlsList.size());
// System.out.println(urlsList);
return urlsList;
}
GettingHtmlElements.java
static Map<String, HtmlElements> urlList = new HashMap<String, HtmlElements>();
public static void getHTMLElements(Map<String, String> urls)
throws IOException {
getElements(urls);
}
public static void getElements(Map<String, String> urls) throws IOException {
for (Map.Entry<String, String> entry1 : urls.entrySet()) {
try {
System.out.println(entry1.getKey());
Document doc = Jsoup.connect(entry1.getKey()).get();
getInputElements(doc, entry1.getKey());
}
catch (Exception e) {
e.printStackTrace();
}
}
Map<String,HtmlElements> list = urlList;
for(Map.Entry<String,HtmlElements> entry1:list.entrySet())
{
HtmlElements ele = entry1.getValue();
System.out.println("url is "+entry1.getKey());
System.out.println("input name "+ele.getInput_name());
}
}
public static HtmlElements getInputElements(Document doc, String entry1) {
HtmlElements htmlElements = new HtmlElements();
Elements inputElements2 = doc.getElementsByTag("input");
Elements textAreaElements2 = doc.getElementsByTag("textarea");
Elements formElements3 = doc.getElementsByTag("form");
for (Element inputElement : inputElements2) {
String key = inputElement.attr("name");
htmlElements.setInput_name(key);
String key1 = inputElement.attr("type");
htmlElements.setInput_type(key1);
String key2 = inputElement.attr("class");
htmlElements.setInput_class(key2);
}
for (Element inputElement : textAreaElements2) {
String key = inputElement.attr("id");
htmlElements.setTextarea_id(key);
String key1 = inputElement.attr("name");
htmlElements.setTextarea_name(key1);
}
for (Element inputElement : formElements3) {
String key = inputElement.attr("method");
htmlElements.setForm_method(key);
String key1 = inputElement.attr("action");
htmlElements.setForm_action(key1);
}
return urlList.put(entry1, htmlElements);
}
Which elements i want take it as a bean.For every url i am getting the urls and htmle elements.but when url contains multiple elements i was getting last element only
You use a class HtmlElements which is not part of JSoup as far as I know. I don't know its inner workings, but I assume it is some sort of list of html nodes or something.
However, you seem to use this class like this:
HtmlElements htmlElements = new HtmlElements();
htmlElements.setInput_name(key);
This indicates that only ONE html element is stored in the htmlElements variable. This would explain why you get only the last element stored - you simply overwrite the one instance all the time.
It is not really clear, since I don't know the HtmlElements class. Maybe something like this works, assuming that HtmlElement is working as a single instance of HtmlElements and HtmlElements has a method add:
HtmlElements htmlElements = new HtmlElements();
...
for (Element inputElement : inputElements2) {
HtmlElement e = new HtmlElement();
htmlElements.add(e);
String key = inputElement.attr("name");
e.setInput_name(key);
}