JSOUP Web scraping from support portal - java

I'm new in using jSoup and now I'm trying to make a web scraping form this portal.
https://supportforums.cisco.com/t5/lan-switching-and-routing/bd-p/6016-discussions-lan-switching-routing
On this portal, I want to receive information from this list, which will show solved problems, I mean the topics which have the special image of solving like this.
Solved task must look in such way
I created a connection to this page in such way and checked the title of this page to be sure that I'm in the right place.
document = Jsoup.connect("https://supportforums.cisco.com/t5/lan-switching-and-routing/bd-p/6016-discussions-lan-switching-routing").get();
String title = document.title();
print("Title: " + title);
After that I began to look into HTML side and i understood that this topics must be element in list inside div class messageList.MessageList.lia-component-forums-widget-message-list.lia-forum-message-list.lia-component-message-list but I'm not sure about it.
Then I figured out that each topic contain unique id and I'm stuck on it.
Could you please help me how to receive all these elements, topics? And how to filter solved topics among all of them? In the beginning, I just want to output the titles of these topics using Console in Java.
And sorry if I asked a silly question.

The topics that are solved are represented by row with class lia-list-row-thread-solved. The main thread list is in element with id grid.
Document doc = Jsoup.connect(
"https://supportforums.cisco.com/t5/lan-switching-and-routing/bd-p/6016-discussions-lan-switching-routing")
.get();
for (Element e : doc.select("#grid tr.lia-list-row-thread-solved")) {
String text = e.text();
System.out.println(text);
}

Related

Document.select("a[href]") not getting all the href

I am using JSOUP to fetch the documents from a website.
Below is my code
webPageUrl = https://mwcc.ms.gov/#/electronicDataInterchange
Document doc = Jsoup.connect(webPageUrl).get();
Elements links = doc.getElementsByAttribute("a[href]");
Below line of code is not working. It is supposed to return an element but doesn't:
doc.getElementsByAttribute("a[href]")
Can someone please point out the mistake in my code?
That page seems to be an Angular application, which means it loads some (probably all or most) of its content via JavaScript scripts.
The fact that the URL contains the fragment separator # is already a strong indicator of that fact, because if you do a HTTP request, then everything after that indicator is cut off (i.e. not sent to the server), so the actual request will just be of https://mwcc.ms.gov/.
As far as I know JSoup does not support running JavaScript, so you might need to look into a more involved scraping tool (possibly running a full browser engine).

How to fetch the Facebook comment replies using graph API?

I am using the Facebook graph API to get the posts and the comments, and it's all working fine. Now I want to use the API to get the replies of a particular comment. How can I do that? I’m using the following code to get the comments:
cmmntObj = facebookClient.fetchObject(postID + "/comments", JsonObject.class,
Parameter.with("limit", limitOfRecords),
Parameter.with(Since_Until[k], date_SinceLast[k].toString()),
Parameter.with("Date_Format", "U"));
The following code works well and fetches the comments. I would appreciate if somebody can help me in getting the replies also.
I parsed the Comments JSON and built another query around it but it doesn’t work.
This is the query to fetch the tweets:
String getCmmntID = new String();
getCmmntID = cmmntObj.getJsonArray("data").getJsonObject(0).getString("id");// .getString("id");
cmmntReplies = facebookClient.fetchObject(
postID + "/comments?filter=stream&fields=parent.fields(" + getCmmntID + ")",
JsonObject.class, Parameter.with("limit", limitOfRecords),
Parameter.with(Since_Until[k], date_SinceLast[k].toString()),
Parameter.with("Date_Format", "U"));
How do I get the replies to these?
It is possible to get the comments edge for your comments, and then to get the comments edge of those comments, ad-infinitum. This becomes a matter of you deciding how many levels of comments you want to code into your application. I'm not sure if there is a finite number of levels of comments that Facebook would allow, you'll have to experiment with that. However, here is what you would want to add:
Parameter.with("fields", "message,comments{comments,message}")
That will get you three levels of comments (the main comments and two levels of replies).

Url working in Google chrome inaccessible by Java w/Jsoup?

I'm having quite a confusing problem. I have literally only been doing networking for a day, so please forgive me and I apologize if I am making a dumb error. My issue is that I cannot access a URL in a programmatic fashion which I can access through copy-pasting into chrome.
I am using a library called jsoup (http://jsoup.org/apidocs/) which parses text out of raw html from a website. My goal in general is to use a base url to which I can attach a string, and get a webpage from it. I am using the code (edit for those who asked for more code, I know this is still sparse but this is the only code preceding the error)
String url = "https://www.google.com/search?q=definition+of+";
url += search; //search is the passed in string
Document doc = Jsoup.connect(url).get(); //url is the String in question
to get the webpage. My ultimate goal is to use this method to get the text of the box at the top of chrome searches when you search for the definition of a word. I.e the box at the top here: https://www.google.com/search?q=definition+of+apple
However, I come to an issue when I attempt to use the above link as my url, for I get a org.jsoup.HttpStatusException, so I think it is a networking problem. What causes this url to work when typed into chrome, but not in Java? (I would also not be adverse to different ways to get the information in that box, since my current method feels a bit roundabout)
The full error message (edited in)
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=https://www.google.com/search?q=definition+of+apple
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:435)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:410)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:164)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:153)
at test.Test.parseDef(Test.java:68)
at test.Test.main(Test.java:112)
To whomever answers, thank you for spending your time to help a networking newbie!
Most likely, Google is accurately identifying your program as a "robot" and acting accordingly. Google encourages robots to use the Google Custom Search API and discourages them from using the human-oriented search interface.
In fact, all web spiders are supposed to check robots.txt, right? Here is Google's: http://www.google.com/robots.txt. Note that /search is disallowed.
Please see this question for further information. It's basically the python version of your question. Why does Google Search return HTTP Error 403?
If you use Jsoup you have to replace spaces with %20 and not with +.
Try this url :
https://www.google.com/search?q=definition%20of%20apple
String url = "https://www.google.com/search?q=definition%20of%20";
url += search; //search is the passed in string
Document doc = Jsoup.connect(url).get();
public static void main(String[] args) {
Document doc = Jsoup.connect(link)
.data("query", "Java")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(1000)
.post();
}

Is it possible for a column to become a hyperlink using google charts API?

I am using Visualr http://googlevisualr.herokuapp.com/ with Rails and having a good amount of success creating dynamic charts. However, I am wondering if it's possible to allow the user to click on the column in a 'column chart' and be linked to a page? I am happy to know the java version if you aren't familiar with visualr.
Thanks!
It now is available!
There has recently been an update on this issue. Therefore I want to update this SO Q&A.
Resources:
Google Visualr Github Pull Request #39
Google Visualr Github Issue #36
Code example
xxx_controller.rb
#table = GoogleVisualr::Interactive::ColumnChart.new(g, options_g)
#table.add_listener("select", "function(e) {
EventHandler(e, chart, data_table)
}")
And then in a JS file e.g. app/assets/javascripts/application.js:
function EventHandler(e, chart, data) {
var selection = chart.getSelection();
if (selection.length > 0) {
var row = selection[0].row;
var department = data.getValue(row, 0);
alert(department + " | " + row)
}
}
Google Charts (whether you access them directly or via a wrapper gem like Visualr) are simple images, so the straight answer is "No", at least not without doing some work of your own. In order to achieve this you would need to place your own transparent clickable links (or divs or whatever) over the image, in the right place, to correspond to the columns that google generate in the image.
I'd imagine this would be tricky and error prone - it might actually be easier for you to just generate the columns yourself in html and css, using the data you would previously have sent to google to set the height (in %) of the columns. Then, each column would be a seperate html element and could link to whatever you want.
So, more control = more work. As usual :)

Gathering data from inconsistant HTML pages - JSoup

I'm trying to get a lot of data from multiple pages but its not always consistent. here is an example of the html I am working with!:
Example HTML
I need to get something like: Team | Team | Result all into different variables or lists.
I just need some help on where to start because the main table I'm working with on multiple pages isn't the same on everyone.
heres my java so far:
try {
Document team_page = Jsoup.connect("http://www.soccerstats.com/team.asp?league=" + league + "&teamid=" + teamNumber).get();
Element home_team = team_page.select("[class=homeTitle]").first();
String teamName = home_team.text();
System.out.println(teamName + "'s Latest Results: ");
Elements main_page = team_page.select("[class=stat]");
System.out.println(main_page);
} catch (IOException e) {
System.out.println("unable to parse content");
}
I am getting the league and teamid from different methods of my program.
Thanks!
Yes. This is one of the problems with webpage scraping.
You have to figure out one or more heuristics that will extract the information that you need across all of the pages that you need to access. There's no magic bullet. Just hard work. (And you'll have to do it all over again if the site changes its page layout.)
A better idea is to request the information as XML or JSON using the site or sites' RESTful APIs ... assuming they exist and are available to you.
(And if you continue with the web-scraping approach, check the site's Terms of Service to make sure that your activity is acceptable.)

Categories

Resources