I'm new in using jSoup and now I'm trying to make a web scraping form this portal.
https://supportforums.cisco.com/t5/lan-switching-and-routing/bd-p/6016-discussions-lan-switching-routing
On this portal, I want to receive information from this list, which will show solved problems, I mean the topics which have the special image of solving like this.
Solved task must look in such way
I created a connection to this page in such way and checked the title of this page to be sure that I'm in the right place.
document = Jsoup.connect("https://supportforums.cisco.com/t5/lan-switching-and-routing/bd-p/6016-discussions-lan-switching-routing").get();
String title = document.title();
print("Title: " + title);
After that I began to look into HTML side and i understood that this topics must be element in list inside div class messageList.MessageList.lia-component-forums-widget-message-list.lia-forum-message-list.lia-component-message-list but I'm not sure about it.
Then I figured out that each topic contain unique id and I'm stuck on it.
Could you please help me how to receive all these elements, topics? And how to filter solved topics among all of them? In the beginning, I just want to output the titles of these topics using Console in Java.
And sorry if I asked a silly question.
The topics that are solved are represented by row with class lia-list-row-thread-solved. The main thread list is in element with id grid.
Document doc = Jsoup.connect(
"https://supportforums.cisco.com/t5/lan-switching-and-routing/bd-p/6016-discussions-lan-switching-routing")
.get();
for (Element e : doc.select("#grid tr.lia-list-row-thread-solved")) {
String text = e.text();
System.out.println(text);
}
I am very new to coding in Java/Android Studio. I have everything setup that I have been able to figure out thus far. I have a button, and I need to put code inside of the button click event that will fetch information from a website, convert it to a string and display it. I figured I would have to use the html source code in order to do this, so I have installed Jsoup html parser. All of the help with Jsoup I have found only leads me up to getting the HTML into a "Document". And I am not sure if that is the best way to accomplish what I need. Can anyone tell me what code to use to fetch the html code from the website, and then do a search through the html looking for a specific match, and convert that match to a string. Or can anyone tell me if there is a better way to do this. I only need to grab one piece of information and display it.
Here is the piece of html code that contains the value I want:
writeBidRow('Wheat',-60,false,false,false,0.5,'01/15/2015','02/26/2015','All',' ',' ',60,'even','c=2246&l=3519&d=G15',quotes['KEH15'], 0-0);
I need to grab and display whatever value represents the quotes['KEH15'], in that html code.
Thank you in advance for your help.
Keith
Grabbing raw HTML is an extremely tedious way to access information from the web, bad practice, and difficult to maintain in the case that wherever you are fetching the info from changes their HTML.
I don't know your specific situation and what the data is that you are fetching, but if there is another way for you to fetch that data via an API, use that instead.
Since you say you are pretty new to Android and Java, let me explain something I wish had been explained to me very early on (although I am mostly self taught).
The way people access information across the Internet is traditionally through HTML and JavaScript (which is interpreted by your browser like Chrome or Firefox to look pretty), which are transferred over the internet using the protocol called HTTP. This is a great way for humans to communicate with computers that are far away, and the average person probably doesn't realize that there is more to the internet than this--your browser and the websites you can go to.
Although there are multiple methods, for the purpose of what I think you're looking for, applications communicate over the internet a slightly different way:
When an android application asks a server for some information, rather than returning HTML and JavaScript which is intended for human consumption, the server will (traditionally) return what's called JSON (or sometimes XML, which is very similar). JSON is a very simple way to get information about an object, and put it into a form that is readable easily by both humans (developers) and computers, and can be transmitted over the internet easily. For example, let's say you ask a server for some kind of "Video" object for an app that plays video, it may give you something like this:
{
"name": "Gangnam Style",
"metadata": {
"url": "https://www.youtube.com/watch?v=9bZkp7q19f0",
"views": 2000000000,
"ageRestricted": false,
"likes": 43434
"dislikes":124
},
"comments": [
{
"username": "John",
"comment": "10/10 would watch again"
},
{
"username": "Jane",
"number": "12/10 with rice"
}
]
}
That is very readable by us humans, but also by computers! We know the name is "Gangnam Style", the link of the video, etc.
A super helpful way to interact with JSON in Java and Android is Google's GSON library, which lets you cast a Java object as JSON or parse a JSON object to a Java object.
To get this information in the first place, you have to make a network call to an API, Application Programming Interface. Just a fancy term for communication between a server and a client. One very cool, free, and easy to understand API that I will use for this example is the OMDB API, which just spits back information about movies from IMDB. So how do you talk to the API? Well luckily they've got some nice documentation, which says that to get information on a movie we need to use some parameters in the url, like perhaps
http://www.omdbapi.com/?t=Interstellar
They want a title with the parameter "t". We could put a year, or return type, but this should be good to understand the basics. If you go to that URL in your browser, it spits back lots of information about Interstellar in JSON form. That stuff we were talking about! So how would you get this information from your Android application?
Well, you could use Android's built in HttpUrlConnection classes and research for a few hours on why your calls aren't working. But doesn't essentially every app now use networking? Why reinvent the wheel when virtually every valuable app out there has probably done this work before? Perhaps we can find some code online to do this work for us.
Or even better, a library! In particular, an open source library developed by Square, retrofit. There are multiple libraries like it (go ahead and research that out, it's best to find the best fit for your project), but the idea is they do all the hard work for you like low level network programming. Following their guides, you can reduce a lot of code work into just a few lines. So for our OMDB API example, we can set up our network calls like this:
//OMDB API
public ApiClient{
//an instance of this client object
private static OmdbApiInterface sOmdbApiInterface;
//if the omdbApiInterface object has been instantiated, return it, but if not, build it then return it.
public static OmdbApiInterface getOmdbApiClient() {
if (sOmdbApiInterface == null) {
RestAdapter restAdapter = new RestAdapter.Builder()
.setEndpoint("http://www.omdbapi.com")
.build();
sOmdbApiInterface = restAdapter.create(OmdbApiInterface.class);
}
return sOmdbApiInterface;
}
public interface OmdbApiInterface {
#GET("/")
void getInfo(#Query("t") String title, Callback<JsonObject> callback);
}
}
After you have researched and understand what's going on up there using their documentation, we can now use this class that we have set up anywhere in your application to call the API:
//you could get a user input string and pass it in as movieName
ApiClient.getOmdbApiClient().getInfo(movieName, new Callback<List<MovieInfo>>() {
//the nice thing here is that RetroFit deals with the JSON for you, so you can just get information right here from the JSON object
#Override
public void success(JsonObject movies, Response response) {
Log.i("TAG","Movie name is " + movies.getString("Title");
}
#Override
public void failure(RetrofitError error) {
Log.e("TAG", error.getMessage());
}
});
Now you've made an API call to get info from across the web! Congratulations! Now do what you want with the data. In this case we used Omdb but you can use anything that has this method of communication. For your purposes, I don't know exactly what data you are trying to get, but if it's possible, try to find a public API or something where you can get it using a method similar to this.
Let me know if you've got any questions.
Cheers!
As #caleb-allen said, if an API is available to you, it's better to use that.
However, I'm assuming that the web page is all you have to work with.
There are many libraries that can be used on Android to get the content of a URL.
Choices range from using the bare-bones HTTPUrlConnection to slightly higher-level HTTPClient to using robust libraries like Retrofit. I personally recommend Retrofit. Whatever you do, make sure that your HTTP access is asynchronous, and not done on the UI thread. Retrofit will handle this for you by default.
For parsing the results, I've had good results in the past using the open-source HTMLCleaner library - see http://htmlcleaner.sourceforge.net
Similar to JSoup, it takes a possibly-badly-formed HTML document and creates a valid XML document from it.
Once you have a valid XML document, you can use HTMLCleaner's implementation of the XML DOM to parse the document to find what you need.
Here, for example, is a method that I use to parse the names of 'projects' from a <table> element on a web page where projects are links within the table:
private List<Project> parseProjects(String html) throws Exception {
List<Project> parsedProjects = new ArrayList<Project>();
HtmlCleaner pageParser = new HtmlCleaner();
TagNode node = pageParser.clean(html);
String xpath = "//table[#class='listtable']".toString();
Object[] tables = node.evaluateXPath(xpath);
TagNode tableNode;
if(tables.length > 1) {
tableNode = (TagNode) tables[0];
} else {
throw new Exception("projects table not found in html");
}
TagNode[] projectLinks = tableNode.getElementsByName("a", true);
for(int i = 0; i < projectLinks.length; i++) {
TagNode link = projectLinks[i];
String projectName = link.getText().toString();
String href = link.getAttributeByName("href");
String projectIdString = href.split("=")[1];
int projectId = Integer.parseInt(projectIdString);
Project project = new Project(projectId, projectName);
parsedProjects.add(project);
}
return parsedProjects;
}
If you have permission to edit the webpage to add hyper link to specified line of that page you can use this way
First add code for head of line that you want to go there in your page
head your text if wanna
Then in your apk app on control click code enter
This.mwebview.loadurl("https:#######.com.html#target")
in left side of # enter your address of webpage and then #target in this example that your id is target.
Excuse me if my english lang. isn't good
I am currently working on getting the source code of a specific web page in a file using Java.
The web page is: http://www.studenti.ict.uniba.it/esse3/ListaAppelliOfferta.do
I wrote some code to do that:
try{
URL url= new URL("http://www.studenti.ict.uniba.it/esse3/ListaAppelliOfferta.do");
URLConnection urlConn = url.openConnection();
BufferedReader dis= new BufferedReader(new InputStreamReader((url.openStream())));
String s="";
while (( s=dis.readLine())!= null) {
System.out.println(s);
}
dis.close();
}catch (MalformedURLException mue) {}
catch (IOException ioe) {}
}
This works fine.
The problem is I want to "simulate" a user selecting "[1020] Dipartimento di Informatica" in Facoltà and "[1102] Informatica e Tecnologie per la produzione del Software" in Corso di Studio and then the user clicking on "Avvia Ricerca" which starts a search and shows a table with the results.
The goal is obtaining the source code of the web page containing also the information in the table I need.
I noticed that if I manually do those selections and then click "Avvia Ricerca" to start the search, the web page is loaded again showing the data in the table I need, but the URL does not change.
So even if the page is now showing the data I need, when using my code I can only get the source code of the page as it is BEFORE doing the selections and doing the search.
I've done similar things with HTMLUnit (http://htmlunit.sourceforge.net) before, works quite well for simulating anything in regards to websites, and for scraping.
I would suggest to open the page in a web debugger (Ctrl-Shift-I) and see what URLs are fetched when you make your selections, and then program those fetches in your Java app.
The downside of this approach is if the page implementation changes your code will break.
Another alternative is to run the page Javascript in a browser sandbox. That is also error-prone and can even be unsafe.
Normaly, you just could send this information by GET/POST (for example with url?department=xy), but in your case its quite complicated, as the site uses JSF and generates an ID (and the information, which department is chosen, is written there, for example "http://www.studenti.ict.uniba.it/esse3/ListaAppelliOfferta.do;jsessionid=365EB9843B2872E73067693A6095BA35").
Depending on what you want to do you could use Selenium (http://docs.seleniumhq.org/). This simulates the Browser, and you could get your elemeents (for example department by name: fac_id), and set the value (for example with selectByValue after you created a select-element, documented here: http://selenium.googlecode.com/git/docs/api/java/org/openqa/selenium/support/ui/Select.html).
If you need to do it without using Selenium (for example cause you need to do it only on command line and without using the browser itself) you can try do deactivate cookies, then the parameters should be sent in GET- or POST-Parameters, and you could inspect this e.g. with Firebug. But thats the harder soluation, Selenium would be much easier to use.
I am using Visualr http://googlevisualr.herokuapp.com/ with Rails and having a good amount of success creating dynamic charts. However, I am wondering if it's possible to allow the user to click on the column in a 'column chart' and be linked to a page? I am happy to know the java version if you aren't familiar with visualr.
Thanks!
It now is available!
There has recently been an update on this issue. Therefore I want to update this SO Q&A.
Resources:
Google Visualr Github Pull Request #39
Google Visualr Github Issue #36
Code example
xxx_controller.rb
#table = GoogleVisualr::Interactive::ColumnChart.new(g, options_g)
#table.add_listener("select", "function(e) {
EventHandler(e, chart, data_table)
}")
And then in a JS file e.g. app/assets/javascripts/application.js:
function EventHandler(e, chart, data) {
var selection = chart.getSelection();
if (selection.length > 0) {
var row = selection[0].row;
var department = data.getValue(row, 0);
alert(department + " | " + row)
}
}
Google Charts (whether you access them directly or via a wrapper gem like Visualr) are simple images, so the straight answer is "No", at least not without doing some work of your own. In order to achieve this you would need to place your own transparent clickable links (or divs or whatever) over the image, in the right place, to correspond to the columns that google generate in the image.
I'd imagine this would be tricky and error prone - it might actually be easier for you to just generate the columns yourself in html and css, using the data you would previously have sent to google to set the height (in %) of the columns. Then, each column would be a seperate html element and could link to whatever you want.
So, more control = more work. As usual :)
I am creatin an app in Java that checks if a webpage has been updated.
However some webpages dont have a "last Modified" header.
I even tried checking for a change in content length but this method is not reliable as sometimes the content length changes without any modification in the webpage giving a false alarm.
I really need some help here as i am not able to think of a single foolproof method.
Any ideas???
If you connect the whole time to the webpage like this code it can help:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class main {
String updatecheck = "";
public static void main(String args[]) throws Exception {
//Constantly trying to load page
while (true) {
try {
System.out.println("Loading page...");
// connecting to a website with Jsoup
Document doc = Jsoup.connect("URL").userAgent("CHROME").get();
// Selecting a part of this website with Jsoup
String pick = doc.select("div.selection").get(0);
// printing out when selected part is updated.
if (updatecheck != pick){
updatecheck = pick;
System.out.println("Page is changed.");
}
} catch (Exception e) {
e.printStackTrace();
System.out.println("Exception occured.... going to retry... \n");
}
}
}
}
How to get notified after a webpage changes instead of refreshing?
Probably the most reliable option would be to store a hash of the page contet.
If you are saying that content-length changes then probably the webpages your are trying to check are dynamically generated and or not whatsoever a static in nature. If that is the case then even if you check the 'last-Modified' header it won't reflect the changes in content in most cases anyway.
I guess the only solution would be a page specific solution working only for a specific page, one page you could parse and look for content changes in some parts of this page, another page you could check by last modified header and some other pages you would have to check using the content length, in my opinion there is no way to do it in a unified mode for all pages on the internet. Another option would be to talk with people developing the pages you are checking for some markers which will help you determine if the page changed or not but that of course depends on your specific use case and what you are doing with it.