I am a beginner to Java programming. I am using Jsoup to parse this website.
I want to download all images from the website in least possible time. Let me clarify by what I mean by least possible time. The method I am using right now is this:
I filter out all img elements in html by accessing DOM created by Jsoup.
try {
String pageLink = baseLink + character;
pageDOM = Jsoup.connect(pageLink).timeout(0).get(); // timeout(0) to set no timeout limit (infinity)
//get image link elements ka list by text common to all image links
Elements imgTags = pageDOM.getElementsByAttributeValueContaining("src", "/CharacterImages");
List<String> imgLinkList = new ArrayList<>();
// add links to list
for (Element imgTag : imgTags) {
imgLinkList.add(imgTag.absUrl("src")); // get absolute URLs
}
} catch (IOException e) {
System.err.print("IO error.");
}
I then iterate over imgLinkList and download individual images from the links in imgLinkList.
But this takes lot of time. I have to wait for around 42 seconds (for 車 character, I used System.currentTimeMillis()) before all images are downloaded. I have tried using wget but it takes around the same amount of time. I think the cause is, a new http request is fired for every link from imgLinkList to download an image.
So, I tried downloading an entire webpage locally and then filtering out the images and display them in my program. I tried using wget to download the webpage with images:
wget -p "http://chineseetymology.org/CharacterEtymology.aspx?characterInput=%E8%BB%8A&submitButton1=Etymology"
But same problem here. This takes around the same time. But in Firefox, the same page I want to download, takes 10 seconds or less with my bandwidth. Saving the webpage takes less than 5 seconds.
How can I imitate what firefox does here? Is there a way to collapse all multiple http requests into a single request? Why does Firefox take so little time to download my webpage including all the images.
Sorry, if I am not making a lot of sense. This is my second post here. If I need to provide any more information, I would be more than happy to do so.
I am very new to coding in Java/Android Studio. I have everything setup that I have been able to figure out thus far. I have a button, and I need to put code inside of the button click event that will fetch information from a website, convert it to a string and display it. I figured I would have to use the html source code in order to do this, so I have installed Jsoup html parser. All of the help with Jsoup I have found only leads me up to getting the HTML into a "Document". And I am not sure if that is the best way to accomplish what I need. Can anyone tell me what code to use to fetch the html code from the website, and then do a search through the html looking for a specific match, and convert that match to a string. Or can anyone tell me if there is a better way to do this. I only need to grab one piece of information and display it.
Here is the piece of html code that contains the value I want:
writeBidRow('Wheat',-60,false,false,false,0.5,'01/15/2015','02/26/2015','All',' ',' ',60,'even','c=2246&l=3519&d=G15',quotes['KEH15'], 0-0);
I need to grab and display whatever value represents the quotes['KEH15'], in that html code.
Thank you in advance for your help.
Keith
Grabbing raw HTML is an extremely tedious way to access information from the web, bad practice, and difficult to maintain in the case that wherever you are fetching the info from changes their HTML.
I don't know your specific situation and what the data is that you are fetching, but if there is another way for you to fetch that data via an API, use that instead.
Since you say you are pretty new to Android and Java, let me explain something I wish had been explained to me very early on (although I am mostly self taught).
The way people access information across the Internet is traditionally through HTML and JavaScript (which is interpreted by your browser like Chrome or Firefox to look pretty), which are transferred over the internet using the protocol called HTTP. This is a great way for humans to communicate with computers that are far away, and the average person probably doesn't realize that there is more to the internet than this--your browser and the websites you can go to.
Although there are multiple methods, for the purpose of what I think you're looking for, applications communicate over the internet a slightly different way:
When an android application asks a server for some information, rather than returning HTML and JavaScript which is intended for human consumption, the server will (traditionally) return what's called JSON (or sometimes XML, which is very similar). JSON is a very simple way to get information about an object, and put it into a form that is readable easily by both humans (developers) and computers, and can be transmitted over the internet easily. For example, let's say you ask a server for some kind of "Video" object for an app that plays video, it may give you something like this:
{
"name": "Gangnam Style",
"metadata": {
"url": "https://www.youtube.com/watch?v=9bZkp7q19f0",
"views": 2000000000,
"ageRestricted": false,
"likes": 43434
"dislikes":124
},
"comments": [
{
"username": "John",
"comment": "10/10 would watch again"
},
{
"username": "Jane",
"number": "12/10 with rice"
}
]
}
That is very readable by us humans, but also by computers! We know the name is "Gangnam Style", the link of the video, etc.
A super helpful way to interact with JSON in Java and Android is Google's GSON library, which lets you cast a Java object as JSON or parse a JSON object to a Java object.
To get this information in the first place, you have to make a network call to an API, Application Programming Interface. Just a fancy term for communication between a server and a client. One very cool, free, and easy to understand API that I will use for this example is the OMDB API, which just spits back information about movies from IMDB. So how do you talk to the API? Well luckily they've got some nice documentation, which says that to get information on a movie we need to use some parameters in the url, like perhaps
http://www.omdbapi.com/?t=Interstellar
They want a title with the parameter "t". We could put a year, or return type, but this should be good to understand the basics. If you go to that URL in your browser, it spits back lots of information about Interstellar in JSON form. That stuff we were talking about! So how would you get this information from your Android application?
Well, you could use Android's built in HttpUrlConnection classes and research for a few hours on why your calls aren't working. But doesn't essentially every app now use networking? Why reinvent the wheel when virtually every valuable app out there has probably done this work before? Perhaps we can find some code online to do this work for us.
Or even better, a library! In particular, an open source library developed by Square, retrofit. There are multiple libraries like it (go ahead and research that out, it's best to find the best fit for your project), but the idea is they do all the hard work for you like low level network programming. Following their guides, you can reduce a lot of code work into just a few lines. So for our OMDB API example, we can set up our network calls like this:
//OMDB API
public ApiClient{
//an instance of this client object
private static OmdbApiInterface sOmdbApiInterface;
//if the omdbApiInterface object has been instantiated, return it, but if not, build it then return it.
public static OmdbApiInterface getOmdbApiClient() {
if (sOmdbApiInterface == null) {
RestAdapter restAdapter = new RestAdapter.Builder()
.setEndpoint("http://www.omdbapi.com")
.build();
sOmdbApiInterface = restAdapter.create(OmdbApiInterface.class);
}
return sOmdbApiInterface;
}
public interface OmdbApiInterface {
#GET("/")
void getInfo(#Query("t") String title, Callback<JsonObject> callback);
}
}
After you have researched and understand what's going on up there using their documentation, we can now use this class that we have set up anywhere in your application to call the API:
//you could get a user input string and pass it in as movieName
ApiClient.getOmdbApiClient().getInfo(movieName, new Callback<List<MovieInfo>>() {
//the nice thing here is that RetroFit deals with the JSON for you, so you can just get information right here from the JSON object
#Override
public void success(JsonObject movies, Response response) {
Log.i("TAG","Movie name is " + movies.getString("Title");
}
#Override
public void failure(RetrofitError error) {
Log.e("TAG", error.getMessage());
}
});
Now you've made an API call to get info from across the web! Congratulations! Now do what you want with the data. In this case we used Omdb but you can use anything that has this method of communication. For your purposes, I don't know exactly what data you are trying to get, but if it's possible, try to find a public API or something where you can get it using a method similar to this.
Let me know if you've got any questions.
Cheers!
As #caleb-allen said, if an API is available to you, it's better to use that.
However, I'm assuming that the web page is all you have to work with.
There are many libraries that can be used on Android to get the content of a URL.
Choices range from using the bare-bones HTTPUrlConnection to slightly higher-level HTTPClient to using robust libraries like Retrofit. I personally recommend Retrofit. Whatever you do, make sure that your HTTP access is asynchronous, and not done on the UI thread. Retrofit will handle this for you by default.
For parsing the results, I've had good results in the past using the open-source HTMLCleaner library - see http://htmlcleaner.sourceforge.net
Similar to JSoup, it takes a possibly-badly-formed HTML document and creates a valid XML document from it.
Once you have a valid XML document, you can use HTMLCleaner's implementation of the XML DOM to parse the document to find what you need.
Here, for example, is a method that I use to parse the names of 'projects' from a <table> element on a web page where projects are links within the table:
private List<Project> parseProjects(String html) throws Exception {
List<Project> parsedProjects = new ArrayList<Project>();
HtmlCleaner pageParser = new HtmlCleaner();
TagNode node = pageParser.clean(html);
String xpath = "//table[#class='listtable']".toString();
Object[] tables = node.evaluateXPath(xpath);
TagNode tableNode;
if(tables.length > 1) {
tableNode = (TagNode) tables[0];
} else {
throw new Exception("projects table not found in html");
}
TagNode[] projectLinks = tableNode.getElementsByName("a", true);
for(int i = 0; i < projectLinks.length; i++) {
TagNode link = projectLinks[i];
String projectName = link.getText().toString();
String href = link.getAttributeByName("href");
String projectIdString = href.split("=")[1];
int projectId = Integer.parseInt(projectIdString);
Project project = new Project(projectId, projectName);
parsedProjects.add(project);
}
return parsedProjects;
}
If you have permission to edit the webpage to add hyper link to specified line of that page you can use this way
First add code for head of line that you want to go there in your page
head your text if wanna
Then in your apk app on control click code enter
This.mwebview.loadurl("https:#######.com.html#target")
in left side of # enter your address of webpage and then #target in this example that your id is target.
Excuse me if my english lang. isn't good
I'm trying to download www.pandora.com/profile/stations/olin_d_kirkland HTML with Java to match what I get when I select 'view page source' from the context menu of the webpage in Chrome.
Now, I know how to download webpage HTML source code with Java. I have done it with downloads.nl and tested it on other sites. However, Pandora is being a mystery. My ultimate goal is to parse the 'Stations' from a Pandora account.
Specifically, I would like to grab the Station names from a site such as www.pandora.com/profile/stations/olin_d_kirkland
I have attempted using the selenium library and the built in URL getter in Java, but I only get ~4700 lines of code when I should be getting 5300. Not to mention that there is no personalized data in the code, which is what I'm looking for.
I figured it was that I wasn't grabbing the JavaScript or letting the JavaScript execute first, but even though I waited for it to load in my code, I would only always get the same result.
If at all possible, I should have a method called 'grabPageSource()' that returns a String. It should return the source code when called upon.
public class PandoraStationFinder {
public static void main(String[] args) throws IOException, InterruptedException {
String s = grabPageSource();
String[] lines = s.split("\n\r");
String t;
ArrayList stations = new ArrayList();
for (int i = 0; i < lines.length; i++) {
t = lines[i].trim();
Pattern p = Pattern.compile("[\\w\\s]+");
Matcher m = p.matcher(t);
if (m.matches() ? true : false) {
Station someStation = new Station(t);
stations.add(someStation);
// System.out.println("I found a match on line " + i + ".");
// System.out.println(t);
}
}
}
public static String grabPageSource() throws IOException {
String fullTxt = "";
// Get HTML from www.pandora.com/profile/stations/olin_d_kirkland
return fullTxt;
}
}
It is irrelevant how it's done, but I'd like, in the final product, to grab a comprehensive list of ALL songs that have been liked by a user on Pandora.
The Pandora pages are heavily constructed using ajax, so many scrapers struggle. In the case you've shown above, looking at the list of stations, the page actually puts through a secondary request to:
http://www.pandora.com/content/stations?startIndex=0&webname=olin_d_kirkland
If you run your request, but point it to that URL rather than the main site, I think you will have a lot more luck with your scraping.
Similarly, to access the "likes", you want this URL:
http://www.pandora.com/content/tracklikes?likeStartIndex=0&thumbStartIndex=0&webname=olin_d_kirkland
This will pull back the liked tracks in groups of 5, but you can page through the results by increasing the 'thumbStartIndex' parameter.
Not an answer exactly, but hopefully this will get you moving in the correct direction:
Whenever I get into this sort of thing, I always fall back on an HTTP monitoring tool. I use firefox, and I really like the Live HTTP Headers extension. Check out what the headers are that are going back and forth, then tailor your http requests accordingly. As an absolute lowest level test, grab the header from a successful request, then send it to port 80 using telnet and see what comes back.
I was quite happy with my JSF app which read the contents of MQ messages received and supplied them to the UI like this:
<rich:panel>
<snip>
<rich:panelMenuItem label="mylabel" action="#{MyBacking.updateCurrent}">
<f:param name="current" value="mylog.log" />
</rich:panelMenuItem>
</snip>
</rich:panel>
<rich:panel>
<a4j:outputPanel ajaxRendered="true">
<rich:insert content="#{MyBacking.log}" highlight="groovy" />
</a4j:outputPanel>
</rich:panel>
and in MyBacking.java
private String logFile = null;
...
public String updateCurrent() {
FacesContext context=FacesContext.getCurrentInstance();
setCurrent((String)context.getExternalContext().getRequestParameterMap().get("current"));
setLog(getCurrent());
return null;
}
public void setLog(String log) {
sendMsg(log);
msgBody = receiveMsg(moreargs);
logFile = msgBody;
}
public String getLog() {
return logFile;
}
until the contents of one of the messages was too big and tomcat fell over. Obviously, I thought, I need to change the way it works so that I return some form of stream so that no one object grows so big that the container dies and the content returned by successive messages is streamed to the UI as it comes in.
Am I right in thinking that I can replace the work I'm doing now on a String object with a BufferedOutputStream object ie no change to the JSF code and something like this changing at the back end:
private BufferedOutputStream logFile = null;
public void setLog(String log) {
sendMsg(args);
logFile = (BufferedOutputStream) receiveMsg(moreargs);
}
public String getLog() {
return logFile;
}
If Tomcat fell over that, it must be over 128MB large or maybe double (which is the minimum default memory size of certain Tomcat versions). I don't think that users would value to visit a webpage which is that big. It may feel fast when acting as both server and client at localhost, but it will be up to 100 times slower when served over internets.
Introduce paging/filtering. Query and show just 100 entries at once or so. Add a filter which returns specific results, such as logs of a certain time range or of certain user, etc.
Google also doesn't show all the zillion available results at once in a single webpage, their servers would certainly "fell over" as well :)
Update as per the comment: Is the bean been put in the session scope or so? This way it will indeed accumulate in the memory soon. Streaming is only possible if you have an InputStream at the one side and an OutputStream at the other side. There is no way to convert a String to a stream that way as your casting attempt so that it won't be stored in the Java memroy anymore. The source has to stay at the other side and it has to be retrieved byte by byte over the line. The only feasible approach would be to use an <iframe> whose src points to some HttpServlet which directly streams the data from the source to the response.
Your best bet is likely to store the whole thing in a database or --if it doesn't contain user specific data-- in the application scope and share it among all sessions/requests.
Here's the JavaScript (on an aspx page):
function WriteDocument(clientRef, system, branch, category, pdfXML)
{
AppletReturnValue = document.DocApplet.WriteDocument(clientRef, apmBROOMS, branch, category, pdfXML);
if (AppletReturnValue.length > 0) {
document.getElementById('pdfData').value = "";
CallServer(AppletReturnValue,'');
}
PostBackAndDisplayPDF()
}
pdfXML is got from pdfData which is a hidden field on the page containing the XML that contains base64 encoded pdf data which is passed to the java applet. All the other values being passed have within range sensible values.
The XML is like this
<Documents>
<FileName>AFileName</FileName>
<PDF>JVBERiDAzOTY1NzMwIDAwMDAwIG4NCjAwMDM5NjU4NDcgMDAwMDAgbg0KMDAwMzk2NTk2</PDF>
</Documents>
The contents of the element PDF is a lot bigger than displayed here
The signature of the Java method is:
public String WriteDocument(String clientPolicyReference,
int systemType,
int branch,
String category,
String PDFData) throws Exception
It seems that when the size of the PDF data gets large the applet fails to be called and the error 'Unknown Error' is thrown in the JS.
The PDF doc the data of which is producing this error is about 4Mb in size.
Many thanks in advance for any help.
Thanks for responding chaps but I've sorted the problem.
How? I took JRE 1.6 update 12 off and stuck update 7 (which is the version we reccomend to those who use our website) on my machine.
Why update 12 stopped working I don't know. Why update 7 is stable I don't know. [sigh]
It's things like this that make me glad I work mostly with a 'long time between releases' framework like .net.