I am creatin an app in Java that checks if a webpage has been updated.
However some webpages dont have a "last Modified" header.
I even tried checking for a change in content length but this method is not reliable as sometimes the content length changes without any modification in the webpage giving a false alarm.
I really need some help here as i am not able to think of a single foolproof method.
Any ideas???
If you connect the whole time to the webpage like this code it can help:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class main {
String updatecheck = "";
public static void main(String args[]) throws Exception {
//Constantly trying to load page
while (true) {
try {
System.out.println("Loading page...");
// connecting to a website with Jsoup
Document doc = Jsoup.connect("URL").userAgent("CHROME").get();
// Selecting a part of this website with Jsoup
String pick = doc.select("div.selection").get(0);
// printing out when selected part is updated.
if (updatecheck != pick){
updatecheck = pick;
System.out.println("Page is changed.");
}
} catch (Exception e) {
e.printStackTrace();
System.out.println("Exception occured.... going to retry... \n");
}
}
}
}
How to get notified after a webpage changes instead of refreshing?
Probably the most reliable option would be to store a hash of the page contet.
If you are saying that content-length changes then probably the webpages your are trying to check are dynamically generated and or not whatsoever a static in nature. If that is the case then even if you check the 'last-Modified' header it won't reflect the changes in content in most cases anyway.
I guess the only solution would be a page specific solution working only for a specific page, one page you could parse and look for content changes in some parts of this page, another page you could check by last modified header and some other pages you would have to check using the content length, in my opinion there is no way to do it in a unified mode for all pages on the internet. Another option would be to talk with people developing the pages you are checking for some markers which will help you determine if the page changed or not but that of course depends on your specific use case and what you are doing with it.
Related
I get this issue with CheckMarx security scan:
Method exec at line 69 of
web\src\main\java\abc\web\actions\HomeAction.java gets user input for
the CNF_KEY_COSN element. This element’s value then flows through the
code without being properly sanitized or validated and is eventually
displayed to the user in method logException at line 905 of
web\src\main\java\gov\abc\external\info\ServiceHelper.java. This may
enable a Cross-Site-Scripting attack.
Line 69 of HomeAction.java:
String cosn = (String) request.getParameter(CNF_KEY_CON);
Line 905 in ServiceHelper.java just logs the error:
private static void logException(InfoServiceException exception, String message) {
String newMessage = message + ": " + exception.getMessageForLogging();
try {
log.error(newMessage, exception);
} catch (Exception e) {
// fallback to console
System.out.println("error logging exception ->");
e.printStackTrace(System.out);
System.out.println("exception ->");
System.out.print(newMessage);
if (exception != null) exception.printStackTrace(System.out);
}
}
Changed another block of code in HomeAction.java to:
if(cosn!= null && cosn.matches("[0-9a-zA-Z_]+")) {
...
}
But that didn't help. How do I validate/sanitize/encode Line 69. Any help is much appreciated.
Thanks
You can sanitise strings for XSS attacks using Jsoup there is a clean() method for this. You would do something like this to sanitise the input:
String sanitizedInput = Jsoup.clean(originalInput, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
Checkmarx defines a set of sanitizers that you can check in the system.
Based on your source code snippets; i assume that;
i) you are appending 'cosn' to 'message'
ii) application is web-based in nature (in view of the request.getParameter)
iii) message is been displayed to the console or log to a file.
You could consider using Google Guava or Apache Commons Test to html escape the input.
import com.google.common.html.HtmlEscapers;
public void testGuavaHtmlEscapers(){
String badInput = "<script> alert me! <script>";
String escapedLocation = HtmlEscapers.htmlEscaper().escape(badInput);
System.out.println("<h1> Location: " + escapedLocation + "<h1>");
}
import static org.apache.commons.text.StringEscapeUtils.escapeHtml4;
public void testHtmlEscapers(){
String badInput = "<script> alert me! <script>";
System.out.println(escapeHtml4(badInput));
}
I would also consider if there is sensitive information, that i should mask e.g., using String.replace.
public void testReplace(){
String email = "some-email#domail.com";
String masked = email.replaceAll("(?<=.).(?=[^#]*?.#)", "*");
System.out.println(masked);
}
Above 3 sanitization methods will work similarly.
This is likely a false positive (technically, "not exploitable" in Checkmarx) with regard to XSS, depending on how you process and display logs. If logs are ever displayed in a browser as html, it might be vulnerable to blind XSS from this applications point of view, but it would be a vulnerability in whatever component displays logs as html, and not in the code above.
Contrary to other answers, you should not encode the message here. Whatever technology you use for logging will of course have to encode it properly for its own use (like for example if it's stored as JSON, data will have to be JSON-encoded), but that has nothing to do with XSS, or with this problem at all.
This is just raw data, and you can store raw data as is. If you encode it here, you will have a hard time displaying it in any other way. For example if you apply html encoding, you can only display it in html (or you have to decode, which will negate any effect). It doesn't make sense. XSS would arise if you displayed these logs in a browser - in which case whatever displays it would have to encode it properly, but that's not the case here.
Note though that it can still be a log injection vulnerability. Make sure that whatever way you store logs, that log store **does* apply necessary encoding. If it's a text file, you probably want to remove newlines so that fake lines cannot be added to the log. If it's json, you will want to encode to json, and so on. But that's a feature of your log facility, and not the code above.
I want to learn how to:
Step1: open URL – for example Gmail
Step 2: insert user and password and press sign-in.
How can I inset user and password and press the sign-in button?
Do I need/must use selenium?
This code is only for open the browser(step 1)
import java.io.IOException;
public class Website
{
public void openWebsite() //throws IOException
{
try
{
#SuppressWarnings("unused")
Process p = Runtime.getRuntime().exec("cmd /c start http://accounts.google.com/ServiceLogin ");
}
catch (IOException e1)
{
System.out.println(e1);
}
}
}
First you need to open the URL. Right now you are actually not opening the URL. You are asking the Windows operating system "What would you do with http://accounts.google.com/ServiceLogin?"
Because it is windows, it will make a guess, which sort of follows this line of logic:
it sort of looks like a URL, so I'll fire up explorer and
ask explorer to do something with it.
Which means that your code is now a few programs away from being able to get the data, and none of the intermediate programs will (because they're not built to do so), transmit the need for input into your program.
What you need to do is to avoid asking other programs to open the URL, it's just too problematic. First, they might get it wrong, second they'll never know how to ask you the input. To open a URL directly:
import java.net.URL;
... somewhere in the code ...
URL url = new URL("http://accounts.google.com/ServiceLogin");
InputStream in = url.openStream();
do some googling on various java.net.URL tutorials, and you will soon find the right combination of techniques needed to handle your particular credential challenge. Here's one resource, but it seems you need to do a bit of homework before what they say will make sense to you. If you stumble, at least you'll have a better, more specific question to ask the next time around (and don't forget to post your source code).
I'm trying to download www.pandora.com/profile/stations/olin_d_kirkland HTML with Java to match what I get when I select 'view page source' from the context menu of the webpage in Chrome.
Now, I know how to download webpage HTML source code with Java. I have done it with downloads.nl and tested it on other sites. However, Pandora is being a mystery. My ultimate goal is to parse the 'Stations' from a Pandora account.
Specifically, I would like to grab the Station names from a site such as www.pandora.com/profile/stations/olin_d_kirkland
I have attempted using the selenium library and the built in URL getter in Java, but I only get ~4700 lines of code when I should be getting 5300. Not to mention that there is no personalized data in the code, which is what I'm looking for.
I figured it was that I wasn't grabbing the JavaScript or letting the JavaScript execute first, but even though I waited for it to load in my code, I would only always get the same result.
If at all possible, I should have a method called 'grabPageSource()' that returns a String. It should return the source code when called upon.
public class PandoraStationFinder {
public static void main(String[] args) throws IOException, InterruptedException {
String s = grabPageSource();
String[] lines = s.split("\n\r");
String t;
ArrayList stations = new ArrayList();
for (int i = 0; i < lines.length; i++) {
t = lines[i].trim();
Pattern p = Pattern.compile("[\\w\\s]+");
Matcher m = p.matcher(t);
if (m.matches() ? true : false) {
Station someStation = new Station(t);
stations.add(someStation);
// System.out.println("I found a match on line " + i + ".");
// System.out.println(t);
}
}
}
public static String grabPageSource() throws IOException {
String fullTxt = "";
// Get HTML from www.pandora.com/profile/stations/olin_d_kirkland
return fullTxt;
}
}
It is irrelevant how it's done, but I'd like, in the final product, to grab a comprehensive list of ALL songs that have been liked by a user on Pandora.
The Pandora pages are heavily constructed using ajax, so many scrapers struggle. In the case you've shown above, looking at the list of stations, the page actually puts through a secondary request to:
http://www.pandora.com/content/stations?startIndex=0&webname=olin_d_kirkland
If you run your request, but point it to that URL rather than the main site, I think you will have a lot more luck with your scraping.
Similarly, to access the "likes", you want this URL:
http://www.pandora.com/content/tracklikes?likeStartIndex=0&thumbStartIndex=0&webname=olin_d_kirkland
This will pull back the liked tracks in groups of 5, but you can page through the results by increasing the 'thumbStartIndex' parameter.
Not an answer exactly, but hopefully this will get you moving in the correct direction:
Whenever I get into this sort of thing, I always fall back on an HTTP monitoring tool. I use firefox, and I really like the Live HTTP Headers extension. Check out what the headers are that are going back and forth, then tailor your http requests accordingly. As an absolute lowest level test, grab the header from a successful request, then send it to port 80 using telnet and see what comes back.
I'm trying to get a lot of data from multiple pages but its not always consistent. here is an example of the html I am working with!:
Example HTML
I need to get something like: Team | Team | Result all into different variables or lists.
I just need some help on where to start because the main table I'm working with on multiple pages isn't the same on everyone.
heres my java so far:
try {
Document team_page = Jsoup.connect("http://www.soccerstats.com/team.asp?league=" + league + "&teamid=" + teamNumber).get();
Element home_team = team_page.select("[class=homeTitle]").first();
String teamName = home_team.text();
System.out.println(teamName + "'s Latest Results: ");
Elements main_page = team_page.select("[class=stat]");
System.out.println(main_page);
} catch (IOException e) {
System.out.println("unable to parse content");
}
I am getting the league and teamid from different methods of my program.
Thanks!
Yes. This is one of the problems with webpage scraping.
You have to figure out one or more heuristics that will extract the information that you need across all of the pages that you need to access. There's no magic bullet. Just hard work. (And you'll have to do it all over again if the site changes its page layout.)
A better idea is to request the information as XML or JSON using the site or sites' RESTful APIs ... assuming they exist and are available to you.
(And if you continue with the web-scraping approach, check the site's Terms of Service to make sure that your activity is acceptable.)
I'm updating a Selenium program I wrote a while back and part of it has stopped working. I want to go through a whole series of links on a page, click on each, making sure that some expected text is present. But sometimes a log-in page (https://library.med.nyu.edu/sso/ezproxy_form.php) appears before the desired page, in which case I need to log in before checking the page. The problem is, no matter what string I put in to check whether I've landed on the log in page, Selenium concludes it's not present and skips logging in, obviously causing everything else to fail. See below--I'm not sure that was actually the problem. It seems to be instead that it's rushing through the "if we need to sign in" code without actually signing in, then obviously failing the main part of the test because it's not on the right page.
Here's the code--does anyone see my mistake?
for (int i = 0; i < Resources.size(); i++) {
try {
selenium.open("/");
selenium.click("link=" + Resources.get(i).link);
selenium.waitForPageToLoad("100000");
if (selenium.isTextPresent("Please sign in to access NYUHSL e-resources")) {
selenium.type("sso_user", kid);
selenium.type("sso_pass", password);
selenium.click("name=SignIn");
selenium.waitForPageToLoad("100000");
}
if (!selenium.isTextPresent(Resources.get(i).text)) {
outfile.println(Resources.get(i).name + " failed");
}
} catch (Exception e) {
outfile.println(Resources.get(i).name + " could not be found--link removed?");
}
}
Does the login page have a page title? If yes, try validating the page title using selenium.getTitle() method to check if you are headed to login page. If not, proceed clicking on the link without logging in.
I think page title validation can help resolve this issue
Try putting:
selenium.setSpeed("1000");
Right after the selenium.open this will inject 1 second delay (1000ms) between selenium commands. You should make it a standard practice to add this, especially if you're not running headless browsers.
Also you might consider using, since you know the url you are expecting to be on if on the login page, the selenium command getLocation. This will return the absolute URL of the current page. Might be more effective than trying to look for elements that can change at any time within the page.
So to use getLocation in your code above:
if (selenium.getLocation() == "your reference url"){
do your login stuff here
}
Again this is just a sample to illustrate what I'm saying. Hope it helps you out.