Retrieving URL's from webpage in java

Retrieving URL's from webpage in java - java

I have the most basic java code to do a http request and it works fine. I request data and a ton of html comes back. I want to retrieve all the url's from that page and list them. For a simple first test i made it look like this:
int b = line.indexOf("http://",lastE);
int e = line.indexOf("\"", b);
This works but as you can imagine it's horrible and only works in 80% of the cases. The only alternative i could come up with myself sounded slow and stupid. So my question is pretty mutch do i go from
String html
to
List<Url>
?

Pattern p = Pattern.compile("http://[\w^\"]++");
Matcher m = p.matcher(yourFetchedHtmlString);
while (m.find()) {
nextUrl=m.group();//Do whatever you want with it
}
You may also have to tweak the regexp, as i have just written it without testing. This should be a very fast way to fetch urls.

I would try a library like HTML Parser to parse the html string and extract all url tags from it.

Your thinking is good, you just missing some parts.
Yous should add some known extension for urls.
like .html .aspx .php .htm .cgi .js .pl .asp
And if you like images too then add .gif .jpg .png
I think your doing it the best just need to add more extensions checking.
If you can post the full method code, i will be happy to help you make it better.

Related

copyURLToFile is not downloading the file

I am trying to download - http://1.bp.blogspot.com/_jNFMBZMTBvA/SjhABgh4-_I/AAAAAAAAA0U/Yvsaq_CreCs/s1600-h/hellboy+003.jpg using below code.But the downloaded file is not a valid image. Can someone helpme with this.I tried to encode this as well , but did not help
var url = URL(originalUrl)
FileUtils.copyURLToFile(url, srcPath.toFile())

This question deals with the same problem and even the same site, I'm not sure it's a duplicate since it's about c#, but the same approaches (extracting the url of the actual image from the html named as *.jpg) should help in the same way. The html url has s1600-h, while the image url just s1600, which seeems to be a recurring pattern. Depending on how general your code has to be, you could use a simplified approach or just change the url.

how can I clean and sanitize a url submitted by a user for redisplay in java?

I want a user to be able to submit a url, and then display that url to other users as a link.
If I naively redisplay what the user submitted, I leave myself open to urls like
http://somesite.com' ><script>[any javacscript in here]</script>
that when I redisplay it to other users will do something nasty, or at least something that makes me look unprofessional for not preventing it.
Is there a library, preferably in java, that will clean a url so that it retains all valid urls but weeds out any exploits/tomfoolery?
Thanks!

URLs having ' in are perfectly valid. If you are outputting them to an HTML document without escaping, then the problem lies in your lack of HTML-escaping, not in the input checking. You need to ensure that you are calling an HTML encoding method every time you output any variable text (including URLs) into an HTML document.
Java does not have a built-in HTML encoder (poor show!) but most web libraries do (take your pick, or write it yourself with a few string replaces). If you use JSTL tags, you get escapeXml to do it for free by default:
ok
Whilst your main problem is HTML-escaping, it is still potentially beneficial to validate that an input URL is valid to catch mistakes - you can do that by parsing it with new URL(...) and seeing if you get a MalformedURLException.
You should also check that the URL begins with a known-good protocol such as http:// or https://. This will prevent anyone using dangerous URL protocols like javascript: which can lead to cross-site-scripting as easily as HTML-injection can.

I think what you are looking for is output encoding. Have a look at OWASP ESAPI which is tried and tested way to perform encoding in Java.
Also, just a suggestion, if you want to check if a user is submitting malicious URL, you can check that against Google malware database. You can use SafeBrowing API for that.

You can use apache validator URLValidator
UrlValidator urlValidator = new UrlValidator(schemes);
if (urlValidator.isValid("http://somesite.com")) {
//valid
}

Selenium 2: Detect content type of link destinations

I am using the Selenium 2 Java API to interact with web pages. My question is: How can i detect the content type of link destinations?
Basically, this is the background: Before clicking a link, i want to be sure that the response is an HTML file. If not, i need to handle it in another way. So, let's say there is a download link for a PDF file. The application should directly read the contents of that URL instead of opening it in the browser.
The goal is to have an application which automatically knows wheather the current location is an HTML, PDF, XML or whatever to use appropriate parsers to extract useful information out of the documents.
Update
Added bounty: Will reward it to the best solution which allows me to get the content type of a given URL.

As Jochen suggests, the way to get the Content-type without also downloading the content is HTTP HEAD, and the selenium webdrivers does not seem to offer functionality like that. You'll have to find another library to help you with fetching the content type of an url.
A Java library that can do this is Apache HttpComponents, especially HttpClient.
(The following code is untested)
HttpClient httpclient = new DefaultHttpClient();
HttpHead httphead = new HttpHead("http://foo/bar");
HttpResponse response = httpclient.execute(httphead);
BasicHeader contenttypeheader = response.getFirstHeader("Content-Type");
System.out.println(contenttypeheader);
The project publishes JavaDoc for HttpClient, the documentation for the HttpClient interface contains a nice example.

You can figure out the content type will processing the data coming in.
Not sure why you need to figure this out first.
If so, use the HEAD method and look at the Content-Type header.

You can retrieve all the URLs from the DOM, and then parse the last few characters of each URL (using a java regex) to determine the link type.
You can parse characters proceeding the last dot. For example, in the url http://yoursite.com/whatever/test.pdf, extract the pdf, and enforce your test logic accordingly.
Am I oversimplifying your problem?

Best Way to determine MimeType from a String?

I have a crawler that downloads pages and tries to parse the HTML. One of the issues I've been facing is how to properly determine what mimetype an HTML file is.
Right now I'm using
is = new ByteArrayInputStream( htmlResult.getBytes( "UTF-8" ) );
mimeType = URLConnection.guessContentTypeFromStream(is);
but it misses sites like this: http://www.artdaily.org/index.asp?int_sec%3D11%26int_new%3D39415 because of the extra space between the doc tag and HTML tag in the source.
Does anyone know a good way to determine if a string is HTML or not? Searching for or some other tag wouldn't necessarily work because of text being embedded in binary files I may come across.
thanks

Do you have control over the http connection that you crawler uses? Then how about checking the HTTP response header "Content-type". Thats one way to determine the content type. I just did a quick test of the artdaily.com to see if the content type header was sent. And there is one that has a value text/html

Running a JavaScript command from MATLAB to fetch a PDF file

I'm currently writing some MATLAB code to interact with my company's internal reports database. So far I can access the HTML abstract page using code which looks like this:
import com.mathworks.mde.desk.*;
wb=com.mathworks.mde.webbrowser.WebBrowser.createBrowser;
wb.setCurrentLocation(ReportURL(8:end));
pause(1);
s={};
while isempty(s)
s=char(wb.getHtmlText);
pause(.1);
end
desk=MLDesktop.getInstance;
desk.removeClient(wb);
I can extract out various bits of information from the HTML text which ends up in the variable s, however the PDF of the report is accessed via what I believe is a JavaScript command (onClick="gotoFulltext('','[Report Number]')").
Any ideas as to how I execute this JavaScript command and get the contents of the PDF file into a MATLAB variable?
(MATLAB sits on top of Java, so I believe a Java solution would work...)

I think you should take a look at the JavaScript that is being called and see what the final request to the webserver looks like.
You can do this quite easily in Firefox using the FireBug plugin.
https://addons.mozilla.org/en-US/firefox/addon/1843
Once you have found the real server request then you can just request this URL or post to this URL instead of trying to run the JavaScript.

Once you have gotten the correct URL (a la the answer from pjp), your next problem is to "get the contents of the PDF file into a MATLAB variable". Whether or not this is possible may depend on what you mean by "contents"...
If you want to get the raw data in the PDF file, I don't think there is a way currently to do this in MATLAB. The URLREAD function was the first thing I thought of to read content from a URL into a string, but it has this note in the documentation:
s = urlread('url') reads the content
at a URL into the string s. If the
server returns binary data, s will
be unreadable.
Indeed, if you try to read a PDF as in the following example, s contains some text intermingled with mostly garbage:
s = urlread('http://samplepdf.com/sample.pdf');
If you want to get the text from the PDF file, you have some options. First, you can use URLWRITE to save the contents of the URL to a file:
urlwrite('http://samplepdf.com/sample.pdf','temp.pdf');
Then you should be able to use one of two submissions on The MathWorks File Exchange to extract the text from the PDF:
Extract text from a PDF document by Dimitri Shvorob
PDF Reader by Tom Gaudette
If you simply want to view the PDF, you can just open it in Adobe Acrobat with the OPEN function:
open('temp.pdf');

wb=com.mathworks.mde.webbrowser.WebBrowser.createBrowser;
wb.executeScript('javascript:alert(''Some code from a link'')');
desk=com.mathworks.mde.desk.MLDesktop.getInstance;
desk.removeClient(wb);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Retrieving URL's from webpage in java - java

Pattern p = Pattern.compile("http://[\w^\"]++"); Matcher m = p.matcher(yourFetchedHtmlString); while (m.find()) { nextUrl=m.group();//Do whatever you want with it } You may also have to tweak the regexp, as i have just written it without testing. This should be a very fast way to fetch urls.

I would try a library like HTML Parser to parse the html string and extract all url tags from it.

Related

copyURLToFile is not downloading the file

how can I clean and sanitize a url submitted by a user for redisplay in java?

Selenium 2: Detect content type of link destinations

Best Way to determine MimeType from a String?

Running a JavaScript command from MATLAB to fetch a PDF file

Categories

Resources