Hey all trying to write a web scraper that downloads all the songs from this website
https://billwurtz.com/songs.html
but some of his older URLs contain spaces ("%20") such as https://billwurtz.com/can%20i.mp3. Which on his newer links he has changed to '-'.
Anyway, when I try to download these tracks I get a 400 Error (Bad Request Error) which makes me think that java is sending the request with and not %20
URL website = new URL("https://www.billwurtz.com/" + song);
Path out = Path.of("songs/" + song);
System.out.println(website.toString());
try (InputStream in = website.openStream()) {
Files.copy(in, out, StandardCopyOption.REPLACE_EXISTING);
} catch (Exception e) {
// missed.add(song + ":" + e.getMessage());
throw new Exception(e.getMessage());
}
returns
https://www.billwurtz.com/can%20i.mp3
Exception in thread "main" java.lang.Exception: Server returned HTTP response code: 400 for URL: https://billwurtz.com/can i.mp3
I tried looking about but only came across issues created from the programmer trying to create a URL that contains (spaces) but here it is clear that it is using %20
Thank you all for your time :)
I need to get host from this url
android-app://com.google.android.googlequicksearchbox?Pub_id={siteID}
java.net.URL and java.net.URI can't handle it.
The problem is in { and } characters which are not valid for URI. Looks like a placeholder that wasn't resolved correctly when creating a URI.
You can use String.replaceAll() to get rid of these two characters:
String value = "android-app://com.google.android.googlequicksearchbox?Pub_id={siteID}";
URI uri = URI.create(value.replaceAll("[{}]", ""));
System.out.println(uri.getHost()); // com.google.android.googlequicksearchbox
You see, eventually I need path, scheme and query.
I've just found super fast library for parsing such URLs. https://github.com/anthonynsimon/jurl
It's also very flexible.
You can try the following code
String url = "android-app://com.google.android.googlequicksearchbox?Pub_id={siteID}";
url = url.replace("{", "").replace("}","");
URI u;
try {
u = new URI(url);
System.out.println(u.getHost());
} catch (URISyntaxException e) {
e.printStackTrace();
}
How to normalise a URL in Java to remove the fragment. I.e. from https://www.website.com#something to https://www.website.com
This is possible with the URL.Normalize code, although in this specific use case I've only got a full absolute URL which needs to remain intact.
I'd like to be able to modify this code slightly to remove the fragment from the URL;
//The website below is just an example. In reality, this URL is unknown and could be anything. Both with and without a fragment depending on the use case
URL absUrl = new URL("https://www.website.com#something");
My thoughts so far is that this is only going to be possible by breaking down the URL into the Protocol + Domain + Path then joining it all back together which does appear to work, but there must be a more elegant way of doing this.
Fragment removal is fairly simple using the conversion methods toURI and toURL. So to convert a URL to a URI:
URL url = /*what have you*/ …
URI u = url.toURI();
To remove any fragment from the URI:
if( u.getFragment() != null ) { // Remake with same parts, less the fragment:
u = new URI( u.getScheme(), u.getSchemeSpecificPart(), /*fragment*/null ); }
In reconstructing a URI from its parts like that, it’s important to use the decoded getters (as shown), not the corresponding raw ones. For authority on this usage, see e.g. the Identity section of the API.
To convert the result back to a URL:
url = u.toURL();
Fragments do not exist as a separate entity in Java URLs. But you can convert a URL into a URI and back to remove a fragment. I did it like this:
URL url;
...
if (url.toString().contains("#")) {
URI uri = null;
try {
uri = new URI(url.getProtocol(), url.getHost(), url.getPath(), null);
String file = "";
if (uri.getPath() != null) {
file += uri.getPath();
}
if (uri.getQuery() != null) {
file += uri.getQuery();
}
url = new URL(uri.getScheme(), uri.getHost(), uri.getPort(), file);
} catch (URISyntaxException e) {
...
} catch (MalformedURLException e) {
...
}
}
So I was attempting to use this String in a URL :-
http://site-test.com/Meetings/IC/DownloadDocument?meetingId=c21c905c-8359-4bd6-b864-844709e05754&itemId=a4b724d1-282e-4b36-9d16-d619a807ba67&file=\\s604132shvw140\Test-Documents\c21c905c-8359-4bd6-b864-844709e05754_attachments\7e89c3cb-ce53-4a04-a9ee-1a584e157987\myDoc.pdf
In this code: -
String fileToDownloadLocation = //The above string
URL fileToDownload = new URL(fileToDownloadLocation);
HttpGet httpget = new HttpGet(fileToDownload.toURI());
But at this point I get the error: -
java.net.URISyntaxException: Illegal character in query at index 169:Blahblahblah
I realised with a bit of googling this was due to the characters in the URL (guessing the &), so I then added in some code so it now looks like so: -
String fileToDownloadLocation = //The above string
fileToDownloadLocation = URLEncoder.encode(fileToDownloadLocation, "UTF-8");
URL fileToDownload = new URL(fileToDownloadLocation);
HttpGet httpget = new HttpGet(fileToDownload.toURI());
However, when I try and run this I get an error when I try and create the URL, the error then reads: -
java.net.MalformedURLException: no protocol: http%3A%2F%2Fsite-test.testsite.com%2FMeetings%2FIC%2FDownloadDocument%3FmeetingId%3Dc21c905c-8359-4bd6-b864-844709e05754%26itemId%3Da4b724d1-282e-4b36-9d16-d619a807ba67%26file%3D%5C%5Cs604132shvw140%5CTest-Documents%5Cc21c905c-8359-4bd6-b864-844709e05754_attachments%5C7e89c3cb-ce53-4a04-a9ee-1a584e157987%myDoc.pdf
It looks like I can't do the encoding until after I've created the URL else it replaces slashes and things which it shouldn't, but I can't see how I can create the URL with the string and then format it so its suitable for use. I'm not particularly familiar with all this and was hoping someone might be able to point out to me what I'm missing to get string A into a suitably formatted URL to then use with the correct characters replaced?
Any suggestions greatly appreciated!
You need to encode your parameter's values before concatenating them to URL.
Backslash \ is special character which have to be escaped as %5C
Escaping example:
String paramValue = "param\\with\\backslash";
String yourURLStr = "http://host.com?param=" + java.net.URLEncoder.encode(paramValue, "UTF-8");
java.net.URL url = new java.net.URL(yourURLStr);
The result is http://host.com?param=param%5Cwith%5Cbackslash which is properly formatted url string.
I have the same problem, i read the url with an properties file:
String configFile = System.getenv("system.Environment");
if (configFile == null || "".equalsIgnoreCase(configFile.trim())) {
configFile = "dev.properties";
}
// Load properties
Properties properties = new Properties();
properties.load(getClass().getResourceAsStream("/" + configFile));
//read url from file
apiUrl = properties.getProperty("url").trim();
URL url = new URL(apiUrl);
//throw exception here
URLConnection conn = url.openConnection();
dev.properties
url = "https://myDevServer.com/dev/api/gate"
it should be
dev.properties
url = https://myDevServer.com/dev/api/gate
without "" and my problem is solved.
According to oracle documentation
Thrown to indicate that a malformed URL has occurred. Either no legal protocol could be found in a specification string or the string
could not be parsed.
So it means it is not parsed inside the string.
You want to use URI templates. Look carefully at the README of this project: URLEncoder.encode() does NOT work for URIs.
Let us take your original URL:
http://site-test.test.com/Meetings/IC/DownloadDocument?meetingId=c21c905c-8359-4bd6-b864-844709e05754&itemId=a4b724d1-282e-4b36-9d16-d619a807ba67&file=\s604132shvw140\Test-Documents\c21c905c-8359-4bd6-b864-844709e05754_attachments\7e89c3cb-ce53-4a04-a9ee-1a584e157987\myDoc.pdf
and convert it to a URI template with two variables (on multiple lines for clarity):
http://site-test.test.com/Meetings/IC/DownloadDocument
?meetingId={meetingID}&itemId={itemID}&file={file}
Now let us build a variable map with these three variables using the library mentioned in the link:
final VariableMap = VariableMap.newBuilder()
.addScalarValue("meetingID", "c21c905c-8359-4bd6-b864-844709e05754")
.addScalarValue("itemID", "a4b724d1-282e-4b36-9d16-d619a807ba67e")
.addScalarValue("file", "\\\\s604132shvw140\\Test-Documents"
+ "\\c21c905c-8359-4bd6-b864-844709e05754_attachments"
+ "\\7e89c3cb-ce53-4a04-a9ee-1a584e157987\\myDoc.pdf")
.build();
final URITemplate template
= new URITemplate("http://site-test.test.com/Meetings/IC/DownloadDocument"
+ "meetingId={meetingID}&itemId={itemID}&file={file}");
// Generate URL as a String
final String theURL = template.expand(vars);
This is GUARANTEED to return a fully functional URL!
Thanks to Erhun's answer I finally realised that my JSON mapper was returning the quotation marks around my data too! I needed to use "asText()" instead of "toString()"
It's not an uncommon issue - one's brain doesn't see anything wrong with the correct data, surrounded by quotes!
discoveryJson.path("some_endpoint").toString();
"https://what.the.com/heck"
discoveryJson.path("some_endpoint").asText();
https://what.the.com/heck
This code worked for me
public static void main(String[] args) {
try {
java.net.URL url = new java.net.URL("http://path");
System.out.println("Instantiated new URL: " + url);
}
catch (MalformedURLException e) {
e.printStackTrace();
}
}
Instantiated new URL: http://path
Very simple fix
String encodedURL = UriUtils.encodePath(request.getUrl(), "UTF-8");
Works no extra functionality needed.
I wanted to know if there is any standard APIs in Java to validate a given URL?
I want to check both if the URL string is right i.e. the given protocol is valid and then to check if a connection can be established.
I tried using HttpURLConnection, providing the URL and connecting to it. The first part of my requirement seems to be fulfilled but when I try to perform HttpURLConnection.connect(), 'java.net.ConnectException: Connection refused' exception is thrown.
Can this be because of proxy settings? I tried setting the System properties for proxy but no success.
Let me know what I am doing wrong.
For the benefit of the community, since this thread is top on Google when searching for
"url validator java"
Catching exceptions is expensive, and should be avoided when possible. If you just want to verify your String is a valid URL, you can use the UrlValidator class from the Apache Commons Validator project.
For example:
String[] schemes = {"http","https"}; // DEFAULT schemes = "http", "https", "ftp"
UrlValidator urlValidator = new UrlValidator(schemes);
if (urlValidator.isValid("ftp://foo.bar.com/")) {
System.out.println("URL is valid");
} else {
System.out.println("URL is invalid");
}
The java.net.URL class is in fact not at all a good way of validating URLs. MalformedURLException is not thrown on all malformed URLs during construction. Catching IOException on java.net.URL#openConnection().connect() does not validate URL either, only tell wether or not the connection can be established.
Consider this piece of code:
try {
new URL("http://.com");
new URL("http://com.");
new URL("http:// ");
new URL("ftp://::::#example.com");
} catch (MalformedURLException malformedURLException) {
malformedURLException.printStackTrace();
}
..which does not throw any exceptions.
I recommend using some validation API implemented using a context free grammar, or in very simplified validation just use regular expressions. However I need someone to suggest a superior or standard API for this, I only recently started searching for it myself.
Note
It has been suggested that URL#toURI() in combination with handling of the exception java.net. URISyntaxException can facilitate validation of URLs. However, this method only catches one of the very simple cases above.
The conclusion is that there is no standard java URL parser to validate URLs.
You need to create both a URL object and a URLConnection object. The following code will test both the format of the URL and whether a connection can be established:
try {
URL url = new URL("http://www.yoursite.com/");
URLConnection conn = url.openConnection();
conn.connect();
} catch (MalformedURLException e) {
// the URL is not in a valid form
} catch (IOException e) {
// the connection couldn't be established
}
Using only standard API, pass the string to a URL object then convert it to a URI object. This will accurately determine the validity of the URL according to the RFC2396 standard.
Example:
public boolean isValidURL(String url) {
try {
new URL(url).toURI();
} catch (MalformedURLException | URISyntaxException e) {
return false;
}
return true;
}
Use the android.webkit.URLUtil on android:
URLUtil.isValidUrl(URL_STRING);
Note: It is just checking the initial scheme of URL, not that the entire URL is valid.
There is a way to perform URL validation in strict accordance to standards in Java without resorting to third-party libraries:
boolean isValidURL(String url) {
try {
new URI(url).parseServerAuthority();
return true;
} catch (URISyntaxException e) {
return false;
}
}
The constructor of URI checks that url is a valid URI, and the call to parseServerAuthority ensures that it is a URL (absolute or relative) and not a URN.
Just important to point that the URL object handle both validation and connection. Then, only protocols for which a handler has been provided in sun.net.www.protocol are authorized (file,
ftp, gopher, http, https, jar, mailto, netdoc) are valid ones. For instance, try to make a new URL with the ldap protocol:
new URL("ldap://myhost:389")
You will get a java.net.MalformedURLException: unknown protocol: ldap.
You need to implement your own handler and register it through URL.setURLStreamHandlerFactory(). Quite overkill if you just want to validate the URL syntax, a regexp seems to be a simpler solution.
Are you sure you're using the correct proxy as system properties?
Also if you are using 1.5 or 1.6 you could pass a java.net.Proxy instance to the openConnection() method. This is more elegant imo:
//Proxy instance, proxy ip = 10.0.0.1 with port 8080
Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress("10.0.0.1", 8080));
conn = new URL(urlString).openConnection(proxy);
I think the best response is from the user #b1nary.atr0phy. Somehow, I recommend combine the method from the b1nay.atr0phy response with a regex to cover all the possible cases.
public static final URL validateURL(String url, Logger logger) {
URL u = null;
try {
Pattern regex = Pattern.compile("(?i)^(?:(?:https?|ftp)://)(?:\\S+(?::\\S*)?#)?(?:(?!(?:10|127)(?:\\.\\d{1,3}){3})(?!(?:169\\.254|192\\.168)(?:\\.\\d{1,3}){2})(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,}))\\.?)(?::\\d{2,5})?(?:[/?#]\\S*)?$");
Matcher matcher = regex.matcher(url);
if(!matcher.find()) {
throw new URISyntaxException(url, "La url no está formada correctamente.");
}
u = new URL(url);
u.toURI();
} catch (MalformedURLException e) {
logger.error("La url no está formada correctamente.");
} catch (URISyntaxException e) {
logger.error("La url no está formada correctamente.");
}
return u;
}
This is what I use to validate CDN urls (must start with https, but that's easy to customise). This will also not allow using IP addresses.
public static final boolean validateURL(String url) {
var regex = Pattern.compile("^[https:\\/\\/(www\\.)?a-zA-Z0-9#:%._\\+~#=]{2,256}\\.[a-z]{2,6}\\b([-a-zA-Z0-9#:%_\\+.~#?&//=]*)");
var matcher = regex.matcher(url);
return matcher.find();
}
Thanks. Opening the URL connection by passing the Proxy as suggested by NickDK works fine.
//Proxy instance, proxy ip = 10.0.0.1 with port 8080
Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress("10.0.0.1", 8080));
conn = new URL(urlString).openConnection(proxy);
System properties however doesn't work as I had mentioned earlier.
Thanks again.
Regards,
Keya