I'm building a web crawler. Having read this I understand that DNS resolution is slow and so we should separate out the DNS Resolver.
So say that you have
String urlString http://google.com
you can then convert that into an ip by doing
URL url = new URL(urlString)
InetAddress ip = InetAddress.getByName(url.getHost());
But then how do you download the actual website itself?
With the url, we could just dow something like this:
String htmlDocumentString = new Scanner(new url.openStream(), "UTF-8").useDelimiter("\\A").next();
But if we want to used the resolved IP, do we have to manually reconstruct the URL with an ip? There is no url.setHost() method, it just seems kinda messy?
Reading from URL is simple :
public class URLReader {
public static void main(String[] args) throws Exception {
URL oracle = new URL("http://www.oracle.com/");
BufferedReader in = new BufferedReader(
new InputStreamReader(oracle.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
}
Taken from : http://docs.oracle.com/javase/tutorial/networking/urls/readingURL.html
Try this instead:
URL oracle = new URL("http://www.oracle.com/");
URLConnection urlc = oracle.openConnection();
urlc.setDoInput(true);
urlc.setRequestProperty("Accept", "text/text");
InputStream inputStream = urlc.getInputStream();
String myString = IOUtils.toString(inputStream, "UTF-8");
... using IOUtils from Apache Commons above:
http://commons.apache.org/io/api-1.4/org/apache/commons/io/IOUtils.html#toString(java.io.InputStream,%20java.lang.String)
Related
I have been using a java code to retrieve an url content. The code does not work for https://www.amazon.es/. A similar python code does achieve retrieving an amazon url content.
The java code:
URL url = new URL(urlToScan);
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream(), StandardCharsets.UTF_8));
StringBuilder builder = new StringBuilder();
for (String temp = reader.readLine(); temp != null; temp = reader.readLine())
builder.append(temp);
webpage = builder.toString();
The python code:
from urllib.request import urlopen
url = "https://www.amazon.es/"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
print(html)
I searched amazon's html on my own looking for the used charset (in case it was a charset issue) and they are using charset="utf-8".As the html is 22,000+ lines long, I thought it could be some parsing error for long Strings. I also tried with a ByteArrayOutputStream and then instancing using String(byte[], charset) constructor.Java output:
?
Why is not java.net.URL retrieving properly the url content?
Maybe it's because of User-Agent. To set User-Agent, using URLConnection:
URL url = new URL("https://www.amazon.es/");
URLConnection connection = url.openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64)");
BufferedInputStream bufferedInputStream = new BufferedInputStream(connection.getInputStream());
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(bufferedInputStream));
StringBuilder buffer = new StringBuilder();
String inputLine;
while ((inputLine = bufferedReader.readLine()) != null) {
buffer.append(inputLine).append("\n");
}
bufferedReader.close();
System.out.println(buffer.toString());
While Python's urllib should be using certain User-Agent by default.
I'm trying to send a Telegram message from an Android app. I want that message to contain a hyperlink so I used parse_mode=html param but I have a problem with the anchor tag. It seems that java is treating my URL as a local path.
This is the code:
String location = "http://www.google.com";
urlString = String.format("https://api.telegram.org/bot<bot_token>/sendMessage?chat_id=<chat_id>&parse_mode=html&text=<a href=%s>Location</a>", location);
URL url = new URL(urlString);
URLConnection conn = url.openConnection();
StringBuilder sb = new StringBuilder();
InputStream is = new BufferedInputStream(conn.getInputStream());
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String inputLine = "";
while ((inputLine = br.readLine()) != null) {
sb.append(inputLine);
}
And this is the error:
java.io.FileNotFoundException:
https://api.telegram.org/bot<bot_token>/sendMessage?chat_id=<chat_id>&parse_mode=html&text=<a href=http://google.com>Location</a>
How should I write this message so the href link will be treated as an external URL?
The error java.io.FileNotFoundException doesn't mean that it is treated as a local path.
It is HTTP 404 File Not Found. And it is the response from the server for your HTTP Request.
It seems that at first to provide proper <bot_token> and <chat_id> is needed. And second, you should urlencode that String before instantiating a URL object with it.
String encodedUrlString = URLEncoder.encode(urlString, "UTF-8");
URL url = new URL(encodedUrlString);
I'm trying to read the HTML from a particular URL and store it into a String for parsing. I referred to a previous post to help me out. When I print out what was read, all I get are special characters.
Here is my Java code (with try/catches left out) that reads from a URL and prints:
String path = "https://html1-f.scribdassets.com/913q5pjrsw60h9i4/pages/106-6b1bd15200.jsonp";
URL url = new URL(path);
InputStream in = url.openStream();
BufferedReader bw = new BufferedReader(new InputStreamReader(in, "UTF-8");
String line;
while ((line = bw.readLine()) != null) {
System.out.println(line);
}
Program output:
�ĘY106-6b1bd15200.jsonpmP�r� �Ƨ�!�%m�vD"��Ra*��w�%����ݳ�sβ��MK�d�9+%�m��l^��މ����:���� ���8B�Vce�.A*��x$FCo���a�b�<����Xy��m�c�>t����� �Z������Gx�o� �J���oKe�0�5�kGYpb�*l����+|�U���-�N3��jBp�R�z5Cۥjh��o�;�~)����~��)~ɮhy��<c,=;tHW���'�c�=~�w���
Expected output:
window.page106_callback(["<div class=\"newpage\" id=\"page106\" style=\"width: 902px; height:1273px\">\n<div class=image_layer style=\"z-index: 1\">\n<div class=ie_fix>\n<img class=\"absimg\" style=\"left:18px;top:27px;width:860px;height:1077px;clip:rect(1px 859px 1076px 1px)\" orig=\"http://html.scribd.com/913q5pjrsw60h9i4/images/106-6b1bd15200.jpg\"/>\n</div>\n</div>\n</div>\n\n"]);
At first, I thought it was an issue with permissions or something that somehow encrypted the stream, but my friend wrote a small Python script to do the same thing and it worked, thereby ruling this out. This is what he wrote:
import requests
link = 'https://html1-f.scribdassets.com/913q5pjrsw60h9i4/pages/106-
6b1bd15200.jsonp'
f = requests.get(link)
text = (f.text)
print(text)
So the question is, why is the Java version unable to correctly read and print from this particular URL? Note that I tried testing some other URLs from various websites and those worked fine. Maybe I should learn Python.
The response is gzip-encoded. You can do:
InputStream in = new GZIPInputStream(con.getInputStream());
#Maurice Perry is right, I tried with below code
String url = "https://html1-f.scribdassets.com/913q5pjrsw60h9i4/pages/106-6b1bd15200.jsonp";
URL obj = new URL(url);
HttpURLConnection con = (HttpURLConnection) obj.openConnection();
BufferedReader in = new BufferedReader(
new InputStreamReader(new GZIPInputStream(con.getInputStream())));
String inputLine;
StringBuffer response = new StringBuffer();
while ((inputLine = in.readLine()) != null) {
response.append(inputLine);
}
in.close();
System.out.println(response.toString());
I am trying to read in a website and save it to a string. I'm using this code below which works perfectly fine in Eclipse. But when I try to run the program via the command line in Windows like "java MyProgram", the program starts and just hangs and never reads in the URL. Anyone know why this would be happening?
URL link = new URL("http://www.yahoo.com");
BufferedReader in = new BufferedReader(new InputStreamReader(link.openStream()));
//InputStream in = link.openStream();
String inputLine = "";
int count = 0;
while ((inputLine = in.readLine()) != null)
{
site = site + "\n" + inputLine;
}
in.close();
...
It could be because you are behind a proxy, and Eclipse is automatically adding settings in to configure this.
If you are behind a proxy, when running from the command prompt, try setting the java.net.useSystemProxies property. You can also manually configure proxy settings with a few network properties found here (http.proxyHost, http.proxyPort).
I encountered such a problem and found a solution.
Here my working code:
// Create a URL for the desired page
URL url = new URL("your url");
// Get connection
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setConnectTimeout(5000); // 5 seconds connectTimeout
connection.setReadTimeout(5000 ); // 5 seconds socketTimeout
// Connect
connection.connect(); // Without this line, method readLine() stucks!!!
// because it reads incorrect data, possibly from another memory area
InputStreamReader isr = new InputStreamReader(url.openStream(),"UTF-8");
BufferedReader in = new BufferedReader(isr);
String str;
while (true) {
str = in.readLine();
if(str==null){break;}
listItems.add(str);
}
// Closing all
in.close();
isr.close();
connection.disconnect();
If that's all your code is doing, there's no reason it shoudln't work from the command line. I suspect you've cut out what's broken. For example:
public static void main(String[] args) throws Exception {
String site = "";
URL link = new URL("http://www.yahoo.com");
BufferedReader in = new BufferedReader(new InputStreamReader(link.openStream()));
//InputStream in = link.openStream();
String inputLine = "";
int count = 0;
while ((inputLine = in.readLine()) != null) {
site = site + "\n" + inputLine;
}
in.close();
System.out.println(site);
}
works fine. Another possibility would be if you're running it in Eclipse and from the command line on two different computers, and the latter can't reach http://www.yahoo.com.
I want to download the html source code of a site to parse some info. How do I accomplish this in Java?
Just attach a BufferedReader (or anything that reads strings) from a URL's InputStream returned from openStream().
public static void main(String[] args)
throws IOException
{
URL url = new URL("http://stackoverflow.com/");
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
String s = null;
while ((s = reader.readLine()) != null)
System.out.println(s);
}
You can use the Java classes directly:
URL url = new URL("http://www.example.com");
URLConnection conn = url.openConnection();
InputStream in = conn.getInputStream();
...
but it's more recommended to use Apache HttpClient as HttpClient will handle a lot of things that you'll have to do yourself with the Java native classes.