Scraping a site - java

I am trying to write an alert system to scrape complaints board site periodically to look for any complaints about my product. I am using Jsoup for the same. Below is the the code fragment that gives me error.
doc = Jsoup.connect(finalUrl).timeout(10 * 1000).get();
This gives me error
java.net.SocketException: Unexpected end of file from server
When I copy paste the same finalUrl String in the browser, it works. I then tried simple URL connection
BufferedReader br = null;
try {
URL a = new URL(finalUrl);
URLConnection conn = a.openConnection();
// open the stream and put it into BufferedReader
br = new BufferedReader(new InputStreamReader(
conn.getInputStream()));
doc = Jsoup.parse(br.toString());
} catch (IOException e) {
e.printStackTrace();
}
But as it turned out, the connection itself is returning null (br is null). Now the question is, why does the same string when copy pasted in browser opens the site without any error?
Full stacktrace is as below:
java.net.SocketException: Unexpected end of file from server
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:774)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:771)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
at ComplaintsBoardScraper.main(ComplaintsBoardScraper.java:46)

That one was tricky! :-)
The server blocks all requests which don't have a proper user agent. And that’s why you succeeded with your browser but failed with Java.
Fortunately changing user agent is not a big thing in jsoup:
final String url = "http://www.complaintsboard.com/?search=justanswer.com&complaints=Complaints";
final String userAgent = "Mozilla/5.0 (X11; U; Linux i586; en-US; rv:1.7.3) Gecko/20040924 Epiphany/1.4.4 (Ubuntu)";
Document doc = Jsoup.connect(url) // you get a 'Connection' object here
.userAgent(userAgent) // ! set the user agent
.timeout(10 * 1000) // set timeout
.get(); // execute GET request
I've taken the first user agent I found … I guess you can use any valid one instead too.

Related

Java loosing part of input streambuffer

I'm currently working on a project to parse weather data from an XML input stream. However I'm experiencing a strange bug related to Linux.
When running the server software that's receiving the input stream on Windows, everything works like a charm. However when running the server on Linux, the following bug presents itself. After receiving a couple messages correctly, the third or so message will be corrupted because half of the input buffer will be "lost".
This issue only occurs in the following situations:
[Platform] [Origin of XML stream] [Error occurs]
Windows localhost No
Windows remote No
Linux localhost No
Linux remote Yes
I'm using the following code to receive the XML stream.
private Document getXML() throws JDOMException
{
SAXBuilder builder = new SAXBuilder();
try {
// Get the input stream.
BufferedReader in = new BufferedReader(new InputStreamReader(sock.getInputStream()));
String xmlstream = "";
String line;
while (!(line = in.readLine()).contains("</MEASUREMENT>")){
xmlstream += line;
}
xmlstream += "</MEASUREMENT></WEATHERDATA>";
System.out.println("XML DATA:" + xmlstream);
Document xmlDocument = builder.build(new StringReader(xmlstream));
return xmlDocument;
} catch (IOException | NullPointerException e) {
// Socket closes.
System.out.println("Client disconnected!");
return null;
}
}
Here is an example of the received data:
XML DATA:<?xml version="1.0"?><WEATHERDATA> <MEASUREMENT> <STN>726030</STN> <DATE>2018-01-27</DATE> <TIME>13:42:01</TIME> <TEMP>-2.8</TEMP> <DEWP>-2.7</DEWP> <STP>1014.1</STP> <SLP>1019.3</SLP> <VISIB>8.2</VISIB> <WDSP>29.0</WDSP> <PRCP>0.02</PRCP> <SNDP>0.0</SNDP><FRSHTT>110000</FRSHTT> <CLDC>77.1</CLDC> <WNDDIR>191</WNDDIR></MEASUREMENT></WEATHERDATA>
XML DATA:<?xml version="1.0"?><WEATHERDATA> <MEASUREMENT> <STN>726030</STN> <DATE>2018-01-27</DATE> <TIME>13:42:02</TIME> <TEMP>-0.7</TEMP> <DEWP>-3.5</DEWP> <STP>1014.2</STP> <SLP>1019.2</SLP> <VISIB>8.3</VISIB> <WDSP>28.9</WDSP> <PRCP>0.02</PRCP> <SNDP>0.0</SNDP><FRSHTT>110000</FRSHTT> <CLDC>77.2</CLDC> <WNDDIR>191</WNDDIR></MEASUREMENT></WEATHERDATA>
XML DATA:/DATE> <TIME>13:42:02</TIME> <TEMP>-9.4</TEMP> <DEWP>-13.5</DEWP> <STP>1005.1</STP> <SLP>1013.2</SLP> <VISIB>22.8</VISIB> <WDSP>12.8</WDSP> <PRCP>0.25</PRCP> <SNDP>8.3</SNDP> <FRSHTT>111000</FRSHTT> <CLDC>50.0</CLDC> <WNDDIR>311</WNDDIR></MEASUREMENT></WEATHERDATA>
As you can see, the first two messages are received in it's entirity, however the third message randomly starts in the middle of the stream.

Reading and printing HTML from website hangs up

I've been working on some Java code in which a string is converted into a URL and then used to download and output its corresponding URL. Unfortunately, when I run the program, it just hangs up. Does anyone have any suggestsion?
Note: I've used import java.io.* and import java.net.*
public static boolean htmlOutput(String testURL) throws Exception {
URL myPage2 = new URL(testURL); //converting String to URL
System.out.println(myPage2);
BufferedReader webInput2 = new BufferedReader(
new InputStreamReader(myPage2.openStream()));
String individualLine=null;
String completeInput=null;
while ((individualLine = webInput2.readLine()) != null) {
//System.out.println(inputLine);
System.out.println(individualLine);
completeInput=completeInput+individualLine;
}//end while
webInput2.close();
return true;
}//end htmlOutput()
[Though this answer helped the OP it is wrong. HttpURLConnection does follow redirects so this could not be the OP 's problem. I will remove it as soon as the OP removes the accepted mark.]
My guess is that you don't get anything back in the response stream because the page you are trying to connect sends you a redirect response (i.e. 302).
Try to verify that by reading the response code and iterate over the response headers. There should be a header named Location with a new url that you need to follow
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
int code = connection.getResponseCode();
Map<String, List<String>> map = conn.getHeaderFields();
// iterate over the map and find new url
If you are having trouble getting the above snippet to work take a look at a working example
You could do yourself a favor and use a third party http client like Apache Http client that can handle redirects otherwise you should do this manually.

Java - How to load the full source of an HTML website

I am trying to load the FULL source code of an HTML website into a String in Java. I have tried several approaches, however, I get almost all the source code. To make it worse: one of the main parts that I do not get is the part that I need the most!
URL url = new URL("http://www.website.com");
URLConnection spoof = url.openConnection();
//Spoof the connection so we look like a web browser
spoof.setRequestProperty( "User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)" );
BufferedReader in = new BufferedReader(new InputStreamReader(spoof.getInputStream()));
String strLine = "";
String finalHTML = "";
//Loop through every line in the source
while ((strLine = in.readLine()) != null){
finalHTML += strLine;
}
It might be because the content you are looking for is actually loaded dynamically, through ajax/javascript.
for example, a website might contain an empty DIV tag, which would be populated with many things only after the page loads (Through an AJAX call to another location).

App engine Url request utf-8 characters becoming '??' or '???'

I have an error where I am loading data from a web-service into the datastore. The problem is that the XML returned from the web-service has UTF-8 characters and app engine is not interpreting them correctly. It renders them as ??.
I'm fairly sure I've tracked this down to the URL Fetch request. The basic flow is: Task queue -> fetch the web-service data -> put data into datastore so it definitely has nothing to do with request or response encoding of the main site.
I put log messages before and after Apache Digester to see if that was the cause, but determined it was not. This is what I saw in logs:
string from the XML: "Doppelg��nger"
After digester processed: "Doppelg??nger"
Here is my url fetching code:
public static String getUrl(String pageUrl) {
StringBuilder data = new StringBuilder();
log.info("Requesting: " + pageUrl);
for(int i = 0; i < 5; i++) {
try {
URL url = new URL(pageUrl);
URLConnection connection = url.openConnection();
connection.connect();
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String line;
while ((line = reader.readLine()) != null) {
data.append(line);
}
reader.close();
break;
} catch (Exception e) {
log.warn("Failed to load page: " + pageUrl, e);
}
}
String resp = data.toString();
if(resp.isEmpty()) {
return null;
}
return resp;
Is there a way I can force this to recognize the input as UTF-8. I tested the page I am loading and the W3c validator recognized it as valid utf-8.
The issue is only on app engine servers, it works fine in the development server.
Thanks
try
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF-8"));
I was drawn into the same issue 3 months back Mike. It does look like and I would assume your problems are same.
Let me recollect and put it down here. Feel free to add if I miss something.
My set up was Tomcat and struts.
And the way I resolved it was through correct configs in Tomcat.
Basically it has to support the UTF-8 character there itself. useBodyEncodingForURI in the connector. this is for GET params
Plus you can use a filter for POST params.
A good resource where yu can find all this in one roof is Click here!
I had a problem in the production thereafter where I had apache webserver redirecting request to tomcat :). Similarly have to enable UTF-8 there too. The moral of the story resolve the problem as it comes :)

WRONG_DOCUMENT_ERR Error after login to sugarCRM from java Axis 1.4

I want to import data from java web application to sugarCRM. I created client stub using AXIS and then I am trying to connect, it seems it is getting connected, since I can get server information. But after login, it gives me error while getting sessionID:
Error is: "faultString: org.w3c.dom.DOMException: WRONG_DOCUMENT_ERR: A node is used in a different document than the one that created it."
Here is my code:
private static final String ENDPOINT_URL = " http://localhost/sugarcrm/service/v3/soap.php";
java.net.URL url = null;
try {
url = new URL(ENDPOINT_URL);
} catch (MalformedURLException e1) {
System.out.println("URL endpoing creation failed. Message: "+e1.getMessage());
e1.printStackTrace();
}
> System.out.println("URL endpoint created successfully!");
Sugarsoap service = new SugarsoapLocator();
SugarsoapPortType port = service.getsugarsoapPort(url);
Get_server_info_result result = port.get_server_info();
System.out.println(result.getGmt_time());
System.out.println(result.getVersion());
//I am getting right answers
User_auth userAuth=new User_auth();
userAuth.setUser_name(USER_NAME);
MessageDigest md =MessageDigest.getInstance("MD5");
String password=convertToHex(md.digest(USER_PASSWORD.getBytes()));
userAuth.setPassword(password);
Name_value nameValueListLogin[] = null;
Entry_value loginResponse = null;
loginResponse=port.login (userAuth, "sugarcrm",nameValueListLogin);
String sessionID = loginResponse.getId(); // <--- Get error on this one
The nameValueListLogin could be be from a different document context (coming from a different source). See if this link helps.
You may need to get more debugging/logging information so we can see what nameValueListLogin consists of and where it is coming from.

Categories

Resources