Java loosing part of input streambuffer

Java loosing part of input streambuffer - java

I'm currently working on a project to parse weather data from an XML input stream. However I'm experiencing a strange bug related to Linux.
When running the server software that's receiving the input stream on Windows, everything works like a charm. However when running the server on Linux, the following bug presents itself. After receiving a couple messages correctly, the third or so message will be corrupted because half of the input buffer will be "lost".
This issue only occurs in the following situations:
[Platform] [Origin of XML stream] [Error occurs]
Windows localhost No
Windows remote No
Linux localhost No
Linux remote Yes
I'm using the following code to receive the XML stream.
private Document getXML() throws JDOMException
{
SAXBuilder builder = new SAXBuilder();
try {
// Get the input stream.
BufferedReader in = new BufferedReader(new InputStreamReader(sock.getInputStream()));
String xmlstream = "";
String line;
while (!(line = in.readLine()).contains("</MEASUREMENT>")){
xmlstream += line;
}
xmlstream += "</MEASUREMENT></WEATHERDATA>";
System.out.println("XML DATA:" + xmlstream);
Document xmlDocument = builder.build(new StringReader(xmlstream));
return xmlDocument;
} catch (IOException | NullPointerException e) {
// Socket closes.
System.out.println("Client disconnected!");
return null;
}
}
Here is an example of the received data:
XML DATA:<?xml version="1.0"?><WEATHERDATA> <MEASUREMENT> <STN>726030</STN> <DATE>2018-01-27</DATE> <TIME>13:42:01</TIME> <TEMP>-2.8</TEMP> <DEWP>-2.7</DEWP> <STP>1014.1</STP> <SLP>1019.3</SLP> <VISIB>8.2</VISIB> <WDSP>29.0</WDSP> <PRCP>0.02</PRCP> <SNDP>0.0</SNDP><FRSHTT>110000</FRSHTT> <CLDC>77.1</CLDC> <WNDDIR>191</WNDDIR></MEASUREMENT></WEATHERDATA>
XML DATA:<?xml version="1.0"?><WEATHERDATA> <MEASUREMENT> <STN>726030</STN> <DATE>2018-01-27</DATE> <TIME>13:42:02</TIME> <TEMP>-0.7</TEMP> <DEWP>-3.5</DEWP> <STP>1014.2</STP> <SLP>1019.2</SLP> <VISIB>8.3</VISIB> <WDSP>28.9</WDSP> <PRCP>0.02</PRCP> <SNDP>0.0</SNDP><FRSHTT>110000</FRSHTT> <CLDC>77.2</CLDC> <WNDDIR>191</WNDDIR></MEASUREMENT></WEATHERDATA>
XML DATA:/DATE> <TIME>13:42:02</TIME> <TEMP>-9.4</TEMP> <DEWP>-13.5</DEWP> <STP>1005.1</STP> <SLP>1013.2</SLP> <VISIB>22.8</VISIB> <WDSP>12.8</WDSP> <PRCP>0.25</PRCP> <SNDP>8.3</SNDP> <FRSHTT>111000</FRSHTT> <CLDC>50.0</CLDC> <WNDDIR>311</WNDDIR></MEASUREMENT></WEATHERDATA>
As you can see, the first two messages are received in it's entirity, however the third message randomly starts in the middle of the stream.

Related

Using Alchemy Entity Extraction to retrieve JSON output

I am running the EntityTest.java file from the Alchemy API Java SDK which can be found here. The programs works just fine, but it seems there is no way to change output format to JSON.
I have tried executing this code-
// Create an AlchemyAPI object.
AlchemyAPI alchemyObj = AlchemyAPI.GetInstanceFromFile("api_key.txt");
// Force the output type to be JSON
AlchemyAPI_NamedEntityParams params = new AlchemyAPI_NamedEntityParams();
params.setOutputMode("json");
// Extract a ranked list of named entities for a web URL.
Document doc = alchemyObj.URLGetRankedNamedEntities("http://www.techcrunch.com/", params);
System.out.println(getStringFromDocument(doc));
But the code throws a RunTimeException, and prints the following on console-
Exception in thread "main" java.lang.RuntimeException: Invalid setting json for parameter outputMode
at com.alchemyapi.api.AlchemyAPI_Params.setOutputMode(AlchemyAPI_Params.java:42)
at com.alchemyapi.test.EntityTest.main(EntityTest.java:29)
Also, here is the setOutputCode method from AlchemyAPI_Params.java file-
public void setOutputMode(String outputMode) {
if( !outputMode.equals(AlchemyAPI_Params.OUTPUT_XML) && !outputMode.equals(OUTPUT_RDF) )
{
throw new RuntimeException("Invalid setting " + outputMode + " for parameter outputMode");
}
this.outputMode = outputMode;
}
As is evident from the code, it seems that the only 2 acceptable output formats are XML and RDF. Is that so?? Is there no way the get the output in JSON?
Can anybody please help me out regarding that??

You will need to add new constant : OUTPUT_JSON in AlchemyAPI_Params and modify the setOutputMode method to accept it.
After that in AlchemyAPI :
You will need to modify the doRequest method with a the new OUTPUT_JSON case.
You can use :
http://www.oracle.com/technetwork/articles/java/json-1973242.html
to create the new content.
Hope it help

I solved the problem by resorting to a completely different approach. Instead of using the already available Java SDK, I made an HTTP connection to the endpoint of URLGetRankedNamedEntities API, and retrieved the response.
Here is a code sample that demonstrates how to do this-
URL urlObj = new URL("http://access.alchemyapi.com/calls/url/URLGetRankedNamedEntities?apikey=" + API_KEY_HERE + "&url=http://www.smashingmagazine.com/2015/04/08/web-scraping-with-nodejs/&outputMode=json");
System.out.println(urlObj.toString() + "\n");
URLConnection connection = urlObj.openConnection();
connection.connect();
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String line;
StringBuilder builder = new StringBuilder();
while ((line = reader.readLine()) != null) {
builder.append(line + "\n");
}
System.out.println(builder);
Similar endpoints are avaliable for other APIs as well, which can found here.

Specific characters not rendering properly in Java

I have an issue when displaying strings received from a server in a JTable. Some specific characters appear as little white squares instead of "é" or "à" etc. I tried a lot of things but none of them fixed my problem. I'm working with Eclipse under Windows. The server was developped using Visual Studio 2010.
The server sends an XML file using tinyXML2, the client uses JDom to read it. The font used is "Dialog". The server takes the strings from an Oracle database.
I assume this is an encoding problem, but I haven't been able to fix it yet.
Does anyone have an idea ?
Thx
Arnaud
EDIT : As requested, this is how I use JDom
public static Player fromXML(Element e)
{
Player result = new Player();
String e_text = null;
try
{
e_text = e.getChildText(XMLTags.XML_Player_playerId);
if (e_text != null) result.setID(Integer.parseInt(e_text));
e_text = e.getChildText(XMLTags.XML_Player_lastName);
if (e_text != null) result.setName(e_text);
e_text = e.getChildText(XMLTags.XML_Player_point_scored);
if (e_text != null) result.addSpecial(STAT_SCORED, Double.parseDouble(e_text));
e_text = e.getChildText(XMLTags.XML_Player_point_scored_last);
if (e_text != null) result.addSpecial(STAT_SCORED_LAST, Double.parseDouble(e_text));
}
catch (Exception ex) {
ex.printStackTrace();
}
return result;
}
public static Document load(String filename) {
File XMLFile = new File(CLIENT_TO_SERVER, filename);
SAXBuilder sxb = new SAXBuilder();
Document document = new Document();
try
{
document = sxb.build(new File(XMLFile.getPath()));
} catch(Exception e){e.printStackTrace();}
return document;
}

read the file using correct encoding, something like:
document = sxb.build(new BufferedReader(new InputStreamReader(new FileInputStream(XMLFile.getPath()), "UTF8")));
Note: 1. 1st determine which char encoding used in that file. specify that charset instead of UTF8 above.
Incase encoding is not known or it's being generated from various systems with different encoding, you may use 'encoding detector library of Mozilla'. #see https://code.google.com/p/juniversalchardet/
need to handle UnsupportedEncodingException

Scraping a site

I am trying to write an alert system to scrape complaints board site periodically to look for any complaints about my product. I am using Jsoup for the same. Below is the the code fragment that gives me error.
doc = Jsoup.connect(finalUrl).timeout(10 * 1000).get();
This gives me error
java.net.SocketException: Unexpected end of file from server
When I copy paste the same finalUrl String in the browser, it works. I then tried simple URL connection
BufferedReader br = null;
try {
URL a = new URL(finalUrl);
URLConnection conn = a.openConnection();
// open the stream and put it into BufferedReader
br = new BufferedReader(new InputStreamReader(
conn.getInputStream()));
doc = Jsoup.parse(br.toString());
} catch (IOException e) {
e.printStackTrace();
}
But as it turned out, the connection itself is returning null (br is null). Now the question is, why does the same string when copy pasted in browser opens the site without any error?
Full stacktrace is as below:
java.net.SocketException: Unexpected end of file from server
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:774)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:771)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
at ComplaintsBoardScraper.main(ComplaintsBoardScraper.java:46)

That one was tricky! :-)
The server blocks all requests which don't have a proper user agent. And that’s why you succeeded with your browser but failed with Java.
Fortunately changing user agent is not a big thing in jsoup:
final String url = "http://www.complaintsboard.com/?search=justanswer.com&complaints=Complaints";
final String userAgent = "Mozilla/5.0 (X11; U; Linux i586; en-US; rv:1.7.3) Gecko/20040924 Epiphany/1.4.4 (Ubuntu)";
Document doc = Jsoup.connect(url) // you get a 'Connection' object here
.userAgent(userAgent) // ! set the user agent
.timeout(10 * 1000) // set timeout
.get(); // execute GET request
I've taken the first user agent I found … I guess you can use any valid one instead too.

App engine Url request utf-8 characters becoming '??' or '???'

I have an error where I am loading data from a web-service into the datastore. The problem is that the XML returned from the web-service has UTF-8 characters and app engine is not interpreting them correctly. It renders them as ??.
I'm fairly sure I've tracked this down to the URL Fetch request. The basic flow is: Task queue -> fetch the web-service data -> put data into datastore so it definitely has nothing to do with request or response encoding of the main site.
I put log messages before and after Apache Digester to see if that was the cause, but determined it was not. This is what I saw in logs:
string from the XML: "Doppelg��nger"
After digester processed: "Doppelg??nger"
Here is my url fetching code:
public static String getUrl(String pageUrl) {
StringBuilder data = new StringBuilder();
log.info("Requesting: " + pageUrl);
for(int i = 0; i < 5; i++) {
try {
URL url = new URL(pageUrl);
URLConnection connection = url.openConnection();
connection.connect();
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String line;
while ((line = reader.readLine()) != null) {
data.append(line);
}
reader.close();
break;
} catch (Exception e) {
log.warn("Failed to load page: " + pageUrl, e);
}
}
String resp = data.toString();
if(resp.isEmpty()) {
return null;
}
return resp;
Is there a way I can force this to recognize the input as UTF-8. I tested the page I am loading and the W3c validator recognized it as valid utf-8.
The issue is only on app engine servers, it works fine in the development server.
Thanks

try
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF-8"));

I was drawn into the same issue 3 months back Mike. It does look like and I would assume your problems are same.
Let me recollect and put it down here. Feel free to add if I miss something.
My set up was Tomcat and struts.
And the way I resolved it was through correct configs in Tomcat.
Basically it has to support the UTF-8 character there itself. useBodyEncodingForURI in the connector. this is for GET params
Plus you can use a filter for POST params.
A good resource where yu can find all this in one roof is Click here!
I had a problem in the production thereafter where I had apache webserver redirecting request to tomcat :). Similarly have to enable UTF-8 there too. The moral of the story resolve the problem as it comes :)

Tagsoup fails to parse html document from a StringReader ( java )

I have this function:
private Node getDOM(String str) throws SearchEngineException {
DOMResult result = new DOMResult();
try {
XMLReader reader = new Parser();
reader.setFeature(Parser.namespacesFeature, false);
reader.setFeature(Parser.namespacePrefixesFeature, false);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(new SAXSource(reader,new InputSource(new StringReader(str))), result);
} catch (Exception ex) {
throw new SearchEngineException("NukatSearchEngine.getDom: " + ex.getMessage());
}
return result.getNode();
}
It takes a String that contains the html document sent by the http server after a POST request, but fails to parse it properly - I only get like four nodes from the entire document. The string itself looks fine - if I print it out and copypasta it into a text document I see the page I expected.
When I use an overloaded version of the above method:
private Node getDOM(URL url) throws SearchEngineException {
DOMResult result = new DOMResult();
try {
XMLReader reader = new Parser();
reader.setFeature(Parser.namespacesFeature, false);
reader.setFeature(Parser.namespacePrefixesFeature, false);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(new SAXSource(reader, new InputSource(url.openStream())), result);
} catch (Exception ex) {
throw new SearchEngineException("NukatSearchEngine.getDom: " + ex.getMessage());
}
return result.getNode();
}
then everything works just fine - I get a proper DOM tree, but I need to somehow retrieve the POST answer from server.
Storing the string in a file and reading it back does not work - still getting the same results.
What could be the problem?

Is it maybe a problem with the xml encoding?

This seems like an encoding problem. In the code example of yours that doesn't work you're passing the url as a string into the constructor, which uses it as the systemId, and you get problems with Tagsoup parsing the html. In the example that works you're passing the stream in to the InputSource constructor. The difference is that when you pass in the stream then the SAX implementation can figure out the encoding from the stream.
If you want to test this you could try these steps:
Stream the html you're parsing through a java.io.InputStreamReader and call getEncoding on it to see what encoding it detects.
In your first example code, call setEncoding on the InputSource passing in the encoding that the inputStreamReader reported.
See if the first example, changed to explicitly set the encoding, parses the html correctly.
There's a discussion of this toward the end of an article on using the SAX InputSource.

To get a POST response you first need to do a POST request, new InputSource(url.openStream()) probably opens a connection and reads the response from a GET request. Check out Sending a POST Request Using a URL.
Other possibilities that might be interesting to check out for doing POST requests and getting the response:
Jersey Web Client
HtmlUnit

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java loosing part of input streambuffer - java

Related

Using Alchemy Entity Extraction to retrieve JSON output

Specific characters not rendering properly in Java

Scraping a site

App engine Url request utf-8 characters becoming '??' or '???'

Tagsoup fails to parse html document from a StringReader ( java )

Categories

Resources