Strange Whitespace Error when Accessing RSS Feed - java

I'm not sure if anyone else has encountered or asked about this before, but for my application I make use of two Yahoo! RSS Feeds: Top News and Weather Forcast. I'm new to the idea of using these in the first place, but from what I've read, I simply need to make an HTTP GET request to a specific URL to retrieve an XML file which I can parse for the information I want. I have the parser working just fine, for I tested it with a sample XML file from each feed; however, a strange error is occuring when I use the AJAX GET call to the urls:
The XML page cannot be displayed
Cannot view XML input using XSL style sheet. Please correct the error and then click the Refresh button, or try again later.
Whitespace is not allowed at this location.
Error processing resource 'http://localhost:8080/BBS/fservlet?p=n'. Line 28, P...
for (i = 0; i < s.length; i++){
-------------------^
Note that I have this applciation "BBS" currently deployed on my local system with Tomcat. I looked into whitespace errors like this, and most seem to point to some line within the XML file itself that's having a problem. In most cases, it had something to do with escaping the "&" symbol, but it appears as though IE is telling me that the error is within a for-loop. I'm no XML expert, but I've never seen a for-loop within an XML. Even so, I've gone to the url directly in my browser and viewed the XML file (its the one I used to test my parsing) and found no such line. In addition, no such loop exists anywhere in my code. In other words, I'm not sure if this is an error on my end, or some configuration setting. Here's the code I'm working with, however:
jQuery Code
// Located in my JSP file
var baseContext = "<%=request.getContextPath()%>";
$(document).ready(function() {
ParseWeather();
ParseNews();
}
// Located in a separate JS file
function ParseWeather() {
$.get(baseContext + "/servlet?p=w", function(data) {
// XML Parser
}
// Data Manipulation
}
function ParseNews() {
$.get(baseContext + "/servlet?p=n", function(data) {
// XML Parser
}
// Data Manipulation
}
Java Code
import java.io.*;
import javax.servlet.*;
import javax.servlet.http.*;
import javax.servlet.http.HttpServlet;
import java.net.URL;
public class FeedServlet extends HttpServlet {
protected void doGet(final HttpServletRequest request, final HttpServletResponse response) throws ServletException, IOException {
try {
response.setContentType("text/xml");
final URL url;
String line = "";
if(request.getParameter("p").equals("w")) {
// Configuration setting that returns: "http://xml.weather.yahoo.com/forecastrss?p=USOR0186"
url = new URL(AppConfiguration.getInstance().getForcastUrl());
} else {
// Configuration setting that returns: "http://news.yahoo.com/rss/"
url = new URL(AppConfiguration.getInstance().getNewsUrl());
}
final BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream());
final PrintWriter writer = response.getWriter();
while((line = reader.readLine()) != null) {
writer.println(line);
writer.flush();
}
writer.close();
} catch(IOException e) {
e.printStackTrace();
}
}
}
My company has a AppConfiguration class that allows for certain variables, like the URL's, to be changed through the configuration page. At any rate, those two calls simple return the urls...
Yahoo! Forcast RSS Feed:
http://xml.weather.yahoo.com/forecastrss?p=USOR0186
Yahoo! News: Top Stories Feed:
http://news.yahoo.com/rss/
Anyway, any help would be incredibly helpful.

for (i = 0; i < s.length; i++){
The error is at the less-than symbol, which means that the XML parser is reading your source code! Use WGET to get the resource and check that actual XML is returned and not source code.

Related

Parsing curl response with Java

Before writing something like "why don't you use Java HTTP client such as apache, etc", I need you to know that the reason is SSL. I wish I could, they are very convenient, but I can't.
None of the available HTTP clients support GOST cipher suite, and I get handshake exception all the time. The ones which do support the suite, doesn't support SNI (they are also proprietary) - I'm returned with a wrong cert and get handshake exception over and over again.
The only solution was to configure openssl (with gost engine) and curl and finally execute the command with Java.
Having said that, I wrote a simple snippet for executing a command and getting input stream response:
public static InputStream executeCurlCommand(String finalCurlCommand) throws IOException
{
return Runtime.getRuntime().exec(finalCurlCommand).getInputStream();
}
Additionally, I can convert the returned IS to a string like that:
public static String convertResponseToString(InputStream isToConvertToString) throws IOException
{
StringWriter writer = new StringWriter();
IOUtils.copy(isToConvertToString, writer, "UTF-8");
return writer.toString();
}
However, I can't see a pattern according to which I could get a good response or a desired response header:
Here's what I mean
After executing a command (with -i flag), there might be lots and lots of information like in the screen below:
At first, I thought that I could just split it with '\n', but the thing is that a required response's header or a response itself may not satisfy the criteria (prettified JSON or long redirect URL break the rule).
Also, the static line GOST engine already loaded is a bit annoying (but I hope that I'll be able to get rid of it and nothing unrelated info like that will emerge)
I do believe that there's a pattern which I can use.
For now I can only do that:
public static String getLocationRedirectHeaderValue(String curlResponse)
{
String locationHeaderValue = curlResponse.substring(curlResponse.indexOf("Location: "));
locationHeaderValue = locationHeaderValue.substring(0, locationHeaderValue.indexOf("\n")).replace("Location: ", "");
return locationHeaderValue;
}
Which is not nice, obviosuly
Thanks in advance.
Instead of reading the whole result as a single string you might want to consider reading it line by line using a scanner.
Then keep a few status variables around. The main task would be to separate header from body. In the body you might have a payload you want to treat differently (e.g. use GSON to make a JSON object).
The nice thing: Header and Body are separated by an empty line. So your code would be along these lines:
boolean inHeader = true;
StringBuilder b = new StringBuilder;
String lastLine = "";
// Technically you would need Multimap
Map<String,String> headers = new HashMap<>();
Scanner scanner = new Scanner(yourInputStream);
while scanner.hasNextLine() {
String line = scanner.nextLine();
if (line.length() == 0) {
inHeader = false;
} else {
if (inHeader) {
// if line starts with space it is
// continuation of previous header
treatHeader(line, lastLine);
} else {
b.append(line);
b.appen(System.lineSeparator());
}
}
}
String body = b.toString();

HtmlUnit error - com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException: 429 Too Many Requests

I am scraping data from a publication website (ResearchGate) using HtmlUnit - Java.
For scraping the data, I am giving URLs from a text file. I have almost 4000 URLs in my text file (all URLs or page has similar pattern, but different data). But when I try to run my logic for all those 4000 URLs, I am getting the error :
com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException: 429 Too Many Requests for https://www.researchgate.net/application.RequestQuotaExceeded.html?tk=i1iSnVitFTozE0uu1nlOqH6CgwJA0vikMY_2VFnCBM3JDz4SZrupIy5I4yAT5KBOFAX-LySwTEIR4dak8u0FRHod9caWkRiNZS6RDGKXCY2Gn7kh80q72oaXjk8RWsXqqfcrNa3ULlnSHgQ
at com.gargoylesoftware.htmlunit.WebClient.throwFailingHttpStatusCodeExceptionIfNecessary(WebClient.java:537)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:362)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:434)
at com.pollak.library.Authenticator.autoLogin(Authenticator.java:70)
at com.pollak.library.FetchfromPublicationPage.main(FetchfromPublicationPage.java:34)
My code is :
package com.pollak.library;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.util.ArrayList;
import java.util.List;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class FetchfromPublicationPage {
public static void main(String a[]) throws Exception {
String path = "Path to the text file which contains 4000 URLs";
File file = new File(path);
BufferedReader br = new BufferedReader(new java.io.FileReader(file));
String line = null;
String baseUrl = "https://www.researchgate.net";
String login = <login_ID>;
String password = <password>;
File facurl = new File("Path to the file in which I want to save scraped information");
FileWriter fw = new FileWriter(facurl);
BufferedWriter bw = new BufferedWriter(fw);
int neha = 1;
try {
WebClient client = Authenticator.autoLogin(baseUrl + "/login", login, password);
String facultyprofileurl;
while ((facultyprofileurl = br.readLine()) != null) {
String info= "", ath = "";
String arr[] = facultyprofileurl.split(",");
HtmlPage page = client.getPage(arr[2]);
if (page.asText().contains("You need to sign in for access to this page")) {
throw new Exception(String.format("Error during login on %s , check your credentials", baseUrl));
}
List<HtmlElement> items = (List<HtmlElement>) page.getByXPath(
"//div[#class='nova-e-text nova-e-text--size-m nova-e-text--family-sans-serif nova-e-text--spacing-xxs nova-e-text--color-grey-700']");
List<HtmlElement> items2 = (List<HtmlElement>) page.getByXPath(
"//div[#class='nova-e-text nova-e-text--size-l nova-e-text--family-sans-serif nova-e-text--spacing-none nova-e-text--color-inherit nova-v-person-list-item__title nova-v-person-list-item__title--clamp-1']");
String print = "";
if (items.isEmpty()) {
System.out.println("No items found !");
} else {
for (HtmlElement htmlItem : items) {
HtmlElement articleinfo = ((HtmlElement) htmlItem.getFirstByXPath(".//ul"));
info += articleinfo.getTextContent().toString()+"**";
}
}
if (items.isEmpty()) {
System.out.println("No items found !");
} else {
for (HtmlElement htmlItem : items2) {
HtmlAnchor authors = ((HtmlAnchor) htmlItem.getFirstByXPath(".//a"));
ath += authors.getTextContent().toString()+"**";
}
}
bw.write(neha + "," + info +","+ath);
bw.newLine();
neha = neha + 1;
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
Can anyone one please guide. How to solve this error.
I fear there is no simple solution for you. You have to dig yourself and figure out what is going on.
Maybe some hints.
At first you have to get familiar with Http and the general way it works. Try to understand that and read about the error code you got.
Next step is to use a web proxy (e.g. Charles) to see what is going on on the wire. Maybe the server sends some additional information (header) that contain a hint about the rules used at the server side to detect this situation.
Next start with a simple program and try to find the amount of requests that forces your problem.
All in all we can't do the analysis work for you. You have to learn about the way http works, you have to understand what http servers are doing and finally you might find a way. But keep in mind that the peoples at the server side seem to block robots like you (for various good reasons). Maybe you will find a solution, but maybe this solution will work only for some time.

Unable to parse JSON from url

Write a piece of code that will query a URL that returns JSON and can parse the JSON string to pull out pieces of information. The information that should be parsed and returned is the pageid and the list of “See Also” links. Those links should be formatted to be actual links that can be used by a person to find the appropriate article.
Use the Wikipedia API for the query. A sample query is:
URL
Other queries can be generated changing the “titles” portion of the query string. The code to parse the JSON and pull the “See Also” links should be generic enough to work on any Wikipedia article.
I tried writing the below code:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import org.json.JSONException;
import org.json.JSONObject;
public class JsonRead {
private static String readUrl(String urlString) throws Exception {
BufferedReader reader = null;
try {
URL url = new URL(urlString);
reader = new BufferedReader(new InputStreamReader(url.openStream()));
StringBuffer buffer = new StringBuffer();
int read;
char[] chars = new char[1024];
while ((read = reader.read(chars)) != -1)
buffer.append(chars, 0, read);
return buffer.toString();
} finally {
if (reader != null)
reader.close();
}
}
public static void main(String[] args) throws IOException, JSONException {
JSONObject json;
try {
json = new JSONObject(readUrl("https://en.wikipedia.org/w/api.php?format=json&action=query&titles=SMALL&prop=revisions&rvprop=content"));
System.out.println(json.toString());
System.out.println(json.get("pageid"));
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
I have used the json jar from the below link in eclipse:
Json jar
When I run the above code I am getting the below error;
org.json.JSONException: JSONObject["pageid"] not found.
at org.json.JSONObject.get(JSONObject.java:471)
at JsonRead.main(JsonRead.java:35)
How can I extract the details of the pageid and also the "See Also" links from the url?
I have never worked on JSON before hence kindly let me know how to proceed here
The json:
{
"batchcomplete":"",
"query":{
"pages":{
"1808130":{
"pageid":1808130,
"ns":0,
"title":"SMALL",
"revisions":[
{
"contentformat":"text/x-wiki",
"contentmodel":"wikitext",
"*":"{{About|the ALGOL-like programming language|the scripting language formerly named Small|Pawn (scripting language)}}\n\n'''SMALL''', Small Machine Algol Like Language, is a [[computer programming|programming]] [[programming language|language]] developed by Dr. [[Nevil Brownlee]] of [[Auckland University]].\n\n==History==\nThe aim of the language was to enable people to write [[ALGOL]]-like code that ran on a small machine. It also included the '''string''' type for easier text manipulation.\n\nSMALL was used extensively from about 1980 to 1985 at [[Auckland University]] as a programming teaching aid, and for some internal projects. Originally written to run on a [[Burroughs Corporation]] B6700 [[Main frame]] in [[Fortran]] IV, subsequently rewritten in SMALL and ported to a DEC [[PDP-10]] Architecture (on the [[Operating System]] [[TOPS-10]]) and IBM S360 Architecture (on the Operating System VM/[[Conversational Monitor System|CMS]]).\n\nAbout 1985, SMALL had some [[Object-oriented programming|object-oriented]] features added to handle structures (that were missing from the early language), and to formalise file manipulation operations.\n\n==See also==\n*[[ALGOL]]\n*[[Lua (programming language)]]\n*[[Squirrel (programming language)]]\n\n==References==\n*[http://www.caida.org/home/seniorstaff/nevil.xml Nevil Brownlee]\n\n[[Category:Algol programming language family]]\n[[Category:Systems programming languages]]\n[[Category:Procedural programming languages]]\n[[Category:Object-oriented programming languages]]\n[[Category:Programming languages created in the 1980s]]"
}
]
}
}
}
}
If You Read your Exception Carefully you will find your solution at your own.
Exception in thread "main" org.json.JSONException: A JSONObject text must begin with '{' at 1 [character 2 line 1]
at org.json.JSONTokener.syntaxError(JSONTokener.java:433)
Your Exception says A JSONObject text must begin with '{' it means the the json you received from the api is probably not Correct.
So, I suggest you to debug your code and try to find out what you actually received in your String Variable jsonText.
You get the exception org.json.JSONException: JSONObject["pageid"] not found. when calling json.get("pageid") because pageid is not a direct sub-element of your root. You have to go all the way down through the object graph:
int pid = json.getJSONObject("query")
.getJSONObject("pages")
.getJSONObject("1808130")
.getInt("pageid");
If you have an array in there you will even have to iterate the array elements (or pick the one you want).
Edit Here's the code to get the field containing the 'see also' values
String s = json.getJSONObject("query")
.getJSONObject("pages")
.getJSONObject("1808130")
.getJSONArray("revisions")
.getJSONObject(0)
.getString("*");
The resulting string contains no valid JSON. You will have to parse it manually.

how can I detect charset of a web page

I just want to get the web page source in java language and I just want to get that content with correct encoding type. I am able to get the content of a web page till now. But for some web pages the content comes with absurd characters. So I need to detect charset of that web page.
According to my little research I found that there is a jChardet library to do this. But I couldn't import it to my project. Can someone please help me?
By the way the code below is the code to read the web page content
StringBuilder builder = new StringBuilder();
InputStream is = fURL.openStream();
BufferedReader buffer = null;
buffer = new BufferedReader(new InputStreamReader(is, encodingType));
int byteRead;
while ((byteRead = buffer.read()) != -1) {
builder.append((char) byteRead);
}
buffer.close();
return builder;
Read the Content-Type header of the HTTP response, it's the best way to get the charset. Only apply guessing when you have no alternatives - you do.
You can use too the http://jchardet.sourceforge.net/
private static String detectCharset(byte[] body) {
nsDetector det = new nsDetector(nsPSMDetector.ALL);
det.Init(new nsICharsetDetectionObserver() {
public void Notify(String charset) {
HtmlCharsetDetector.found = true;
}
});
boolean done = false;
boolean isAscii = true;
if (isAscii) {
isAscii = det.isAscii(body, body.length);
}
// DoIt if non-ascii and not done yet.
if (!isAscii && !done) {
done = det.DoIt(body, body.length, false);
}
return det.getProbableCharsets()[0];
}
Minimally, you would need to read and parse the HTTP headers to see whether they declare the encoding in HTTP headers and, in the absence of such a declaration (rather common), parse the document itself to find a meta tag that declares the encoding. For XHTML documents, you would need to check the XML declaration and default to utf-8. This would still leave a considerable amount of pages with undeclared encoding, so some heuristics would be needed. You might check the section on encodings in the HTML5 draft, which contains some heuristic overrides too (e.g., treating iso-8859-1 as windows-1252).

App engine Url request utf-8 characters becoming '??' or '???'

I have an error where I am loading data from a web-service into the datastore. The problem is that the XML returned from the web-service has UTF-8 characters and app engine is not interpreting them correctly. It renders them as ??.
I'm fairly sure I've tracked this down to the URL Fetch request. The basic flow is: Task queue -> fetch the web-service data -> put data into datastore so it definitely has nothing to do with request or response encoding of the main site.
I put log messages before and after Apache Digester to see if that was the cause, but determined it was not. This is what I saw in logs:
string from the XML: "Doppelg��nger"
After digester processed: "Doppelg??nger"
Here is my url fetching code:
public static String getUrl(String pageUrl) {
StringBuilder data = new StringBuilder();
log.info("Requesting: " + pageUrl);
for(int i = 0; i < 5; i++) {
try {
URL url = new URL(pageUrl);
URLConnection connection = url.openConnection();
connection.connect();
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String line;
while ((line = reader.readLine()) != null) {
data.append(line);
}
reader.close();
break;
} catch (Exception e) {
log.warn("Failed to load page: " + pageUrl, e);
}
}
String resp = data.toString();
if(resp.isEmpty()) {
return null;
}
return resp;
Is there a way I can force this to recognize the input as UTF-8. I tested the page I am loading and the W3c validator recognized it as valid utf-8.
The issue is only on app engine servers, it works fine in the development server.
Thanks
try
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF-8"));
I was drawn into the same issue 3 months back Mike. It does look like and I would assume your problems are same.
Let me recollect and put it down here. Feel free to add if I miss something.
My set up was Tomcat and struts.
And the way I resolved it was through correct configs in Tomcat.
Basically it has to support the UTF-8 character there itself. useBodyEncodingForURI in the connector. this is for GET params
Plus you can use a filter for POST params.
A good resource where yu can find all this in one roof is Click here!
I had a problem in the production thereafter where I had apache webserver redirecting request to tomcat :). Similarly have to enable UTF-8 there too. The moral of the story resolve the problem as it comes :)

Categories

Resources