XML Parsing Error: junk after document element - REST - java

I am working on a RESTful web service, that will return a list of RSS feeds that someone has added to a feed list which I have previously implemented.
Now if I return a TEXT_PLAIN reply, this displays just fine in the browser, although when I attempt to return an APPLICATION_XML reply, then I get the following error:
XML Parsing Error: junk after document element
Location: http:// localhost:8080/Assignment1/api/feedlist
Line Number 1, Column 135:SMH Top Headlineshttp://feeds.smh.com.au/rssheadlines/top.xmlUTS Library Newshttp://www.lib.uts.edu.au/news/feed/all
Here is the code - I cannot figure out why it is not returning a well formed XML page (I have also tried formatting the XML reply with new lines and spaces(indents) - and of course this did not work):
package au.com.rest;
import java.io.FileNotFoundException;
import java.io.IOException;
import javax.ws.rs.*;
import javax.ws.rs.core.*;
import au.edu.uts.it.wsd.*;
#Path("/feedlist")
public class RESTFeedService {
String feedFile = "/tmp/feeds.txt";
String textReply = "";
String xmlReply = "<?xml version=\"1.0\"?><feeds>";
FeedList feedList = new FeedListImpl();
#GET
#Produces(MediaType.APPLICATION_XML)
public String showXmlFeeds() throws FileNotFoundException, IOException
{
feedList.load(feedFile);
for (Feed f:feedList.list()){
xmlReply += "<feed><name>" + f.getName() + "</name>";
xmlReply += "<uri>" + f.getURI() + "</uri></feed></feeds>";
}
return xmlReply;
}
}

EDIT: I've spotted the immediate problem now. You're closing the feeds element on every input element:
for (Feed f:feedList.list()){
xmlReply += "<feed><name>" + f.getName() + "</name>";
xmlReply += "<uri>" + f.getURI() + "</uri></feed></feeds>";
}
The minimal change would be:
for (Feed f:feedList.list()){
xmlReply += "<feed><name>" + f.getName() + "</name>";
xmlReply += "<uri>" + f.getURI() + "</uri></feed>";
}
xmlReply += "</feeds>";
... but you should still apply the rest of the advice below.
First step - you need to diagnose the problem further. Look at the source in the browser to see exactly what it's complaining about. Can you see the problem in the XML yourself? What does it look like?
Without knowing about the rest framework you're using, this looks like it could be a
problem to do with a single instance servicing multiple requests. For some reason you've got an instance variable which you're mutating in your method. Why would you want to do that? If a new instance of your class is created for each request, it shouldn't be a problem - but I don't know if that's the case.
As a first change, try moving this line:
String xmlReply = "<?xml version=\"1.0\"?><feeds>";
into the method as a local variable.
After that though:
Keep all your fields private
Avoid using string concatenation in a loop like this
More importantly, don't build up XML by hand - use an XML API to do it. (The built-in Java APIs aren't nice, but there are plenty of alternatives.)
Consider which of these fields (if any) is really state of the object rather than something which should be a local variable. What state does your object logically have at all?

Related

Parsing Information from URL Using Jsoup

I need help with my Java project using Jsoup (if you think there is a more efficient way to achieve the purpose, please let me know). The purpose of my program is to parse certain useful information from different URLs and put it in a text file. I am not an expert in HTML or JavaScript, therefore, it has been difficult for me to code in Java exactly what I want to parse.
In the website that you see in the code below as one of the examples, the information that interests me to parse with Jsoup is everything you can see in the table under “Routing”(Route, Location, Vessel/Voyage, Container Arrival Date, Container Departure Date; = Origin, Seattle SSA Terminal T18, 26 Jun 15 A, 26 Jun 15 A… and so on).
So far, with Jsoup we are only able to parse the title of the website, yet we have been unsuccessful in getting any of the body.
Here is the code that I used, which I got from an online source:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Jsouptest71115 {
public static void main(String[] args) throws Exception {
String url = "http://google.com/gentrack/trackingMain.do "
+ "?trackInput01=999061985";
Document document = Jsoup.connect(url).get();
String title = document.title();
System.out.println("title : " + title);
String body = document.select("body").text();
System.out.println("Body: " + body);
}
}
Working code:
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.ArrayList;
public class Sample {
public static void main(String[] args) {
String url = "http://homeport8.apl.com/gentrack/blRoutingPopup.do";
try {
Connection.Response response = Jsoup.connect(url)
.data("blNbr", "999061985") // tracking number
.method(Connection.Method.POST)
.execute();
Element tableElement = response.parse().getElementsByTag("table")
.get(2).getElementsByTag("table")
.get(2);
Elements trElements = tableElement.getElementsByTag("tr");
ArrayList<ArrayList<String>> tableArrayList = new ArrayList<>();
for (Element trElement : trElements) {
ArrayList<String> columnList = new ArrayList<>();
for (int i = 0; i < 5; i++) {
columnList.add(i, trElement.children().get(i).text());
}
tableArrayList.add(columnList);
}
System.out.println("Origin/Location: "
+tableArrayList.get(1).get(1));// row and column number
System.out.println("Discharge Port/Container Arrival Date: "
+tableArrayList.get(5).get(3));
} catch (IOException e) {
e.printStackTrace();
}
}
}
Output:
Origin/Location: SEATTLE SSA TERMINAL (T18), WA  
Discharge Port/Container Arrival Date: 23 Jul 15  E
You need to utilize document.select("body") select method input to which is CSS selector. To know more about CSS selectors just google it, or Read this. Using CSS selectors you can identify parts of web page body easily.
In your particular case you will have a different problem though, for instance the table you are after is inside an IFrame and if you look at the html of web page you are visiting its(iframe's) url is "http://homeport8.apl.com/gentrack/blRoutingFrame.do", so if you visit this URL directly so that you can access its content you will get an exception which is perhaps some restriction from Server. To get content properly you need to visit two URLs via JSoup, 1. http://homeport8.apl.com/gentrack/trackingMain.do?trackInput01=999061985 and 2. http://homeport8.apl.com/gentrack/blRoutingFrame.do?trackInput01=999061985
For first URL you'll get nothing useful, but for second URL you'll get tables of your interest. The try using document.select("table") which will give you List of tables iterator over this list and find table of your interest. Once you have the table use Element.select("tr") to get a table row and then for each "tr" use Element.select("td") to get table cell data.
The webpage you are visiting didn't use CSS class and id selectors which would have made reading it with jsoup a lot easier so I am afraid iterating over document.select("table") is your best and easy option.
Good Luck.

IO Exception using java-google-translate-text-to-speech Api

I am having issues using java-google-translate-text-to-speech.Trying to translate a language to another. This is my code:
import com.gtranslate.Language;
import com.gtranslate.Translator;
import java.io.IOException;
public class Main {
public static void main(String[] args){
Translator translate = Translator.getInstance();
String text = translate.translate("Hello", Language.ENGLISH,Language.PORTUGUESE);
System.out.println(text);
}
}
Its giving me an error:
java.io.IOException: Server returned HTTP response code: 503 for URL: http://ipv4.google.com/sorry/IndexRedirect?continue=http://translate.google.com.br/translate_a/t%3Fclient%3Dt%26text%3DHello%26hl%3Den%26sl%3Den%26tl%3Den%26multires%3D1%26prev%3Dbtn%26ssel%3D0%26tsel%3D0%26sc%3D1&q=CGMSBHqsFhAY_L3FqQUiGQDxp4NLxnAO-gsMAyd56ktUpufqNjEC280
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1459)
at com.gtranslate.utils.WebUtils.source(WebUtils.java:24)
at com.gtranslate.parsing.ParseTextTranslate.parse(ParseTextTranslate.java:19)
Just found a simple 2-step solution. Please see Comment #4 at the following URL (requires only a minor modification of the sources):
https://code.google.com/p/java-google-translate-text-to-speech/issues/detail?id=8
STEP 1 in Comment #4 is straightforward. Let me cite it from the above webpage:
In class com.gtranslate.URLCONSTANT change public static final String GOOGLE_TRANSLATE_TEXT = "http://translate.google.com.br/translate_a/t?";
TO public static final String GOOGLE_TRANSLATE_TEXT1 = "http://translate.google.com.br/translate_a/single?";
...however in STEP 2 it is much simpler just to add a &dt=t URL parameter-value pair at the end of the generated URL in the com.gtranslate.parsing.ParseTextTranslate.appendURL() method.
...the original STEP 2 in Comment #4 above was the following, I cite (FYR):
STEP2) In the class, the appendURL function needs to be changed as shown com.gtranslate.parsing.ParseTextTranslate #Override public void appendURL() { Text input = textTranslate.getInput(); Text output = textTranslate.getOutput(); url = new StringBuilder(URLCONSTANTS.GOOGLE_TRANSLATE_TEXT); /* url.append("client=t&text=" + input.getText().replace(" ", "%20")); url.append("&hl=" + input.getLanguage()); url.append("&sl=" + input.getLanguage()); url.append("&tl=" + output.getLanguage()); url.append("&multires=1&prev=btn&ssel=0&tsel=0&sc=1"); */
url = new StringBuilder(URLCONSTANTS.GOOGLE_TRANSLATE_TEXT);
url.append("client=t&sl=auto&tl="+ output.getLanguage()
+"&hl=" + input.getLanguage()
+"&dt=bd&dt=ex&dt=ld&dt=md&dt=qc&dt=rw&dt=rm&dt=ss&dt=t&dt=at&ie=UTF-8&oe=UTF-8&otf=1&rom=1&ssel=0&tsel=3&kc=1&tk=620730|996163"
+ "&q=" + input.getText().replace(" ", "%20"));
}
.........end of citation...and sooooooooo, for example, just replace this line in the appendURL() method:
url.append("&multires=1&prev=btn&ssel=0&tsel=0&sc=1");
...to this:
url.append("&multires=1&prev=btn&ssel=0&tsel=0&sc=1&dt=t");
Additionally here are some values for the dt URL param, which practically specifies what to return in the reply:
t - translation of source text
at - alternate translations
rm - transcription / transliteration of source and translated texts
bd - dictionary, in case source text is one word (you get translations with articles, reverse translations, etc.)
md - definitions of source text, if it's one word
ss - synonyms of source text, if it's one word
ex - examples
...
P.S.: A similar HTTP 503 error happens with Google TTS (due to the background API change). You can find the solution to that problem here: Text to Speech 503 and Captcha Now
HTTP Response Code: 503 : Service Unavailable says : The server is currently unavailable may be because it is overloaded or down for maintenance.
The server might currently be unable to handle the request due to a temporary overloading or maintenance of the server.
Note: Some servers may simply refuse the connection and might result in 503 response

java.lang.NullPointerException trying to get specific values from hashmap

I've spent several frustrating days on this now and would appreciate some help. I have a Java agent in Lotus Domino 8.5.3 which is activated by a cgi:POST from my Lotusscript validation agent which is checking that customer has filled in the Billing and delivery address form. This is the code that parses the incoming data into a HashMap where field names are mapped to their respective values.
HashMap hmParam = new HashMap(); //Our Hashmap for request_content data
//Grab transaction parameters from form that called agent (CGI: request_content)
if (contentDecoded != null) {
String[] arrParam = contentDecoded.split("&");
for(int i=0; i < arrParam.length; i++) {
int n = arrParam[i].indexOf("=");
String paramName = arrParam[i].substring(0, n);
String paramValue = arrParam[i].substring(n + 1, arrParam[i].length());
hmParam.put(paramName, paramValue); //Old HashMap
if (paramName.equalsIgnoreCase("transaction_id")) {
transactionID = paramValue;
description = "Order " + transactionID + " from Fareham Wine Cellar";
//System.out.println("OrderID = " + transactionID);
}
if (paramName.equalsIgnoreCase("amount")) {
orderTotal = paramValue;
}
if (paramName.equalsIgnoreCase("deliveryCharge")) {
shipping = paramValue;
}
}
}
The block of code above dates back over a year to my original integration of shopping cart to Barclays EPDQ payment gateway. In that agent I recover the specific values and build a form that is then submitted to EPDQ CPI later on in the agent like this;
out.print("<input type=\"hidden\" name=\"shipping\" value=\"");
out.println(hmParam.get("shipping") + "\">");
I want to do exactly the same thing here, except when I try the agent crashes with a null pointer exception. I can successfully iterate through the hashMap with the snippet below, so I know the data is present, but I can't understand why I can't use myHashMap.Get(key) to get each field value in the order I want them for the html form. The original agent in another application is still in use so what is going on? The data too is essentially unchanged String fieldnames mapped to String values.
Iterator it = cgiData.entrySet().iterator();
while (it.hasNext()) {
Map.Entry pairs = (Map.Entry)it.next();
out.println("<br />" + pairs.getKey() + " = " + pairs.getValue());
//System.out.println(pairs.getKey() + " = " + pairs.getValue());
}
I did two things that may have had an impact, in the process of trying to debug what was going on I needed these further imports;
import java.util.Iterator;
import java.util.Map;
Although I'm not iterating over the hashMap, I've left them in in case which gives me the option of dumping the hashMap out to my system audit trail when application is in production. In variations of the snippet below after it started working I was able to get to any of the data I needed, even if the value was Null, and toString() also seemed to be optional again, as it made no difference to the output.
String cgiValue = "";
cgiValue = hmParam.get("ship_to_lastname").toString();
out.println("<br />Lastname: " + cgiValue);
out.println("<br />Company name: " + hmParam.get("bill_to_company"));
out.println("<br />First name: " + hmParam.get("ship_to_firstname"));
The second thing I did, while trying to get code to work was I enabled the option "Compile Java code with debugging information" for the agent, this may have done something to the way the project was built within the Domino Developer client.
I think I have to put this down to some sort of internal error created when Domino Designer compiled the code. I had a major crash last night while working on this which necessitated a cold boot of my laptop. You also may find that when using Domino Designer 8.5.x that strange things can happen if you don't completely close down all the tasks from time to time with KillNotes

How to parse data in Talend with Java (coming from a previously produced .txt file)?

I have a process in Talend which gets the search result of a page, saves the html and writes it into files, as seen here:
Initially I had a two step process with parsing out the date from the HTML files in Java. Here is the code: It works and writes it to a mysql database. Here is the code which basically does exactly that. (I'm a beginner, sorry for the lack of elegance)
package org.jsoup.examples;
import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.Elements;
import java.io.IOException;
public class parse2 {
static parse2 parseIt2 = new parse2();
String companyName = "Platzhalter";
String jobTitle = "Platzhalter";
String location = "Platzhalter";
String timeAdded = "Platzhalter";
public static void main(String[] args) throws IOException {
parseIt2.getData();
}
//
public void getData() throws IOException {
Document document = Jsoup.parse(new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt"), "utf-8");
Elements elements = document.select(".joblisting");
for (Element element : elements) {
// Parse Data into Elements
Elements jobTitleElement = element.select(".job_title span");
Elements companyNameElement = element.select(".company_name span[itemprop=name]");
Elements locationElement = element.select(".locality span[itemprop=addressLocality]");
Elements dateElement = element.select(".job_date_added [datetime]");
// Strip Data from unnecessary tags
String companyName = companyNameElement.text();
String jobTitle = jobTitleElement.text();
String location = locationElement.text();
String timeAdded = dateElement.attr("datetime");
System.out.println("Firma:\t"+ companyName + "\t" + jobTitle + "\t in:\t" + location + " \t Erstellt am \t" + timeAdded );
}
}
}
Now I want to do the process End-to-End in Talend, and I got assured this works.
I tried this (which looks quite shady to me):
Basically I put all imports in "advanced settings" and the code in the "basic settings" section. This importLibrary is thought to load the jsoup parsing library, as well as the mysql connect (i might to the connect with talend tools though).
Obviously this isn't working. I tried to strip the Base Code from classes and stuff and it was even worse. Can you help me how to get the generated .txt files parsed with Java here?
EDIT: Here is the Link to the talend Job http://www.share-online.biz/dl/8M5MD99NR1
EDIT2: I changed the code to the one I tried in JavaFlex. But it didn't work (the import part in the start part of the code, the rest in "body/main" and nothing in "end".
This is a problem related to Talend, in your code, use the complete method names including their packages. For your document parsing for example, you can use :
Document document = org.jsoup.Jsoup.parse(new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt"), "utf-8");

org.xml.sax.SAXParseException: Reference is not allowed in prolog

I am trying to escape html characters of a string and use this string to build a DOM XML using parseXml method shown below. Next, I am trying to insert this DOM document into database. But, when I do that I am getting the following error:
org.xml.sax.SAXParseException: Reference is not allowed in prolog.
I have three questions:
1) I am not sure how to escape double quotes. I tried replaceAll("\"", """) and am not sure if this is right.
2) Suppose I want a string starting and ending with double quotes (eg: "sony"), how do I code it? I tried something like:
String sony = "\"sony\""
Is this right? Will the above string contain "sony" along with double quotes or is there another way of doing it?
3)I am not sure what the "org.xml.sax.SAXParseException: Reference is not allowed in prolog." error means. Can someone help me fix this?
Thanks,
Sony
Steps in my code:
Utils. java
public static String escapeHtmlEntities(String s) {
return s.replaceAll("&", "&").replaceAll("<", "<").replaceAll(">", ">").replaceAll("\"", """).
replaceAll(":", ":").replaceAll("/", "/");
}
public static Document parseXml (String xml) throws Exception {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(new InputSource(new StringReader(xml)));
doc.setXmlStandalone(false);
return doc;
}
TreeController.java
protected void notifyNewEntryCreated(String entryType) throws Exception {
for (Listener l : treeControlListeners)
l.newEntryCreated();
final DomNodeTreeModel domModel = (DomNodeTreeModel) getModel();
Element parent_item = getSelectedEntry();
String xml = Utils.escapeHtmlEntities("<entry xmlns=" + "\"http://www.w3.org/2005/atom\"" + "xmlns:libx=" +
"\"http://libx.org/xml/libx2\">" + "<title>" + "New" + entryType + "</title>" +
"<updated>2010-71-22T11:08:43z</updated>" + "<author> <name>LibX Team</name>" +
"<uri>http://libx.org</uri>" + "<email>libx.org#gmail.com</email></author>" +
"<libx:" + entryType + "></libx:" + entryType + ">" + "</entry>");
xmlModel.insertNewEntry(xml, getSelectedId());
}
XMLDataModel.java
public void insertNewEntry (String xml, String parent_id) throws Exception {
insertNewEntry(Utils.parseXml(xml).getDocumentElement(), parent_id);
}
public void insertNewEntry (Element elem, String parent_id) throws Exception {
// inserting an entry with no libx: tag will create a storage leak
if (elem.getElementsByTagName("libx:package").getLength() +
elem.getElementsByTagName("libx:libapp").getLength() +
elem.getElementsByTagName("libx:module").getLength() < 1) {
// TODO: throw exception here instead of return
return;
}
XQPreparedExpression xqp = Q.get("insert_new_entry.xq");
xqp.bindNode(new QName("entry"), elem.getOwnerDocument(), null);
xqp.bindString(new QName("parent_id"), parent_id, null);
xqp.executeQuery();
xqp.close();
updateRoots();
}
insert_new_entry.xq
declare namespace libx='http://libx.org/xml/libx2';
declare namespace atom='http://www.w3.org/2005/atom';
declare variable $entry as xs:anyAtomicType external;
declare variable $parent_id as xs:string external;
declare variable $feed as xs:anyAtomicType := doc('libx2_feed')/atom:feed;
declare variable $metadata as xs:anyAtomicType := doc('libx2_meta')/metadata;
let $curid := $metadata/curid
return replace value of node $curid with data($curid) + 1,
let $newid := data($metadata/curid) + 1
return insert node
{$newid}{
$entry//
}
into $feed,
let $newid := data($metadata/curid) + 1
return if ($parent_id = 'root') then ()
else
insert node http://libx.org/xml/libx2' /> into
$feed/atom:entry[atom:id=$parent_id]//(libx:module|libx:libapp|libx:package)
To escape a double quote, use the " entity, which is predefined in XML.
So, your example string, say an attribute value, will look like
<person name=""sony""/>
There is also &apos; for apostrophe/single quote.
I see you have lots of replaceAll calls, but the replacements seem to be the same? There are some other characters that cannot be used literally, but should be escaped:
& --> &
> --> >
< --> <
" --> "
' --> &apos;
(EDIT: ok, I see this is just formatting - the entities are being turned into they're actual values when being presented by SO.)
The SAX exception is the parser grumbling because of the invalid XML.
As well as escaping the text, you will need to ensure it adheres to the well-formedness rules of XML. There's quite a bit to get right, so it's often simpler to use a 3rd party library to write out the XML. For example, the XMLWriter in dom4j.
You can check out Tidy specification. its a spec released by w3c. Almost all recent languages have their own implementation.
rather than just replace or care only to < ,>, & just configure JTidy ( for java ) options and parse. this abstracts all the complication of Xml escape thing.
i have used both python , java and marklogic based tidy implementations. all solved my purposes

Categories

Resources