Note that what I want is not get specified parameter in a sevlet, but to get the parameter from a String like that:
res_data=%3C%3Fxml+version%3D%221.0%22+encoding%3D%22utf8%22%3F%3E%3Cdirect_trade_create_res%3E%3Crequest_token%3E201502051324ee4d4baf14d30e3510808c08ee1d%3C%2Frequest_token%3E%3C%2Fdirect_trade_create_res%3E&service=alipay.wap.trade.create.direct&sec_id=MD5&partner=2088611853232587&req_id=20121212344553&v=2.0
It's a url encoded utf-8 string, when decode this by python I can get the real data it represents:
res_data=<?xml version="1.0" encoding="utf-8"?><direct_trade_create_res><request_token>201502051324ee4d4baf14d30e3510808c08ee1d</request_token></direct_trade_create_res>&service=alipay.wap.trade.create.direct&sec_id=MD5&partner=2088611853232587&req_id=20121212344553&v=2.0
I want to get the parameter res_data that I care about, more specifically, I just want the request_token in the xml of res_data
I know I can use regex to get this work, but is there a more suitable way to use some lib like apache url lib or something else that I can get the res_data parameter more elegantly? May be stealing some components from servlet mechanism?
Since you say you don't want to hack it with a regex you might use a proper XML parser, although for such a small example it is probably overkill.
If you can assume that you can simply split your string on &'s, i.e., there aren't any &'s in there that do not signal the boundary of two attribute-value pairs, you can first decode the string, then extract the attribute-value pairs from it and finally use a DOM parser + XPath to get to the request token:
// split up URL parameters into attribute value pairs
String[] pairs = s.split("&");
// expect the first attribute/value pair to contain the data
// and decode the URL escape sequences
String resData = URLDecoder.decode(pairs[0], "utf-8");
int equalIndex = resData.indexOf("=");
if (equalIndex >= 0) {
// the value is right of the '=' sign
String xmlString = resData.substring(equalIndex + 1);
// prepare XML parser
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();
InputSource is = new InputSource(new StringReader(xmlString));
Document doc = parser.parse(is);
// prepare XPath expression to extract request token
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression xp = xpath.compile("//request_token/text()");
String requestToken = xp.evaluate(doc);
}
You can use java.net.URLDecoder. Assuming the parameter is in a string called param (and you have already split it away from the other parameters that were connected to it by &):
String[] splitString = param.split("=");
String realData = null;
try {
String realData = java.net.URLDecoder.decode( splitString[1], "UTF-8" );
} catch ( UnsupportedEncodingException e ) {
// Nothing to do, it should not happen as you supplied a standard one
}
Once you do that, you can parse it with the XML parser of your choice and extract whatever you want. Don't try to parse XML with a regex, though.
Related
I'm using Sax with xalan implementation (v. 2.7.2). I have string in html format
" <p>Test k"nnen</p>"
and I have to pass it to content of xml tag.
The result is:
"<p>Test k"nnen</p>"
xalan encodes the ampersand sign although it's a part of already escaped entity.
Anyone knows a way how to make xalan understand escaped entities and not escape their ampersand?
One of possible solution is to add startCDATA() to transformerHandler but It's not something can use in my code.
public class TestSax{
public static void main(String[] args) throws TransformerConfigurationException, SAXException {
TestSax t = new TestSax();
System.out.println(t.createSAXXML());
}
public String createSAXXML() throws SAXException, TransformerConfigurationException {
Writer writer = new StringWriter( );
StreamResult streamResult = new StreamResult(writer);
SAXTransformerFactory transformerFactory =
(SAXTransformerFactory) SAXTransformerFactory.newInstance( );
String data = null;
TransformerHandler transformerHandler =
transformerFactory.newTransformerHandler( );
transformerHandler.setResult(streamResult);
transformerHandler.startDocument( );
transformerHandler.startElement(null,"decimal","decimal", null);
data = " <p>Test k"nnen</p>";
transformerHandler.characters(data.toCharArray(),0,data.length( ));
transformerHandler.endElement(null,"decimal","decimal");
transformerHandler.endDocument( );
return writer.toString( );
}}
If your input is XML, then you need to parse it. Then <p> and </p> will be recognized as tags, and " will be recognized as an entity reference.
On the other hand if you want to treat it as a string and pass it through XML machinery, then "<" and "&" are going to be preserved as ordinary characters, which means they will be escaped as < and & respectively.
If you want "<" treated as an ordinary character but "&" treated with its XML meaning, then you need software with some kind of split personality, and you're not going to get that off-the-shelf.
I am making a piece of code to send and recieve data from and to an webpage. I am doeing this in java. But when i 'receive' the xml data it is still between tags like this
<?xml version='1.0'?>
<document>
<title> TEST </title>
</document>
How can i get the data without the tags in Java.
This is what i tried, The function writes the data and then should get the reponse and use that in a System.out.println.
public static String User_Select(String username, String password) {
String mysql_type = "1"; // 1 = Select
try {
String urlParameters = "mysql_type=" + mysql_type + "&username=" + username + "&password=" + password;
URL url = new URL("http://localhost:8080/HTTP_Connection/index.php");
URLConnection conn = url.openConnection();
conn.setDoOutput(true);
OutputStreamWriter writer = new OutputStreamWriter(conn.getOutputStream());
writer.write(urlParameters);
writer.flush();
String line;
BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
while ((line = reader.readLine()) != null) {
System.out.println(line);
//System.out.println("Het werkt!!");
}
writer.close();
reader.close();
return line;
} catch (IOException iox) {
iox.printStackTrace();
return null;
}
}
Thanks in advance
I would suggest simply using RegEx to read the XML, and get the tag content that you are after.
That simplifies what you need to do, and limits the inclusion of additional (unnecessary) libraries.
And then there are lots of StackOverflows on this topic: Regex for xml parsing and In RegEx, I want to find everything between two XML tags just to mention 2 of them.
use DOMParser in java.
Check further in java docs
Use an XML Parser to Parse your XML. Here is a link to Oracle's Tutorial
Oracle Java XML Parser Tutorial
Simply pass the InputStream from URLConnection
Document doc = DocumentBuilderFactory.
newInstance().
newDocumentBuilder().
parse(conn.getInputStream());
From there you could use xPath to query the contents of the document or simply walk the document model.
Take a look at Java API for XML Processing (JAXP) for more details
You have to use an XML Parser , in your case the perfect choice is JSoup which scrap data from the web and parse XML & HTML format ,it will load data and parse it and give you what you want , here is a an example of how it works :
1. XML From an URL
String xml = Jsoup.connect("http://localhost:8080/HTTP_Connection/index.php")
.get().toString();
Document doc = Jsoup.parse(xml, "", Parser.xmlParser());
String myTitle=doc.select("title").first();// myTitle contain now TEST
Edit :
to send GET or POST parameters with you request use this code:
String xml = Jsoup.connect("http://localhost:8080/HTTP_Connection/index.php")
.data("param1Name";"param1Value")
.data("param2Name","param2Value").get().toString();
you can use get() to invoke HTTP GET method or post() to invoke HTTP POST method.
2. XML From String
You can use JSoup to parse XML data in a String :
String xmlData="<?xml version='1.0'?><document> <title> TEST </title> </document>" ;
Document doc = Jsoup.parse(xmlData, "", Parser.xmlParser());
String myTitle=doc.select("title").first();// myTitle contain now TEST
I'm building a simulator to post JSON data to a service I'm running.
The JSON should look like this:
{"sensor":
{"id":"SENSOR1","name":"SENSOR","type":"Temperature","value":100.12,"lastDateValue":"\/Date(1382459367723)\/"}
}
I tried this with the "Advanced REST Client" in Chrome and this works fine. The date get's parsed properly by the ServiceStack webservice.
So, the point is to write a sensor simulator that posts data like this to the web service.
I created this in Java, so I could run it on my raspberry pi.
This is the code:
public static void main(String[] args) {
String url = "http://localhost:63003/api/sensors";
String sensorname = "Simulated sensor";
int currentTemp = 10;
String dateString = "\\" + "/Date(" + System.currentTimeMillis() + ")\\" + "/";
System.out.println(dateString);
System.out.println("I'm going to post some data to: " + url);
//Creating the JSON Object
JSONObject data = new JSONObject();
data.put("id", sensorname);
data.put("name", sensorname);
data.put("type", "Temperature");
data.put("value", currentTemp);
data.put("lastDateValue", dateString);
JSONObject sensor = new JSONObject().put("sensor", data);
//Print out the data to be sent
StringWriter out = new StringWriter();
sensor.write(out);
String jsonText = out.toString();
System.out.print(jsonText);
//Sending the object
HttpClient c = new DefaultHttpClient();
HttpPost p = new HttpPost(url);
p.setEntity(new StringEntity(sensor.toString(), ContentType.create("application/json")));
try {
HttpResponse r = c.execute(p);
} catch (Exception e) {
e.printStackTrace();
}
}
The output of this program is as follows:
\/Date(1382459367723)\/
I'm going to post some data to: http://localhost:63003/api/sensors
{"sensor":{"lastDateValue":"\\/Date(1382459367723)\\/","id":"Simulated sensor","name":"Simulated sensor","value":10,"type":"Temperature"}}
The issue here is that the JSONObject string still contains these escape characters. But when I print the string in the beginning it does not contain the escape characters. Is there any way to get rid of these? My service can't parse these..
This is a sample of what I send with the rest client in chrome:
{"sensor":{"id":"I too, am a sensor!","name":"Willy","type":"Temperature","value":100.12,"lastDateValue":"\/Date(1382459367723)\/"}}
JSONObject is correctly encoding the string. This page describes how string literals are to be escaped in JavaScript (and, by extension, JSON). The following note is important to understanding what happens in your example:
For characters not listed in Table 2.1, a preceding backslash is ignored, but this usage is deprecated and should be avoided.
Your example ("\/Date(1382459367723)\/") uses a preceding backslash before a /. Because / is not in table 2.1, the \ should simply be ignored. If your service doesn't ignore the \, then it either has a bug, or is not a JSON parser (perhaps it uses a data format which is similar to, but not quite, JSON).
Since you need to generate non-conforming JSON, you won't be able to use standard tools to do so. Your two options are to write your own not-quite-JSON encoder, or to avoid characters which must be escaped, such as \ and ".
#pburka is correct. If you want to send it in \/Date(1382459367723)\/ format, try escaping the blackslash twice as below
String dateString = "\\\\" + "/Date(" + System.currentTimeMillis() + ")\\\\" + "/";
In the first pass, dateString will make it as \\/Date(1382459367723)\\/ and finally JSONObject will add extra backslashes internally to it's buffer i.e \\\/Date(1382459367723)\\\/ so that the blackslashes before / will be ignored according to JSON parsing rules and you would get the desired result i.e \/Date(1382459367723)\/
I want to parse an XML whose tag contains an & for example: <xml><OC&C>12.4</OC&C></xml>. I tried to escape & to & but that didn't fix the issue for tag name (it fixes it for values only), currently my code is throwing an exception, see complete function below.
public static void main(String[] args) throws Exception
{
String xmlString = "<xml><OC&C>12.4</OC&C></xml>";
xmlString = xmlString.replaceAll("&", "&");
String path = "xml";
InputSource inputSource = new InputSource(new StringReader(xmlString));
try
{
Document xmlDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(inputSource);
XPath xPath = XPathFactory.newInstance().newXPath();
XPathExpression xPathExpression = xPath.compile(path);
System.out.println("Compiled Successfully.");
}
catch (SAXException e)
{
System.out.println("Error while retrieving node Path:" + path + " from " + xmlString + ". Returning null");
}
}
Hmmm... I don't think that it is a legal XML name. I'd think about using a regex to replace OC&C to something legal first, and then parse it.
It's not "an XML". It's a non-XML. XML doesn't allow ampersands in names. Therefore, you can't parse it successfully using an XML parser.
xml could not be name of any XML element. So, your XML fragment could never be parsed anyway. Then you could try something like that.
<name><![CDATA[<OC&C>12.4</OC&C>]]></name>
I am trying to use boilerpipe java library, to extract news articles from a set of websites.
It works great for texts in english, but for text with special characters, for example, words with accent marks (história), this special characters are not extracted correctly. I think it is an encoding problem.
In the boilerpipe faq, it says "If you extract non-English text you might need to change some parameters" and then refers to a paper. I found no solution in this paper.
My question is, are there any params when using boilerpipe where i can specify the encoding? Is there any way to go around and get the text correctly?
How i'm using the library:
(first attempt based on the URL):
URL url = new URL(link);
String article = ArticleExtractor.INSTANCE.getText(url);
(second on the HTLM source code)
String article = ArticleExtractor.INSTANCE.getText(html_page_as_string);
You don't have to modify inner Boilerpipe classes.
Just pass InputSource object to the ArticleExtractor.INSTANCE.getText() method and force encoding on that object. For example:
URL url = new URL("http://some-page-with-utf8-encodeing.tld");
InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());
String text = ArticleExtractor.INSTANCE.getText(is);
Regards!
Well, from what I see, when you use it like that, the library will auto-chose what encoding to use. From the HTMLFetcher source:
public static HTMLDocument fetch(final URL url) throws IOException {
final URLConnection conn = url.openConnection();
final String ct = conn.getContentType();
Charset cs = Charset.forName("Cp1252");
if (ct != null) {
Matcher m = PAT_CHARSET.matcher(ct);
if(m.find()) {
final String charset = m.group(1);
try {
cs = Charset.forName(charset);
} catch (UnsupportedCharsetException e) {
// keep default
}
}
}
Try debugging their code a bit, starting with ArticleExtractor.getText(URL), and see if you can override the encoding
Ok, got a solution.
As Andrei said, i had to change the class HTMLFecther, which is in the package de.l3s.boilerpipe.sax
What i did was to convert all the text that was fetched, to UTF-8.
At the end of the fetch function, i had to add two lines, and change the last one:
final byte[] data = bos.toByteArray(); //stays the same
byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one (convertion)
cs = Charset.forName("UTF-8"); //set the charset to UFT-8
return new HTMLDocument(utf8, cs); // edited line
Boilerpipe's ArticleExtractor uses some algorithms that have been specifically tailored to English - measuring number of words in average phrases, etc. In any language that is more or less verbose than English (ie: every other language) these algorithms will be less accurate.
Additionally, the library uses some English phrases to try and find the end of the article (comments, post a comment, have your say, etc) which will clearly not work in other languages.
This is not to say that the library will outright fail - just be aware that some modification is likely needed for good results in non-English languages.
Java:
import java.net.URL;
import org.xml.sax.InputSource;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
public class Boilerpipe {
public static void main(String[] args) {
try{
URL url = new URL("http://www.azeri.ru/az/traditions/kuraj_pehlevanov/");
InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());
String text = ArticleExtractor.INSTANCE.getText(is);
System.out.println(text);
}catch(Exception e){
e.printStackTrace();
}
}
}
Eclipse:
Run > Run Configurations > Common Tab. Set Encoding to Other(UTF-8), then click Run.
I had the some problem; the cnr solution works great. Just change UTF-8 encoding to ISO-8859-1. Thank's
URL url = new URL("http://some-page-with-utf8-encodeing.tld");
InputSource is = new InputSource();
is.setEncoding("ISO-8859-1");
is.setByteStream(url.openStream());
String text = ArticleExtractor.INSTANCE.getText(is);