Hard to understand the title I know. I am importing keywords from a CSV File in a format like this:
"Business Intelligence";
"Big Data";
with doublequotes. Afterwards I do a HTTP GET Request with each of these Keywords like this:
"http://www.stepstone.de/5/ergebnisliste.html?ke="+ context.keywordname +"&li=1000000"
My outputfile does this:
"C:/Talend/workspace/WEBCRAWLER/output/keywords_" + context.keywordname +".txt"
Obviously you can't write double quotes in the file name. What can I do as a workaround?
I already tried adding " in the get request, but it didn't work out unfortunatelly!
Thank you!
Use HTML encode on the files:
"Business Intelligence";"Big Data";
would become
"Business Intelligence";"Big Data";
I used the following site: http://www.opinionatedgeek.com/DotNet/Tools/HTMLEncode/encode.aspx
Unfortunately there's no easy way to that that in Talend, however you could try to use:
java.net.URLEncoder
http://docs.oracle.com/javase/7/docs/api/java/net/URLEncoder.html
if you want to create file name with keyword then you can replace/remove keywords double quotes using replace function please check below code, i think this will work for you.
"C:/Talend/workspace/WEBCRAWLER/output/keywords_" + context.keywordname.replace("\"\"", "") +".txt"
Related
How would I go about cutting this giant string? I need the data of the first ID," preferably for the string output to be:
"id":1,"name":"site1","has_doneit":true,"destination":"4613"
Is this possible? Or if not any other ways of grabbing the name:"site1" and the has_doneit:true would be perfectly fine.
{"needs_complete":true,"has_done":true,"sites":[
{"id":1,"name":"site1","has_doneit":true,"destination":"4613"},{"id":2,"name":"site2","has_doneit":true,"destination":"4613"},{"id":3,"name":"site3","has_doneit":true,"destination":"4339"},{"id":4,"name":"site4","has_doneit":true,"destination":"4340"},{"id":5,"name":"site5","has_doneit":true,"destination":"4341"},
{"id":6,"name":"site6","has_doneit":true,"destination":"4622"},{"id":7,"name":"site7","has_doneit":true,"destination":"4623"},{"id":8,"name":"site8","has_doneit":true,"destination":"4828"},
{"id":9,"name":"site9","has_doneit":true,"destination":"4829"},{"id":10,"name":"site10","has_doneit":true,"destination":"4861"}]}
That seems like a JSON string so use a Json Parser...
PHP: JSON_decode
$parsedStr = json_decode($yourString, true);
you can access sites array in $parsedStr['sites']
So, to access the id of the first site:
echo $parsedStr['sites'][0]['id'];
Java
check this answer in SO Decoding JSON String in Java
In my Android application I get JSON response string from a PHP url. from the response I get some hotel names with apostrophe, I get ' character instead of apostrophe. How can I parse the hotel with special characters in android? I can see the apostrophe in the browser but could not see in android logcat.
I have tried jresponse = URLEncoder.encode(jresponse,"UTF-8"); but I could not get apostrophe for hotel name.
This is the one of the hotel name in the response.
I see the following in browser.
{"id":12747,
"name":"Oscar's",
....
}
But in the logcat:
id 12747
name Oscar's
Use the decoder instead of encoder. URLDecoder.decode(jresponse,"UTF-8")
Use ISO-8859-2 when you create the URLEncodedEntity that you send off. You can set this as a parameter in the constructor.
Without a specified charset, you are probably sending the data in UTF-8/UTF-16 (most common) which the server is interpreting in a different way.
EDIT: It looks like ISO-8859-2 doesn't support ñ. You may have to change something server-side. http://en.wikipedia.org/wiki/ISO/IEC_8859-2
You can try Html class. eg :-
jresponse = Html.fromHtml(jresponse);
I need to fix a issue for xss vulnerability. the code segment is below.
StringBuffer xml = new StringBuffer();
xml.append("<?xml version=\"1.0\"?>");
xml.append("<parent>");
xml.append("<child>");
for(int cntr=0; cntr < dataList.size(); cntr++){
AAAAA obj = (AAAAA) dataList.get(cntr);
if(obj.getStatus().equals(Constants.ACTIVE)){
xml.append("<accountNumber>");
xml.append(obj.getAccountNumber());
xml.append("</accountNumber>");
xml.append("<partnerName>");
xml.append(obj.getPartnerName());
xml.append("</partnerName>");
xml.append("<accountType>");
xml.append(obj.getAccountType());
xml.append("</accountType>");
xml.append("<priority>");
xml.append(obj.getPriority());
xml.append("</priority>");
}
}
xml.append("</child>");
xml.append("</parent>");
response.getWriter().write(xml.toString());
response.setContentType("text/xml");
response.setHeader("Cache-Control", "no-cache");
The issue is at the line having the syntax response.getWriter().write(xml.toString()); It says that it is vulnerable for xss attack. I have done sufficient home work and also installed ESAPI 2.0. but I donot know how to implement the solutions.
Please suggest a solution.
You should always escape any text and attribute nodes you insert into an XML document, so I would expect to see
xml.append("<accountType>");
xml.append(escape(obj.getAccountType()));
xml.append("</accountType>");
where escape() looks after characters that need special treatment, eg. "<", "&", "]]>", and surrogate pairs.
Better still, don't construct XML by string concatenation. Use a serialization library that allows you to write
out.startElement("accountType");
out.text(obj.getAccountType());
out.endElement();
(I use a Saxon serializer with the StAX XMLStreamWriter interface when I need to do this, but there are plenty of alternatives available.)
As I can understand:
AAAAA obj = (AAAAA) dataList.get(cntr);
here you have got some data from external source.
Then you've got to validate this data. Otherwise anyone can put any data there, that would cause the destruction on client side (cookies will be stolened for example).
ANSWER-- the code using the ESAPI is below.
xml.append(ESAPI.encoder().encodeForXML(desc));
It will escape the data in the variable 'desc'. By the implementation of this, the content in the variable 'desc' will be readed as data not executable code and hence the data will not get executed in the browser on the response of the back end java code.
I have the most basic java code to do a http request and it works fine. I request data and a ton of html comes back. I want to retrieve all the url's from that page and list them. For a simple first test i made it look like this:
int b = line.indexOf("http://",lastE);
int e = line.indexOf("\"", b);
This works but as you can imagine it's horrible and only works in 80% of the cases. The only alternative i could come up with myself sounded slow and stupid. So my question is pretty mutch do i go from
String html
to
List<Url>
?
Pattern p = Pattern.compile("http://[\w^\"]++");
Matcher m = p.matcher(yourFetchedHtmlString);
while (m.find()) {
nextUrl=m.group();//Do whatever you want with it
}
You may also have to tweak the regexp, as i have just written it without testing. This should be a very fast way to fetch urls.
I would try a library like HTML Parser to parse the html string and extract all url tags from it.
Your thinking is good, you just missing some parts.
Yous should add some known extension for urls.
like .html .aspx .php .htm .cgi .js .pl .asp
And if you like images too then add .gif .jpg .png
I think your doing it the best just need to add more extensions checking.
If you can post the full method code, i will be happy to help you make it better.
I have an XML which contains many special symbols like ® (HTML number ®) etc.
and HTML names like ã (HTML number ã) etc.
I am trying to replace these HTML symbols and HTML names with corresponding HTML number using Java. For this, I first converted XML file to string and then used replaceAll method as:
File fn = new File("myxmlfile.xml");
String content = FileUtils.readFileToString(fn);
content = content.replaceAll("®", "&\#174");
FileUtils.writeStringToFile(fn, content);
But this is not working.
Can anyone please tell how to do it.
Thanks !!!
The signature for the replaceAll method is:
public String replaceAll(String regex, String replacement)
You have to be careful that your first parameter is a valid regular expression. The Java Pattern class describes the constructs used in a Java regular expression.
Based on what I see in the Pattern class description, I don't see what's wrong with:
content = content.replaceAll("®", "&\#174");
You could try:
content = content.replaceAll("\\p(®)", "&\#174");
and see if that works better.
I don't think that \# is a valid escape sequence.
BTW, what's wrong with "®" ?
If you want HTML numbers try first escaping for XML.
Use EscapeUtils from Apache Commons Lang.
Java may have trouble dealing with it, so first I prefere to escape Java, and after that XML or HTML.
String escapedStr= StringEscapeUtils.escapeJava(yourString);
escapedStr= StringEscapeUtils.escapeXML(yourString);
escapedStr= StringEscapeUtils.escapeHTML(yourString);