Convert HTML symbols and HTML names to HTML number using Java - java

I have an XML which contains many special symbols like ® (HTML number &#174) etc.
and HTML names like &atilde (HTML number &#227) etc.
I am trying to replace these HTML symbols and HTML names with corresponding HTML number using Java. For this, I first converted XML file to string and then used replaceAll method as:
File fn = new File("myxmlfile.xml");
String content = FileUtils.readFileToString(fn);
content = content.replaceAll("®", "&\#174");
FileUtils.writeStringToFile(fn, content);
But this is not working.
Can anyone please tell how to do it.
Thanks !!!

The signature for the replaceAll method is:
public String replaceAll(String regex, String replacement)
You have to be careful that your first parameter is a valid regular expression. The Java Pattern class describes the constructs used in a Java regular expression.
Based on what I see in the Pattern class description, I don't see what's wrong with:
content = content.replaceAll("®", "&\#174");
You could try:
content = content.replaceAll("\\p(®)", "&\#174");
and see if that works better.

I don't think that \# is a valid escape sequence.
BTW, what's wrong with "&#174" ?

If you want HTML numbers try first escaping for XML.
Use EscapeUtils from Apache Commons Lang.
Java may have trouble dealing with it, so first I prefere to escape Java, and after that XML or HTML.
String escapedStr= StringEscapeUtils.escapeJava(yourString);
escapedStr= StringEscapeUtils.escapeXML(yourString);
escapedStr= StringEscapeUtils.escapeHTML(yourString);

Related

Jsoup Parsing double quotes as &quot and single quotes as double quotes

I am trying to parse an HTML document. In the document, there is the
span-data-personalization = '{"one":["two"]}' which converts to
span-data-personalization = "{&quotone&quot:[&quottwo&quot]}" while parsing. The double quotes convert to &quot and single quotes to double quote. I have also used doc.outputSettings().prettyPrint(false); with no success. Also, made the changes suggested in jsoup - stop jsoup from making quotes into & It still did not work. And, I have also tried updating the Jsoup version.nothing seems to work. Does anybody have any suggestions?
Thank you.
The JSoup Parser class has a built in unescapeEntities​ method. From the JSoup documentation:
public static String unescapeEntities​(String string,
boolean inAttribute)
Utility method to unescape HTML entities from a string
Parameters:
string - HTML escaped string
inAttribute - if the string is to be escaped in strict mode (as attributes are)
Returns:
an unescaped string

Escaping special characters in Java

The application I support is going through security review and there are some questions regarding escaping special characters. I have not been supporting this application for a long time and I'm not very knowledgeable about escaping special characters. The question I was asked is "Why are you JavaScript encoding the value and then HTML encoding it? Is that value written out in a context that requires the value to be encoded for both contexts?"
What is the difference between JavaScript encoding used and HTML encoding used? Why would I need both in my code?
Any information regarding this will be greatly appreciated!
public class HTMLEncodedResultSet extends ResultSetWrapper {
public HTMLEncodedResultSet(ResultSet resultSet) {
super(resultSet);
}
public String getString(int columnIndex) throws SQLException {
return StringEscapeUtils.escapeHtml(StringEscapeUtils.escapeJavaScript(super.getString(columnIndex)));
}
public String getString(String columnName) throws SQLException {
return StringEscapeUtils.escapeHtml(StringEscapeUtils.escapeJavaScript(super.getString(columnName)));
}
}
From the official documentation:
escapeHtml
Escapes the characters in a String using HTML entities.
For example:
"bread" & "butter"
becomes: "bread" & "butter".
escapeJavaScript
Escapes the characters in a String using JavaScript String rules.
Escapes any values it finds into their JavaScript String form. Deals
correctly with quotes and control-chars (tab, backslash, cr, ff, etc.)
So a tab becomes the characters '\' and 't'.
Example:
input string: He didn't say, "Stop!" output string: He didn\'t say,
\"Stop!\"
So, given that JS and HTML reserved characters are not the same, in your case if the input has HTML and JS code it may be necessary to invoke both methods.
It looks like that your application has JavaScript snippets stored in database. These snippets might create or contain HTML parts (i.e. for generating dynamic HTML based on interaction). When loading these snippets from DB as a string in Java a JavaScript AND HTML encoding is required.
Here an example of a value that could be stored in DB.
var obj = $('#fire');
var fps = 200;
var letters = obj.html().split('');
obj.empty();
$.each(letters,function(el){
obj.append($('<span>'+this+'</span>'));
});
var animateLetters = obj.find('span');
setInterval(function(){
animateLetters.each(function(){
$(this).css('fontSize', 80+(Math.floor(Math.random()*50)));
});
},fps);
Referring to the documentation:
escapeHTML: Escapes the characters in a String using HTML entities.
For example:
"bread" & "butter"
becomes: "bread" & "butter".
and
escapeJavaScript: Escapes any values it finds into their JavaScript
String form. Deals correctly with quotes and control-chars (tab,
backslash, cr, ff, etc.)
So a tab becomes the characters '\' and 't'.
The only difference between Java strings and JavaScript strings is
that in JavaScript, a single quote must be escaped.
Example:
input string: He didn't say, "Stop!" output string: He didn\'t say,
\"Stop!\"

How to automatically unescape the escape characters in a string

I am receiving the data from the service with the escape sequence characters...I have managed to elemenate them by this code
results=results.replace("\\\"", "\"");
if(results.startsWith("\"")) {
results=results.substring(1,results.length());
}
if(results.endsWith("\"")) {
results=results.substring(0,results.length()-1);
}
It works fine but for some strings it throws exception while creating json object...How do I automatically unescape the escape characters in the result, I have searched for answers but many of them saying to use a third party library...what is the best I can achieve this.
I think Apache Commons work pretty good. It has StringEscapeUtils class with bunch of different static methods for escaping and unescaping strings, so i think you should check it.
Good luck!
place this part of code below the parsing Array
// to remove all <P> </p> and <br /> and replace with ""
content = content.replace("<br />", "");
content = content.replace("<p>", "");
content = content.replace("</p>", "");
here for me content is object, replace according to ur necessary in the place of "content".

Regex Email addresses out of xml

My question: What's a good way to parse the information below?
I have a java program that gets it's input from XML. I have a feature which will send an error email if there was any problem in the processing. Because parsing the XML could be a problem, I want to have a feature that would be able to regex the emails out of the xml (because if parsing was the problem then I couldn't get the error e-mails out of the xml normally).
Requirements:
I want to be able to parse the to, cc, and bcc attributes seperately
There are other elements which have to, cc, and bcc attributes
Whitespace does not matter, so my example may show the attributes on a newline, but that's not always the case.
The order of the attributes does not matter.
Here's an example of the xml:
<error_options
to="your_email#your_server.com"
cc="cc_error#your_server.com"
bcc="bcc_error#your_server.com"
reply_to="someone_else#their_server.com"
from="bo_error#some_server.org"
subject="Error running System at ##TIMESTAMP##"
force_send="false"
max_email_size="10485760"
oversized_email_action="zip;split_all"
>
I tried this error_options.{0,100}?to="(.*?)", but that matched me down to reply_to. That made me think there are probably some cases I might miss, which is why I'm posting this as a question.
This piece will put all attributes from your String s="<error_options..." into a map:
Pattern p = Pattern.compile("\\s+?(.+?)=\"(.+?)\\s*?\"",Pattern.DOTALL);
Map a = new HashMap() ;
Matcher m = p.matcher(s) ;
while( m.find() ) {
String key = m.group(1).trim() ;
String val = m.group(2).trim() ;
a.put(key, val) ;
}
...then you can extract the values that you're interested in from that map.
This question is similar to RegEx match open tags except XHTML self-contained tags. Never ever parse XML or HTML with regular expressions. There are many XML parser implementation in Java to do this task properly. Read the document and parse the attributes one by one.
Don't mind, if the users XML is not well-formed, the parsers can handle a lot of sloppiness.
/<error_options(?=\s)[^>]*?(?<=\n)\s*to="([^"]*)"/s;
/<error_options(?=\s)[^>]*?(?<=\n)\s*cc="([^"]*)"/s;
/<error_options(?=\s)[^>]*?(?<=\n)\s*bcc="([^"]*)"/s;

How to extract a substring from a string in java

What I am doing is validating URLs from my code. So I have a file with url's in it and I want to see if they exist or not. If they exist, the web page contains xml code in which there will be an email address I want to extract.
I go round a while loop and in each instance, if the url exists, The xml is added to a string. This one big string contains the xml code. What I want to do is extract the email address from this string with the xml code in it. I can't use the methods in the string api as they require you to specify the sarting index which I don't know as it varies each time.
What I was hoping to do was search the string for a sub-string starting with (e.g. "<email id>") and ending with (e.g. "</email id>") and add the string between these strings to a seperate string.
Does anyone know if this is possible to do or if there is an easier/different way of doing what I want to do?
Thanks.
If you know well the structure of the XML document, I'll recommand to use XPath.
For example, with emails contained in <email>a#b.com</email>, there will a XPath request like /root/email (depends on your xml structure)
By executing this XPath query on your XML file, you will automatically get all <email> element (Node) returned in an array. And if you have XML element, you have XML content. (#getNodeValue)
To answer your subject question: .indexOf, or, regular expressions.
But after a brief review of your question, you should really be processing the XML document properly.
A regular expression that will find and return strings between two " characters:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
private final static Pattern pattern = Pattern.compile("\"(.*?)\"");
private void doStuffWithStringsBetweenQuotes(String source) {
Matcher matcher = pattern.matcher(source);
while (matcher.find()) {
String match = matcher.group(1);
}
}
Have you try to use Regex? Probably a sample document will be very useful for this kind of question.
Check out the org.xml.sax API. It is very easy to use and allows you to parse through XML and do whatever you want with the contents whenever you come across anything of interest. So you could easily add some logic to look for < email > start elements then save the contents (characters) which will contain your email address.
If I understand your question correctly you are extracting pieces of XML from multiple web pages and concatenating them into a big 'xml' string,
something that looks like
"<somedata>blah</somedata>
<email>a.b#c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
<email>a.c#c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
<email>a.d#c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
"
I'd advise making that a somewhat valid xml document by including a root element.
"
<?xml version="1.0" encoding="ISO-8859-1"?>
<newRoot>
<somedata>blah</somedata>
<email>a.b#c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
<email>a.c#c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
<email>a.d#c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
</newroot>"
Then you could load that into an Xml Document object and can use Xpath expressions to extract the email nodes and their values.
If you don't want to do that that you could use the indexOf(String str, int fromIndex) method to find the <email> and </email> (or whatever they are called) positions. and then substring based on those. That's not a particularly clean or easy to read way of doing it though.

Categories

Resources