Efficient way to unescape HTML escape characters WITHOUT external library

Efficient way to unescape HTML escape characters WITHOUT external library - java

Now if I want to convert HTML escape characters to readable String I have this method:
public static String unescapeHTML(String text) {
return text
.replace("™", "™")
.replace("€", "€")
.replace(" ", " ")
.replace(" ", " ")
.replace("!", "!")
.replace(""", "\"")
.replace(""", "\"")
.replace("#", "#")
.replace("$", "$")
.replace("%", "%")
.replace("&", "&")
//and the rest of HTML escape characters
.replace("&", "&");
}
My goal is not to use any external library like Apache (class StringUtils), etc.
Because the list is quite long - more than 300 chars - it would be nice to know what would be the fastest way to replace them?

Using Patterns and Matcher. if you want avoid the calculation/adjustment on buffer length, you can also keep the difference between two strings in some datastructure and use it instead of calculating buffer length at run time. like { -4,-4,0,-4} . Since buffer length is just returning the instance variable, i did used buffer length here.
private final static Pattern MY_PATTERN = Pattern.compile("\\&(.*?)\\;");
private final static HashMap<String, String> patterns = new HashMap<>();
static{
patterns.put("&", "&");
patterns.put("!", "!");
patterns.put(" ", "thick");
patterns.put("$", "$");
}
public static StringBuffer escapeString(String text){
StringBuffer buffer = new StringBuffer(text);
Matcher m = MY_PATTERN.matcher(text);
int modifiedLength = 0;
while (m.find()) {
int tmpLength = buffer.length();
// To consider the modified buffer length due to replace. hold difference between old and previous
buffer.replace(m.start()-modifiedLength, m.end()-modifiedLength, patterns.get(m.group()));
modifiedLength = modifiedLength + tmpLength-buffer.length();
}
return buffer;
}

I have decided to do it this way:
private static final Map<Integer, Character> iMap = new HashMap<>();
static {//Code, like or
iMap.put(32, ' ');
iMap.put(33, '!');
iMap.put(34, '\"');
iMap.put(35, '#');
iMap.put(36, '$');
iMap.put(37, '%');
iMap.put(38, '&');
//...
}
private static final Map<String, Character> sMap = new HashMap<>();
static {//Entity Name
sMap.put("←", '←');
sMap.put("↑", '↑');
sMap.put("→", '→');
sMap.put("↓", '↓');
sMap.put("↔", '↔');
sMap.put("♠", '♠');
sMap.put("♣", '♣');
sMap.put("♥", '♥');
//...
}
public static String unescapeHTML(String str) {
StringBuilder sb = new StringBuilder(),
tmp = new StringBuilder();
StringReader sr = new StringReader(str);
boolean esc = false;
try {
int i;
while ((i = sr.read()) != -1) {
char c = (char) i;
if (c == '&') {
tmp.append(c);
esc = true;
} else if (esc) {
tmp.append(c);
if (c == ';') {
esc = false;
if (tmp.charAt(1) == '#') {
try {
sb.append(iMap.get(Integer.parseInt(tmp.substring(2, tmp.capacity() - 1))));
} catch (NumberFormatException ex) {
sb.append(tmp.toString());//Ignore and leave unchanged
}
} else {
sb.append(sMap.get(tmp.toString()));
}
tmp.setLength(0);
}
} else {
sb.append(c);
}
}
sr.close();
} catch (IOException ex) {
Logger.getLogger(UnescapeHTML.class.getName()).log(Level.SEVERE, null, ex);
}
return sb.toString();
}
Works perfectly and the code is simple. Still testing. It would be nice to hear your comments.

Related

Is there any way to convert first word in the string to camelcase format?

I want to write java code to convert left side strings to right ones.
1234_hello -- 1234_Hello
hello Data -- Hello Data
hELLO data -- Hello data
1234hEllo -- 1234Hello
heLLO1234hEllo -- Hello1234hEllo
$hello -- $Hello
Could you please help with the solution?
Thank you!

Here is a solution:
public static void main(String[] args) {
try {
System.out.println(convertString("1234_hello"));
System.out.println(convertString("hello Data"));
System.out.println(convertString("hELLO data"));
System.out.println(convertString("1234hEllo"));
System.out.println(convertString("heLLO1234hEllo"));
System.out.println(convertString("$hello"));
System.out.println(convertString("$1234hEllo_TTHjjZ"));
}
catch (Exception e) {
e.printStackTrace();
}
}
private static String convertString(String string) {
String result = string;
final String regex1 = "^([^a-zA-Z]+)([a-zA-Z])([a-zA-Z]*)([^a-zA-Z].*)$";
final String regex2 = "^([a-zA-Z])([a-zA-Z]*)([^a-zA-Z].*)$";
final String regex3 = "^([^a-zA-Z]+)([a-zA-Z])([a-zA-Z]*)$";
final Pattern pattern1 = Pattern.compile(regex1, Pattern.MULTILINE);
final Pattern pattern2 = Pattern.compile(regex2, Pattern.MULTILINE);
final Pattern pattern3 = Pattern.compile(regex3, Pattern.MULTILINE);
Matcher matcher1 = pattern1.matcher(string);
Matcher matcher2 = pattern2.matcher(string);
Matcher matcher3 = pattern3.matcher(string);
if (matcher1.find()) {
result = matcher1.group(1) + matcher1.group(2).toUpperCase() + matcher1.group(3).toLowerCase() + matcher1.group(4);
}
else if (matcher2.find()) {
result = matcher2.group(1).toUpperCase() + matcher2.group(2).toLowerCase() + matcher2.group(3);
}
else if (matcher3.find()) {
result = matcher3.group(1) + matcher3.group(2).toUpperCase() + matcher3.group(3).toLowerCase();
}
return result;
}
The result is as expected:
1234_Hello
Hello Data
Hello data
1234Hello
Hello1234hEllo
$Hello
$1234Hello_TTHjjZ

I have a solution for you but it is not efficient:
public static String toCamelCase(String input) {
StringBuilder output = new StringBuilder();
for(int i = 0; i < input.length(); i++) {
if(i == 0) {
output.append(Character.toUpperCase(input.charAt(i)));
continue;
}
if(Character.isLetter(input.charAt(i))) {
if(Character.isLetter(input.charAt(i-1))) {
output.append(Character.toLowerCase(input.charAt(i)));
} else {
output.append(Character.toUpperCase(input.charAt(i)));
}
} else {
output.append(input.charAt(i));
}
}
return output.toString();
}

Java 6 converting utf8 to iso88591 charset and ignoring unmappable characters

I have written the following function which gets rid of characters in a string that can't be represented in iso88591:
public static String convert(String str) {
if (str.length()==0) return str;
str = str.replace("–","-");
str = str.replace("“","\"");
str = str.replace("”","\"");
return new String(str.getBytes(),iso88591charset);
}
My problem is this doesn't have the behavior I require.
When it comes across a character that has no representation it is converted to multiple bytes. I want that character to be simply omitted from the result.
I would also like to somehow not have to have all those replace commands.
I have been researching charsetEnocder. It has methods like:
CharsetEncoder encoder = iso88591charset.newEncoder();
encoder.onMalformedInput(CodingErrorAction.IGNORE);
encoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
which seem to be what I want, but I have failed to even write a function that mimics what I already have using charset encoder yet alone get to set those options.
Also I am restricted to Java 6 :(
Update:
I came up with a nasty solution for this, but there must be a better way to do it:
public static String convert(String str) {
if (str.length()==0) return str;
str = str.replace("–","-");
str = str.replace("“","\"");
str = str.replace("”","\"");
String str2 = "";
for (int c=0;c<str.length();c++) {
String cur = (new Character(str.charAt(c))).toString();
if (cur.equals(new String(cur.getBytes(),iso88591charset))) str2 += cur;
}
return new String(str2.getBytes(),iso88591charset);
}

One possibile way could be
// U+2126 - omega sign
// U+2013 - en dash
// U+201c - left double quotation mark
// U+201d - right double quotation mark
String str = "\u2126\u2013\u201c\u201d";
System.out.println("original = " + str);
str = str.replace("–", "-");
str = str.replace("“", "\"");
str = str.replace("”", "\"");
System.out.println("replaced = " + str);
StringBuilder sb = new StringBuilder();
for (char c : str.toCharArray()) {
if (c <= '\u00ff') {
sb.append(c);
}
}
System.out.println("stripped = " + sb);
output
original = Ω–“”
replaced = Ω-""
stripped = -""

Creating hashmap from json data

I am working on a very simple application for a website, just a basic desktop application.
So I've figured out how to grab all of the JSON Data I need, and if possible, I am trying to avoid the use of external libraries to parse the JSON.
Here is what I am doing right now:
package me.thegreengamerhd.TTVPortable;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import me.thegreengamerhd.TTVPortable.Utils.Messenger;
public class Channel
{
URL url;
String data;
String[] dataArray;
String name;
boolean online;
int viewers;
int followers;
public Channel(String name)
{
this.name = name;
}
public void update() throws IOException
{
// grab all of the JSON data from selected channel, if channel exists
try
{
url = new URL("https://api.twitch.tv/kraken/channels/" + name);
URLConnection connection = url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
data = new String(in.readLine());
in.close();
// clean up data a little, into an array
dataArray = data.split(",");
}
// channel does not exist, throw exception and close client
catch (Exception e)
{
Messenger.sendErrorMessage("The channel you have specified is invalid or corrupted.", true);
e.printStackTrace();
return;
}
StringBuilder sb = new StringBuilder();
for (int i = 0; i < dataArray.length; i++)
{
sb.append(dataArray[i] + "\n");
}
System.out.println(sb.toString());
}
}
So here is what is printed when I enter an example channel (which grabs data correctly)
{"updated_at":"2013-05-24T11:00:26Z"
"created_at":"2011-06-28T07:50:25Z"
"status":"HD [XBOX] Call of Duty Black Ops 2 OPEN LOBBY"
"url":"http://www.twitch.tv/zetaspartan21"
"_id":23170407
"game":"Call of Duty: Black Ops II"
"logo":"http://static-cdn.jtvnw.net/jtv_user_pictures/zetaspartan21-profile_image-121d2cb317e8a91c-300x300.jpeg"
"banner":"http://static-cdn.jtvnw.net/jtv_user_pictures/zetaspartan21-channel_header_image-7c894f59f77ae0c1-640x125.png"
"_links":{"subscriptions":"https://api.twitch.tv/kraken/channels/zetaspartan21/subscriptions"
"editors":"https://api.twitch.tv/kraken/channels/zetaspartan21/editors"
"commercial":"https://api.twitch.tv/kraken/channels/zetaspartan21/commercial"
"teams":"https://api.twitch.tv/kraken/channels/zetaspartan21/teams"
"features":"https://api.twitch.tv/kraken/channels/zetaspartan21/features"
"videos":"https://api.twitch.tv/kraken/channels/zetaspartan21/videos"
"self":"https://api.twitch.tv/kraken/channels/zetaspartan21"
"follows":"https://api.twitch.tv/kraken/channels/zetaspartan21/follows"
"chat":"https://api.twitch.tv/kraken/chat/zetaspartan21"
"stream_key":"https://api.twitch.tv/kraken/channels/zetaspartan21/stream_key"}
"name":"zetaspartan21"
"delay":0
"display_name":"ZetaSpartan21"
"video_banner":"http://static-cdn.jtvnw.net/jtv_user_pictures/zetaspartan21-channel_offline_image-b20322d22543539a-640x360.jpeg"
"background":"http://static-cdn.jtvnw.net/jtv_user_pictures/zetaspartan21-channel_background_image-587bde3d4f90b293.jpeg"
"mature":true}
Initializing User Interface - JOIN
All of this is correct. Now what I want to do, is to be able to grab, for example the 'mature' tag, and it's value. So when I grab it, it would be like as simple as:
// pseudo code
if(mature /*this is a boolean */ == true){ // do stuff}
So if you don't understand, I need to split away the quotes and semicolon between the values to retrieve a Key, Value.

It's doable with the following code :
public static Map<String, Object> parseJSON (String data) throws ParseException {
if (data==null)
return null;
final Map<String, Object> ret = new HashMap<String, Object>();
data = data.trim();
if (!data.startsWith("{") || !data.endsWith("}"))
throw new ParseException("Missing '{' or '}'.", 0);
data = data.substring(1, data.length()-1);
final String [] lines = data.split("[\r\n]");
for (int i=0; i<lines.length; i++) {
String line = lines[i];
if (line.isEmpty())
continue;
line = line.trim();
if (line.indexOf(":")<0)
throw new ParseException("Missing ':'.", 0);
String key = line.substring(0, line.indexOf(":"));
String value = line.substring(line.indexOf(":")+1);
if (key.startsWith("\"") && key.endsWith("\"") && key.length()>2)
key = key.substring(1, key.length()-1);
if (value.startsWith("{"))
while (i+1<line.length() && !value.endsWith("}"))
value = value + "\n" + lines[++i].trim();
if (value.startsWith("\"") && value.endsWith("\"") && value.length()>2)
value = value.substring(1, value.length()-1);
Object mapValue = value;
if (value.startsWith("{") && value.endsWith("}"))
mapValue = parseJSON(value);
else if (value.equalsIgnoreCase("true") || value.equalsIgnoreCase("false"))
mapValue = new Boolean (value);
else {
try {
mapValue = Integer.parseInt(value);
} catch (NumberFormatException nfe) {
try {
mapValue = Long.parseLong(value);
} catch (NumberFormatException nfe2) {}
}
}
ret.put(key, mapValue);
}
return ret;
}
You can call it like that :
try {
Map<String, Object> ret = parseJSON(sb.toString());
if(((Boolean)ret.get("mature")) == true){
System.out.println("mature is true !");
}
} catch (ParseException e) {
}
But, really, you shouldn't do this, and use an already existing JSON parser, because this code will break on any complex or invalid JSON data (like a ":" in the key), and if you want to build a true JSON parser by hand, it will take you a lot more code and debugging !

This is a parser of an easy json string:
public static HashMap<String, String> parseEasyJson(String json) {
final String regex = "([^{}: ]*?):(\\{.*?\\}|\".*?\"|[^:{}\" ]*)";
json = json.replaceAll("\n", "");
Matcher m = Pattern.compile(regex).matcher(json);
HashMap<String, String> map = new HashMap<>();
while (m.find())
map.put(m.group(1), m.group(2));
return map;
}
Live Demo

Java's String .replaceFirst is not working and I don't know if its a bug or I did it wrong

Ok I have a method that is replacing text when I use string.replace() it works but when I switch to relpaceFirst() as shown below it no longer works, what am I doing wrong or missing here?
private void acceptAccButtonActionPerformed(java.awt.event.ActionEvent evt) {
int selectedAcTableItem = validAcTable.getSelectedRow();
int selectedSugTableItem = suggestedAcTable.getSelectedRow();
if (selectedAcTableItem > 0) {
String acNameDefthmlText = htmlText;
String parensName = "";
String acName = validAcTable.getValueAt(selectedAcTableItem, 0).toString();
String acDef = validAcTable.getValueAt(selectedAcTableItem, 1).toString();
String acSent = validAcTable.getValueAt(selectedAcTableItem, 2).toString();
StringBuilder acBuilder = new StringBuilder(acDef);
acBuilder.append(" (").append(acName).append(")");
if (!acDef.equals("")) {
parensName = " (" + acName + ")";
if (htmlText.contains(acName) && !htmlText.contains(acBuilder)){
String acReplace = acBuilder.toString();
String acOrigDefName = acDefRow + parensName;
if (htmlText.contains(acOrigDefName) && parensName.contains(acOrigName)){
acNameDefthmlText = htmlText.replaceFirst(acOrigDefName, acReplace);
} else if (htmlText.contains(acName)) {
acNameDefthmlText = htmlText.replaceFirst(acName, acReplace);
}
htmlText = acNameDefthmlText;
}
validAcTable.setValueAt(true, selectedAcTableItem, 2);
Acronym acronym = createNewAcronym(acName, acSent, acDef, true);
try {
AcronymDefinitionController.sharedInstance().writeAcronymToExcelSheet(acName, acDef);
} catch (IOException ex) {
Exceptions.printStackTrace(ex);
} catch (InvalidFormatException ex) {
Exceptions.printStackTrace(ex);
}
if (validAcTable.getRowCount() - 1 >= validAcTable.getSelectedRow() + 1) {
validAcTable.changeSelection(selectedAcTableItem + 1, 0, true, true);
}
validAcTable.repaint();
}
}

If you notice the signature of two methods in question:
replace(char oldChar,char newChar);
replace(CharSequence target, CharSequence replacement);
replaceFirst(String regex, String replacement);
As you can see, in replaceFirst you matching argument is treated as regex(regular expression), which will cause the difference if any special chars are involved in the argument.
For example: consider below:
System.out.println("abcdab".replace("ab", "ef")); //<- replaces all
System.out.println("abcdab".replaceFirst("ab", "ef"));//<-replaces first
System.out.println("\\abcdab".replace("\\ab", "ef")); //<-replaces first
System.out.println("\\abcdab".replaceFirst("\\ab", "ef"));
//^ doesn't replace as `\` is an special char

Regex Issue With Multiple Groups

I'm trying to create a regex pattern to match the lines in the following format:
field[bii] = float4:.4f_degree // Galactic Latitude
field[class] = int2 (index) // Browse Object Classification
field[dec] = float8:.4f_degree (key) // Declination
field[name] = char20 (index) // Object Designation
field[dircos1] = float8 // 1st Directional Cosine
I came up with this pattern, which seemed to work, then suddenly seemed NOT to work:
field\[(.*)\] = (float|int|char)([0-9]|[1-9][0-9]).*(:(\.([0-9])))
Here is the code I'm trying to use (edit: provided full method instead of excerpt):
private static Map<String, String> createColumnMap(String filename) {
// create a linked hashmap mapping field names to their column types. Use LHM because I'm picky and
// would prefer to preserve the order
Map<String, String> columnMap = new LinkedHashMap<String, String>();
// define the regex patterns
Pattern columnNamePattern = Pattern.compile(columnNameRegexPattern);
try {
Scanner scanner = new Scanner(new FileInputStream(filename));
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
if (line.indexOf("field[") != -1) {
// get the field name
Matcher fieldNameMatcher = columnNamePattern.matcher(line);
String fieldName = null;
if (fieldNameMatcher.find()) {
fieldName = fieldNameMatcher.group(1);
}
String columnName = null;
String columnType = null;
String columnPrecision = null;
String columnScale = null;
//Pattern columnTypePattern = Pattern.compile(".*(float|int|char)([0-9]|[1-9][0-9])");
Pattern columnTypePattern = Pattern.compile("field\\[(.*)\\] = (float|int|char).*([0-9]|[1-9][0-9]).*(:(\\.([0-9])))");
Matcher columnTypeMatcher = columnTypePattern.matcher(line);
System.out.println(columnTypeMatcher.lookingAt());
if (columnTypeMatcher.lookingAt()) {
System.out.println(fieldName + ": " + columnTypeMatcher.groupCount());
int count = columnTypeMatcher.groupCount();
if (count > 1) {
columnName = columnTypeMatcher.group(1);
columnType = columnTypeMatcher.group(2);
}
if (count > 2) {
columnScale = columnTypeMatcher.group(3);
}
if (count >= 6) {
columnPrecision = columnTypeMatcher.group(6);
}
}
int precision = Integer.parseInt(columnPrecision);
int scale = Integer.parseInt(columnScale);
if (columnType.equals("int")) {
if (precision <= 4) {
columnMap.put(fieldName, "INTEGER");
} else {
columnMap.put(fieldName, "BIGINT");
}
} else if (columnType.equals("float")) {
if (columnPrecision==null) {
columnMap.put(fieldName,"DECIMAL(8,4)");
} else {
columnMap.put(fieldName,"DECIMAL(" + columnPrecision + "," + columnScale + ")");
}
} else {
columnMap.put(fieldName,"VARCHAR("+columnPrecision+")");
}
}
if (line.indexOf("<DATA>") != -1) {
scanner.close();
break;
}
}
scanner.close();
} catch (FileNotFoundException e) {
}
return columnMap;
}
When I get the groupCount from the Matcher object, it says there are 6 groups. However, they aren't matching the text, so I could definitely use some help... can anyone assist?

It's not entirely clear to me what you're after but I came up with the following pattern and it accepts all of your input examples:
field\\[(.*)\\] = (float|int|char)([1-9][0-9]?)?(:\\.([0-9]))?
using this code:
String columnName = null;
String columnType = null;
String columnPrecision = null;
String columnScale = null;
// Pattern columnTypePattern =
// Pattern.compile(".*(float|int|char)([0-9]|[1-9][0-9])");
// field\[(.*)\] = (float|int|char)([0-9]|[1-9][0-9]).*(:(\.([0-9])))
Pattern columnTypePattern = Pattern
.compile("field\\[(.*)\\] = (float|int|char)([1-9][0-9]?)?(:\\.([0-9]))?");
Matcher columnTypeMatcher = columnTypePattern.matcher(line);
boolean match = columnTypeMatcher.lookingAt();
System.out.println("Match: " + match);
if (match) {
int count = columnTypeMatcher.groupCount();
if (count > 1) {
columnName = columnTypeMatcher.group(1);
columnType = columnTypeMatcher.group(2);
}
if (count > 2) {
columnScale = columnTypeMatcher.group(3);
}
if (count > 4) {
columnPrecision = columnTypeMatcher.group(5);
}
System.out.println("Name=" + columnName + "; Type=" + columnType + "; Scale=" + columnScale + "; Precision=" + columnPrecision);
}
I think the problem with your regex was it needed to make the scale and precision optional.

field\[(.*)\] = (float|int|char)([0-9]|[1-9][0-9]).*(:(\.([0-9])))
The .* is overly broad, and there is a lot of redundancy in ([0-9]|[1-9][0-9]), and I think the parenthetical group that starts with : and preceding .* should be optional.
After removing all the ambiguity, I get
field\[([^\]]*)\] = (float|int|char)(0|[1-9][0-9]+)(?:[^:]*(:(\.([0-9]+))))?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Efficient way to unescape HTML escape characters WITHOUT external library - java

Related

Is there any way to convert first word in the string to camelcase format?

Java 6 converting utf8 to iso88591 charset and ignoring unmappable characters

Creating hashmap from json data

Java's String .replaceFirst is not working and I don't know if its a bug or I did it wrong

Regex Issue With Multiple Groups

Categories

Resources