Java 6 converting utf8 to iso88591 charset and ignoring unmappable characters

Java 6 converting utf8 to iso88591 charset and ignoring unmappable characters - java

I have written the following function which gets rid of characters in a string that can't be represented in iso88591:
public static String convert(String str) {
if (str.length()==0) return str;
str = str.replace("–","-");
str = str.replace("“","\"");
str = str.replace("”","\"");
return new String(str.getBytes(),iso88591charset);
}
My problem is this doesn't have the behavior I require.
When it comes across a character that has no representation it is converted to multiple bytes. I want that character to be simply omitted from the result.
I would also like to somehow not have to have all those replace commands.
I have been researching charsetEnocder. It has methods like:
CharsetEncoder encoder = iso88591charset.newEncoder();
encoder.onMalformedInput(CodingErrorAction.IGNORE);
encoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
which seem to be what I want, but I have failed to even write a function that mimics what I already have using charset encoder yet alone get to set those options.
Also I am restricted to Java 6 :(
Update:
I came up with a nasty solution for this, but there must be a better way to do it:
public static String convert(String str) {
if (str.length()==0) return str;
str = str.replace("–","-");
str = str.replace("“","\"");
str = str.replace("”","\"");
String str2 = "";
for (int c=0;c<str.length();c++) {
String cur = (new Character(str.charAt(c))).toString();
if (cur.equals(new String(cur.getBytes(),iso88591charset))) str2 += cur;
}
return new String(str2.getBytes(),iso88591charset);
}

One possibile way could be
// U+2126 - omega sign
// U+2013 - en dash
// U+201c - left double quotation mark
// U+201d - right double quotation mark
String str = "\u2126\u2013\u201c\u201d";
System.out.println("original = " + str);
str = str.replace("–", "-");
str = str.replace("“", "\"");
str = str.replace("”", "\"");
System.out.println("replaced = " + str);
StringBuilder sb = new StringBuilder();
for (char c : str.toCharArray()) {
if (c <= '\u00ff') {
sb.append(c);
}
}
System.out.println("stripped = " + sb);
output
original = Ω–“”
replaced = Ω-""
stripped = -""

Related

How to split string based on length and space using Java

I'm having a string as following in Java. The length of the string is not known and as an example it will be something like below.
String str = "I love programming. I'm currently working with Java and C++."
For some requirement I want to get first 15 characters. Then 30, 45, 70 next characters. Once the string was split if the name was not meaningful then it should be split from nearest space. For the above example output is as following.
String strSpli1 = "I love "; //Since 'programming' is splitting it was sent to next split
String strSpli2 = "programming. I'm currently ";//Since 'working' is splitting it was sent to next split
String strSpli3 = "working with Java and C++.";
Please help me to achieve this.
Updated answer for anybody having this kind of requirement.
String str = "I love programming. I'm currently working with Java and C++.";
String strSpli1 = "";
String strSpli2 = "";
String strSpli3 = "";
try {
strSpli1 = str.substring(15);
int pos = str.lastIndexOf(" ", 16);
if (pos == -1) {
pos = 15;
}
strSpli1 = str.substring(0, pos);
str = str.substring(pos);
try {
strSpli2 = str.substring(45);
int pos1 = str.lastIndexOf(" ", 46);
if (pos1 == -1) {
pos1 = 45;
}
strSpli2 = str.substring(0, pos1);
str = str.substring(pos1);
try {
strSpli3 = str.substring(70);
int pos2 = str.lastIndexOf(" ", 71);
if (pos2 == -1) {
pos2 = 45;
}
strSpli3 = str.substring(0, pos2);
str = str.substring(pos2);
} catch (Exception ex) {
strSpli3 = str;
}
} catch (Exception ex) {
strSpli2 = str;
}
} catch (Exception ex) {
strSpli1 = str;
}
Thank you

use the 2 parameter version of lastIndexOf() to search for space backwards starting from a given position. Example for the first 15 characters:
int pos = str.lastIndexOf(" ", 16);
if (pos == -1) {
pos = 15;
}
String found = str.substring(0, pos);
str = str.substring(pos+1);
this is missing checks like ensuring the string starts with at least 15 characters, or that pos+1 is valid for given length
suggest having a look at java.text.BreakIterator

why you use so many try catch ? just try this.
public class MyClass {
public static void main(String args[]) {
String str = "I love programming. I'm currently working with Java and C++.";
String strSpli1 = "";
String strSpli2 = "";
String strSpli3 = "";
strSpli1 = str.substring(0, 7);
strSpli2 = str.substring(7, 33);
strSpli3 = str.substring(34, str.length());
System.out.println(strSpli1+"\n");
System.out.println(strSpli2+"\n");
System.out.println(strSpli3+"\n");
}
use substring(start index, end index).

How to remove whitespace in String imported from Excel

I need to remove all white character from a string and I am not able to do so.
Anyone has an idea on how to do it?
Here is my string retrieved from an excel file via jxl API :
"Destination à gauche"
And here are its bytes :
6810111511610511097116105111110-96-32321039711799104101
There is the code I use to remove whitespaces :
public static void checkEntetes(Workbook book) {
String sheetName = "mysheet";
System.out.print(sheetName + " : ");
for(int i = 0; i < getColumnMax(book.getSheet(sheetName)); i++) {
String elementTrouve = book.getSheet(sheetName).getCell(i, 0).getContents();
String fileEntete = new String(elementTrouve.getBytes()).replaceAll("\\s+","");
System.out.println("\t" + elementTrouve + ", " + bytesArrayToString(elementTrouve.getBytes()));
System.out.println("\t" + fileEntete + ", " + bytesArrayToString(fileEntete.getBytes()));
}
System.out.println();
}
And this outputs :
"Destination à gauche", 6810111511610511097116105111110-96-32321039711799104101
"Destination àgauche", 6810111511610511097116105111110-96-321039711799104101
I even tried to make it myself and it still leaves a space before the 'à' char.
public static String removeWhiteChars(String s) {
String retour = "";
for(int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if(c != (char) ' ') {
retour += c;
}
}
return retour;
}

regular expressions to the rescue:
str = str.replaceAll("\\s+", "")
will remove any sequence of whitespace characters. for example:
String input = "Destination à gauche";
String output = input.replaceAll("\\s+","");
System.out.println("output is \""+output+"\"");
outputs Destinationàgauche
if youre starting point is indeed the raw bytes (byte[]) you will first need to make them into a String:
byte[] inputData = //get from somewhere
String stringBefore = new String(inputData, Charset.forName("UTF-8")); //you need to know the encoding
String withoutSpaces = stringBefore.replaceAll("\\s+","");
byte[] outputData = withoutSpaces.getBytes(Charset.forName("UTF-8"));

If you would like to use a formula, the TRIM function will do exactly what you're looking for:
+----+------------+---------------------+
| | A | B |
+----+------------+---------------------+
| 1 | =TRIM(B1) | value to trim here |
+----+------------+---------------------+
So to do the whole column.
1) Insert a column
2) Insert TRIM function pointed at cell you are trying to correct.
3) Copy formula down the page
4) Copy inserted column
5) Paste as "Values"
Reference: Question number 9578397 on stackoverflow.com

Efficient way to unescape HTML escape characters WITHOUT external library

Now if I want to convert HTML escape characters to readable String I have this method:
public static String unescapeHTML(String text) {
return text
.replace("™", "™")
.replace("€", "€")
.replace(" ", " ")
.replace(" ", " ")
.replace("!", "!")
.replace(""", "\"")
.replace(""", "\"")
.replace("#", "#")
.replace("$", "$")
.replace("%", "%")
.replace("&", "&")
//and the rest of HTML escape characters
.replace("&", "&");
}
My goal is not to use any external library like Apache (class StringUtils), etc.
Because the list is quite long - more than 300 chars - it would be nice to know what would be the fastest way to replace them?

Using Patterns and Matcher. if you want avoid the calculation/adjustment on buffer length, you can also keep the difference between two strings in some datastructure and use it instead of calculating buffer length at run time. like { -4,-4,0,-4} . Since buffer length is just returning the instance variable, i did used buffer length here.
private final static Pattern MY_PATTERN = Pattern.compile("\\&(.*?)\\;");
private final static HashMap<String, String> patterns = new HashMap<>();
static{
patterns.put("&", "&");
patterns.put("!", "!");
patterns.put(" ", "thick");
patterns.put("$", "$");
}
public static StringBuffer escapeString(String text){
StringBuffer buffer = new StringBuffer(text);
Matcher m = MY_PATTERN.matcher(text);
int modifiedLength = 0;
while (m.find()) {
int tmpLength = buffer.length();
// To consider the modified buffer length due to replace. hold difference between old and previous
buffer.replace(m.start()-modifiedLength, m.end()-modifiedLength, patterns.get(m.group()));
modifiedLength = modifiedLength + tmpLength-buffer.length();
}
return buffer;
}

I have decided to do it this way:
private static final Map<Integer, Character> iMap = new HashMap<>();
static {//Code, like or
iMap.put(32, ' ');
iMap.put(33, '!');
iMap.put(34, '\"');
iMap.put(35, '#');
iMap.put(36, '$');
iMap.put(37, '%');
iMap.put(38, '&');
//...
}
private static final Map<String, Character> sMap = new HashMap<>();
static {//Entity Name
sMap.put("←", '←');
sMap.put("↑", '↑');
sMap.put("→", '→');
sMap.put("↓", '↓');
sMap.put("↔", '↔');
sMap.put("♠", '♠');
sMap.put("♣", '♣');
sMap.put("♥", '♥');
//...
}
public static String unescapeHTML(String str) {
StringBuilder sb = new StringBuilder(),
tmp = new StringBuilder();
StringReader sr = new StringReader(str);
boolean esc = false;
try {
int i;
while ((i = sr.read()) != -1) {
char c = (char) i;
if (c == '&') {
tmp.append(c);
esc = true;
} else if (esc) {
tmp.append(c);
if (c == ';') {
esc = false;
if (tmp.charAt(1) == '#') {
try {
sb.append(iMap.get(Integer.parseInt(tmp.substring(2, tmp.capacity() - 1))));
} catch (NumberFormatException ex) {
sb.append(tmp.toString());//Ignore and leave unchanged
}
} else {
sb.append(sMap.get(tmp.toString()));
}
tmp.setLength(0);
}
} else {
sb.append(c);
}
}
sr.close();
} catch (IOException ex) {
Logger.getLogger(UnescapeHTML.class.getName()).log(Level.SEVERE, null, ex);
}
return sb.toString();
}
Works perfectly and the code is simple. Still testing. It would be nice to hear your comments.

Words inside square brackes - RegExp

String linkPattern = "\\[[A-Za-z_0-9]+\\]";
String text = "[build]/directory/[something]/[build]/";
RegExp reg = RegExp.compile(linkPattern,"g");
MatchResult matchResult = reg.exec(text);
for (int i = 0; i < matchResult.getGroupCount(); i++) {
System.out.println("group" + i + "=" + matchResult.getGroup(i));
}
I am trying to get all blocks which are encapsulated by squared bracets form a path string:
and I only get group0="[build]" what i want is:
1:"[build]" 2:"[something]" 3:"[build]"
EDIT:
just to be clear words inside the brackets are generated with random text
public static String genText()
{
final int LENGTH = (int)(Math.random()*12)+4;
StringBuffer sb = new StringBuffer();
for (int x = 0; x < LENGTH; x++)
{
sb.append((char)((int)(Math.random() * 26) + 97));
}
String str = sb.toString();
str = str.substring(0,1).toUpperCase() + str.substring(1);
return str;
}
EDIT 2:
JDK works fine, GWT RegExp gives this problem
SOLVED:
Answer from Didier L
String linkPattern = "\\[[A-Za-z_0-9]+\\]";
String result = "";
String text = "[build]/directory/[something]/[build]/";
RegExp reg = RegExp.compile(linkPattern,"g");
MatchResult matchResult = null;
while((matchResult=reg.exec(text)) != null){
if(matchResult.getGroupCount()==1)
System.out.println( matchResult.getGroup(0));
}

I don't know which regex library you are using but using the one from the JDK it would go along the lines of
String linkPattern = "\\[[A-Za-z_0-9]+\\]";
String text = "[build]/directory/[something]/[build]/";
Pattern pat = Pattern.compile(linkPattern);
Matcher mat = pat.matcher(text);
while (mat.find()) {
System.out.println(mat.group());
}
Output:
[build]
[something]
[build]

Try:
String linkPattern = "(\\[[A-Za-z_0-9]+\\])*";
EDIT:
Second try:
String linkPattern = "\\[(\\w+)\\]+"
Third try, see http://rubular.com/r/eyAQ3Vg68N

Java special characters RegEx

I want to achieve following using Regular expression in Java
String[] paramsToReplace = {"email", "address", "phone"};
//input URL string
String ip = "http://www.google.com?name=bob&email=okATtk.com&address=NYC&phone=007";
//output URL string
String op = "http://www.google.com?name=bob&email=&address=&phone=";
The URL can contain special characters like %

Try this expression: (email=)[^&]+ (replace email with your array elements) and replace with the group: input.replaceAll("("+ paramsToReplace[i] + "=)[^&]+", "$1");
String input = "http://www.google.com?name=bob&email=okATtk.com&address=NYC&phone=007";
String output = input;
for( String param : paramsToReplace ) {
output = output.replaceAll("("+ param + "=)[^&]+", "$1");
}

For the example above. you can use split
String[] temp = ip.split("?name=")[1].split("&")[0];
op = temp[0] + "?name=" + temp[1].split("&")[0] +"&email=&address=&phone=";

Something like this?
private final static String REPLACE_REGEX = "=.+\\&";
ip=ip+"&";
for(String param : paramsToReplace) {
ip = ip.replaceAll(param+REPLACE_REGEX, Matcher.quoteReplacement(param+"=&"));
}
P.S. This is only a concept, i didn't compile this code.

You don't need regular expressions to achieve that:
String op = ip;
for (String param : paramsToReplace) {
int start = op.indexOf("?" + param);
if (start < 0)
start = op.indexOf("&" + param);
if (start < 0)
continue;
int end = op.indexOf("&", start + 1);
if (end < 0)
end = op.length();
op = op.substring(0, start + param.length() + 2) + op.substring(end);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java 6 converting utf8 to iso88591 charset and ignoring unmappable characters - java

Related

How to split string based on length and space using Java

How to remove whitespace in String imported from Excel

Efficient way to unescape HTML escape characters WITHOUT external library

Words inside square brackes - RegExp

Java special characters RegEx

Categories

Resources