What is the better approach to trim unprintable characters from a string - java

I am reading data from xml. When I checked in eclipse console I found I am getting the whole data with some square boxes. Example If there is 123 in excel sheet i am getting 123 with some square boxes. I used trim() to avoid such things but didnot get success because trim() method trims only white spaces. But I found those characters have ASCII value -17, -20 .. I dont want to trim only white spaces I want to trim those square boxes also
So I have used the following method to trim those characters and I got success.
What is the more appropriate way of trimming a string
Trimming a string
String trimData(String accessNum){
StringBuffer sb = new StringBuffer();
try{
if((accessNum != null) && (accessNum.length()>0)){
// Log.i("Settings", accessNum+"Access Number length....."+accessNum.length());
accessNum = accessNum.trim();
byte[] b = accessNum.getBytes();
for(int i=0; i<b.length; i++){
System.out.println(i+"....."+b[i]);
if(b[i]>0){
sb.append((char)(b[i]));
}
}
// Log.i("Settigs", accessNum+"Trimming....");
}}catch(Exception ex){
}
return sb.toString();
}

Edited
use Normalizer (since java 6)
public static final Pattern DIACRITICS_AND_FRIENDS
= Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");
private static String stripDiacritics(String str) {
str = Normalizer.normalize(str, Normalizer.Form.NFD);
str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
return str;
}
And here and here are complete solution.
And if you only want to remove all non printable characters from a string, use
rawString.replaceAll("[^\\x20-\\x7e]", "")
Ref : replace special characters in string in java and How to remove high-ASCII characters from string like ®, ©, ™ in Java

Try this:
str = (str == null) ? null :
str.replaceAll("[^\\p{Print}\\p{Space}]", "").trim();

Related

Replace a million different regex of a string

I'm doing a million different regex replacements of a string. Thus I decided to save all String regex's and String replacements in a file.txt. I tried reading the file line by line and replacing it but it is not working.
replace_regex_file.txt
aaa zzz
^cc eee
ww$ sss
...
...
...
...
a million data
Coding
String user_input = "assume 100,000 words"; // input from user
String regex_file = "replace_regex_file.txt";
String result="";
String line;
try (BufferedReader reader = new BufferedReader(new FileReader(regex_file)) {
while ((line = reader.readLine()) != null) { // while line not equal null
String[] parts = line.split("\\s+", 2); //split process
if (parts.length >=2) {
String regex = parts[0]; // String regex stored in first array
String replace = parts[1]; // String replacement stored in second array
result = user_input.replaceAll(regex, replace); // replace processing
}
}
} System.out.println(result); // show the result
But it does not replace anything. How can I fix this?
Your current code will only apply the last matching regex, because you don't assign the result of the replacement back to the input string:
result = user_input.replaceAll(regex, replace);
Instead, try:
String result = user_input;
outside the loop and
result = result.replaceAll(regex, replace);

Tokenize Arabic text files java

I am trying to tokenize some text files into words and I write this code, It works perfect in English and when I try it in Arabic it did not work.
I added the UTF-8 to read Arabic files. did I miss something
public void parseFiles(String filePath) throws FileNotFoundException, IOException {
File[] allfiles = new File(filePath).listFiles();
BufferedReader in = null;
for (File f : allfiles) {
if (f.getName().endsWith(".txt")) {
fileNameList.add(f.getName());
Reader fstream = new InputStreamReader(new FileInputStream(f),"UTF-8");
// BufferedReader br = new BufferedReader(fstream);
in = new BufferedReader(fstream);
StringBuilder sb = new StringBuilder();
String s=null;
String word = null;
while ((s = in.readLine()) != null) {
Scanner input = new Scanner(s);
while(input.hasNext()) {
word = input.next();
if(stopword.isStopword(word)==true)
{
word= word.replace(word, "");
}
//String stemmed=stem.stem (word);
sb.append(word+"\t");
}
//System.out.print(sb); ///here the arabic text is outputed without stopwords
}
String[] tokenizedTerms = sb.toString().replaceAll("[\\W&&[^\\s]]", "").split("\\W+"); //to get individual terms
for (String term : tokenizedTerms) {
if (!allTerms.contains(term)) { //avoid duplicate entry
allTerms.add(term);
System.out.print(term+"\t"); //here the problem.
}
}
termsDocsArray.add(tokenizedTerms);
}
}
}
Please any ideas to help me proceed.
Thanks
The problem lies with your regex which will work well for English but not for Arabic because by definition
[\\W&&[^\\s]
means
// returns true if the string contains a arbitrary number of non-characters except whitespace.
\W A non-word character other than [a-zA-Z_0-9]. (Arabic chars all satisfy this condition.)
\s A whitespace character, short for [ \t\n\x0b\r\f]
So, by this logic, all chars of Arabic will be selected by this regex. So, when you give
sb.toString().replaceAll("[\\W&&[^\\s]]", "")
it will mean, replace all non word character which is not a space with "". Which in case of Arabic, is all characters. Thus you will get a problem that all Arabic chars are replaced by "". Hence no output will come. You will have to tweak this regex to work for Arabic text or just split the string with space like
sb.toString().split("\\s+")
which will give you the Arabic words array separated by space.
In addition to worrying about character encoding as in bgth's response, tolkenizing Arabic has an added complication that words are not nessisarily white space separated:
http://www1.cs.columbia.edu/~rambow/papers/habash-rambow-2005a.pdf
If you're not familiar with the Arabic, you'll need to read up on some of the methods regarding tolkenization:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.120.9748

Find and replace all NewLine or BreakLine characters with \n in a String - Platform independent

I am looking for a proper and robust way to find and replace all newline or breakline chars from a String independent of any OS platform with \n.
This is what I tried, but didn't really work well.
public static String replaceNewLineChar(String str) {
try {
if (!str.isEmpty()) {
return str.replaceAll("\n\r", "\\n")
.replaceAll("\n", "\\n")
.replaceAll(System.lineSeparator(), "\\n");
}
return str;
} catch (Exception e) {
// Log this exception
return str;
}
}
Example:
Input String:
This is a String
and all newline chars
should be replaced in this example.
Expected Output String:
This is a String\nand all newline chars\nshould be replaced in this example.
However, it returned the same input String back. Like it placed \n and interpreted it as Newline again.
Please note, if you wonder why would someone want \n in there, this is a special requirement by user to place the String in XML afterwords.
If you want literal \n then following should work:
String repl = str.replaceAll("(\\r|\\n|\\r\\n)+", "\\\\n")
This seems to work well:
String s = "This is a String\nand all newline chars\nshould be replaced in this example.";
System.out.println(s);
System.out.println(s.replaceAll("[\\n\\r]+", "\\\\n"));
By the way, you don't need to catch exception.
Oh sure, you could do it with one line of regex, but what fun is that?
public static String fixToNewline(String orig){
char[] chars = orig.toCharArray();
StringBuilder sb = new StringBuilder(100);
for(char c : chars){
switch(c){
case '\r':
case '\f':
break;
case '\n':
sb.append("\\n");
break;
default:
sb.append(c);
}
}
return sb.toString();
}
public static void main(String[] args){
String s = "This is \r\n a String with \n Different Newlines \f and other things.";
System.out.println(s);
System.out.println();
System.out.println("Now calling fixToNewline....");
System.out.println(fixToNewline(s));
}
The result
This is
a String with
Different Newlines and other things.
Now calling fixToNewline....
This is \n a String with \n Different Newlines and other things.

Split text file into Strings on empty line

I want to read a local txt file and read the text in this file. After that i want to split this whole text into Strings like in the example below .
Example :
Lets say file contains-
abcdef
ghijkl
aededd
ededed
ededfe
efefeef
efefeff
......
......
I want to split this text in to Strings
s1 = abcdef+"\n"+ghijkl;
s2 = aededd+"\n"+ededed;
s3 = ededfe+"\n"+efefeef+"\n"+efefeff;
........................
I mean I want to split text on empty line.
I do know how to read a file. I want help in splitting the text in to strings
you can split a string to an array by
String.split();
if you want it by new lines it will be
String.split("\\n\\n");
UPDATE*
If I understand what you are saying then john.
then your code will essentially be
BufferedReader in
= new BufferedReader(new FileReader("foo.txt"));
List<String> allStrings = new ArrayList<String>();
String str ="";
while(true)
{
String tmp = in.readLine();
if(tmp.isEmpty())
{
if(!str.isEmpty())
{
allStrings.add(str);
}
str= "";
}
else if(tmp==null)
{
break;
}
else
{
if(str.isEmpty())
{
str = tmp;
}
else
{
str += "\\n" + tmp;
}
}
}
Might be what you are trying to parse.
Where allStrings is a list of all of your strings.
The below code would work even if there are more than 2 empty lines between useful data.
import java.util.regex.*;
// read your file and store it in a string named str_file_data
Pattern p = Pattern.compile("\\n[\\n]+"); /*if your text file has \r\n as the newline character then use Pattern p = Pattern.compile("\\r\\n[\\r\\n]+");*/
String[] result = p.split(str_file_data);
(I did not test the code so there could be typos.)
I would suggest more general regexp:
text.split("(?m)^\\s*$");
In this case it would work correctly on any end-of-line convention, and also would treat the same empty and blank-space-only lines.
It may depend on how the file is encoded, so I would likely do the following:
String.split("(\\n\\r|\\n|\\r){2}");
Some text files encode newlines as "\n\r" while others may be simply "\n". Two new lines in a row means you have an empty line.
Godwin was on the right track, but I think we can make this work a bit better. Using the '[ ]' in regx is an or, so in his example if you had a \r\n that would just be a new line not an empty line. The regular expression would split it on both the \r and the \n, and I believe in the example we were looking for an empty line which would require a either a \n\r\n\r, a \r\n\r\n, a \n\r\r\n, a \r\n\n\r, or a \n\n or a \r\r
So first we want to look for either \n\r or \r\n twice, with any combination of the two being possible.
String.split(((\\n\\r)|(\\r\\n)){2}));
next we need to look for \r without a \n after it
String.split(\\r{2});
lastly, lets do the same for \n
String.split(\\n{2});
And all together that should be
String.split("((\\n\\r)|(\\r\\n)){2}|(\\r){2}|(\\n){2}");
Note, this works only on the very specific example of using new lines and character returns. I in ruby you can do the following which would encompass more cases. I don't know if there is an equivalent in Java.
.match($^$)
#Kevin code works fine and as he mentioned that the code was not tested, here are the 3 changes required:
1.The if check for (tmp==null) should come first, otherwise there will be a null pointer exception.
2.This code leaves out the last set of lines being added to the ArrayList. To make sure the last one gets added, we have to include this code after the while loop: if(!str.isEmpty()) { allStrings.add(str); }
3.The line str += "\n" + tmp; should be changed to use \n instead if \\n. Please see the end of this thread, I have added the entire code so that it can help
BufferedReader in
= new BufferedReader(new FileReader("foo.txt"));
List<String> allStrings = new ArrayList<String>();
String str ="";
List<String> allStrings = new ArrayList<String>();
String str ="";
while(true)
{
String tmp = in.readLine();
if(tmp==null)
{
break;
}else if(tmp.isEmpty())
{
if(!str.isEmpty())
{
allStrings.add(str);
}
str= "";
}else
{
if(str.isEmpty())
{
str = tmp;
}
else
{
str += "\n" + tmp;
}
}
}
if(!str.isEmpty())
{
allStrings.add(str);
}

Decode a string in Java

How do I properly decode the following string in Java
http%3A//www.google.ru/search%3Fhl%3Dru%26q%3Dla+mer+powder%26btnG%3D%u0420%A0%u0421%u045F%u0420%A0%u0421%u2022%u0420%A0%u0421%u2018%u0420%u040E%u0420%u0453%u0420%A0%u0421%u201D+%u0420%A0%u0420%u2020+Google%26lr%3D%26rlz%3D1I7SKPT_ru
When I use URLDecoder.decode() I the following error
java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern - For input string: "u0"
Thanks,
Dave
According to Wikipedia, "there exist a non-standard encoding for Unicode characters: %uxxxx, where xxxx is a Unicode value".
Continuing: "This behavior is not specified by any RFC and has been rejected by the W3C".
Your URL contains such tokens, and the Java URLDecoder implementation doesn't support those.
%uXXXX encoding is non-standard, and was actually rejected by W3C, so it's natural, that URLDecoder does not understand it.
You can make small function, which will fix it by replacing each occurrence of %uXXYY with %XX%YY in your encoded string. Then you can procede and decode the fixed string normally.
we started with Vartec's solution but found out additional issues. This solution works for UTF-16, but it can be changed to return UTF-8. The replace all is left for clarity reasons and you can read more at http://www.cogniteam.com/wiki/index.php?title=DecodeEncodeJavaScript
static public String unescape(String escaped) throws UnsupportedEncodingException
{
// This code is needed so that the UTF-16 won't be malformed
String str = escaped.replaceAll("%0", "%u000");
str = str.replaceAll("%1", "%u001");
str = str.replaceAll("%2", "%u002");
str = str.replaceAll("%3", "%u003");
str = str.replaceAll("%4", "%u004");
str = str.replaceAll("%5", "%u005");
str = str.replaceAll("%6", "%u006");
str = str.replaceAll("%7", "%u007");
str = str.replaceAll("%8", "%u008");
str = str.replaceAll("%9", "%u009");
str = str.replaceAll("%A", "%u00A");
str = str.replaceAll("%B", "%u00B");
str = str.replaceAll("%C", "%u00C");
str = str.replaceAll("%D", "%u00D");
str = str.replaceAll("%E", "%u00E");
str = str.replaceAll("%F", "%u00F");
// Here we split the 4 byte to 2 byte, so that decode won't fail
String [] arr = str.split("%u");
Vector<String> vec = new Vector<String>();
if(!arr[0].isEmpty())
{
vec.add(arr[0]);
}
for (int i = 1 ; i < arr.length ; i++) {
if(!arr[i].isEmpty())
{
vec.add("%"+arr[i].substring(0, 2));
vec.add("%"+arr[i].substring(2));
}
}
str = "";
for (String string : vec) {
str += string;
}
// Here we return the decoded string
return URLDecoder.decode(str,"UTF-16");
}
After having had a good look at the solution presented by #ariy I created a Java based solution that is also resilient against encoded characters that have been chopped into two parts (i.e. half of the encoded character is missing). This happens in my usecase where I need to decode long urls that are sometimes chopped at 2000 chars length. See What is the maximum length of a URL in different browsers?
public class Utils {
private static Pattern validStandard = Pattern.compile("%([0-9A-Fa-f]{2})");
private static Pattern choppedStandard = Pattern.compile("%[0-9A-Fa-f]{0,1}$");
private static Pattern validNonStandard = Pattern.compile("%u([0-9A-Fa-f][0-9A-Fa-f])([0-9A-Fa-f][0-9A-Fa-f])");
private static Pattern choppedNonStandard = Pattern.compile("%u[0-9A-Fa-f]{0,3}$");
public static String resilientUrlDecode(String input) {
String cookedInput = input;
if (cookedInput.indexOf('%') > -1) {
// Transform all existing UTF-8 standard into UTF-16 standard.
cookedInput = validStandard.matcher(cookedInput).replaceAll("%00%$1");
// Discard chopped encoded char at the end of the line (there is no way to know what it was)
cookedInput = choppedStandard.matcher(cookedInput).replaceAll("");
// Handle non standard (rejected by W3C) encoding that is used anyway by some
// See: https://stackoverflow.com/a/5408655/114196
if (cookedInput.contains("%u")) {
// Transform all existing non standard into UTF-16 standard.
cookedInput = validNonStandard.matcher(cookedInput).replaceAll("%$1%$2");
// Discard chopped encoded char at the end of the line
cookedInput = choppedNonStandard.matcher(cookedInput).replaceAll("");
}
}
try {
return URLDecoder.decode(cookedInput,"UTF-16");
} catch (UnsupportedEncodingException e) {
// Will never happen because the encoding is hardcoded
return null;
}
}
}

Categories

Resources