java Convert Hex NCRs texts to unicode characters

java Convert Hex NCRs texts to unicode characters - java

I'm making a feed reader app for local languages. A news site provide rss feed with these characters
ഹലോ സ്റ്റാക്ക്ഓവർ ഫ്ലോ
Which actually means
ഹലോ സ്റ്റാക്ക്ഓവർ ഫ്ലോ
this is also what I want display in my app..
How can I convert this input to the required form..

Try this.
String input = "ഹലോ സ്റ"
+ "്റാക്ക്ഓ"
+ "വർ ഫ്ലോ";
Pattern HEX = Pattern.compile("(?i)&#x([0-9a-f]+);|&#(\\d+);");
Matcher m = HEX.matcher(input);
StringBuffer sb = new StringBuffer();
while (m.find())
m.appendReplacement(sb,
String.valueOf((char) (m.group(1) != null ?
Integer.parseInt(m.group(1), 16) :
Integer.parseInt(m.group(2)))));
m.appendTail(sb);
String output = sb.toString();
System.out.println(output);
// -> ഹലോ സ്റ്റാക്ക്ഓവർ ഫ്ലോ
This code can handle also decimal NCR.
But cannot handle x10000 to x10FFFF.
Or you can use Jsoup like this.
Document doc = Jsoup.parse(input);
String output = doc.text();
System.out.println(output);
// -> ഹലോ സ്റ്റാക്ക്ഓവർ ഫ്ലോ

Related

java.util.regex.Matcher is unable to find anything inside a String obtained from Apache IOUtils

I have a code snippet to convert an input stream into a String. I then use java.util.regex.Matcher to find something inside the string.
The following works for me:
StringBuilder sb = new StringBuilder();
InputStream ins; // the InputStream data
BufferedReader br = new BufferedReader(new InputStreamReader(ins));
br.lines().forEach(sb::append);
br.close();
String data = sb.toString();
Pattern pattern = Pattern.compile(".*My_PATTERN:(.*)");
Matcher matcher = pattern.matcher(data);
if (matcher.find())
String searchedStr = matcher.group(1); // I find a match here
But if I try to replace BufferedReader with Apache IOUtils, I do not find any matches with the same string.
InputStream ins; // the InputStream data
String data = IOUtils.toString(inputStream, StandardCharsets.UTF_8);
Pattern pattern = Pattern.compile(".*My_PATTERN:(.*)");
Matcher matcher = pattern.matcher(data);
if (matcher.find())
String searchedStr = matcher.group(1); // I cannot find a match here
I have tried with other "StandardCharsets" apart from UTF-8 but none have worked.
I am unable to understand what is different here that would cause IOUtils to not match. Can someone kindly help me out here?

The first code removes line brakes, the second doesn't.
So you should define multiline pattern matching:
In the pattern (starting with flags s=dotall, m=multiline)
Pattern pattern = Pattern.compile("(?sm).*My_PATTERN:(.*)");
In the pattern v2
Pattern pattern = Pattern.compile("[\\s\\S]*My_PATTERN:([\\s\\S]*)");
With flags
Pattern pattern = Pattern.compile(".*My_PATTERN:(.*)", MULTILINE|DOTALL);
All matches line brakes in the group's value.
Or remove line breaks ie: data = data.replaceAll("\\r?\\n", "");
See:
https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#compile(java.lang.String,%20int)
https://docs.oracle.com/javase/tutorial/essential/regex/pattern.html

How to remove duplicate words (words are going not in a row) in file using regex?

I want to remove all the words which are duplicate from a file using regex.
For example:
The university of Hawaii university began using began radio.
Output:
The university of Hawaii began using radio.
I wrote this regex:
String regex = "\\b(\\p{IsAlphabetic}+)(\\s+\\1\\b)+";
Which is removing only words which are going in a row after word.
For example:
The university university of Hawaii Hawaii began using radio.
Output: The university of Hawaii began using radio.
My code with regex:
File dir = new File("C:/Users/Arnoldas/workspace/uplo/");
String source = dir.getCanonicalPath() + File.separator + "Output.txt";
String dest = dir.getCanonicalPath() + File.separator + "Final.txt";
File fin = new File(source);
FileInputStream fis = new FileInputStream(fin);
BufferedReader in = new BufferedReader(new InputStreamReader(fis, "UTF-8"));
//FileWriter fstream = new FileWriter(dest, true);
OutputStreamWriter fstream = new OutputStreamWriter(new FileOutputStream(dest, true), "UTF-8");
BufferedWriter out = new BufferedWriter(fstream);
String regex = "\\b(\\p{IsAlphabetic}+)(\\s+\\1\\b)+";
//String regex = "(?i)\\b([a-z]+)\\b(?:\\s+\\1\\b)+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
String aLine;
while ((aLine = in.readLine()) != null) {
Matcher m = p.matcher(aLine);
while (m.find()) {
aLine = aLine.replaceAll(m.group(), m.group(1));
}
//Process each line and add output to *.txt file
out.write(aLine);
out.newLine();
out.flush();
}

You could use Streams instead:
String s = "The university university of Hawaii Hawaii began using radio.";
System.out.println(Arrays.asList(s.split(" ")).stream().distinct().collect(Collectors.joining(" ")));
In this example the String is split along the blanks, than transformed to a stream. Duplicates are removed with distinct() and at the end all ist joined together with spaces between.
But this approach has a problem with the dot at the end. "radio" and "radio." are different words.

Try this regular expression:
\b(\w+)\s+\1\b
Here \b is a word boundary and \1 references the captured match of the first group.
Source : Regular Expression For Consecutive Duplicate Words

You were on the right track, but if between the repetitions there can be text
it must be done in a loop (for "began ... began ... began").
String s = "The university of Hawaii university began using began radio.";
for (;;) {
String t = s.replaceAll("(?i)\\b(\\p{IsAlphabetic}+)\\b(.*?)\\s*\\b\\1\\b",
"$1$2");
if (t.equals(s)) {
break;
}
s = t;
}
For case-insensitive replace: use (?i).
This is very inefficient as the regex must backtrack.
Simply throw all words in a Set.
// Java 9
Set<String> corpus = Set.of(s.split("\\P{IsAlphabetic}+"));
// Older java:
Set<String> corpus = new TreeSet<>();
Collections.addAll(set, s.split("\\P{IsAlphabetic}+"));
corpus.remove("");
After comment
Correction of original code
New style I/O using Files and Path, still no streams though
Try-with-resources for automatic closing in and out
Regex only to find a word with optional whitespace. Using a set to check duplicates.
Path dir = Paths.get("C:/Users/Arnoldas/workspace/uplo");
Path source = dir.resolve("Output.txt");
String dest = dir.resolve("Final.txt");
String regex = "(\\s*)\\b\\(p{IsAlphabetic}+)\\b";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
try (BufferedReader in = Files.newBufferedReader(source);
BufferedWriter out = new BufferedWriter(dest)) {
String line;
while ((line = in.readLine()) != null) {
Set<String> words = new HashSet<>();
Matcher m = p.matcher(line);
StringBuffer sb = new StringBuffer();
while (m.find()) {
boolean added = words.add(m.group(2).toLowerCase());
m.appendReplacement(sb, added ? m.group() : "");
}
m.appendTail(sb);
out.write(sb.toString());
out.newLine();
}
}

Java Pattern/ Matcher

This is a sample text: \1f\1e\1d\020028. I cannot modify the input text, I am reading long string of texts from a file.
I want to extract the following: \1f, \1e, \1d, \02
For this, I have written the following regular expression pattern: "\\[a-fA-F0-9]"
I am using Pattern and Matcher classes, but my matcher is not able find the pattern using the mentioned regular expression. I have tested this regex with the text on some online regex websites and surprisingly it works there.
Where am I going wrong?
Original code:
public static void main(String[] args) {
String inputText = "\1f\1e\1d\02002868BF03030000000000000000S023\1f\1e\1d\03\0d";
inputText = inputText.replace("\\", "\\\\");
String regex = "\\\\[a-fA-F0-9]{2}";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(inputText);
while (m.find()) {
System.out.println(m.group());
}
}
Output: Nothing is printed

(answer changed after OP added more details)
Your string
String inputText = "\1f\1e\1d\02002868BF03030000000000000000S023\1f\1e\1d\03\0d";
Doesn't actually contains any \ literals because according to Java Language Specification in section 3.10.6. Escape Sequences for Character and String Literals \xxx will be interpreted as character indexed in Unicode Table with octal (base/radix 8) value represented by xxx part.
Example \123 = 1*82 + 2*81 + 3*80 = 1*64 + 2*8 + 3*1 = 64+16+3 = 83 which represents character S
If string you presented in your question is written exactly the same in your text file then you should write it as
String inputText = "\\1f\\1e\\1d\\02002868BF03030000000000000000S023\\1f\\1e\\1d\\03\\0d";
(with escaped \ which now will represent literal).
(older version of my answer)
It is hard to tell what exactly you did wrong without seeing your code. You should be able to find at least \1, \1, \1, \0 since your regex can match one \ and one hexadecimal character placed after it.
Anyway this is how you can find results you mentioned in question:
String text = "\\1f\\1e\\1d\\020028";
Pattern p = Pattern.compile("\\\\[a-fA-F0-9]{2}");
// ^^^--we want to find two hexadecimal
// characters after \
Matcher m = p.matcher(text);
while (m.find())
System.out.println(m.group());
Output:
\1f
\1e
\1d
\02

You need to read the file properly and replace '\' characters with '\\'. Assume that there is file called test_file in your project with this content:
\1f\1e\1d\02002868BF03030000000000000000S023\1f\1e\1d\03\0d
Here is the code to read the file and extract values:
public static void main(String[] args) throws IOException, URISyntaxException {
Test t = new Test();
t.test();
}
public void test() throws IOException {
BufferedReader br =
new BufferedReader(
new InputStreamReader(
getClass().getResourceAsStream("/test_file.txt"), "UTF-8"));
String inputText;
while ((inputText = br.readLine()) != null) {
inputText = inputText.replace("\\", "\\\\");
Pattern pattern = Pattern.compile("\\\\[a-fA-F0-9]{2}");
Matcher match = pattern.matcher(inputText);
while (match.find()) {
System.out.println(match.group());
}
}
}

Try adding a . at the end, like:
\\[a-fA-F0-9].

If you don't want to modify the input string, you could try something like:
static public void main(String[] argv) {
String s = "\1f\1e\1d\020028";
Pattern regex = Pattern.compile("[\\x00-\\x1f][0-9A-Fa-f]");
Matcher match = regex.matcher(s);
while (match.find()) {
char[] c = match.group().toCharArray();
System.out.println(String.format("\\%d%s",c[0]+0, c[1])) ;
}
}
Yes, it's not perfect, but you get the idea.

What is the better approach to trim unprintable characters from a string

I am reading data from xml. When I checked in eclipse console I found I am getting the whole data with some square boxes. Example If there is 123 in excel sheet i am getting 123 with some square boxes. I used trim() to avoid such things but didnot get success because trim() method trims only white spaces. But I found those characters have ASCII value -17, -20 .. I dont want to trim only white spaces I want to trim those square boxes also
So I have used the following method to trim those characters and I got success.
What is the more appropriate way of trimming a string
Trimming a string
String trimData(String accessNum){
StringBuffer sb = new StringBuffer();
try{
if((accessNum != null) && (accessNum.length()>0)){
// Log.i("Settings", accessNum+"Access Number length....."+accessNum.length());
accessNum = accessNum.trim();
byte[] b = accessNum.getBytes();
for(int i=0; i<b.length; i++){
System.out.println(i+"....."+b[i]);
if(b[i]>0){
sb.append((char)(b[i]));
}
}
// Log.i("Settigs", accessNum+"Trimming....");
}}catch(Exception ex){
}
return sb.toString();
}

Edited
use Normalizer (since java 6)
public static final Pattern DIACRITICS_AND_FRIENDS
= Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");
private static String stripDiacritics(String str) {
str = Normalizer.normalize(str, Normalizer.Form.NFD);
str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
return str;
}
And here and here are complete solution.
And if you only want to remove all non printable characters from a string, use
rawString.replaceAll("[^\\x20-\\x7e]", "")
Ref : replace special characters in string in java and How to remove high-ASCII characters from string like ®, ©, ™ in Java

Try this:
str = (str == null) ? null :
str.replaceAll("[^\\p{Print}\\p{Space}]", "").trim();

url encode matched groups

I've got a regex that's matching a given pattern(obviously, thats what regex's do) and replacing that pattern with an anchor tag and including a captured group. That part is working lovely.
String substituted = content.asString().replaceAll("\\[{2}((?:.)*?)\\]{2}",
"$1");
What I can't figure out is how to url encode the captured group before using it in the href attribute.
Example inputs
[[a]]
[[a b]]
[[a&b]]
desired outputs
a
a b
a&b
Is there any way to do this? I haven't found anything that looks useful yet, though once I ask I usually find an answer.

Replace all special chars with what you want first,
then match that inside the double [ and replace it in the <a href=..> tag.
That, or extract the url part inside the [ and pass it through a URL encoder before placing it in the <a href=..> tag.
Java seems to offer java.net.URLEncoder by default. So I think getting the url from the pattern, and passing though the encoder, and then placing it in the <a href=..> tag is your best choice.

Sure 'nough, found my answer.
Started with the code from Matcher.appendReplacement
Pure java:
Pattern p = Pattern.compile("\\[{2}((?:.)*?)\\]{2}" );
Matcher m = p.matcher(content.asString());
StringBuffer sb = new StringBuffer();
while (m.find()) {
String one = m.group(1);
try {
m.appendReplacement(sb, "$1");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
m.appendTail(sb);
GWT:
RegExp p = RegExp.compile("\\[{2}((?:.)*?)\\]{2}", "g");
MatchResult m;
StringBuffer sb = new StringBuffer();
int beginIndex = 0;
while ((m = p.exec(content.asString())) != null) {
String one = m.getGroup(1);
int endIndex = m.getIndex();
sb.append(content.asString().substring(beginIndex, endIndex));
sb.append("" + one + "");
beginIndex = p.getLastIndex();
}
sb.append(content.asString().substring(beginIndex));

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java Convert Hex NCRs texts to unicode characters - java

Related

java.util.regex.Matcher is unable to find anything inside a String obtained from Apache IOUtils

How to remove duplicate words (words are going not in a row) in file using regex?

Java Pattern/ Matcher

What is the better approach to trim unprintable characters from a string

url encode matched groups

Categories

Resources