Why won't my buffered `String` match with the `RegEx` pattern?

Why won't my buffered `String` match with the `RegEx` pattern? - java

I've created a bluetooth data listener that stores incoming data into a String then checks whether it matches a regular expression pattern then resets the String. This is done because the data does not arrive whole in one instance so that I can manipulate with it when I get the full text. For example when I send "Hello Android!" to my device via bluetooth with my method and print the data, it would be printed like this:
#1 New data! "H"
#2 New data! "ello Android!
As you can see, the whole string cannot be sent at one instance which means that two strings get sent instead and I'm sure that most people know about that. That is why I am using RegEx to help me with that.
Instead, I am sending a randomly generated number between two different characters then try to parse them. For example "<128>". Now, I want to get the whole number so that I can use it, like parse it to an int or something like that. But only when my String buffer gets the whole data that is being sent which is determined by a RegEx pattern that goes like ([<])(-?\d+)([>]). character<followed by any positive/negative number followed by character '>'.
The problem is that it does not match the pattern at all for unknown reasons.
String szBuffer = "";
if(mmInputStream.available() > 0) {
StringBuilder builder = new StringBuilder();
byte[] bData = new byte[1024];
while(mmInputStream.available() > 0) {
int read = mmInputStream.read(bData);
builder.append(new String(bData, 0, read, StandardCharsets.UTF_8));
}
szBuffer += builder.toString();
Log.d("SZ_BUFFER", szBuffer); // For this example, "<128>" gets sent into pieces.
// <n>
if(Pattern.matches("([<])(-?\\d+)([>])", szBuffer)) {
Log.d("SZ_BUFFER_ISMATCH", "MATCH!");
szBuffer = ""; // Reset the buffer for new data
}
else
Log.d("SZ_BUFFER_ISMATCH", "NO MATCH...");
}
Here's a live output:
D/SZ_BUFFER: <
D/SZ_BUFFER_ISMATCH: NO MATCH...
D/SZ_BUFFER: <128>
D/SZ_BUFFER_ISMATCH: NO MATCH...
As you can see, it gets sent into two pieces, but when it gets the whole text together it should be a match, but isn't. Why? If I replace szBuffer with a constant String like this:
if(Pattern.matches("([<])(-?\\d+)([>])", "<128>"))
It's a match, meaning that the pattern should be correct, but when it checks for szBuffer it is never a match.

According to the docs, Pattern.matches(regex,string) is equivalent to Pattern.compile(regex).matcher(string).matches(), which is equivalent to string.matches(regex). This means that you are checking if the entire string matches the regex. If you want to check if the string contains your end-of-string marker, you can use Matcher.find instead:
Pattern pattern = Pattern.compile("([<])(-?\\d+)([>])");
String szBuffer = "";
if(mmInputStream.available() > 0) {
...
szBuffer += convertedBytes;
Log.d("SZ_BUFFER", szBuffer);
// <n>
if(pattern.matcher(szBuffer).find()) {
Log.d("SZ_BUFFER_ISMATCH", "MATCH!");
szBuffer = "";
}
else
Log.d("SZ_BUFFER_ISMATCH", "NO MATCH...");
}

The problem was that the data contains invisible characters which was confirmed by getting the data's length being wrong. The solution is to check every character then store them only if they match a certain criteria like this regular expression method:
for(char c : builder.toString().toCharArray()) {
String s = String.valueOf(c);
if(Pattern.matches("<|>|-|-?\\d+", s)){
szBuffer += s;
}
}
This is not the best nor cleanest solution, but it does solve my problem. I am still learning Regular Expressions and the many many combinations in algorithms. All new recommendations are welcome and will be added below this one.

Related

Extracting digits in the middle of a string using delimiters

String ccToken = "";
String result = "ssl_transaction_type=CCGETTOKENssl_result=0ssl_token=4366738602809990ssl_card_number=41**********9990ssl_token_response=SUCCESS";
String[] elavonResponse = result.split("=|ssl");
for (String t : elavonResponse) {
System.out.println(t);
}
ccToken = (elavonResponse[6]);
System.out.println(ccToken);
I want to be able to grab a specific part of a string and store it in a variable. The way I'm currently doing it, is by splitting the string and then storing the value of the cell into my variable. Is there a way to specify that I want to store the digits after "ssl_token="?
I want my code to be able to obtain the value of ssl_token without having to worry about changes in the string that are not related to the token since I wont have control over the string. I have searched online but I can't find answers for my specific problem or I maybe using the wrong words for searching.

You can use replaceAll with this regex .*ssl_token=(\\d+).* :
String number = result.replaceAll(".*ssl_token=(\\d+).*", "$1");
Outputs
4366738602809990

You can do it with regex. It would probably be better to change the specifications of the input string so that each key/value pair is separated by an ampersand (&) so you could split it (similar to HTTP POST parameters).
Pattern p = Pattern.compile(".*ssl_token=([0-9]+).*");
Matcher m = p.matcher(result);
if(m.matches()) {
long token = Long.parseLong(m.group(1));
System.out.println(String.format("token: [%d]", token));
} else {
System.out.println("token not found");
}

Search index of ssl_token. Create substring from that index. Convert substring to number. To number can extract number when it is at the beggining of the string.

How to convert special characters in a string to unicode?

I couldn't find an answer to this problem, having tried several answer here combined to find something that works, to no avail.
An application I'm working on uses a users name to create PDF's with that name in it. However, when someones name contains a special character like "Yağmur" the pdf creator freaks out and omits this special character.
However, when it gets the unicode equivalent ("Yağmur"), it prints "Yağmur" in the pdf as it should.
How do I check a name/string for any special character (regex = "[^a-z0-9 ]") and when found, replace that character with its unicode equivalent and returning the new unicoded string?

I will try to give the solution in generic way as the frame work you are using is not mentioned as the part of your problem statement.
I too faced the same kind of issue long time back. This should be handled by the pdf engine if you set the text/char encoding as UTF-8. Please find how you can set encoding in your framework for pdf generation and try it out. Hope it helps !!

One hackish way to do this would be as follows:
/*
* TODO: poorly named
*/
public static String convertUnicodePoints(String input) {
// getting char array from input
char[] chars = input.toCharArray();
// initializing output
StringBuilder sb = new StringBuilder();
// iterating input chars
for (int i = 0; i < input.length(); i++) {
// checking character code point to infer whether "conversion" is required
// here, picking an arbitrary code point 125 as boundary
if (Character.codePointAt(input, i) < 125) {
sb.append(chars[i]);
}
// need to "convert", code point > boundary
else {
// for hex representation: prepends as many 0s as required
// to get a hex string of the char code point, 4 characters long
// sb.append(String.format("&#xu%04X;", (int)chars[i]));
// for decimal representation, which is what you want here
sb.append(String.format("&#%d;", (int)chars[i]));
}
}
return sb.toString();
}
If you execute: System.out.println(convertUnicodePoints("Yağmur"));...
... you'll get: Yağmur.
Of course, you can play with the "conversion" logic and decide which ranges get converted.

Replace text with data & matched group contents

I don't believe I saw this when searching (believe me, I spent a good amount of time searching for this) for a solution to this so here goes.
Goal:
Match regex in a string and replace it with something that contains the matched value.
Regex used currently:
\b(Connor|charries96|Foo|Bar)\b
For the record I suck at regex incase this isn't the best way to do it.
My current code (and several other methods I tried) can only replace the text with the first match it encounters if there are multiple matches.
private Pattern regexFromList(List<String> input) {
if(input.size() < 1) {
return "";
}
StringBuilder builder = new StringBuilder();
builder.append("\\b");
builder.append("(");
for(String s : input) {
builder.append(s);
if(!s.equals(input.get(input.size() - 1)))
{
builder.append("|");
}
}
builder.append(")");
builder.append("\\b");
return Pattern.compile(builder.toString(), Pattern.CASE_INSENSITIVE);
}
Example input:
charries96's name is Connor.
Example result using TEST as the data to prepend the match with
TESTcharries96's name is TESTcharries96.
Desired result using example input:
TESTcharries96's name is TESTConnor.
Here is my current code for replacing the text:
if(highlight) {
StringBuilder builder = new StringBuilder();
Matcher match = pattern.matcher(event.getMessage());
String string = event.getMessage();
if (match.find()) {
string = match.replaceAll("TEST" + match.group());
// I do realise I'm using #replaceAll but that's mainly given it gives me the same result as other methods so why not just cut to the chase.
}
builder.append(string);
return builder.toString();
}
EDIT:
Working example of desired result on RegExr

There are a few problems here:
You are taking the user input as is and build the regex:
builder.append(s);
If there are special character in the user input, it might be recognized as meta character and cause unexpected behavior.
Always use Pattern.quote if you want to match a string as it is passed in.
builder.append(Pattern.quote(s));
Matcher.replaceAll is a high level function which resets the Matcher (start the match all over again), and search for all the matches and perform the replacement. In your case, it can be as simple as:
String result = match.replaceAll("TEST$1");
The StringBuilder should be thrown away along with the if statement.
Matcher.find, Matcher.group are lower level functions for fine grain control on what you want to do with a match.
When you perform replacement, you need to build the result with Matcher.appendReplacement and Matcher.appendTail.
A while loop (instead of if statement) should be used with Matcher.find to search for and perform replacement for all matched.

Making only the first letter of a word uppercase

I have a method that converts all the first letters of the words in a sentence into uppercase.
public static String toTitleCase(String s)
{
String result = "";
String[] words = s.split(" ");
for (int i = 0; i < words.length; i++)
{
result += words[i].replace(words[i].charAt(0)+"", Character.toUpperCase(words[i].charAt(0))+"") + " ";
}
return result;
}
The problem is that the method converts each other letter in a word that is the same letter as the first to uppercase. For example, the string title comes out as TiTle
For the input this is a title this becomes the output This Is A TiTle
I've tried lots of things. A nested loop that checks every letter in each word, and if there is a recurrence, the second is ignored. I used counters, booleans, etc. Nothing works and I keep getting the same result.
What can I do? I only want the first letter in upper case.

Instead of using the replace() method, try replaceFirst().
result += words[i].replaceFirst(words[i].charAt(0)+"", Character.toUpperCase(words[i].charAt(0))+"") + " ";
Will output:
This Is A Title

The problem is that you are using replace method which replaces all occurrences of described character. To solve this problem you can either
use replaceFirst instead
take first letter,
create its uppercase version
concatenate it with rest of string which can be created with a little help of substring method.
since you are using replace(String, String) which uses regex you can add ^ before character you want to replace like replace("^a","A"). ^ means start of input so it will only replace a that is placed after start of input.
I would probably use second approach.
Also currently in each loop your code creates new StringBuilder with data stored in result, append new word to it, and reassigns result of output from toString().
This is infective approach. Instead you should create StringBuilder before loop that will represent your result and append new words created inside loop to it and after loop ends you can get its String version with toString() method.

Doing some Regex-Magic can simplify your task:
public static void main(String[] args) {
final String test = "this is a Test";
final StringBuffer buffer = new StringBuffer(test);
final Pattern patter = Pattern.compile("\\b(\\p{javaLowerCase})");
final Matcher matcher = patter.matcher(buffer);
while (matcher.find()) {
buffer.replace(matcher.start(), matcher.end(), matcher.group().toUpperCase());
}
System.out.println(buffer);
}
The expression \\b(\\p{javaLowerCase}) matches "The beginning of a word followed by a lower-case letter", while matcher.group() is equal to whats inside the () in the part that matches. Example: Applying on "test" matches on "t", so start is 0, end is 1 and group is "t". This can easily run through even a huge amount of text and replace all those letters that need replacement.
In addition: it is always a good idea to use a StringBuffer (or similar) for String manipulation, because each String in Java is unique. That is if you do something like result += stringPart you actually create a new String (equal to result + stringPart) each time this is called. So if you do this with like 10 parts, you will in the end have at least 10 different Strings in memory, while you only need one, which is the final one.
StringBuffer instead uses something like char[] to ensure that if you change only a single character no extra memory needs to be allocated.
Note that a patter only need to be compiled once, so you can keep that as a class variable somewhere.

Java regex and pattern matching: finding "blanks" in pattern which do not include them?

So, I need to write a compiler scanner for a homework, and thought it'd be "elegant" to use regex. Fact is, I seldomly used them before, and it was a long time ago. So I forgot most of the stuff about them and needed to have a look around. I used them successfully for the identifiers (or at least I think so, I still need to do some further tests but for now they all look ok), but I have a problem with the numbers-recognition.
The function nextCh() reads the next character on the input (lookahead char). What I'd like to do here is to check if this char matches the regex [0-9]*. I append every matching char in the str field of my current token, then I read the int value of this field. It recognizes a single number input such as "123", but the problem I have is that for the input "123 456", the final str will be "123 456" while I should get 2 separate tokens with fields "123" and "456". Why is the " " being matched?
private void readNumber(Token t) {
t.str = "" + ch; // force conversion char --> String
final Pattern pattern = Pattern.compile("[0-9]*");
nextCh(); // get next char and check if it is a digit
Matcher match = pattern.matcher("" + ch);
while (match.find() && ch != EOF) {
t.str += ch;
nextCh();
match = pattern.matcher("" + ch);
}
t.kind = Kind.number;
try {
int value = Integer.parseInt(t.str);
t.val = value;
} catch(NumberFormatException e) {
error(t, Message.BIG_NUM, t.str);
}
Thank you!
PS: I did solve my problem using the code below. Nevertheless, I'd like to understand where the flaw is in my regex expression.
t.str = "" + ch;
nextCh(); // get next char and check if it is a number
while (ch>='0' && ch<='9') {
t.str += ch;
nextCh();
}
t.kind = Kind.number;
try {
int value = Integer.parseInt(t.str);
t.val = value;
} catch(NumberFormatException e) {
error(t, Message.BIG_NUM, t.str);
}
EDIT: turns out my regex also doesn't work for the identifiers recognition (again, includes blanks), so I had to switch to a system similar to my "solution" (while with a lot of conditions). Guess I'll need to study the regex again :O

I'm not 100% sure whether this is relevant in your case, but this:
Pattern.compile("[0-9]*");
matches zero or more numbers anywhere in the string, because of the asterisk. I think the space gets matched because it is a match for 'zero numbers'. If you wanted to make sure the char was a number, you would have to match one or more, using the plus sign:
Pattern.compile("[0-9]+");
or, since you are only comparing a single char at a time, just match one number:
Pattern.compile("^[0-9]$");

You should be using the matches method rather than the find method. From the documentation:
The matches method attempts to match the entire input sequence against the pattern
The find method scans the input sequence looking for the next subsequence that matches the pattern.
So in other words, by using find, if the string contains a digit anywhere at all, you'll get a match, but if you use matches the entire string must match the pattern.
For example, try this:
Pattern p = Pattern.compile("[0-9]*");
Matcher m123abc = p.matcher("123 abc");
System.out.println(m123abc.matches()); // prints false
System.out.println(m123abc.find()); // prints true

Use a simpler regex like
/\d+/
Where
\d means a digit
+ means one or more
In code:
final Pattern pattern = Pattern.compile("\\d+");

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Why won't my buffered `String` match with the `RegEx` pattern? - java

Related

Extracting digits in the middle of a string using delimiters

How to convert special characters in a string to unicode?

Replace text with data & matched group contents

Making only the first letter of a word uppercase

Java regex and pattern matching: finding "blanks" in pattern which do not include them?

Categories

Resources