Regex Differences between Java and Ruby - java

I'm trying to write a regex for:
Strings of characters beginning and ending with a double quote character, that do not contain control characters, and for which the backslash is used to escape the next character.
The paren-star form of comments in Pascal: strings beginning with (* and ending with *) that do not contain *)
I'm trying to write a version in Ruby, then another in Java, but I'm having trouble finding the differences in regex expressions for both. Any help is appreciated!

Here is a good place to start:
specifics for Java (mostly usage of regex in general)
specifics for Ruby (mostly usage of regex in general)
flavor comparison (mostly regex syntax and features)
Mostly note that in Ruby your write regexes by delimiting them with /, and in Java you need to double-escape everything (\\ instead of \) so that the backslashes get through to the regex engine. Everything else you should find within those links I gave you above.
For the sake of completeness of this answer, I would also like to include Tom's Link to this online regex tester, that supports a multitude of regex flavors.
You should go ahead and give both regexes a go. If you encounter any problems, you are more than welcome to ask a new (specific) question, showing your own attempts.

Related

Java regex not matching German "Umlaut" OR underscore

I'm trying to "play around" with some REST APIs and Java code.
As I am using German language mainly, I already managed it to get the Apache HTTP Client to work with UTF-8 encoding to make sure "Umlaut" are handled the right way.
Still I can't get my regex to match my words correctly.
I try to find words/word combinations like "Büro_Licht" from string like ..."type":"Büro_Licht"....
Using regex expression ".*?type\":\"(\\w+).*?" returns "B" for me, as it doesn't recognize the "ü" as a word character. Clearly, as \w is said to be [a-z A-Z 0-9]. Within strings with no special characters I get the full "Office_Light" meanwhile.
So I tried another hint mentioned here in like nearly the same question (which I could not comment, because I lack of reputation points).
Using regex expression ".*?type\":\"(\\p{L}).*?" returns "Büro" for me. But here again it cuts on the underscore for a reason I don't understand.
Is there a nice way to combine both expressions to get the "full" word including underscores and special characters?
If you have to keep using regex, which is not a great tool for parsing JSON, try \p{L}_. In your case it would be:
String regex = ".*?type\":\"[\\p{L}_]+\"";
With on-line example: https://regex101.com/r/57oFD5/2
\p{L} matches any kind of letter from any language
_ matches the character _ literally (case sensitive)
This will get hectic if you need to support other languages, whitespaces and various other UTF code points. For example do you need to support random number of white spaces around :? Take a look at this answer on removing emojis, there are many corner cases.

Detect repeated characters in Arabic tokens

Kindly, I need your help in the following:
How I can detect repeated characters in tokens, for example:
If I have this sentence:
كييييف نستطيع التوااصل مع الطلاب؟
I want a java code that detect each word that contains repeated characters then remove them (the repeated characters) and update the word.
So, our sentence should be:
كيف نستطيع التواصل مع الطلاب؟
Notice the word "كييييف" as contains repeated character "ي", it should be updated to be only "كيف" and "التوااصل" became "التواصل".
I approciate your help.
Lolina, loops are not of much help. Did you hear about regular expressions. Java uses them as many other languages such as Perl and Python. I am familiar with Python but regex almost have similar functions in all languages.
What you need now is to read about regular expressions in Java and especially read about the metacharacters * and + which match 0 or more and 1 or more characters respectively.
First try to compile simple regular expressions and then add extra stuff to them so that they perform what you actually want to do.
Finally, regular expressions are a bit confusing at the beginning but they worth the trouble. Remeber that Stanford Arabic POS tagger uses regular expressions to perform things similar to what you are trying to do.
I am not familiar at all with Java, but in Python, I would do it as follows:
>>> import re
>>> p = re.compile('ي+') # The + sign means match at least more than one occurrence of ي
>>> p.sub('ي', 'كييييييييف نتواصل مع الطلاب')
'كيف نتواصل مع الطلاب'
Usually in Arabic we repeat typing the following three letters, ا, ي, and و. These are the vowels of Arabic. You can compile a regex for ي and strip them off. Then compile another one for ا and one more for و.
I hope this will help you!
One option (please consider my knowledge of Arabic is non-existant) is to split the string by space delimiters and then check each of the split strings for the character repeats using the charAt method or using indexOf using the unicode character values of the particular characters you wish to check for.

Java replaceAll to javascript regex

I want to move some user input test from Java to javascript. The code suppose to remove wildcard characters out of user input string, at any position. I'm attempting to convert the following Java notation to javascript, but keep getting error
"Invalid regular expression: /(?<!\")~[\\d\\.]*|\\?|\\*/: Invalid group".
I have almost no experience with regex expressions. Any help will be much appreciated:
JAVA:
str = str.replaceAll("(?<!\")~[\\d\\.]*|\\?|\\*","");
My failing javascript version:
input = input.replace( /(?<!\")~[\\d\\.]*|\\?|\\*/g, '');
The problem, as anubhava points out, is that JavaScript doesn't support lookbehind assertions. Sad but true. The lookbehind assertion in your original regex is (?<!\"). Specifically, it's looking only for strings that don't start with a double quotation mark.
However, all is not lost. There are some tricks you can use to achieve the same result as a lookbehind. In this case, the lookbehind is there only to prevent the character prior to the tilde from being replaced as well. We can accomplish this in JavaScript by matching the character anyway, but then including it in the replacement:
input = input.replace( /([^"])~[\d.]*|\?|\*/g, '$1' );
Note that for the alternations \? and \*, there will be no groups, so $1 will evaluate to the empty string, so it doesn't hurt to include it in the replacement.
NOTE: this is not 100% equivalent to the original regular expression. In particular, lookaround assertions (like the lookbehind above) also prevent the input stream from being consumed, which can sometimes be very helpful when matching things that are right next to each other. However, in this case, I can't think of a way that that would be a problem. To make a completely equivalent regex would be more difficult, but I believe this meets the need of the original regex.

How to escape special characters in the regex ***(.*)

I am new to Java. Can somebody help me?
Is there any method available in Java which escapes the special characters in the below regex automatically?
Before escaping ***(.*) and after escaping \\*\\*\\*(.*)
I don't want to escape (.*) here.
On the face of it, Pattern.quote appears to do the job.
However, looking at the detail of your question, it appears that you want / expect to be able to escape some meta-characters and not others. Pattern.quote won't do that if you apply it to a single string. Rather, it will quote each and every character. (For the record, it doesn't use backslashes. It uses "\E" and "\Q".\ which neatly avoids the cost of parsing the string to find characters that need escaping.)
But the real problem is that you haven't said how the quoter should decide which meta-characters to escape and which ones to leave intact. For instance, how does it know to escape the first three '' characters, but not the "."?
Without a clearer specification, your question is pretty much unanswerable. And even with a specification, there is little chance of finding an easy way to do this.
IMO, a better approach would be to do the escaping before you assemble the pattern from its component parts ... assuming that's what is going on here.

Regex conversion from java to php

I have a regular expression in php and I need to convert it to java.
Is it possible to do so? If yes how can i do?
Thanks in advance
$region_pattern = "/<a href=\"#\"><img src=\"images\/ponto_[^\.]+\.gif\"[^>]*>[ ]*<strong>(?P<neighborhood>[^\(<]+)\((?P<region>[^\)]+)\)<\/strong><\/a>/i" ;
A typical conversion from any regex to java is to:
Exclude pattern delimiters => remove starting and trailing /
Remove flags, these are applied to the Pattern object, this is the trailing i. You should either put it in the initialisation of your Pattern object or prepend it to the regex like (?i)<regex>
Replace all \ with \\, \ has a meaning already in java(escape in strings), to use a backslash inside a regex in java you have to use \\ instead of \, so \w becomes \\w. and \\ becomes \\\\
Above regex would become
Pattern.compile("<a href=\"#\"><img src=\"images\\/ponto_[^\\.]+\\.gif\"[^>]*>[ ]*<strong>(?P<neighborhood>[^\\(<]+)\\((?P<region>[^\\)]+)\\)<\\/strong><\\/a>", Pattern.CASE_INSENSITIVE);
This will fail however, I think it is because ?P is a modifier, not one I know exists in Java so ye it is a invalid regex.
There are some problems with the original regex that have to be cleared away first. First, there's [ ], which matches one of the characters &, n, b, s, p or ;. To match an actual non-breaking space character, you should use \xA0.
You also have a lot of unneeded backslashes in there. You can get rid of some by changing the regex delimiter to something other than /; others aren't needed because they're inside character classes, where most metacharacters lose their special meanings. That leaves you with this PHP regex:
"~<img src=\"images/ponto_[^.]+\.gif\"[^>]*>\xA0*<strong>(?P<neighborhood>[^(<]+)\((?P<region>[^)]+)\)</strong>~i"
There are three things that make this regex incompatible with Java. One is the delimiters (/ originally, ~ in the version above) along with the trailing i modifier. Java doesn't use regex delimiters at all, so just drop those. The modifier can be moved into the regex itself by using the inline form, (?i), at the beginning of the regex. (That will work in PHP too, by the way.)
Next is the backslashes. The ones that are used to escape quotation marks remain as they are, but all the others get doubled because Java is more strict about escape sequences in string literals.
Finally, there are the named groups. Up until Java 6, named groups weren't supported at all; Java 7 supports them, but they use the shorter (?<name>...) syntax favored by .NET,
not the Pythonesque (?P<name>...) syntax. (By the way, the shorter (?<name>...) version should work in PHP, too (as should (?'name'...), also introduced by .NET).
So the Java 7 version of your regex would be:
"(?i)<img src=\"images/ponto_[^.]+\\.gif\"[^>]*>\\xA0*<strong>(?<neighborhood>[^(<]+)\\((?<region>[^)]+)\\)</strong>"
For Java 6 or earlier you would use:
"(?i)<img src=\"images/ponto_[^.]+\\.gif\"[^>]*>\\xA0*<strong>([^(<]+)\\(([^)]+)\\)</strong>"
...and you'd have to use numbers instead of names to refer to the group captures.
REGEX is REGEX regardless of language. The REGEX you've posted will work on both Java and PHP. You do need to make some adjustments as both language don't take the pattern exactly the same (though the pattern itself will work in both languages).
Points to Consider
You should know that Java's Pattern object applies flags without having to specify them on the pattern string itself.
Delimiters should not be included as well. Only the pattern itself.

Categories

Resources