Java and C# Regex not producing same result - java

I have part of a regular expression that I am trying to use to split sentences into words. As part of this, I would like to split patterns such as "word." into "word", ".". I do this by using a positive lookahead for the punctuation mark, and a negative lookbehind for a space character.
In Java, the following code accomplishes this:
Pattern test = Pattern.compile("(?=[\\p{P}&&[^']])(?<!\\s)");
test.split("word."); // returns ["word", "."]
However, when I tried it in C#, with the same pattern, it doesn't work.
Regex.Split("word.", #"(?=[\p{P}&&[^']])(?<!\s)");
// returns ["word."]
Why doesn't C# behave the same way here?

That && thing is Java specific regex syntax and will not work in .NET.
However I think you should be able to rewrite it in a simpler way in .NET as follows:
#"(?=[^'\P{P}])(?<!\s)"
It uses the \P character class which is the negation of \p, which gets negated by the ^ and ends up the right way round.

Related

Replace string part with regex pattern

I would like to replace the following string.
img/s/430x250/
The problem is there are variations, like:
img/s/265x200/
or:
img/s/110x73/
So I would like to replace this part in whole, but the numbers are changeable, so how could I make a pattern that replaces it from a string?
Is your goal to match all three of those cases?
If so, this should work: img\/s\/\d+x\d+\/
It searches for img/s/[1 or more digits]x[1 or more digits]/
This regular expression will match your examples
img\/s\/\d+?x\d+?\/
the / matches /
the \d matches digits 0-9 and the + means 1 or more. The ? makes it lazy instead of greedy.
the img and s just match that literally
check out https://regex101.com/ to try out regular expressions. It's much easier than testing them by debugging code. Once you find an expression that works, you can move on to make sure your specific code will perform the same.

Java replaceAll to javascript regex

I want to move some user input test from Java to javascript. The code suppose to remove wildcard characters out of user input string, at any position. I'm attempting to convert the following Java notation to javascript, but keep getting error
"Invalid regular expression: /(?<!\")~[\\d\\.]*|\\?|\\*/: Invalid group".
I have almost no experience with regex expressions. Any help will be much appreciated:
JAVA:
str = str.replaceAll("(?<!\")~[\\d\\.]*|\\?|\\*","");
My failing javascript version:
input = input.replace( /(?<!\")~[\\d\\.]*|\\?|\\*/g, '');
The problem, as anubhava points out, is that JavaScript doesn't support lookbehind assertions. Sad but true. The lookbehind assertion in your original regex is (?<!\"). Specifically, it's looking only for strings that don't start with a double quotation mark.
However, all is not lost. There are some tricks you can use to achieve the same result as a lookbehind. In this case, the lookbehind is there only to prevent the character prior to the tilde from being replaced as well. We can accomplish this in JavaScript by matching the character anyway, but then including it in the replacement:
input = input.replace( /([^"])~[\d.]*|\?|\*/g, '$1' );
Note that for the alternations \? and \*, there will be no groups, so $1 will evaluate to the empty string, so it doesn't hurt to include it in the replacement.
NOTE: this is not 100% equivalent to the original regular expression. In particular, lookaround assertions (like the lookbehind above) also prevent the input stream from being consumed, which can sometimes be very helpful when matching things that are right next to each other. However, in this case, I can't think of a way that that would be a problem. To make a completely equivalent regex would be more difficult, but I believe this meets the need of the original regex.

Necessary to escape a java regular expression in matches()?

I'm currently doing a test on an HTTP Origin to determine if it came from SSL:
(HttpHeaders.Names.ORIGIN).matches("/^https:\\/\\//")
But I'm finding it's not working. Do I need to escape matches() strings like a regular expression or can I leave it like https://? Is there any way to do a simple string match?
Seems like it would be a simple question, but surprisingly I'm not getting anywhere even after using a RegEx tester http://www.regexplanet.com/advanced/java/index.html. Thanks.
Java's regex doesn't need delimiters. Simply do:
.matches("https://.*")
Note that matches validates the entire input string, hence the .* at the end. And if the input contains line break chars (which . will not match), enable DOT-ALL:
.matches("(?s)https://.*")
Of couse, you could also simply do:
.startsWith("https://")
which takes a plain string (no regex pattern).
How about this Regex:
"^(https:)\/\/.*"
It works in your tester

How do I write a regular expression to find the following pattern?

I am trying to write a regular expression to do a find and replace operation. Assume Java regex syntax. Below are examples of what I am trying to find:
12341+1
12241+1R1
100001+1R2
So, I am searching for a string beginning with one or more digits, followed by a "1+1" substring, followed by 0 or more characters. I have the following regex:
^(\d+)(1\\+1).*
This regex will successfully find the examples above, however, my goal is to replace the strings with everything before "1+1". So, 12341+1 would become 1234, and 12241+1R1 would become 1224. If I use the first grouped expression $1 to replace the pattern, I get the wrong result as follows:
12341+1 becomes 12341
12241+1R1 becomes 12241
100001+1R2 becomes 100001
Any ideas?
Your existing regex works fine, just that you are missing a \ before \d
String str = "100001+1R2";
str = str.replaceAll("^(\\d+)(1\\+1).*","$1");
Working link
IMHO, the regex is correct.
Perhaps you wrote it wrong in the code. If you want to code the regex ^(\d+)(1\+1).* in a string, you have to write something like String regex = "^(\\d+)(1\\+1).*".
Your output is the result of ^(\d+)(1+1).* replacement, as you miss some backslash in the string (e.g. "^(\\d+)(1\+1).*").
Your regex looks fine to me - I don't have access to java but in JavaScript the code..
"12341+1".replace(/(\d+)(1\+1)/g, "$1");
Returns 1234 as you'd expect. This works on a string with many 'codes' in too e.g.
"12341+1 54321+1".replace(/(\d+)(1\+1)/g, "$1");
gives 1234 5432.
Personally, I wouldn't use a Regex at all (it'd be like using a hammer on a thumbtack), I'd just create a substring from (Pseudocode)
stringName.substring(0, stringName.indexOf("1+1"))
But it looks like other posters have already mentioned the non-greedy operator.
In most Regex Syntaxes you can add a '?' after a '+' or '*' to indicate that you want it to match as little as possible before moving on in the pattern. (Thus: ^(\d+?)(1+1) matches any number of digits until it finds "1+1" and then, NOT INCLUDING the "1+1" it continues matching, whereas your original would see the 1 and match it as well).

Why doesn't this Java regular expression work?

I need to create a regular expression that allows a string to contain any number of:
alphanumeric characters
spaces
(
)
&
.
No other characters are permitted. I used RegexBuddy to construct the following regex, which works correctly when I test it within RegexBuddy:
\w* *\(*\)*&*\.*
Then I used RegexBuddy's "Use" feature to convert this into Java code, but it doesn't appear to work correctly using a simple test program:
public class RegexTest
{
public static void main(String[] args)
{
String test = "(AT) & (T)."; // Should be valid
System.out.println("Test string matches: "
+ test.matches("\\w* *\\(*\\)*&*\\.*")); // Outputs false
}
}
I must admit that I have a bit of a blind spot when it comes to regular expressions. Can anyone explain why it doesn't work please?
That regular expression tests for any amount of whitespace, followed by any amount of alphanumeric characters, followed by any amount of open parens, followed by any amount of close parens, followed by any amount of ampersands, followed by any amount of periods.
What you want is...
test.matches("[\\w \\(\\)&\\.]*")
As mentioned by mmyers, this allows the empty string. If you do not want to allow the empty string...
test.matches("[\\w \\(\\)&\\.]+")
Though that will also allow a string that is only spaces, or only periods, etc.. If you want to ensure at least one alpha-numeric character...
test.matches("[\\w \\(\\)&\\.]*\\w+[\\w \\(\\)&\\.]*")
So you understand what the regular expression is saying... anything within the square brackets ("[]") indicates a set of characters. So, where "a*" means 0 or more a's, [abc]* means 0 or more characters, all of which being a's, b's, or c's.
Maybe I'm misunderstanding your description, but aren't you essentially defining a class of characters without an order rather than a specific sequence? Shouldn't your regexp have a structure of [xxxx]+, where xxxx are the actual characters you want ?
The difference between your Java code snippet and the Test tab in RegexBuddy is that the matches() method in Java requires the regular expression to match the whole string, while the Test tab in RegexBuddy allows partial matches. If you use your original regex in RegexBuddy, you'll see multiple blocks of yellow and blue highlighting. That indicates RegexBuddy found multiple partial matches in your string. To get a regex that works as intended with matches(), you need to edit it until the whole test subject is highlighted in yellow, or if you turn off highlighting, until the Find First button selects the whole text.
Alternatively, you can use the anchors \A and \Z at the start and the end of your regex to force it to match the whole string. When you do that, your regex always behaves in the same way, whether you test it in RegexBuddy, or whether you use matches() or another method in Java. Only matches() requires a full string match. All other Matcher methods in Java allow partial matches.
the regex
\w* *\(*\)*&*\.*
will give you the items you described, but only in the order you described, and each one can be as many as wanted. So "skjhsklasdkjgsh((((())))))&&&&&....." works, but not mixing the characters.
You want a regex like this:
\[\w\(\)\&\.]+\
which will allow a mix of all characters.
edit: my regex knowledge is limited, so the above syntax may not be perfect.

Categories

Resources