Java Quotation Matching - java

I'm not sure if this is a regex question, but i need to be able to grab whats inside single qoutes, and surround them with something. For example:
this is a 'test' of 'something' and 'I' 'really' want 'to' 'solve this'
would turn into
this is a ('test') of ('something') and ('I') ('really') want ('to') ('solve this')
Any help you could provide would be great!
Thanks!

String str = "this is a 'test' of 'something'";
String rep = str.replaceAll("'[^']*'", "($0)"); // stand back, I know regex
What I did here is use toe replaceAll() method which searches for all matches for regex "'[^']*'" and replaces them with regex "($0)".
The pattern "'[^']*'" matches all substrings that start and end with a single quote ('), and between them are any characters, except another single quote ([^']), and those can appear any number of times (*). Replacing those with "($0)" means taking every match ($0) and wrapping it with parenthesis.

One easy way (but not always valid) is the following
If always you have [ '] and [' ] , you can do this:
myString.replace(" '"," ('"); // replaces all <space_apostrophe> with <space_bracket_apostrophe>
Do the same for the rear bracket :)
One more thing - why do you even have apostrophes-surrounded words? Is it a must that they must be like that? If you made them like that, why did you do it and then look for another approach !

If you can ignore single apostrophes, you could do like this (C# code, should be easy to translate)
string input = "this is a 'test' of 'something' and ...";
Console.WriteLine(Regex.Replace(input, "'([^']*)'", "('$1')"));

Related

Regular expression non-greedy but still

I have some larger text which in essence looks like this:
abc12..manycharshere...hi - abc23...manyothercharshere...jk
Obviously there are two items, each starting with "abc", the numbers (12 and 23) are interesting as well as the "hi" and "jk" at the end.
I would like to create a regular expression which allows me to parse out the numbers, but only if the two characters at the end match, i.e. I am looking for the number related to "jk", but the following regular expression matches the whole string and thus returns "12", not "23" even when non-greedy matching the area with the following:
abc([0-9]+).*?jk
Is there a way to construct a regular expression which matches text like the one above, i.e. retrieving "23" for items ending in "jk"?
Basically I would need something like "match abc followed by a number, but only if there is "jk" at the end before another instance of "abc followed by a number appears"
Note: the texts/matches are an abstraction here, the actual text is more complicated, espially the things that can appear as "manyothercharactershere", I simplified to show the underlying problem more clearly.
Use a regex like this. .*abc([0-9]+).*?jk
demo here
I think you want something like this,
abc([0-9]+)(?=(?:(?!jk|abc[0-9]).)*jk)
DEMO
You need to use negative lookahead here to make it work:
abc(?!.*?abc)([0-9]+).*?jk
RegEx Demo
Here (?!.*?abc) is negative lookahead that makes sure to match abc where it is NOT followed by another abc thus making sure closes string between abc and jk is matched.
Being non-greedy does not change the rule, that the first match is returned. So abc([0-9]+).*?jk will find the first jk after “abcnumber” rather than the last one, but still match the first “abcnumber”.
One way to solve this is to tell that the dot should not match abc([0-9]+):
abc([0-9]+)((?!abc([0-9]+)).)*jk
If it is not important to have the entire pattern being an exact match you can do it simpler:
.*(abc([0-9]+).*?jk)
In this case, it’s group 1 which contains your intended match. The pattern uses a greedy matchall to ensure that the last possible “abcnumber” is matched within the group.
Assuming that hyphen separates "items", this regex will capture the numbers from the target item:
abc([0-9]+)[^-]*?jk
See demo

Java regular expression to match everything between same specific character sequences

Let's take string "Something foo part1 foo part2 foo part3" as an example.
I want to find all parts starting with "foo" continuing till another "foo" or end of string and replace (wrap inner part into HTML markup) afterwards. So the result should be "Something <bar> part1 </bar><bar> part2 </bar><bar> part3</bar>"
I've started with: "foo(.*?)(foo|$)" and replacing it with "<bar>$1</bar>". Replacing seems to be all right but I need help with regex itself.
I have tried many variations with negative lookbehind and others so far without success. Thanks for any suggestions.
Just change your second group into a lookahead
foo(.*?)(?=foo|$)
See it on Regexr
The problem is you are matching the "foo" that you want to use as next start point. You can avoid this by using the lookahead assertion. This way the following "foo" is not matched and therefor used as start of the next match.

How to remove a specific special character pattern from a string

I have a string name s,
String s = "<NOUN>Sam</NOUN> , a student of the University of oxford , won the Ethugalpura International Rating Chess Tournament which concluded on Dec.22 at the Blue Olympiad Hotel";
I want to remove all <NOUN> and </NOUN> tags from the string. I used this to remove tags,
s.replaceAll("[<NOUN>,</NOUN>]","");
Yes it removes the tag. but it also removes letter 'U' and 'O' characters from the string which gives me following output.
Sam , a student of the niversity of oxford , won the Ethugalpura International Rating Chess Tournament which concluded on Dec.22 at the Blue lympiad Hotel
Can anyone please tell me how to do this correctly?
Try:
s.replaceAll("<NOUN>|</NOUN>", "");
In RegEx, the syntax [...] will match every character inside the brackets, regardless of the order they appear in. Therefore, in your example, all appearances of "<", "N", "O" etc. are removed. Instead use the pipe (|) to match both "<NOUN>" and "</NOUN>".
The following should also work (and could be considered more DRY and elegant) since it will match the tag both with and without the forward slash:
s.replaceAll("</?NOUN>", "");
String.replaceAll() takes a regular expression as its first argument. The regexp:
"[<NOUN>,</NOUN>]"
defines within the brackets the set of characters to be identified and thus removed. Thus you're asking to remove the characters <,>,/,N,O,U and comma.
Perhaps the simplest method to do what you want is to do:
s.replaceAll("<NOUN>","").replaceAll("</NOUN>","");
which is explicit in what it's removing. More complex regular expressions are obviously possible.
You can use one regular expression for this: "<[/]*NOUN>"
so
s.replaceAll("<[/]*NOUN>","");
should do the trick. The "[/]*" matches zero or more "/" after the "<".
Try this :String result = originValue.replaceAll("\\<.*?>", "");

What's wrong with this regex?

I am trying the following code on Java:
String test = "http://asda.aasd.sd.google.com/asdasdawrqwfqwfqwfqwf";
String regex = "[http://]{0,1}([a-zA-Z]*.)*\\.google\\.com/[-a-zA-Z/_.?&=]*";
System.out.println(test.matches(regex));
It does work for several minutes (after that I killed the VM) with no result.
Can anyone help me?
BTW: What will you recommend me to do to speed up weblink-testng regexes in future?
[http://] is a character class, meaning any one of those characters from the set.
Just leave those particular square brackets off if it must start with http://. If it's optional, you can use (http://)?.
One obvious problem is that you're looking for the sequence ([a-zA-Z]+.)*\\.google - this will do a lot of backtracking due to that naked . which means "any character" rather than the literal period that you wanted.
But even if you replace it with what you meant, ([a-zA-Z]+\\.)*\\.google, you still have a problem - this will then require two . characters immediately before google. You should instead try:
String regex = "(http://)?([a-zA-Z]+\\.)*google\\.com/[-a-zA-Z/_.?&=]*";
That returns immediately for me with a true match.
Keep in mind that this currently requires the / at the end of google.com. If that's a problem, it's a minor fix, but I've left it there since you had it in your original regex.
You are trying to match the scheme as a character class using square brackets. That means only zero or one of the characters from that set. You want a subpattern, with parentheses. You can also change {0,1} to just say ?.
Also, you should remove the period just before google\\.com because you're already looking for a period in the subdomain subpattern of your regex. As cherouvim points out, you forgot to escape that period as well.
String regex = "(http://)?([a-zA-Z]+\\.)*google\\.com/[-a-zA-Z/_.?&=]*";
In the ([a-zA-Z]*.) part you either need to escape the . (because right now it means "all characters") or remove it.
There are two problems with the regular expression.
The first is easy, as was mentioned by others. You need to match "http://" as a subpattern, not as a character class. Change the brackets to parentheses.
The second problem causes the very poor performance. It's causing the regex to backtrack repeatedly, trying to match the pattern.
What you're trying to do is match zero or more subdomains, which are groups of letters followed by a dot. Since you want to match the dot explicitly, escape the dot. Also remove the dot in front of "google" so you can match "http://google.com/etc" (ie, no leading dot in front of google).
So your expression becomes:
String regex = "(http://){0,1}([a-zA-Z]+\\.)*google\\.com/[-a-zA-Z/_.?&=]*";
Running this regex on your example takes just a fraction of a second.
Assuming you fix the ([a-zA-Z]*\\.) you need to change * to + so the part becomes ([a-zA-Z]+\\.). Otherwise you'll be accepting http://...google.com and this is not valid.
By grouping part before google.com I assume you are looking for part of URL host name. I think that rexep is powerful tool, but you can simply use URL Java class. There is getHost() method. Then you can check if host name ends with google.com and split it or use some simplier regexp with only host name.
URL url = new URL("http://asda.aasd.sd.google.com/asdasdawrqwfqwfqwfqwf");
String host = url.getHost();
if (host.endsWith("google.com"))
{
String [] parts = host.split("\\.");
for (String s: parts)
System.out.println(s);
}

How do I write a regular expression to find the following pattern?

I am trying to write a regular expression to do a find and replace operation. Assume Java regex syntax. Below are examples of what I am trying to find:
12341+1
12241+1R1
100001+1R2
So, I am searching for a string beginning with one or more digits, followed by a "1+1" substring, followed by 0 or more characters. I have the following regex:
^(\d+)(1\\+1).*
This regex will successfully find the examples above, however, my goal is to replace the strings with everything before "1+1". So, 12341+1 would become 1234, and 12241+1R1 would become 1224. If I use the first grouped expression $1 to replace the pattern, I get the wrong result as follows:
12341+1 becomes 12341
12241+1R1 becomes 12241
100001+1R2 becomes 100001
Any ideas?
Your existing regex works fine, just that you are missing a \ before \d
String str = "100001+1R2";
str = str.replaceAll("^(\\d+)(1\\+1).*","$1");
Working link
IMHO, the regex is correct.
Perhaps you wrote it wrong in the code. If you want to code the regex ^(\d+)(1\+1).* in a string, you have to write something like String regex = "^(\\d+)(1\\+1).*".
Your output is the result of ^(\d+)(1+1).* replacement, as you miss some backslash in the string (e.g. "^(\\d+)(1\+1).*").
Your regex looks fine to me - I don't have access to java but in JavaScript the code..
"12341+1".replace(/(\d+)(1\+1)/g, "$1");
Returns 1234 as you'd expect. This works on a string with many 'codes' in too e.g.
"12341+1 54321+1".replace(/(\d+)(1\+1)/g, "$1");
gives 1234 5432.
Personally, I wouldn't use a Regex at all (it'd be like using a hammer on a thumbtack), I'd just create a substring from (Pseudocode)
stringName.substring(0, stringName.indexOf("1+1"))
But it looks like other posters have already mentioned the non-greedy operator.
In most Regex Syntaxes you can add a '?' after a '+' or '*' to indicate that you want it to match as little as possible before moving on in the pattern. (Thus: ^(\d+?)(1+1) matches any number of digits until it finds "1+1" and then, NOT INCLUDING the "1+1" it continues matching, whereas your original would see the 1 and match it as well).

Categories

Resources