How do I convert CamelCase into human-readable names in Java? - java

I'd like to write a method that converts CamelCase into a human-readable name.
Here's the test case:
public void testSplitCamelCase() {
assertEquals("lowercase", splitCamelCase("lowercase"));
assertEquals("Class", splitCamelCase("Class"));
assertEquals("My Class", splitCamelCase("MyClass"));
assertEquals("HTML", splitCamelCase("HTML"));
assertEquals("PDF Loader", splitCamelCase("PDFLoader"));
assertEquals("A String", splitCamelCase("AString"));
assertEquals("Simple XML Parser", splitCamelCase("SimpleXMLParser"));
assertEquals("GL 11 Version", splitCamelCase("GL11Version"));
}

This works with your testcases:
static String splitCamelCase(String s) {
return s.replaceAll(
String.format("%s|%s|%s",
"(?<=[A-Z])(?=[A-Z][a-z])",
"(?<=[^A-Z])(?=[A-Z])",
"(?<=[A-Za-z])(?=[^A-Za-z])"
),
" "
);
}
Here's a test harness:
String[] tests = {
"lowercase", // [lowercase]
"Class", // [Class]
"MyClass", // [My Class]
"HTML", // [HTML]
"PDFLoader", // [PDF Loader]
"AString", // [A String]
"SimpleXMLParser", // [Simple XML Parser]
"GL11Version", // [GL 11 Version]
"99Bottles", // [99 Bottles]
"May5", // [May 5]
"BFG9000", // [BFG 9000]
};
for (String test : tests) {
System.out.println("[" + splitCamelCase(test) + "]");
}
It uses zero-length matching regex with lookbehind and lookforward to find where to insert spaces. Basically there are 3 patterns, and I use String.format to put them together to make it more readable.
The three patterns are:
UC behind me, UC followed by LC in front of me
XMLParser AString PDFLoader
/\ /\ /\
non-UC behind me, UC in front of me
MyClass 99Bottles
/\ /\
Letter behind me, non-letter in front of me
GL11 May5 BFG9000
/\ /\ /\
References
regular-expressions.info/Lookarounds
Related questions
Using zero-length matching lookarounds to split:
Regex split string but keep separators
Java split is eating my characters

You can do it using org.apache.commons.lang.StringUtils
StringUtils.join(
StringUtils.splitByCharacterTypeCamelCase("ExampleTest"),
' '
);

The neat and shorter solution :
StringUtils.capitalize(StringUtils.join(StringUtils.splitByCharacterTypeCamelCase("yourCamelCaseText"), StringUtils.SPACE)); // Your Camel Case Text

If you don't like "complicated" regex's, and aren't at all bothered about efficiency, then I've used this example to achieve the same effect in three stages.
String name =
camelName.replaceAll("([A-Z][a-z]+)", " $1") // Words beginning with UC
.replaceAll("([A-Z][A-Z]+)", " $1") // "Words" of only UC
.replaceAll("([^A-Za-z ]+)", " $1") // "Words" of non-letters
.trim();
It passes all the test cases above, including those with digits.
As I say, this isn't as good as using the one regular expression in some other examples here - but someone might well find it useful.

You can use org.modeshape.common.text.Inflector.
Specifically:
String humanize(String lowerCaseAndUnderscoredWords,
String... removableTokens)
Capitalizes the first word and turns underscores into spaces and strips trailing "_id" and any supplied removable tokens.
Maven artifact is: org.modeshape:modeshape-common:2.3.0.Final
on JBoss repository: https://repository.jboss.org/nexus/content/repositories/releases
Here's the JAR file: https://repository.jboss.org/nexus/content/repositories/releases/org/modeshape/modeshape-common/2.3.0.Final/modeshape-common-2.3.0.Final.jar

The following Regex can be used to identify the capitals inside words:
"((?<=[a-z0-9])[A-Z]|(?<=[a-zA-Z])[0-9]]|(?<=[A-Z])[A-Z](?=[a-z]))"
It matches every capital letter, that is ether after a non-capital letter or digit or followed by a lower case letter and every digit after a letter.
How to insert a space before them is beyond my Java skills =)
Edited to include the digit case and the PDF Loader case.

I think you will have to iterate over the string and detect changes from lowercase to uppercase, uppercase to lowercase, alphabetic to numeric, numeric to alphabetic. On every change you detect insert a space with one exception though: on a change from upper- to lowercase you insert the space one character before.

This works in .NET... optimize to your liking. I added comments so you can understand what each piece is doing. (RegEx can be hard to understand)
public static string SplitCamelCase(string str)
{
str = Regex.Replace(str, #"([A-Z])([A-Z][a-z])", "$1 $2"); // Capital followed by capital AND a lowercase.
str = Regex.Replace(str, #"([a-z])([A-Z])", "$1 $2"); // Lowercase followed by a capital.
str = Regex.Replace(str, #"(\D)(\d)", "$1 $2"); //Letter followed by a number.
str = Regex.Replace(str, #"(\d)(\D)", "$1 $2"); // Number followed by letter.
return str;
}

For the record, here is an almost (*) compatible Scala version:
object Str { def unapplySeq(s: String): Option[Seq[Char]] = Some(s) }
def splitCamelCase(str: String) =
String.valueOf(
(str + "A" * 2) sliding (3) flatMap {
case Str(a, b, c) =>
(a.isUpper, b.isUpper, c.isUpper) match {
case (true, false, _) => " " + a
case (false, true, true) => a + " "
case _ => String.valueOf(a)
}
} toArray
).trim
Once compiled it can be used directly from Java if the corresponding scala-library.jar is in the classpath.
(*) it fails for the input "GL11Version" for which it returns "G L11 Version".

I took the Regex from polygenelubricants and turned it into an extension method on objects:
/// <summary>
/// Turns a given object into a sentence by:
/// Converting the given object into a <see cref="string"/>.
/// Adding spaces before each capital letter except for the first letter of the string representation of the given object.
/// Makes the entire string lower case except for the first word and any acronyms.
/// </summary>
/// <param name="original">The object to turn into a proper sentence.</param>
/// <returns>A string representation of the original object that reads like a real sentence.</returns>
public static string ToProperSentence(this object original)
{
Regex addSpacesAtCapitalLettersRegEx = new Regex(#"(?<=[A-Z])(?=[A-Z][a-z]) | (?<=[^A-Z])(?=[A-Z]) | (?<=[A-Za-z])(?=[^A-Za-z])", RegexOptions.IgnorePatternWhitespace);
string[] words = addSpacesAtCapitalLettersRegEx.Split(original.ToString());
if (words.Length > 1)
{
List<string> wordsList = new List<string> { words[0] };
wordsList.AddRange(words.Skip(1).Select(word => word.Equals(word.ToUpper()) ? word : word.ToLower()));
words = wordsList.ToArray();
}
return string.Join(" ", words);
}
This turns everything into a readable sentence. It does a ToString on the object passed. Then it uses the Regex given by polygenelubricants to split the string. Then it ToLowers each word except for the first word and any acronyms. Thought it might be useful for someone out there.

I'm not a regex ninja, so I'd iterate over the string, keeping the indexes of the current position being checked & the previous position. If the current position is a capital letter, I'd insert a space after the previous position and increment each index.

http://code.google.com/p/inflection-js/
You could chain the String.underscore().humanize() methods to take a CamelCase string and convert it into a human readable string.

Related

Regex for finding only single alphabets in a string and ignore consecutive double

I have searched a lot but I am unable to find a regex that could select only single alphabets and double them while those alphabets which are already double, should remain untouched.
I tried
String str = "yahoo";
str = str.replaceAll("(\\w)\\1+", "$0$0");
But since this (\\w)\\1+ selects all double elements, my output becomes yahoooo. I tried to add negation to it !(\\w)\\1+ but didn't work and output becomes same as input. I have tried
str.replaceAll(".", "$0$0");
But that doubles every character including which are already doubled.
Please help to write an regex that could replace all single character with double while double character should remain untouched.
Example
abc -> aabbcc
yahoo -> yyaahhoo (o should remain untouched)
opinion -> ooppiinniioonn
aaaaaabc -> aaaaaabbcc
You can match using this regex:
((.)\2+)|(.)
And replace it with:
$1$3$3
RegEx Demo
RegEx Explanation:
((.)\2+): Match a character and capture in group #2 and using \2+ next to it to make sure we match all multiple repeats of captured character. Capture all the repeated characters in group #1
|: OR
(.): Match any character and capture in group #3
Code Demo:
import java.util.List;
class Ideone {
public static void main(String[] args) {
List<String> input = List.of("aaa", "abc", "yahoo",
"opinion", "aaaaaabc");
for (String s: input) {
System.out.println( s + " => " +
s.replaceAll("((.)\\2+)|(.)", "$1$3$3") );
}
}
}
Output:
aaa => aaa
abc => aabbcc
yahoo => yyaahhoo
opinion => ooppiinniioonn
aaaaaabc => aaaaaabbcc
The solution by #anubhava, if viable in Java, is probably the best way to go. For a more brute force approach, we can try a regex iteration approach on the following pattern:
(\\w)\\1+|\\w
This matches, eagerly, a series of similar letters (two or more of them), followed by, that failing, a single letter. For each match, we can no-op on the multi-letter match, and double up any other single letter. Here is a short Java code which does this:
List<String> inputs = Arrays.asList(new String[] {"abc", "yahoo", "opinion", "aaaaaabc"});
String pattern = "(\\w)\\1+|\\w";
Pattern r = Pattern.compile(pattern);
for (String input : inputs) {
Matcher m = r.matcher(input);
StringBuffer buffer = new StringBuffer();
while (m.find()) {
if (m.group().matches("(\\w)\\1+")) {
m.appendReplacement(buffer, m.group());
}
else {
m.appendReplacement(buffer, m.group() + m.group());
}
}
m.appendTail(buffer);
System.out.println(input + " => " + buffer.toString());
}
}
This prints:
abc => aabbcc
yahoo => yyaahhoo
opinion => ooppiinniioonn
aaaaaabc => aaaaaabbcc
I've got two different understandings of the question.
If the goal is to get an even amount of each word character:
Search for (\w)\1? and replace with $1$1 (regex101 demo).
If just solely characters should be duplicated and others left untouched:
Search for (\w)\1?(\1*) and replace with $1$1$2 (regex 101 demo).
Captures a word character \w to $1, optionally matches the same character again. The second variant captures any more of the same character to $2 for attaching in the replacement.
FYI: If using as a Java string remember to escape the pattern. E.g. \1 -> \\1, \w ->\\w, ...

Replacing consecutive repeated characters in java

I am working on twitter data normalization. Twitter users frequently uses terms like ts I looooooove it in order to emphasize the word love. I want to such repeated characters to a proper English word by replacing repeat characters till I get a proper meaningful word (I am aware that I can not differentiate between good and god by this mechanism).
My strategy would be
identify existence of such repeated strings. I would look for more than 2 same characters, as probably there is no English word with more than two repeat characters.
String[] strings = { "stoooooopppppppppppppppppp","looooooove", "good","OK", "boolean", "mee", "claaap" };
String regex = "([a-z])\\1{2,}";
Pattern pattern = Pattern.compile(regex);
for (String string : strings) {
Matcher matcher = pattern.matcher(string);
if (matcher.find()) {
System.out.println(string+" TRUE ");
}
}
Search for such words in a Lexicon like Wordnet
Replace all but two such repeat characters and check in Lexicon
If not there in the Lexicon remove one more repeat character (Otherwise treat it as misspelling).
Due to my poor Java knowledge I am unable to manage 3 and 4. Problem is I can not replace all but two repeated consecutive characters.
Following code snippet replace all but one repeated characters System.out.println(data.replaceAll("([a-zA-Z])\\1{2,}", "$1"));
Help is required to find out
A. How to replace all but 2 consecutive repeat characters
B. How to remove one more consecutive character from the output of A
[I think B can be managed by the following code snippet]
System.out.println(data.replaceAll("([a-zA-Z])\\1{1,}", "$1"));
Edit: Solution provided by Wiktor Stribiżew works perfectly in Java. I was wondering what changes are required to get the same result in python.
Python uses re.sub.
Your regex ([a-z])\\1{2,} matches and captures an ASCII letter into Group 1 and then matches 2 or more occurrences of this value. So, all you need to replace with a backreference, $1, that holds the value captured. If you use one $1, the aaaaa will be replaced with a single a and if you use $1$1, it will be replaced with aa.
String twoConsecutivesOnly = data.replaceAll(regex, "$1$1");
String noTwoConsecutives = data.replaceAll(regex, "$1");
See the Java demo.
If you need to make your regex case insensitive, use "(?i)([a-z])\\1{2,}" or even "(\\p{Alpha})\\1{2,}". If any Unicode letters must be handled, use "(\\p{L})\\1{2,}".
BONUS: In a general case, to replace any amount of any repeated consecutive chars use
text = text.replaceAll("(?s)(.)\\1+", "$1"); // any chars
text = text.replaceAll("(.)\\1+", "$1"); // any chars but line breaks
text = text.replaceAll("(\\p{L})\\1+", "$1"); // any letters
text = text.replaceAll("(\\w)\\1+", "$1"); // any ASCII alnum + _ chars
/*This code checks a character in a given string repeated consecutively 3 times
if you want to check for 4 consecutive times change count==2--->count==3 OR
if you want to check for 2 consecutive times change count==2--->count==1*/
public class Test1 {
static char ch;
public static void main(String[] args) {
String str="aabbbbccc";
char[] charArray = str.toCharArray();
int count=0;
for(int i=0;i<charArray.length;i++){
if(i!=0 ){
if(charArray[i]==ch)continue;//ddddee
if(charArray[i]==charArray[i-1]) {
count++;
if(count==2){
System.out.println(charArray[i]);
count=0;
ch=charArray[i];
}
}
else{
count=0;//aabb
}
}
}
}
}

How can I split a String based on capitalization scheme? [duplicate]

I found a brilliant RegEx to extract the part of a camelCase or TitleCase expression.
(?<!^)(?=[A-Z])
It works as expected:
value -> value
camelValue -> camel / Value
TitleValue -> Title / Value
For example with Java:
String s = "loremIpsum";
words = s.split("(?<!^)(?=[A-Z])");
//words equals words = new String[]{"lorem","Ipsum"}
My problem is that it does not work in some cases:
Case 1: VALUE -> V / A / L / U / E
Case 2: eclipseRCPExt -> eclipse / R / C / P / Ext
To my mind, the result shoud be:
Case 1: VALUE
Case 2: eclipse / RCP / Ext
In other words, given n uppercase chars:
if the n chars are followed by lower case chars, the groups should be: (n-1 chars) / (n-th char + lower chars)
if the n chars are at the end, the group should be: (n chars).
Any idea on how to improve this regex?
The following regex works for all of the above examples:
public static void main(String[] args)
{
for (String w : "camelValue".split("(?<!(^|[A-Z]))(?=[A-Z])|(?<!^)(?=[A-Z][a-z])")) {
System.out.println(w);
}
}
It works by forcing the negative lookbehind to not only ignore matches at the start of the string, but to also ignore matches where a capital letter is preceded by another capital letter. This handles cases like "VALUE".
The first part of the regex on its own fails on "eclipseRCPExt" by failing to split between "RPC" and "Ext". This is the purpose of the second clause: (?<!^)(?=[A-Z][a-z]. This clause allows a split before every capital letter that is followed by a lowercase letter, except at the start of the string.
It seems you are making this more complicated than it needs to be. For camelCase, the split location is simply anywhere an uppercase letter immediately follows a lowercase letter:
(?<=[a-z])(?=[A-Z])
Here is how this regex splits your example data:
value -> value
camelValue -> camel / Value
TitleValue -> Title / Value
VALUE -> VALUE
eclipseRCPExt -> eclipse / RCPExt
The only difference from your desired output is with the eclipseRCPExt, which I would argue is correctly split here.
Addendum - Improved version
Note: This answer recently got an upvote and I realized that there is a better way...
By adding a second alternative to the above regex, all of the OP's test cases are correctly split.
(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])
Here is how the improved regex splits the example data:
value -> value
camelValue -> camel / Value
TitleValue -> Title / Value
VALUE -> VALUE
eclipseRCPExt -> eclipse / RCP / Ext
Edit:20130824 Added improved version to handle RCPExt -> RCP / Ext case.
Another solution would be to use a dedicated method in commons-lang: StringUtils#splitByCharacterTypeCamelCase
I couldn't get aix's solution to work (and it doesn't work on RegExr either), so I came up with my own that I've tested and seems to do exactly what you're looking for:
((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))
and here's an example of using it:
; Regex Breakdown: This will match against each word in Camel and Pascal case strings, while properly handling acrynoms.
; (^[a-z]+) Match against any lower-case letters at the start of the string.
; ([A-Z]{1}[a-z]+) Match against Title case words (one upper case followed by lower case letters).
; ([A-Z]+(?=([A-Z][a-z])|($))) Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))", "$1 ")
newString := Trim(newString)
Here I'm separating each word with a space, so here are some examples of how the string is transformed:
ThisIsATitleCASEString => This Is A Title CASE String
andThisOneIsCamelCASE => and This One Is Camel CASE
This solution above does what the original post asks for, but I also needed a regex to find camel and pascal strings that included numbers, so I also came up with this variation to include numbers:
((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))
and an example of using it:
; Regex Breakdown: This will match against each word in Camel and Pascal case strings, while properly handling acrynoms and including numbers.
; (^[a-z]+) Match against any lower-case letters at the start of the command.
; ([0-9]+) Match against one or more consecutive numbers (anywhere in the string, including at the start).
; ([A-Z]{1}[a-z]+) Match against Title case words (one upper case followed by lower case letters).
; ([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))) Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string or a number.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))", "$1 ")
newString := Trim(newString)
And here are some examples of how a string with numbers is transformed with this regex:
myVariable123 => my Variable 123
my2Variables => my 2 Variables
The3rdVariableIsHere => The 3 rdVariable Is Here
12345NumsAtTheStartIncludedToo => 12345 Nums At The Start Included Too
To handle more letters than just A-Z:
s.split("(?<=\\p{Ll})(?=\\p{Lu})|(?<=\\p{L})(?=\\p{Lu}\\p{Ll})");
Either:
Split after any lowercase letter, that is followed by uppercase letter.
E.g parseXML -> parse, XML.
or
Split after any letter, that is followed by upper case letter and lowercase letter.
E.g. XMLParser -> XML, Parser.
In more readable form:
public class SplitCamelCaseTest {
static String BETWEEN_LOWER_AND_UPPER = "(?<=\\p{Ll})(?=\\p{Lu})";
static String BEFORE_UPPER_AND_LOWER = "(?<=\\p{L})(?=\\p{Lu}\\p{Ll})";
static Pattern SPLIT_CAMEL_CASE = Pattern.compile(
BETWEEN_LOWER_AND_UPPER +"|"+ BEFORE_UPPER_AND_LOWER
);
public static String splitCamelCase(String s) {
return SPLIT_CAMEL_CASE.splitAsStream(s)
.collect(joining(" "));
}
#Test
public void testSplitCamelCase() {
assertEquals("Camel Case", splitCamelCase("CamelCase"));
assertEquals("lorem Ipsum", splitCamelCase("loremIpsum"));
assertEquals("XML Parser", splitCamelCase("XMLParser"));
assertEquals("eclipse RCP Ext", splitCamelCase("eclipseRCPExt"));
assertEquals("VALUE", splitCamelCase("VALUE"));
}
}
Brief
Both top answers here provide code using positive lookbehinds, which, is not supported by all regex flavours. The regex below will capture both PascalCase and camelCase and can be used in multiple languages.
Note: I do realize this question is regarding Java, however, I also see multiple mentions of this post in other questions tagged for different languages, as well as some comments on this question for the same.
Code
See this regex in use here
([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)
Results
Sample Input
eclipseRCPExt
SomethingIsWrittenHere
TEXTIsWrittenHERE
VALUE
loremIpsum
Sample Output
eclipse
RCP
Ext
Something
Is
Written
Here
TEXT
Is
Written
HERE
VALUE
lorem
Ipsum
Explanation
Match one or more uppercase alpha character [A-Z]+
Or match zero or one uppercase alpha character [A-Z]?, followed by one or more lowercase alpha characters [a-z]+
Ensure what follows is an uppercase alpha character [A-Z] or word boundary character \b
You can use StringUtils.splitByCharacterTypeCamelCase("loremIpsum") from Apache Commons Lang.
You can use the expression below for Java:
(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|(?=[A-Z][a-z])|(?<=\\d)(?=\\D)|(?=\\d)(?<=\\D)
Instead of looking for separators that aren't there you might also considering finding the name components (those are certainly there):
String test = "_eclipse福福RCPExt";
Pattern componentPattern = Pattern.compile("_? (\\p{Upper}?\\p{Lower}+ | (?:\\p{Upper}(?!\\p{Lower}))+ \\p{Digit}*)", Pattern.COMMENTS);
Matcher componentMatcher = componentPattern.matcher(test);
List<String> components = new LinkedList<>();
int endOfLastMatch = 0;
while (componentMatcher.find()) {
// matches should be consecutive
if (componentMatcher.start() != endOfLastMatch) {
// do something horrible if you don't want garbage in between
// we're lenient though, any Chinese characters are lucky and get through as group
String startOrInBetween = test.substring(endOfLastMatch, componentMatcher.start());
components.add(startOrInBetween);
}
components.add(componentMatcher.group(1));
endOfLastMatch = componentMatcher.end();
}
if (endOfLastMatch != test.length()) {
String end = test.substring(endOfLastMatch, componentMatcher.start());
components.add(end);
}
System.out.println(components);
This outputs [eclipse, 福福, RCP, Ext]. Conversion to an array is of course simple.
I can confirm that the regex string ([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b) given by ctwheels, above, works with the Microsoft flavour of regex.
I would also like to suggest the following alternative, based on ctwheels' regex, which handles numeric characters: ([A-Z0-9]+|[A-Z]?[a-z]+)(?=[A-Z0-9]|\b).
This able to split strings such as:
DrivingB2BTradeIn2019Onwards
to
Driving B2B Trade in 2019 Onwards
A JavaScript Solution
/**
* howToDoThis ===> ["", "how", "To", "Do", "This"]
* #param word word to be split
*/
export const splitCamelCaseWords = (word: string) => {
if (typeof word !== 'string') return [];
return word.replace(/([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)/g, '!$&').split('!');
};

Subtracting characters in a back reference from a character class in java.util.regex.Pattern

Is it possible to subtract the characters in a Java regex back reference from a character class?
e.g., I want to use String#matches(regex) to match either:
any group of characters that are [a-z'] that are enclosed by "
Matches: "abc'abc"
Doesn't match: "1abc'abc"
Doesn't match: 'abc"abc'
any group of characters that are [a-z"] that are enclosed by '
Matches: 'abc"abc'
Doesn't match: '1abc"abc'
Doesn't match: "abc'abc"
The following regex won't compile because [^\1] isn't supported:
(['"])[a-z'"&&[^\1]]*\1
Obviously, the following will work:
'[a-z"]*'|"[a-z']*"
But, this style isn't particularly legible when a-z is replaced by a much more complex character class that must be kept the same in each side of the "or" condition.
I know that, in Java, I can just use String concatenation like the following:
String charClass = "a-z";
String regex = "'[" + charClass + "\"]*'|\"[" + charClass + "']*\"";
But, sometimes, I need to specify the regex in a config file, like XML, or JSON, etc., where java code is not available.
I assume that what I'm asking is almost definitely not possible, but I figured it wouldn't hurt to ask...
One approach is to use a negative look-ahead to make sure that every character in between the quotes is not the quotes:
(['"])(?:(?!\1)[a-z'"])*+\1
^^^^^^
(I also make the quantifier possessive, since there is no use for backtracking here)
This approach is, however, rather inefficient, since the pattern will check for the quote character for every single character, on top of checking that the character is one of the allowed character.
The alternative with 2 branches in the question '[a-z"]*'|"[a-z']*" is better, since the engine only checks for the quote character once and goes through the rest by checking that the current character is in the character class.
You could use two patterns in one OR-separated pattern, expressing both your cases:
// | case 1: [a-z'] enclosed by "
// | | OR
// | | case 2: [a-z"] enclosed by '
Pattern p = Pattern.compile("(?<=\")([a-z']+)(?=\")|(?<=')([a-z\"]+)(?=')");
String[] test = {
// will match group 1 (for case 1)
"abcd\"efg'h\"ijkl",
// will match group 2 (for case 2)
"abcd'efg\"h'ijkl",
};
for (String t: test) {
Matcher m = p.matcher(t);
while (m.find()) {
System.out.println(m.group(1));
System.out.println(m.group(2));
}
}
Output
efg'h
null
null
efg"h
Note
There is nothing stopping you from specifying the enclosing characters or the character class itself somewhere else, then building your Pattern with components unknown at compile-time.
Something in the lines of:
// both strings are emulating unknown-value arguments
String unknownEnclosingCharacter = "\"";
String unknownCharacterClass = "a-z'";
// probably want to catch a PatternSyntaxException here for potential
// issues with the given arguments
Pattern p = Pattern.compile(
String.format(
"(?<=%1$s)([%2$s]+)(?=%1$s)",
unknownEnclosingCharacter,
unknownCharacterClass
)
);
String[] test = {
"abcd\"efg'h\"ijkl",
"abcd'efg\"h'ijkl",
};
for (String t: test) {
Matcher m = p.matcher(t);
while (m.find()) {
// note: only main group here
System.out.println(m.group());
}
}
Output
efg'h

Regular Expression - inserting space after comma only if succeeded by a letter or number

In Java I want to insert a space after a String but only if the character after the comma is succeeded by a digit or letter. I am hoping to use the replaceAll method which uses regular expressions as a parameter. So far I have the following:
String s1="428.0,chf";
s1 = s1.replaceAll(",(\\d|\\w)",", ");
This code does successfully distinguish between the String above and one where there is already a space after the comma. My problem is that I can't figure out how to write the expression so that the space is inserted. The code above will replace the c in the String shown above with a space. This is not what I want.
s1 should look like this after executing the replaceAll: "428.0 chf"
s1.replaceAll(",(?=[\da-zA-Z])"," ");
(?=[\da-zA-Z]) is a positive lookahead which would look for a digit or a word after ,.This lookahead would not be replaced since it is never included in the result.It's just a check
NOTE
\w includes digit,alphabets and a _.So no need of \d.
A better way to represent it would be [\da-zA-Z] instead of \w since \w also includes _ which you do not need 2 match
Try this, and note that $1 refers to your matched grouping:
s1.replaceAll(",(\\d|\\w)"," $1");
Note that String.replaceAll() works in the same way as a Matcher.replaceAll(). From the doc:
The replacement string may contain references to captured subsequences
String s1="428.0,chf";
s1 = s1.replaceAll(",([^_]\\w)"," $1"); //Match alphanumeric except '_' after ','
System.out.println(s1);
Output: -
428.0 chf
Since \w matches digits, words, and an underscore, So, [^_] negates the underscore from \w..
$1 represents the captured group.. You captured c after , here, so replace c with _$1 -> _c.. "_" represent a space..
Try this....
public class Tes {
public static void main(String[] args){
String s1="428.0,chf";
String[] sArr = s1.split(",");
String finalStr = new String();
for(String s : sArr){
finalStr = finalStr +" "+ s;
}
System.out.println(finalStr);
}
}

Categories

Resources