Check if string ends with certain pattern - java

If I have a string like:
This.is.a.great.place.too.work.
or:
This/is/a/great/place/too/work/
than my program should give me that the sentence is valid and it has "work".
If I Have :
This.is.a.great.place.too.work.hahahha
or:
This/is/a/great/place/too/work/hahahah
then my program should not give me that there is a "work" in the sentence.
So I am looking at java strings to find a word at the end of the sentence having . or , or / before it. How can I achieve this?

This is really simple, the String object has an endsWith method.
From your question it seems like you want either /, , or . as the delimiter set.
So:
String str = "This.is.a.great.place.to.work.";
if (str.endsWith(".work.") || str.endsWith("/work/") || str.endsWith(",work,"))
// ...
You can also do this with the matches method and a fairly simple regex:
if (str.matches(".*([.,/])work\\1$"))
Using the character class [.,/] specifying either a period, a slash, or a comma, and a backreference, \1 that matches whichever of the alternates were found, if any.

You can test if a string ends with work followed by one character like this:
theString.matches(".*work.$");
If the trailing character is optional you can use this:
theString.matches(".*work.?$");
To make sure the last character is a period . or a slash / you can use this:
theString.matches(".*work[./]$");
To test for work followed by an optional period or slash you can use this:
theString.matches(".*work[./]?$");
To test for work surrounded by periods or slashes, you could do this:
theString.matches(".*[./]work[./]$");
If the tokens before and after work must match each other, you could do this:
theString.matches(".*([./])work\\1$");
Your exact requirement isn't precisely defined, but I think it would be something like this:
theString.matches(".*work[,./]?$");
In other words:
zero or more characters
followed by work
followed by zero or one , . OR /
followed by the end of the input
Explanation of various regex items:
. -- any character
* -- zero or more of the preceeding expression
$ -- the end of the line/input
? -- zero or one of the preceeding expression
[./,] -- either a period or a slash or a comma
[abc] -- matches a, b, or c
[abc]* -- zero or more of (a, b, or c)
[abc]? -- zero or one of (a, b, or c)
enclosing a pattern in parentheses is called "grouping"
([abc])blah\\1 -- a, b, or c followed by blah followed by "the first group"
Here's a test harness to play with:
class TestStuff {
public static void main (String[] args) {
String[] testStrings = {
"work.",
"work-",
"workp",
"/foo/work.",
"/bar/work",
"baz/work.",
"baz.funk.work.",
"funk.work",
"jazz/junk/foo/work.",
"funk/punk/work/",
"/funk/foo/bar/work",
"/funk/foo/bar/work/",
".funk.foo.bar.work.",
".funk.foo.bar.work",
"goo/balls/work/",
"goo/balls/work/funk"
};
for (String t : testStrings) {
print("word: " + t + " ---> " + matchesIt(t));
}
}
public static boolean matchesIt(String s) {
return s.matches(".*([./,])work\\1?$");
}
public static void print(Object o) {
String s = (o == null) ? "null" : o.toString();
System.out.println(o);
}
}

Of course you can use the StringTokenizer class to split the String with '.' or '/', and check if the last word is "work".

You can use the substring method:
String aString = "This.is.a.great.place.too.work.";
String aSubstring = "work";
String endString = aString.substring(aString.length() -
(aSubstring.length() + 1),aString.length() - 1);
if ( endString.equals(aSubstring) )
System.out.println("Equal " + aString + " " + aSubstring);
else
System.out.println("NOT equal " + aString + " " + aSubstring);

I tried all the different things mentioned here to get the index of the . character in a filename that ends with .[0-9][0-9]*, e.g. srcfile.1, srcfile.12, etc. Nothing worked. Finally, the following worked:
int dotIndex = inputfilename.lastIndexOf(".");
Weird! This is with java -version:
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-0ubuntu1.16.10.2-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
Also, the official Java doc page for regex (from which there is a quote in one of the answers above) does not seem to specify how to look for the . character. Because \., \\., and [.] did not work for me, and I don't see any other options specified apart from these.

String input1 = "This.is.a.great.place.too.work.";
String input2 = "This/is/a/great/place/too/work/";
String input3 = "This,is,a,great,place,too,work,";
String input4 = "This.is.a.great.place.too.work.hahahah";
String input5 = "This/is/a/great/place/too/work/hahaha";
String input6 = "This,is,a,great,place,too,work,hahahha";
String regEx = ".*work[.,/]";
System.out.println(input1.matches(regEx)); // true
System.out.println(input2.matches(regEx)); // true
System.out.println(input3.matches(regEx)); // true
System.out.println(input4.matches(regEx)); // false
System.out.println(input5.matches(regEx)); // false
System.out.println(input6.matches(regEx)); // false

Related

Validate a user input string to match specific pattern using regex in java

I need to validate a user input string to match specific pattern using regex in java. user input needs to match the following syntax: sv32i-a- where "sv" is always mandatory followed by 32 or 64, then "i" or "c" then "-" then "a" or "b" then "-" and then " " an empty space and then a possible repetition on the string like (sv32i-a- sv64c-b- ). Just getting confused. Thank you!
public class StringValidation {
static boolean result = true;
//Help needed here.
static String syntax = "^rv\\d{2}$"; //Code goes here but not sure about the syntax..
public static boolean isTrue(String stringToValidate) {
result = stringToValidate.matches(syntax);
return result;
}
}
sv here "sv" is always mandatory
(?:32|64) followed by 32 or 64,
[ic] then "i" or "c"
- then "-"
[ab] then "a" or "b"
- then "-"
and then " " an empty space
(?:xxx)+ and then a possible repetition on the string like (sv32i-a- sv64c-b- )
So: (?:sv(?:32|64)[ic]-[ab]- )+

Java regex: Replace all characters with `+` except instances of a given string

I have the following problem which states
Replace all characters in a string with + symbol except instances of the given string in the method
so for example if the string given was abc123efg and they want me to replace every character except every instance of 123 then it would become +++123+++.
I figured a regular expression is probably the best for this and I came up with this.
str.replaceAll("[^str]","+")
where str is a variable, but its not letting me use the method without putting it in quotations. If I just want to replace the variable string str how can I do that? I ran it with the string manually typed and it worked on the method, but can I just input a variable?
as of right now I believe its looking for the string "str" and not the variable string.
Here is the output its right for so many cases except for two :(
List of open test cases:
plusOut("12xy34", "xy") → "++xy++"
plusOut("12xy34", "1") → "1+++++"
plusOut("12xy34xyabcxy", "xy") → "++xy++xy+++xy"
plusOut("abXYabcXYZ", "ab") → "ab++ab++++"
plusOut("abXYabcXYZ", "abc") → "++++abc+++"
plusOut("abXYabcXYZ", "XY") → "++XY+++XY+"
plusOut("abXYxyzXYZ", "XYZ") → "+++++++XYZ"
plusOut("--++ab", "++") → "++++++"
plusOut("aaxxxxbb", "xx") → "++xxxx++"
plusOut("123123", "3") → "++3++3"
Looks like this is the plusOut problem on CodingBat.
I had 3 solutions to this problem, and wrote a new streaming solution just for fun.
Solution 1: Loop and check
Create a StringBuilder out of the input string, and check for the word at every position. Replace the character if doesn't match, and skip the length of the word if found.
public String plusOut(String str, String word) {
StringBuilder out = new StringBuilder(str);
for (int i = 0; i < out.length(); ) {
if (!str.startsWith(word, i))
out.setCharAt(i++, '+');
else
i += word.length();
}
return out.toString();
}
This is probably the expected answer for a beginner programmer, though there is an assumption that the string doesn't contain any astral plane character, which would be represented by 2 char instead of 1.
Solution 2: Replace the word with a marker, replace the rest, then restore the word
public String plusOut(String str, String word) {
return str.replaceAll(java.util.regex.Pattern.quote(word), "#").replaceAll("[^#]", "+").replaceAll("#", word);
}
Not a proper solution since it assumes that a certain character or sequence of character doesn't appear in the string.
Note the use of Pattern.quote to prevent the word being interpreted as regex syntax by replaceAll method.
Solution 3: Regex with \G
public String plusOut(String str, String word) {
word = java.util.regex.Pattern.quote(word);
return str.replaceAll("\\G((?:" + word + ")*+).", "$1+");
}
Construct regex \G((?:word)*+)., which does more or less what solution 1 is doing:
\G makes sure the match starts from where the previous match leaves off
((?:word)*+) picks out 0 or more instance of word - if any, so that we can keep them in the replacement with $1. The key here is the possessive quantifier *+, which forces the regex to keep any instance of the word it finds. Otherwise, the regex will not work correctly when the word appear at the end of the string, as the regex backtracks to match .
. will not be part of any word, since the previous part already picks out all consecutive appearances of word and disallow backtrack. We will replace this with +
Solution 4: Streaming
public String plusOut(String str, String word) {
return String.join(word,
Arrays.stream(str.split(java.util.regex.Pattern.quote(word), -1))
.map((String s) -> s.replaceAll("(?s:.)", "+"))
.collect(Collectors.toList()));
}
The idea is to split the string by word, do the replacement on the rest, and join them back with word using String.join method.
Same as above, we need Pattern.quote to avoid split interpreting the word as regex. Since split by default removes empty string at the end of the array, we need to use -1 in the second parameter to make split leave those empty strings alone.
Then we create a stream out of the array and replace the rest as strings of +. In Java 11, we can use s -> String.repeat(s.length()) instead.
The rest is just converting the Stream to an Iterable (List in this case) and joining them for the result
This is a bit trickier than you might initially think because you don't just need to match characters, but the absence of specific phrase - a negated character set is not enough. If the string is 123, you would need:
(?<=^|123)(?!123).*?(?=123|$)
https://regex101.com/r/EZWMqM/1/
That is - lookbehind for the start of the string or "123", make sure the current position is not followed by 123, then lazy-repeat any character until lookahead matches "123" or the end of the string. This will match all characters which are not in a "123" substring. Then, you need to replace each character with a +, after which you can use appendReplacement and a StringBuffer to create the result string:
String inputPhrase = "123";
String inputStr = "abc123efg123123hij";
StringBuffer resultString = new StringBuffer();
Pattern regex = Pattern.compile("(?<=^|" + inputPhrase + ")(?!" + inputPhrase + ").*?(?=" + inputPhrase + "|$)");
Matcher m = regex.matcher(inputStr);
while (m.find()) {
String replacement = m.group(0).replaceAll(".", "+");
m.appendReplacement(resultString, replacement);
}
m.appendTail(resultString);
System.out.println(resultString.toString());
Output:
+++123+++123123+++
Note that if the inputPhrase can contain character with a special meaning in a regular expression, you'll have to escape them first before concatenating into the pattern.
You can do it in one line:
input = input.replaceAll("((?:" + str + ")+)?(?!" + str + ").((?:" + str + ")+)?", "$1+$2");
This optionally captures "123" either side of each character and puts them back (a blank if there's no "123"):
So instead of coming up with a regular expression that matches the absence of a string. We might as well just match the selected phrase and append + the number of skipped characters.
StringBuilder sb = new StringBuilder();
Matcher m = Pattern.compile(Pattern.quote(str)).matcher(input);
while (m.find()) {
for (int i = 0; i < m.start(); i++) sb.append('+');
sb.append(str);
}
int remaining = input.length() - sb.length();
for (int i = 0; i < remaining; i++) {
sb.append('+');
}
Absolutely just for the fun of it, a solution using CharBuffer (unexpectedly it took a lot more that I initially hoped for):
private static String plusOutCharBuffer(String input, String match) {
int size = match.length();
CharBuffer cb = CharBuffer.wrap(input.toCharArray());
CharBuffer word = CharBuffer.wrap(match);
int x = 0;
for (; cb.remaining() > 0;) {
if (!cb.subSequence(0, size < cb.remaining() ? size : cb.remaining()).equals(word)) {
cb.put(x, '+');
cb.clear().position(++x);
} else {
cb.clear().position(x = x + size);
}
}
return cb.clear().toString();
}
To make this work you need a beast of a pattern. Let's say you you are operating on the following test case as an example:
plusOut("abXYxyzXYZ", "XYZ") → "+++++++XYZ"
What you need to do is build a series of clauses in your pattern to match a single character at a time:
Any character that is NOT "X", "Y" or "Z" -- [^XYZ]
Any "X" not followed by "YZ" -- X(?!YZ)
Any "Y" not preceded by "X" -- (?<!X)Y
Any "Y" not followed by "Z" -- Y(?!Z)
Any "Z" not preceded by "XY" -- (?<!XY)Z
An example of this replacement can be found here: https://regex101.com/r/jK5wU3/4
Here is an example of how this might work (most certainly not optimized, but it works):
import java.util.regex.Pattern;
public class Test {
public static void plusOut(String text, String exclude) {
StringBuilder pattern = new StringBuilder("");
for (int i=0; i<exclude.length(); i++) {
Character target = exclude.charAt(i);
String prefix = (i > 0) ? exclude.substring(0, i) : "";
String postfix = (i < exclude.length() - 1) ? exclude.substring(i+1) : "";
// add the look-behind (?<!X)Y
if (!prefix.isEmpty()) {
pattern.append("(?<!").append(Pattern.quote(prefix)).append(")")
.append(Pattern.quote(target.toString())).append("|");
}
// add the look-ahead X(?!YZ)
if (!postfix.isEmpty()) {
pattern.append(Pattern.quote(target.toString()))
.append("(?!").append(Pattern.quote(postfix)).append(")|");
}
}
// add in the other character exclusion
pattern.append("[^" + Pattern.quote(exclude) + "]");
System.out.println(text.replaceAll(pattern.toString(), "+"));
}
public static void main(String [] args) {
plusOut("12xy34", "xy");
plusOut("12xy34", "1");
plusOut("12xy34xyabcxy", "xy");
plusOut("abXYabcXYZ", "ab");
plusOut("abXYabcXYZ", "abc");
plusOut("abXYabcXYZ", "XY");
plusOut("abXYxyzXYZ", "XYZ");
plusOut("--++ab", "++");
plusOut("aaxxxxbb", "xx");
plusOut("123123", "3");
}
}
UPDATE: Even this doesn't quite work because it can't deal with exclusions that are just repeated characters, like "xx". Regular expressions are most definitely not the right tool for this, but I thought it might be possible. After poking around, I'm not so sure a pattern even exists that might make this work.
The problem in your solution that you put a set of instance string str.replaceAll("[^str]","+") which it will exclude any character from the variable str and that will not solve your problem
EX: when you try str.replaceAll("[^XYZ]","+") it will exclude any combination of character X , character Y and character Z from your replacing method so you will get "++XY+++XYZ".
Actually you should exclude a sequence of characters instead in str.replaceAll.
You can do it by using capture group of characters like (XYZ) then use a negative lookahead to match a string which does not contain characters sequence : ^((?!XYZ).)*$
Check this solution for more info about this problem but you should know that it may be complicated to find regular expression to do that directly.
I have found two simple solutions for this problem :
Solution 1:
You can implement a method to replace all characters with '+' except the instance of given string:
String exWord = "XYZ";
String str = "abXYxyzXYZ";
for(int i = 0; i < str.length(); i++){
// exclude any instance string of exWord from replacing process in str
if(str.substring(i, str.length()).indexOf(exWord) + i == i){
i = i + exWord.length()-1;
}
else{
str = str.substring(0,i) + "+" + str.substring(i+1);//replace each character with '+' symbol
}
}
Note : str.substring(i, str.length()).indexOf(exWord) + i this if statement will exclude any instance string of exWord from replacing process in str.
Output:
+++++++XYZ
Solution 2:
You can try this Approach using ReplaceAll method and it doesn't need any complex regular expression:
String exWord = "XYZ";
String str = "abXYxyzXYZ";
str = str.replaceAll(exWord,"*"); // replace instance string with * symbol
str = str.replaceAll("[^*]","+"); // replace all characters with + symbol except *
str = str.replaceAll("\\*",exWord); // replace * symbol with instance string
Note : This solution will work only if your input string str doesn't contain any * symbol.
Also you should escape any character with a special meaning in a regular expression in phrase instance string exWord like : exWord = "++".

IllegalArgumentException: Illegal group reference while replaceFirst

I'm trying to replace first occurence of String matching my regex, while iterating those occurences like this:
(this code is very simplified, so don't try to find some bigger sense of it)
Matcher tagsMatcher = Pattern.compile("\\{[sdf]\\}").matcher(value);
int i = 0;
while (tagsMatcher.find()) {
value = value.replaceFirst("\\{[sdf]\\}", "%" + i + "$s");
i++;
}
I'm getting IllegalArgumentException: Illegal group reference while executing replaceFirst. Why?
replacement part in replaceFirst(regex,replacement) can contain references to groups matched by regex. To do this it is using
$x syntax where x is integer representing group number,
${name} where name is name of named group (?<name>...)
Because of this ability $ is treated as special character in replacement, so if you want to make $ literal you need to
escape it with \ like replaceFirst(regex,"\\$whatever")
or let Matcher escape it for you using Matcher.quote method replaceFirst(regex,Matcher.quote("$whatever"))
BUT you shouldn't be using
value = value.replaceFirst("\\{[sdf]\\}", "%" + i + "\\$s");
inside loop because each time you do, you need to traverse entire string to find part you want to replace, so each time you need to start from beginning which is very inefficient.
Regex engine have solution for this inefficiency in form of matcher.appendReplacement(StringBuffer, replacement) and matcher.appendTail(StringBuffer).
appendReplacement method is adding to StringBuffer all data until current match, and lets you specify what should be put in place of matched by regex part
appendTail adds part which exists after last matched part
So your code should look more like
StringBuffer sb = new StringBuffer();
int i = 0;
Matcher tagsMatcher = Pattern.compile("\\{[sdf]\\}").matcher(value);
while (tagsMatcher.find()) {
tagsMatcher.appendReplacement(sb, Matcher.quoteReplacement("%" + (i++) + "$s"));
}
value = sb.toString();
You need to escape the dollar symbol.
value = value.replaceFirst("\\{[sdf]\\}", "%" + i + "\\$s");
Illegal group reference error occurs mainly because of trying to refer a group which really won't exists.
Special character $ can be handled is simple way. Check below example
public static void main(String args[]){
String test ="Other company in $ city ";
String test2 ="This is test company ";
try{
test2= test2.replaceFirst(java.util.regex.Pattern.quote("test"), Matcher.quoteReplacement(test));
System.out.println(test2);
test2= test2.replaceAll(java.util.regex.Pattern.quote("test"), Matcher.quoteReplacement(test));
System.out.println(test2);
}catch(Exception e){
e.printStackTrace();
}
}
Output:
This is Other company in $ city company
This is Other company in $ city company
I solved it by using apache commons, org.apache.commons.lang3.StringUtils.replaceOnce. This is regex safe.

Java regex and pattern matching: finding "blanks" in pattern which do not include them?

So, I need to write a compiler scanner for a homework, and thought it'd be "elegant" to use regex. Fact is, I seldomly used them before, and it was a long time ago. So I forgot most of the stuff about them and needed to have a look around. I used them successfully for the identifiers (or at least I think so, I still need to do some further tests but for now they all look ok), but I have a problem with the numbers-recognition.
The function nextCh() reads the next character on the input (lookahead char). What I'd like to do here is to check if this char matches the regex [0-9]*. I append every matching char in the str field of my current token, then I read the int value of this field. It recognizes a single number input such as "123", but the problem I have is that for the input "123 456", the final str will be "123 456" while I should get 2 separate tokens with fields "123" and "456". Why is the " " being matched?
private void readNumber(Token t) {
t.str = "" + ch; // force conversion char --> String
final Pattern pattern = Pattern.compile("[0-9]*");
nextCh(); // get next char and check if it is a digit
Matcher match = pattern.matcher("" + ch);
while (match.find() && ch != EOF) {
t.str += ch;
nextCh();
match = pattern.matcher("" + ch);
}
t.kind = Kind.number;
try {
int value = Integer.parseInt(t.str);
t.val = value;
} catch(NumberFormatException e) {
error(t, Message.BIG_NUM, t.str);
}
Thank you!
PS: I did solve my problem using the code below. Nevertheless, I'd like to understand where the flaw is in my regex expression.
t.str = "" + ch;
nextCh(); // get next char and check if it is a number
while (ch>='0' && ch<='9') {
t.str += ch;
nextCh();
}
t.kind = Kind.number;
try {
int value = Integer.parseInt(t.str);
t.val = value;
} catch(NumberFormatException e) {
error(t, Message.BIG_NUM, t.str);
}
EDIT: turns out my regex also doesn't work for the identifiers recognition (again, includes blanks), so I had to switch to a system similar to my "solution" (while with a lot of conditions). Guess I'll need to study the regex again :O
I'm not 100% sure whether this is relevant in your case, but this:
Pattern.compile("[0-9]*");
matches zero or more numbers anywhere in the string, because of the asterisk. I think the space gets matched because it is a match for 'zero numbers'. If you wanted to make sure the char was a number, you would have to match one or more, using the plus sign:
Pattern.compile("[0-9]+");
or, since you are only comparing a single char at a time, just match one number:
Pattern.compile("^[0-9]$");
You should be using the matches method rather than the find method. From the documentation:
The matches method attempts to match the entire input sequence against the pattern
The find method scans the input sequence looking for the next subsequence that matches the pattern.
So in other words, by using find, if the string contains a digit anywhere at all, you'll get a match, but if you use matches the entire string must match the pattern.
For example, try this:
Pattern p = Pattern.compile("[0-9]*");
Matcher m123abc = p.matcher("123 abc");
System.out.println(m123abc.matches()); // prints false
System.out.println(m123abc.find()); // prints true
Use a simpler regex like
/\d+/
Where
\d means a digit
+ means one or more
In code:
final Pattern pattern = Pattern.compile("\\d+");

How do I convert CamelCase into human-readable names in Java?

I'd like to write a method that converts CamelCase into a human-readable name.
Here's the test case:
public void testSplitCamelCase() {
assertEquals("lowercase", splitCamelCase("lowercase"));
assertEquals("Class", splitCamelCase("Class"));
assertEquals("My Class", splitCamelCase("MyClass"));
assertEquals("HTML", splitCamelCase("HTML"));
assertEquals("PDF Loader", splitCamelCase("PDFLoader"));
assertEquals("A String", splitCamelCase("AString"));
assertEquals("Simple XML Parser", splitCamelCase("SimpleXMLParser"));
assertEquals("GL 11 Version", splitCamelCase("GL11Version"));
}
This works with your testcases:
static String splitCamelCase(String s) {
return s.replaceAll(
String.format("%s|%s|%s",
"(?<=[A-Z])(?=[A-Z][a-z])",
"(?<=[^A-Z])(?=[A-Z])",
"(?<=[A-Za-z])(?=[^A-Za-z])"
),
" "
);
}
Here's a test harness:
String[] tests = {
"lowercase", // [lowercase]
"Class", // [Class]
"MyClass", // [My Class]
"HTML", // [HTML]
"PDFLoader", // [PDF Loader]
"AString", // [A String]
"SimpleXMLParser", // [Simple XML Parser]
"GL11Version", // [GL 11 Version]
"99Bottles", // [99 Bottles]
"May5", // [May 5]
"BFG9000", // [BFG 9000]
};
for (String test : tests) {
System.out.println("[" + splitCamelCase(test) + "]");
}
It uses zero-length matching regex with lookbehind and lookforward to find where to insert spaces. Basically there are 3 patterns, and I use String.format to put them together to make it more readable.
The three patterns are:
UC behind me, UC followed by LC in front of me
XMLParser AString PDFLoader
/\ /\ /\
non-UC behind me, UC in front of me
MyClass 99Bottles
/\ /\
Letter behind me, non-letter in front of me
GL11 May5 BFG9000
/\ /\ /\
References
regular-expressions.info/Lookarounds
Related questions
Using zero-length matching lookarounds to split:
Regex split string but keep separators
Java split is eating my characters
You can do it using org.apache.commons.lang.StringUtils
StringUtils.join(
StringUtils.splitByCharacterTypeCamelCase("ExampleTest"),
' '
);
The neat and shorter solution :
StringUtils.capitalize(StringUtils.join(StringUtils.splitByCharacterTypeCamelCase("yourCamelCaseText"), StringUtils.SPACE)); // Your Camel Case Text
If you don't like "complicated" regex's, and aren't at all bothered about efficiency, then I've used this example to achieve the same effect in three stages.
String name =
camelName.replaceAll("([A-Z][a-z]+)", " $1") // Words beginning with UC
.replaceAll("([A-Z][A-Z]+)", " $1") // "Words" of only UC
.replaceAll("([^A-Za-z ]+)", " $1") // "Words" of non-letters
.trim();
It passes all the test cases above, including those with digits.
As I say, this isn't as good as using the one regular expression in some other examples here - but someone might well find it useful.
You can use org.modeshape.common.text.Inflector.
Specifically:
String humanize(String lowerCaseAndUnderscoredWords,
String... removableTokens)
Capitalizes the first word and turns underscores into spaces and strips trailing "_id" and any supplied removable tokens.
Maven artifact is: org.modeshape:modeshape-common:2.3.0.Final
on JBoss repository: https://repository.jboss.org/nexus/content/repositories/releases
Here's the JAR file: https://repository.jboss.org/nexus/content/repositories/releases/org/modeshape/modeshape-common/2.3.0.Final/modeshape-common-2.3.0.Final.jar
The following Regex can be used to identify the capitals inside words:
"((?<=[a-z0-9])[A-Z]|(?<=[a-zA-Z])[0-9]]|(?<=[A-Z])[A-Z](?=[a-z]))"
It matches every capital letter, that is ether after a non-capital letter or digit or followed by a lower case letter and every digit after a letter.
How to insert a space before them is beyond my Java skills =)
Edited to include the digit case and the PDF Loader case.
I think you will have to iterate over the string and detect changes from lowercase to uppercase, uppercase to lowercase, alphabetic to numeric, numeric to alphabetic. On every change you detect insert a space with one exception though: on a change from upper- to lowercase you insert the space one character before.
This works in .NET... optimize to your liking. I added comments so you can understand what each piece is doing. (RegEx can be hard to understand)
public static string SplitCamelCase(string str)
{
str = Regex.Replace(str, #"([A-Z])([A-Z][a-z])", "$1 $2"); // Capital followed by capital AND a lowercase.
str = Regex.Replace(str, #"([a-z])([A-Z])", "$1 $2"); // Lowercase followed by a capital.
str = Regex.Replace(str, #"(\D)(\d)", "$1 $2"); //Letter followed by a number.
str = Regex.Replace(str, #"(\d)(\D)", "$1 $2"); // Number followed by letter.
return str;
}
For the record, here is an almost (*) compatible Scala version:
object Str { def unapplySeq(s: String): Option[Seq[Char]] = Some(s) }
def splitCamelCase(str: String) =
String.valueOf(
(str + "A" * 2) sliding (3) flatMap {
case Str(a, b, c) =>
(a.isUpper, b.isUpper, c.isUpper) match {
case (true, false, _) => " " + a
case (false, true, true) => a + " "
case _ => String.valueOf(a)
}
} toArray
).trim
Once compiled it can be used directly from Java if the corresponding scala-library.jar is in the classpath.
(*) it fails for the input "GL11Version" for which it returns "G L11 Version".
I took the Regex from polygenelubricants and turned it into an extension method on objects:
/// <summary>
/// Turns a given object into a sentence by:
/// Converting the given object into a <see cref="string"/>.
/// Adding spaces before each capital letter except for the first letter of the string representation of the given object.
/// Makes the entire string lower case except for the first word and any acronyms.
/// </summary>
/// <param name="original">The object to turn into a proper sentence.</param>
/// <returns>A string representation of the original object that reads like a real sentence.</returns>
public static string ToProperSentence(this object original)
{
Regex addSpacesAtCapitalLettersRegEx = new Regex(#"(?<=[A-Z])(?=[A-Z][a-z]) | (?<=[^A-Z])(?=[A-Z]) | (?<=[A-Za-z])(?=[^A-Za-z])", RegexOptions.IgnorePatternWhitespace);
string[] words = addSpacesAtCapitalLettersRegEx.Split(original.ToString());
if (words.Length > 1)
{
List<string> wordsList = new List<string> { words[0] };
wordsList.AddRange(words.Skip(1).Select(word => word.Equals(word.ToUpper()) ? word : word.ToLower()));
words = wordsList.ToArray();
}
return string.Join(" ", words);
}
This turns everything into a readable sentence. It does a ToString on the object passed. Then it uses the Regex given by polygenelubricants to split the string. Then it ToLowers each word except for the first word and any acronyms. Thought it might be useful for someone out there.
I'm not a regex ninja, so I'd iterate over the string, keeping the indexes of the current position being checked & the previous position. If the current position is a capital letter, I'd insert a space after the previous position and increment each index.
http://code.google.com/p/inflection-js/
You could chain the String.underscore().humanize() methods to take a CamelCase string and convert it into a human readable string.

Categories

Resources