How to convert "string" to "*s*t*r*i*n*g*" - java

I need to convert a string like
"string"
to
"*s*t*r*i*n*g*"
What's the regex pattern? Language is Java.

You want to match an empty string, and replace with "*". So, something like this works:
System.out.println("string".replaceAll("", "*"));
// "*s*t*r*i*n*g*"
Or better yet, since the empty string can be matched literally without regex, you can just do:
System.out.println("string".replace("", "*"));
// "*s*t*r*i*n*g*"
Why this works
It's because any instance of a string startsWith(""), and endsWith(""), and contains(""). Between any two characters in any string, there's an empty string. In fact, there are infinite number of empty strings at these locations.
(And yes, this is true for the empty string itself. That is an "empty" string contains itself!).
The regex engine and String.replace automatically advances the index when looking for the next match in these kinds of cases to prevent an infinite loop.
A "real" regex solution
There's no need for this, but it's shown here for educational purpose: something like this also works:
System.out.println("string".replaceAll(".?", "*$0"));
// "*s*t*r*i*n*g*"
This works by matching "any" character with ., and replacing it with * and that character, by backreferencing to group 0.
To add the asterisk for the last character, we allow . to be matched optionally with .?. This works because ? is greedy and will always take a character if possible, i.e. anywhere but the last character.
If the string may contain newline characters, then use Pattern.DOTALL/(?s) mode.
References
regular-expressions.info/Dot Matches (Almost) Any Character and Grouping and Backreferences

I think "" is the regex you want.
System.out.println("string".replaceAll("", "*"));
This prints *s*t*r*i*n*g*.

If this is all you're doing, I wouldn't use a regex:
public static String glitzItUp(String text) {
return insertPeriodically(text, "*", 1);
}
Putting char into a java string for each N characters
public static String insertPeriodically(
String text, String insert, int period)
{
StringBuilder builder = new StringBuilder(
text.length() + insert.length() * (text.length()/period)+1);
int index = 0;
while (index <= text.length())
{
builder.append(insert);
builder.append(text.substring(index,
Math.min(index + period, text.length())));
index += period;
}
return builder.toString();
}
Another benefit (besides simplicity) is that it's about ten times faster than a regex.
IDEOne | Working example

Just to be a jerk, I'm going to say use J:
I've spent a school year learning Java, and self-taught myself a bit of J over the course of the summer, and if you're going to be doing this for yourself, it's probably most productive to use J simply because this whole inserting an asterisk thing is easily done with one simple verb definition using one loop.
asterisked =: 3 : 0
i =. 0
running_String =. '*'
while. i < #y do.
NB. #y returns tally, or number of items in y: right operand to the verb
running_String =. running_String, (i{y) , '*'
i =. >: i
end.
]running_String
)
This is why I would use J: I know how to do this, and have only studied the language for a couple months loosely. This isn't as succinct as the whole .replaceAll() method, but you can do it yourself quite easily and edit it to your specifications later. Feel free to delete this/ troll this/ get inflamed at my suggestion of J, I really don't care: I'm not advertising it.

Related

java - Fix an invalid Duration

We get xml with invalid duration, like PT10HMS (note lack of numbers before M and S). I have handled this by reading the file and fixing by iterating the duration string character by character and inserting 0 between 2 letters that are side by side (except between P and T). I was wondering if there was a more elegant solution maybe using a regex with sed or anything else?
thanks for any suggestions
An idea for a Java solution here (sure sed can be used too).
String incorrectDuration = "PT10HMS";
String dur = incorrectDuration.replaceAll("(?<!\\d+)[HMS]", "0$0");
This produces
PT10H0M0S
Personally I would prefer deleting the letters that do not have a number in front of them:
String dur = incorrectDuration.replaceAll("(?<!\\d+)[HMS]", "");
Now I get
PT10H
In both cases Duration.parse(dur) works and gives the expected result.
(?<!\\d+) is a negative lookbehind: with this the regex only matches if the H, M or S is not preceded by a string of digits.
Edit: I am probably overdoing it in the following. I was just curious how I could produce my preferred string also in the case where you have got for example PTHMS as you mentioned in the comment. For production code you will probably want to stick with the simpler solution above.
String durationString = "PTHMS";
// if no digits, insert 0 before last letter
if (! durationString.matches(".*\\d.*")) {
durationString = durationString.replaceFirst("(?=[HMS]$)", "0");
}
// then delete letters that do not have a digit before them
durationString = durationString.replaceAll("(?<!\\d)[HMS]", "");
This produces
PT0S
(?=[HMS]$) is a lookahead. It matches the empty string but only if this empty string is followed by either H, M or S and then the end of the string. So replacing this empty string with 0 gives us PTHM0S. Confident that there is now at least one digit in the string, we can proceed to delete letters that don’t have a digit before them.
It still wouldn’t work if you had just PT. As I understand, this doesn’t happen. If it did, you would prefer for example durationString = PT0S; inside the if statement instead.

Please justify the output in Regex Java program

I have came across one Java program in Regex .
Below is the program code :
import java.util.regex.*;
public class Regex_demo01 {
public static void main(String[] args) {
boolean b=true;
Pattern p=Pattern.compile("\\d*");
Matcher m=p.matcher("ab34ef");
while(b=m.find())
{
System.out.println(b);
System.out.println(">"+m.start()+"\t"+m.group()+"<");
}
}
}
Output :
true
>0 <
true
>1 <
true
>2 34<
true
>4 <
true
>5 <
true
>6 <
Doubt : As we all know that The find() method returns true if it gets a match and remembers the start position of the match. If find() returns true, you can call the start() method to get the starting position of the match, and you can call the group() method to get the string that represents the actual bit of source data that was matched.
My question is how come ">6 <" is present is the output when the string indexing is till index 5 ?
Anser is simple. x* matche any count of x even 0.
Replace * to + which matche to 1 or more element that is left to it.
My question is how come >6 < is present is the output when the string indexing is till index 5 ?
That behavior is due to your regex i.e. \\d* which matches 0 or more digits.
As you can see it is showing start position 0 as well when there is no digit at the start.
Similarly 6 is last index +1 because there is an empty match past the last character as well.
You should use \\d+ as your regex.
The star quantifier (*) is defined as "zero or more times". That said, your pattern matches zero digits most of the time.
What you actually want is probably the plus quantifier (+), which means "one or more times".
Source: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Why is there a match at index 6?
RegEx doesn't work on a char-basis, but rather inbetween single chars. When matching an empty string, it will look before and after every character. Duplicate findings are omitted, of course, so an empty string after the first char and before the second char will yield one match instead of two. By default the algorithm is greedy, which means it will match as many characters as possible.
Consider this example:
Input string is 1
RegEx is \\d*
In this case the RegEx engine starts before the first character and tries to match zero, one or more digits. Since it's greedy, it doesn't stop after the empty string it finds at the beginning. It finds a '1' with no digits following. This is the first match. Then it continues the search after the match. It finds an empty string and matches it too, since that equals zero digits.
For RegEx the string '1' looks rather like this:
"" + "1" + ""
The first two units (empty string and the "1") match the pattern, the third, empty string does, too.
In-depth article about this: http://www.regular-expressions.info/zerolength.html

Check a string order in a sentence

I want to find out if a specific word comes before another. Partial words are not a match.
Some example tests:
“Hi my name is AB, I’m from London and I love it here ..."
if "from" is before "Hi" -> return false
if "Hi" is before "AB" -> return true
There are several ways of doing this:
Use indexOf - this is perhaps the simplest approach. Get indexes of the strings, and compare them. The string with a lower indexs is before the other string
Use regular expressions - construct a regex that matches the strings in the desired order, for example "from.*?Hi". This approach is likely to use multiple regular expressions.
One twist on the first approach would be to start searching for the second word at the index of the first word plus the length of the word, and avoid index comparisons. With many searches and long strings this could save you some CPU cycles.
Note: Depending on the requirements you may need to watch out for the Scunthorpe problem, when you get a false positive for a match on a substring. If your requirement is that "Hi, my friend AB" should be matched, but "Higher than AB" should not be matched, then the regex approach with \b anchors on both ends of the word would provide an easier solution than manipulating string indexes. The "from.*?Hi" regex above becomes "\\bfrom\\b.*?\\bHi\\b".
yourString.matches(".*? Hi\\b.*? AB\\b.*")
This will make sure that you have spaces in between and you're matching whole words.
If you're dealing with latin american stuff where puncuation can come before words, this is more general
yourString.matches(".*?\\bHi\\b.*?\\bAB\\b.*")
Breaking that down you have
.*? = anything, even the empty string. Ignore the ? for now.
\\b = a word boundary
So that regex means
<anything><word boundary>Hi<word boundary><anything><word boundary>AB<word boundary><anything>
which is the same as
if "Hi" is before "AB" -> return true
which would be used as
if(yourString.matches(".*?\\bHi\\b.*?\\bAB\\b.*")){
return true;
}
You can take a look at the indexOf(String string), which returns an integer denoting the position of the substring, or -1 if not found. You could use that to see which strings preceeds another.
You can use indexOf method and get the first occurrence of each word and then check. For example:
String sentence = "Hi my name is AB, I’m from London and I love it here …";
int fromIndex = sentence.indexOf("from");
int hiIndex = sentence.indexOf("Hi");
if (fromIndex < hiIndex)
System.out.println("false");
else
System.out.println("true");
Note that if a word does not exist within the sentence, then indexOf will return -1.

Java String.replaceAll method to sanitize phone numbers

I have databasefield called TelephoneName. In this field, I got different formats of telephone number.
What I need now is to seperate them into countrycode and subscribernumber.
For example, I saw a telephone number +49 (0)711 / 61947-xx.
I want to remove all the slash,brackets,minus,space. The result could be +49 (countrycode) and 071161947**(subsribernumber).
How can I do that with replaceAll method?
replaceAll("//()-","") is that correct?
The thing is I got a lot of unformatted telephone number such as:
+49 04261 85120
+32027400050
It is different to apply every telephone number with same algorithms
The replaceAll method takes a regular expression as argument. To remove everything except digits and +, you could thus do
str = str.replaceAll("[^0-9+]", "")
Here's a more complete example that also figures out the country code (based on the index of the ( symbol):
String str = "+49 (0)711 / 61947-12";
int lpar = str.indexOf('(');
String countryCode = str.substring(0, lpar).trim();
String subscriber = str.substring(lpar).trim();
subscriber = subscriber.replaceAll("[^0-9]", "");
System.out.println(countryCode); // prints +49
System.out.println(subscriber); // prints 07116194712
replaceAll("//()-","") is that correct?
No, not quite. That will remove all //- substrings. To remove those characters you need to put them in [...], like this: replaceAll("[/()-]", "") (and / does not need to be escaped).
The first argument of replaceAll() is a regex pattern, so what you want to do is make it match all non digits (and +). You can do this using the "[^...]" (not one of...) construct :
mystring.replaceAll("[^0-9+]", "")
No, that doesn't work.
ReplaceAll() Replaces each substring of this string that matches the given regular expression with the given replacement.
So your expression would replace all instances in the number that look like /()' with an empty space.
You need to do something like
String output = "+49 (0)711 / 61947-xx".replaceAll("[//()-]","");
The square brackets make it a regex character class ('Either slash or open bracket or close bracket or hypen'), rather than a literal ('slash followed by open bracket followed by close bracket followed by hypen.').
This can be done simply by using :
s=s.replace("/","");
s=s.replace("(","");
s=s.replace(")","");
Then substring it to get country code.

How to tokenize in java without using the java.util tokenizer?

Consider the following as tokens:
+, -, ), (
alpha charactors and underscore
integer
Implement 1.getToken() - returns a string corresponding to the next token
2.getTokPos() - returns the position of the current token in the input string
Example input: (a+b)-21)
Output: (| a| +| b| )| -| 21| )|
Note: Cannot use the java string tokenizer class
Work in progress - Successfully tokenized +,-,),(. Need to figure out characters and numbers:
OUTPUT: +|-|+|-|(|(|)|)|)|(| |
java.util tokenizer is a deprecated class.
Tokenizing Strings in Java is much easier with "String.split()" since Java 1.4 :
String[] tokens = "(a+b)-21)".split("[+-)(]");
If it is a homework, you probably have to reimplement a "split" method:
read the String character by character
if the character is not a special char, add it to a buffer
when you encounter a special char, add the buffer content to a list and clear the buffer
Since it is (probably) a homework, I let you implement it.
Java lets you examine the characters in a String one by one with the charAt method. So use that in a for loop and examine each character. When you encounter a TOKEN you wrap that token with the pipes and any other character you just append to the output.
public static final char PLUS_TOKEN = '+';
// add all tokens as
public String doStuff(String input)
{
StringBuilder output = new StringBuilder();
for (int index = 0; index < input.length(); index++)
{
if (input.charAt(index) == PLUS_TOKEN)
{
// when you see a token you need to append the pipes (|) around it
output.append('|');
output.append(input.charAt(index);
output.append('|');
}
else if () //compare the current character with all tokens
else
{
// just add to new output
output.append(input.charAt(index);
}
}
return output.toString();
}
If it's not a homework assignment use String.split(). If is a homework assignment, say so and tag it so that we can give the appropriate level of help (I did so for you, just in case...).
Because the string needs to be cut in several different ways, not just on whitespace or parens, using the String.split method with any of the symbols there will not work. Split removes the character used as a seperator. You could try to split on the empty string, but this wouldn't get compound symbols, like 21. To correctly parse this string, you will need to effectively implement your own tokenizer. Try thinking about how you could tell you had a complete token if you looked at the string one character at a time. You could probably start a string that collects the characters until you have identified a complete token, and then you can remove the characters from the original and return the string. Starting from this point, you can probably make a basic tokenizer.
If you'd rather learn how to make a full strength tokenizer, most of them are defined by creating a regular expression that only matches the tokens.

Categories

Resources