simple regex in java split - java

I have a string blah-*-bleh-*-bloh
and I want to split it by -*- so I tried (amongst other things):
res.split("/-\\*-/g");
But it's not working. Anyone has an idea?

In java, there is no need for / before and /g after:
String[] splittedArray = res.split("-\\*-");

Related

Replace quote (‘NOA’) using groovy

Can anyone guide me on how to replace this char (‘ ’) using groovy or java?
When I try the below code (i assume this is a single quote), it's not working.
def a = "‘NOA’,’CTF’,’CLM’"
def rep = a.replaceAll("\'","")
My expected Output : NOA,CTF,CLM
Those are curly quotes in your source text. Your replaceAll is replacing straight quotes.
You should have copy-pasted the characters from your source.
System.out.println(
"‘NOA’,’CTF’,’CLM’"
.replaceAll( "‘" , "" )
.replaceAll( "’" , "" )
);
See this code run live at OneCompiler.
NOA,CTF,CLM
i would suggest this
a.replaceAll("[‘’]", "")
or even better to escape unicode characters in a source code
a.replaceAll("[\u2018\u2019]", "")

stripAccents on Thai language

I am trying to strip accents from Thai language word using the stripAccent function in scala language, seems like it's not able to strip the accent.
import org.apache.commons.lang3.StringUtils.stripAccents
println("stripped string " + stripAccents("CLEกอ่ตัRงขึนในปีR"))
stripped string CLEกอ่ตัRงขึนในปีR
I am running in Intellij windows environment. It's stripping many other languages like German, Dutch etc.
Has anyone faced similar issue, how did you resolve?
You can use java Normalizer :
import java.text.Normalizer
val thaiString = "CLEกอ่ตัRงขึนในปีR"
val strippedString = Normalizer.normalize(thaiString, Normalizer.Form.NFD)
.replaceAll("[\\p{InCombiningDiacriticalMarks}\\p{IsM}]+", "")
println(strippedString)
//CLEกอตRงขนในปR

Regular expression in Java and in Eclipse?

I want to remove all empty linse in Java. In Eclipse I will use:
\n( *)\n (or "\r\n( *)\r\n" in Windows)
. But in Java it isn't work (I used:
str=str.replaceAll("\n( *)\n")
). How to do it in Java using replaceAll? Sample:
package example
○○○○
public ... (where ○ is space)
I would do it like this
java.util.regex.Pattern ws = Pattern.compile("[\r|\n][\\s]*[\r|\n]");
java.util.regex.Matcher matcher = ws.matcher(str);
str = matcher.replaceAll(" ");

Split on Regular Expression per Path

If I have this:
thisisgibberish 1234 /hello/world/
more gibberish 43/7 /good/timing/
just onemore 8888 /thanks/mate
what would the regular expression inside the Java String.split() method be to obtain the paths per line?
ie.
[0]: /hello/world/
[1]: /good/timing/
[2]: /thanks/mate
Doing
myString.split("\/[a-zA-Z]")
causes the splits to occur to every /h, /w, /g, /t, and /m.
How would I go about writing a regular expression to split it only once per line while only capturing the paths?
Thanks in advance.
Why split ? I think running a match here is better, try the following expression:
(?<=\s)(/[a-zA-Z/])+
Regex101 Demo
This uses split() :
String[] split = myString.split(myString.substring(0, myString.lastIndexOf(" ")));
OR
myString.split(myString.substring(0, myString.lastIndexOf(" ")))[1]; //works for current inputs
You must first remove the leading junk, then split on the intervening junk:
String[] paths = str.replaceAll("^.*? (?=/[a-zA-Z])", "")
.split("(?m)((?<=[a-zA-Z]/|[a-zA-Z])\\s|^).*? (?=/[a-zA-Z])");
One important point here is the use of (?m), which is a switch that turns on "dot matches newline", which is required to split across the newlines.
Here's some test code:
String str = "thisisgibberish 1234 /hello/world/\nmore gibberish 43/7 /good/timing/\njust onemore 8888 /thanks/mate";
String[] paths = str.replaceAll("^.*? (?=/[a-zA-Z])", "")
.split("(?m)((?<=[a-zA-Z]/|[a-zA-Z])\\s|^).*? (?=/[a-zA-Z])");
System.out.println( Arrays.toString( paths));
Output (achieving requirements):
[/hello/world/, /good/timing/, /thanks/mate]

A custom tokenizer for Java

I am developing an application in which I need to process text files containing emails. I need all the tokens from the text and the following is the definition of token:
Alphanumeric
Case-sensitive (case to be preserved)
'!' and '$' are to be considered as constituent characters. Ex: FREE!!, $50 are tokens
'.' (dot) and ',' comma are to be considered as constituent characters if they occur between numbers. For ex:
192.168.1.1, $24,500
are tokens.
and so on..
Please suggest me some open-source tokenizers for Java which are easy to customize to suit my needs. Will simply using StringTokenizer and regex be enough? I have to perform stopping also and that's why I was looking for an open source tokenizer which will also perform some extra things like stopping, stemming.
A few comments up front:
From StringTokenizer javadoc:
StringTokenizer is a legacy class that is retained for compatibility
reasons although its use is discouraged in new code. It is recommended
that anyone seeking this functionality use the split method of String
or the java.util.regex package instead.
Always use Google first - the first result as of now is JTopas. I did not use it, but it looks it could work for this
As for regex, it really depends on your requirements. Given the above, this might work:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Mkt {
public static void main(String[] args) {
Pattern p = Pattern.compile("([$\\d.,]+)|([\\w\\d!$]+)");
String str = "--- FREE!! $50 192.168.1.1 $24,500";
System.out.println("input: " + str);
Matcher m = p.matcher(str);
while(m.find()) {
System.out.println("token: " + m.group());
}
}
}
Here's a sample run:
$ javac Mkt.java && java Mkt
input: --- FREE!! $50 192.168.1.1 $24,500
token: FREE!!
token: $50
token: 192.168.1.1
token: $24,500
Now, you might need to tweak the regex, for example:
You gave $24,500 as an example. Should this work for $24,500abc or $24,500EUR?
You mentioned 192.168.1.1 should be included. Should it also include 192,168.1,1 (given . and , are to be included)?
and I guess there are other things to consider.
Hope this helps to get you started.

Categories

Resources