StringUtils or any library class method to preserve the delimiter [duplicate] - java

This question already has answers here:
How to split a string, but also keep the delimiters?
(24 answers)
Closed 6 years ago.
I am having a string "role1#role2#role3#role4$arole" separated with delimiter # and $. I used below java code
String str = "role1#role2#role3#role4$arole";
String[] values = StringUtils.splitPreserveAllTokens(str, "\\#\\$");
for (String value : values) {
System.out.println(value);
}
And got the result
role1
role2
role3
role4
arole
But my requirement is to preserve the delimiter in the result. So, the result has to be as per requirement
role1
#role2
#role3
#role4
$arole
I analyzed the apache commons StringUtils method to do that but was unable to found any clue.
Any library class to get the above intended results?

You may use a simple split with a positive lookahead:
String str = "role1#role2#role3#role4$arole";
String[] res = str.split("(?=[#$])");
System.out.println(Arrays.toString(res));
// => [role1, #role2, #role3, #role4, $arole]
See the Java demo
The (?=[#$]) regex matches any location in a string that is followed with a # or $ symbol (note the $ does not have to be escaped inside a [...] character class).

Related

Implement re.search() functionality in Groovy or Java [duplicate]

This question already has answers here:
Create array of regex matches
(6 answers)
Closed 4 years ago.
I need a Groovy/Java function to search for groups in a string based on regular expression
Ex:
function("([\w-]+)-([\w.-]+)\.([\w.-]+)" ,"commons-collections-3.2.2.jar" )
should return a list ["commons-collections" , "3.2.2" , "jar"]
Python can do this by
>> import re
>> re.search("([\w-]+)-([\w.-]+)\.([\w.-]+)" ,"commons-collections-3.2.2.jar" )
>> print(result.groups())
output is ("commons-collections" , "3.2.2" , "jar")
It is a simple and basic task in groovy. Any way I hope this answer will help you.
"commons-collections-3.2.2.jar".findAll(/([\w-]+)-([\w.-]+)\.([\w.-]+)/) {
println it
}
This will produce the output :
[commons-collections-3.2.2.jar, commons-collections, 3.2.2, jar]
Update :
As #tim_yates mentioned in comment,
println "commons-collections-3.2.2.jar".findAll(/([\w-]+)-([\w.-]+)\.([\w.-]+)/) { it.tail() }
This provides better output than above and also more specific to the task.
Output:
[[commons-collections, 3.2.2, jar]]

substring using \(backslash) in java

I want to get file name from complete path of file.
Input : "D://amol//1/\15_amol.jpeg"
Expected Output : 15_amol.jpeg
I have written below code for this
public class JavaApplication9 {
public static void main(String[] args) {
String fname="D://amol//1/\15_amol.jpeg";
System.out.println(fname.substring(fname.lastIndexOf("/")));
System.out.println(fname.substring(fname.lastIndexOf("\\")));
}
}
but getting below output :
_amol.jpeg
Exception in thread "main" java.lang.StringIndexOutOfBoundsException:
String index out of range: -1
at java.lang.String.substring(String.java:1927)
at javaapplication9.JavaApplication9.main(JavaApplication9.java:6)
C:\Users\lakhan.kamble\AppData\Local\NetBeans\Cache\8.1\executor-snippets\run.xml:53:
Java returned: 1
The string \15 is an "octal escape" for the carriage return character (0x0d, 13 decimal). There are two possibilities here.
You really meant \15 to be the octal escape, in which case you are trying to create a filename with an embedded newline. The actual contents of fname in this case could be expressed as
"D://amol//1/" + "\n" + "_amol.jpeg";
Windows will prevent that from happening and your program will throw an IOException.
You really meant
String fname="D://amol//1/\\15_amol.jpeg";
In this case the extra backslash is redundant and will be ignored by Windows because the filename will resolve (in Windows path terms) to D:\amol\1\\15_amol.jpeg and adjacent directory separators collapse to a single separator. So you could just omit the extra backslash altogether without changing the effective path.
As to your exception, the string as shown DOES NOT contain a backslash character (case 1 above), so
fname.lastIndexOf("\\")
returned -1, causing the exception

Regex matching in python 2.7 [duplicate]

This question already has answers here:
How do you validate a URL with a regular expression in Python?
(12 answers)
Closed 6 years ago.
I am new to Python and would like to know how to have build a regex pattern to match a URL
I have the following code in Java and it works. I need to have a similar one in python
Java:
URI uri = new URI("http://localhost:8080")
Matcher m = Pattern.compile("(.*)" + "/client" + "/([0-9]+)")
.matcher(uri.getPath());
Could someone guide me with having an equivalent regex in Python
Why not use urlparse? Batteries included :-).
>>> import urlparse
>>> urlparse.urlparse("http://localhost:8080")
ParseResult(scheme='http', netloc='localhost:8080', path='', params='', query='', fragment='')
Here's the equivalent in Python 2.7:
import re
from urlparse import urlparse
url = urlparse('http://localhost:8080')
match = re.match(r'(.*)/client/([0-9]+)', url.path)
EDIT
Here's how you would use match to get the individual components (just guessing as to what you want to do next):
if match:
prefix = match.group(1)
client_id = int(match.group(2))

Split on Regular Expression per Path

If I have this:
thisisgibberish 1234 /hello/world/
more gibberish 43/7 /good/timing/
just onemore 8888 /thanks/mate
what would the regular expression inside the Java String.split() method be to obtain the paths per line?
ie.
[0]: /hello/world/
[1]: /good/timing/
[2]: /thanks/mate
Doing
myString.split("\/[a-zA-Z]")
causes the splits to occur to every /h, /w, /g, /t, and /m.
How would I go about writing a regular expression to split it only once per line while only capturing the paths?
Thanks in advance.
Why split ? I think running a match here is better, try the following expression:
(?<=\s)(/[a-zA-Z/])+
Regex101 Demo
This uses split() :
String[] split = myString.split(myString.substring(0, myString.lastIndexOf(" ")));
OR
myString.split(myString.substring(0, myString.lastIndexOf(" ")))[1]; //works for current inputs
You must first remove the leading junk, then split on the intervening junk:
String[] paths = str.replaceAll("^.*? (?=/[a-zA-Z])", "")
.split("(?m)((?<=[a-zA-Z]/|[a-zA-Z])\\s|^).*? (?=/[a-zA-Z])");
One important point here is the use of (?m), which is a switch that turns on "dot matches newline", which is required to split across the newlines.
Here's some test code:
String str = "thisisgibberish 1234 /hello/world/\nmore gibberish 43/7 /good/timing/\njust onemore 8888 /thanks/mate";
String[] paths = str.replaceAll("^.*? (?=/[a-zA-Z])", "")
.split("(?m)((?<=[a-zA-Z]/|[a-zA-Z])\\s|^).*? (?=/[a-zA-Z])");
System.out.println( Arrays.toString( paths));
Output (achieving requirements):
[/hello/world/, /good/timing/, /thanks/mate]

A custom tokenizer for Java

I am developing an application in which I need to process text files containing emails. I need all the tokens from the text and the following is the definition of token:
Alphanumeric
Case-sensitive (case to be preserved)
'!' and '$' are to be considered as constituent characters. Ex: FREE!!, $50 are tokens
'.' (dot) and ',' comma are to be considered as constituent characters if they occur between numbers. For ex:
192.168.1.1, $24,500
are tokens.
and so on..
Please suggest me some open-source tokenizers for Java which are easy to customize to suit my needs. Will simply using StringTokenizer and regex be enough? I have to perform stopping also and that's why I was looking for an open source tokenizer which will also perform some extra things like stopping, stemming.
A few comments up front:
From StringTokenizer javadoc:
StringTokenizer is a legacy class that is retained for compatibility
reasons although its use is discouraged in new code. It is recommended
that anyone seeking this functionality use the split method of String
or the java.util.regex package instead.
Always use Google first - the first result as of now is JTopas. I did not use it, but it looks it could work for this
As for regex, it really depends on your requirements. Given the above, this might work:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Mkt {
public static void main(String[] args) {
Pattern p = Pattern.compile("([$\\d.,]+)|([\\w\\d!$]+)");
String str = "--- FREE!! $50 192.168.1.1 $24,500";
System.out.println("input: " + str);
Matcher m = p.matcher(str);
while(m.find()) {
System.out.println("token: " + m.group());
}
}
}
Here's a sample run:
$ javac Mkt.java && java Mkt
input: --- FREE!! $50 192.168.1.1 $24,500
token: FREE!!
token: $50
token: 192.168.1.1
token: $24,500
Now, you might need to tweak the regex, for example:
You gave $24,500 as an example. Should this work for $24,500abc or $24,500EUR?
You mentioned 192.168.1.1 should be included. Should it also include 192,168.1,1 (given . and , are to be included)?
and I guess there are other things to consider.
Hope this helps to get you started.

Categories

Resources