Regular Expression WildCard matching split with java split method

Regular Expression WildCard matching split with java split method - java

I know there's similar questions like this asked before, but i want to do a custom operation and i don't know how to go about it.
I want to split a string of data with a regular expression like, but this time like i know the starting character and the ending character like:
String myString="Google is a great search engine<as:...s>";
The <as: and s> is the beginning and closing characters
the ... is dynamic which i cant predict its value
I want to be able to split the string from the beginning <as: to the end s>
with the dynamic string in it.
Like:
myString.split("<as:/*s>");
Something like that. I also want to get all the occurrence of the <as:..s> in the string.
i know this can be done with regex, but I've never done it before. I need a simple and neat way to do this.
Thanks in advance

Rather than using a .split(), I would just extract using Pattern and Matcher. This approach finds everything between <as: and s> and extracts it to a capture group. Group 1 then has the text you would like.
public static void main(String[] args)
{
final String myString="Google is a great search engine<as:Some stuff heres>";
Pattern pat = Pattern.compile("^[^<]+<as:(.*)s>$");
Matcher m = pat.matcher(myString);
if (m.matches()) {
System.out.println(m.group(1));
}
}
Output:
Some stuff here
If you need the text at the beginning, you can put it in a capture group as well.
Edit: If there are more than one <as...s> in the input, then the following will gather all of them.
Edit 2: increased the logic. Added checks for emptiness.
public static List<String> multiEntry(final String myString)
{
String[] parts = myString.split("<as:");
List<String> col = new ArrayList<>();
if (! parts[0].trim().isEmpty()) {
col.add(parts[0]);
}
Pattern pat = Pattern.compile("^(.*?)s>(.*)?");
for (int i = 1; i < parts.length; ++i) {
Matcher m = pat.matcher(parts[i]);
if (m.matches()) {
for (int j = 1; j <= m.groupCount(); ++j) {
String s = m.group(j).trim();
if (! s.isEmpty()) {
col.add(s);
}
}
}
}
return col;
}
Output:
[Google is a great search engine, Some stuff heress, Here is Facebook, More Stuff, Something else at the end]
Edit 3: This approach uses find and looping to do the parsing. It uses optional capture groups as well.
public static void looping()
{
final String myString="Google is a great search engine"
+ "<as:Some stuff heresss>Here is Facebook<as:More Stuffs>"
+ "Something else at the end" +
"<as:Stuffs>" +
"<as:Yet More Stuffs>";
Pattern pat = Pattern.compile("([^<]+)?(<as:(.*?)s>)?");
Matcher m = pat.matcher(myString);
List<String> col = new ArrayList<>();
while (m.find()) {
String prefix = m.group(1);
String contents = m.group(3);
if (prefix != null) { col.add(prefix); }
if (contents != null) { col.add(contents); }
}
System.out.println(col);
}
Output:
[Google is a great search engine, Some stuff heress, Here is Facebook, More Stuff, Something else at the end, Stuff, Yet More Stuff]
Additional Edit: wrote some quick test cases (with super hacked helper class) to help validate. These all pass (updated) multiEntry:
public static void main(String[] args)
{
Input[] inputs = {
new Input("Google is a great search engine<as:Some stuff heres>", 2),
new Input("Google is a great search engine"
+ "<as:Some stuff heresss>Here is Facebook<as:More Stuffs>"
+ "Something else at the end" +
"<as:Stuffs>" +
"<as:Yet More Stuffs>" +
"ending", 8),
new Input("Google is a great search engine"
+ "<as:Some stuff heresss>Here is Facebook<as:More Stuffs>"
+ "Something else at the end" +
"<as:Stuffs>" +
"<as:Yet More Stuffs>", 7),
new Input("No as here", 1),
new Input("Here is angle < input", 1),
new Input("Angle < plus <as:Stuff in as:s><as:Other stuff in as:s>", 3),
new Input("Angle < plus <as:Stuff in as:s><as:Other stuff in as:s>blah", 4),
new Input("<as:To start with anglass>Some ending", 2),
};
List<String> res;
for (Input inp : inputs) {
res = multiEntry(inp.inp);
if (res.size() != inp.cnt) {
System.err.println("FAIL: " + res.size()
+ " did not match exp of " + inp.cnt
+ " on " + inp.inp);
System.err.println(res);
continue;
}
System.out.println(res);
}
}

Related

Splitting a string based on " " and spaces [duplicate]

This question already has answers here:
Regular Expression to Split String based on space and matching quotes in java
(3 answers)
Closed 8 years ago.
I have a String str, which is comprised of several words separated by single spaces.
If I want to create a set or list of strings I can simply call str.split(" ") and I would get I want.
Now, assume that str is a little more complicated, for example it is something like:
str = "hello bonjour \"good morning\" buongiorno";
In this case what is in between " " I want to keep so that my list of strings is:
hello
bonjour
good morning
buongiorno
Clearly, if I used split(" ") in this case it won't work because I'd get
hello
bonjour
"good
morning"
buongiorno
So, how do I get what I want?

You can create a regex that finds every word or words between "".. like:
\w+|(\"\w+(\s\w+)*\")
and search for them with the Pattern and Matcher classes.
ex.
String searchedStr = "";
Pattern pattern = Pattern.compile("\\w+|(\\\"\\w+(\\s\\w+)*\\\")");
Matcher matcher = pattern.matcher(searchedStr);
while(matcher.find()){
String word = matcher.group();
}
Edit: works for every number of words within "" now. XD forgot that

You can do something like below. First split the Sting using "\"" and then split the remaining ones using space" " . The even tokens will be the ones between quotes "".
public static void main(String args[]) {
String str = "hello bonjour \"good morning\" buongiorno";
System.out.println(str);
String[] parts = str.split("\"");
List<String> myList = new ArrayList<String>();
int i = 1;
for(String partStr : parts) {
if(i%2 == 0){
myList.add(partStr);
}
else {
myList.addAll(Arrays.asList(partStr.trim().split(" ")));
}
i++;
}
System.out.println("MyList : " + myList);
}
and the output is
hello bonjour "good morning" buongiorno
MyList : [hello, bonjour, good morning, buongiorno]

You may be able to find a solution using regular expressions, but what I'd do is simply manually write a string breaker.
List<String> splitButKeepQuotes(String s, char splitter) {
ArrayList<String> list = new ArrayList<String>();
boolean inQuotes = false;
int startOfWord = 0;
for (int i = 0; i < s.length(); i++) {
if (s.charAt(i) == splitter && !inQuotes && i != startOfWord) {
list.add(s.substring(startOfWord, i));
startOfWord = i + 1;
}
if (s.charAt(i) == "\"") {
inQuotes = !inQuotes;
}
}
return list;
}

Trim() in Java not working the way I expect? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Query about the trim() method in Java
I am parsing a site's usernames and other information, and each one has a bunch of spaces after it (but spaces in between the words).
For example: "Bob the Builder " or "Sam the welder ". The numbers of spaces vary from name to name. I figured I'd just use .trim(), since I've used this before.
However, it's giving me trouble. My code looks like this:
for (int i = 0; i < splitSource3.size(); i++) {
splitSource3.set(i, splitSource3.get(i).trim());
}
The result is just the same; no spaces are removed at the end.
Thank you in advance for your excellent answers!
UPDATE:
The full code is a bit more complicated, since there are HTML tags that are parsed out first. It goes exactly like this:
for (String s : splitSource2) {
if (s.length() > "<td class=\"dddefault\">".length() && s.substring(0, "<td class=\"dddefault\">".length()).equals("<td class=\"dddefault\">")) {
splitSource3.add(s.substring("<td class=\"dddefault\">".length()));
}
}
System.out.println("\n");
for (int i = 0; i < splitSource3.size(); i++) {
splitSource3.set(i, splitSource3.get(i).substring(0, splitSource3.get(i).length() - 5));
splitSource3.set(i, splitSource3.get(i).trim());
System.out.println(i + ": " + splitSource3.get(i));
}
}
UPDATE:
Calm down. I never said the fault lay with Java, and I never said it was a bug or broken or anything. I simply said I was having trouble with it and posted my code for you to collaborate on and help solve my issue. Note the phrase "my issue" and not "java's issue". I have actually had the code printing out
System.out.println(i + ": " + splitSource3.get(i) + "*");
in a for each loop afterward.
This is how I knew I had a problem.
By the way, the problem has still not been fixed.
UPDATE:
Sample output (minus single quotes):
'0: Olin D. Kirkland                                          '
'1: Sophomore                                          '
'2: Someplace, Virginia  12345<br />VA SomeCity<br />'
'3: Undergraduate                                          '
EDIT the OP rephrased his question at Query about the trim() method in Java, where the issue was found to be Unicode whitespace characters which are not matched by String.trim().

It just occurred to me that I used to have this sort of issue when I worked on a screen-scraping project. The key is that sometimes the downloaded HTML sources contain non-printable characters which are non-whitespace characters too. These are very difficult to copy-paste to a browser. I assume that this could happened to you.
If my assumption is correct then you've got two choices:
Use a binary reader and figure out what those characters are - and delete them with String.replace(); E.g.:
private static void cutCharacters(String fromHtml) {
String result = fromHtml;
char[] problematicCharacters = {'\000', '\001', '\003'}; //this could be a private static final constant too
for (char ch : problematicCharacters) {
result = result.replace(ch, ""); //I know, it's dirty to modify an input parameter. But it will do as an example
}
return result;
}
If you find some sort of reoccurring pattern in the HTML to be parsed then you can use regexes and substrings to cut the unwanted parts. E.g.:
private String getImportantParts(String fromHtml) {
Pattern p = Pattern.compile("(\\w*\\s*)"); //this could be a private static final constant as well.
Matcher m = p.matcher(fromHtml);
StringBuilder buff = new StringBuilder();
while (m.find()) {
buff.append(m.group(1));
}
return buff.toString().trim();
}

Works without a problem for me.
Here your code a bit refactored and (maybe) better readable:
final String openingTag = "<td class=\"dddefault\">";
final String closingTag = "</td>";
List<String> splitSource2 = new ArrayList<String>();
splitSource2.add(openingTag + "Bob the Builder " + closingTag);
splitSource2.add(openingTag + "Sam the welder " + closingTag);
for (String string : splitSource2) {
System.out.println("|" + string + "|");
}
List<String> splitSource3 = new ArrayList<String>();
for (String s : splitSource2) {
if (s.length() > openingTag.length() && s.startsWith(openingTag)) {
String nameWithoutOpeningTag = s.substring(openingTag.length());
splitSource3.add(nameWithoutOpeningTag);
}
}
System.out.println("\n");
for (int i = 0; i < splitSource3.size(); i++) {
String name = splitSource3.get(i);
int closingTagBegin = splitSource3.get(i).length() - closingTag.length();
String nameWithoutClosingTag = name.substring(0, closingTagBegin);
String nameTrimmed = nameWithoutClosingTag.trim();
splitSource3.set(i, nameTrimmed);
System.out.println("|" + splitSource3.get(i) + "|");
}
I know that's not a real answer, but i cannot post comments and this code as a comment wouldn't fit, so I made it an answer, so that Olin Kirkland can check his code.

Using regular expression to find a set number of + JAVA

I have a program where I want to filter Strings with a set number of "+"'s at the beginning.
For example:
+++Adam is working very well.
++Adam is working well.
+Adam is doing OK.
How do I only pick up each particular case (i.e. only one plus sign, only two plus signs, only three plus signs)? I usually get a return of anything beginning with a +.
I have the following regex patterns compiled, but I either get only one return (usually the two ++) or all of them:
public static String regexpluschar = "^\\Q+\\E{1}[\\w <]";
public static String regexpluspluschar = "^\\Q+\\E{2}[\\w <]";
public static String regexpluspluspluschar = "^\\Q+\\E{3}[\\w <]";
Pattern plusplusplus = Pattern.compile(regexpluspluspluschar);
Pattern plusplus = Pattern.compile(regexpluspluschar);
Pattern plus = Pattern.compile(regexpluschar);
I then try to find using a Matcher class - I've used .find() and .matches() but don't get the result I'm after (java+regex newbie alert here).
Matcher matcherplusplusplus = plusplusplus.matcher(check);
Matcher matcherplusplus = plusplus.matcher(check);
Matcher matcherplus = plus.matcher(check);
//OK we have 3+'s
if ((matcherplusplusplus.find())==true){
System.out.println("Filtering 3 +s.");
System.out.println("filter is " + filter + " in the 3 + filter.");
String toChange = getItem(i);
setItemFiltered(i, toChange);
}
//OK - we have 2 +'s
if ((matcherplusplus.find())==true){
System.out.println("Filtering 2 +s.");
System.out.println("filter is " + filter + " in the 2 + filter.");
String toChange = getItem(i);
setItemFiltered(i, toChange);
}
//OK - we have 1 +'s
if ((matcherplus.find())==true){
System.out.println("Filtering 1 +.");
System.out.println("filter is " + filter + " in the 1 + filter.");
String toChange = getItem(i);
setItemFiltered(i, toChange);
}
For the very curious, the above if's are embedded in a for loop that cycles around some JTextFields. Full code at: http://pastebin.ca/2199327

Why not simpler :
public static String regexpluschar = "^\\+[\\w <]";
public static String regexpluspluschar = "^\\+{2}[\\w <]";
public static String regexpluspluspluschar = "^\\+{3}[\\w <]";
or even
public static String regexpluschar = "^\\+[^\\+]";
public static String regexpluspluschar = "^\\+{2}[^\\+]";
public static String regexpluspluspluschar = "^\\+{3}[^\\+]";
Edit : It's working on my test program, but I had to removed your specific code :
String toChange = getItem(i);
setItemFiltered(i, toChange);
proof : my output is :
Filtering 3 +s.
+++Adam is working very well. is in the 3 + filter.
Filtering 2 +s.
++Adam is working well. is in the 2 + filter.
Filtering 1 +.
+Adam is doing OK. is in the 1 + filter.
Your filter is working, but you specific code may not... (maybe have a look at setItemFiltered?)

I was thinking something like this would be easier:
public static void main(String[] args) {
Pattern pattern = Pattern.compile("^(\\+{1,3}).*");
Matcher matcher = pattern.matcher(<your text>);
if (matcher.matches()) {
String pluses = matcher.group(1);
switch (pluses.length()) {
}
}
}
And if you want to be sure that ++++This is insane does not match then change the pattern to
Pattern pattern = Pattern.compile("^(\\+{1,3})[^+].*");

Split a quoted string with a delimiter

I want to split a string with a delimiter white space. but it should handle quoted strings intelligently. E.g. for a string like
"John Smith" Ted Barry
It should return three strings John Smith, Ted and Barry.

After messing around with it, you can use Regex for this. Run the equivalent of "match all" on:
((?<=("))[\w ]*(?=("(\s|$))))|((?<!")\w+(?!"))
A Java Example:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Test
{
public static void main(String[] args)
{
String someString = "\"Multiple quote test\" not in quotes \"inside quote\" \"A work in progress\"";
Pattern p = Pattern.compile("((?<=(\"))[\\w ]*(?=(\"(\\s|$))))|((?<!\")\\w+(?!\"))");
Matcher m = p.matcher(someString);
while(m.find()) {
System.out.println("'" + m.group() + "'");
}
}
}
Output:
'Multiple quote test'
'not'
'in'
'quotes'
'inside quote'
'A work in progress'
The regular expression breakdown with the example used above can be viewed here:
http://regex101.com/r/wM6yT9
With all that said, regular expressions should not be the go to solution for everything - I was just having fun. This example has a lot of edge cases such as the handling unicode characters, symbols, etc. You would be better off using a tried and true library for this sort of task. Take a look at the other answers before using this one.

Try this ugly bit of code.
String str = "hello my dear \"John Smith\" where is Ted Barry";
List<String> list = Arrays.asList(str.split("\\s"));
List<String> resultList = new ArrayList<String>();
StringBuilder builder = new StringBuilder();
for(String s : list){
if(s.startsWith("\"")) {
builder.append(s.substring(1)).append(" ");
} else {
resultList.add((s.endsWith("\"")
? builder.append(s.substring(0, s.length() - 1))
: builder.append(s)).toString());
builder.delete(0, builder.length());
}
}
System.out.println(resultList);

well, i made a small snipet that does what you want and some more things. since you did not specify more conditions i did not go through the trouble. i know this is a dirty way and you can probably get better results with something that is already made. but for the fun of programming here is the example:
String example = "hello\"John Smith\" Ted Barry lol\"Basi German\"hello";
int wordQuoteStartIndex=0;
int wordQuoteEndIndex=0;
int wordSpaceStartIndex = 0;
int wordSpaceEndIndex = 0;
boolean foundQuote = false;
for(int index=0;index<example.length();index++) {
if(example.charAt(index)=='\"') {
if(foundQuote==true) {
wordQuoteEndIndex=index+1;
//Print the quoted word
System.out.println(example.substring(wordQuoteStartIndex, wordQuoteEndIndex));//here you can remove quotes by changing to (wordQuoteStartIndex+1, wordQuoteEndIndex-1)
foundQuote=false;
if(index+1<example.length()) {
wordSpaceStartIndex = index+1;
}
}else {
wordSpaceEndIndex=index;
if(wordSpaceStartIndex!=wordSpaceEndIndex) {
//print the word in spaces
System.out.println(example.substring(wordSpaceStartIndex, wordSpaceEndIndex));
}
wordQuoteStartIndex=index;
foundQuote = true;
}
}
if(foundQuote==false) {
if(example.charAt(index)==' ') {
wordSpaceEndIndex = index;
if(wordSpaceStartIndex!=wordSpaceEndIndex) {
//print the word in spaces
System.out.println(example.substring(wordSpaceStartIndex, wordSpaceEndIndex));
}
wordSpaceStartIndex = index+1;
}
if(index==example.length()-1) {
if(example.charAt(index)!='\"') {
//print the word in spaces
System.out.println(example.substring(wordSpaceStartIndex, example.length()));
}
}
}
}
this also checks for words that were not separated with a space after or before the quotes, such as the words "hello" before "John Smith" and after "Basi German".
when the string is modified to "John Smith" Ted Barry the output is three strings,
1) "John Smith"
2) Ted
3) Barry
The string in the example is hello"John Smith" Ted Barry lol"Basi German"hello and prints
1)hello
2)"John Smith"
3)Ted
4)Barry
5)lol
6)"Basi German"
7)hello
Hope it helps

This is my own version, clean up from http://pastebin.com/aZngu65y (posted in the comment).
It can take care of Unicode. It will clean up all excessive spaces (even in quote) - this can be good or bad depending on the need. No support for escaped quote.
private static String[] parse(String param) {
String[] output;
param = param.replaceAll("\"", " \" ").trim();
String[] fragments = param.split("\\s+");
int curr = 0;
boolean matched = fragments[curr].matches("[^\"]*");
if (matched) curr++;
for (int i = 1; i < fragments.length; i++) {
if (!matched)
fragments[curr] = fragments[curr] + " " + fragments[i];
if (!fragments[curr].matches("(\"[^\"]*\"|[^\"]*)"))
matched = false;
else {
matched = true;
if (fragments[curr].matches("\"[^\"]*\""))
fragments[curr] = fragments[curr].substring(1, fragments[curr].length() - 1).trim();
if (fragments[curr].length() != 0)
curr++;
if (i + 1 < fragments.length)
fragments[curr] = fragments[i + 1];
}
}
if (matched) {
return Arrays.copyOf(fragments, curr);
}
return null; // Parameter failure (double-quotes do not match up properly).
}
Sample input for comparison:
"sdfskjf" sdfjkhsd "hfrif ehref" "fksdfj sdkfj fkdsjf" sdf sfssd
asjdhj sdf ffhj "fdsf fsdjh"
日本語　中文 "Tiếng Việt" "English"
dsfsd
sdf " s dfs fsd f " sd f fs df fdssf "日本語　中文"
"" "" ""
" sdfsfds " "f fsdf
(2nd line is empty, 3rd line is spaces, last line is malformed).
Please judge with your own expected output, since it may varies, but the baseline is that, the 1st case should return [sdfskjf, sdfjkhsd, hfrif ehref, fksdfj sdkfj fkdsjf, sdf, sfssd].

commons-lang has a StrTokenizer class to do this for you, and there is also java-csv library.
Example with StrTokenizer:
String params = "\"John Smith\" Ted Barry"
// Initialize tokenizer with input string, delimiter character, quote character
StrTokenizer tokenizer = new StrTokenizer(params, ' ', '"');
for (String token : tokenizer.getTokenArray()) {
System.out.println(token);
}
Output:
John Smith
Ted
Barry

find substrings inside string

How can i find substrings inside string and then remember and delete it when i found it.
EXAMPLE:
select * from (select a.iid_organizacijske_enote,
a.sifra_organizacijske_enote "Sifra OE",
a.naziv_organizacijske_enote "Naziv OE",
a.tip_organizacijske_enote "Tip OE"
I would like to get all word inside " ", so
Sifra OE
Naziv OE
TIP OE
and return
select * from (select a.iid_organizacijske_enote,
a.sifra_organizacijske_enote,
a.naziv_organizacijske_enote,
a.tip_organizacijske_enote
i try with regex, indexOf() but no one works ok

String.replace(..):
Replaces each substring of this string that matches the literal target sequence with the specified literal replacement sequence. The replacement proceeds from the beginning of the string to the end, for example, replacing "aa" with "b" in the string "aaa" will result in "ba" rather than "ab".
str = str.replace(wordToRemove, "");
If you don't know the words in advance, you can use the regex version:
str = str.replaceAll("\"[^\"]+\"", "");
This means, that all strings starting and ending with quotes, with any character except quotes between them, will be replaced with empty string.

Consider using regex with capturing groups. With Java's Matcher class, you can find the first match, and then use replaceFirst(String).
--EDIT--
example (not efficient for long inputs):
String in = "hello \"there\", \"friend!\"";
Pattern p = Pattern.compile("\\\"([^\"]*)\\\"");
Matcher m = p.matcher(in);
while(m.find()){
System.out.println(m.group(1));
in = m.replaceFirst("");
m = p.matcher(in);
}
System.out.println(in);

i tried and created function as below -- its working fine and returning output you want
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
Program p = new Program();
string s = p.mystring("select * from (select a.iid_organizacijske_enote, a.sifra_organizacijske_enote 'Sifra OE', "
+"a.naziv_organizacijske_enote 'Naziv OE', "+
"a.tip_organizacijske_enote 'Tip OE'");
}
public string mystring(string s)
{
if (s.IndexOf("'") > 0)
{
string test = s.Substring(0, s.IndexOf("'"));
s = s.Replace(test+"'", "");
s = s.Remove(0, s.IndexOf("'") + 1);
test = test.Replace("'", "");
test = test + s;
return mystring(test);
}
else
{
return s;
}
}
}
}

best & optimized code is here:
public static void main(String[] args){
int j =0;
boolean substr = true;
String mainStr = "abcdefgh";
String ipStr = "efg";
for(int i=0 ; i < mainStr.length();i++){
if(j<ipStr.length() && mainStr.charAt(i)==ipStr.charAt(j)){
j++;
}
}
if(j>=0 && j !=ipStr.length()){
substr = false;
}
System.out.println("its a substring:"+substr);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regular Expression WildCard matching split with java split method - java

Related

Splitting a string based on " " and spaces [duplicate]

Trim() in Java not working the way I expect? [duplicate]

Using regular expression to find a set number of + JAVA

Split a quoted string with a delimiter

find substrings inside string

Categories

Resources