Splitting input string when it contains countires with multiple words - java

I get multiple countries as an input that i have to split by space. If the country has multiple word it's declared between "". For example
Chad Benin Angola Algeria Finland Romania "Democratic Republic of the Congo" Bolivia Uzbekistan Lesotho "United States of America"
At the moment im able to split the countries word by word. So United States of America doesnt stay together as one country.
BufferedReader reader = new BufferedReader(
new InputStreamReader(System.in));
// Reading data using readLine
String str = reader.readLine();
ArrayList<String> sets = new ArrayList<String>();
String[] newStr = str.split("[\\W]");
boolean check = false;
for (String s : newStr) {
sets.add(s);
}
System.out.print(sets);
How can i split these countries so that the multiword countires dont get split?

Instead of matching what to split, match country names. You need to catch either letters, or letters and spaces between quotes. Match 1 or more letters - [a-zA-Z]+, or(|) match letters and spaces between quotes - "[a-zA-Z\s]+".
String input = "Chad Benin Angola Algeria Finland Romania \"Democratic Republic of the Congo\" Bolivia Uzbekistan Lesotho \"United States of America\"";
Pattern pattern = Pattern.compile("[a-zA-Z]+|\"[a-zA-Z\\s]+\"");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
String result = matcher.group();
if (result.startsWith("\"")) {
//quotes are matched, so remove them
result = result.substring(1, result.length() - 1);
}
System.out.println(result);
}

Hm, may be I am not intelligent enough, but I do not see any one-line-of-code solution, but I can think of the following solution:
public static void main(String[] args) {
String inputString = "Chad Benin Angola Algeria Finland Romania \"Democratic Republic of the Congo\" Bolivia Uzbekistan Lesotho \"United States of America\"\n";
List<String> resultCountriesList = new ArrayList<>();
int currentIndex = 0;
boolean processingMultiWordsCountry = false;
for (int i = 0; i < inputString.length(); i++) {
Optional<String> substringAsOptional = extractNextSubstring(inputString, currentIndex);
if (substringAsOptional.isPresent()) {
String substring = substringAsOptional.get();
currentIndex += substring.length() + 1;
if (processingMultiWordsCountry) {
resultCountriesList.add(substring);
} else {
resultCountriesList.addAll(Arrays.stream(substring.split(" ")).peek(String::trim).filter(s -> !s.isEmpty()).collect(Collectors.toList()));
}
processingMultiWordsCountry = !processingMultiWordsCountry;
}
}
System.out.println(resultCountriesList);
}
private static Optional<String> extractNextSubstring(String inputString, int currentIndex) {
if (inputString.length() > currentIndex + 1) {
return Optional.of(inputString.substring(currentIndex, inputString.indexOf("\"", currentIndex + 1)));
}
return Optional.empty();
}
The result list of the countries, as strings, resides in resultCountriesList. That code just iterates over the string, taking substring of the original string - inputString from the previous substring index - currentIndex to the next occurrence of \" symbol. If the substring is present - we continue processing. Also we segregate countries enclosed by \" symbol from countries, that resides outside of \" by the boolean flag processingMultiWordsCountry.
So, at least for now, I cannot find anything better. Also I do not think that this code is ideal, I think there are a lot of possible improvements, so if you consider any - feel free to add a comment. Hope it helped, have a nice day!

Similar approach as in the accepted answer but with a shorter regex and without matching and replacing the double quotes (which is quite an expensive procedure, in my opinion):
String in = "Chad Benin Angola Algeria Finland Romania \"Democratic Republic of the Congo\" Bolivia Uzbekistan Lesotho \"United States of America\"";
Pattern p = Pattern.compile("\"([^\"]*)\"|(\\w+)");
Matcher m = p.matcher(in);
ArrayList<String> sets = new ArrayList<>();
while(m.find()) {
String multiWordCountry = m.group(1);
if (multiWordCountry != null) {
sets.add(multiWordCountry);
} else {
sets.add(m.group(2));
}
}
System.out.print(sets);
Result:
[Chad, Benin, Angola, Algeria, Finland, Romania, Democratic Republic of the Congo, Bolivia, Uzbekistan, Lesotho, United States of America]

Related

How to print a substring with only the matching elements of a string?

Given a String that lists metadata about a book line by line, how do I print out only the lines that match the data I am looking for?
In order to do this, I've been trying to create substrings for each lines using indexes. The substring starts at the beginning of a line and ends before a "\n". I have not seen lists, arrays or bufferedReader yet.
For each substring that I parse through, I check if it contains my pattern. If it does, I add it to a string that only includes my results.
Here would be an example of my list (in french); I'd like to match, for say, all the books written in 2017.
Origine D. Brown 2017 Thriller Policier
Romance et de si belles fiancailles M. H. Clark 2018 thriller policier Romance
La fille du train P. Hawkins 2015 Policier
There is a flaw in how I am doing this and I am stuck with an IndexOutOfBounds exception that I can't figure out. Definitely new in creating algorithms like this.
public static String search() {
String list;
int indexLineStart = 0;
int indexLineEnd = list.indexOf("\n");
int indexFinal = list.length()-1;
String listToPrint = "";
while (indexLineStart <= indexFinal){
String listCheck = list.substring(indexLineStart, indexLineEnd);
if (listCheck.contains(dataToMatch)){
listToPrint = listToPrint + "\n" + listCheck;
}
indexLineStart = indexLineEnd +1 ;
indexLineEnd = list.indexOf("\n", indexLineStart);
}
return listeToPrint;
}
Regardless of the comments about using split() and String[], which do have merit :-)
The IndexOutOfBounds exception I believe is being caused by the second of these two lines:
indexLineStart = indexLineEnd +1 ;
indexLineEnd = list.indexOf("\n", indexLineStart);
You wan't them swapped around (I believe).
You don't have to make this much complex logic by using String.substring(), what you can use is String.split() and can make an array of your string. At each index is a book, then, you can search for you matching criteria, and add the book to the finalString if it matches your search.
Working Code:
public class stackString
{
public static void main(String[] args)
{
String list = "Origine D. Brown 2017 Thriller Policier\n Romance et de si belles fiancailles M. H. Clark 2018 thriller policier Romance\n La fille du train P. Hawkins 2015 Policier\n";
String[] listArray = list.split("\n"); // make a String Array on each index is new book
String finalString = ""; // final array to store the books that matches the search
String matchCondition = "2017";
for(int i =0; i<listArray.length;i++)
if(listArray[i].contains(matchCondition))
finalString += listArray[i]+"\n";
System.out.println(finalString);
}
}
Here is a solution using pattern matching
public static List<String> search(String input, String keyword)
{
Pattern pattern = Pattern.compile(".*" + keyword + ".*");
Matcher matcher = pattern.matcher(input);
List<String> linesContainingKeyword = new LinkedList<>();
while (matcher.find())
{
linesContainingKeyword.add(matcher.group());
}
return linesContainingKeyword;
}
Since I wasn't allowed to use lists and arrays, I got this to be functional this morning.
public static String linesWithPattern (String pattern){
String library;
library = library + "\n"; //Added and end of line at the end of the file to parse through it without problem.
String substring = "";
String substringWithPattern = "";
char endOfLine = '\n';
int nbrLines = countNbrLines(library, endOfLine); //Method to count number of '\n'
int lineStart = 0;
int lineEnd = 0;
for (int i = 0; i < nbrLines ; i++){
lineStart = lineEnd;
if (lineStart == 0){
lineEnd = library.indexOf('\n');
} else if (lineStart != 0){
lineEnd = library.indexOf('\n', (lineEnd + 1));
}
substring = library.substring(lineStart, lineEnd);
if (substring.toLowerCase().contains(motif.toLowerCase())){
substringWithPattern = substring + substringWithPattern + '\n';
}
if (!library.toLowerCase().contains(pattern.toLowerCase())){
substringWithPattern = "\nNO ENTRY FOUND \n";
}
}
if (library.toLowerCase().contains(pattern)){
substringWithPattern = "This or these books were found in the library \n" +
"--------------------------" + substringWithPattern;
}
return substringWithPattern;
The IndexOutOfBounds exception is thrown when the index you are searching for is not in the range of array length. When I went through the code, you are getting this exception because of below line execution where probably the indexLineEnd value is more than the actual length of List if the string variable list is not Null (Since your code doesn't show list variable to be initialized).
String listCheck = list.substring(indexLineStart, indexLineEnd);
Please run the application in debug mode to get the exact value that is getting passed to the method to understand why it throwing the exception.
you need to be careful at calculating the value of indexLineEnd.

Uppercase all characters but not those in quoted strings

I have a String and I would like to uppercase everything that is not quoted.
Example:
My name is 'Angela'
Result:
MY NAME IS 'Angela'
Currently, I am matching every quoted string then looping and concatenating to get the result.
Is it possible to achieve this in one regex expression maybe using replace?
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("\\'(.*?)\\'");
String input = "'s'Hello This is 'Java' Not '.NET'";
Matcher regexMatcher = regex.matcher(input);
StringBuffer sb = new StringBuffer();
int counter = 0;
while (regexMatcher.find())
{// Finds Matching Pattern in String
regexMatcher.appendReplacement(sb, "{"+counter+"}");
matchList.add(regexMatcher.group());// Fetching Group from String
counter++;
}
String format = MessageFormat.format(sb.toString().toUpperCase(), matchList.toArray());
System.out.println(input);
System.out.println("----------------------");
System.out.println(format);
Input: 's'Hello This is 'Java' Not '.NET'
Output: 's'HELLO THIS IS 'Java' NOT '.NET'
You could use a regular expression like this:
([^'"]+)(['"]+[^'"]+['"]+)(.*)
# match and capture everything up to a single or double quote (but not including)
# match and capture a quoted string
# match and capture any rest which might or might not be there.
This will only work with one quoted string, obviously. See a working demo here.
Ok. This will do it for you.. Not efficient, but will work for all cases. I actually don't suggest this solution as it will be too slow.
public static void main(String[] args) {
String s = "'Peter' said, My name is 'Angela' and I will not change my name to 'Pamela'.";
Pattern p = Pattern.compile("('\\w+')");
Matcher m = p.matcher(s);
List<String> quotedStrings = new ArrayList<>();
while(m.find()) {
quotedStrings.add(m.group(1));
}
s=s.toUpperCase();
// System.out.println(s);
for (String str : quotedStrings)
s= s.replaceAll("(?i)"+str, str);
System.out.println(s);
}
O/P :
'Peter' SAID, MY NAME IS 'Angela' AND I WILL NOT CHANGE MY NAME TO 'Pamela'.
Adding to the answer by #jan_kiran, we need to call the
appendTail()
method appendTail(). Updated code is:
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("\\'(.*?)\\'");
String input = "'s'Hello This is 'Java' Not '.NET'";
Matcher regexMatcher = regex.matcher(input);
StringBuffer sb = new StringBuffer();
int counter = 0;
while (regexMatcher.find())
{// Finds Matching Pattern in String
regexMatcher.appendReplacement(sb, "{"+counter+"}");
matchList.add(regexMatcher.group());// Fetching Group from String
counter++;
}
regexMatcher.appendTail(sb);
String formatted_string = MessageFormat.format(sb.toString().toUpperCase(), matchList.toArray());
I did not find my luck with these solutions, as they seemed to remove trailing non-quoted text.
This code works for me, and treats both ' and " by remembering the last opening quotation mark type. Replace toLowerCase appropriately, of course...
Maybe this is extremely slow; I don't know:
private static String toLowercaseExceptInQuotes(String line) {
StringBuffer sb = new StringBuffer(line);
boolean nowInQuotes = false;
char lastQuoteType = 0;
for (int i = 0; i < sb.length(); ++i) {
char cchar = sb.charAt(i);
if (cchar == '"' || cchar == '\''){
if (!nowInQuotes) {
nowInQuotes = true;
lastQuoteType = cchar;
}
else {
if (lastQuoteType == cchar) {
nowInQuotes = false;
}
}
}
else if (!nowInQuotes) {
sb.setCharAt(i, Character.toLowerCase(sb.charAt(i)));
}
}
return sb.toString();
}

Splitting a string based on " " and spaces [duplicate]

This question already has answers here:
Regular Expression to Split String based on space and matching quotes in java
(3 answers)
Closed 8 years ago.
I have a String str, which is comprised of several words separated by single spaces.
If I want to create a set or list of strings I can simply call str.split(" ") and I would get I want.
Now, assume that str is a little more complicated, for example it is something like:
str = "hello bonjour \"good morning\" buongiorno";
In this case what is in between " " I want to keep so that my list of strings is:
hello
bonjour
good morning
buongiorno
Clearly, if I used split(" ") in this case it won't work because I'd get
hello
bonjour
"good
morning"
buongiorno
So, how do I get what I want?
You can create a regex that finds every word or words between "".. like:
\w+|(\"\w+(\s\w+)*\")
and search for them with the Pattern and Matcher classes.
ex.
String searchedStr = "";
Pattern pattern = Pattern.compile("\\w+|(\\\"\\w+(\\s\\w+)*\\\")");
Matcher matcher = pattern.matcher(searchedStr);
while(matcher.find()){
String word = matcher.group();
}
Edit: works for every number of words within "" now. XD forgot that
You can do something like below. First split the Sting using "\"" and then split the remaining ones using space" " . The even tokens will be the ones between quotes "".
public static void main(String args[]) {
String str = "hello bonjour \"good morning\" buongiorno";
System.out.println(str);
String[] parts = str.split("\"");
List<String> myList = new ArrayList<String>();
int i = 1;
for(String partStr : parts) {
if(i%2 == 0){
myList.add(partStr);
}
else {
myList.addAll(Arrays.asList(partStr.trim().split(" ")));
}
i++;
}
System.out.println("MyList : " + myList);
}
and the output is
hello bonjour "good morning" buongiorno
MyList : [hello, bonjour, good morning, buongiorno]
You may be able to find a solution using regular expressions, but what I'd do is simply manually write a string breaker.
List<String> splitButKeepQuotes(String s, char splitter) {
ArrayList<String> list = new ArrayList<String>();
boolean inQuotes = false;
int startOfWord = 0;
for (int i = 0; i < s.length(); i++) {
if (s.charAt(i) == splitter && !inQuotes && i != startOfWord) {
list.add(s.substring(startOfWord, i));
startOfWord = i + 1;
}
if (s.charAt(i) == "\"") {
inQuotes = !inQuotes;
}
}
return list;
}

Equivalent to StringTokenizer with multiple characters delimiters

I try to split a String into tokens.
The token delimiters are not single characters, some delimiters are included into others (example, & and &&), and I need to have the delimiters returned as token.
StringTokenizer is not able to deal with multiple characters delimiters. I presume it's possible with String.split, but fail to guess the magical regular expression that will suits my needs.
Any idea ?
Example:
Token delimiters: "&", "&&", "=", "=>", " "
String to tokenize: a & b&&c=>d
Expected result: an string array containing "a", " ", "&", " ", "b", "&&", "c", "=>", "d"
--- Edit ---
Thanks to all for your help, Dasblinkenlight gives me the solution. Here is the "ready to use" code I wrote with his help:
private static String[] wonderfulTokenizer(String string, String[] delimiters) {
// First, create a regular expression that matches the union of the delimiters
// Be aware that, in case of delimiters containing others (example && and &),
// the longer may be before the shorter (&& should be before &) or the regexpr
// parser will recognize && as two &.
Arrays.sort(delimiters, new Comparator<String>() {
#Override
public int compare(String o1, String o2) {
return -o1.compareTo(o2);
}
});
// Build a string that will contain the regular expression
StringBuilder regexpr = new StringBuilder();
regexpr.append('(');
for (String delim : delimiters) { // For each delimiter
if (regexpr.length() != 1) regexpr.append('|'); // Add union separator if needed
for (int i = 0; i < delim.length(); i++) {
// Add an escape character if the character is a regexp reserved char
regexpr.append('\\');
regexpr.append(delim.charAt(i));
}
}
regexpr.append(')'); // Close the union
Pattern p = Pattern.compile(regexpr.toString());
// Now, search for the tokens
List<String> res = new ArrayList<String>();
Matcher m = p.matcher(string);
int pos = 0;
while (m.find()) { // While there's a delimiter in the string
if (pos != m.start()) {
// If there's something between the current and the previous delimiter
// Add it to the tokens list
res.add(string.substring(pos, m.start()));
}
res.add(m.group()); // add the delimiter
pos = m.end(); // Remember end of delimiter
}
if (pos != string.length()) {
// If it remains some characters in the string after last delimiter
// Add this to the token list
res.add(string.substring(pos));
}
// Return the result
return res.toArray(new String[res.size()]);
}
It could be optimize if you have many strings to tokenize by creating the Pattern only one time.
You can use the Pattern and a simple loop to achieve the results that you are looking for:
List<String> res = new ArrayList<String>();
Pattern p = Pattern.compile("([&]{1,2}|=>?| +)");
String s = "s=a&=>b";
Matcher m = p.matcher(s);
int pos = 0;
while (m.find()) {
if (pos != m.start()) {
res.add(s.substring(pos, m.start()));
}
res.add(m.group());
pos = m.end();
}
if (pos != s.length()) {
res.add(s.substring(pos));
}
for (String t : res) {
System.out.println("'"+t+"'");
}
This produces the result below:
's'
'='
'a'
'&'
'=>'
'b'
Split won't do it for you as it removed the delimeter. You probably need to tokenize the string on your own (i.e. a for-loop) or use a framework like
http://www.antlr.org/
Try this:
String test = "a & b&&c=>d=A";
String regEx = "(&[&]?|=[>]?)";
String[] res = test.split(regEx);
for(String s : res){
System.out.println("Token: "+s);
}
I added the '=A' at the end to show that that is also parsed.
As mentioned in another answer, if you need the atypical behaviour of keeping the delimiters in the result, you will probably need to create you parser yourself....but in that case you really have to think about what a "delimiter" is in your code.

Split a quoted string with a delimiter

I want to split a string with a delimiter white space. but it should handle quoted strings intelligently. E.g. for a string like
"John Smith" Ted Barry
It should return three strings John Smith, Ted and Barry.
After messing around with it, you can use Regex for this. Run the equivalent of "match all" on:
((?<=("))[\w ]*(?=("(\s|$))))|((?<!")\w+(?!"))
A Java Example:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Test
{
public static void main(String[] args)
{
String someString = "\"Multiple quote test\" not in quotes \"inside quote\" \"A work in progress\"";
Pattern p = Pattern.compile("((?<=(\"))[\\w ]*(?=(\"(\\s|$))))|((?<!\")\\w+(?!\"))");
Matcher m = p.matcher(someString);
while(m.find()) {
System.out.println("'" + m.group() + "'");
}
}
}
Output:
'Multiple quote test'
'not'
'in'
'quotes'
'inside quote'
'A work in progress'
The regular expression breakdown with the example used above can be viewed here:
http://regex101.com/r/wM6yT9
With all that said, regular expressions should not be the go to solution for everything - I was just having fun. This example has a lot of edge cases such as the handling unicode characters, symbols, etc. You would be better off using a tried and true library for this sort of task. Take a look at the other answers before using this one.
Try this ugly bit of code.
String str = "hello my dear \"John Smith\" where is Ted Barry";
List<String> list = Arrays.asList(str.split("\\s"));
List<String> resultList = new ArrayList<String>();
StringBuilder builder = new StringBuilder();
for(String s : list){
if(s.startsWith("\"")) {
builder.append(s.substring(1)).append(" ");
} else {
resultList.add((s.endsWith("\"")
? builder.append(s.substring(0, s.length() - 1))
: builder.append(s)).toString());
builder.delete(0, builder.length());
}
}
System.out.println(resultList);
well, i made a small snipet that does what you want and some more things. since you did not specify more conditions i did not go through the trouble. i know this is a dirty way and you can probably get better results with something that is already made. but for the fun of programming here is the example:
String example = "hello\"John Smith\" Ted Barry lol\"Basi German\"hello";
int wordQuoteStartIndex=0;
int wordQuoteEndIndex=0;
int wordSpaceStartIndex = 0;
int wordSpaceEndIndex = 0;
boolean foundQuote = false;
for(int index=0;index<example.length();index++) {
if(example.charAt(index)=='\"') {
if(foundQuote==true) {
wordQuoteEndIndex=index+1;
//Print the quoted word
System.out.println(example.substring(wordQuoteStartIndex, wordQuoteEndIndex));//here you can remove quotes by changing to (wordQuoteStartIndex+1, wordQuoteEndIndex-1)
foundQuote=false;
if(index+1<example.length()) {
wordSpaceStartIndex = index+1;
}
}else {
wordSpaceEndIndex=index;
if(wordSpaceStartIndex!=wordSpaceEndIndex) {
//print the word in spaces
System.out.println(example.substring(wordSpaceStartIndex, wordSpaceEndIndex));
}
wordQuoteStartIndex=index;
foundQuote = true;
}
}
if(foundQuote==false) {
if(example.charAt(index)==' ') {
wordSpaceEndIndex = index;
if(wordSpaceStartIndex!=wordSpaceEndIndex) {
//print the word in spaces
System.out.println(example.substring(wordSpaceStartIndex, wordSpaceEndIndex));
}
wordSpaceStartIndex = index+1;
}
if(index==example.length()-1) {
if(example.charAt(index)!='\"') {
//print the word in spaces
System.out.println(example.substring(wordSpaceStartIndex, example.length()));
}
}
}
}
this also checks for words that were not separated with a space after or before the quotes, such as the words "hello" before "John Smith" and after "Basi German".
when the string is modified to "John Smith" Ted Barry the output is three strings,
1) "John Smith"
2) Ted
3) Barry
The string in the example is hello"John Smith" Ted Barry lol"Basi German"hello and prints
1)hello
2)"John Smith"
3)Ted
4)Barry
5)lol
6)"Basi German"
7)hello
Hope it helps
This is my own version, clean up from http://pastebin.com/aZngu65y (posted in the comment).
It can take care of Unicode. It will clean up all excessive spaces (even in quote) - this can be good or bad depending on the need. No support for escaped quote.
private static String[] parse(String param) {
String[] output;
param = param.replaceAll("\"", " \" ").trim();
String[] fragments = param.split("\\s+");
int curr = 0;
boolean matched = fragments[curr].matches("[^\"]*");
if (matched) curr++;
for (int i = 1; i < fragments.length; i++) {
if (!matched)
fragments[curr] = fragments[curr] + " " + fragments[i];
if (!fragments[curr].matches("(\"[^\"]*\"|[^\"]*)"))
matched = false;
else {
matched = true;
if (fragments[curr].matches("\"[^\"]*\""))
fragments[curr] = fragments[curr].substring(1, fragments[curr].length() - 1).trim();
if (fragments[curr].length() != 0)
curr++;
if (i + 1 < fragments.length)
fragments[curr] = fragments[i + 1];
}
}
if (matched) {
return Arrays.copyOf(fragments, curr);
}
return null; // Parameter failure (double-quotes do not match up properly).
}
Sample input for comparison:
"sdfskjf" sdfjkhsd "hfrif ehref" "fksdfj sdkfj fkdsjf" sdf sfssd
asjdhj sdf ffhj "fdsf fsdjh"
日本語 中文 "Tiếng Việt" "English"
dsfsd
sdf " s dfs fsd f " sd f fs df fdssf "日本語 中文"
"" "" ""
" sdfsfds " "f fsdf
(2nd line is empty, 3rd line is spaces, last line is malformed).
Please judge with your own expected output, since it may varies, but the baseline is that, the 1st case should return [sdfskjf, sdfjkhsd, hfrif ehref, fksdfj sdkfj fkdsjf, sdf, sfssd].
commons-lang has a StrTokenizer class to do this for you, and there is also java-csv library.
Example with StrTokenizer:
String params = "\"John Smith\" Ted Barry"
// Initialize tokenizer with input string, delimiter character, quote character
StrTokenizer tokenizer = new StrTokenizer(params, ' ', '"');
for (String token : tokenizer.getTokenArray()) {
System.out.println(token);
}
Output:
John Smith
Ted
Barry

Categories

Resources