Split the right substring in a list

Split the right substring in a list - java

I am trying to store the line string in a list to process it. With the current state just the first element is being removed. I want to remove the letter substring from the line string before process it. How can I fix that?
I appreciate any help.
Simple:
stop 04:48 05:18 05:46 06:16 06:46 07:16 07:46 16:46 17:16 17:46 18:16 18:46 19:16
Apple chair car 04:52 05:22 05:50 06:20 06:50 07:20 07:50 16:50 17:20 17:50 18:20 18:50 19:20
Result:
[04:48, 05:18, 05:46, 06:16, 06:46, 07:16, 07:46, 16:46, 17:16, 17:46, 18:16, 18:46, 19:16]
[04:52, 05:22, 05:50, 06:20, 06:50, 07:20, 07:50, 16:50, 17:20, 17:50, 18:20, 18:50, 19:20]
Code:
if (line.contains(":")) {
String delims = " ";
String[] tokens = line.split(delims);
List<String> list = new ArrayList<String>(
Arrays.asList(tokens));
list.remove(0);
System.out.println(tokens);
}

First replace and then do splitting.
string.replaceFirst("(?m)^.*?(?=\\d+:\\d+)", "").split("\\s+");
DEMO
string.replaceFirst("(?m)^.*?(?=\\d+:\\d+)", "") will replace the starting alphabets plus spaces with an empty string.
Now do splitting on spaces against the resultant string will give you the desired output.

Here is an alternative without regex, end result will be string that you can split by space.
public class StringReplace {
public static void main(String[] args) {
String output = replace("Apple chair car 04:52 05:22 05:50 06:20 06:50 07:20 07:50 16:50 17:20 17:50 18:20 18:50 19:20");
List<String> tokens = new ArrayList<>();
Collections.addAll(tokens, output.split(" "));
}
private static String replace(String input) {
char[] chars = input.toCharArray();
StringBuilder builder = new StringBuilder();
for (char character: chars) {
// test against ASCII range 0 to ':' and 'space'
if ((int)character > 47 && (int)(character) < 59 || (int)character == 32) {
builder.append(character);
}
}
return builder.toString().trim();
}
}
Result >> 04:52 05:22 05:50 06:20 06:50 07:20 07:50 16:50 17:20 17:50 18:20 18:50 19:20

Related

Splitting input string when it contains countires with multiple words

I get multiple countries as an input that i have to split by space. If the country has multiple word it's declared between "". For example
Chad Benin Angola Algeria Finland Romania "Democratic Republic of the Congo" Bolivia Uzbekistan Lesotho "United States of America"
At the moment im able to split the countries word by word. So United States of America doesnt stay together as one country.
BufferedReader reader = new BufferedReader(
new InputStreamReader(System.in));
// Reading data using readLine
String str = reader.readLine();
ArrayList<String> sets = new ArrayList<String>();
String[] newStr = str.split("[\\W]");
boolean check = false;
for (String s : newStr) {
sets.add(s);
}
System.out.print(sets);
How can i split these countries so that the multiword countires dont get split?

Instead of matching what to split, match country names. You need to catch either letters, or letters and spaces between quotes. Match 1 or more letters - [a-zA-Z]+, or(|) match letters and spaces between quotes - "[a-zA-Z\s]+".
String input = "Chad Benin Angola Algeria Finland Romania \"Democratic Republic of the Congo\" Bolivia Uzbekistan Lesotho \"United States of America\"";
Pattern pattern = Pattern.compile("[a-zA-Z]+|\"[a-zA-Z\\s]+\"");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
String result = matcher.group();
if (result.startsWith("\"")) {
//quotes are matched, so remove them
result = result.substring(1, result.length() - 1);
}
System.out.println(result);
}

Hm, may be I am not intelligent enough, but I do not see any one-line-of-code solution, but I can think of the following solution:
public static void main(String[] args) {
String inputString = "Chad Benin Angola Algeria Finland Romania \"Democratic Republic of the Congo\" Bolivia Uzbekistan Lesotho \"United States of America\"\n";
List<String> resultCountriesList = new ArrayList<>();
int currentIndex = 0;
boolean processingMultiWordsCountry = false;
for (int i = 0; i < inputString.length(); i++) {
Optional<String> substringAsOptional = extractNextSubstring(inputString, currentIndex);
if (substringAsOptional.isPresent()) {
String substring = substringAsOptional.get();
currentIndex += substring.length() + 1;
if (processingMultiWordsCountry) {
resultCountriesList.add(substring);
} else {
resultCountriesList.addAll(Arrays.stream(substring.split(" ")).peek(String::trim).filter(s -> !s.isEmpty()).collect(Collectors.toList()));
}
processingMultiWordsCountry = !processingMultiWordsCountry;
}
}
System.out.println(resultCountriesList);
}
private static Optional<String> extractNextSubstring(String inputString, int currentIndex) {
if (inputString.length() > currentIndex + 1) {
return Optional.of(inputString.substring(currentIndex, inputString.indexOf("\"", currentIndex + 1)));
}
return Optional.empty();
}
The result list of the countries, as strings, resides in resultCountriesList. That code just iterates over the string, taking substring of the original string - inputString from the previous substring index - currentIndex to the next occurrence of \" symbol. If the substring is present - we continue processing. Also we segregate countries enclosed by \" symbol from countries, that resides outside of \" by the boolean flag processingMultiWordsCountry.
So, at least for now, I cannot find anything better. Also I do not think that this code is ideal, I think there are a lot of possible improvements, so if you consider any - feel free to add a comment. Hope it helped, have a nice day!

Similar approach as in the accepted answer but with a shorter regex and without matching and replacing the double quotes (which is quite an expensive procedure, in my opinion):
String in = "Chad Benin Angola Algeria Finland Romania \"Democratic Republic of the Congo\" Bolivia Uzbekistan Lesotho \"United States of America\"";
Pattern p = Pattern.compile("\"([^\"]*)\"|(\\w+)");
Matcher m = p.matcher(in);
ArrayList<String> sets = new ArrayList<>();
while(m.find()) {
String multiWordCountry = m.group(1);
if (multiWordCountry != null) {
sets.add(multiWordCountry);
} else {
sets.add(m.group(2));
}
}
System.out.print(sets);
Result:
[Chad, Benin, Angola, Algeria, Finland, Romania, Democratic Republic of the Congo, Bolivia, Uzbekistan, Lesotho, United States of America]

How to access each element after a split

I am trying to read from a text file and split it into three separate categories. ID, address, and weight. However, whenever I try to access the address and weight I have an error. Does anyone see the problem?
import java.io.*;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.*;
class Project1
{
public static void main(String[] args)throws Exception
{
List<String> list = new ArrayList<String>();
List<String> packages = new ArrayList<String>();
List<String> addresses = new ArrayList<String>();
List<String> weights = new ArrayList<String>();
//Provide the file path
File file = new File(args[0]);
//Reads the file
BufferedReader br = new BufferedReader(new FileReader(file));
String str;
while((str = br.readLine()) != null)
{
if(str.trim().length() > 0)
{
//System.out.println(str);
//Splits the string by commas and trims whitespace
String[] result = str.trim().split("\\s*,\\s*", 3);
packages.add(result[0]);
//ERROR: Doesn't know what result[1] or result[2] is.
//addresses.add(result[1]);
//weights.add(result[2]);
System.out.println(result[0]);
//System.out.println(result[1]);
//System.out.println(result[2]);
}
}
for(int i = 0; i < packages.size(); i++)
{
System.out.println(packages.get(i));
}
}
}
Here is the text file (The format is intentional):
,123-ABC-4567, 15 W. 15th St., 50.1
456-BgT-79876, 22 Broadway, 24
QAZ-456-QWER, 100 East 20th Street, 50
Q2Z-457-QWER, 200 East 20th Street, 49
678-FGH-9845 ,, 45 5th Ave,, 12.2,
678-FGH-9846,45 5th Ave,12.2
123-A BC-9999, 46 Foo Bar, 220.0
347-poy-3465, 101 B'way,24
,123-FBC-4567, 15 West 15th St., 50.1
678-FGH-8465 45 5th Ave 12.2

Seeing the pattern in your data, where some lines start with an unneeded comma, and some lines having multiple commas as delimiter and one line not even having any comma delimiter and instead space as delimiter, you will have to use a regex that handles all these behaviors. You can use this regex which does it all for your data and captures appropriately.
([\w- ]+?)[ ,]+([\w .']+)[ ,]+([\d.]+)
Here is the explanation for above regex,
([\w- ]+?) - Captures ID data which consists of word characters hyphen and space and places it in group1
[ ,]+ - This acts as a delimiter where it can be one or more space or comma
([\w .']+) - This captures address data which consists of word characters, space and . and places it in group2
[ ,]+ - Again the delimiter as described above
([\d.]+) - This captures the weight data which consists of numbers and . and places it in group3
Demo
Here is the modified Java code you can use. I've removed some of your variable declarations which you can have them back as needed. This code prints all the information after capturing the way you wanted using Matcher object.
Pattern p = Pattern.compile("([\\w- ]+?)[ ,]+([\\w .']+)[ ,]+([\\d.]+)");
// Reads the file
try (BufferedReader br = new BufferedReader(new FileReader("data1.txt"))) {
String str;
while ((str = br.readLine()) != null) {
Matcher m = p.matcher(str);
if (m.matches()) {
System.out.println(String.format("Id: %s, Address: %s, Weight: %s",
new Object[] { m.group(1), m.group(2), m.group(3) }));
}
}
}
Prints,
Id: 456-BgT-79876, Address: 22 Broadway, Weight: 24
Id: QAZ-456-QWER, Address: 100 East 20th Street, Weight: 50
Id: Q2Z-457-QWER, Address: 200 East 20th Street, Weight: 49
Id: 678-FGH-9845, Address: 45 5th Ave, Weight: 12.2
Id: 678-FGH-9846, Address: 45 5th Ave, Weight: 12.2
Id: 123-A BC-9999, Address: 46 Foo Bar, Weight: 220.0
Id: 347-poy-3465, Address: 101 B'way, Weight: 24
Id: 678-FGH-8465, Address: 45 5th Ave, Weight: 12.2
Let me know if this works for you and if you have any query further.

The last line only contains one token. So split will only return an array with one element.
A minimal reproducing example:
import java.io.*;
class Project1 {
public static void main(String[] args) throws Exception {
//Provide the file path
File file = new File(args[0]);
//Reads the file
BufferedReader br = new BufferedReader(new FileReader(file));
String str;
while ((str = br.readLine()) != null) {
if (str.trim().length() > 0) {
String[] result = str.trim().split("\\s*,\\s*", 3);
System.out.println(result[1]);
}
}
}
}
With this input file:
678-FGH-8465 45 5th Ave 12.2
The output looks like this:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at Project1.main(a.java:22)
Process finished with exit code 1
So you will have to decide, what your program should do in such cases. You might ignore those lines, print an error, or only add the first token in one of your lists.

you can add following code in your code
if (result.length > 0) {
packages.add(result[0]);
}
if (result.length > 1) {
addresses.add(result[1]);
}
if (result.length > 2) {
weights.add(result[2]);
}

Find index of a string array

I have a string :
String str = "sces123 4096 May 27 16:22 sces123 abc";
I want to get sces123 abc from the string. My code is :
String[] line = str.split("\\s+");
String name = str.substring(str.indexOf(line[5]));
It returns the whole string.
Dont know how to do.
any help appreciated!

Your code should be
String[] line = str.split("\\s+");
String name = str.substring(str.lastIndexOf(line[5]));
because str.lastindexOf(line[5]) returns 0 and then the substring returns the whole String.

In your case you just need to change str.indexOf -> str.lastIndexOf.

This is one easy solution :
String str = "sces123 4096 May 27 16:22 sces123 abc";
//split spaces
String[] line = str.split(" ");
//get 2 last columns
String name = (line[5] + " " + line[6]);
System.out.println(name);

As Glorfindel said in the comment sces123 which is the content of if line[5] also contain as the first substring in the main String str. That why you are getting the full string.
Whats really happening here is:
indexOf( line[ 5 ]) --> returning 0
str.substring(0) --> returning substring form 0 to last which is the main string
If you are only doing the hard codded things then i don't see the purpose of you here.
But What you want you get in this way (if it serve your purpose ) :
String name = str.substring( str.indexOf( line[ 5 ]+" "+line[6] ) );

Try This:
String str = "sces123 4096 May 27 16:22 sces123 abc";
String[] line = str.split("\\s+");
System.out.println(str.substring(str.lastIndexOf(line[5])));

You could use a Matcher to find the end of the 5th match:
String str = "sces123 4096 May 27 16:22 sces123 abc";
Pattern p = Pattern.compile("\\s+");
Matcher m = p.matcher(str);
for (int i = 0; i < 5; i++) {
m.find();
}
String name = str.substring(m.end());
In my opinion this is better than using lastIndexOf on to concatenating elements at indices 5 and 6, for the following reasons:
It does not require line[5] to be the last occurence of that string.
Using lastIndexOf doesn't work for input
"sces123 4096 May 27 16:22 sces123 sces123"
It also works for seperator strings of arbirtrary length.
Using line[ 5 ]+" "+line[6] doesn't work for input
"sces123 4096 May 27 16:22 sces123 abc"
It does not require the number elements after the split to be 7.
Using line[ 5 ]+" "+line[6] doesn't work for input
"sces123 4096 May 27 16:22 sces123 abc def"

Regex to match a number or nothing

i need to get a regex that can match something like this :
1234 <CIRCLE> 12 12 12 </CIRCLE>
1234 <RECTANGLE> 12 12 12 12 </RECTANGLE>
i've come around to write this regex :
(\\d+?) <([A-Z]+?)> (\\d+?) (\\d+?) (\\d+?) (\\d*)? (</[A-Z]+?>)
It works fine for when i'm trying to match the rectangle, but it doesn't work for the circle
the problem is my fifth group is not capturing though it should be ??

Try
(\\d+?) <([A-Z]+?)> (\\d+?) (\\d+?) (\\d+?) (\\d+ )?(</[A-Z]+?>)
(I changed the last "\d" group to make the space optional too.)

That is because only (\\d*)? part is optional, but spaces before and after it are mandatory, so you end up requiring two spaces at end, if last (\\d*) would not be found. Try maybe with something like
(\\d+?) <([A-Z]+?)> (:?(\\d+?) ){3,4}(</[A-Z]+?>)
Oh, and if you want to make sure that closing tag is same as opening one you can use group references like \\1 will represent match from first group. So maybe update your regex to something like
(\\d+?) <([A-Z]+?)> (:?(\\d+?) ){3,4}(</\\2>)
// ^^^^^^^-----------------------^^^
// group 2 here value need to match one from group 2

Solution for just the numbers:
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import javax.annotation.Nonnull;
public class Q26005150
{
private static final Pattern P = Pattern.compile("(\\d+)");
public static void main(String[] args)
{
final String s1 = "1234 <CIRCLE> 12 12 12 </CIRCLE>";
final String s2 = "1234 <RECTANGLE> 12 12 12 12 </RECTANGLE>";
final List<Integer> l1 = getAllMatches(s1);
final List<Integer> l2 = getAllMatches(s2);
System.out.println("l1 = " + l1);
System.out.println("l2 = " + l2);
}
private static List<Integer> getAllMatches(#Nonnull final String s)
{
final Matcher m = P.matcher(s);
final List<Integer> matches = new ArrayList<Integer>();
while(m.find())
{
matches.add(Integer.valueOf(m.group(1)));
}
return matches;
}
}
Outputs:
l1 = [1234, 12, 12, 12]
l2 = [1234, 12, 12, 12, 12]
Answer on GitHub
Stackoverflow GitHub repository

Solution for the Numbers and the Tags
private static final Pattern P = Pattern.compile("(<\/?(\w+)>|(\d+))");
public static void main(String[] args)
{
final String s1 = "1234 <CIRCLE> 12 12 12 </CIRCLE>";
final String s2 = "1234 <RECTANGLE> 12 12 12 12 </RECTANGLE>";
final List<String> l1 = getAllMatches(s1);
final List<String> l2 = getAllMatches(s2);
System.out.println("l1 = " + l1);
System.out.println("l2 = " + l2);
}
private static List<String> getAllMatches(#Nonnull final String s)
{
final Matcher m = P.matcher(s);
final List<String> matches = new ArrayList<String>();
while(m.find())
{
final String match = m.group(1);
matches.add(match);
}
return matches;
}
Outputs:
l1 = [1234, <CIRCLE>, 12, 12, 12, </CIRCLE>]
l2 = [1234, <RECTANGLE>, 12, 12, 12, 12, </RECTANGLE>]
Answer on GitHub
Stackoverflow GitHub repository

assuming the labels between "<" & ">" has to match and the numbers in between are identical
use this pattern
^\d+\s<([A-Z]+)>\s(\d+\s)(\2)+<\/(\1)>$
Demo
or if numbers in the middle do not have to be identical and or optional:
^\d+\s<([A-Z]+)>\s(\d+\s)*<\/(\1)>$

How to get Java split to work on Cyrillic string

I have the following situation:
Reading from the database a field, that contains Cyrillic letters.
String title = (String)dbTable.getAttribute("title");
Show this title in JSP page - if the title contains more than 10 words, show only first 10 words, otherwise show full title.
Full title shows well.
For getting 10 words from the title I used code:
String t1 = (String)dbTable.getAttribute("title");
String t2 = t1.split("\\w", 11);
title = t2[10];
But got strange results - obviously I'm missing something about method split.
for example
t1 = "Внасяне от осигурителя на осигурителните вноски за държавното обществено осигуряване и допълнително задължително пенсионно осигуряване върху начислени, но неизплатени възнаграждения или върху неначислени възнаграждения, отнасящи се за труд, положен през месец Март 2012 г. (първият работен ден след 30 Април 2012 г. е 02 Май 2012 г.)";
t2 returns "г. е 02 Май 2012 г.) "
which is not the result I want.
I tried to see what is in t2[0] , t2[1] so on - but also didn't get expected results - in t2[0] I got first 5 words from the beginning of that string - not just first one word.
Question is what I did wrong with split, or how to get split working on Cyrillic string, or please suggest some workaround.

I wouldn't use a regex here. For extremely simple parsing, doing it manually is faster than doing it with a regex (and, in this case, far simpler).
public class FirstTenTest {
public static void main (String... args) {
String myString = "Внасяне от осигурителя на осигурителните вноски за държавното обществено осигуряване и допълнително задължително пенсионно осигуряване върху начислени, но неизплатени възнаграждения или върху неначислени възнаграждения, отнасящи се за труд, положен през месец Март 2012 г. (първият работен ден след 30 Април 2012 г. е 02 Май 2012 г.)";
System.out.println(firstTenWords(myString));
}
public static String firstTenWords(String input) {
StringBuilder sb = new StringBuilder();
int spaceCount = 0;
for(char c : input.toCharArray()) {
if (c == ' ') spaceCount++;
if (spaceCount == 10) break;
sb.append(c);
}
return sb.toString();
}
}
Output:
Внасяне от осигурителя на осигурителните вноски за държавното обществено осигуряване

Try to use "\\s+" instead of "\\w"

String[] t2 = t1.split("\\w", 11); actually means: split the string t1 by a word character (a-z, A-Z, 0-9 or _), and give me only 11 splitted members max.
The character class for whitespace is \\s

Steps that you can implement from my understanding not sure about cyrilic lettters
1.Get the length of the title
2.Check the length of the string
3.If the length is greater than 10 then use title.substring(startindex, endIndex) return it
4.If title < 10 then return the title actual string

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Split the right substring in a list - java

Related

Splitting input string when it contains countires with multiple words

How to access each element after a split

Find index of a string array

Regex to match a number or nothing

How to get Java split to work on Cyrillic string

Categories

Resources