Splitting a text file at regular expressions and creating an array - java

I need some help with a splitting a text file with java. I have a text file with around 2000 words, the text file is in the format as follows -
"A", "HELLO", "RANDOM", "WORD"... etc
My goal is to put these words into a string array. Using the scanner class I'm taking the text and creating a string, then trying to split and add to an array as follows.
Scanner input = new Scanner(file);
while(input.hasNext()){
String temp = input.next();
String[] wordArray = temp.split(",");
}
After added to the array, when I want to use or print each element, they are stored with the "" surrounding them. Instead of printing
A
HELLO
RANDOM
… etc they are printing
"A"
"HELLO"
"RANDOM"
So my question is how can I get rid of these regular expressions and split the text at each word so I'm left with no regular expressions?
Many Thanks

Try this
List<String> allMatches = new ArrayList<>();
Scanner input = new Scanner(new File(FILE_PATH), StandardCharsets.UTF_8);
while(input.hasNext()){
String temp = input.next();
Matcher m = Pattern.compile("\"(.*)\"").matcher(temp);
while (m.find()) {
allMatches.add(m.group(1));
}
}
allMatches.forEach(System.out::println);

Just in your case, when you have all values like this "A", "HELLO", "RANDOM", "WORD"... etc, where every value have delimiter ", ", you can use this temp.split("\", \""); and you don't need to use loop.

Use this to replace all occurrences of " with an empty string.
wordArray[i].replaceAll("/"", "");
Perform this in a loop.

Related

Space in list despite using delimiters and how to remove it?

I am solving project Euler problem 22, wherein the program reads a text file having text format as follows and then tries to alphabetically sort it:
"MARY","PATRICIA","LINDA","BARBARA","ELIZABETH","JENNIFER",
"MARIA","SUSAN","MARGARET","DOROTHY","LISA", etc...
I use delimiter to eliminate both "" and ",", however when the ArrayList is sorted, it gives first element blank and sort result is like this:
<I get blank space here>,ANNALISA, ANNAMAE, ANNAMARIA, ANNAMARIE,
ANNE, ANNELIESE, ANNELLE, ANNEMARIE, ANNETT, ANNETTA, ANNETTE,
ANNICE, ANNIE, ANNIKA, ANNIS, ANNITA, ANNMARIE, ANTHONY,
ANTIONE, ANTIONETTE, ANTOINE, ANTOINETTE, etc...
My code is
public class Problem22 {
public static void main(String[] args) throws FileNotFoundException {
Scanner scan = new Scanner (new File("file.txt"));
scan.useDelimiter(",|\"| ");
String name = null;
ArrayList<String> names = new ArrayList<>();
while(scan.hasNext()) {
name = scan.next();
names.add(name);
}
scan.close();
Collections.sort(names);
System.out.println(names);
}
}
I need help to understand the reason for getting the blank line. Also I tried to remove it but unable to do it.
Pattern b = Pattern.compile("\\|"+"\r\n");
scan.useDelimiter(b);
I changed regex
To understand the regular expression(regex)1:https://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html
2:https://regexone.com/ - practice online
When I ran your code I actually had multiple empty strings in the result. Your mistake is your delimiter regex. ,|\"| means "split at each ,, ", or " and not "split at sequences of ,, ", ".
That means that "aaa", "bbb" will be split into ["", "aaa", "", "", "", "", "bbb", ""].
Change your regex accordingly and it'll work. I used \\W+ (meaning "sequences of non-word characters"), which also dealt with line breaks nicely. If you need more control, use something like [, \"]+.

Java Regex expression to match and store any integers

Right now, using Java, I just want it to be able to tokenize any string of integers to an array
input = 1dsa23f hj23nma9123
array = 1,23,23,9123;
I have been trying a few different ways to do it, string.matches("") and then tokenising after it's in the right format and what not but it is too limiting to the user.
It looks like you are looking for something like
String[] nums = text.split("\\D+");
\D regex is negation of \d (it is like [^\d]) which means \D+ will match one or more non-digits.
Only problem with this solution is that if your text start with non-digits result array will start with one empty string.
If you still want to use split then you can simply remove that non-digits part from start of your text.
String[] nums = text.replaceFirst("^\\D+","").split("\\D+");
Other approach than split which is focusing on finding delimiters would be focusing on finding parts which are interesting to us. So instead of searching for non-digits lets find digits.
We can do it in few ways like Patter/Matcher#find, or with Scanner. Problem here is that these approaches don't return array but single elements which you would need to store in some resizeable structure like List.
So solution using Pattern and Matcher could look like:
List<String> numbers = new ArrayList<>();
Matcher m = Pattern.compile("\\d+").matcher(yourText);
while(m.find()){
numbers.add(m.group());
}
Solution using Scanner is similar, we just need to set proper delimiter (to non-digit) and read everything which is not delimiter (delimiters at start of text will be ignored which will should prevent returning empty strings).
List<String> nums = new ArrayList<>();
Scanner sc = new Scanner(yourText);
sc.useDelimiter("\\D+");
while(sc.hasNext()){
nums.add(sc.next());
}
final String input = "1dsa23f hj23nma9123";
final String[] parts = input.split("[^0-9]+");
for (final String s: parts) {
final int i = Integer.parseInt(s);
}

How to use split function when input is new line?

The question is we have to split the string and write how many words we have.
Scanner in = new Scanner(System.in);
String st = in.nextLine();
String[] tokens = st.split("[\\W]+");
When I gave the input as a new line and printed the no. of tokens .I have got the answer as one.But i want it as zero.What should i do? Here the delimiters are all the symbols.
Short answer: To get the tokens in str (determined by whitespace separators), you can do the following:
String str = ... //some string
str = str.trim() + " "; //modify the string for the reasons described below
String[] tokens = str.split("\\s+");
Longer answer:
First of all, the argument to split() is the delimiter - in this case one or more whitespace characters, which is "\\s+".
If you look carefully at the Javadoc of String#split(String, int) (which is what String#split(String) calls), you will see why it behaves like this.
If the expression does not match any part of the input then the resulting array has just one element, namely this string.
This is why "".split("\\s+") would return an array with one empty string [""], so you need to append the space to avoid this. " ".split("\\s+") returns an empty array with 0 elements, as you want.
When there is a positive-width match at the beginning of this string then an empty leading substring is included at the beginning of the resulting array.
This is why " a".split("\\s+") would return ["", "a"], so you need to trim() the string first to remove whitespace from the beginning.
If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
Since String#split(String) calls String#split(String, int) with the limit argument of zero, you can add whitespace to the end of the string without changing the number of words (because trailing empty strings will be discarded).
UPDATE:
If the delimiter is "\\W+", it's slightly different because you can't use trim() for that:
String str = ...
str = str.replaceAll("^\\W+", "") + " ";
String[] tokens = str.split("\\W+");
public static void main(String[] args) {
Scanner in = new Scanner(System.in);
String line = null;
while (!(line = in.nextLine()).isEmpty()) {
//logic
}
System.out.print("Empty Line");
}
output
Empty Line

How to get the string between double quotes in a string in Java [duplicate]

This question already has answers here:
Split string on spaces in Java, except if between quotes (i.e. treat \"hello world\" as one token) [duplicate]
(1 answer)
Java Regex for matching quoted string with escaped quotes
(1 answer)
Closed 8 years ago.
For example, input will be like:
AddItem rt456 4 12 BOOK “File Structures” “Addison-Wesley” “Michael Folk”
and I want to read all by using scanner and put it in a array.
like:
info[0] = rt456
info[1] = 4
..
..
info[4] = File Structures
info[5] = Addison-Wesley
So how can I get the string between quotes?
EDIT: a part of my code->
public static void main(String[] args) {
String command;
String[] line = new String[6];
Scanner read = new Scanner(System.in);
Library library = new Library();
command = read.next();
if(command.matches("AddItem"))
{
line[0] = read.next(); // Serial Number
line[1] = read.next(); // Shelf Number
line[2] = read.next(); // Shelf Index
command = read.next(); // Type of the item. "Book" - "CD" - "Magazine"
if(command.matches("BOOK"))
{
line[3] = read.next(); // Name
line[4] = read.next(); // Publisher
line[5] = read.next(); // Author
Book yeni = new Book(line[0],Integer.parseInt(line[1]),Integer.parseInt(line[2]),line[3],line[4],line[5]);
}
}
}
so I use read.next to read String without quotes.
SOLVED BY USING REGEX AS
read.next("([^\"]\\S*|\".+?\")\\s*");
You can use StreamTokenizer for this in a pinch. If operating on a String, wrap it with a StringReader. If operating on a file just pass your Reader to it.
// Replace “ and ” with " to make parsing easier; do this only if you truly are
// using pretty quotes (as you are in your post).
inputString = inputString.replaceAll("[“”]", "\"");
StreamTokenizer tokenizer = new StreamTokenizer(new StringReader(inputString));
tokenizer.resetSyntax();
tokenizer.whitespaceChars(0, 32);
tokenizer.wordChars(33, 255);
tokenizer.quoteChar('\"');
while (tokenizer.nextToken() != StreamTokenizer.TT_EOF) {
// tokenizer.sval will contain the token
System.out.println(tokenizer.sval);
}
You will have to use an appropriate configuration for non-ASCII text, the above is just an example.
If you want to pull numbers out separately, then the default StreamTokenizer configuration is fine, although it uses double and provides no int numeric tokens. Annoyingly, it is not possible to simply disable number parsing without resetting the syntax from scratch.
If you don't want to mess with all this, you could also consider changing the input format to something more convenient, as in Steve Sarcinella's good suggestion, if it is appropriate.
As a reference, take a look at this: Scanner Docs
How you read from the scanner is determined by how you will present the data to your user.
If they are typing it all on one line:
Scanner scanner = new Scanner(System.in);
String result = "";
System.out.println("Enter Data:");
result = scanner.nextLine();
Otherwise if you split it up into input fields you could do:
Scanner scanner = new Scanner(System.in);
System.out.println("Enter Identifier:");
info[0] = scanner.nextLine();
System.out.println("Enter Num:");
info[1] = scanner.nextLine();
...
If you want to validate anything before assigning the data to a variable, try using scanner.next(""); where the quotes contain a regex pattern to match
EDIT:
Check here for regex info.
As an example, say I have a string
String foo = "The cat in the hat";
regex (Regular Expressions) can be used to manipulate this string in a very quick and efficient manner. If I take that string and do foo = foo.replace("\\s+", "");, this will replace any whitespace with nothing, therefore eliminating whitespace.
Breaking down the argument \\s+, we have \s which means match any character that is whitespace.
The extra \ before \s is a an escape character that allows the \s to be read properly.
The + means match the previous expression 0 or more times. (Match all).
So foo, after running replace, would be "TheCatInTheHat"
Same this regex logic can apply to scanner.next(String regex);
Hopefully this helps a bit more, I'm not the best at explanation :)
An alternative using a messy regular expression:
public static void main(String[] args) throws Exception {
Pattern p = Pattern.compile("^(\\w*)[\\s]+(\\w*)[\\s]+(\\w*)[\\s]+(\\w*)[\\s]+(\\w*)[\\s]+[“](.*)[”][\\s]+[“](.*)[”][\\s]+[“](.*)[”]");
Matcher m = p.matcher("AddItem rt456 4 12 BOOK “File Structures” “Addison-Wesley” “Michael Folk”");
if (m.find()) {
for (int i=1;i<=m.groupCount();i++) {
System.out.println(m.group(i));
}
}
}
That prints:
AddItem
rt456
4
12
BOOK
File Structures
Addison-Wesley
Michael Folk
I assumed quotes are as you typed them in the question “” and not "", so they dont need to be escaped.
You can try this. I have prepared the demo for your requirement
public static void main(String args[]) {
String str = "\"ABC DEF\"";
System.out.println(str);
String str1 = str.replaceAll("\"", "");
System.out.println(str1);
}
After reading just replace the double quotes with empty string

StringTokenizer in java. Why is it adding one more space

I am using jdk 1.6 (it is older but ok). I have a function like this:
public static ArrayList gettokens(String input, String delim)
{
ArrayList tokenArray = new ArrayList();
StringTokenizer tokens = new StringTokenizer(input, delim);
while (tokens.hasMoreTokens())
{
tokenArray.add(tokens.nextToken());
}
return tokenArray;
}
My initial intention is to use tokens to clear the input string of duplicate emails (that is initial).
Let's say I have
input = ", email-1#email.com, email-2#email.com, email-3#email.com"; //yes with , at the beginning
delim = ";,";
And when I run above function the result is:
[email-1#email.com, email-2#email.com, email-3#email.com]
Which is fine, but there is added one more space between , and email .
Why is that? and how to fix it?
Edit:
here is the function that prints the output:
List<String> tokens = StringUtility.gettokens(email, ";,");
Set<String> emailSet = new LinkedHashSet<String>(tokens);
emails = StringUtils.join(emailSet, ", ");
hehe, and now I see the answer.
Edit 2 - the root cause:
the root cause of the problem was that line of the code:
emails = StringUtils.join(emailSet, ", ");
Was adding an extra ", " when joining tokens.
From the example above, one token would look like this " email-1#email.com" and when join in applied it will add comma and space before token. So if a token has a space at the beginning of the string, then it will have two spaces between comma and space.
Example:
", " + " email-1#email.com" = ",<space><space>email-1#email.com"
When printing array list, it prints all the object comma and space separated. Your input also have a space before each comma so that causes two.
You can use:
tokenArray.add(tokens.nextToken().trim());
to remove unwanted spaces from your input.
You've got spaces in your string, and ArrayList's implementation of toString adds a space before each element. The idea is that you've got a list of "x", "y" and "z", the output should be "[x, y, z]" rather than "[x,y,z]"
Your real problem probably is that you've kept the spaces in the tokens. Fix:
public static List<String> gettokens(String input, String delim)
{
ArrayList<String> tokenArray = new ArrayList<String>();
StringTokenizer tokens = new StringTokenizer(input, delim);
while (tokens.hasMoreTokens())
{
tokenArray.add(tokens.nextToken().trim());
}
return tokenArray;
}
You can change the delim to include the sapce ", " then it would not be conatined in the tokens elements.
Easier would be to use the split() method which returns a string array, so basically the method will look like:
public static ArrayList gettokens(String input, string delim)
{
return Arrays.asList(input.split(delim));
}
I think it would be a better approach to use split method of String, just because it would be shorter. All you would need to do is :
String[] values = input.split(delim);
It will return an array instead of a List.
The reason of your space is because you are adding it in your printing method.
List<String> tokens = StringUtility.gettokens(email, ";,");
Set<String> emailSet = new LinkedHashSet<String>(tokens);
emails = StringUtils.join(emailSet, ", "); //adds a space after a comma
So StringTokenizer works as expected.
In your case, without much modifying the code, you could use trim function to clear the spaces before removing duplicates, and then join with separator ", " like this:
tokenArray.add(tokens.nextToken().trim());
And you will get result without two spaces.
There is no space or comma in between.
Try to print your ArrayList as:
for(Object obj: tokenArray )
System.out.println(obj);

Categories

Resources