Splitting Strings in a stream in Java? - java

I have an assignment where we're reading textfiles and counting the occurrences of each word (ignoring punctuation). We don't have to use streams but I want to practice using them.
So far I am able to read a text file and put each line in a string, and all the strings in a list using this:
try (Stream<String> p = Files.lines(FOLDER_OF_TEXT_FILES)) {
list = p.map(line -> line.replaceAll("[^A-Za-z0-9 ]", ""))
.collect(Collectors.toList());
}
However, so far, it simply makes all the lines a single String, so each element of the list is not a word, but a line. Is there a way using streams that I can have each element be a single word, using something like String's split method with regex? Or will I have to handle this outside the stream itself?

I may misunderstood your question. But if you just want comma separated words you can try below code
Replace line.replaceAll("[^A-Za-z0-9 ]", "") with Arrays.asList(line.replaceAll("[^A-Za-z0-9 ]", "").split(" ")).stream().collect(Collectors.joining(","))
Again use joining method on the list to get comma separated String of words .
String commaSeperated = list.stream().collect(Collectors.joining(",")) ;
You can perform further operations on the final string as per your requirement.

Instead of applying replaceAll on a line, do it on words of the line as follows:
import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;
public class Main {
public static void main(String[] args) {
String str = "Harry is a good cricketer. Tanya is an intelligent student. Bravo!";
List<String> words = Arrays.stream(str.split("\\s+")).map(s -> s.replaceAll("[^A-Za-z0-9 ]", ""))
.collect(Collectors.toList());
System.out.println(words);
}
}
Output:
[Harry, is, a, good, cricketer, Tanya, is, an, intelligent, student, Bravo]
Note: The regex, \\s+ splits a string on space(s).

try this:
String fileName = "file.txt";
try {
Map<String, Long> wordCount = Files.lines(Path.of(fileName))
.flatMap(line -> Arrays.stream(line.split("\\s+")))
.filter(w->w.matches("[a-zA-Z]+"))
.sorted(Comparator.comparing(String::length)
.thenComparing(String.CASE_INSENSITIVE_ORDER))
.collect(Collectors.groupingBy(w -> w,
LinkedHashMap::new, Collectors.counting()));
wordCount.entrySet().forEach(System.out::println);
}catch (Exception e) {
e.printStackTrace();
}
This is relatively simple. It just splits on white space and counts the words by putting them in a map where the Key is the word and the Value is a long containing the count.
I included a filter to only capture words of nothing but letters.
The way this works is that the Lines put into a stream. Each line is then split into words using String.split. Since this creates an array, the flatMap converts all these individual streams of words into a single stream where they are processed.
The work horse of this is the Collectors.groupingBy which will group the values in a particular way for each key. In this case, I specified the Collectors.counting() method to increase the count each time the key (i.e. word) appeared.
As an option, I sorted the words first on length and then alphabetically, ignoring case.

First, for each line, we're removing all non-alphanumeric characters (excluding spaces), then we split on space, so all elements are single words. Since we're flatmapping, the stream consists of all words. Then we simply collect using the groupingBy collector, and use counting() as downstream collector. That'll leaves us with a Map<String, Long> were the key is the word and the value is the number of occurrences.
list = p
.flatMap(line -> Arrays.stream(line.replaceAll("[^0-9A-Za-z ]+", "").split("\\s+")))
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting());

Since line boundaries are irrelevant when want to process words, the preferred way is not to bother with splitting into lines, just to split lines into words, but split the file into words in the first place. You can use something like:
Map<String,Long> wordsAndCounts;
try(Scanner s = new Scanner(Paths.get(path))) {
wordsAndCounts = s.findAll("\\w+")
.collect(Collectors.groupingBy(MatchResult::group, Collectors.counting()));
}
wordsAndCounts.forEach((w,c) -> System.out.println(w+":\t"+c));
The findAll method of Scanner requires Java 9 or newer. This answer contains an implementation of findAll for Java 8. This allows to use it on Java 8 and easily migrate to newer versions by just switching to the standard method.

For the entire "read a text file and count each word using streams", I suggest using something like this:
try (Stream<String> lines = Files.lines(FOLDER_OF_TEXT_FILES)) {
lines.flatMap(l -> Arrays.stream(l.split(" ")))
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
}
There is no need to first collect everything into a list, this can be done inline.
Also it's good that you used try-with-resources.

one could use a Pattern.splitAsStream to split a string in a performant way and at the same time replace all non word characters before creating a map of occurrence counts:
Pattern splitter = Pattern.compile("(\\W*\\s+\\W*)+");
String fileStr = Files.readString(Path.of(FOLDER_OF_TEXT_FILES));
Map<String, Long> collect = splitter.splitAsStream(fileStr)
.collect(groupingBy(Function.identity(), counting()));
System.out.println(collect);
For splitting and removal of non word characters we are using the pattern (\W*\s+\W*)+ where we look for optional non word characters then a space and then again for optional non word characters.

Related

How can I use Java Stream to find the average of all values that share a key?

I'm having a lot of trouble with trying to average the values of a map in java. My method takes in a text file and sees the average length of each word starting with a certain letter (case insensitive and goes through all words in the text file.
For example, let's say I have a text file that contains the following::
"Apple arrow are very common Because bees behave Cant you come home"
My method currently returns:
{A=5, a=8, B=7, b=10, c=10, C=5, v=4, h=4, y=3}
Because it is looking at the letters and finding the average length of the word, but it is still case sensitive.
It should return:
{A=5, a=8, B=7, b=10, c=10, C=5, v=4, h=4, y=3}
{a=4.3, b=5.5, c=5.0, v=4.0, h=4.0, y=3}
This is what I have so far.
public static Map<String, Integer> findAverageLength(String filename) {
Map<String, Integer> wordcount = new TreeMap<>(String.CASE_INSENSITIVE_ORDER);
try
{
Scanner in = new Scanner(new File(filename));
List<String> wordList = new ArrayList<>();
while (in.hasNext())
{
wordList.add(in.next());
}
wordcount = wordList.stream().collect(Collectors.toConcurrentMap(w->w.substring(0,1), w -> w.length(), Integer::sum));
System.out.println(wordcount);
}
catch (IOException e)
{
System.out.println("File: " + filename + " not found");
}
return wordcount;
}
You are almost there.
You could try the following.
We group by the first character of the word, converted to lowercase. This lets us collect into a Map<Character, …>, where the key is the first letter of each word. A typical map entry would then look like
a = [ Apple, arrow, are ]
Then, the average of each group of word lengths is calculated, using the averagingDouble method. A typical map entry would then look like
a = 4.33333333
Here is the code:
// groupingBy and averagingDouble are static imports from
// java.util.stream.Collectors
Map<Character, Double> map = Arrays.stream(str.split(" "))
.collect(groupingBy(word -> Character.toLowerCase(word.charAt(0)),
averagingDouble(String::length)));
Note that, for brevity, I left out additional things like null checks, empty strings and Locales.
Also note that this code was heavily improved responding to the comments of Olivier Grégoire and Holger below.
You can try with the following:
String str = "Apple arrow are very common Because bees behave Cant you come home";
Map<String, Double> map = Arrays.stream(str.split(" "))
.collect(Collectors.groupingBy(s -> String.valueOf(Character.toLowerCase(s.charAt(0))),
Collectors.averagingDouble(String::length)));
The split method will split the string into an array of strings using the delimiter " ". Then, you want to group by the average of the string length. Hence, the use the of Collectors.groupingBy method and the downstream parameter Collectors.averagingDouble(String::length). Finally, given the constraints that you have described we need to group by lower case (or up case) of the first char in the String (i.e., Character.toLowerCase(s.charAt(0)))).
and then print the map:
map.entrySet().forEach(System.out::println);
If you do not need to keep the map structure you can do it in one go:
Arrays.stream(str.split(" "))
.collect(Collectors.groupingBy(s -> String.valueOf(Character.toLowerCase(s.charAt(0))), Collectors.averagingDouble(String::length)))
.entrySet().forEach(System.out::println);
Just convert the first letter, which you obtain using substring, to the same case. Upper or lower, doesn't matter.
w.substring(0,1).toLowercase()
You've defined a case-insensitive map, but you haven't used it. Try Collectors.toMap(w->w.substring(0,1), w -> w.length(), Integer::sum, () -> new TreeMap<String, Integer>(String.CASE_INSENSITIVE_ORDER)), or just Collectors.toMap(w->w.toUpperCase().substring(0,1), w -> w.length(), Integer::sum)

ContainsIgnoreCase in stream filter to count one particular word occurence in list of String

I want to count a single word occurrence in a List of String in java. Seemingly this task is easy but I met a problem with words which starts by capital letter or contains , or . at the end of the word.
My method looks like:
public static Long countWordOccurence(List<String> wordList, String word) {
return wordList.stream()
.filter(s -> word.contains(s))
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
.values()
.stream()
.findFirst()
.orElse((long) -1);
}
Above code works fine for normal scenario but the problem occurs for a corner case like coma at the end of the string like Test, or a String which starts by a capital letter.
I am splitting my string list like:
Arrays.asList(TEXT_TO_PARSE.split(" "));
If it possible I would be grateful to avoid additional dependencies but if it will be necessary I will not despise.
I will be grateful for a suggestion on how to fix my filter clause in a stream to count strings properly.
There are several fundamental problems with your code.
.filter(s -> word.contains(s)) performs a substring search. Contrary to your question’s title, it does not ignore case. Still, there can be strings of different content passing the filter
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting())) creates groups according to the string’s actual content. So when multiple different strings passed the previous filter, multiple groups may exist
.values().stream().findFirst(): since the groupingBy created a map with an unspecified ordering, this will pick an arbitrary group. Besides that, it’s a very inefficient way to ask just for the count()
.orElse((long) -1) The value -1 is a very strange fall-back for counting, as the most natural answer would be “zero” when there are no matches.
So a straight-forward solution would be
public static long countWordOccurence(List<String> wordList, String word) {
return Collections.frequency(wordList, word);
}
for counting case sensitive matches or
public static long countWordOccurence(List<String> wordList, String word) {
return wordList.stream().filter(word::equalsIgnoreCase).count();
}
for counting case insensitive.
But that’s an xy problem anyway.
When you want to count occurrences of a word in a string, it’s not necessary to split the string into words and to convert the array into a list (by the way, you can stream over an array directly), before performing the actual search.
You can use
public static long countWordOccurence(String sentence, String word) {
if(!word.codePoints().allMatch(Character::isLetter))
throw new IllegalArgumentException(word+" is not a word");
Pattern p = Pattern.compile("\\b"+word+"\\b");
return p.matcher(sentence).results().count();
}
for a count of case sensitive matches and
public static long countWordOccurence(String sentence, String word) {
if(!word.codePoints().allMatch(Character::isLetter))
throw new IllegalArgumentException(word+" is not a word");
Pattern p = Pattern.compile("\\b"+word+"\\b", Pattern.CASE_INSENSITIVE);
return p.matcher(sentence).results().count();
}
for the case insensitive matches. The \b pattern denotes word boundaries, which only makes sense if the search string is actually a word. So the methods above have a pre-test for that, which also ensures that the word does not contain characters that could be misinterpreted as regex patterns.
The results() method was introduced in Java 9. This answer shows a solution for creating such a stream under Java 8, however, for such a simple task as counting the occurrences, the alternative would be not to use streams here:
public static long countWordOccurence(String sentence, String word) {
if(!word.codePoints().allMatch(Character::isLetter))
throw new IllegalArgumentException(word+" is not a word");
Pattern p = Pattern.compile("\\b"+word+"\\b", Pattern.CASE_INSENSITIVE);
int count = 0;
for(Matcher m = p.matcher(sentence); m.find(); count++) {}
return count;
}

How to find total count of Words, total count of Vowels, total count of Special Character in a text file using java 8

I have a text file and i want to check
- total words count in file
- total vowels count in file
- total special character in file
By using Java 8 Streams.
i want output as a Map in a single iteration if possible i.e
{"totalWordCount":10,"totalVowelCount":10,"totalSpecialCharacter":10}
i tried below code
Long wordCount=Files.lines(child).parallel().flatMap(line -> Arrays.stream(line.trim().split(" ")))
.map(word -> word.replaceAll("[^a-zA-Z]", "").toLowerCase().trim())
.filter(word -> !word.isEmpty())
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting())).values().stream().reduce(0L, Long::sum)
but it is giving me only total word count, i am thinking if its possible to return a single map which contain output as above with all count.
If we only had to count special characters and vowels, we could use something like this:
Map<String,Long> result;
try(Stream<String> lines = Files.lines(path)) {
result = lines
.flatMap(Pattern.compile("\\s+")::splitAsStream)
.flatMapToInt(String::chars)
.filter(c -> !Character.isAlphabetic(c) || "aeiou".indexOf(c) >= 0)
.mapToObj(c -> "aeiou".indexOf(c)>=0? "totalVowelCount": "totalSpecialCharacter")
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
}
First we flatten the stream of lines to a stream of words, then to a stream of characters, to group them by their type. This works smoothly as “special character” and “vowel” are mutual exclusive. In principle, the flattening to words could have been omitted if we just extend the filter to skip white-space characters, but here, it helps getting to a solution counting words.
Since words are a different kind of entity than characters, counting them in the same operation is not that straight-forward. One solution is to inject a pseudo character for each word and count it just like other characters at the end. Since all actual characters are positive, we can use -1 for that:
Map<String,Long> result;
try(Stream<String> lines = Files.lines(path)) {
result = lines.flatMap(Pattern.compile("\\s+")::splitAsStream)
.flatMapToInt(w -> IntStream.concat(IntStream.of(-1), w.chars()))
.mapToObj(c -> c==-1? "totalWordCount": "aeiou".indexOf(c)>=0? "totalVowelCount":
Character.isAlphabetic(c)? "totalAlphabetic": "totalSpecialCharacter")
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
}
This adds a "totalAlphabetic" category in addition to the others into the result map. If you do not want that, you can insert a .filter(cat -> !cat.equals("totalAlphabetic")) step between the mapToObj and collect steps. Or use a filter like in the first solution before the mapToObj step.
As an additional note, this solution does more work than necessary, because it splits the input into lines, which is not necessary as we can treat line breaks just like other white-space, i.e. as a word boundary. Starting with Java 9, we can use Scanner for the job:
Map<String,Long> result;
try(Scanner scanner = new Scanner(path)) {
result = scanner.findAll("\\S+")
.flatMapToInt(w -> IntStream.concat(IntStream.of(-1), w.group().chars()))
.mapToObj(c -> c==-1? "totalWordCount": "aeiou".indexOf(c)>=0? "totalVowelCount":
Character.isAlphabetic(c)? "totalAlphabetic": "totalSpecialCharacter")
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
}
This will split the input into words in the first place without treating line breaks specially. This answer contains a Java 8 compatible implementation of Scanner.findAll.
The solutions above consider every character which is neither white-space nor alphabetic as “special character”. If your definition of “special character” is different, it should not be too hard to adapt the solutions.

How to ignore punctuations and symbols appended to a word, so that they are all treated as same when considering for word count?

I am writing a program for word count of each word in any text file.
The contents of file are NOT known before-hand.
Desired Output :
e.g.
[book] [book!] [book-] [book?] [book,] [book's] and the likes to be treated as same for word count.
Current Output :
book=2, book.=1, book--=1, book?=5, book's=3, book,=2, book!=1
When I am actually looking for book=15
try(Stream<String> fileContents = Files.lines(filePath)){
Function<String, Stream<String>> splitIntoWords = line -> Pattern.compile(" ").splitAsStream(line);
Map<String, Long> wordFrequency = fileContents.flatMap(splitIntoWords)
.filter(word -> word.trim().length() > 4) //Consider only Words with length greater than 4
.map(String::toLowerCase)
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
System.out.println(wordFrequency);
}
I do not wish to hard-code specific symbols and punctuations in regex to ignore, since the exact contents of file is not known.
Is there any generic way to accomplish this?
Pattern.compile("\\P{L}+").split ...
This would split at any character (or more than one) which is NOT a letter of any language. I guess this gets you what you want?

Change part of code to functional paradigm

I have few lines of code, which I'm trying to convert to functional paradigm. The code is:
private String test(List<String> strings, String input) {
for (String s : strings) {
input = input.replace(s, ", ");
}
return input;
}
I need to make this one instruction chain. It must replace all strings from given list with coma IN given String input. I tried to do it with map method, but with no success. I'm aware I can do it if I appended input string into list at beginning and call map then, but the list is immutable, so I cannot do that.
I believe you can do this with a simple reduce:
strings.stream().reduce(input, (in, s) -> in.replace(s, ", "));
It takes the input, and replaces each occurence of the first string with ", ". Then it takes that result, and uses it as the input along with the next string, and repeats for every item in the list.
As Louis Wasserman points out, this approach cannot be used with parallelStream, so it won't work if you want parallelization.
The only think I can think of -- which is pretty awkward -- is
strings.stream()
.map(s -> (Function<String, String>) (x -> x.replace(s, ", ")))
.reduce(Function.identity(), Function::andThen)
.apply(input)
The following does pretty much the same thing.
private String test(List<String> strings, String input) {
return input.replaceAll(
strings
.stream()
.map(Pattern::quote)
.collect(Collectors.joining("|"))
, ", "
);
}
The main difference is that it first combines all the search strings into a single regex, and applies them all at once. Depending on size of your input strings, this may perform even better than your original use case.
If the list of strings is fixed, or changes only rarely, you can get some more speed from precompiling the joined pattern.

Categories

Resources