Regular expression to retrieve words from files

Regular expression to retrieve words from files - java

I have a set of files in particular diretory.
After retrieving the contents from all the files(text files) in the directory, I have a
List of Strings.
Each string element represents the retrieved content from each file. So the first String element in the list represents the content from first file.
Now I want to split the string to get words.(Later the words store into an array of strings)
1) words can be seperated by single space/multiple space.
2) Sentences are end by a '.', so a new word can be started after '.'
3) A new word can start after '\n'
So can anyone suggest a regular expression which can fit into split() method?

Perhaps the StringTokenizer class is a better fit for your need. The constructor takes the string to tokenize and a list of delimiters (in your case: space, ., and line break).

String[] result = myString.split("[\\.\\s]");

You probably don't need regexp for this, just remove every nonletter charcters from file, and use Tokenizer to read each word.

I would suggest using tokens for this ... simply go through each character and decide what to do based on what the character is. Here's the pseudo-code
string word = "";
while ( EOF ){
char = getNextChar()
if ( char not space or full-stop ){
append the char to the word
}
else {
if ( the word is empty ){ continue /* ignore multi space */ }
else {
add the word to an array of words
reset the word to ""
}
}
}
This way, you have a complete control of the way you process the data - you don't have to worry about crazy scenarios with to include in the regex rule. Most of all, this is the most efficient way (def better than regex) and you do only a single pass through the data.

Related

How to count number of symbols like #,#,+ etc in Java

I'm trying to write a code to count number of letters,characters,space and symbols in a String. But I don't know how to count Symbols.
Is there any such function available in java?

That very much depends on your definition of the term symbol.
A straight forward solution could be something like
Set<Character> SYMBOLS = Set.of('#', ' ', ....
for (int i=0; i < someString.length(); i++} {
if (SYMBOLS.contains(someString.charAt(i)) {
That iterates the chars someString, and checks each char whether it can be found within that predefined SYMBOLS set.
Alternatively, you could use a regular expression to define "symbols", or, you can rely on a variety of existing definitions. When you check the regex Pattern language for java, you can find
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]
for example. And various other shortcuts that denote this or that set of characters already.

Please post what you have tried so far
If you need the count of individual characters - you better iterate the string and use a map to track the character with its count
Or
You can use a regex if just the overall count would enough like below
while (matcher.find() ) {count++}

One way of doing it would be to just iterate over the String and compare each character to their ASCII value
String str = "abcd!##";
for(int i=0;i<str.length();i++)
{
if(33==str.charAt(i))
System.out.println("Found !");
}
lookup here for ASCII values https://www.cs.cmu.edu/~pattis/15-1XX/common/handouts/ascii.html

How to search a text file for words ending with a and then append r to these words? (Java)

I have a file containing a movie script and I am trying to append 'r' to all the words that end with 'a'. Any help will be greatly appreciated!

This can be done using the String.replaceAll method like this:
String script = "a cuba is a bla bla country";
System.out.println(script.replaceAll("a(\\s+)", "ar$1"));
// this gives the result of "ar cubar is ar blar blar country"
The first argument for replaceAll is a regular expression which means anything ends with a and one or more whitespaces. The second argument is the replacement which appends r after a and keeps the first capturing group which is the whitespaces.

We can split this problem into three main stages:
Reading from the file
Modifying words ending with 'a'
Writing back to a file
We can access the file using a FileReader, then read each word using Scanner. An example of how to do this is available on the Oracle Basic IO tutorial.
To check if each word ends in 'a', we can compare the final character with 'a'. If this is the case, we can append 'r' to this using Java's string concatenation:
if(word.charAt(word.length()-1) == 'a'){
word = word + 'r';
}
Finally, we can write all of this back to a file, using Java's BufferedWriter.

Try this
System.out.println(script.replaceAll("\\b.*?a\\b", "$0r"));
\b.*?a\b will match boundary word ending with a.

Splitting a string in Java using multiple delimiters

I have a string like
String myString = "hello world~~hello~~world"
I am using the split method like this
String[] temp = myString.split("~|~~|~~~");
I want the array temp to contain only the strings separated by ~, ~~ or ~~~.
However, the temp array thus created has length 5, the 2 additional 'strings' being empty strings.
I want it to ONLY contain my non-empty string. Please help. Thank you!

You should use quantifier with your character:
String[] temp = myString.split("~+");
String#split() takes a regex. ~+ will match 1 or more ~, so it will split on ~, or ~~, or ~~~, and so on.
Also, if you just want to split on ~, ~~, or ~~~, then you can limit the repetition by using {m,n} quantifier, which matches a pattern from m to n times:
String[] temp = myString.split("~{1,3}");
When you split it the way you are doing, it will split a~~b twice on ~, and thus the middle element will be an empty string.
You could also have solved the problem by reversing the order of your delimiter like this:
String[] temp = myString.split("~~~|~~|~");
That will first try to split on ~~, before splitting on ~ and will work fine. But you should use the first approach.

Just turn the pattern around:
String myString = "hello world~~hello~~world";
String[] temp = myString.split("~~~|~~|~");

Try This :
myString.split("~~~|~~|~");
It will definitely works. In your code, what actually happens that when ~ occurs for the first time,it count as a first separator and split the string from that point. So it doesn't get ~~ or ~~~ anywhere in your string though it is there. Like :
[hello world]~[]~[hello]~[]~[world]
Square brackets are split-ed in to 5 different string values.

how to get words from a sequence of characters in java?

I have a method getNextChar() which reads a string character by character. And I am writing a method to get the words in a character sequence provided by getNextChar().
The text contains punctuation marks and other special characters.
I am thinking to have an array which contains all the punctuation marks and special characters, and when I read the characters of the text, check if the character is in the array to ignore it.
The method will recognize the word when it get's a space. The words will be stored in a Collection (ex: map) as i need to count the frequencies as well by checking if the word has been inserted before in the map and increasing the counter of that word.
Is this the best and efficient way of doing it? I am looking for the most efficient way. A
Is there any complete list of punctuation marks and special characters?

I think there is an easier way to do this.
No matter what your input source is, I would be reading it using the Scanner class. You can instantiate this class using your input string and call the Scanner.nextWord() method to get the next word in the string. This automatically checks for whitespace and returns the next word. Then, you can use String.replace("punctuation","") to remove punctuation and then insert these words into an ArrayList and you can count frequencies etc.
Scanner reader = new Scanner(string);
String word = reader.nextWord();
word=word.replaceAll(//code);
list.add(word);

You could use string.split() to break apart the string into an array of strings separated by whitespace (for your words.) You could also check each character with Character.isLetterOrDigit() to avoid punctuation. (Not necessarily in that order.)

The lookup for punctation will have a better performance if you use a Set of characters.
Set<Character> punctationchars ....
if(punctationcahars.contains(yourChar) { ... }

Just use a Scanner to read in the Strings:
Scanner in = new Scanner(...);
while (in.hasNext()) {
String word = in.next();
/* do something with the word, check punctuation, etc. */
}

How to tokenize in java without using the java.util tokenizer?

Consider the following as tokens:
+, -, ), (
alpha charactors and underscore
integer
Implement 1.getToken() - returns a string corresponding to the next token
2.getTokPos() - returns the position of the current token in the input string
Example input: (a+b)-21)
Output: (| a| +| b| )| -| 21| )|
Note: Cannot use the java string tokenizer class
Work in progress - Successfully tokenized +,-,),(. Need to figure out characters and numbers:
OUTPUT: +|-|+|-|(|(|)|)|)|(| |

java.util tokenizer is a deprecated class.
Tokenizing Strings in Java is much easier with "String.split()" since Java 1.4 :
String[] tokens = "(a+b)-21)".split("[+-)(]");
If it is a homework, you probably have to reimplement a "split" method:
read the String character by character
if the character is not a special char, add it to a buffer
when you encounter a special char, add the buffer content to a list and clear the buffer
Since it is (probably) a homework, I let you implement it.

Java lets you examine the characters in a String one by one with the charAt method. So use that in a for loop and examine each character. When you encounter a TOKEN you wrap that token with the pipes and any other character you just append to the output.
public static final char PLUS_TOKEN = '+';
// add all tokens as
public String doStuff(String input)
{
StringBuilder output = new StringBuilder();
for (int index = 0; index < input.length(); index++)
{
if (input.charAt(index) == PLUS_TOKEN)
{
// when you see a token you need to append the pipes (|) around it
output.append('|');
output.append(input.charAt(index);
output.append('|');
}
else if () //compare the current character with all tokens
else
{
// just add to new output
output.append(input.charAt(index);
}
}
return output.toString();
}

If it's not a homework assignment use String.split(). If is a homework assignment, say so and tag it so that we can give the appropriate level of help (I did so for you, just in case...).

Because the string needs to be cut in several different ways, not just on whitespace or parens, using the String.split method with any of the symbols there will not work. Split removes the character used as a seperator. You could try to split on the empty string, but this wouldn't get compound symbols, like 21. To correctly parse this string, you will need to effectively implement your own tokenizer. Try thinking about how you could tell you had a complete token if you looked at the string one character at a time. You could probably start a string that collects the characters until you have identified a complete token, and then you can remove the characters from the original and return the string. Starting from this point, you can probably make a basic tokenizer.
If you'd rather learn how to make a full strength tokenizer, most of them are defined by creating a regular expression that only matches the tokens.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regular expression to retrieve words from files - java

Perhaps the StringTokenizer class is a better fit for your need. The constructor takes the string to tokenize and a list of delimiters (in your case: space, ., and line break).

String[] result = myString.split("[\\.\\s]");

You probably don't need regexp for this, just remove every nonletter charcters from file, and use Tokenizer to read each word.

Related

How to count number of symbols like #,#,+ etc in Java

How to search a text file for words ending with a and then append r to these words? (Java)

Splitting a string in Java using multiple delimiters

how to get words from a sequence of characters in java?

How to tokenize in java without using the java.util tokenizer?

Categories

Resources