UVa #494 - regex [^a-zA-z]+ to split words using Java

UVa #494 - regex [^a-zA-z]+ to split words using Java - java

I was playing with UVa #494 and I managed to solve it with the code below:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
class Main {
public static void main(String[] args) throws IOException{
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
String line;
while((line = in.readLine()) != null){
String words[] = line.split("[^a-zA-z]+");
int cnt = words.length;
// for some reason it is counting two words for 234234ddfdfd and words[0] is empty
if(cnt != 0 && words[0].isEmpty()) cnt--; // ugly fix, if has words and the first is empty, reduce one word
System.out.println(cnt);
}
System.exit(0);
}
}
I built the regex "[^a-zA-z]+" to split the words so for example the strings abc..abc or abc432abc should be splitted as ["abc", "abc"]. However, when I try the string 432abc, I have as a result ["", "abc"] - the first element from words[] is just an empty string but I was expecting to have just ["abc"]. I can't figure out why this regex gives me the first element as "" for this case.

Check the split reference page: split reference
Each element of separator defines a separate delimiter character. If
two delimiters are adjacent, or a delimiter is found at the beginning
or end of this instance, the corresponding array element contains
Empty. The following table provides examples.
Since you have several consecutive delimiter characters, you get empty array elements

Prints the count of number of words
public static void main(String[] args) throws IOException {
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
String line;
while ((line = in.readLine()) != null) {
Pattern pattern = Pattern.compile("[a-zA-z]+");
Matcher matcher = pattern.matcher(line);
int count = 0;
while (matcher.find()) {
count++;
System.out.println(matcher.group());
}
System.out.println(count);
}
}

Related

Importing a .csv file into java, splitting it to an array or strings

As the title says, I have imported a csv file with 3 columns separated by a comma (,).
When I try to print each column on it's own, the first column prints without problems. When I try the second and third columns (index [1] and [2]), they print as expected but I get an "ArrayIndexOutOfBoundsException" error at the end of the program before exiting. Why is that?
Here is the code below:
public static void main(String[] args) throws IOException {
String path = "C:\\Users\\UserONE\\Downloads\\inSic1.csv";
String line = "";
BufferedReader br = new BufferedReader(new FileReader(path));
while ((line = br.readLine()) != null){
String value [] = line.split(",");
System.out.println(value[1]);

Try printing the values dynamically and not with a hardcoded value.
E.g.
while ((line = br.readLine()) != null)
{
String values[] = line.split(",");
for(String val: values)
{
System.out.println(val);
}
}
This way you will not reach an index that does not exist and you generify the problem, even if you use another file, it will work.

How to split a file into several tokens

I was trying to tokenize an input file from sentences into tokens(words).
For example,
"This is a test file." into five words "this" "is" "a" "test" "file", omitting the punctuations and the white spaces. And store them into an arraylist.
I tried to write some codes like this:
public static ArrayList<String> tokenizeFile(File in) throws IOException {
String strLine;
String[] tokens;
//create a new ArrayList to store tokens
ArrayList<String> tokenList = new ArrayList<String>();
if (null == in) {
return tokenList;
} else {
FileInputStream fStream = new FileInputStream(in);
DataInputStream dataIn = new DataInputStream(fStream);
BufferedReader br = new BufferedReader(new InputStreamReader(dataIn));
while (null != (strLine = br.readLine())) {
if (strLine.trim().length() != 0) {
//make sure strings are independent of capitalization and then tokenize them
strLine = strLine.toLowerCase();
//create regular expression pattern to split
//first letter to be alphabetic and the remaining characters to be alphanumeric or '
String pattern = "^[A-Za-z][A-Za-z0-9'-]*$";
tokens = strLine.split(pattern);
int tokenLen = tokens.length;
for (int i = 1; i <= tokenLen; i++) {
tokenList.add(tokens[i - 1]);
}
}
}
br.close();
dataIn.close();
}
return tokenList;
}
This code works fine except I found out that instead of make a whole file into several words(tokens), it made a whole line into a token. "area area" becomes a token, instead of "area" appeared twice. I don't see the error in my codes. I believe maybe it's something wrong with my trim().
Any valuable advices is appreciated. Thank you so much.
Maybe I should use scanner instead?? I'm confused.

I think Scanner is more approprate for this task. As to this code, you should fix regex, try "\\s+";

Try pattern as String pattern = "[^\\w]"; in the same code

Search String only prints out searches without characters attached

I’m new to java and I am working on a project. I am trying to search a text file for a few 4 character acronyms. It will only show or output when it’s just the 4 characters and nothing else. If there is a space or another character attached to it won’t display it… I have tried to make it show the whole line, but have yet to be successful.
The contents of text file:
APLM
APLM12345
ABC0
ABC0123456
CSQV
CSQVABCDE
ZIAU
ZIAUABCDE
The output in console:
APLM
ABC0
CSQV
ZIAU
My Code:
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.Arrays;
public class searchPdfText
{
public static void main(String args[]) throws Exception
{
int tokencount;
FileReader fr = new FileReader("TextSearchTest.txt");
BufferedReader br = new BufferedReader(fr);
String s = "";
int linecount = 0;
ArrayList<String> keywordList = new ArrayList<String>(Arrays.asList("APLM", "ABC0", "CSQV", "ZIAU" ));
String line;
while ((s = br.readLine()) != null)
{
String[] lineWordList = s.split(" ");
for (String word : lineWordList)
{
if (keywordList.contains(word))
{
System.out.println(s);
break;
}
}
}
}
}

If you take a look at the documentation for ArrayList.contains you will see that it only returns true if your keyword contains the provided string from your file. As such, your code is correct when it only outputs the exact matches found for those provided strings in keywordList.
Instead, if you want to get matches when a part of the provided string contains a keyword, consider iterating through the input and matching it the other way around:
while ((s = br.readLine()) != null) {
String[] lineWordList = s.split(" ");
for (String word : lineWordList) {
// JAVA 8
keywordList.stream().filter(e -> word.contains(e)).findFirst()
.ifPresent(e -> System.out.println(word));
// JAVA <8
for (String keyword : keywordList) {
if (word.contains(keyword)) {
System.out.println(s);
break;
}
}
}
}
Additionally, you may consider following Oracle's Java Naming Conventions with regards to your class name. Each word in your class name should be capitalized. For example, you class might be better named SearchPdfText.

You just need to change your while code for the output you want:
while ((s = br.readLine()) != null) {
if (s.length() == 4){
System.out.println(s);
}
}
If you want only that 4 specific values just create a method to check it like:
public static boolean hasIt(String text){
String [] list = { "APLM", "ABC0", "CSQV", "ZIAU" };
for ( String s : list ){
if (s.equals(text)){
return true;
}
}
return false;
}
And your while to:
while ((s = br.readLine()) != null) {
if (hasIt(s)){
System.out.println(s);
}
}

Java - differentiating between strings

Is there a number wildcard character in java? I'm opening a file and looking at a list of data and I need to differentiate between three pieces of information that start with "M". However, one of them has numbers directly following it and the other two have letters that follow. I was wondering if there was a way to check if there was a number after the letter with a wildcard character. I'm sure you could do this with ASCII, but I also am unsure of how to execute that.
EDIT: I'm still having issues, so here is my code.
import java.io.*;
import java.util.*;
import java.util.regex.*;
public class addSevTest{
public static void main(String[] args) throws IOException{
FileReader fr = new FileReader("output6.txt");
BufferedReader br = new BufferedReader(fr);
String line;
Pattern pattern = Pattern.compile(br.readLine());
Matcher matcher = pattern.matcher(br.readLine());
List<String> list = new ArrayList<String>();
while ((line = br.readLine()) != null){
if(line.contains("100%") || line.contains("70%") || matcher.find("[.][1-9]")){
list.add(line);
list.add(" 2");
list.add("\n");
//System.out.println('Using String matches method: '+line.matches('.M'));
}else if(line.startsWith("MDRALM")){
list.add(line);
list.add(" 3");
list.add("\n");
}else if(line.startsWith("SOL") || line.startsWith("I/O") || line.startsWith("AH") || line.startsWith("LT")){
continue;
}else{
list.add(line);
list.add(" 1");
list.add("\n");
}
}
/*while ((line = br.readLine()) != null){
if(line.contains("CP")){
list.add(line);
list.add("\n");
}
}*/
br.close();
FileWriter writer = new FileWriter("addSevTest_O.txt");
for(String str: list){
writer.write(str);
}
writer.close();
}
}

You'd be best off using some simple regular expressions.
I found some basic tutorials you can skim through for the basics here:
http://www.vogella.com/articles/JavaRegularExpressions/article.html
http://docs.oracle.com/javase/tutorial/essential/regex/intro.html
http://www.javacodegeeks.com/2012/11/java-regular-expression-tutorial-with-examples.html
And a couple of tools to help you on your journey:
http://regexpal.com/
http://tools.netshiftmedia.com/regexlibrary/
EDIT
In your added code, try replacing this:
if(line.contains("100%") || line.contains("70%") || matcher.find("[.][1-9]"))
with this:
if(line.contains("100%") || line.contains("70%") || line.matches("M[1-9]+.*"))
The M matches the first letter of the line. [1-9] matches the digits, with the + meaning one or more. .* means zero or more additional characters following the number will also match.
The Pattern/Matcher stuff you've got here is overkill for your purposes.

Can I do this - token=str.split(" "||",");

import java.io.*;
import java.text.DecimalFormat;
import java.text.NumberFormat;
public class TrimTest{
public static void main(String args[]) throws IOException{
String[] token = new String[0];
String opcode;
String strLine="";
String str="";
try{
// Open and read the file
FileInputStream fstream = new FileInputStream("a.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
//Read file line by line and storing data in the form of tokens
if((strLine = br.readLine()) != null){
token = strLine.split(" ");// split w.r.t spaces
token = strLine.split(" "||",") // split if there is a space or comma encountered
}
in.close();//Close the input stream
}
catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
}
int i;
int n = token.length;
for(i=0;i<n;i++){
System.out.println(token[i]);
}
}
}
If the input MOVE R1,R2,R3
Split with respect to space or comma and save it into and array token[]
I want output as:
MOVE
R1
R2
R3
Thanks in Advance.

Try token = strLine.split(" |,").
split uses regex as argument and or in regex is |. You can also use character class like [\\s,] which is equal to \\s|, and means \\s = any white space (like normal space, tab, new line mark) OR comma".

You want
token = strLine.split("[ ,]"); // split if there is a space or comma encountered
Square brackets denote a character class. This class contains a space and a comma and the regex will match on any character of the character class.

Change it to strLine.split(" |,"), or maybe even strLine.split("\\s+|,").

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

UVa #494 - regex [^a-zA-z]+ to split words using Java - java

Related

Importing a .csv file into java, splitting it to an array or strings

How to split a file into several tokens

Search String only prints out searches without characters attached

Java - differentiating between strings

Can I do this - token=str.split(" "||",");

Categories

Resources