Java - differentiating between strings - java

Is there a number wildcard character in java? I'm opening a file and looking at a list of data and I need to differentiate between three pieces of information that start with "M". However, one of them has numbers directly following it and the other two have letters that follow. I was wondering if there was a way to check if there was a number after the letter with a wildcard character. I'm sure you could do this with ASCII, but I also am unsure of how to execute that.
EDIT: I'm still having issues, so here is my code.
import java.io.*;
import java.util.*;
import java.util.regex.*;
public class addSevTest{
public static void main(String[] args) throws IOException{
FileReader fr = new FileReader("output6.txt");
BufferedReader br = new BufferedReader(fr);
String line;
Pattern pattern = Pattern.compile(br.readLine());
Matcher matcher = pattern.matcher(br.readLine());
List<String> list = new ArrayList<String>();
while ((line = br.readLine()) != null){
if(line.contains("100%") || line.contains("70%") || matcher.find("[.][1-9]")){
list.add(line);
list.add(" 2");
list.add("\n");
//System.out.println('Using String matches method: '+line.matches('.M'));
}else if(line.startsWith("MDRALM")){
list.add(line);
list.add(" 3");
list.add("\n");
}else if(line.startsWith("SOL") || line.startsWith("I/O") || line.startsWith("AH") || line.startsWith("LT")){
continue;
}else{
list.add(line);
list.add(" 1");
list.add("\n");
}
}
/*while ((line = br.readLine()) != null){
if(line.contains("CP")){
list.add(line);
list.add("\n");
}
}*/
br.close();
FileWriter writer = new FileWriter("addSevTest_O.txt");
for(String str: list){
writer.write(str);
}
writer.close();
}
}

You'd be best off using some simple regular expressions.
I found some basic tutorials you can skim through for the basics here:
http://www.vogella.com/articles/JavaRegularExpressions/article.html
http://docs.oracle.com/javase/tutorial/essential/regex/intro.html
http://www.javacodegeeks.com/2012/11/java-regular-expression-tutorial-with-examples.html
And a couple of tools to help you on your journey:
http://regexpal.com/
http://tools.netshiftmedia.com/regexlibrary/
EDIT
In your added code, try replacing this:
if(line.contains("100%") || line.contains("70%") || matcher.find("[.][1-9]"))
with this:
if(line.contains("100%") || line.contains("70%") || line.matches("M[1-9]+.*"))
The M matches the first letter of the line. [1-9] matches the digits, with the + meaning one or more. .* means zero or more additional characters following the number will also match.
The Pattern/Matcher stuff you've got here is overkill for your purposes.

Related

How to split a file into several tokens

I was trying to tokenize an input file from sentences into tokens(words).
For example,
"This is a test file." into five words "this" "is" "a" "test" "file", omitting the punctuations and the white spaces. And store them into an arraylist.
I tried to write some codes like this:
public static ArrayList<String> tokenizeFile(File in) throws IOException {
String strLine;
String[] tokens;
//create a new ArrayList to store tokens
ArrayList<String> tokenList = new ArrayList<String>();
if (null == in) {
return tokenList;
} else {
FileInputStream fStream = new FileInputStream(in);
DataInputStream dataIn = new DataInputStream(fStream);
BufferedReader br = new BufferedReader(new InputStreamReader(dataIn));
while (null != (strLine = br.readLine())) {
if (strLine.trim().length() != 0) {
//make sure strings are independent of capitalization and then tokenize them
strLine = strLine.toLowerCase();
//create regular expression pattern to split
//first letter to be alphabetic and the remaining characters to be alphanumeric or '
String pattern = "^[A-Za-z][A-Za-z0-9'-]*$";
tokens = strLine.split(pattern);
int tokenLen = tokens.length;
for (int i = 1; i <= tokenLen; i++) {
tokenList.add(tokens[i - 1]);
}
}
}
br.close();
dataIn.close();
}
return tokenList;
}
This code works fine except I found out that instead of make a whole file into several words(tokens), it made a whole line into a token. "area area" becomes a token, instead of "area" appeared twice. I don't see the error in my codes. I believe maybe it's something wrong with my trim().
Any valuable advices is appreciated. Thank you so much.
Maybe I should use scanner instead?? I'm confused.
I think Scanner is more approprate for this task. As to this code, you should fix regex, try "\\s+";
Try pattern as String pattern = "[^\\w]"; in the same code

Split paragraphs into sentences - a special case

I am a newbie to programming in Java. I want to split the paragraphs in one file into sentences and write them in a different file. Also there should be mechanism to identify which sentence comes from which paragraph.The code I have used so far is mentioned below. But this code breaks:
Former Secretary of Finance Dr. P.B. Jayasundera is being questioned by the police Financial Crime Investigation Division.
into
Former Secretary of Finance Dr.
P.B.
Jayasundera is being questioned by the police Financial Crime Investigation Division.
How can I correct it? Thanks in advance.
import java.io.*;
class trial4{
public static void main(String args[]) throws IOException
{
FileReader fr = new FileReader("input.txt");
BufferedReader br = new BufferedReader(fr);
String s;
OutputStream out = new FileOutputStream("output10.txt");
String token[];
while((s = br.readLine()) != null)
{
token = s.split("(?<=[.!?])\\s* ");
for(int i=0;i<token.length;i++)
{
byte buf[]=token[i].getBytes();
for(int j=0;j<buf.length;j=j+1)
{
out.write(buf[j]);
if(j==buf.length-1)
out.write('\n');
}
}
}
fr.close();
}
}
I referenced all the similar questions posted on StackOverFlow. But those answers couldn't help me solve this.
How about using a negative-lookbehind in conjunction with a replace. Simply said: Replace all line endings that don't have "something special" before them with the line end followed by newline.
A list of "known abbreviations" will be needed. There's no guarantee as to how long those can be or how short a word at the end of a line might be. (See? 'be' if quite short already!)
class trial4{
public static void main(String args[]) throws IOException {
FileReader fr = new FileReader("input.txt");
BufferedReader br = new BufferedReader(fr);
PrintStream out = new PrintStream(new FileOutputStream("output10.txt"));
String s = br.readLine();
while(s != null) {
out.print( //Prints newline after each line in any case
s.replaceAll("(?i)" //Make the match case insensitive
+ "(?<!" //Negative lookbehind
+ "(\\W\\w)|" //Single non-word followed by word character (P.B.)
+ "(\\W\\d{1,2})|" //one or two digits (dates!)
+ "(\\W(dr|mr|mrs|ms))" //List of known abbreviations
+ ")" //End of lookbehind
+"([!?\\.])" //Match end-ofsentence
, "$5" //Replace with end-of-sentence found
+System.lineSeparator())); //Add newline if found
s = br.readLine();
}
}
}
As mentioned in the comment "it will be reasonable hard" to break text into paragraphs without formalizing the requirements. Take a look at BreakIterator - especially SentenceInstance. You might roll out your own BreakIterator since it breaks the same as you get with regexp, except that it is more abstract. Or try to find a 3rd party solution like http://deeplearning4j.org/sentenceiterator.html which can be trained to tokenize your input.
Example with BreakIterator:
String str = "Former Secretary of Finance Dr. P.B. Jayasundera is being questioned by the police Financial Crime Investigation Division.";
BreakIterator bilus = BreakIterator.getSentenceInstance(Locale.US);
bilus.setText(str);
int last = bilus.first();
int count = 0;
while (BreakIterator.DONE != last)
{
int first = last;
last = bilus.next();
if (BreakIterator.DONE != last)
{
String sentence = str.substring(first, last);
System.out.println("Sentence:" + sentence);
count++;
}
}
System.out.println("" + count + " sentences found.");

Search String only prints out searches without characters attached

I’m new to java and I am working on a project. I am trying to search a text file for a few 4 character acronyms. It will only show or output when it’s just the 4 characters and nothing else. If there is a space or another character attached to it won’t display it… I have tried to make it show the whole line, but have yet to be successful.
The contents of text file:
APLM
APLM12345
ABC0
ABC0123456
CSQV
CSQVABCDE
ZIAU
ZIAUABCDE
The output in console:
APLM
ABC0
CSQV
ZIAU
My Code:
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.Arrays;
public class searchPdfText
{
public static void main(String args[]) throws Exception
{
int tokencount;
FileReader fr = new FileReader("TextSearchTest.txt");
BufferedReader br = new BufferedReader(fr);
String s = "";
int linecount = 0;
ArrayList<String> keywordList = new ArrayList<String>(Arrays.asList("APLM", "ABC0", "CSQV", "ZIAU" ));
String line;
while ((s = br.readLine()) != null)
{
String[] lineWordList = s.split(" ");
for (String word : lineWordList)
{
if (keywordList.contains(word))
{
System.out.println(s);
break;
}
}
}
}
}
If you take a look at the documentation for ArrayList.contains you will see that it only returns true if your keyword contains the provided string from your file. As such, your code is correct when it only outputs the exact matches found for those provided strings in keywordList.
Instead, if you want to get matches when a part of the provided string contains a keyword, consider iterating through the input and matching it the other way around:
while ((s = br.readLine()) != null) {
String[] lineWordList = s.split(" ");
for (String word : lineWordList) {
// JAVA 8
keywordList.stream().filter(e -> word.contains(e)).findFirst()
.ifPresent(e -> System.out.println(word));
// JAVA <8
for (String keyword : keywordList) {
if (word.contains(keyword)) {
System.out.println(s);
break;
}
}
}
}
Additionally, you may consider following Oracle's Java Naming Conventions with regards to your class name. Each word in your class name should be capitalized. For example, you class might be better named SearchPdfText.
You just need to change your while code for the output you want:
while ((s = br.readLine()) != null) {
if (s.length() == 4){
System.out.println(s);
}
}
If you want only that 4 specific values just create a method to check it like:
public static boolean hasIt(String text){
String [] list = { "APLM", "ABC0", "CSQV", "ZIAU" };
for ( String s : list ){
if (s.equals(text)){
return true;
}
}
return false;
}
And your while to:
while ((s = br.readLine()) != null) {
if (hasIt(s)){
System.out.println(s);
}
}

BufferedReader - Search for string inside .txt file

I am trying, using a BufferedReader to count the appearances of a string inside a .txt file. I am using:
File file = new File(path);
try {
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
int appearances = 0;
while ((line = br.readLine()) != null) {
if (line.contains("Hello")) {
appearances++;
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("Found " + appearances);
But the problem is that if my .txt file contains for example the string "Hello, world\nHello, Hello, world!" and "Hello" is to be found then the appearances become two instead of three because it searches a line for only one appearance of the string. How could I fix this? Thanks a lot
The simplest solution is to do
while ((line = br.readLine()) != null)
appearances += line.split("Hello", -1).length-1;
Note that, if instead of "Hello", you search for anything with regex-reserved characters, you should escape the string before splitting:
String escaped = Pattern.quote("Hello."); // avoid '.' special meaning in regex
while ((line = br.readLine()) != null)
appearances += line.split(escaped, -1).length-1;
This is an efficent and correct solution:
String line;
int count = 0;
while ((line = br.readLine()) != null)
int index = -1;
while((index = line.indexOf("Hello",index+1)) != -1){
count++;
}
}
return count;
It walks through the line and looks for the next index, starting from the previous index+1.
The problem with Peter's solution is that it is wrong (see my comment). The problem with TheLostMind's solution is that it creates a lot of new strings by replacement which is an unnecessary performance drawback.
A regex-driven version:
String line;
Pattern p = Pattern.compile(Pattern.quote("Hello")); // quotes in case you need 'Hello.'
int count = 0;
while ((line = br.readLine()) != null)
for (Matcher m = p.matcher(line); m.find(); count ++) { }
}
return count;
I am now curious as to performance between this and gexicide's version - will edit when I have results.
EDIT: benchmarked by running 100 times on a ~800k log file, looking for strings that were found once at the start, once around middle-ish, once at the end, and several times throughout. Results:
IndexFinder: 1579ms, 2407200hits. // gexicide's code
RegexFinder: 2907ms, 2407200hits. // this code
SplitFinder: 5198ms, 2407200hits. // Peter Lawrey's code, after quoting regexes
Conclussion: for non-regex strings, the repeated-indexOf approach is fastest by a nice margin.
Essential benchmark code (log file from vanilla Ubuntu 12.04 installation):
public static void main(String ... args) throws Exception {
Finder[] fs = new Finder[] {
new SplitFinder(), new IndexFinder(), new RegexFinder()};
File log = new File("/var/log/dpkg.log.1"); // around 800k in size
Find test = new Find();
for (int i=0; i<100; i++) {
for (Finder f : fs) {
test.test(f, log, "2014"); // start
test.test(f, log, "gnome"); // mid
test.test(f, log, "ubuntu1"); // end
test.test(f, log, ".1"); // multiple; not at start
}
}
test.printResults();
}
while (line.contains("Hello")) { // search until line has "Hello"
appearances++;
line = line.replaceFirst("Hello",""); // replace first occurance of "Hello" with empty String
}

UVa #494 - regex [^a-zA-z]+ to split words using Java

I was playing with UVa #494 and I managed to solve it with the code below:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
class Main {
public static void main(String[] args) throws IOException{
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
String line;
while((line = in.readLine()) != null){
String words[] = line.split("[^a-zA-z]+");
int cnt = words.length;
// for some reason it is counting two words for 234234ddfdfd and words[0] is empty
if(cnt != 0 && words[0].isEmpty()) cnt--; // ugly fix, if has words and the first is empty, reduce one word
System.out.println(cnt);
}
System.exit(0);
}
}
I built the regex "[^a-zA-z]+" to split the words so for example the strings abc..abc or abc432abc should be splitted as ["abc", "abc"]. However, when I try the string 432abc, I have as a result ["", "abc"] - the first element from words[] is just an empty string but I was expecting to have just ["abc"]. I can't figure out why this regex gives me the first element as "" for this case.
Check the split reference page: split reference
Each element of separator defines a separate delimiter character. If
two delimiters are adjacent, or a delimiter is found at the beginning
or end of this instance, the corresponding array element contains
Empty. The following table provides examples.
Since you have several consecutive delimiter characters, you get empty array elements
Prints the count of number of words
public static void main(String[] args) throws IOException {
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
String line;
while ((line = in.readLine()) != null) {
Pattern pattern = Pattern.compile("[a-zA-z]+");
Matcher matcher = pattern.matcher(line);
int count = 0;
while (matcher.find()) {
count++;
System.out.println(matcher.group());
}
System.out.println(count);
}
}

Categories

Resources