searching in text file specific words using java - java

I've a huge text file, I'd like to search for specific words and print three or more then this number OF THE WORDS AFTER IT so far I have done this
public static void main(String[] args) {
String fileName = "C:\\Users\\Mishari\\Desktop\\Mesh.txt";
String line = null;
try {
FileReader fileReader =
new FileReader(fileName);
BufferedReader bufferedReader =
new BufferedReader(fileReader);
while((line = bufferedReader.readLine()) != null) {
System.out.println(line);
}
bufferedReader.close();
} catch(FileNotFoundException ex) {
System.out.println(
"Unable to open file '" +
fileName + "'");
} catch(IOException ex) {
System.out.println(
"Error reading file '"
+ fileName + "'");
}
}
It's only for printing the file can you advise me what's the best way of doing it.

You can look for the index of word in line using this method.
int index = line.indexOf(word);
If the index is -1 then that word does not exist.
If it exist than takes the substring of line starting from that index till the end of line.
String nextWords = line.substring(index);
Now use String[] temp = nextWords.split(" ") to get all the words in that substring.

while((line = bufferedReader.readLine()) != null) {
System.out.println(line);
if (line.contains("YOUR_SPECIFIC_WORDS")) { //do what you need here }
}

By the sounds of it what you appear to be looking for is a basic Find & Replace All mechanism for each file line that is read in from file. In other words, if the current file line that is read happens to contain the Word or phrase you would like to add words after then replace that found word with the very same word plus the other words you want to add. In a sense it would be something like this:
String line = "This is a file line.";
String find = "file"; // word to find in line
String replaceWith = "file (plus this stuff)"; // the phrase to change the found word to.
line = line.replace(find, replaceWith); // Replace any found words
System.out.println(line);
The console output would be:
This is a file (plus this stuff) line.
The main thing here though is that you only want to deal with actual words and not the same phrase within another word, for example the word "and" and the word "sand". You can clearly see that the characters that make up the word 'and' is also located in the word 'sand' and therefore it too would be changed with the above example code. The String.contains() method also locates strings this way. In most cases this is undesirable if you want to specifically deal with whole words only so a simple solution would be to use a Regular Expression (RegEx) with the String.replaceAll() method. Using your own code it would look something like this:
String fileName = "C:\\Users\\Mishari\\Desktop\\Mesh.txt";
String findPhrase = "and"; //Word or phrase to find and replace
String replaceWith = findPhrase + " (adding this)"; // The text used for the replacement.
boolean ignoreLetterCase = false; // Change to true to ignore letter case
String line = "";
try {
FileReader fileReader = new FileReader(fileName);
BufferedReader bufferedReader = new BufferedReader(fileReader);
while ((line = bufferedReader.readLine()) != null) {
if (ignoreLetterCase) {
line = line.toLowerCase();
findPhrase = findPhrase.toLowerCase();
}
if (line.contains(findPhrase)) {
line = line.replaceAll("\\b(" + findPhrase + ")\\b", replaceWith);
}
System.out.println(line);
}
bufferedReader.close();
} catch (FileNotFoundException ex) {
System.out.println("Unable to open file: '" + fileName + "'");
} catch (IOException ex) {
System.out.println("Error reading file: '" + fileName + "'");
}
You will of course notice the escaped \b word boundary Meta Characters within the regular expression used in the String.replaceAll() method specifically in the line:
line = line.replaceAll("\\b(" + findPhrase + ")\\b", replaceWith);
This allows us to deal with whole words only.

Related

How to determine the delimiter in CSV file

I have a scenario at which i have to parse CSV files from different sources, the parsing code is very simple and straightforward.
String csvFile = "/Users/csv/country.csv";
String line = "";
String cvsSplitBy = ",";
try (BufferedReader br = new BufferedReader(new FileReader(csvFile))) {
while ((line = br.readLine()) != null) {
// use comma as separator
String[] country = line.split(cvsSplitBy);
System.out.println("Country [code= " + country[4] + " , name=" + country[5] + "]");
}
} catch (IOException e) {
e.printStackTrace();
}
my problem come from the CSV delimiter character, i have many different formats, some time it is a , sometimes it is a ;
is there is any way to determine the delimiter character before parsing the file
univocity-parsers supports automatic detection of the delimiter (also line endings and quotes). Just use it instead of fighting with your code:
CsvParserSettings settings = new CsvParserSettings();
settings.detectFormatAutomatically();
CsvParser parser = new CsvParser(settings);
List<String[]> rows = parser.parseAll(new File("/path/to/your.csv"));
// if you want to see what it detected
CsvFormat format = parser.getDetectedFormat();
Disclaimer: I'm the author of this library and I made sure all sorts of corner cases are covered. It's open source and free (Apache 2.0 license)
Hope this helps.
Yes, but only if the delimiter characters are not allowed to exist as regular text
The most simple answer is to have a list with all the available delimiter characters and try to identify which character is being used. Even though, you have to place some limitations on the files or the person/people that created them. Look a the following two scenarios:
Case 1 - Contents of file.csv
test,test2,test3
Case 2 - Contents of file.csv
test1|test2,3|test4
If you have prior knowledge of the delimiter characters, then you would split the first string using , and the second one using |, getting the same result. But, if you try to identify the delimiter by parsing the file, both strings can be split using the , character, and you would end up with this:
Case 1 - Result of split using ,
test1
test2
test3
Case 2 - Result of split using ,
test1|test2
3|test4
By lacking the prior knowledge of which delimiter character is being used, you cannot create a "magical" algorithm that will parse every combination of text; even regular expressions or counting the number of appearance of a character will not save you.
Worst case
test1,2|test3,4|test5
By looking the text, one can tokenize it by using | as the delimiter. But the frequency of appearance of both , and | are the same. So, from an algorithm's perspective, both results are accurate:
Correct result
test1,2
test3,4
test5
Wrong result
test1
2|test3
4|test5
If you pose a set of guidelines or you can somehow control the generation of the CSV files, then you could just try to find the delimiter used with String.contains() method, employing the aforementioned list of characters. For example:
public class MyClass {
private List<String> delimiterList = new ArrayList<>(){{
add(",");
add(";");
add("\t");
// etc...
}};
private static String determineDelimiter(String text) {
for (String delimiter : delimiterList) {
if(text.contains(delimiter)) {
return delimiter;
}
}
return "";
}
public static void main(String[] args) {
String csvFile = "/Users/csv/country.csv";
String line = "";
String cvsSplitBy = ",";
String delimiter = "";
boolean firstLine = true;
try (BufferedReader br = new BufferedReader(new FileReader(csvFile))) {
while ((line = br.readLine()) != null) {
if(firstLine) {
delimiter = determineDelimiter(line);
if(delimiter.equalsIgnoreCase("")) {
System.out.println("Unsupported delimiter found: " + delimiter);
return;
}
firstLine = false;
}
// use comma as separator
String[] country = line.split(delimiter);
System.out.println("Country [code= " + country[4] + " , name=" + country[5] + "]");
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Update
For a more optimized way, in determineDelimiter() method instead of the for-each loop, you can employ regular expressions.
If the delimiter can appear in a data column, then you are asking for the impossible. For example, consider this first line of a CSV file:
one,two:three
This could be either a comma-separated or a colon-separated file. You can't tell which type it is.
If you can guarantee that the first line has all its columns surrounded by quotes, for example if it's always this format:
"one","two","three"
then you may be able to use this logic (although it's not 100% bullet-proof):
if (line.contains("\",\""))
delimiter = ',';
else if (line.contains("\";\""))
delimiter = ';';
If you can't guarantee a restricted format like that, then it would be better to pass the delimiter character as a parameter.
Then you can read the file using a widely-known open-source CSV parser such as Apache Commons CSV.
While I agree with Lefteris008 that it is not possible to have the function that correctly determine all the cases, we can have a function that is both efficient and give mostly correct result in practice.
def head(filename: str, n: int):
try:
with open(filename) as f:
head_lines = [next(f).rstrip() for x in range(n)]
except StopIteration:
with open(filename) as f:
head_lines = f.read().splitlines()
return head_lines
def detect_delimiter(filename: str, n=2):
sample_lines = head(filename, n)
common_delimiters= [',',';','\t',' ','|',':']
for d in common_delimiters:
ref = sample_lines[0].count(d)
if ref > 0:
if all([ ref == sample_lines[i].count(d) for i in range(1,n)]):
return d
return ','
My efficient implementation is based on
Prior knowledge such as list of common delimiter you often work with ',;\t |:' , or even the likely hood of the delimiter to be used so that I often put the regular ',' on the top of the list
The frequency of the delimiter appear in each line of the text file are equal. This is to resolve the problem that if we read a single line and see the frequency to be equal (false detection as Lefteris008) or even the right delimiter to appear less frequent as the wrong one in the first line
The efficient implementation of a head function that read only first n lines from the file
As you increase the number of test sample n, the likely hood that you get a false answer reduce drastically. I often found n=2 to be adequate
Add a condition like this,
String [] country;
if(line.contains(",")
country = line.split(",");
else if(line.contains(";"))
country=line.split(";");
That depends....
If your datasets are always the same length and/or the separator NEVER occurs in your datacolumns, you could just read the first line of the file, look at it for the longed for separator, set it and then read the rest of the file using that separator.
Something like
String csvFile = "/Users/csv/country.csv";
String line = "";
String cvsSplitBy = ",";
try (BufferedReader br = new BufferedReader(new FileReader(csvFile))) {
while ((line = br.readLine()) != null) {
// use comma as separator
if (line.contains(",")) {
cvsSplitBy = ",";
} else if (line.contains(";")) {
cvsSplitBy = ";";
} else {
System.out.println("Wrong separator!");
}
String[] country = line.split(cvsSplitBy);
System.out.println("Country [code= " + country[4] + " , name=" + country[5] + "]");
}
} catch (IOException e) {
e.printStackTrace();
}
Greetz Kai

Reading an array from file. (java)

Hello it is my code to read from file
case 11: {
String line;
String temp[];
System.out.println("Podaj nazwę pliku z jakiego odczytać playlistę.");
nazwa11 = odczyt.next();
try {
FileReader fileReader = new FileReader(nazwa11);
BufferedReader bufferedReader = new BufferedReader(fileReader);
playlists.add(new Playlist(bufferedReader.readLine()));
x++;
while((line = bufferedReader.readLine())!=null){
String delimiter = "|";
temp = line.split(delimiter);
int rok;
rok = Integer.parseInt(temp[2]);
playlists.get(x).dodajUtwor(temp[0], temp[1], rok);
}
bufferedReader.close();
} catch (FileNotFoundException ex) {
System.out.println("Nie znaleziono pliku: '" + nazwa11 + "'");
} catch (IOException ex) {
System.out.println("Error reading file '" + nazwa11 + "'");
}
break;
}
Example file looks like this:
Pop
Test|Test|2010
Test1|Test1|2001
Gives me error
Exception in thread "main" java.lang.NumberFormatException: For input string: "s"
Why my line.split doesn't split when it finds "|"? I guess it splits t-e-s, any tips?
The pipe character "|" is one of the meta characters that carries a special meaning while performing the match.
This page gives you the complete lists of these special characters and their meanings.
So, in your program, modify the following line,
String delimiter = "|";
to
String delimiter = "\\|";
This will give you the result that you want.

How to trim the elements before assigning it into an array list?

I need to assign the elements present in a CSV file into an arraylist. CSV file contains filenames with extension .tar. I need to trim those elements before i read it into an array list or trim the whole arraylist. Please help me with it
try
{
String strFile1 = "D:\\Ramakanth\\PT2573\\target.csv"; //csv file containing data
BufferedReader br1 = new BufferedReader( new FileReader(strFile1)); //create BufferedReader
String strLine1 = "";
StringTokenizer st1 = null;
while( (strLine1 = br1.readLine()) != null) //read comma separated file line by line
{
st1 = new StringTokenizer(strLine1, ","); //break comma separated line using ","
while(st1.hasMoreTokens())
{
array1.add(st1.nextToken()); //store csv values in array
}
}
}
catch(Exception e)
{
System.out.println("Exception while reading csv file: " + e);
}
If you want to remove the ".tar" string from your tokens, you can use:
String nextToken = st1.nextToken();
if (nextToken.endsWith(".tar")) {
nextToken = nextToken.replace(".tar", "");
}
array1.add(nextToken);
You shouldn't be using StringTokenizer the JavaDoc says (in part) StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead. You should close your BufferedReader. You could use a try-with-resources statement to do that. And, you might use a for-each loop to iterate the array produced by String.split(String) the regular expression below optionally matches whitespace before or after your , and you might continue the loop if the token endsWith ".tar" like
String strFile1 = "D:\\Ramakanth\\PT2573\\target.csv";
try (BufferedReader br1 = new BufferedReader(new FileReader(strFile1)))
{
String strLine1 = "";
while( (strLine1 = br1.readLine()) != null) {
String[] parts = strLine1.split("\\s*,\\s*");
for (String token : parts) {
if (token.endsWith(".tar")) continue; // <-- don't add "tar" files.
array1.add(token);
}
}
}
catch(Exception e)
{
System.out.println("Exception while reading csv file: " + e);
}
if(str.indexOf(".tar") >0)
str = str.subString(0, str.indexOf(".tar")-1);
while(st1.hasMoreTokens())
{
String input = st1.nextToken();
int index = input.indexOf("."); // Get the position of '.'
if(index >= 0){ // To avoid StringIndexOutOfBoundsException, when there is no match with '.' then the index position set to -1.
array1.add(input.substring(0, index)); // Get the String before '.' position.
}
}

Read text from file and correct it (commas and dots)[Java]

I have to correct text in the file.
When is comma or dot I have to change to the correct position e.g.
"Here is ,some text , please correct. this text. " to "Here is, some text, please correct. this text."
I noticed that my code is not work properly. For dots he does not work at all, for commas before adds comma make space.Do you have any hints?
FileReader fr = null;
String line = "";
String result="";
String []array;
String []array2;
String result2="";
// open the file
try {
fr = new FileReader("file.txt");
} catch (FileNotFoundException e) {
System.out.println("Can not open the file!");
System.exit(1);
}
BufferedReader bfr = new BufferedReader(fr);
// read the lines:
try {
while((line = bfr.readLine()) != null){
array=line.split(",");
for(int i=0;i<array.length;i++){
//if i not equal to end(at the end has to be period)
if(i!=array.length-1){
array[i]+=",";
}
result+=array[i];
}
// System.out.println(result);
array2=result.split("\\.");
for(int i=0;i<array2.length;i++){
System.out.println(array2[i]);
array[i]+="\\.";
result2+=array2[i];
}
System.out.println(result2);
}
} catch (IOException e) {
System.out.println("Can not read the file!");
System.exit(2);
}
// close the file
try {
fr.close();
} catch (IOException e) {
System.out.println("error can not close the file");
System.exit(3);
}
Let's first assume you can use regex. Here is a simple way to do what you want:
import java.io.*;
class CorrectFile
{
public static void main(String[] args)
{
FileReader fr = null;
String line = "";
String result="";
// open the file
try {
fr = new FileReader("file.txt");
} catch (FileNotFoundException e) {
System.out.println("Can not open the file!");
System.exit(1);
}
BufferedReader bfr = new BufferedReader(fr);
// read the lines:
try {
while((line = bfr.readLine()) != null){
line = line.trim().replaceAll("\\s*([,,.])\\s*", "$1 ");
System.out.println(line);
}
} catch (IOException e) {
System.out.println("Can not read the file!");
System.exit(2);
}
// close the file
try {
fr.close();
} catch (IOException e) {
System.out.println("error can not close the file");
System.exit(3);
}
}
}
The most import thing is this line: line = line.trim().replaceAll("\\s*([,,.])\\s*", "$1 ");. First, each line you read may contain white spaces at both ends. String.trim() will remove them if so. Next, taken the string (with white spaces at both ends removed), we want to replace something like "a number of white spaces + a comma + a number of spaces" with "a comma + a white space) and the same thing with "dot". "\s" is regex for space and "\s*" is regex for "zero or any number of space". "[]" represents character group and "[,,.]" means either a "," or a "." and the middle comma just a a separator. Here we need to escape "\" for String, so now we have "\s*([,,.])\s*" which means let's replace some arbitrary number of white spaces followed by either a "," or a ".", followed by arbitrary number of white spaces whit either a "," followed by one space or a "." followed by one space. The brackets here makes the elements inside it a capture group which serves the purpose of "saving" the match found (here either a "," or a ".") and we use it later in our example as "$1". So we will be able to replace the matches we found with either a "," or a "." whatever the match is. Since you need a space after comma or dot, we add a white space there make it "$1 ".
Now, let's see what's wrong with your original thing and why I said String.split() may not be a good idea.
Aside from you are creating tons of new String objects, the most obvious problem is you (might be out of a typo) used array[i]+="."; instead of array2[i]+=".";. But the most not so obvious problem is right coming from the String.split() method which actually includes white spaces for the String segments in your split arrays. The last array element even contains only a white space.

How to keep text file contents in same Format after replacing a string

I have a text file with the contents:
love
test
me
once
My Java program replaces the word "love" by "liverpool". But the text file loses its format and becomes like this:
Liverpool test me once
All the strings appear on a single line.
This is what I have so far:
import java.io.*;
public class Replace_Line {
public static void main(String args[]) {
try {
File file = new File("C:\\Users\\Antish\\Desktop\\Test_File.txt");
BufferedReader reader = new BufferedReader(new FileReader(file));
String line = "", oldtext = "";
while ((line = reader.readLine()) != null) {
oldtext += line + " ";
}
reader.close();
// replace a word in a file
// String newtext = oldtext.replaceAll("boy", "Love");
// To replace a line in a file
String newtext = oldtext.replaceAll("love", "Liverpool");
FileWriter writer = new FileWriter(
"C:\\Users\\Antish\\Desktop\\Test_File.txt");
writer.write(newtext);
writer.close();
} catch (IOException ioe) {
ioe.printStackTrace();
}
}
}
Any help to keep the file format the same and just replace the string.
Thanks
When you read by lines, you're removing all line breaks and then replacing them with a space.
oldtext += line + " ";
Needs to be
oldtext += line + System.lineSeparator();
You lose the line breaks while reading like this:
while((line = reader.readLine()) != null) {
oldtext += line + " ";
}
To fix it, you should replace the code inside the loop with
oldtext += line + "\n";
But be aware of the fact, that reading a file line by line and concatenating each line with += is very inefficient. You can do so when learning Java, but never in any production code. Use a StringBuilder or some external libraries to handle IO.
Replace
oldtext += line + " ";
by
oldtext += line + "\n";
Note that this will transform the new line characters (which could be \r or \r\n) by a \n though. Also note that concatenating using the + operator in a loop is very inefficient because it produces a whole lot of large throw-away STring instances. You'd better use a StringBuilder or StringWriter.
Why don't you simply read the whole file as a single String, and then call replaceAll and rewrite everything. See Guava's method or commons-io method

Categories

Resources