Split text file into Strings on empty line

Split text file into Strings on empty line - java

I want to read a local txt file and read the text in this file. After that i want to split this whole text into Strings like in the example below .
Example :
Lets say file contains-
abcdef
ghijkl
aededd
ededed
ededfe
efefeef
efefeff
......
......
I want to split this text in to Strings
s1 = abcdef+"\n"+ghijkl;
s2 = aededd+"\n"+ededed;
s3 = ededfe+"\n"+efefeef+"\n"+efefeff;
........................
I mean I want to split text on empty line.
I do know how to read a file. I want help in splitting the text in to strings

you can split a string to an array by
String.split();
if you want it by new lines it will be
String.split("\\n\\n");
UPDATE*
If I understand what you are saying then john.
then your code will essentially be
BufferedReader in
= new BufferedReader(new FileReader("foo.txt"));
List<String> allStrings = new ArrayList<String>();
String str ="";
while(true)
{
String tmp = in.readLine();
if(tmp.isEmpty())
{
if(!str.isEmpty())
{
allStrings.add(str);
}
str= "";
}
else if(tmp==null)
{
break;
}
else
{
if(str.isEmpty())
{
str = tmp;
}
else
{
str += "\\n" + tmp;
}
}
}
Might be what you are trying to parse.
Where allStrings is a list of all of your strings.

The below code would work even if there are more than 2 empty lines between useful data.
import java.util.regex.*;
// read your file and store it in a string named str_file_data
Pattern p = Pattern.compile("\\n[\\n]+"); /*if your text file has \r\n as the newline character then use Pattern p = Pattern.compile("\\r\\n[\\r\\n]+");*/
String[] result = p.split(str_file_data);
(I did not test the code so there could be typos.)

I would suggest more general regexp:
text.split("(?m)^\\s*$");
In this case it would work correctly on any end-of-line convention, and also would treat the same empty and blank-space-only lines.

It may depend on how the file is encoded, so I would likely do the following:
String.split("(\\n\\r|\\n|\\r){2}");
Some text files encode newlines as "\n\r" while others may be simply "\n". Two new lines in a row means you have an empty line.

Godwin was on the right track, but I think we can make this work a bit better. Using the '[ ]' in regx is an or, so in his example if you had a \r\n that would just be a new line not an empty line. The regular expression would split it on both the \r and the \n, and I believe in the example we were looking for an empty line which would require a either a \n\r\n\r, a \r\n\r\n, a \n\r\r\n, a \r\n\n\r, or a \n\n or a \r\r
So first we want to look for either \n\r or \r\n twice, with any combination of the two being possible.
String.split(((\\n\\r)|(\\r\\n)){2}));
next we need to look for \r without a \n after it
String.split(\\r{2});
lastly, lets do the same for \n
String.split(\\n{2});
And all together that should be
String.split("((\\n\\r)|(\\r\\n)){2}|(\\r){2}|(\\n){2}");
Note, this works only on the very specific example of using new lines and character returns. I in ruby you can do the following which would encompass more cases. I don't know if there is an equivalent in Java.
.match($^$)

#Kevin code works fine and as he mentioned that the code was not tested, here are the 3 changes required:
1.The if check for (tmp==null) should come first, otherwise there will be a null pointer exception.
2.This code leaves out the last set of lines being added to the ArrayList. To make sure the last one gets added, we have to include this code after the while loop: if(!str.isEmpty()) { allStrings.add(str); }
3.The line str += "\n" + tmp; should be changed to use \n instead if \\n. Please see the end of this thread, I have added the entire code so that it can help
BufferedReader in
= new BufferedReader(new FileReader("foo.txt"));
List<String> allStrings = new ArrayList<String>();
String str ="";
List<String> allStrings = new ArrayList<String>();
String str ="";
while(true)
{
String tmp = in.readLine();
if(tmp==null)
{
break;
}else if(tmp.isEmpty())
{
if(!str.isEmpty())
{
allStrings.add(str);
}
str= "";
}else
{
if(str.isEmpty())
{
str = tmp;
}
else
{
str += "\n" + tmp;
}
}
}
if(!str.isEmpty())
{
allStrings.add(str);
}

Related

Read file using delimiter and add to array

I am trying to read from a text file that is in my project workspace then;
Create an object depending on the first element on the first line of the file
Set some variables within the object
Then add it to my arrayList
I seem to be reading the file ok but am struggling to create the different objects based off what the first element on each line in the text file is
Text file is like this
ul,1,gg,0,33.0
sl,2,hh,0,44.0
My expected result is to create an UltimateLanding object or StrongLanding object based on the first element in the text above file example
Disclaimer - I know the .equals is not correct to use in the IF statement, i've tried many ways to resolve this
My Code -
Edited -
It seems the program is now reading the file and correctly and adding to the array. However, it is only doing this for the first line in the file? There should be 2 objects created as there are 2 lines in the text file.
Scanner myFile = new Scanner(fr);
String line;
myFile.useDelimiter(",");
while (myFile.hasNext()) {
line = myFile.next();
if (line.equals("sl")) {
StrongLanding sl = new StrongLanding();
sl.setLandingId(Integer.parseInt(myFile.next()));
sl.setLandingDesc(myFile.next());
sl.setNumLandings(Integer.parseInt(myFile.next()));
sl.setCost(Double.parseDouble(myFile.next()));
landings.add(sl);
} else if (line.equals("ul")) {
UltimateLanding ul = new UltimateLanding();
ul.setLandingId(Integer.parseInt(myFile.next()));
ul.setLandingDesc(myFile.next());
ul.setNumLandings(Integer.parseInt(myFile.next()));
ul.setCost(Double.parseDouble(myFile.next()));
landings.add(ul);
}
}
TIA

There are multiple issues with your current code.
myFile.equals("sl") compares your Scanner object with a String. You would actually want to compare your read string line, not your Scanner object. So line.equals("sl").
nextLine() will read the whole line. So line will never be equal to "sl". You should split the line using your specified delimiter, then use the split parts to build your object. This way, you will not have to worry about newline in combination with next().
Currently, your evaluation of the read input is outside of the while loop, so you will read all the content of the file, but only evaluate the last line (currently). You should move the evaluation of the input and creation of your landing objects inside the while loop.
All suggestions implemented:
...
Scanner myFile = new Scanner(fr);
// no need to specify a delimiter, since you want to read line by line
String line;
String[] splitLine;
while (myFile.hasNextLine()) {
line = myFile.nextLine();
splitLine = line.split(","); // split the line by ","
if (splitLine[0].equals("sl")) {
StrongLanding sl = new StrongLanding();
sl.setLandingId(Integer.parseInt(splitLine[1]));
sl.setLandingDesc(splitLine[2]);
sl.setNumLandings(Integer.parseInt(splitLine[3]));
sl.setCost(Double.parseDouble(splitLine[4]));
landings.add(sl);
} else if (splitLine[0].equals("ul")) {
UltimateLanding ul = new UltimateLanding();
ul.setLandingId(Integer.parseInt(splitLine[1]));
ul.setLandingDesc(splitLine[2]);
ul.setNumLandings(Integer.parseInt(splitLine[3]));
ul.setCost(Double.parseDouble(splitLine[4]));
landings.add(ul);
}
}
...
However, if you don't want to read the contents line by line (due to whatever requirement you have), you can keep reading it via next(), but you have to specify the delimiter correctly:
...
Scanner myFile = new Scanner(fr);
String line; // variable naming could be improved, since it's not the line
myFile.useDelimiter(",|\\n"); // comma and newline as delimiters
while (myFile.hasNext()) {
line = myFile.next();
if (line.equals("sl")) {
StrongLanding sl = new StrongLanding();
sl.setLandingId(Integer.parseInt(myFile.next()));
sl.setLandingDesc(myFile.next());
sl.setNumLandings(Integer.parseInt(myFile.next()));
sl.setCost(Double.parseDouble(myFile.next()));
landings.add(sl);
} else if (line.equals("ul")) {
UltimateLanding ul = new UltimateLanding();
ul.setLandingId(Integer.parseInt(myFile.next()));
ul.setLandingDesc(myFile.next());
ul.setNumLandings(Integer.parseInt(myFile.next()));
ul.setCost(Double.parseDouble(myFile.next()));
landings.add(ul);
}
}
...

A solution.
List<Landing> landings = Files.lines(Paths.get("LandingsData.txt")).map(line -> {
String[] split = line.split(",");
if (split[0].equals("sl")) {
StrongLanding sl = new StrongLanding();
sl.setLandingId(Integer.parseInt(split[1]));
sl.setLandingDesc(split[2]);
sl.setNumLandings(split[3]);
sl.setCost(Double.parseDouble(split[4]));
return sl;
} else if (split[0].equals("ul")) {
UltimateLanding ul = new UltimateLanding();
ul.setLandingId(Integer.parseInt(split[1]));
ul.setLandingDesc(split[2]);
ul.setNumLandings(split[3]);
ul.setCost(Double.parseDouble(split[4]));
return ul;
}
return null;
}).filter(t -> t!= null).collect(Collectors.toList());

Remove stop words from file - going over it multiple times causes content duplication and does not remove the words

I am trying to go over a bunch of files, read each of them, and remove all stopwords from a specified list with such words. The result is a disaster - the content of the whole file copied over and over again.
What I tried:
- Saving the file as String and trying to look with regex
- Saving the file as String and going over line by line and comparing tokens to the stopwords that are stored in a LinkedHashSet, I can also store them in a file
- tried to twist the logic below in multiple ways, getting more and more ridiculous output.
- tried looking into text / line with the .contains() method, but no luck
My general logic is as follows:
for every word in the stopwords set:
while(file has more lines):
save current line into String
while (current line has more tokens):
assign current token into String
compare token with current stopword:
if(token equals stopword):
write in the output file "" + " "
else: write in the output file the token as is
Tried what's in this question and many other SO questions, but just can't achieve what I need.
Real code below:
private static void removeStopWords(File fileIn) throws IOException {
File stopWordsTXT = new File("stopwords.txt");
System.out.println("[Removing StopWords...] FILE: " + fileIn.getName() + "\n");
// create file reader and go over it to save the stopwords into the Set data structure
BufferedReader readerSW = new BufferedReader(new FileReader(stopWordsTXT));
Set<String> stopWords = new LinkedHashSet<String>();
for (String line; (line = readerSW.readLine()) != null; readerSW.readLine()) {
// trim() eliminates leading and trailing spaces
stopWords.add(line.trim());
}
File outp = new File(fileIn.getPath().substring(0, fileIn.getPath().lastIndexOf('.')) + "_NoStopWords.txt");
FileWriter fOut = new FileWriter(outp);
Scanner readerTxt = new Scanner(new FileInputStream(fileIn), "UTF-8");
while(readerTxt.hasNextLine()) {
String line = readerTxt.nextLine();
System.out.println(line);
Scanner lineReader = new Scanner(line);
for (String curSW : stopWords) {
while(lineReader.hasNext()) {
String token = lineReader.next();
if(token.equals(curSW)) {
System.out.println("---> Removing SW: " + curSW);
fOut.write("" + " ");
} else {
fOut.write(token + " ");
}
}
}
fOut.write("\n");
}
fOut.close();
}
What happens most often is that it looks for the first word from the stopWords set and that's it. The output contains all the other words even if I manage to remove the first one. And the first will be there in the next appended output in the end.
Part of my stopword list
about
above
after
again
against
all
am
and
any
are
as
at
With tokens I mean words, i.e. getting every word from the line and comparing it to the current stopword

After awhile of debugging I believe I have found the solution. This problem is very tricky as you have to use several different scanners and file readers etc. Here is what I did:
I changed how you added to your StopWords set, as it wasn't adding them correctly. I used a buffered reader to read each line, then a scanner to read each word, then added it to the set.
Then when you compared them I got rid of one of your loops as you can easily use the .contains() method to check if the word was a stopWord.
I left you to do the part of writing to the file to take out the stop words, as I'm sure you can figure that out now that everything else is working.
-My sample stop words txt file:
Stop words
Words
-My samples input file was the exact same, so it should catch all three words.
The code:
// create file reader and go over it to save the stopwords into the Set data structure
BufferedReader readerSW = new BufferedReader(new FileReader("stopWords.txt"));
Set<String> stopWords = new LinkedHashSet<String>();
String stopWordsLine = readerSW.readLine();
while (stopWordsLine != null) {
// trim() eliminates leading and trailing spaces
Scanner words = new Scanner(stopWordsLine);
String word = words.next();
while(word != null) {
stopWords.add(word.trim()); //Add the stop words to the set
if(words.hasNext()) {
word = words.next(); //If theres another line, read it
}
else {
break; //else break the inner while loop
}
}
stopWordsLine = readerSW.readLine();
}
BufferedReader outp = new BufferedReader(new FileReader("Words.txt"));
String line = outp.readLine();
while(line != null) {
Scanner lineReader = new Scanner(line);
String line2 = lineReader.next();
while(line2 != null) {
if(stopWords.contains(line2)) {
System.out.println("removing " + line2);
}
if(lineReader.hasNext()) { //If theres another line, read it
line2 = lineReader.next();
}
else {
break; //else break the first while loop
}
}
lineReader.close();
line = outp.readLine();
}
OutPut:
removing Stop
removing words
removing Words
Let me know if I can elaborate any more on my code or why I did something!

How to determine the delimiter in CSV file

I have a scenario at which i have to parse CSV files from different sources, the parsing code is very simple and straightforward.
String csvFile = "/Users/csv/country.csv";
String line = "";
String cvsSplitBy = ",";
try (BufferedReader br = new BufferedReader(new FileReader(csvFile))) {
while ((line = br.readLine()) != null) {
// use comma as separator
String[] country = line.split(cvsSplitBy);
System.out.println("Country [code= " + country[4] + " , name=" + country[5] + "]");
}
} catch (IOException e) {
e.printStackTrace();
}
my problem come from the CSV delimiter character, i have many different formats, some time it is a , sometimes it is a ;
is there is any way to determine the delimiter character before parsing the file

univocity-parsers supports automatic detection of the delimiter (also line endings and quotes). Just use it instead of fighting with your code:
CsvParserSettings settings = new CsvParserSettings();
settings.detectFormatAutomatically();
CsvParser parser = new CsvParser(settings);
List<String[]> rows = parser.parseAll(new File("/path/to/your.csv"));
// if you want to see what it detected
CsvFormat format = parser.getDetectedFormat();
Disclaimer: I'm the author of this library and I made sure all sorts of corner cases are covered. It's open source and free (Apache 2.0 license)
Hope this helps.

Yes, but only if the delimiter characters are not allowed to exist as regular text
The most simple answer is to have a list with all the available delimiter characters and try to identify which character is being used. Even though, you have to place some limitations on the files or the person/people that created them. Look a the following two scenarios:
Case 1 - Contents of file.csv
test,test2,test3
Case 2 - Contents of file.csv
test1|test2,3|test4
If you have prior knowledge of the delimiter characters, then you would split the first string using , and the second one using |, getting the same result. But, if you try to identify the delimiter by parsing the file, both strings can be split using the , character, and you would end up with this:
Case 1 - Result of split using ,
test1
test2
test3
Case 2 - Result of split using ,
test1|test2
3|test4
By lacking the prior knowledge of which delimiter character is being used, you cannot create a "magical" algorithm that will parse every combination of text; even regular expressions or counting the number of appearance of a character will not save you.
Worst case
test1,2|test3,4|test5
By looking the text, one can tokenize it by using | as the delimiter. But the frequency of appearance of both , and | are the same. So, from an algorithm's perspective, both results are accurate:
Correct result
test1,2
test3,4
test5
Wrong result
test1
2|test3
4|test5
If you pose a set of guidelines or you can somehow control the generation of the CSV files, then you could just try to find the delimiter used with String.contains() method, employing the aforementioned list of characters. For example:
public class MyClass {
private List<String> delimiterList = new ArrayList<>(){{
add(",");
add(";");
add("\t");
// etc...
}};
private static String determineDelimiter(String text) {
for (String delimiter : delimiterList) {
if(text.contains(delimiter)) {
return delimiter;
}
}
return "";
}
public static void main(String[] args) {
String csvFile = "/Users/csv/country.csv";
String line = "";
String cvsSplitBy = ",";
String delimiter = "";
boolean firstLine = true;
try (BufferedReader br = new BufferedReader(new FileReader(csvFile))) {
while ((line = br.readLine()) != null) {
if(firstLine) {
delimiter = determineDelimiter(line);
if(delimiter.equalsIgnoreCase("")) {
System.out.println("Unsupported delimiter found: " + delimiter);
return;
}
firstLine = false;
}
// use comma as separator
String[] country = line.split(delimiter);
System.out.println("Country [code= " + country[4] + " , name=" + country[5] + "]");
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Update
For a more optimized way, in determineDelimiter() method instead of the for-each loop, you can employ regular expressions.

If the delimiter can appear in a data column, then you are asking for the impossible. For example, consider this first line of a CSV file:
one,two:three
This could be either a comma-separated or a colon-separated file. You can't tell which type it is.
If you can guarantee that the first line has all its columns surrounded by quotes, for example if it's always this format:
"one","two","three"
then you may be able to use this logic (although it's not 100% bullet-proof):
if (line.contains("\",\""))
delimiter = ',';
else if (line.contains("\";\""))
delimiter = ';';
If you can't guarantee a restricted format like that, then it would be better to pass the delimiter character as a parameter.
Then you can read the file using a widely-known open-source CSV parser such as Apache Commons CSV.

While I agree with Lefteris008 that it is not possible to have the function that correctly determine all the cases, we can have a function that is both efficient and give mostly correct result in practice.
def head(filename: str, n: int):
try:
with open(filename) as f:
head_lines = [next(f).rstrip() for x in range(n)]
except StopIteration:
with open(filename) as f:
head_lines = f.read().splitlines()
return head_lines
def detect_delimiter(filename: str, n=2):
sample_lines = head(filename, n)
common_delimiters= [',',';','\t',' ','|',':']
for d in common_delimiters:
ref = sample_lines[0].count(d)
if ref > 0:
if all([ ref == sample_lines[i].count(d) for i in range(1,n)]):
return d
return ','
My efficient implementation is based on
Prior knowledge such as list of common delimiter you often work with ',;\t |:' , or even the likely hood of the delimiter to be used so that I often put the regular ',' on the top of the list
The frequency of the delimiter appear in each line of the text file are equal. This is to resolve the problem that if we read a single line and see the frequency to be equal (false detection as Lefteris008) or even the right delimiter to appear less frequent as the wrong one in the first line
The efficient implementation of a head function that read only first n lines from the file
As you increase the number of test sample n, the likely hood that you get a false answer reduce drastically. I often found n=2 to be adequate

Add a condition like this,
String [] country;
if(line.contains(",")
country = line.split(",");
else if(line.contains(";"))
country=line.split(";");

That depends....
If your datasets are always the same length and/or the separator NEVER occurs in your datacolumns, you could just read the first line of the file, look at it for the longed for separator, set it and then read the rest of the file using that separator.
Something like
String csvFile = "/Users/csv/country.csv";
String line = "";
String cvsSplitBy = ",";
try (BufferedReader br = new BufferedReader(new FileReader(csvFile))) {
while ((line = br.readLine()) != null) {
// use comma as separator
if (line.contains(",")) {
cvsSplitBy = ",";
} else if (line.contains(";")) {
cvsSplitBy = ";";
} else {
System.out.println("Wrong separator!");
}
String[] country = line.split(cvsSplitBy);
System.out.println("Country [code= " + country[4] + " , name=" + country[5] + "]");
}
} catch (IOException e) {
e.printStackTrace();
}
Greetz Kai

Java compare strings from two places and exclude any matches

I'm trying to end up with a results.txt minus any matching items, having successfully compared some string inputs against another .txt file. Been staring at this code for way too long and I can't figure out why it isn't working. New to coding so would appreciate it if I could be steered in the right direction! Maybe I need a different approach? Apologies in advance for any loud tutting noises you may make. Using Java8.
//Sending a String[] into 'searchFile', contains around 8 small strings.
//Example of input: String[]{"name1","name2","name 3", "name 4.zip"}
^ This is my exclusions list.
public static void searchFile(String[] arr, String separator)
{
StringBuilder b = new StringBuilder();
for(int i = 0; i < arr.length; i++)
{
if(i != 0) b.append(separator);
b.append(arr[i]);
String findME = arr[i];
searchInfo(MyApp.getOptionsDir()+File.separator+"file-to-search.txt",findME);
}
}
^This works fine. I'm then sending the results to 'searchInfo' and trying to match and remove any duplicate (complete, not part) strings. This is where I am currently failing. Code runs but doesn't produce my desired output. It often finds part strings rather than complete ones. I think the 'results.txt' file is being overwritten each time...but I'm not sure tbh!
file-to-search.txt contains: "name2","name.zip","name 3.zip","name 4.zip" (text file is just a single line)
public static String searchInfo(String fileName, String findME)
{
StringBuffer sb = new StringBuffer();
try {
BufferedReader br = new BufferedReader(new FileReader(fileName));
String line = null;
while((line = br.readLine()) != null)
{
if(line.startsWith("\""+findME+"\""))
{
sb.append(line);
//tried various replace options with no joy
line = line.replaceFirst(findME+"?,", "");
//then goes off with results to create a txt file
FileHandling.createFile("results.txt",line);
}
}
} catch (Exception e) {
e.printStackTrace();
}
return sb.toString();
}
What i'm trying to end up with is a result file MINUS any matching complete strings (not part strings):
e.g. results.txt to end up with: "name.zip","name 3.zip"

ok with the information I have. What you can do is this
List<String> result = new ArrayList<>();
String content = FileUtils.readFileToString(file, "UTF-8");
for (String s : content.split(", ")) {
if (!s.equals(findME)) { // assuming both have string quotes added already
result.add(s);
}
}
FileUtils.write(newFile, String.join(", ", result), "UTF-8");
using apache commons file utils for ease. You may add or remove spaces after comma as per your need.

Storing values from a file into array according to some multiple split() criteria

This is the file from where i am reading:
abc.txt
1,Arjun,12,GhandiNagar,Pune,411020
2,Deep,8,M.G.Road,Mumbai,411032
3,Deep,3,F.C.Road,Pune,411032
Now how do I store individual content in a String array.
I have used
String content="";
while(line=br.readLine()!=null)
{
content=line+content;
}
String x[]=content.split(",");
But this is splitting according to "," as a result of which the last content of every line become 411020'2'/ 411032'3'.
So how do i separate them and store in an array like
x[0]=1,x[1]=Arjun,x[2]=12,x[3]=GhandiNagar,x[4]=Pune,x[5]=411020,x[6]=2,etc..?

You should do something like
String x[]=line.split(",");
within your while block. The split by "," will ignore line breaks.

Try adding a comma after the line is added to the content:
content = line + "," + content;
By the way, this effectively reverses the order of the lines in your file. If you don't want this to happen do this:
content = content + "," + line;
But using string concatenation (which is what you are doing) is best avoided (poor performance) by using a StringBuilder/StringBuffer (better performance)
StringBuilder content = new StringBuilder();
while ((line = br.readLine()) != null) {
content.append(line);
content.append(",");
}
String[] x = content.toString().split(",");

Try:
String x[] = line.split(",|\\r?\\n");
This code splits line with multiple delimiters. It splits line at every "," AND every "\n", which represents the end of a line in a text file. | is the regex OR operator.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Split text file into Strings on empty line - java

I would suggest more general regexp: text.split("(?m)^\\s*$"); In this case it would work correctly on any end-of-line convention, and also would treat the same empty and blank-space-only lines.

It may depend on how the file is encoded, so I would likely do the following: String.split("(\\n\\r|\\n|\\r){2}"); Some text files encode newlines as "\n\r" while others may be simply "\n". Two new lines in a row means you have an empty line.

Related

Read file using delimiter and add to array

Remove stop words from file - going over it multiple times causes content duplication and does not remove the words

How to determine the delimiter in CSV file

Java compare strings from two places and exclude any matches

Storing values from a file into array according to some multiple split() criteria

Categories

Resources