I want to split each line into two separate strings when reading through the txt file I'm using and later store them in a HashMap. But right now I can't seem to read through the file properly. This is what a small part of my file looks like:
....
CPI Clock Per Instruction
CPI Common Programming Interface [IBM]
.CPI Code Page Information (file name extension) [MS-DOS]
CPI-C Common Programming Interface for Communications [IBM]
CPIO Copy In and Out [Unix]
....
And this is what my code looks like:
try {
BufferedReader br = new BufferedReader(new FileReader("akronymer.txt"));
String line;
String akronym;
String betydning;
while((line = br.readLine()) != null) {
String[] linje = line.split("\\s+");
akronym = linje[0];
betydning = linje[1];
System.out.println(akronym + " || " + betydning);
}
} catch(Exception e) {
System.out.println("Feilen som ble fanget opp: " + e);
}
What I want is to store the acronym in one String and the definition in another String
The problem is that whitespace in the definition is interpreted as additional fields. You're getting only the first word of the definition in linje[1] because the other words are in other array elements:
["CPI", "Clock", "Per", "Instruction"]
Supply a limit parameter in the two-arg overload of split, to stop at 2 fields:
String[] linje = line.split("\\s+", 2);
E.g. linje[0] will be CPI and linje[1] will be Clock Per Instruction.
If you want to limit your split to only two parts then use split("\\s+", 2). Now you are splitting your line on every whitespace, so every word is stored in different position.
Related
I have a scenario at which i have to parse CSV files from different sources, the parsing code is very simple and straightforward.
String csvFile = "/Users/csv/country.csv";
String line = "";
String cvsSplitBy = ",";
try (BufferedReader br = new BufferedReader(new FileReader(csvFile))) {
while ((line = br.readLine()) != null) {
// use comma as separator
String[] country = line.split(cvsSplitBy);
System.out.println("Country [code= " + country[4] + " , name=" + country[5] + "]");
}
} catch (IOException e) {
e.printStackTrace();
}
my problem come from the CSV delimiter character, i have many different formats, some time it is a , sometimes it is a ;
is there is any way to determine the delimiter character before parsing the file
univocity-parsers supports automatic detection of the delimiter (also line endings and quotes). Just use it instead of fighting with your code:
CsvParserSettings settings = new CsvParserSettings();
settings.detectFormatAutomatically();
CsvParser parser = new CsvParser(settings);
List<String[]> rows = parser.parseAll(new File("/path/to/your.csv"));
// if you want to see what it detected
CsvFormat format = parser.getDetectedFormat();
Disclaimer: I'm the author of this library and I made sure all sorts of corner cases are covered. It's open source and free (Apache 2.0 license)
Hope this helps.
Yes, but only if the delimiter characters are not allowed to exist as regular text
The most simple answer is to have a list with all the available delimiter characters and try to identify which character is being used. Even though, you have to place some limitations on the files or the person/people that created them. Look a the following two scenarios:
Case 1 - Contents of file.csv
test,test2,test3
Case 2 - Contents of file.csv
test1|test2,3|test4
If you have prior knowledge of the delimiter characters, then you would split the first string using , and the second one using |, getting the same result. But, if you try to identify the delimiter by parsing the file, both strings can be split using the , character, and you would end up with this:
Case 1 - Result of split using ,
test1
test2
test3
Case 2 - Result of split using ,
test1|test2
3|test4
By lacking the prior knowledge of which delimiter character is being used, you cannot create a "magical" algorithm that will parse every combination of text; even regular expressions or counting the number of appearance of a character will not save you.
Worst case
test1,2|test3,4|test5
By looking the text, one can tokenize it by using | as the delimiter. But the frequency of appearance of both , and | are the same. So, from an algorithm's perspective, both results are accurate:
Correct result
test1,2
test3,4
test5
Wrong result
test1
2|test3
4|test5
If you pose a set of guidelines or you can somehow control the generation of the CSV files, then you could just try to find the delimiter used with String.contains() method, employing the aforementioned list of characters. For example:
public class MyClass {
private List<String> delimiterList = new ArrayList<>(){{
add(",");
add(";");
add("\t");
// etc...
}};
private static String determineDelimiter(String text) {
for (String delimiter : delimiterList) {
if(text.contains(delimiter)) {
return delimiter;
}
}
return "";
}
public static void main(String[] args) {
String csvFile = "/Users/csv/country.csv";
String line = "";
String cvsSplitBy = ",";
String delimiter = "";
boolean firstLine = true;
try (BufferedReader br = new BufferedReader(new FileReader(csvFile))) {
while ((line = br.readLine()) != null) {
if(firstLine) {
delimiter = determineDelimiter(line);
if(delimiter.equalsIgnoreCase("")) {
System.out.println("Unsupported delimiter found: " + delimiter);
return;
}
firstLine = false;
}
// use comma as separator
String[] country = line.split(delimiter);
System.out.println("Country [code= " + country[4] + " , name=" + country[5] + "]");
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Update
For a more optimized way, in determineDelimiter() method instead of the for-each loop, you can employ regular expressions.
If the delimiter can appear in a data column, then you are asking for the impossible. For example, consider this first line of a CSV file:
one,two:three
This could be either a comma-separated or a colon-separated file. You can't tell which type it is.
If you can guarantee that the first line has all its columns surrounded by quotes, for example if it's always this format:
"one","two","three"
then you may be able to use this logic (although it's not 100% bullet-proof):
if (line.contains("\",\""))
delimiter = ',';
else if (line.contains("\";\""))
delimiter = ';';
If you can't guarantee a restricted format like that, then it would be better to pass the delimiter character as a parameter.
Then you can read the file using a widely-known open-source CSV parser such as Apache Commons CSV.
While I agree with Lefteris008 that it is not possible to have the function that correctly determine all the cases, we can have a function that is both efficient and give mostly correct result in practice.
def head(filename: str, n: int):
try:
with open(filename) as f:
head_lines = [next(f).rstrip() for x in range(n)]
except StopIteration:
with open(filename) as f:
head_lines = f.read().splitlines()
return head_lines
def detect_delimiter(filename: str, n=2):
sample_lines = head(filename, n)
common_delimiters= [',',';','\t',' ','|',':']
for d in common_delimiters:
ref = sample_lines[0].count(d)
if ref > 0:
if all([ ref == sample_lines[i].count(d) for i in range(1,n)]):
return d
return ','
My efficient implementation is based on
Prior knowledge such as list of common delimiter you often work with ',;\t |:' , or even the likely hood of the delimiter to be used so that I often put the regular ',' on the top of the list
The frequency of the delimiter appear in each line of the text file are equal. This is to resolve the problem that if we read a single line and see the frequency to be equal (false detection as Lefteris008) or even the right delimiter to appear less frequent as the wrong one in the first line
The efficient implementation of a head function that read only first n lines from the file
As you increase the number of test sample n, the likely hood that you get a false answer reduce drastically. I often found n=2 to be adequate
Add a condition like this,
String [] country;
if(line.contains(",")
country = line.split(",");
else if(line.contains(";"))
country=line.split(";");
That depends....
If your datasets are always the same length and/or the separator NEVER occurs in your datacolumns, you could just read the first line of the file, look at it for the longed for separator, set it and then read the rest of the file using that separator.
Something like
String csvFile = "/Users/csv/country.csv";
String line = "";
String cvsSplitBy = ",";
try (BufferedReader br = new BufferedReader(new FileReader(csvFile))) {
while ((line = br.readLine()) != null) {
// use comma as separator
if (line.contains(",")) {
cvsSplitBy = ",";
} else if (line.contains(";")) {
cvsSplitBy = ";";
} else {
System.out.println("Wrong separator!");
}
String[] country = line.split(cvsSplitBy);
System.out.println("Country [code= " + country[4] + " , name=" + country[5] + "]");
}
} catch (IOException e) {
e.printStackTrace();
}
Greetz Kai
I am searching a directory with about 450 files, each file around 20kb. Here is my method:
public void search(String searchWord) throws IOException
{
this.directoryPath = FileSystems.getDefault().getPath(this.directoryString);
this.fileListStream = Files.newDirectoryStream(this.directoryPath);
int fileCount = 0;
for(Path path : this.fileListStream)
{
String fileName = path.getFileName().toString();
if(!fileName.startsWith("."))
{
BufferedReader br = Files.newBufferedReader(path, Charset.defaultCharset());
String line;
while((line = br.readLine()) != null)
{
System.out.println(fileName + ": " + line);
}
fileCount++;
br.close();
}
}
System.out.println("File Count: " + fileCount);
}
My goal is to go word by word and find a match for searchWord and print out the line number and the file name it was found in.
My problem is that I'm wondering if I should split the line into an array and search the array for the word and add it to a list. Or should I scan the entire file into an array of words and then search for the words and add them to a list? Or does it even matter? Also, if there is a better way to do this, please let me know! I'm trying to do this as efficient as possible due to limited resources.
You shouldn't be looking word-by-word, just read the entire line as a String and then use String.indexOf() method to find if the line contains the word or not.
You can use Scanner class to parse files and use its next() method to read each word so you won't require any array or other storage. Try to use multi-threading if possible for each file which can even improve performance.
This is the file from where i am reading:
abc.txt
1,Arjun,12,GhandiNagar,Pune,411020
2,Deep,8,M.G.Road,Mumbai,411032
3,Deep,3,F.C.Road,Pune,411032
Now how do I store individual content in a String array.
I have used
String content="";
while(line=br.readLine()!=null)
{
content=line+content;
}
String x[]=content.split(",");
But this is splitting according to "," as a result of which the last content of every line become 411020'2'/ 411032'3'.
So how do i separate them and store in an array like
x[0]=1,x[1]=Arjun,x[2]=12,x[3]=GhandiNagar,x[4]=Pune,x[5]=411020,x[6]=2,etc..?
You should do something like
String x[]=line.split(",");
within your while block. The split by "," will ignore line breaks.
Try adding a comma after the line is added to the content:
content = line + "," + content;
By the way, this effectively reverses the order of the lines in your file. If you don't want this to happen do this:
content = content + "," + line;
But using string concatenation (which is what you are doing) is best avoided (poor performance) by using a StringBuilder/StringBuffer (better performance)
StringBuilder content = new StringBuilder();
while ((line = br.readLine()) != null) {
content.append(line);
content.append(",");
}
String[] x = content.toString().split(",");
Try:
String x[] = line.split(",|\\r?\\n");
This code splits line with multiple delimiters. It splits line at every "," AND every "\n", which represents the end of a line in a text file. | is the regex OR operator.
Suppose I have a file called "Bill.txt".
The format:
ItemType ItemName Price
Now I want to add a new Description field. This description must be written next to the price.
The problem is, how to determine the position where to write it.
Yeah I agree with user2085282. You could read in the file using:
BufferedReader in = new BufferedReader(new FileReader("Bill.txt"));
For each line, the Reader reads, add like a semicolon or some character that should not be in the original file. Then in array split the string based on that character.
while ((line = in.readLine()) != null) {
//string = line + semicolon
// then set an array to split(;)
}
Then in another loop have like result += array[i] + description;
Then write the string in a new file
I am trying to write a Java program that simulates a record store shopping cart. The first step is to open up the inventory.txt file and read the contents which is basically what the "store has to offer". Then I need to read every line individually and process the id record and price.
The current method outputs a result that is very close to what I need, however, it picks up on the item id of the next line, as you can see below.
I was wondering if someone can assist me in figuring out how to process every line in the text document individually and store every piece of data in its own variable without picking up the id of the next item?
public void openFile(){
try{
x = new Scanner(new File("inventory.txt"));
x.useDelimiter(",");
}
catch(Exception e){
System.out.println("Could not find file");
}
}
public void readFile(){
while(x.hasNext()){
String id = x.next();
String record = x.next();
String price = x.next();
System.out.println(id + " " + record + " " + price);
break;
}
}
.txt document:
11111, "Hush Hush... - Pussycat Dolls", 12.95
22222, "Animal - Ke$ha", 9.95
33333, "Hanging By A Moment - Lifehouse - Single, 4.95
44444, "Have A Nice Day - Bon Jovi", 9.99
55555, "Day & Age - Killers", 10.99
66666, "She Wolf - Shakira", 15.99
77777, "Dark Horse - Nickelback", 12.99
88888, "The E.N.D. - Black Eyed Peas", 10.95
actual output
11111 "Hush Hush... - Pussycat Dolls" 12.95
22222
expected result
11111 "Hush Hush... - Pussycat Dolls" 12.95
So the problem here specifically is that you are breaking on commas, and you should be breaking on commas and newlines. But there are tons of other corner cases (for example, if your column is "abc,,,abc" you shouldn't break on those commas). Apache Commons comes with a CSVParser that handles all of these corner cases, you should use it:
http://commons.apache.org/csv/apidocs/org/apache/commons/csv/CSVParser.html
You can use a Pattern as the argument to Scanner.useDelimiter. Use this to provide alernates for the delimiter: either comma, or the line separator.
x.useDelimiter(",|" + System.getProperty("line.separator"));
Depending on what your input file uses as the line separator, you may need to change the second option.
The advice in other answers to use an existing CSV library is good: parsing CSV isn't as simple as breaking up the input around commas.
There are multiple ways to achieve this but going with your own way, you could use Scanner to first read lines (use Java's "line.separator" as delimiter) and then use Scanner class again with comma as delimiter.
The problem you're going to be facing is the CSV is more then just splitting a String on a comma. There are considerations to take into account with "escaped" commas (commas you don't want to delimante against).
I suggest you save your self a lot of time and head aches and use an existing API.
The Apache Commons has already been mentioned. I recently used OpenCSV and found it to be extremely simple to use and powerful
IMHO
An easy way to read in the entire file into a list of Strings (lines)...
public class Scanner {
public static List<String> readLines(String filename) throws IOException {
FileReader fileReader = new FileReader(filename);
BufferedReader bufferedReader = new BufferedReader(fileReader);
List<String> lines = new ArrayList<String>();
String line = null;
while ((line = bufferedReader.readLine()) != null) {
lines.add(line);
}
bufferedReader.close();
return lines;
}
}
Then you can process the individual lines as before, as each line is it's own String object. That is, if you don't use a CSVParser.