I have a scenario at which i have to parse CSV files from different sources, the parsing code is very simple and straightforward.
String csvFile = "/Users/csv/country.csv";
String line = "";
String cvsSplitBy = ",";
try (BufferedReader br = new BufferedReader(new FileReader(csvFile))) {
while ((line = br.readLine()) != null) {
// use comma as separator
String[] country = line.split(cvsSplitBy);
System.out.println("Country [code= " + country[4] + " , name=" + country[5] + "]");
}
} catch (IOException e) {
e.printStackTrace();
}
my problem come from the CSV delimiter character, i have many different formats, some time it is a , sometimes it is a ;
is there is any way to determine the delimiter character before parsing the file
univocity-parsers supports automatic detection of the delimiter (also line endings and quotes). Just use it instead of fighting with your code:
CsvParserSettings settings = new CsvParserSettings();
settings.detectFormatAutomatically();
CsvParser parser = new CsvParser(settings);
List<String[]> rows = parser.parseAll(new File("/path/to/your.csv"));
// if you want to see what it detected
CsvFormat format = parser.getDetectedFormat();
Disclaimer: I'm the author of this library and I made sure all sorts of corner cases are covered. It's open source and free (Apache 2.0 license)
Hope this helps.
Yes, but only if the delimiter characters are not allowed to exist as regular text
The most simple answer is to have a list with all the available delimiter characters and try to identify which character is being used. Even though, you have to place some limitations on the files or the person/people that created them. Look a the following two scenarios:
Case 1 - Contents of file.csv
test,test2,test3
Case 2 - Contents of file.csv
test1|test2,3|test4
If you have prior knowledge of the delimiter characters, then you would split the first string using , and the second one using |, getting the same result. But, if you try to identify the delimiter by parsing the file, both strings can be split using the , character, and you would end up with this:
Case 1 - Result of split using ,
test1
test2
test3
Case 2 - Result of split using ,
test1|test2
3|test4
By lacking the prior knowledge of which delimiter character is being used, you cannot create a "magical" algorithm that will parse every combination of text; even regular expressions or counting the number of appearance of a character will not save you.
Worst case
test1,2|test3,4|test5
By looking the text, one can tokenize it by using | as the delimiter. But the frequency of appearance of both , and | are the same. So, from an algorithm's perspective, both results are accurate:
Correct result
test1,2
test3,4
test5
Wrong result
test1
2|test3
4|test5
If you pose a set of guidelines or you can somehow control the generation of the CSV files, then you could just try to find the delimiter used with String.contains() method, employing the aforementioned list of characters. For example:
public class MyClass {
private List<String> delimiterList = new ArrayList<>(){{
add(",");
add(";");
add("\t");
// etc...
}};
private static String determineDelimiter(String text) {
for (String delimiter : delimiterList) {
if(text.contains(delimiter)) {
return delimiter;
}
}
return "";
}
public static void main(String[] args) {
String csvFile = "/Users/csv/country.csv";
String line = "";
String cvsSplitBy = ",";
String delimiter = "";
boolean firstLine = true;
try (BufferedReader br = new BufferedReader(new FileReader(csvFile))) {
while ((line = br.readLine()) != null) {
if(firstLine) {
delimiter = determineDelimiter(line);
if(delimiter.equalsIgnoreCase("")) {
System.out.println("Unsupported delimiter found: " + delimiter);
return;
}
firstLine = false;
}
// use comma as separator
String[] country = line.split(delimiter);
System.out.println("Country [code= " + country[4] + " , name=" + country[5] + "]");
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Update
For a more optimized way, in determineDelimiter() method instead of the for-each loop, you can employ regular expressions.
If the delimiter can appear in a data column, then you are asking for the impossible. For example, consider this first line of a CSV file:
one,two:three
This could be either a comma-separated or a colon-separated file. You can't tell which type it is.
If you can guarantee that the first line has all its columns surrounded by quotes, for example if it's always this format:
"one","two","three"
then you may be able to use this logic (although it's not 100% bullet-proof):
if (line.contains("\",\""))
delimiter = ',';
else if (line.contains("\";\""))
delimiter = ';';
If you can't guarantee a restricted format like that, then it would be better to pass the delimiter character as a parameter.
Then you can read the file using a widely-known open-source CSV parser such as Apache Commons CSV.
While I agree with Lefteris008 that it is not possible to have the function that correctly determine all the cases, we can have a function that is both efficient and give mostly correct result in practice.
def head(filename: str, n: int):
try:
with open(filename) as f:
head_lines = [next(f).rstrip() for x in range(n)]
except StopIteration:
with open(filename) as f:
head_lines = f.read().splitlines()
return head_lines
def detect_delimiter(filename: str, n=2):
sample_lines = head(filename, n)
common_delimiters= [',',';','\t',' ','|',':']
for d in common_delimiters:
ref = sample_lines[0].count(d)
if ref > 0:
if all([ ref == sample_lines[i].count(d) for i in range(1,n)]):
return d
return ','
My efficient implementation is based on
Prior knowledge such as list of common delimiter you often work with ',;\t |:' , or even the likely hood of the delimiter to be used so that I often put the regular ',' on the top of the list
The frequency of the delimiter appear in each line of the text file are equal. This is to resolve the problem that if we read a single line and see the frequency to be equal (false detection as Lefteris008) or even the right delimiter to appear less frequent as the wrong one in the first line
The efficient implementation of a head function that read only first n lines from the file
As you increase the number of test sample n, the likely hood that you get a false answer reduce drastically. I often found n=2 to be adequate
Add a condition like this,
String [] country;
if(line.contains(",")
country = line.split(",");
else if(line.contains(";"))
country=line.split(";");
That depends....
If your datasets are always the same length and/or the separator NEVER occurs in your datacolumns, you could just read the first line of the file, look at it for the longed for separator, set it and then read the rest of the file using that separator.
Something like
String csvFile = "/Users/csv/country.csv";
String line = "";
String cvsSplitBy = ",";
try (BufferedReader br = new BufferedReader(new FileReader(csvFile))) {
while ((line = br.readLine()) != null) {
// use comma as separator
if (line.contains(",")) {
cvsSplitBy = ",";
} else if (line.contains(";")) {
cvsSplitBy = ";";
} else {
System.out.println("Wrong separator!");
}
String[] country = line.split(cvsSplitBy);
System.out.println("Country [code= " + country[4] + " , name=" + country[5] + "]");
}
} catch (IOException e) {
e.printStackTrace();
}
Greetz Kai
The Input text file has content as following :
TIMINCY........ many arbitrary characters incl. white spaces and tabs
DETAIL........ many arbitrary characters incl. white spaces and tabs
DETAIL........ many arbitrary characters incl. white spaces and tabs
DETAIL........ many arbitrary characters incl. white spaces and tabs
.
.
. (any number of lines containing DETAILS)
TIMINCY........ many arbitrary characters incl. white spaces and tabs
DETAIL........ many arbitrary characters incl. white spaces and tabs
DETAIL........ many arbitrary characters incl. white spaces and tabs
DETAIL........ many arbitrary characters incl. white spaces and tabs
.
.
.(so on)
Q: I need to validate the file using regex so that if the file's content is NOT
in accordance with respect to the pattern given above then I can throw CustomException.
Please let know if you could help. Any help is appreciated cordially.
String patternString = "TMINCY"+"[.]\\{*\\}"+";"+"["+"DETAILS"+"[.]\\{*\\}"+";"+"]"+"\\{*\\}"+"]"+"\\{*\\};";
Pattern pattern = Pattern.compile(patternString );
String messageString = null;
StringBuilder builder = new StringBuilder();
try (BufferedReader reader = Files.newBufferedReader(curracFile.toPath(), charset)) {
String line;
while ((line = reader.readLine()) != null) {
builder.append(line);
builder.append(NEWLINE_CHAR_SEQUENCE);
}
messageString = builder.toString();
} catch (IOException ex) {
LOGGER.error(FILE_CREATION_ERROR, ex.getCause());
throw new BusinessConversionException(FILE_CREATION_ERROR, ex);
}
System.out.println("messageString is::"+messageString);
return pattern.matcher(messageString).matches();
But it is Returning FALSE for correct file. Please help me with the regex.
What about something like "^(TIMINCY|DETAIL)[\.]+[a-zA-z\s.]+"
"^" - matches the start of the line
"(TIMINCY|DETAIL)" - matches TIMINCY or DETAIL
"[\.]" - matches the dot character to occur one or more times
"[a-zA-z\s.]+" - Here you put the allowed characters to occur one or more time
Reference: Oracle Documentation
You could try line by line when you're iterating over the lines
Pattern p = Pattern.compile("^(?:TIMINCY|DETAILS)[.]{8}.*");
//Explanation:
// ^ : Matches the begining of the string.
// (?:): non capturing group.
// [.]{8}: Matches a dot (".") eight times in a row.
// .*: Matches everything until the end of the string
// | : Regex OR operator
String line = reader.readLine()
Matcher m;
while (line != null) {
m = p.matcher(line);
if(!m.matches(line))
throw new CustomException("Not valid");
builder.append(line);
builder.append(NEWLINE_CHAR_SEQUENCE);
line = reader.readLine();
}
Also: Matcher.matches() returns true if the ENTIRE STRING matches your regular expression, i would recommend using Matcher.find() to find patterns you don't want.
Matcher (Java 7)
I have to write a program that will parse baseball player info and hits,out,walk,ect from a txt file. For example the txt file may look something like this:
Sam Slugger,h,h,o,s,w,w,h,w,o,o,o,h,s
Jill Jenks,o,o,s,h,h,o,o
Will Jones,o,o,w,h,o,o,o,o,w,o,o
I know how to parse the file and can get that code running perfect. The only problem I am having is that we should only be printing the name for each player and 3 or their plays. For example:
Sam Slugger hit,hit,out
Jill Jenks out, out, sacrifice fly
Will Jones out, out, walk
I am not sure how to limit this and every time I try to cut it off at 3 I always get the first person working fine but it breaks the loop and doesn't do anything for all the other players.
This is what I have so far:
import java.util.Scanner;
import java.io.*;
public class ReadBaseBall{
public static void main(String args[]) throws IOException{
int count=0;
String playerData;
Scanner fileScan, urlScan;
String fileName = "C:\\Users\\Crust\\Documents\\java\\TeamStats.txt";
fileScan = new Scanner(new File(fileName));
while(fileScan.hasNext()){
playerData = fileScan.nextLine();
fileScan.useDelimiter(",");
//System.out.println("Name: " + playerData);
urlScan = new Scanner(playerData);
urlScan.useDelimiter(",");
for(urlScan.hasNext(); count<4; count++)
System.out.print(" " + urlScan.next() + ",");
System.out.println();
}
}
}
This prints out:
Sam Slugger, h, h, o,
but then the other players are voided out. I need help to get the other ones printing as well.
Here, try this one using FileReader
Assuming your file content format is like this
Sam Slugger,h,h,o,s,w,w,h,w,o,o,o,h,s
Jill Johns,h,h,o,s,w,w,h,w,o,o,o,h,s
with each player in the his/her own line then this can work for you
BufferedReader reader;
try {
reader = new BufferedReader(new FileReader(new File("file.txt")));
String line = "";
while ((line = reader.readLine()) != null) {
String[] values_per_line = line.split(",");
System.out.println("Name:" + values_per_line[0] + " "
+ values_per_line[1] + " " + values_per_line[2] + " "
+ values_per_line[3]);
line = reader.readLine();
}
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
otherwise if they are lined all in like one line which would not make sense then modify this sample.
Sam Slugger,h,h,o,s,w,w,h,w,o,o,o,h,s| John Slugger,h,h,o,s,w,w,h,w,o,o,o,h,s
BufferedReader reader;
try {
reader = new BufferedReader(new FileReader(new File("file.txt")));
String line = "";
while ((line = reader.readLine()) != null) {
// token identifier is a space
String[] data = line.trim().split("|");
for (int i = 0; i < data.length; i++)
System.out.println("Name:" + data[0].split(",")[0] + " "
+ data[1].split(",")[1] + " "
+ data[2].split(",")[2] + " "
+ data[3].split(",")[3]);
line = reader.readLine();
}
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
You need to reset your count car in the while loop:
while(fileScan.hasNext()){
count = 0;
...
}
First Problem
Change while(fileScan.hasNext())) to while(fileScan.hasNextLine()). Not a breaking problem but when using scanner you usually put sc.* right after a sc.has*.
Second Problem
Remove the line fileScan.useDelimiter(","). This line doesn't do anything in this case but replaces the default delimiter so the scanner no longer splits on whitespace. Which doesn't matter when using Scanner.nextLine, but can have some nasty side effects later on.
Third Problem
Change this line for(urlScan.hasNext(); count<4; count++) to while(urlScan.hasNext()). Honestly I'm surprised that line even compiled and if it did it only read the first 4 from the scanner.
If you want to limit the amount processed for each line you can replace it with
for( int count = 0; count < limit && urlScan.hasNext( ); count++ )
This will limit the amount read to limit while still handling lines that have less data than the limit.
Make sure that each of your data sets is separated by a line otherwise the output might not make much sense.
You shouldn't have multiple scanners on this - assuming the format you posted in your question you can use regular expressions to do this.
This demonstrates a regular expression to match a player and to use as a delimiter for the scanner. I fed the scanner in my example a string, but the technique is the same regardless of source.
int count = 0;
Pattern playerPattern = Pattern.compile("\\w+\\s\\w+(?:,\\w){1,3}");
Scanner fileScan = new Scanner("Sam Slugger,h,h,o,s,w,w,h,w,o,o,o,h,s Jill Jenks,o,o,s,h,h,o,o Will Jones,o,o,w,h,o,o,o,o,w,o,o");
fileScan.useDelimiter("(?<=,\\w)\\s");
while (fileScan.hasNext()){
String player = fileScan.next();
Matcher m = playerPattern.matcher(player);
if (m.find()) {
player = m.group(0);
} else {
throw new InputMismatchException("Players data not in expected format on string: " + player);
}
System.out.println(player);
count++;
}
System.out.printf("%d players found.", count);
Output:
Sam Slugger,h,h,o
Jill Jenks,o,o,s
Will Jones,o,o,w
The call to Scanner.delimiter() sets the delimiter to use for retrieving tokens. The regex (?<=,\\w)\\s:
(?< // positive lookbehind
,\w // literal comma, word character
)
\s // whitespace character
Which delimits the players by the space between their entries without matching anything but that space, and fails to match the space between the names.
The regular expression used to extract up to 3 plays per player is \\w+\\s\\w+(?:,\\w){1,3}:
\w+ // matches one to unlimited word characters
(?: // begin non-capturing group
,\w // literal comma, word character
){1,3} // match non-capturing group 1 - 3 times
I'm trying to get the result of a match with two lines and more, this is my text in a file (for JOURNAL ENTRIES for Wincor ATM):
DEMANDE SOLDE
N° CARTE : 1500000001180006
OPERATION NO. : 585068
========================================
RETRAIT
N° CARTE 1600001002200006
OPERATION NO. : 585302
MONTANT : MAD 200.00
========================================
... etc.
Theare more lines repeated for each operation : retrait(ATMs), demande de solde (balance inquiry), which I want to get a resul like: RETRAIT\nN° CARTE 1600001002200006
My java code:
String filename="20140604.jrn";
File file=new File(filename);
String regexe = ".*RETRAIT^\r\n.*CARTE.*\\d{16}"; // Work with .*CARTE.*\\d{16}: result: N° CARTE : 1500000001180006 N° CARTE 1600001002200006
Pattern pattern = Pattern.compile(regexe,Pattern.MULTILINE);
try {
BufferedReader in = new BufferedReader(new FileReader(file));
while (in.ready()) {
String s = in.readLine();
Matcher matcher = pattern.matcher(s);
while (matcher.find()) { // find the next match
System.out.println("found the pattern \"" + matcher.group());
}
}
in.close();
}
catch(IOException e) {
System.out.println("File 20140604.jrn not found");
}
Any Solution Please ?
I am unable to test this right now, but it looks like you have the boundary special character '^' in the wrong spot. It is trying to match RETRAIT followed by the beginning of a line followed by newline characters, when the beginning of the line won't start until after the newline characters.
UPDATE:
With an online java regex tool, I've been able to test this:
^RETRAIT\s*\w+.*CARTE\s+\d{16}
which matches what you want in multiline mode. The \s special character consumes whitespace (including carriage return and new line), which is more resilient than checking explicitly for \n or \r.
strong textI have a bunch of lines in a textfile and I want to match this ${ALPANUMERIC characters} and replace it with ${SAME ALPHANUMERIC characters plus _SOMETEXT(CONSTANT)}.
I've tried this expression ${(.+)} but it didn't work and I also don't know how to do the replace regex in java.
thank you for your feedback
Here is some of my code :
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
StringBuilder sb = new StringBuilder();
while ((line = br.readLine()) != null) {
Pattern p = Pattern.compile("\\$\\{.+\\}");
Matcher m = p.matcher(line); // get a matcher object
if(m.find()) {
System.out.println("MATCH: "+m.group());
//TODO
//REPLACE STRING
//THEN APPEND String Builder
}
}
OK this above works but it only founds my variable and not the whole line for ex here is my input :
some text before ${VARIABLE_NAME} some text after
some text before ${VARIABLE_NAME2} some text after
some text before some text without variable some text after
... etc
so I just want to replace the ${VARIABLE_NAME} or ${VARIABLE_NAME} with ${VARIABLE_NAME2_SOMETHING} but leave preceding and following text line as it is
EDIT:
I though I though of a way like this :
if(line.contains("\\${([a-zA-Z0-9 ]+)}")){
System.out.println(line);
}
if(line.contains("\\$\\{.+\\}")){
System.out.println(line);
}
My idea was to capture the line containing this, then replace , but the regex is not ok, it works with pattern/matcher combination though.
EDIT II
I feel like I'm getting closer to the solution here, here is what I've come up with so far :
if(line.contains("$")){
System.out.println(line.replaceAll("\\$\\{.+\\}", "$1" +"_SUFFIX"));
}
What I meant by $1 is the string you just matched replace it with itself + _SUFFIX
I would use the String.replaceAll() method like so:
`String old="some string data";
String new=old.replaceAll("$([a-zA-Z0-9]+)","(\1) CONSTANT"); `
The $ is a special regular expression character that represents the end of a line. You'll need to escape it in order to match it. You'll also need to escape the backslash that you use for escaping the dollar sign because of the way Java handles strings.
Once you have your text in a string, you should be able to do the following:
str.replaceAll("\\${([a-zA-Z0-9 ]+)}", "\\${$1 _SOMETEXT(CONSTANT)}")
If you have other characters in your variable names (i.e. underscores, symbols, etc...) then just add them to the character class that you are matching for.
Edit: If you want to use a Pattern and Matcher then there are still a few changes. First, you probably want to compile your Pattern outside of the loop. Second, you can use this, although it is more verbose.
Pattern p = Pattern.compile("\\$\\{.+\\}");
Matcher m = p.matcher(line);
sb.append(m.replaceAll("\\${$1 _SOMETEXT(CONSTANT)}"));
THE SOLUTION :
while ((line = br.readLine()) != null) {
if(line.contains("$")){
sb.append(line.replaceAll("\\$\\{(.+)\\}", "\\${$1" +"_SUFFIX}") + "\n");
}else{
sb.append(line + "\n");
}
}
line = line.replaceAll("\\$\\{\\w+", "$0_SOMETHING");
There's no need to check for the presence of $ or whatever; that's part of what replaceAll() does. Anyway, contains() is not regex-powered like find(); it just does a plain literal text search.