how to read a txt and delimit it by pipes - java

I have a txt file with the following form:
1 | Argentina |Y|POSTAL_C |CAPITAL|STATES
I would like to convert each of these positions separated by the "|", be it a position within an example array like this:
0 | 1 |2|3 |4 |5
1 | Argentina |Y|POSTAL_C |CAPITAL|STATES
and work them inside an array
My code
public void ReadFile2() throws IOException {
String referencePath = "C:\\Users\\Admin\\Desktop\\PRUEBA.txt";
BufferedReader br;
br = new BufferedReader(new FileReader(referencePath));
String lines = br.readLine();
lines.split("\n", 6);
this.logger.info("Starting...");
while (lines != null) {
if (!lines.isEmpty() && lines.length() >7) {
String[] values = lines.split("|");
System.out.println(values[0]); ----> should return 1
}
lines = br.readLine();
}
br.close();
}
grateful for your comments

Because String you pass to split method should be a regex you need to use lines.split("\\|").
In regex pipe sign have special meaning (it works like OR operator) so you need to escape it by backslash. In java string literal backslash have special meaning because it's escape following character so you that's way two backslashes here. Result String is in fact just \|.
Try this:
String test = "1 | Argentina |Y|POSTAL_C |CAPITAL|STATES";
String[] values = test.split("\\|");
for (String value : values) {
System.out.println(value);
}
This returns output:
1
Argentina
Y
POSTAL_C
CAPITAL
STATES
Probably you would like also to .trim() particular values to remove unnecessary white spaces.

Related

How to splitting records based white spaces when different lines have spaces at different positions

I have a file with records as below and I am trying to split the records in it based on white spaces and convert them into comma.
file:
a 3w 12 98 header P6124
e 4t 2 100 header I803
c 12L 11 437 M12
BufferedReader reader = new BufferedReader(new FileReader("/myfile.txt"));
String line = reader.readLine();
while (line != null) {
System.out.println(line);
line = reader.readLine();
String[] splitLine = line.split("\\s+")
If the data is separated by multiple white spaces, I usually go for regex replace -> split('\\s+') or split(" +").
But in the above case, I have a record c which doesn't have the data header. Hence the regex "\s+" or " +" will just skip that record and I will get an empty space as c,12L,11,437,M12 instead of c,12L,11,437,,M12
How do I properly split the lines based on any delimiter in this case so that I get data in the below format:
a,3w,12,98,header,P6124
e,4t,2,100,header,I803
c,12L,11,437,,M12
Could anyone let me know how I can achieve this ?
May be you can try using a more complicated approach, using a complex regex in order to match exatcly six fields for each line and handling explicitly the case of a missing value for the fifth one.
I rewrote your example adding some console log in order to clarify my suggestion:
public class RegexTest {
private static final String Input = "a 3w 12 98 header P6124\n" +
"e 4t 2 100 header I803\n" +
"c 12L 11 437 M12";
public static void main(String[] args) throws Exception {
BufferedReader reader = new BufferedReader(new StringReader(Input));
String line = null;
Pattern pattern = Pattern.compile("^([^ ]+) +([^ ]+) +([^ ]+) +([^ ]+) +([^ ]+)? +([^ ]+)$");
do {
line = reader.readLine();
System.out.println(line);
if(line != null) {
String[] splitLine = line.split("\\s+");
System.out.println(splitLine.length);
System.out.println("Line: " + line);
Matcher matcher = pattern.matcher(line);
System.out.println("matches: " + matcher.matches());
System.out.println("groups: " + matcher.groupCount());
for(int i = 1; i <= matcher.groupCount(); i++) {
System.out.printf(" Group %d has value '%s'\n", i, matcher.group(i));
}
}
} while (line != null);
}
}
The key is that the pattern used to match each line requires a sequence of six fields:
for each field, the value is described as [^ ]+
separators between fields are described as +
the value of the fifth (nullable) field is described as [^ ]+?
each value is captured as a group using parentheses: ( ... )
start (^) and end ($) of each line are marked explicitly
Then, each line is matched against the given pattern, obtaining six groups: you can access each group using matcher.group(index), where index is 1-based because group(0) returns the full match.
This is a more complex approach but I think it can help you to solve your problem.
Put a limit on the number of whitespace chars that may be used to split the input.
In the case of your example data, a maximum of 5 works:
String[] splitLine = line.split("\\s{1,5}");
See live demo (of this code working as desired).
Are you just trying to switch your delimiters from spaces to commas?
In that case:
cat myFile.txt | sed 's/ */ /g' | sed 's/ /,/g'
*edit: added a stage to strip out lists of more than two spaces, replacing them with just the two spaces needed to retain the double comma.

How to use regex with String.split()

I have the following String:
String fullPDFContex = "Title1 Title2\r\nTitle3 Title4\r\n\r\nTitle5 Title6\r\n \r\n Title7 \r\n\r\n\r\n\r\n\r\n"
I want to convert it to an array of String which will look like this.
String[] Title = {"Title1 Title2","Title3 Title4","Title5 Title6","Title7"}
I am trying the following code.
String[] Title=fullPDFContext.split("\r\n\r\n|\r\n \r\n|\r\n");
But not getting the desired output.
You need to split with a pattern that matches any amount of whitespace that contains a line break:
String fullPDFContex = "Title1 Title2\r\nTitle3 Title4\r\n\r\nTitle5 Title6\r\n \r\n Title7 \r\n\r\n\r\n\r\n\r\n";
String separator = "\\p{javaWhitespace}*\\R\\p{javaWhitespace}*";
String results[] = fullPDFContex.split(separator);
System.out.println(Arrays.toString(results));
// => [Title1 Title2, Title3 Title4, Title5 Title6, Title7]
See the Java demo.
The \\p{javaWhitespace}*\\R\\p{javaWhitespace}* matches
\\p{javaWhitespace}* - 0+ whitespaces
\\R - a line break (you may replace it with [\r\n] for Java 7 and older)
\\p{javaWhitespace}* - 0+ whitespaces.
Alternatively, you may use a bit more efficient
String separator = "[\\s&&[^\r\n]]*\\R\\s*";
See another demo
Unfortunately, the \R construct cannot be used in the character classes. The pattern will match:
[\\s&&[^\r\n]]* - zero or more whitespace chars other than CR and LF (character class subtraction is used here)
\\R - a line break
\\s* - any 0+ whitespace chars.
Here is your solution. we can use StringTokenizer & I have used list to insert the splitted values.This can help you if you have n number of values splitted from your array
package com.sujit;
import java.util.ArrayList;
import java.util.List;
import java.util.StringTokenizer;
public class UserInput {
public static void main(String[] args) {
String fullPDFContex = "Title1 Title2\r\nTitle3 Title4\r\n\r\nTitle5 Title6\r\n \r\n Title7 \r\n\r\n\r\n\r\n\r\n";
StringTokenizer token = new StringTokenizer(fullPDFContex, "\r\n");
List<String> list = new ArrayList<>();
while (token.hasMoreTokens()) {
list.add(token.nextToken());
}
for (String string : list) {
System.out.println(string);
}
}
}
With this code you get the output you want:
String[] Title = fullPDFContext.split(" *(\r\n ?)+ *");

parse csv, do not split within single OR double quotes

I try to parse a csv with java and have the following issue: The second column is a String (which may also contain comma) enclosed in double-quotes, except if the string itself contains a double quote, then the entire string is enclosed with a single quote. e.g.
Lines may lokk like this:
someStuff,"hello", someStuff
someStuff,"hello, SO", someStuff
someStuff,'say "hello, world"', someStuff
someStuff,'say "hello, world', someStuff
someStuff are placeholders for other elements, which can also include quotes in the same style
I'm looking for a generic way to split the lines at commas UNLESS enclosed in single OR double quotes in order to get the second column as a String. With second column I mean the fields:
hello
hello, SO
say "hello, world"
say "hello, world
I tried OpenCSV but fail as one can only specifiy one type of quote:
public class CSVDemo {
public static void main(String[] args) throws IOException {
CSVDemo demo = new CSVDemo();
demo.process("input.csv");
}
public void process(String fileName) throws IOException {
String file = this.getClass().getClassLoader().getResource(fileName)
.getFile();
CSVReader reader = new CSVReader(new FileReader(file));
String[] nextLine;
while ((nextLine = reader.readNext()) != null) {
System.out.println(nextLine[0] + " | " + nextLine[1] + " | "
+ nextLine[2]);
}
}
}
The solution with opencsv fails on the last line where there is only one double quote enclosed in single quotes:
someStuff | hello | someStuff
someStuff | hello, SO | someStuff
someStuff | 'say "hello, world"' | someStuff
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
If you truly cannot use a real CSV parser you could use a regex. This is generally not a good idea as there are always edge cases that you cannot handle but if the formatting is strictly as you describe then this may work.
public void test() {
String[] tests = {"numeStuff,\"hello\", someStuff, someStuff",
"numeStuff,\"hello, SO\", someStuff, someStuff",
"numeStuff,'say \"hello, world\"', someStuff, someStuff"
};
/* Matches a field and a potentially empty separator.
*
* ( - Field Group
* \" - Start with a quote
* [^\"]*? - Non-greedy match on anything that is not a quote
* \" - End with a quote
* | - Or
* ' - Start with a strop
* [^']*? - Non-greedy match on anything that is not a strop
* ' - End with a strop
* | - Or
* [^\"'] - Not starting with a quote or strop
* [^,$]*? - Non-greedy match on anything that is not a comma or end-of-line
* ) - End field group
* ( - Separator group
* [,$] - Comma separator or end of line
* ) - End separator group
*/
Pattern p = Pattern.compile("(\"[^\"]*?\"|'[^\']*?\'|[^\"'][^,\r\n]*?)([,\r\n]|$)");
for (String t : tests) {
System.out.println("Matching: " + t);
Matcher m = p.matcher(t);
while (m.find()) {
System.out.println(m.group(1));
}
}
}
It does not appear that opencsv supports this out of the box. You could extend com.opencsv.CSVParser and implement your own algorithm for handling two types of quotes. This is the source of the method you would be changing and here is a stub to get you started.
class MyCSVParser extends CSVParser{
#Override
private String[] parseLine(String nextLine, boolean multi) throws IOException{
//Your algorithm here
}
}
Basically you only need to track ," and ,' (trimming what's in the middle).
When you encounter one of those, set the appropriate flag (eg. singleQuoteOpen, doubleQuoteOpen) to true to indicate they're open and you are in ignore-commas mode.
When you meet the appropriate closing quote, reset the flag and keep slicing the elements.
To perform the check, stop at every comma (when not in ignore-commas mode) and look at the next char (if any, and trimming).
Note: the regex solution is good and also shorter, but less customizable for edge cases (at least without big headaches).
If the use of single and double quotes is consistent per line, one could chose the corresponding type of quote per line:
public class CSVDemo {
public static void main(String[] args) throws IOException {
CSVDemo demo = new CSVDemo();
demo.process("input.csv");
}
public void process(String fileName) throws IOException {
String file = this.getClass().getClassLoader().getResource(fileName)
.getFile();
CSVParser doubleParser = new CSVParser(',', '"');
CSVParser singleParser = new CSVParser(',', '\'');
String[] nextLine;
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
String line;
while ((line = br.readLine()) != null) {
if (line.contains(",'") && line.contains("',")) {
nextLine = singleParser.parseLine(line);
} else {
nextLine = doubleParser.parseLine(line);
}
System.out.println(nextLine[0] + " | " + nextLine[1] + " | "
+ nextLine[2]);
}
}
}
}
It doesn't seem opencv supports this. However, have a look at this previous question and my answer as well as the other answers in case they help
you: https://stackoverflow.com/a/15905916/1688441
Below an example, please not notInsideComma actually meant "Inside quotes". The following code could be extended to check for both quotes and double quotes.
public static ArrayList<String> customSplitSpecific(String s)
{
ArrayList<String> words = new ArrayList<String>();
boolean notInsideComma = true;
int start =0, end=0;
for(int i=0; i<s.length()-1; i++)
{
if(s.charAt(i)==',' && notInsideComma)
{
words.add(s.substring(start,i));
start = i+1;
}
else if(s.charAt(i)=='"')
notInsideComma=!notInsideComma;
}
words.add(s.substring(start));
return words;
}

How to replace odd number of single quote with two single quote

I want to add escape character "'"(single quote) in string in java but only when there is odd number of occurrence using Regular Expression
For Ex:
if string is like "string's property" then output should be "string''s property"
if string is like "string''s property" then output should be "string''s property"
Try this :
\'(\')?
Demo (replacing with ')
http://regexr.com?38eeh
Try this code (even count).
public static void main(String[] args) {
String str = "a''''''b";
str = str.replaceAll("[^']'('')*[^']", "###");
System.out.println(str);
}
Then try this one (odd count).
public static void main(String[] args) {
String str = "a'''''''b";
str = str.replaceAll("[^']'('')*[^']", "###");
System.out.println(str);
}
Try this:
// input that will be replaced
String replace = "string's property";
// input that won't be replaced
String noReplace = "string''s property";
// String representation of the Pattern for both inputs
// |no single quote before...
// | |single quote
// | | |... no single quote after
String pattern = "(?<!')'(?!')";
// Will replace found text with main group twice --> found
System.out.println(replace.replaceAll(pattern, "$0$0"));
// Will replace found text with main group twice --> not found, no replacement
System.out.println(noReplace.replaceAll(pattern, "$0$0"));
Output:
string''s property
string''s property

Regular Expression Statement

I've never been good with regex and I can't seem to get this...
I am trying to match statements along these lines (these are two lines in a text file I'm reading)
Lname Fname 12.35 1
Jones Bananaman 7.1 3
Currently I am using this for a while statement
reader.hasNext("\\w+ \\w+ \\d*\\.\\d{1,2} [0-5]")
But it doesn't enter the while statement.
The program reads the text file just fine when I remove the while.
The code segment is this:
private void initializeFileData(){
try {
Scanner reader = new Scanner(openedPath);
while(reader.hasNext("\\w+ \\w+ \\d*\\.\\d{1,2} [0-5]")){
employeeInfo.add(new EmployeeFile(reader.next(), reader.next(), reader.nextDouble(), reader.nextInt(), new employeeRemove()));
}
for(EmployeeFile element: employeeInfo){
output.add(element);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
Use the \s character class for the spaces between words:
while(reader.hasNext("\\w+\\s\\w+\\s\\d*\\.\\d{1,2}\\s[0-5]"))
Update:
According to the javadoc for the Scanner class, by default it splits it's tokens using whitespace. You can change the delimiter it uses with the useDelimiter(String pattern) method of Scanner.
private void initializeFileData(){
try {
Scanner reader = new Scanner(openedPath).useDelimiter("\\n");
...
while(reader.hasNext("\\w+\\s\\w+\\s\\d*\\.\\d{1,2}\\s[0-5]")){
...
http://docs.oracle.com/javase/1.5.0/docs/api/java/util/Scanner.html
From what I can see (And correct me if I'm wrong, because regex always seems to trick my brain :p), you're not handling the spaces correctly. You need to use \s, not just the standard ' ' character
EDIT: Sorry, \s. Someone else beat me to it :p
Actually
\w+
is going to catch [Lname, Fname, 12, 35, 1] for Lname Fname 12.35 1. So you can just store reader.nextLine() and then extract all regex matches from there. From there, you can abstract it a bit for instance by :
class EmployeeFile {
.....
public EmployeeFile(String firstName, String lastName,
Double firstDouble, int firstInt,
EmployeeRemove er){
.....
}
public EmployeeFile(String line) {
//TODO : extract all the required info from the string array
// instead of doing it while reading at the same time.
// Keep input parsing separate from input reading.
// Turn this into a string array using the regex pattern
// mentioned above
}
}
I created my own version, without files and the last loop, that goes like that:
private static void initializeFileData() {
String[] testStrings = {"Lname Fname 12.35 1", "Jones Bananaman 7.1 3"};
Pattern myPattern = Pattern.compile("(\\w+)\\s+(\\w+)\\s+(\\d*\\.\\d{1,2})\\s+([0-5])");
for (String s : testStrings) {
Matcher myMatcher = myPattern.matcher(s);
if (myMatcher.groupCount() == 4) {
String lastName = myMatcher.group(1);
String firstName = myMatcher.group(2);
double firstValue = Double.parseDouble(myMatcher.group(3) );
int secondValue = Integer.parseInt(myMatcher.group(4));
//employeeInfo.add(new EmployeeFile(lastName, firstName, firstValue, secondValue, new employeeRemove()));
}
}
}
Notice that I removed the slash before the dot (you want a dot, not any character) and inserted the parenthesis, in order to create the groups.
I hope it helps.

Categories

Resources