Java extract multiline values from a file

Java extract multiline values from a file - java

I'm reading file line by line and some lines have multiline values as below due to which my loop breaks and returns unexpected result.
TSNK/Metadata/tk.filename=PZSIIF-anefnsadual-rasdfepdasdort.pdf
TSNK/Metadata/tk_ISIN=LU0291600822,LU0871812862,LU0327774492,LU0291601986,LU0291605201
,LU0291595725,LU0291599800,LU0726995649,LU0726996290,LU0726995995,LU0726995136,LU0726995482,LU0726995219,LU0855227368
TSNK/Metadata/tk_GroupCode=PZSIIF
TSNK/Metadata/tk_GroupCode/PZSIIF=y
TSNK/Metadata/tk_oneTISNumber=16244,17007,16243,11520,19298,18247,20755
TSNK/Metadata/tk_oneTISNumber_TEXT=Neo Emerging Market Corporate Debt
Neo Emerging Market Debt Opportunities II
Neo Emerging Market Investment Grade Debt
Neo Floating Rate II
Neo Upper Tier Floating Rate
Global Balanced Regulation 28
Neo Multi-Sector Credit Income
Here TSNK/Metadata/tk_ISIN & TSNK/Metadata/tk_oneTISNumber_TEXT have multiline values. While reading line by line from file how do I read these fields as single line ?
I have tried below logic but it did not produce expected result:
try {
fr = new FileReader(FILENAME);
br = new BufferedReader(fr);
String sCurrentLine;
br = new BufferedReader(new FileReader(FILENAME));
int i=1;
CharSequence OneTIS = "TSNK/Metadata/tk_oneTISNumber_TEXT";
StringBuilder builder = new StringBuilder();
while ((sCurrentLine = br.readLine()) != null) {
if(sCurrentLine.contains(OneTIS)==true) {
System.out.println("Line number here -> "+i);
builder.append(sCurrentLine);
builder.append(",");
}
else {
System.out.println("else --->");
}
//System.out.println("Line number"+i+" Value is---->>>> "+sCurrentLine);
i++;
}
System.out.println("Line number"+i+" Value is---->>>> "+builder);

The solution involves Scanner and multiline regular expressions.
The assumption here is that all of your lines start with TSNK/Metadata/
Scanner scanner = new Scanner(new File("file.txt"));
scanner.useDelimiter("TSNK/Metadata/");
Pattern p = Pattern.compile("(.*)=(.*)", Pattern.DOTALL | Pattern.MULTILINE);
String s = null;
do {
if (scanner.hasNext()) {
s = scanner.next();
Matcher matcher = p.matcher(s);
if (matcher.find()) {
System.out.println("key = '" + matcher.group(1) + "'");
String[] values = matcher.group(2).split("[,\n]");
int i = 1;
for (String value : values) {
System.out.println(String.format(" val(%d)='%s',", (i++), value ));
}
}
}
} while (s != null);
The above produces output
key = 'tk.filename'
val(0)='PZSIIF-anefnsadual-rasdfepdasdort.pdf',
key = 'tk_ISIN'
val(0)='LU0291600822',
val(1)='LU0871812862',
val(2)='LU0327774492',
val(3)='LU0291601986',
val(4)='LU0291605201',
val(5)='',
val(6)='LU0291595725',
val(7)='LU0291599800',
val(8)='LU0726995649',
val(9)='LU0726996290',
val(10)='LU0726995995',
val(11)='LU0726995136',
val(12)='LU0726995482',
val(13)='LU0726995219',
val(14)='LU0855227368',
key = 'tk_GroupCode'
val(0)='PZSIIF',
key = 'tk_GroupCode/PZSIIF'
val(0)='y',
key = 'tk_oneTISNumber'
val(0)='16244',
val(1)='17007',
val(2)='16243',
val(3)='11520',
val(4)='19298',
val(5)='18247',
val(6)='20755',
key = 'tk_oneTISNumber_TEXT'
val(0)='Neo Emerging Market Corporate Debt ',
val(1)='Neo Emerging Market Debt Opportunities II ',
val(2)='Neo Emerging Market Investment Grade Debt ',
val(3)='Neo Floating Rate II ',
val(4)='Neo Upper Tier Floating Rate ',
val(5)='Global Balanced Regulation 28 ',
val(6)='Neo Multi-Sector Credit Income',
Please note empty entry (val(5) for key tk_ISIN) due to new line followed by a comma in that entry. It can be sorted quite easily either by rejecting empty strings or by adjusting the splitting pattern.
Hope this helps!

Related

Splitting input string when it contains countires with multiple words

I get multiple countries as an input that i have to split by space. If the country has multiple word it's declared between "". For example
Chad Benin Angola Algeria Finland Romania "Democratic Republic of the Congo" Bolivia Uzbekistan Lesotho "United States of America"
At the moment im able to split the countries word by word. So United States of America doesnt stay together as one country.
BufferedReader reader = new BufferedReader(
new InputStreamReader(System.in));
// Reading data using readLine
String str = reader.readLine();
ArrayList<String> sets = new ArrayList<String>();
String[] newStr = str.split("[\\W]");
boolean check = false;
for (String s : newStr) {
sets.add(s);
}
System.out.print(sets);
How can i split these countries so that the multiword countires dont get split?

Instead of matching what to split, match country names. You need to catch either letters, or letters and spaces between quotes. Match 1 or more letters - [a-zA-Z]+, or(|) match letters and spaces between quotes - "[a-zA-Z\s]+".
String input = "Chad Benin Angola Algeria Finland Romania \"Democratic Republic of the Congo\" Bolivia Uzbekistan Lesotho \"United States of America\"";
Pattern pattern = Pattern.compile("[a-zA-Z]+|\"[a-zA-Z\\s]+\"");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
String result = matcher.group();
if (result.startsWith("\"")) {
//quotes are matched, so remove them
result = result.substring(1, result.length() - 1);
}
System.out.println(result);
}

Hm, may be I am not intelligent enough, but I do not see any one-line-of-code solution, but I can think of the following solution:
public static void main(String[] args) {
String inputString = "Chad Benin Angola Algeria Finland Romania \"Democratic Republic of the Congo\" Bolivia Uzbekistan Lesotho \"United States of America\"\n";
List<String> resultCountriesList = new ArrayList<>();
int currentIndex = 0;
boolean processingMultiWordsCountry = false;
for (int i = 0; i < inputString.length(); i++) {
Optional<String> substringAsOptional = extractNextSubstring(inputString, currentIndex);
if (substringAsOptional.isPresent()) {
String substring = substringAsOptional.get();
currentIndex += substring.length() + 1;
if (processingMultiWordsCountry) {
resultCountriesList.add(substring);
} else {
resultCountriesList.addAll(Arrays.stream(substring.split(" ")).peek(String::trim).filter(s -> !s.isEmpty()).collect(Collectors.toList()));
}
processingMultiWordsCountry = !processingMultiWordsCountry;
}
}
System.out.println(resultCountriesList);
}
private static Optional<String> extractNextSubstring(String inputString, int currentIndex) {
if (inputString.length() > currentIndex + 1) {
return Optional.of(inputString.substring(currentIndex, inputString.indexOf("\"", currentIndex + 1)));
}
return Optional.empty();
}
The result list of the countries, as strings, resides in resultCountriesList. That code just iterates over the string, taking substring of the original string - inputString from the previous substring index - currentIndex to the next occurrence of \" symbol. If the substring is present - we continue processing. Also we segregate countries enclosed by \" symbol from countries, that resides outside of \" by the boolean flag processingMultiWordsCountry.
So, at least for now, I cannot find anything better. Also I do not think that this code is ideal, I think there are a lot of possible improvements, so if you consider any - feel free to add a comment. Hope it helped, have a nice day!

Similar approach as in the accepted answer but with a shorter regex and without matching and replacing the double quotes (which is quite an expensive procedure, in my opinion):
String in = "Chad Benin Angola Algeria Finland Romania \"Democratic Republic of the Congo\" Bolivia Uzbekistan Lesotho \"United States of America\"";
Pattern p = Pattern.compile("\"([^\"]*)\"|(\\w+)");
Matcher m = p.matcher(in);
ArrayList<String> sets = new ArrayList<>();
while(m.find()) {
String multiWordCountry = m.group(1);
if (multiWordCountry != null) {
sets.add(multiWordCountry);
} else {
sets.add(m.group(2));
}
}
System.out.print(sets);
Result:
[Chad, Benin, Angola, Algeria, Finland, Romania, Democratic Republic of the Congo, Bolivia, Uzbekistan, Lesotho, United States of America]

How can I split a string then edit the string then place back the delimiters to their original spot

I am working on a Pig Latin project which requires to change any sentence input by the user to be translated into Pig Latin. I have the conversion down and it works. However I have issues with punctuation. When I split my string to work on each individual word in the string the punctuation gets in the way. I would like to know a way to be able to split the string input into its individual words however keep the delimiters and then be able to properly place back the punctuation and whitespaces?
Thank you
public static void main(String[] args) {
Scanner scanner = new Scanner(System.in);
System.out.print("Enter a word or phrase: ");
String convert = scanner.nextLine();
String punctuations = ".,?!;";
//convert = convert.replaceAll("\\p{Punct}+", ""); //idk if this is useful for me
String finalSentence = "";
if (convert.contains(" ")) {
String[] arr = convert.split("[ ,?!;:.]+");
for (int index = 0; index < arr.length; index++) {
if (vowel(arr[index]) == true) {
System.out.println(arr[index] + "yay");
finalSentence = (finalSentence + arr[index] + "yay ");
} else {
System.out.println(newConvert(arr[index]));
finalSentence = (finalSentence + newConvert(arr[index]) + " ");
}
}

public static void main(String[] args) {
String convert = "The quick? brown!!fox jumps__over the lazy333 dog.";
StringBuilder finalSentence = new StringBuilder();
List<String> tokens = Arrays.asList(convert.split(""));
Iterator<String> it = tokens.iterator();
while (it.hasNext()) {
String token = it.next();
StringBuilder sb = new StringBuilder();
while (token.matches("[A-Za-z]")) {
sb.append(token);
if (it.hasNext()) {
token = it.next();
} else {
token = "";
break;
}
}
String word = sb.toString();
if (!word.isEmpty()) {
finalSentence.append(magic(word));
}
finalSentence.append(token);
}
//prints "The1 quick1? brown1!!fox1 jumps1__over1 the1 lazy1333 dog1."
System.out.println(finalSentence.toString());
}
private static String magic(String word) {
return word + 1;
}
Do the Pig Latin translation in the magic method.

I defined two methods for the Pig Latin translation: the convert_word_to_pig_latin method is to convert each word into Pig Latin, and the convert_sentence_to_pig_latin method is for a sentence using the convert_word_to_pig_latin method.
def convert_word_to_pig_latin(word)
vowels = "aeiou"
punctuations = ".,?!'\":;-"
if vowels.include?(word[0])
return word
else
if punctuations.include?(word[-1])
punctuation = word[-1]
word = word.chop
end
first_vowel_index = word.chars.find_index { |letter| vowels.include?(letter) }
new_word = word[first_vowel_index..-1] + word[0...first_vowel_index] + "ay"
return punctuation ? new_word += punctuation : new_word
end
end
def convert_sentence_to_pig_latin(sentence)
sentence_array = sentence.split(" ")
sentence_array.map { |word| convert_word_to_pig_latin(word) }.join(" ")
end
NOTE: Please feel free to add any additional punctuation marks as you'd like.
Lastly, here is my RSpec to ensure both of my methods pass all tests:
require_relative('../pig_latin')
describe 'Converting single words to Pig Latin' do
word1 = "beautiful"
word2 = "easy"
word3 = "straight"
it "converts word to Pig Latin" do
expect(convert_word_to_pig_latin(word1)).to eq "eautifulbay"
end
it "does not change word if it begins with a vowel" do
expect(convert_word_to_pig_latin(word2)).to eq "easy"
end
it "converts word to Pig Latin" do
expect(convert_word_to_pig_latin(word3)).to eq "aightstray"
end
end
describe 'Converting a sentence to Pig Latin' do
sentence1 = "Make your life a masterpiece; imagine no limitations on what you can be, have, or do."
sentence2 = "The pessimist sees difficulty in every opportunity. The optimist sees the opportunity in every difficulty."
it "converts motivational quote from Brian Tracy to Pig Latin" do
expect(convert_sentence_to_pig_latin(sentence1)).to eq "akeMay ouryay ifelay a asterpiecemay; imagine onay imitationslay on atwhay ouyay ancay ebay, avehay, or oday."
end
it "converts motivational quote from Winston Churchill to Pig Latin" do
expect(convert_sentence_to_pig_latin(sentence2)).to eq "eThay essimistpay eessay ifficultyday in every opportunity. eThay optimist eessay ethay opportunity in every ifficultyday."
end
end

Uppercase all characters but not those in quoted strings

I have a String and I would like to uppercase everything that is not quoted.
Example:
My name is 'Angela'
Result:
MY NAME IS 'Angela'
Currently, I am matching every quoted string then looping and concatenating to get the result.
Is it possible to achieve this in one regex expression maybe using replace?

List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("\\'(.*?)\\'");
String input = "'s'Hello This is 'Java' Not '.NET'";
Matcher regexMatcher = regex.matcher(input);
StringBuffer sb = new StringBuffer();
int counter = 0;
while (regexMatcher.find())
{// Finds Matching Pattern in String
regexMatcher.appendReplacement(sb, "{"+counter+"}");
matchList.add(regexMatcher.group());// Fetching Group from String
counter++;
}
String format = MessageFormat.format(sb.toString().toUpperCase(), matchList.toArray());
System.out.println(input);
System.out.println("----------------------");
System.out.println(format);
Input: 's'Hello This is 'Java' Not '.NET'
Output: 's'HELLO THIS IS 'Java' NOT '.NET'

You could use a regular expression like this:
([^'"]+)(['"]+[^'"]+['"]+)(.*)
# match and capture everything up to a single or double quote (but not including)
# match and capture a quoted string
# match and capture any rest which might or might not be there.
This will only work with one quoted string, obviously. See a working demo here.

Ok. This will do it for you.. Not efficient, but will work for all cases. I actually don't suggest this solution as it will be too slow.
public static void main(String[] args) {
String s = "'Peter' said, My name is 'Angela' and I will not change my name to 'Pamela'.";
Pattern p = Pattern.compile("('\\w+')");
Matcher m = p.matcher(s);
List<String> quotedStrings = new ArrayList<>();
while(m.find()) {
quotedStrings.add(m.group(1));
}
s=s.toUpperCase();
// System.out.println(s);
for (String str : quotedStrings)
s= s.replaceAll("(?i)"+str, str);
System.out.println(s);
}
O/P :
'Peter' SAID, MY NAME IS 'Angela' AND I WILL NOT CHANGE MY NAME TO 'Pamela'.

Adding to the answer by #jan_kiran, we need to call the
appendTail()
method appendTail(). Updated code is:
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("\\'(.*?)\\'");
String input = "'s'Hello This is 'Java' Not '.NET'";
Matcher regexMatcher = regex.matcher(input);
StringBuffer sb = new StringBuffer();
int counter = 0;
while (regexMatcher.find())
{// Finds Matching Pattern in String
regexMatcher.appendReplacement(sb, "{"+counter+"}");
matchList.add(regexMatcher.group());// Fetching Group from String
counter++;
}
regexMatcher.appendTail(sb);
String formatted_string = MessageFormat.format(sb.toString().toUpperCase(), matchList.toArray());

I did not find my luck with these solutions, as they seemed to remove trailing non-quoted text.
This code works for me, and treats both ' and " by remembering the last opening quotation mark type. Replace toLowerCase appropriately, of course...
Maybe this is extremely slow; I don't know:
private static String toLowercaseExceptInQuotes(String line) {
StringBuffer sb = new StringBuffer(line);
boolean nowInQuotes = false;
char lastQuoteType = 0;
for (int i = 0; i < sb.length(); ++i) {
char cchar = sb.charAt(i);
if (cchar == '"' || cchar == '\''){
if (!nowInQuotes) {
nowInQuotes = true;
lastQuoteType = cchar;
}
else {
if (lastQuoteType == cchar) {
nowInQuotes = false;
}
}
}
else if (!nowInQuotes) {
sb.setCharAt(i, Character.toLowerCase(sb.charAt(i)));
}
}
return sb.toString();
}

Parsing using Pattern in Java

I want to Parse the lines of a file Using parsingMethod
test.csv
Frank George,Henry,Mary / New York,123456
,Beta Charli,"Delta,Delta Echo
", 25/11/1964, 15/12/1964,"40,000,000.00",0.0975,2,"King, Lincoln ",Alpha
This is the way i read line
public static void main(String[] args) throws Exception {
File file = new File("C:\\Users\\test.csv");
BufferedReader reader = new BufferedReader(new FileReader(file));
String line2;
while ((line2= reader.readLine()) !=null) {
String[] tab = parsingMethod(line2, ",");
for (String i : tab) {
System.out.println( i );
}
}
}
public static String[] parsingMethod(String line,String parser) {
List<String> liste = new LinkedList<String>();
String patternString ="(([^\"][^"+parser+ "]*)|\"([^\"]*)\")" +parser+"?";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher =pattern.matcher(line);
while (matcher.find()) {
if(matcher.group(2) != null){
liste.add(matcher.group(2).replace("\n","").trim());
}else if(matcher.group(3) != null){
liste.add(matcher.group(3).replace("\n","").trim());
}
}
String[] result = new String[liste.size()];
return liste.toArray(result);
}
}
Output :
Frank George
Henry
Mary / New York
123456
Beta Charli
Delta
Delta Echo
"
25/11/1964
15/12/1964
40,000,000.00
0.0975
2
King
Lincoln
"
Alpha
Delta
Delta Echo
I want to remove this " ,
Can any one help me to improve my Pattern.
Expected output
Frank George
Henry
Mary / New York
123456
Beta Charli
Delta
Delta Echo
25/11/1964
15/12/1964
40,000,000.00
0.0975
2
King
Lincoln
Alpha
Delta
Delta Echo
Output for line 3
25/11/1964
15/12/1964
40
000
000.00
0.0975
2
King
Lincoln

Your code didn't compile properly but that was caused by some of the " not being escaped.
But this should do the trick:
String patternString = "(?:^.,|)([^\"]*?|\".*?\")(?:,|$)";
Pattern pattern = Pattern.compile(patternString, Pattern.MULTILINE);
(?:^.,|) is a non capturing group that matches a single character at the start of the line
([^\"]*?|\".*?\") is a capturing group that either matches everything but " OR anything in between " "
(?:,|$) is a non capturing group that matches a end of the line or a comma.
Note: ^ and $ only work as stated when the pattern is compiled with the Pattern.MULTILINE flag

I can't reproduce your result but I'm thinking maybe you want to leave the quotes out of the second captured group, like this:
"(([^\"][^"+parser+ "]*)|\"([^\"]*))\"" +parser+"?"
Edit: Sorry, this won't work. Maybe you want to let any number of ^\" in the first group as well, like this: (([^,\"]*)|\"([^\"]*)\"),?

As i can see the lines are related so try this:
public static void main(String[] args) throws Exception {
File file = new File("C:\\Users\\test.csv");
BufferedReader reader = new BufferedReader(new FileReader(file));
StringBuilder line = new StringBuilder();
String lineRead;
while ((lineRead = reader.readLine()) != null) {
line.append(lineRead);
}
String[] tab = parsingMethod(line.toString());
for (String i : tab) {
System.out.println(i);
}
}
public static String[] parsingMethod(String line) {
List<String> liste = new LinkedList<String>();
String patternString = "(([^\"][^,]*)|\"([^\"]*)\"),?";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
if (matcher.group(2) != null) {
liste.add(matcher.group(2).replace("\n", "").trim());
} else if (matcher.group(3) != null) {
liste.add(matcher.group(3).replace("\n", "").trim());
}
}
String[] result = new String[liste.size()];
return liste.toArray(result);
}
Ouput:
Frank George
Henry
Mary / New York
123456
Beta Charli
Delta,Delta Echo
25/11/1964
15/12/1964
40,000,000.00
0.0975
2
King, Lincoln
Alpha
as Delta, Delta Echo is in a quotation this should appear in the same line ! like as King, Lincoln

Java Scanner Class useDelimiter Method

I have to read from a text file containing all the NCAA Division 1 championship games since 1933,
the file is in this format: 1939:Villanova:42:Brown:30
1945:New York University:70:Ohio State:65 **The fact that some Universities have multiple white spaces is giving me lots of trouble beause we are only to read the school names and discard the year, points and colon. I do not know if I have to use a delimiter that discards what spaces, but buttom line is I am a very lost.
We are to discard the date, points, and ":". I am slightly fimilar with the useDelimiter method but, I have read that a .split("") might be useful. I am having a great deal of problems due to my lack of knowledge in patterns.
THIS IS WHAT I HAVE SO FAR:
class NCAATeamTester
{
public static void main(String[]args)throws IOException
{
NCAATeamList myList = new NCAATeamList(); //ArrayList containing teams
Scanner in = new Scanner(new File("ncaa2012.data"));
in.useDelimiter("[A-Za-z]+"); //String Delimeter excluding non alphabetic chars or ints
while(in.hasNextLine()){
String line = in.nextLine();
String name = in.next(line);
String losingTeam = in.next(line);
//Creating team object with winning team
NCAATeamStats win = new NCAATeamStats(name);
myList.addToList(win); //Adds to List
//Creating team object with losing team
NCAATeamStats lose = new NCAATeamStats(losingTeam);
myList.addToList(lose)
}
}
}

What about
String[] spl = line.split(':');
String name1 = spl[1];
String name2 = spl[3];
?
Or, if there are more records at the same line, use regular expressions :
String line = "1939:Villanova:42:Brown:30 1945:New York University:70:Ohio State:65";
Pattern p = Pattern.compile("(.*?:){4}[0-9]+");
Matcher m = p.matcher(line);
while (m.find())
{
String[] spl = m.group().split(':');
String name = spl[1];
String name2 = spl[3];
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java extract multiline values from a file - java

Related

Splitting input string when it contains countires with multiple words

How can I split a string then edit the string then place back the delimiters to their original spot

Uppercase all characters but not those in quoted strings

Parsing using Pattern in Java

Java Scanner Class useDelimiter Method

Categories

Resources