Using regex to parse a string from text that includes a newline

Using regex to parse a string from text that includes a newline - java

Given the following text, I'm trying to parse out the string "TestFile" after Address::
File: TestFile
Branch
OFFICE INFORMATION
Address: TestFile
City: L.A.
District.: 43
State: California
Zip Code: 90210
DISTRICT INFORMATION
Address: TestFile2
....
I understand that lookbehinds require zero-width so quantifiers are not allowed, meaning this won't work:
(?<=OFFICE INFORMATION\n\s*Address:).*(?=\n)
I could use this
(?<=OFFICE INFORMATION\n Address:).*
but it depends on consistent spacing, which isn't dynamic and thus not ideal.
How do I reliably parse out "TestFile" and not "TestFile2" as shown in my example above. Note that Address appears twice but I only need the first value.
Thank you

You don't really need to use a lookbehind here. Get your matched text using captured group:
(?:\bOFFICE INFORMATION\s+Address:\s*)(\S+)
RegEx Demo
captured group #1 will have value TestFile
JS Code:
var re = /(?:\bOFFICE INFORMATION\s+Address:\s*)(\S+)/;
var m;
var matches = [];
if ((m = re.exec(input)) !== null) {
if (m.index === re.lastIndex)
re.lastIndex++;
matches.push(m[1]);
}
console.log(matches);

Working with Array:
// A sample String
String questions = "File: TestFile Branch OFFICE INFORMATION Address: TestFile City: L.A. District.: 43 State: California Zip Code: 90210 DISTRICT INFORMATION Address: TestFile2";
// An array list to store split elements
ArrayList arr = new ArrayList();
// Split based on colon and spaces.
// Including spaces resolves problems for new lines etc
for(String x : questions.split(":|\\s"))
// Ignore blank elements, so we get a clean array
if(!x.trim().isEmpty())
arr.add(x);
This will give you an array which is:
[File, TestFile, Branch, OFFICE, INFORMATION, Address, TestFile, City, L.A., District., 43, State, California, Zip, Code, 90210, DISTRICT, INFORMATION, Address, TestFile2]
Now lets analyze... suppose you want information corresponding to Address, or element Address. This element is at position 5 in array. That means element 6 is what you want.
So you would do this:
String address = arr.get(6);
This will return you testFile.
Similarly for City, element 8 is what you want. The count starts from 0. You can ofcourse modify my matching pattern or even create a loop and get yourself even better ways to do this task. This is just a hint.
Here is one such example loop:
// Every i+1 is the property tag, and every i+2 is the property name for
// Skip first 6 elements because they are of no real purpose to us
for(int i = 6; i<(arr.size()/2)+6; i+=2)
System.out.println(arr.get(i));
This gives following output:
TestFile
L.A.
43
California
Code
Ofcourse this loop is unrefined, refine it a little and you will get every element correctly. Even the last element. Or better yet, use ZipCode instead of Zip Code and dont use spaces in between and you will have a perfect loop with nothing much to be done in addition).
The advantage over using direct regex: You wont have to specify the regex for every single element. Iteration is always more handy to get things done automatically.

See this
//read input from file
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(new File("D:/tests/sample.txt"))));
StringBuilder string = new StringBuilder();
String line = "";
while((line = reader.readLine()) != null){
string.append(line);
string.append("\n");
}
//now string will contain the input as
/*File: TestFile
Branch
OFFICE INFORMATION
Address: TestFile
City: L.A.
District.: 43
State: California
Zip Code: 90210
DISTRICT INFORMATION
Address: TestFile2
....*/
Pattern regex = Pattern.compile("(OFFICE INFORMATION.*\\r?\\n.*Address:(?<officeAddress>.*)\\r?\\n)");
Matcher regexMatcher = regex.matcher(string.toString());
while (regexMatcher.find()) {
System.out.println(regexMatcher.group("officeAddress"));//prints TestFile
}
You can see the named group officeAddress in the pattern which is needed to be extracted.

Related

Java - String splitting

I read a txt with data in the following format: Name Address Hobbies
Example(Bob Smith ABC Street Swimming)
and Assigned it into String z
Then I used z.split to separate each field using " " as the delimiter(space) but it separated Bob Smith into two different strings while it should be as one field, same with the address. Is there a method I can use to get it in the particular format I want?
P.S Apologies if I explained it vaguely, English isn't my first language.
String z;
try {
BufferedReader br = new BufferedReader(new FileReader("desc.txt"));
z = br.readLine();
} catch(IOException io) {
io.printStackTrace();
}
String[] temp = z.split(" ");

If the format of name and address parts is fixed to consist of two parts, you could just join them:
String z = ""; // z must be initialized
// use try-with-resources to ensure the reader is closed properly
try (BufferedReader br = new BufferedReader(new FileReader("desc.txt"))) {
z = br.readLine();
} catch(IOException io) {
io.printStackTrace();
}
String[] temp = z.split(" ");
String name = String.join(" ", temp[0], temp[1]);
String address = String.join(" ", temp[2], temp[3]);
String hobby = temp[4];
Another option could be to create a format string as a regular expression and use it to parse the input line using named groups (?<group_name>capturing text):
// use named groups to define parts of the line
Pattern format = Pattern.compile("(?<name>\\w+\\s\\w+)\\s(?<address>\\w+\\s\\w+)\\s(?<hobby>\\w+)");
Matcher match = format.matcher(z);
if (match.matches()) {
String name = match.group("name");
String address = match.group("address");
String hobby = match.group("hobby");
System.out.printf("Input line matched: name=%s address=%s hobby=%s%n", name, address, hobby);
} else {
System.out.println("Input line not matching: " + z);
}

I can think of three solutions.
In order from best to worst:
Different delimiter
Enforce the format to always have two names, two address parts and one hobby
Have a dictionary with names and hobbies, check each word to determine which type it is and then group them together as needed.
(The 3rd option is not meant as a serious alternative.)

As others have mentioned, using spaces as both field delimiter and inside fields is problematic. You could use a regex pattern to split the line (paste (\w+ \w+) (\w+ \w+) (.+) in Regex101 for an explanation):
Pattern pattern = Pattern.compile("(\\w+ \\w+) (\\w+ \\w+) (.+)");
Matcher matcher = pattern.matcher("Bob Smith ABC Street Bowling Fishing Rollerblading");
System.out.println("matcher.matches() = " + matcher.matches());
for (int i = 0; i <= matcher.groupCount(); i++) {
System.out.println("matcher.group(" + i + ") = " + matcher.group(i));
}
This would give the following output:
matcher.matches() = true
matcher.group(0) = Bob Smith ABC Street Bowling Fishing Rollerblading
matcher.group(1) = Bob Smith
matcher.group(2) = ABC Street
matcher.group(3) = Bowling Fishing Rollerblading
However this only works for this exact format. If you get a line with three name parts for example:
John B Smith ABC Street Swimming
This will get split into John B as the name, Smith ABC as the address and Street Swimming as hobbies.
So either make 100% sure your input will always match this format or use a different delimiter.

The split() method majorly works on the 2 things:
Delimiter and
The String Object
Sometimes on limit too.
Whatever limit you will provide, the split() method will do its work according to that.
It doesn't understand whether the left substring is a name or not, same as for the right substring.
Have a look at this code snippet:
String assets = "Gold:Stocks:Fixed Income:Commodity:Interest Rates";
String[] splits = assets.split(":");
System.out.println("splits.size: " + splits.length);
for(String asset: splits){
System.out.println(assets);
}
OutPut
splits.size: 5
Gold
Stocks
Fixed Income // with space
Commodity
Interest Rates // with space
The output came with spaces because I provided the ; as a delimiter.
This probably helped you to get your answer.
Find Detailed Information on Split():
Top 5 Use cases of Split()
Java Docs : Split()

It depends on the data you're dealing with. Will the name always consist of a first and last name? Then you can simply combine the first two elements from the resulting array into a new string.
Otherwise, you might have to find a different way to separate out the different pieces within the txt file. Possibly a comma? Some character that you know won't ever be used in your normal data.

Assuming that every line follows the format
Bob Smith ABC Street Swimming
ie, name surname.... this code can manually manipulate the data for you:
String[] temp = z.split(" ");
String[] temp2 = new String[temp.length - 1];
temp2[0] = temp[0] + " " + temp[1];
for (int i = 2; i < temp.length; i++) {
temp2[i] = temp2[i];
}
temp = temp2;

Extracting all DATES from a .txt file

hopefully this is short and to the question..
In the below program I have successfully extracted ALL data from a notepad doc named "pad.txt", which consists of 3 sets vertically aligned with an 'ID' followed by 'Name' followed by 'Date Joined', that pattern is consistent.
The notepad doc consists solely of this:
dID: 1
Name: Bob
Date Joined: 01/12/2014
ID: 2
Name: Jim
Date Joined: 8/21/1993
ID: 3
Name: Steve
Date Joined: 6/07/2016
I have also defined a regex that accepts an acceptable date format: 1-2 digits, a slash, 1-2 digits again, a slash, then 2 to four digits for YEAR date.. At the beginning of that I specified a wild card character "." <- the dot with a greedy quantifier "" the star, to say ANY number of ANY character before the date is accepted, as well as after the date I have also specified the "."
My main goal with this code is to EXTRACT ONLY all of the DATES within the pad.txt file, and store them in a String or something..
public class Main {
public static void main(String args[]) throws Exception{
StringBuilder builder = new StringBuilder();
FileReader reader = new FileReader(new File("pad.txt"));
// Define valid date format via regex
String dateRegex = ".* (\\d{1,2})/(\\d{1,2})/(\\d{2,4}) .* ";
int fileContent = 0;
// iterate through entire notepad doc, until = 0 AKA (finished searching doc)
while((fileContent = reader.read()) !=-1){
builder.append((char)fileContent);
}//encapsulating loop
reader.close();
String extracted = builder.toString();
System.out.println("Extracted: " + extracted);
System.out.println();
Matcher m = null;
// Validate that file contents conform with 'dateRegex'
m = Pattern.compile(dateRegex).matcher(extracted);
if(m.find()){
System.out.println("Entire group : " + m.group());
}
}
}
Unfortunately, the m.group(); outprint only returns:
"Entire group : 6/07/2016"
As stated, my goal is to extract ALL of the dates, but I can't fiddle with all of the dates if the .matcher call ONLY catches the "Entire group : 6/07/2016"
In my mind, I say ANY character of ANY amount is allowed before and AFTER the date, so it scrolls to the very bottom and finds ONLY the LAST date, how do I defined the regex so that it pulls out ALL of the dates, not just the very LAST one, and why is it only pulling the last one?
I've tried relentlessly with this and cannot figure out how..
Thanks in advance

Well, that's relatively easy. You can't write a regex that matches all dates at once, but you can use matcher as it was intended to be used, i.e. find() returns true as often as another match can be found.
So you have to modify your regex and remove the .* on both ends. Then you can simply do this:
StringBuilder dateListBuilder = new Stringbuilder();
while(m.find()){
dateListBuilder.append(m.group());
}
System.out.println(dateListBuilder.toString());

How to find if HL7 Segment has ended or not if Carriage return is not present

I am working on a tool which will construct a HL7 message in following Way :
Message will start with : 0B
Segment will end with : OD
And Message will end with : 1C0D
So, here i have reached so far, i am able to add OB and add 1C0D in the end of the HL7 Message. I am also able to add OD before at the end of the segment. I am accomplishing with the of code where i will check if Character before Segment name is 0D or not.
But the issue is if text in the message is somewhat like this ...PID| my code will add 0D before PID| which is not correct it should check if its the start of the segment or not.
Please help if someone has worked on similar requirement.
Link to my code is :
Arraylist Sublist IndexOutOfBounds Exception

I had some time to look at this problem. As far as I could understand, you have some piece of code that generates the HL7v2 segments for you and then you want to create a message with the following delimiters:
Segment delimiter: 0x0D (or 13 in ASCII), which is the Carriage Return. It's the segment separator, as per HL7v2 standard;
Message start delimiter: 0x0B (ASCII 11 - Vertical Tab);
Message finish delimiter: 0x1C0D. My guess is that this value is supposed to be the concatenation of 0x1C (ASCII 28 - File Separator) and 0x0D (ASCII 13 - Carriage Return).
With #1 you get HL7v2 messages standard-compliant. With #2 and #3 you are able to clearly define delimiters for the message so that it can be processed and parsed later by some custom processor.
So I took a shot writing some simple code and here's the result:
public class App
{
public static void main( String[] args ) throws Exception
{
String msg = "MSH|^~\\&|HIS|RIH|EKG|EKG|199904140038||ADT^A01||P|2.5" +
"PID|0001|00009874|00001122|A00977|SMITH^JOHN^M|MOM|19581119|F|NOTREAL^LINDA^M|C|564 SPRING ST^^NEEDHAM^MA^02494^US" +
"AL1||SEV|001^POLLEN";
String[] segments = msg.split("(?=PID|AL1)");
System.out.println("Initial message:");
for (String s : segments)
System.out.println(s);
byte hexStartMessage = 0x0B;
byte hexFinishMessage1 = 0x1C;
byte hexFinishMessage2 = 0x0D;
byte hexFinishSegment = 0x0D;
String finalMessage = Byte.toString(hexStartMessage) +
intersperse(segments, hexFinishSegment) +
Byte.toString(hexFinishMessage1) +
Byte.toString(hexFinishMessage2);
System.out.println("\nFinal message:\n" + finalMessage);
}
public static String intersperse(String[] segments, byte delimiter) throws UnsupportedEncodingException {
// uncomment this line if you wish to show the delimiter in the output
//System.out.printf("Byte Delimiter: %s", String.format("%04x", (int)delimiter));
StringBuilder sb = new StringBuilder();
String defaultDelimiter = "";
for (String segment : segments) {
sb.append(defaultDelimiter).append(segment);
defaultDelimiter = Byte.toString(delimiter);
}
return sb.toString();
}
}
I picked up a simple HL7v2 message and I splitted it in segments, according to the segments (name) used in the message, with the help of a regex with a lookahead strategy. This means that, for your messages you'll need to know the segments that are going to be used (you can get that from the standard).
I then interspersed the segment delimiter between each segment (at its end) and added the message start and end delimiters. In this case, for the message end delimiters, I used the 0x1C and 0x0D values separated, but if you need to use a single value then you only need to change the final appends.
Here's the output:
Initial message:
MSH|^~\&|HIS|RIH|EKG|EKG|199904140038||ADT^A01||P|2.5
PID|0001|00009874|00001122|A00977|SMITH^JOHN^M|MOM|19581119|F|NOTREAL^LINDA^M|C|564 SPRING ST^^NEEDHAM^MA^02494^US
AL1||SEV|001^POLLEN
Final message:
11MSH|^~\&|HIS|RIH|EKG|EKG|199904140038||ADT^A01||P|2.5
PID|0001|00009874|00001122|A00977|SMITH^JOHN^M|MOM|19581119|F|NOTREAL^LINDA^M|C|564 SPRING ST^^NEEDHAM^MA^02494^US
AL1||SEV|001^POLLEN2813
As you see, the final message begins with value 11 (0x0B) and ends with 28 (0x1C) and 13 (0x0D). The 13 (0x0D) at the end of each segment is not shown because Java's System.out.println() recognizes it as being the '\r' character and starts a new line because I'm running in Mac OS X. If you try to intersperse the segments with any other character (ex: 0x25 = '%') you'll notice that the final message is printed in a single line:
11MSH|^~\&|HIS|RIH|EKG|EKG|199904140038||ADT^A01||P|2.5%PID|0001|00009874|00001122|A00977|SMITH^JOHN^M|MOM|19581119|F|NOTREAL^LINDA^M|C|564 SPRING ST^^NEEDHAM^MA^02494^US%AL1||SEV|001^POLLEN2813
If I run in Ubuntu, you get to see the message in one line with the segment delimiter:
11MSH|^~\&|HIS|RIH|EKG|EKG|199904140038||ADT^A01||P|2.513PID|0001|00009874|00001122|A00977|SMITH^JOHN^M|MOM|19581119|F|NOTREAL^LINDA^M|C|564 SPRING ST^^NEEDHAM^MA^02494^US13AL1||SEV|001^POLLEN2813

Cleaning a file name in Java

I want to write a script that will clean my .mp3 files.
I was able to write a few line that change the name but I want to write an automatic script that will erase all the undesired characters $%_!?7 and etc. while changing the name in the next format Artist space dash Song.
File file = new File("C://Users//nikita//Desktop//$%#Artis8t_-_35&Son5g.mp3");
String Original = file.toString();
String New = "Code to change 'Original' to 'Artist - Song'";
File file2 = new File("C://Users//nikita//Desktop//" + New + ".mp3");
file.renameTo(file2);
I feel like I should make a list with all possible characters and then run the String through this list and erase all of the listed characters but I am not sure how to do it.
String test = "$%$#Arti56st_-_54^So65ng.mp3";
Edit 1:
When I try using the method remove, it still doesn't change the name.
String test = "$%$#Arti56st_-_54^So65ng.mp3";
System.out.println("Original: " + test);
test.replace( "[0-9]%#&\\$", "");
System.out.println("New: " + test);
The code above returns the following output
Original: $%$#Arti56st_-_54^So65ng.mp3
New: $%$#Arti56st_-_54^So65ng.mp3

I'd suggest something like this:
public static String santizeFilename(String original){
Pattern p = Pattern.compile("(.*)-(.*)\\.mp3");
Matcher m = p.matcher(original);
if (m.matches()){
String artist = m.group(1).replaceAll("[^a-zA-Z ]", "");
String song = m.group(2).replaceAll("[^a-zA-Z ]", "");
return String.format("%s - %s", artist, song);
}
else {
throw new IllegalArgumentException("Failed to match filename : "+original);
}
}
(Edit - changed whitelist regex to exclude digits and underscores)
Two points in particular - when sanitizing strings, it's a good idea to whitelist permitted characters, rather than blacklisting the ones you want to exclude, so you won't be surprised by edge cases later. (You may want a less restrictive whitelist than I've used here, but it's easy to vary)
It's also a good idea to handle the case that the filename doesn't match the expected pattern. If your code comes across something other than an MP3, how would you like it to respond? Here I've through an exception, so the calling code can catch and handle that appropriately.

String new = original.replace( "[0-9]%#&\\$", "")
this should replace almost all the characters you don't want
or you can come up with your own regex
https://docs.oracle.com/javase/tutorial/essential/regex/

Find a character or group of characters in a string and remove them

I am trying to build a program that
locates the # symbol, then
locates the .edu part of an educational email address, and finally
removes the #school.edu section, and return the rest.
I've tried using charAt, but I keep receiving an incompatible-types error, and I'm not sure how to remove a section of a string that could be in a different location each time. Any guidance would be welcome.
here is what I have so far:
if (UserEmail.charAt(0) == (".edu"))
String UserName = UserEmail.substring(0,//location of #//)
else
System.out.print(UserEmail + "is not an acceptable email address.
System.out.print("Type your email address.");
UserEmail = kb.nextLine();

You could use string.replaceAll function.
string.replaceAll("#\\S+?\\.edu\\b", "");
\\S+? will do a non-greedy match of one or more non-space characters.
Example:
String r = "foo#school.edu bar";
System.out.println(r.replaceAll("#\\S+?\\.edu\\b", ""));
Output:
foo bar

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Using regex to parse a string from text that includes a newline - java

Related

Java - String splitting

Extracting all DATES from a .txt file

How to find if HL7 Segment has ended or not if Carriage return is not present

Cleaning a file name in Java

Find a character or group of characters in a string and remove them

Categories

Resources