Extracting all DATES from a .txt file - java

hopefully this is short and to the question..
In the below program I have successfully extracted ALL data from a notepad doc named "pad.txt", which consists of 3 sets vertically aligned with an 'ID' followed by 'Name' followed by 'Date Joined', that pattern is consistent.
The notepad doc consists solely of this:
dID: 1
Name: Bob
Date Joined: 01/12/2014
ID: 2
Name: Jim
Date Joined: 8/21/1993
ID: 3
Name: Steve
Date Joined: 6/07/2016
I have also defined a regex that accepts an acceptable date format: 1-2 digits, a slash, 1-2 digits again, a slash, then 2 to four digits for YEAR date.. At the beginning of that I specified a wild card character "." <- the dot with a greedy quantifier "" the star, to say ANY number of ANY character before the date is accepted, as well as after the date I have also specified the "."
My main goal with this code is to EXTRACT ONLY all of the DATES within the pad.txt file, and store them in a String or something..
public class Main {
public static void main(String args[]) throws Exception{
StringBuilder builder = new StringBuilder();
FileReader reader = new FileReader(new File("pad.txt"));
// Define valid date format via regex
String dateRegex = ".* (\\d{1,2})/(\\d{1,2})/(\\d{2,4}) .* ";
int fileContent = 0;
// iterate through entire notepad doc, until = 0 AKA (finished searching doc)
while((fileContent = reader.read()) !=-1){
builder.append((char)fileContent);
}//encapsulating loop
reader.close();
String extracted = builder.toString();
System.out.println("Extracted: " + extracted);
System.out.println();
Matcher m = null;
// Validate that file contents conform with 'dateRegex'
m = Pattern.compile(dateRegex).matcher(extracted);
if(m.find()){
System.out.println("Entire group : " + m.group());
}
}
}
Unfortunately, the m.group(); outprint only returns:
"Entire group : 6/07/2016"
As stated, my goal is to extract ALL of the dates, but I can't fiddle with all of the dates if the .matcher call ONLY catches the "Entire group : 6/07/2016"
In my mind, I say ANY character of ANY amount is allowed before and AFTER the date, so it scrolls to the very bottom and finds ONLY the LAST date, how do I defined the regex so that it pulls out ALL of the dates, not just the very LAST one, and why is it only pulling the last one?
I've tried relentlessly with this and cannot figure out how..
Thanks in advance

Well, that's relatively easy. You can't write a regex that matches all dates at once, but you can use matcher as it was intended to be used, i.e. find() returns true as often as another match can be found.
So you have to modify your regex and remove the .* on both ends. Then you can simply do this:
StringBuilder dateListBuilder = new Stringbuilder();
while(m.find()){
dateListBuilder.append(m.group());
}
System.out.println(dateListBuilder.toString());

Related

Remove elements from Date Format String using a Regular Expression

I want to remove elements a supplied Date Format String - for example convert the format "dd/MM/yyyy" to "MM/yyyy" by removing any non-M/y element.
What I'm trying to do is create a localised month/year format based on the existing day/month/year format provided for the Locale.
I've done this using regular expressions, but the solution seems longer than I'd expect.
An example is below:
public static void main(final String[] args) {
System.out.println(filterDateFormat("dd/MM/yyyy HH:mm:ss", 'M', 'y'));
System.out.println(filterDateFormat("MM/yyyy/dd", 'M', 'y'));
System.out.println(filterDateFormat("yyyy-MMM-dd", 'M', 'y'));
}
/**
* Removes {#code charsToRetain} from {#code format}, including any redundant
* separators.
*/
private static String filterDateFormat(final String format, final char...charsToRetain) {
// Match e.g. "ddd-"
final Pattern pattern = Pattern.compile("[" + new String(charsToRetain) + "]+\\p{Punct}?");
final Matcher matcher = pattern.matcher(format);
final StringBuilder builder = new StringBuilder();
while (matcher.find()) {
// Append each match
builder.append(matcher.group());
}
// If the last match is "mmm-", remove the trailing punctuation symbol
return builder.toString().replaceFirst("\\p{Punct}$", "");
}
Let's try a solution for the following date format strings:
String[] formatStrings = { "dd/MM/yyyy HH:mm:ss",
"MM/yyyy/dd",
"yyyy-MMM-dd",
"MM/yy - yy/dd",
"yyabbadabbadooMM" };
The following will analyze strings for a match, then print the first group of the match.
Pattern p = Pattern.compile(REGEX);
for(String formatStr : formatStrings) {
Matcher m = p.matcher(formatStr);
if(m.matches()) {
System.out.println(m.group(1));
}
else {
System.out.println("Didn't match!");
}
}
Now, there are two separate regular expressions I've tried. First:
final String REGEX = "(?:[^My]*)([My]+[^\\w]*[My]+)(?:[^My]*)";
With program output:
MM/yyyy
MM/yyyy
yyyy-MMM
Didn't match!
Didn't match!
Second:
final String REGEX = "(?:[^My]*)((?:[My]+[^\\w]*)+[My]+)(?:[^My]*)";
With program output:
MM/yyyy
MM/yyyy
yyyy-MMM
MM/yy - yy
Didn't match!
Now, let's see what the first regex actually matches to:
(?:[^My]*)([My]+[^\\w]*[My]+)(?:[^My]*) First regex =
(?:[^My]*) Any amount of non-Ms and non-ys (non-capturing)
([My]+ followed by one or more Ms and ys
[^\\w]* optionally separated by non-word characters
(implying they are also not Ms or ys)
[My]+) followed by one or more Ms and ys
(?:[^My]*) finished by any number of non-Ms and non-ys
(non-capturing)
What this means is that at least 2 M/ys are required to match the regex, although you should be careful that something like MM-dd or yy-DD will match as well, because they have two M-or-y regions 1 character long. You can avoid getting into trouble here by just keeping a sanity check on your date format string, such as:
if(formatStr.contains('y') && formatStr.contains('M') && m.matches())
{
String yMString = m.group(1);
... // other logic
}
As for the second regex, here's what it means:
(?:[^My]*)((?:[My]+[^\\w]*)+[My]+)(?:[^My]*) Second regex =
(?:[^My]*) Any amount of non-Ms and non-ys
(non-capturing)
( ) followed by
(?:[My]+ )+[My]+ at least two text segments consisting of
one or more Ms or ys, where each segment is
[^\\w]* optionally separated by non-word characters
(?:[^My]*) finished by any number of non-Ms and non-ys
(non-capturing)
This regex will match a slightly broader series of strings, but it still requires that any separations between Ms and ys be non-words ([^a-zA-Z_0-9]). Additionally, keep in mind that this regex will still match "yy", "MM", or similar strings like "yyy", "yyyy"..., so it would be useful to have a sanity check as described for the previous regular expression.
Additionally, here's a quick example of how one might use the above to manipulate a single date format string:
LocalDateTime date = LocalDateTime.now();
String dateFormatString = "dd/MM/yyyy H:m:s";
System.out.println("Old Format: \"" + dateFormatString + "\" = " +
date.format(DateTimeFormatter.ofPattern(dateFormatString)));
Pattern p = Pattern.compile("(?:[^My]*)([My]+[^\\w]*[My]+)(?:[^My]*)");
Matcher m = p.matcher(dateFormatString);
if(dateFormatString.contains("y") && dateFormatString.contains("M") && m.matches())
{
dateFormatString = m.group(1);
System.out.println("New Format: \"" + dateFormatString + "\" = " +
date.format(DateTimeFormatter.ofPattern(dateFormatString)));
}
else
{
throw new IllegalArgumentException("Couldn't shorten date format string!");
}
Output:
Old Format: "dd/MM/yyyy H:m:s" = 14/08/2019 16:55:45
New Format: "MM/yyyy" = 08/2019
I'll try to answer with the understanding of my question : how do I remove from a list/table/array of String, elements that does not exactly follow the patern 'dd/MM'.
so I'm looking for a function that looks like
public List<String> removeUnWantedDateFormat(List<String> input)
We can expect, from my knowledge on Dateformat, only 4 possibilities that you would want, hoping i dont miss any, which are "MM/yyyy", "MMM/yyyy", "MM/yy", "MM/yyyy". So that we know what we are looking for we can do an easy function.
public List<String> removeUnWantedDateFormat(List<String> input) {
String s1 = "MM/yyyy";
string s2 = "MMM/yyyy";
String s3 = "MM/yy";
string s4 = "MMM/yy";
for (String format:input) {
if (!s1.equals(format) && s2.equals(format) && s3.equals(format) && s4.equals(format))
input.remove(format);
}
return input;
}
Better not to use regex if you can, it costs a lot of resources. And great improvement would be to use an enum of the date format you accept, like this you have better control over it, and even replace them.
Hope this will help, cheers
edit: after i saw the comment, i think it would be better to use contains instead of equals, should work like a charm and instead of remove,
input = string expected.
so it would looks more like:
public List<String> removeUnWantedDateFormat(List<String> input) {
List<String> comparaisons = new ArrayList<>();
comparaison.add("MMM/yyyy");
comparaison.add("MMM/yy");
comparaison.add("MM/yyyy");
comparaison.add("MM/yy");
for (String format:input) {
for(String comparaison: comparaisons)
if (format.contains(comparaison)) {
format = comparaison;
break;
}
}
return input;
}

how to extract date from the given filename in java

I have my file names as below
C:\Users\name\Documents\repository\zzz\xxx_yyy\new\aaa_bbb_ccc_ddd_eee_ZZ_E_20160801_20160831_v1-0.csv
C:\Users\name\Documents\repository\zzz\xxx_yyy\new\aaa_bbb_ppp_ccc_ddd_eee_ZZ_E_20160801_20160831_v1-0.csv
I have to write a single java script for both the file format to extract both the dates from each filename.
Can you please help.
You should use Regular expressions to extract dates from filenames like these.
private static Date[] extractDatesFromFileName(File file) throws ParseException {
Date[] dates = new Date[2];
SimpleDateFormat dateFormatter = new SimpleDateFormat("yyyyMMdd");
String regex = ".*(\\d{8})_(\\d{8}).*";
Pattern pattern = Pattern.compile(regex);
Matcher m = pattern.matcher(file.getName());
if (m.find()) {
dates[0] = dateFormatter.parse(m.group(1));
dates[1] = dateFormatter.parse(m.group(2));
}
System.out.println(dates[0]);
System.out.println(dates[1]);
return dates;
}
Little explanation:
In regex .*(\\d{8})_(\\d{8}).*:
.* stands for any sing repeated from zero to unlimited times
(\\d{8}) stands for exactly eight digits (if they are in brackets they are considered capturing groups, we have 2 capturing groups in this regex, one for each date)
_ stands for _ sign :)
If filename matches provided pattern both dates are extracted, parsed and returned as array. You should add some error handling etc.
If you mean a Java script (not Javascript) you can use regexp, something like the following:
String in = "C:\\Users\\name\\Documents\\repository\\zzz\\xxx_yyy\\new\\aaa_bbb_ppp_ccc_ddd_eee_ZZ_E_20160801_20160831_v1-0.csv";
Pattern p = Pattern.compile("_(\\d{8})_v1-0");
Matcher m = p.matcher(in);
if (m.find()){
System.out.println(m.group(1));
}
I think you want to extract two dates which are present in each file path.
This could be done as follows:
String filename1 = "C:\\Users\\name\\Documents\\repository\\zzz\\xxx_yyy\\new\\aaa_bbb_ccc_ddd_eee_ZZ_E_20160801_20160831_v1-0.csv";
Pattern p = Pattern.compile("[0-9]{8}+_[0-9]{8}+");
Matcher m = p.matcher(filename1);
String[] dateStrArr = m.find()?m.group(0).split("_"): null;
First date will be in 0 index and second date will be in 1 index position.
Same goes for second file name.
Hope this helps.
Also once extracted you can convert them to date object using SimpleDateFormat.

Using regex to parse a string from text that includes a newline

Given the following text, I'm trying to parse out the string "TestFile" after Address::
File: TestFile
Branch
OFFICE INFORMATION
Address: TestFile
City: L.A.
District.: 43
State: California
Zip Code: 90210
DISTRICT INFORMATION
Address: TestFile2
....
I understand that lookbehinds require zero-width so quantifiers are not allowed, meaning this won't work:
(?<=OFFICE INFORMATION\n\s*Address:).*(?=\n)
I could use this
(?<=OFFICE INFORMATION\n Address:).*
but it depends on consistent spacing, which isn't dynamic and thus not ideal.
How do I reliably parse out "TestFile" and not "TestFile2" as shown in my example above. Note that Address appears twice but I only need the first value.
Thank you
You don't really need to use a lookbehind here. Get your matched text using captured group:
(?:\bOFFICE INFORMATION\s+Address:\s*)(\S+)
RegEx Demo
captured group #1 will have value TestFile
JS Code:
var re = /(?:\bOFFICE INFORMATION\s+Address:\s*)(\S+)/;
var m;
var matches = [];
if ((m = re.exec(input)) !== null) {
if (m.index === re.lastIndex)
re.lastIndex++;
matches.push(m[1]);
}
console.log(matches);
Working with Array:
// A sample String
String questions = "File: TestFile Branch OFFICE INFORMATION Address: TestFile City: L.A. District.: 43 State: California Zip Code: 90210 DISTRICT INFORMATION Address: TestFile2";
// An array list to store split elements
ArrayList arr = new ArrayList();
// Split based on colon and spaces.
// Including spaces resolves problems for new lines etc
for(String x : questions.split(":|\\s"))
// Ignore blank elements, so we get a clean array
if(!x.trim().isEmpty())
arr.add(x);
This will give you an array which is:
[File, TestFile, Branch, OFFICE, INFORMATION, Address, TestFile, City, L.A., District., 43, State, California, Zip, Code, 90210, DISTRICT, INFORMATION, Address, TestFile2]
Now lets analyze... suppose you want information corresponding to Address, or element Address. This element is at position 5 in array. That means element 6 is what you want.
So you would do this:
String address = arr.get(6);
This will return you testFile.
Similarly for City, element 8 is what you want. The count starts from 0. You can ofcourse modify my matching pattern or even create a loop and get yourself even better ways to do this task. This is just a hint.
Here is one such example loop:
// Every i+1 is the property tag, and every i+2 is the property name for
// Skip first 6 elements because they are of no real purpose to us
for(int i = 6; i<(arr.size()/2)+6; i+=2)
System.out.println(arr.get(i));
This gives following output:
TestFile
L.A.
43
California
Code
Ofcourse this loop is unrefined, refine it a little and you will get every element correctly. Even the last element. Or better yet, use ZipCode instead of Zip Code and dont use spaces in between and you will have a perfect loop with nothing much to be done in addition).
The advantage over using direct regex: You wont have to specify the regex for every single element. Iteration is always more handy to get things done automatically.
See this
//read input from file
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(new File("D:/tests/sample.txt"))));
StringBuilder string = new StringBuilder();
String line = "";
while((line = reader.readLine()) != null){
string.append(line);
string.append("\n");
}
//now string will contain the input as
/*File: TestFile
Branch
OFFICE INFORMATION
Address: TestFile
City: L.A.
District.: 43
State: California
Zip Code: 90210
DISTRICT INFORMATION
Address: TestFile2
....*/
Pattern regex = Pattern.compile("(OFFICE INFORMATION.*\\r?\\n.*Address:(?<officeAddress>.*)\\r?\\n)");
Matcher regexMatcher = regex.matcher(string.toString());
while (regexMatcher.find()) {
System.out.println(regexMatcher.group("officeAddress"));//prints TestFile
}
You can see the named group officeAddress in the pattern which is needed to be extracted.

Regex expression to get the file name

I want to extract only filename from the complete file name + time stamp . below is the input.
String filePath = "fileName1_20150108.csv";
expected output should be: "fileName1"
String filePath2 = "fileName1_filedesc1_20150108_002_20150109013841.csv"
And expected output should be: "fileName1_filedesc1"
I wrote a below code in java to get the file name but it is working for first part (filePath) but not for filepath2.
Pattern pattern = Pattern.compile(".*.(?=_)");
String filePath = "fileName1_20150108.csv";
String filePath2 = "fileName1_filedesc1_20150108_002_20150109013841.csv";
Matcher matcher = pattern.matcher(filePath);
while (matcher.find()) {
System.out.print("Start index: " + matcher.start());
System.out.print(" End index: " + matcher.end() + " ");
System.out.println(matcher.group());
}
Can somebody please help me to correct the regex so i can parse both filepath using same regex?
Thanks
Anchor the start, and make the .* non-greedy:
^.*?(_\D.*?)?(?=[_.])
Update: change the second group (for fileDesc) to optional, and enforce that it starts with a non-digit character. This will work as long as your fileDesc strings never start with numbers.
You can get the characters before the first underscode, the first underscore, and then the characters until the next underscore:
^[^_]*_[^_]*
This should work: "^(.*?)_([0-9_]*)\\.([^.]*)$"
It will return you 3 groups:
the base name (assuming not a single part will be all numbers)
the timestamp info
the extension.
You can test here: http://fiddle.re/v0hne6 (RegexPlanet)

How to pull numbers from a string/file name in Java?

Hopefully somebody can help me with this.. or at least point me in the right direction.
First off, I have a bunch of files with names such as:
vendor.2012-07-25
vendor.2012-07-25 2
ven_dor.2012-05-18
ven_dor.2012-05-18 2
Basically a vendor name (Sometimes one word, sometimes two with an underscore) + (period ".") + (year) + (month) + (day). Year, month, day are separated by (-). Possibly multiple files with the same name, denoted by a 2/3/4 etc after the date.
I obtain these as strings by doing file.getName(); where 'file' is the selected file from a JFileChooser
Then I need to chart some of the data based on date. Should I try to split the initial file name string by a "." first, so that the vendor and date are separated, and then split/divide up the remaining part by "-" to have the individual values for year/month/day?
I was thinking this could be a regex thing, but I'm pretty weak in that area.. so the double splitting is what I came up with. Anybody have input or suggestions? Thanks!
Indeed, you can use a regular expression:
String s = "vendor.2012-07-25 2";
Pattern p = Pattern.compile("([^.]+)\\.(\\d{4})-(\\d{2})-(\\d{2}) ?(\\d?)");
Matcher m = p.matcher(s);
if (m.find()) {
String vendorName = m.group(1);
String year = m.group(2);
String month = m.group(3);
String day = m.group(4);
String multipleFiles = m.groupCount() > 4 ? m.group(5) : "";
System.out.printf("%s %s %s %s %s", vendorName, year, month, day, multipleFiles);
}
Each expression wrapped with parentheses () is called a capturing group, and it basically tells the regex engine to save its content, so that it can be retrieved later on.
In sum, here's what each capturing group does:
([^.]+) - Everything but a dot (.), so we are basically capturing the vendor name part;
(\\d{4}) - \d matches a digit. \d{4} matches 4 digits (year);
(\\d{2}) - Month;
(\\d{2}) - Day;
(\\d?) - Matches an optional (?) last digit.
If you want to parse the date part as a java.Util.Date instance, you can use a single capturing group for it, and then use SimpleDateFormat:
Pattern p = Pattern.compile("([^.]+)\\.(\\d{4}-\\d{2}-\\d{2}) ?(\\d?)");
Matcher m = p.matcher(s);
if (m.find()) {
String vendorName = m.group(1);
String dateString = m.group(2);
SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd");
String multipleFiles = m.groupCount() > 2 ? m.group(3) : "";
}
String.split on the . (it will probably require escaping). Take the dotSplitString[1] as being the part after vendor. or ven_dor.
Split that part on space (spaceSplitString).
Parse the first part using DateFormat.parse(String) to get a Date
If the 2nd part (of the spaceSplitString) is present, use Integer.parseInt(spaceSplitString[1])
Java API String Tokenizer class
What you can do is:
tokenizer = new StringTokenizer(file.getName(), ".");
tokenizer.nextElement();
you get the picture, Or you can use Scanner to parse it as well
I tend to make use of StringTokenizers in my code a lot. To tokenize the above example you could use something akin to the following:
StringTokenizer tok = new StringTokenizer(filename,".-"); //tokenizes both on '.' and '-'
String name = tok.nextToken();
int year = Integer.parseInt(tok.nextToken());
int month = Integer.parseInt(tok.nextToken());
int day = Integer.parseInt(tok.nextToken());
int cnt = 1; //default one copy of the file
if(tok.hasMoreTokens()){
cnt = Integer.parseInt(tok.nextToken());
}
...and so on.
However I endorse the use of the regex solution above, if not only because it looks less comprehensible to a layman. Just including this here for completeness.

Categories

Resources