Validating a Text File content using regex

Validating a Text File content using regex - java

The Input text file has content as following :
TIMINCY........ many arbitrary characters incl. white spaces and tabs
DETAIL........ many arbitrary characters incl. white spaces and tabs
DETAIL........ many arbitrary characters incl. white spaces and tabs
DETAIL........ many arbitrary characters incl. white spaces and tabs
.
.
. (any number of lines containing DETAILS)
TIMINCY........ many arbitrary characters incl. white spaces and tabs
DETAIL........ many arbitrary characters incl. white spaces and tabs
DETAIL........ many arbitrary characters incl. white spaces and tabs
DETAIL........ many arbitrary characters incl. white spaces and tabs
.
.
.(so on)
Q: I need to validate the file using regex so that if the file's content is NOT
in accordance with respect to the pattern given above then I can throw CustomException.
Please let know if you could help. Any help is appreciated cordially.
String patternString = "TMINCY"+"[.]\\{*\\}"+";"+"["+"DETAILS"+"[.]\\{*\\}"+";"+"]"+"\\{*\\}"+"]"+"\\{*\\};";
Pattern pattern = Pattern.compile(patternString );
String messageString = null;
StringBuilder builder = new StringBuilder();
try (BufferedReader reader = Files.newBufferedReader(curracFile.toPath(), charset)) {
String line;
while ((line = reader.readLine()) != null) {
builder.append(line);
builder.append(NEWLINE_CHAR_SEQUENCE);
}
messageString = builder.toString();
} catch (IOException ex) {
LOGGER.error(FILE_CREATION_ERROR, ex.getCause());
throw new BusinessConversionException(FILE_CREATION_ERROR, ex);
}
System.out.println("messageString is::"+messageString);
return pattern.matcher(messageString).matches();
But it is Returning FALSE for correct file. Please help me with the regex.

What about something like "^(TIMINCY|DETAIL)[\.]+[a-zA-z\s.]+"
"^" - matches the start of the line
"(TIMINCY|DETAIL)" - matches TIMINCY or DETAIL
"[\.]" - matches the dot character to occur one or more times
"[a-zA-z\s.]+" - Here you put the allowed characters to occur one or more time
Reference: Oracle Documentation

You could try line by line when you're iterating over the lines
Pattern p = Pattern.compile("^(?:TIMINCY|DETAILS)[.]{8}.*");
//Explanation:
// ^ : Matches the begining of the string.
// (?:): non capturing group.
// [.]{8}: Matches a dot (".") eight times in a row.
// .*: Matches everything until the end of the string
// | : Regex OR operator
String line = reader.readLine()
Matcher m;
while (line != null) {
m = p.matcher(line);
if(!m.matches(line))
throw new CustomException("Not valid");
builder.append(line);
builder.append(NEWLINE_CHAR_SEQUENCE);
line = reader.readLine();
}
Also: Matcher.matches() returns true if the ENTIRE STRING matches your regular expression, i would recommend using Matcher.find() to find patterns you don't want.
Matcher (Java 7)

Related

Java Regular Expression Multiline

I'm trying to get the result of a match with two lines and more, this is my text in a file (for JOURNAL ENTRIES for Wincor ATM):
DEMANDE SOLDE
N° CARTE : 1500000001180006
OPERATION NO. : 585068
========================================
RETRAIT
N° CARTE 1600001002200006
OPERATION NO. : 585302
MONTANT : MAD 200.00
========================================
... etc.
Theare more lines repeated for each operation : retrait(ATMs), demande de solde (balance inquiry), which I want to get a resul like: RETRAIT\nN° CARTE 1600001002200006
My java code:
String filename="20140604.jrn";
File file=new File(filename);
String regexe = ".*RETRAIT^\r\n.*CARTE.*\\d{16}"; // Work with .*CARTE.*\\d{16}: result: N° CARTE : 1500000001180006 N° CARTE 1600001002200006
Pattern pattern = Pattern.compile(regexe,Pattern.MULTILINE);
try {
BufferedReader in = new BufferedReader(new FileReader(file));
while (in.ready()) {
String s = in.readLine();
Matcher matcher = pattern.matcher(s);
while (matcher.find()) { // find the next match
System.out.println("found the pattern \"" + matcher.group());
}
}
in.close();
}
catch(IOException e) {
System.out.println("File 20140604.jrn not found");
}
Any Solution Please ?

I am unable to test this right now, but it looks like you have the boundary special character '^' in the wrong spot. It is trying to match RETRAIT followed by the beginning of a line followed by newline characters, when the beginning of the line won't start until after the newline characters.
UPDATE:
With an online java regex tool, I've been able to test this:
^RETRAIT\s*\w+.*CARTE\s+\d{16}
which matches what you want in multiline mode. The \s special character consumes whitespace (including carriage return and new line), which is more resilient than checking explicitly for \n or \r.

Tokenize Arabic text files java

I am trying to tokenize some text files into words and I write this code, It works perfect in English and when I try it in Arabic it did not work.
I added the UTF-8 to read Arabic files. did I miss something
public void parseFiles(String filePath) throws FileNotFoundException, IOException {
File[] allfiles = new File(filePath).listFiles();
BufferedReader in = null;
for (File f : allfiles) {
if (f.getName().endsWith(".txt")) {
fileNameList.add(f.getName());
Reader fstream = new InputStreamReader(new FileInputStream(f),"UTF-8");
// BufferedReader br = new BufferedReader(fstream);
in = new BufferedReader(fstream);
StringBuilder sb = new StringBuilder();
String s=null;
String word = null;
while ((s = in.readLine()) != null) {
Scanner input = new Scanner(s);
while(input.hasNext()) {
word = input.next();
if(stopword.isStopword(word)==true)
{
word= word.replace(word, "");
}
//String stemmed=stem.stem (word);
sb.append(word+"\t");
}
//System.out.print(sb); ///here the arabic text is outputed without stopwords
}
String[] tokenizedTerms = sb.toString().replaceAll("[\\W&&[^\\s]]", "").split("\\W+"); //to get individual terms
for (String term : tokenizedTerms) {
if (!allTerms.contains(term)) { //avoid duplicate entry
allTerms.add(term);
System.out.print(term+"\t"); //here the problem.
}
}
termsDocsArray.add(tokenizedTerms);
}
}
}
Please any ideas to help me proceed.
Thanks

The problem lies with your regex which will work well for English but not for Arabic because by definition
[\\W&&[^\\s]
means
// returns true if the string contains a arbitrary number of non-characters except whitespace.
\W A non-word character other than [a-zA-Z_0-9]. (Arabic chars all satisfy this condition.)
\s A whitespace character, short for [ \t\n\x0b\r\f]
So, by this logic, all chars of Arabic will be selected by this regex. So, when you give
sb.toString().replaceAll("[\\W&&[^\\s]]", "")
it will mean, replace all non word character which is not a space with "". Which in case of Arabic, is all characters. Thus you will get a problem that all Arabic chars are replaced by "". Hence no output will come. You will have to tweak this regex to work for Arabic text or just split the string with space like
sb.toString().split("\\s+")
which will give you the Arabic words array separated by space.

In addition to worrying about character encoding as in bgth's response, tolkenizing Arabic has an added complication that words are not nessisarily white space separated:
http://www1.cs.columbia.edu/~rambow/papers/habash-rambow-2005a.pdf
If you're not familiar with the Arabic, you'll need to read up on some of the methods regarding tolkenization:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.120.9748

Replace a particular String from a text file

I'm trying to replace the occurence of a certain String from a given text file. Here's the code I've written:
BufferedReader tempFileReader = new BufferedReader(new InputStreamReader(new FileInputStream(tempFile)));
File tempFileBuiltForUse = new File("C:\\testing\\anotherTempFile.txt");
Writer changer = new BufferedWriter(new FileWriter(tempFileBuiltForUse));
String lineContents ;
while( (lineContents = tempFileReader.readLine()) != null)
{
Pattern pattern = Pattern.compile("/.");
Matcher matcher = pattern.matcher(lineContents);
String lineByLine = null;
while(matcher.find())
{
lineByLine = lineContents.replaceAll(matcher.group(),System.getProperty("line.separator"));
changer.write(lineByLine);
}
}
changer.close();
tempFileReader.close();
Suppose the contents of my tempFile are:
This/DT is/VBZ a/DT sample/NN text/NN ./.
I want the anotherTempFile to contain :
This/DT is/VBZ a/DT sample/NN text/NN .
with a new line.
But I'm not getting the desired output. And I'm not able to see where I'm going wrong. :-(
Kindly help. :-)

A dot means "every character" in regular expressions. Try to escape it:
Pattern pattern = Pattern.compile("\\./\\.");
(You need two backslahes, to escape the backslash itself inside the String, so that Java knows you want to have a backslash and not a special character as the newline character, e.g. \n

In a regex, the dot (.) matches any character (except newlines), so it needs to be escaped if you want it to match a literal dot. Also, you appear to be missing the first dot in your regex since you want the pattern to match ./.:
Pattern pattern = Pattern.compile("\\./\\.");

Your regular expression has a problem. Also you don't have to use the Pattern and matcher. Simply use replaceAll() method of the String class for the replacement. It would be easier. Try the code below:
tempFileReader = new BufferedReader(
new InputStreamReader(new FileInputStream("c:\\test.txt")));
File tempFileBuiltForUse = new File("C:\\anotherTempFile.txt");
Writer changer = new BufferedWriter(new FileWriter(tempFileBuiltForUse));
String lineContents;
while ((lineContents = tempFileReader.readLine()) != null) {
String lineByLine = lineContents.replaceAll("\\./\\.", System.getProperty("line.separator"));
changer.write(lineByLine);
}
changer.close();
tempFileReader.close();

/. is a regular expression \[any-symbol].
Change into to `/\\.'

strip data from a text file

Im going to start by posting what the date in the text file looks like, this is just 4 lines of it, the actually file is a couple hundred lines long.
Friday, September 9 2011
-STV 101--------05:00 - 23:59 SSB 4185 Report Printed on 9/08/2011 at 2:37
0-AH 104--------07:00 - 23:00 AH GYM Report Printed on 9/08/2011 at 2:37
-BG 105--------07:00 - 23:00 SH GREAT HALL Report Printed on 9/08/2011 at 2:37
What I want to do with this text file is ignore the first line with the date on it, and then ignore the '-' on the next line but read in the "STV 101", "5:00" and "23:59" save them to variables and then ignore all other characters on that line and then so on for each line after that.
Here is how I am currently reading the lines entirely. And then I just call this function once the user has put the path in the scheduleTxt JTextfield. It can read and print each line out fine.
public void readFile () throws IOException
{
try
{
FileInputStream fstream = new FileInputStream(scheduleTxt.getText());
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
while ((strLine = br.readLine()) != null)
{
System.out.println (strLine);
}
in.close();
}
catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
}
}
UPDATE: it turns out I also need to strip Friday out of the top line and put it in a variable as well
Thanks! Beef.

Did not test it thoroughly, but this regular expression would capture the info you need in groups 2, 5 and 7: (Assuming you're only interested in "AH 104" in the example of "0-AH 104----")
^(\S)*-(([^-])*)(-)+((\S)+)\s-\s((\S)+)\s(.)*
String regex = "^(\\S)*-(([^-])*)(-)+((\\S)+)\\s-\\s((\\S)+)\\s(.)*";
Pattern pattern = Pattern.compile(regex);
while ((strLine = br.readLine()) != null){
Matcher matcher = pattern.matcher(strLine);
boolean matchFound = matcher.find();
if (matchFound){
String s1 = matcher.group(2);
String s2 = matcher.group(5);
String s3 = matcher.group(7);
System.out.println (s1 + " " + s2 + " " + s3);
}
}
The expression could be tuned with non-capturing groups in order to capture only the information you want.
Explanation of the regexp's elements:
^(\S)*- Matches group of non-whitespace characters ended by -. Note: Could have been ^(.)*- instead, would not work if there are whitespaces before the first -.
(([^-])*) Matches group of every character except -.
(-)+ Matches group of one or more -.
((\S)+) Matches group of one or more non-white-space characters. This is captured in group 5.
\s-\s Matches group of white-space followed by - followed by whitespace.
'((\S)+)' Same as 4. This is captured in group 7.
\s(.)* Matches white-space followed by anything, which will be skipped.
More info on regular expression can be found on this tutorial.
There are also several useful cheatsheets around. When designing/debugging an expression, a regexp testing tool can prove quite useful, too.

Java regex matching

strong textI have a bunch of lines in a textfile and I want to match this ${ALPANUMERIC characters} and replace it with ${SAME ALPHANUMERIC characters plus _SOMETEXT(CONSTANT)}.
I've tried this expression ${(.+)} but it didn't work and I also don't know how to do the replace regex in java.
thank you for your feedback
Here is some of my code :
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
StringBuilder sb = new StringBuilder();
while ((line = br.readLine()) != null) {
Pattern p = Pattern.compile("\\$\\{.+\\}");
Matcher m = p.matcher(line); // get a matcher object
if(m.find()) {
System.out.println("MATCH: "+m.group());
//TODO
//REPLACE STRING
//THEN APPEND String Builder
}
}
OK this above works but it only founds my variable and not the whole line for ex here is my input :
some text before ${VARIABLE_NAME} some text after
some text before ${VARIABLE_NAME2} some text after
some text before some text without variable some text after
... etc
so I just want to replace the ${VARIABLE_NAME} or ${VARIABLE_NAME} with ${VARIABLE_NAME2_SOMETHING} but leave preceding and following text line as it is
EDIT:
I though I though of a way like this :
if(line.contains("\\${([a-zA-Z0-9 ]+)}")){
System.out.println(line);
}
if(line.contains("\\$\\{.+\\}")){
System.out.println(line);
}
My idea was to capture the line containing this, then replace , but the regex is not ok, it works with pattern/matcher combination though.
EDIT II
I feel like I'm getting closer to the solution here, here is what I've come up with so far :
if(line.contains("$")){
System.out.println(line.replaceAll("\\$\\{.+\\}", "$1" +"_SUFFIX"));
}
What I meant by $1 is the string you just matched replace it with itself + _SUFFIX

I would use the String.replaceAll() method like so:
`String old="some string data";
String new=old.replaceAll("$([a-zA-Z0-9]+)","(\1) CONSTANT"); `

The $ is a special regular expression character that represents the end of a line. You'll need to escape it in order to match it. You'll also need to escape the backslash that you use for escaping the dollar sign because of the way Java handles strings.
Once you have your text in a string, you should be able to do the following:
str.replaceAll("\\${([a-zA-Z0-9 ]+)}", "\\${$1 _SOMETEXT(CONSTANT)}")
If you have other characters in your variable names (i.e. underscores, symbols, etc...) then just add them to the character class that you are matching for.
Edit: If you want to use a Pattern and Matcher then there are still a few changes. First, you probably want to compile your Pattern outside of the loop. Second, you can use this, although it is more verbose.
Pattern p = Pattern.compile("\\$\\{.+\\}");
Matcher m = p.matcher(line);
sb.append(m.replaceAll("\\${$1 _SOMETEXT(CONSTANT)}"));

THE SOLUTION :
while ((line = br.readLine()) != null) {
if(line.contains("$")){
sb.append(line.replaceAll("\\$\\{(.+)\\}", "\\${$1" +"_SUFFIX}") + "\n");
}else{
sb.append(line + "\n");
}
}

line = line.replaceAll("\\$\\{\\w+", "$0_SOMETHING");
There's no need to check for the presence of $ or whatever; that's part of what replaceAll() does. Anyway, contains() is not regex-powered like find(); it just does a plain literal text search.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Validating a Text File content using regex - java

Related

Java Regular Expression Multiline

Tokenize Arabic text files java

Replace a particular String from a text file

strip data from a text file

Java regex matching

Categories

Resources