Multiline regexp matcher - java

There is input file with content:
XX00002200000
XX00003300000
regexp:
(.{6}22.{5}\W)(.{6}33.{5})
Tried in The Regex Coach(app for regexp testing), strings are matched OK.
Java:
pattern = Pattern.compile(patternString);
inputStream = resource.getInputStream();
scanner = new Scanner(inputStream, charsetName);
scanner.useDelimiter("\r\n");
patternString is regexp(mentioned above) added as bean property from .xml
It's failed from Java.

Simple solution: ".{6}22.{5}\\s+.{6}33.{5}". Note that \s+ is a shorthand for consequent whitespace elements.
Heres an example:
public static void main(String[] argv) throws FileNotFoundException {
String input = "yXX00002200000\r\nXX00003300000\nshort", regex = ".{6}22.{5}\\s+.{6}33.{5}", result = "";
Pattern pattern = Pattern.compile(regex);
Matcher m = pattern.matcher(input);
while (m.find()) {
result = m.group();
System.out.println(result);
}
}
With output:
XX00002200000
XX00003300000
To play around with Java Regex you can use: Regular Expression Editor (free online editor)
Edit: I think that you are changing the input when you are reading data, try:
public static String readFile(String filename) throws FileNotFoundException {
Scanner sc = new Scanner(new File(filename));
StringBuilder sb = new StringBuilder();
while (sc.hasNextLine())
sb.append(sc.nextLine());
sc.close();
return sb.toString();
}
Or
static String readFile(String path) {
FileInputStream stream = null;
FileChannel channel = null;
MappedByteBuffer buffer = null;
try {
stream = new FileInputStream(new File(path));
channel = stream.getChannel();
buffer = channel.map(FileChannel.MapMode.READ_ONLY, 0,
channel.size());
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
stream.close();
} catch (Exception e2) {
e2.printStackTrace();
}
}
return Charset.defaultCharset().decode(buffer).toString();
}
With imports like:
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.charset.Charset;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

Try this change in delimiter:
scanner.useDelimiter("\\s+");
also why don't you use a more general regex expression like this :
".{6}[0-9]{2}.{5}"
The regex you have mentioned above is for 2 lines.Since you have mentioned the delimiter as a new line you should be giving a regex expression suitable for a single line.

Pardon my ignorance, but I am still not sure what exactly are you trying to search. In case, you are trying to search for the string (with new lines)
XX00002200000
XX00003300000
then why are you reading it by delimiting it by new lines?
To read the above string as it is, the following code works
Pattern p = Pattern.compile(".{6}22.{5}\\W+.{6}33.{5}");
FileInputStream scanner = null;
try {
scanner = new FileInputStream("C:\\new.txt");
{
byte[] f = new byte[100];
scanner.read(f);
String s = new String(f);
Matcher m = p.matcher(s);
if(m.find())
System.out.println(m.group());
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
NB: here new.txt file contains the string
XX00002200000
XX00003300000

Related

Reading a words from file with a stream

I am trying to read the words of a file into a stream and the count the number of times the word "the" appears in the file. I cannot seem to figure out an efficient way of doing this with only streams.
Example: If the file contained a sentence such as: "The boy jumped over the river." the output would be 2
This is what I've tried so far
public static void main(String[] args){
String filename = "input1";
try (Stream<String> words = Files.lines(Paths.get(filename))){
long count = words.filter( w -> w.equalsIgnoreCase("the"))
.count();
System.out.println(count);
} catch (IOException e){
}
}
Just line name suggests Files.lines returns stream of lines not words. If you want to iterate over words I you can use Scanner like
Scanner sc = new Scanner(new File(fileLocation));
while(sc.hasNext()){
String word = sc.next();
//handle word
}
If you really want to use streams you can split each line and then map your stream to those words
try (Stream<String> lines = Files.lines(Paths.get(filename))){
long count = lines
.flatMap(line->Arrays.stream(line.split("\\s+"))) //add this
.filter( w -> w.equalsIgnoreCase("the"))
.count();
System.out.println(count);
} catch (IOException e){
e.printStackTrace();//at least print exception so you would know what wend wrong
}
BTW you shouldn't leave empty catch blocks, at least print exception which was throw so you would have more info about problem.
You could use Java's StreamTokenizer for this purpose.
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.StreamTokenizer;
import java.io.InputStreamReader;
import java.nio.charset.StandardCharsets;
public class Main {
public static void main(String[] args) throws IOException {
long theWordCount = 0;
String input = "The boy jumped over the river.";
try (InputStream stream = new ByteArrayInputStream(
input.getBytes(StandardCharsets.UTF_8.name()))) {
StreamTokenizer tokenizer =
new StreamTokenizer(new InputStreamReader(stream));
int tokenType = 0;
while ( (tokenType = tokenizer.nextToken())
!= StreamTokenizer.TT_EOF) {
if (tokenType == StreamTokenizer.TT_WORD) {
String word = tokenizer.sval;
if ("the".equalsIgnoreCase(word)) {
theWordCount++;
}
}
}
}
System.out.println("The word 'the' count is: " + theWordCount);
}
}
Use the stream reader to calculate the number of words.

Java - Groovy : regex parse text block

I know that this is a common question and I've been through a lot of forums to figure out whats the problem in my code.
I have to read a text file with several blocks in the following format:
import com.myCompanyExample.gui.Layout
/*some comments here*/
#Layout
LayoutModel currentState() {
MyBuilder builder = new MyBuilder()
form example
title form{
row_1
row_1
row_n
}
return build.get()
}
#Layout
LayoutModel otherState() {
....
....
return build.get()
}
I have this code to read all the file and I'd like to extract each block between the keyword "#Layout" and the keyword "return". I need also to catch all newline so later I'll be able to split each matched block into a list
private void myReadFile(File fileLayout){
String line = null;
StringBuilder allText = new StringBuilder();
try{
FileReader fileReader = new FileReader(fileLayout);
BufferedReader bufferedReader = new BufferedReader(fileReader);
while((line = bufferedReader.readLine()) != null) {
allText.append(line)
}
bufferedReader.close();
}
catch(FileNotFoundException ex) {
System.out.println("Unable to open file");
}
catch(IOException ex) {
System.out.println("Error reading file");
}
Pattern pattern = Pattern.compile("(?s)#Layout.*?return",Pattern.DOTALL);
Matcher matcher = pattern.matcher(allText);
while(matcher.find()){
String [] layoutBlock = (matcher.group()).split("\\r?\\n")
for(index in layoutBlock){
//check each line of the current block
}
}
layoutBlock returns size=1
I think this can potentially be a so called XY problem anyway...if the groovy source is composed only by #Layout annotated blocks of code you can use a tempered greedy token to select till the next annotation (view online demo).
Change the pattern loc as this:
Pattern pattern = Pattern.compile( "#Layout(?:(?!#Layout).)*", Pattern.DOTALL );
PS: the dotall flag (?s) inside the regex and the parameter Pattern.DOTALL do the same thing (enable the so called multiline mode), use only one of them indifferently.
UPDATE
I tried your code, the problem (preserving newline) is in the method you use to slurp the file (bufferedReader.readline() remove the newline at the end of the string).
Simply readd a newline when append to allText:
String ln = System.lineSeparator();
while((line = bufferedReader.readLine()) != null) {
allText.append(line + ln);
}
Or you can replace all the code to slurp the file with this:
import java.nio.file.Files;
import java.nio.file.Paths;
//can throw an IOException
String filePath = "/path/to/layout.groovy";
String allText = new String(Files.readAllBytes(Paths.get(filePath)),StandardCharsets.UTF_8);

how to use escape characters for patterns read from file in java

I am reading some patterns from a file and using it in String matches method. but while reading the patterns from the file, the escape characters are not working
Ex I have few data ex "abc.1", "abcd.1", "abce.1", "def.2"
I want do do some activity if the string matches "abc.1" i.e abc. followed by any characters or numbers
I have a file that stores the pattern to be matched ex the pattern abc\..*
but when I read the pattern from the file and using it in String matches method it does not work.
any suggestions
a sample java program to demonstrate the issue is :
package com.test.resync;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
public class TestPattern {
public static void main(String args[]) {
// raw data against which the pattern is to be matched
String[] data = { "abc.1", "abcd.1", "abce.1", "def.2" };
String regex_data = ""; // variable to hold the regexpattern after
// reading from the file
// regex.txt the file containing the regex pattern
File file = new File(
"/home/ekhaavi/Documents/WORKSPACE/TESTproj/src/com/test/regex.txt");
try {
BufferedReader br = new BufferedReader(new FileReader(file));
String str = "";
while ((str = br.readLine()) != null) {
if (str.startsWith("matchedpattern")) {
regex_data = str.split("=")[1].toString(); // setting the
// regex pattern
}
}
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
/*if the regex is set by the below String literal it works fine*/
//regex_data = "abc\\..*";
for (String st : data) {
if (st.matches(regex_data)) {
System.out.println(" data matched "); // this is not printed when the pattern is read from the file instead of setting it through literals
}
}
}
}
The regex.txt file has the below entry
matchedpattern=abc\..*
Use Pattern.quote(String) method:
if (st.matches(Pattern.quote(regex_data))) {
System.out.println(" data matched "); // this is not printed when the pattern is read from the file instead of setting it through literals
}
There are some other issues that you should consider resolving:
You're overwriting the value of regex_data in the while loop. Did you intend to store all the the regex pattern in a list?
String#split()[0] will return a String only. You don't need to invoke toString() on that.

Regex extract string, why my pattern don't works?

I have a long string in this format (a long single line in file):
"1":"Aname","2":"AnotherName","3":"Sempronio"
I want to extract the number and the name and save them on a Map.
I tried this:
FileReader fileReader = null;
BufferedReader br = null;
File file = new File("./SingleLineFileNames.txt");
try {
fileReader = new FileReader(file);
br = new BufferedReader(fileReader);
String line;
Pattern p = Pattern.compile("\"(\\d+)\":\"([\\w-.' ]+)\"");
Matcher matcher;
while((line = br.readLine()) != null) {
matcher = p.matcher(line);
String name;
int i = 1;
while((name = matcher.group(i)) != null){
// save in map
i++;
}
}
}
catch (FileNotFoundException e) {
e.printStackTrace();
}
catch (IOException e) {
e.printStackTrace();
}
finally {
try {
br.close();
fileReader.close();
}
catch (IOException e) {
e.printStackTrace();
}
}
return null;
result is java.lang.IllegalStateException: No match found
It's the right way to iterate on groups?
Where I wrong?
First split the String at , (String#split) and then split each resulting array element at : to get key and value. With input strings like these, I wonder what kind of masochism is on the developers using regex sledgehammers breaking these simple nuts..
If you use hyphen inside [] then always place at the first or at the last.
Pattern p = Pattern.compile("\"(\\d+)\":\"([-\\w.' ]+)\"");
^ here
Also the way you are checking the group() is not correct. Check here:
while(matcher.find()){
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
}
Remove the broken square bracket construct ([\\w-.' ]+) . For the name containing word characters only, it is enough to put (\\w+) there.

Cant match Srt subtitle using Regex in Java

In try in this code to parse an srt subtitle:
public class MatchArray {
public static void main(String args[]) {
File file = new File(
"C:/Users/Thiago/workspace/SubRegex/src/Dirty Harry VOST - Clint Eastwood.srt");
{
try {
Scanner in = new Scanner(file);
try {
String contents = in.nextLine();
while (in.hasNextLine()) {
contents = contents + "\n" + in.nextLine();
}
String pattern = "([\\d]+)\r([\\d]{2}:[\\d]{2}:[\\d]{2}),([\\d]{3})[\\s]*-->[\\s]*([\\d]{2}:[\\d]{2}:[\\d]{2}),([\\d]{3})\r(([^|\r]+(\r|$))+)";
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(contents);
ArrayList<String> start = new ArrayList<String>();
while (m.find()) {
start.add(m.group(1));
start.add(m.group(2));
start.add(m.group(3));
start.add(m.group(4));
start.add(m.group(5));
start.add(m.group(6));
start.add(m.group(7));
System.out.println(start);
}
}
finally {
in.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
But when i execute it, it dosent capture any group, when try to capture only the time with this pattern:
([\\d]{2}:[\\d]{2}:[\\d]{2}),([\\d]{3})[\\s]*-->[\\s]*([\\d]{2}:[\\d]{2}:[\\d]{2}),([\\d]{3})
It works. So how do I make it capture the entire subtitle?
I can not quite understand your need but i thought this can help.
Please try the regex:
(\\d+?)\\s*(\\d+?:\\d+?:\\d+?,\\d+?)\\s+-->\\s+(\\d+?:\\d+?:\\d+?,\\d+?)\\s+(.+)
I tried it on http://www.myregextester.com/index.php and it worked.
I hope this can help.

Categories

Resources