Java - Groovy : regex parse text block - java

I know that this is a common question and I've been through a lot of forums to figure out whats the problem in my code.
I have to read a text file with several blocks in the following format:
import com.myCompanyExample.gui.Layout
/*some comments here*/
#Layout
LayoutModel currentState() {
MyBuilder builder = new MyBuilder()
form example
title form{
row_1
row_1
row_n
}
return build.get()
}
#Layout
LayoutModel otherState() {
....
....
return build.get()
}
I have this code to read all the file and I'd like to extract each block between the keyword "#Layout" and the keyword "return". I need also to catch all newline so later I'll be able to split each matched block into a list
private void myReadFile(File fileLayout){
String line = null;
StringBuilder allText = new StringBuilder();
try{
FileReader fileReader = new FileReader(fileLayout);
BufferedReader bufferedReader = new BufferedReader(fileReader);
while((line = bufferedReader.readLine()) != null) {
allText.append(line)
}
bufferedReader.close();
}
catch(FileNotFoundException ex) {
System.out.println("Unable to open file");
}
catch(IOException ex) {
System.out.println("Error reading file");
}
Pattern pattern = Pattern.compile("(?s)#Layout.*?return",Pattern.DOTALL);
Matcher matcher = pattern.matcher(allText);
while(matcher.find()){
String [] layoutBlock = (matcher.group()).split("\\r?\\n")
for(index in layoutBlock){
//check each line of the current block
}
}
layoutBlock returns size=1

I think this can potentially be a so called XY problem anyway...if the groovy source is composed only by #Layout annotated blocks of code you can use a tempered greedy token to select till the next annotation (view online demo).
Change the pattern loc as this:
Pattern pattern = Pattern.compile( "#Layout(?:(?!#Layout).)*", Pattern.DOTALL );
PS: the dotall flag (?s) inside the regex and the parameter Pattern.DOTALL do the same thing (enable the so called multiline mode), use only one of them indifferently.
UPDATE
I tried your code, the problem (preserving newline) is in the method you use to slurp the file (bufferedReader.readline() remove the newline at the end of the string).
Simply readd a newline when append to allText:
String ln = System.lineSeparator();
while((line = bufferedReader.readLine()) != null) {
allText.append(line + ln);
}
Or you can replace all the code to slurp the file with this:
import java.nio.file.Files;
import java.nio.file.Paths;
//can throw an IOException
String filePath = "/path/to/layout.groovy";
String allText = new String(Files.readAllBytes(Paths.get(filePath)),StandardCharsets.UTF_8);

Related

finding character count between two special symbols

Am trying to find the character count between = and \n new line character using below java code. But \n is not considering in my case.
am using import org.apache.commons.lang3.StringUtils; package
Please find my below java code.
public class CharCountInLine {
public static void main(String[] args)
{
BufferedReader reader = null;
try
{
reader = new BufferedReader(new FileReader("C:\\wordcount\\sample.txt"));
String currentLine = reader.readLine();
String[] line = currentLine.split("=");
while (currentLine != null ){
String res = StringUtils.substringBetween(currentLine, "=", "\n"); // \n is not working.
if(res != null) {
System.out.println("line -->"+res.length());
}
currentLine = reader.readLine();
}
}
catch (IOException e)
{
e.printStackTrace();
}
finally
{
try
{
reader.close();
}
catch (IOException e)
{
e.printStackTrace();
}
}
}
}
Please find my sample text file.
sample.txt
Karthikeyan=123456
sathis= 23546
Arun = 23564
Well, you're reading the string using readLine(), which according to the Javadoc (emphasis mine):
Returns:
A String containing the contents of the line, not including
any line-termination characters, or null if the end of the stream has
been reached
So your code doesn't work because the string does not contain a newline character.
You can address this in a number of ways:
Use StringUtils.substringAfter() instead of StringUtils.substringBetween().
If it meets the requirements, treat your file as a Java properties file so you don't need to parse it yourself.
Use String.split().
Use String.lastIndexOf().
Some simple regex matching and grouping.
You don't need to change how you read the lines, simply change your logic to extract the text after =.
Pattern p = Pattern.compile("(?:.+)=(.+)$");
Matcher m = p.matcher("Karthikeyan=123456");
if (m.find()) {
System.out.println(m.group(1).length());
}
No need for Apache StringUtils either, simple Java regex will do. If you don't want to count whitespace, trim the string before calling length().
Alternatively, you can also split the line around = as discussed here.
10x simpler code:
Path p = Paths.get("C:\\wordcount\\sample.txt");
Files.lines(p)
.forEach { line ->
// Put the above code here
}

Java - Reading a text file when a certain sequence occurs

I haven't been able to find a way to read from a .txt file when a certain sequence occurs.
This is how an entry from my file looks like:
&1551:John:Packard:83:Heavy:Blonde&
I want my file to be read from &1551 (1551 is the unique ID number of the user) until the next "&". Do you guys have any suggestions as to how to accomplish this? The ":" is later used for splitting the string.
Thanks!
A simple JDK Scanner has the ability to read a file stopping at certain patterns:
public String findWithinHorizon(String pattern,
int horizon)
Attempts to find the next occurrence of the specified pattern.
public Scanner skip(Pattern pattern)
Skips input that matches the specified pattern, ignoring delimiters. This method will skip input if an anchored match of the specified pattern succeeds.
If a match to the specified pattern is not found at the current position, then no input is skipped and a NoSuchElementException is thrown.
So this should be enough:
// skip anything up to "$1551:" (but keep "1551:" for next read)
Pattern toSkip = Pattern.compile(".*?\\$(?=1511:)", Pattern.DOTALL);
sc.skip(toSkip);
// get everything starting at the "1551:" up to a "$" sign on same line
String line = sc.findWithinHorizon(".*(?=\\$)", 0);
If end of lines can be included between the $ signs, then you should compile the pattern with the DOTALL flag as I did for toSkip.
Firstly you will have to get input from string without staring & and ending &,
then split string by :
So, Assuming all inputs will be in new line below code should work,
public static void main(String args[]) {
BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader("D://test.txt"));
String sCurrentLine;
String[] fields;
while ((sCurrentLine = reader.readLine()) != null) {
sCurrentLine = sCurrentLine.substring(sCurrentLine.indexOf('&') + 1);
sCurrentLine = sCurrentLine.substring(0, sCurrentLine.indexOf('&'));
fields = sCurrentLine.split(":");
for (String tmp : fields)
System.out.println(tmp);
}
} catch (Exception e) {
System.out.println("Error in accepting String");
} finally {
try {
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Hope this helps.

Regex for replacing Exact String match [duplicate]

My input:
1. end
2. end of the day or end of the week
3. endline
4. something
5. "something" end
Based on the above discussions, If I try to replace a single string using this snippet, it removes the appropriate words from the line successfully
public class DeleteTest {
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
File file = new File("C:/Java samples/myfile.txt");
File temp = File.createTempFile("myfile1", ".txt", file.getParentFile());
String delete="end";
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
PrintWriter writer = new PrintWriter(new OutputStreamWriter(new FileOutputStream(temp)));
for (String line; (line = reader.readLine()) != null;) {
line = line.replaceAll("\\b"+delete+"\\b", "");
writer.println(line);
}
reader.close();
writer.close();
}
catch (Exception e) {
System.out.println("Something went Wrong");
}
}
}
My output If I use the above snippet:(Also my expected output)
1.
2. of the day or of the week
3. endline
4. something
5. "something"
But when I include more words to delete, and for that purpose when I use Set, I use the below code snippet:
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
File file = new File("C:/Java samples/myfile.txt");
File temp = File.createTempFile("myfile1", ".txt", file.getParentFile());
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
PrintWriter writer = new PrintWriter(new OutputStreamWriter(new FileOutputStream(temp)));
Set<String> toDelete = new HashSet<>();
toDelete.add("end");
toDelete.add("something");
for (String line; (line = reader.readLine()) != null;) {
line = line.replaceAll("\\b"+toDelete+"\\b", "");
writer.println(line);
}
reader.close();
writer.close();
}
catch (Exception e) {
System.out.println("Something went Wrong");
}
}
I get my output as: (It just removes the space)
1. end
2. endofthedayorendoftheweek
3. endline
4. something
5. "something" end
Can u guys help me on this?
Click here to follow the thread
You need to create an alternation group out of the set with
String.join("|", toDelete)
and use as
line = line.replaceAll("\\b(?:"+String.join("|", toDelete)+")\\b", "");
The pattern will look like
\b(?:end|something)\b
See the regex demo. Here, (?:...) is a non-capturing group that is used to group several alternatives without creating a memory buffer for the capture (you do not need it since you remove the matches).
Or, better, compile the regex before entering the loop:
Pattern pat = Pattern.compile("\\b(?:" + String.join("|", toDelete) + ")\\b");
...
line = pat.matcher(line).replaceAll("");
UPDATE:
To allow matching whole "words" that may contain special chars, you need to Pattern.quote those words to escape those special chars, and then you need to use unambiguous word boundaries, (?<!\w) instead of the initial \b to make sure there is no word char before and (?!\w) negative lookahead instead of the final \b to make sure there is no word char after the match.
In Java 8, you may use this code:
Set<String> nToDel = new HashSet<>();
nToDel = toDelete.stream()
.map(Pattern::quote)
.collect(Collectors.toCollection(HashSet::new));
String pattern = "(?<!\\w)(?:" + String.join("|", nToDel) + ")(?!\\w)";
The regex will look like (?<!\w)(?:\Q+end\E|\Qsomething-\E)(?!\w). Note that the symbols between \Q and \E is parsed as literal symbols.
The problem is that you're not creating the correct regex for replacing the words in the set.
"\\b"+toDelete+"\\b" will produce this String \b[end, something]\b which is not what you need.
To fix that you can do something like this:
for(String del : toDelete){
line = line.replaceAll("\\b"+del+"\\b", "");
}
What this does is to go through the set, produce a regex from each word and remove that word from the line String.
Another approach will be to produce a single regex from all the words in the set.
Eg:
String regex = "";
for(String word : toDelete){
regex+=(regex.isEmpty() ? "" : "|") + "(\\b"+word+"\\b)";
}
....
line = line.replace(regex, "");
This should produce a regex that looks something like this: (\bend\b)|(\bsomething\b)

JAVA: Getting the content of specific strings from text files

I have a text file like this:
text
text
text
.
.
#data
instances1
instances2
.
.
instancesN
I want to get the contents of this file from #data until the end of the file, how can I do?
I found this method of FileUtils (from apache commons-lang) class but it's usable only if I already know the line number.
String ln = FileUtils.readLines(new File("arff_file/"+results.get(0)))
.get(lineNumber);
Since you are using Apache Commons, you can do it in one line:
String contents = FileUtils.readFileToString(new File("arff_file/"+results.get(0)), "UTF-16").replaceAll("^.*?(?=#data)", "");
This works by
reading the whole file into a single String
using regex-based replaceAll() to remove (by replacing with a blank) everything up to, but not including, #data
The regex breakdown of ^.*?(?=#data) is:
^ start of input
.*? a reluctantly quantified wildcard
(?=#data) a positive (non-consuming) look ahead that asserts that the next input is #data
A reluctant quantifier could be important to use so it won't skip past the first #data, in case it appears more than once in the input.
try {
String file = "fileName";
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while ((line = br.readLine()) != null) {
if (line.equals("#data"))
nowRead(br);//I just do this for more efficiency, you can set a boolean flag instead
}
br.close();
}catch (IOException e) {
//OMG Exception again!
}
}
static ArrayList<String> nowRead(BufferedReader br) throws IOException {
ArrayList<String> s = new ArrayList<String>();// do it as you wish
String line;
while ((line = br.readLine()) != null) {
s.add(line);
}
return s;
}
Path start = Paths.get("test.txt");
try
{
List<String> lines = Files.readAllLines(start);
for (Iterator<String> it = lines.iterator(); it.hasNext();)
{
String line = it.next();
if (!"#data".equals(line.trim()))
{
it.remove();
}
else
{
break;
}
}
System.out.println(lines);
}
catch (IOException e)
{
e.printStackTrace();
}
I was reading about Path online so why not something like this as alternative to Bohemian code?
Maybe something could be done using stream() of Java 8 but not still nothing...

Regex extract string, why my pattern don't works?

I have a long string in this format (a long single line in file):
"1":"Aname","2":"AnotherName","3":"Sempronio"
I want to extract the number and the name and save them on a Map.
I tried this:
FileReader fileReader = null;
BufferedReader br = null;
File file = new File("./SingleLineFileNames.txt");
try {
fileReader = new FileReader(file);
br = new BufferedReader(fileReader);
String line;
Pattern p = Pattern.compile("\"(\\d+)\":\"([\\w-.' ]+)\"");
Matcher matcher;
while((line = br.readLine()) != null) {
matcher = p.matcher(line);
String name;
int i = 1;
while((name = matcher.group(i)) != null){
// save in map
i++;
}
}
}
catch (FileNotFoundException e) {
e.printStackTrace();
}
catch (IOException e) {
e.printStackTrace();
}
finally {
try {
br.close();
fileReader.close();
}
catch (IOException e) {
e.printStackTrace();
}
}
return null;
result is java.lang.IllegalStateException: No match found
It's the right way to iterate on groups?
Where I wrong?
First split the String at , (String#split) and then split each resulting array element at : to get key and value. With input strings like these, I wonder what kind of masochism is on the developers using regex sledgehammers breaking these simple nuts..
If you use hyphen inside [] then always place at the first or at the last.
Pattern p = Pattern.compile("\"(\\d+)\":\"([-\\w.' ]+)\"");
^ here
Also the way you are checking the group() is not correct. Check here:
while(matcher.find()){
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
}
Remove the broken square bracket construct ([\\w-.' ]+) . For the name containing word characters only, it is enough to put (\\w+) there.

Categories

Resources