Java sequentially parse information from file

Java sequentially parse information from file - java

lets say I have a file with a structure like this:
Line 0:
354858 Some String That Is Important AA OTHER STUFF SOMESTUFF
THAT SHOULD BE IGNORED
Line 1:
543788 Another String That Is Important AA OTHER STUFF
SOMESTUFF THAT SHOULD BE IGNORED
and so on...
Now I would like to get the information that is marked in my example (see gray background). The sequence AA is always present (and could be used as a break and skip to the next line) while the information string varies in length.
What will be the best way to parse the information? A buffered reader with if, then, else or is there some kind of parser that you can tell, read a number of lenth XYZ then read everything into a String until you find AA then skip line.

To tell you which is best for your problem is not possible without more information.
One solution might be
String s = "354858 Some String That Is Important AA OTHER STUFF SOMESTUFF THAT SHOULD BE IGNORED";
String[] split = s.substring(0, s.indexOf(" AA")).split(" ", 2);
System.out.println("split = " + Arrays.toString(split));
output
split = [354858, Some String That Is Important]

You can read the file line by line and exclude the part which contains the AA charSequence:
final String charSequence = "AA";
String line;
BufferedReader r = new BufferedReader(new InputStreamReader(new FileInputStream("yourfilename")));
try {
while ((line = r.readLine()) != null) {
int pos = line.indexOf(charSequence);
if (pos > 0) {
String myImportantStuff = line.substring(0, pos);
//do something with your useful string
}
}
} finally {
r.close();
}

I would read the file line by line and match each line against a regular expression. I hope my comments in the code below will be detailed enough.
// The pattern to use
Pattern p = Pattern.compile("^([0-9]+)\\s+(([^A]|A[^A])+)AA");
// Read file line by line
BufferedReader br = new BufferedReader(new FileReader(myFile));
String line;
while((line = br.readLine()) != null) {
// Match line against our pattern
Matcher m = p.matcher(line);
if(m.find()) {
// Line is valid, process it however you want
// m.group(1) contains the number
// m.group(2) contains the text between number and AA
} else {
// Line has invalid format (pattern does not match)
}
}
Explanation of the regular expression (Pattern) I used:
^([0-9]+)\s+(([^A]|A[^A])+)AA
^ matches the start of the line
([0-9]+) matches any integral number
\s+ matches one or more whitespace characters
(([^A]|A[^A])+) matches any characters which are either not A or not followed by another A
AA matches the terminating AA
Update as a reply to comment:
If every line has a preceding | character, the expression looks like this:
^\|([0-9]+)\s+(([^A]|A[^A])+)AA
In JAVA, you need to escape it like this:
"^\\|([0-9]+)\\s+(([^A]|A[^A])+)AA"
The character | has a special meaning in regular expressions and has to be escaped.

Here is a solution for you:
public static void main(String[] args) {
InputStream source; //select a text source (should be a FileInputStream)
{
String fileContent = "354858 Some String That Is Important AA OTHER STUFF SOMESTUFF THAT SHOULD BE IGNORED\n" +
"543788 Another String That Is Important AA OTHER STUFF SOMESTUFF THAT SHOULD BE IGNORED";
source = new ByteArrayInputStream(fileContent.getBytes(StandardCharsets.UTF_8));
}
try(BufferedReader stream = new BufferedReader(new InputStreamReader(source))) {
Pattern pattern = Pattern.compile("^([0-9]+) (.*?) AA .*$");
while(true) {
String line = stream.readLine();
if(line == null) {
break;
}
Matcher matcher = pattern.matcher(line);
if(matcher.matches()) {
String someNumber = matcher.group(1);
String someText = matcher.group(2);
//do something with someNumber and someText
} else {
throw new ParseException(line, 0);
}
}
} catch (IOException | ParseException e) {
e.printStackTrace(); // TODO ...
}
}

You could use a regular expression, but if you know every line contains AA and you want the content up to AA you could can simply do substring(int,int) to get the part of the line up to AA
public List read(Path path) throws IOException {
return Files.lines(path)
.map(this::parseLine)
.collect(Collectors.toList());
}
public String parseLine(String line){
int index = line.indexOf("AA");
return line.substring(0,index);
}
Here's the non-Java8 version of read
public List read(Path path) throws IOException {
List<String> content = new ArrayList<>();
try(BufferedReader reader = new BufferedReader(new FileReader(path.toFile()))){
String line;
while((line = reader.readLine()) != null){
content.add(parseLine(line));
}
}
return content;
}

Use Regex : .+?(?=AA).
Check Here is the Demo

Related

Troubleshooting Java replaceAll

I am trying to write a method that accepts an input string to be found and an input string to replace all instances of the found word and to return the number of replacements made. I am trying to use pattern and matcher from JAVA regex. I have a text file called "text.txt" which includes "this is a test this is a test this is a test". When I try to search for "test" and replace it with "mess", the method returns 1 each time and none of the words test are replaced.
public int findAndRepV2(String word, String replace) throws FileNotFoundException, IOException
{
int cnt = 0;
BufferedReader input = new BufferedReader( new FileReader(this.filename));
Writer fw = new FileWriter("test.txt");
String line = input.readLine();
while (line != null)
{
Pattern pattern = Pattern.compile(word, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {matcher.replaceAll(replace); cnt++;}
line = input.readLine();
}
fw.close();
return cnt;
}

First, you need to ensure that the text you are searching for is not interpreted as a regex. You should do:
Pattern pattern = Pattern.compile(Pattern.quote(word), Pattern.CASE_INSENSITIVE);
Second, replaceAll does something like this:
public String replaceAll(String replacement) {
reset();
boolean result = find();
if (result) {
StringBuffer sb = new StringBuffer();
do {
appendReplacement(sb, replacement);
result = find();
} while (result);
appendTail(sb);
return sb.toString();
}
return text.toString();
}
Note how it calls find until it can't find anything. This means that your loop will only be run once, since after the first call to replaceAll, the matcher has already found everything.
You should use appendReplacement instead:
StringBuffer buffer = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(buffer, replace);
cnt++;
}
buffer.append(line.substring(matcher.end()));
// "buffer" contains the string after the replacement
I noticed that in your method, you didn't actually do anything with the string after the replacement. If that's the case, just count how many times find returns true:
while (matcher.find()) {
cnt++;
}

Regex for replacing Exact String match [duplicate]

My input:
1. end
2. end of the day or end of the week
3. endline
4. something
5. "something" end
Based on the above discussions, If I try to replace a single string using this snippet, it removes the appropriate words from the line successfully
public class DeleteTest {
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
File file = new File("C:/Java samples/myfile.txt");
File temp = File.createTempFile("myfile1", ".txt", file.getParentFile());
String delete="end";
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
PrintWriter writer = new PrintWriter(new OutputStreamWriter(new FileOutputStream(temp)));
for (String line; (line = reader.readLine()) != null;) {
line = line.replaceAll("\\b"+delete+"\\b", "");
writer.println(line);
}
reader.close();
writer.close();
}
catch (Exception e) {
System.out.println("Something went Wrong");
}
}
}
My output If I use the above snippet:(Also my expected output)
1.
2. of the day or of the week
3. endline
4. something
5. "something"
But when I include more words to delete, and for that purpose when I use Set, I use the below code snippet:
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
File file = new File("C:/Java samples/myfile.txt");
File temp = File.createTempFile("myfile1", ".txt", file.getParentFile());
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
PrintWriter writer = new PrintWriter(new OutputStreamWriter(new FileOutputStream(temp)));
Set<String> toDelete = new HashSet<>();
toDelete.add("end");
toDelete.add("something");
for (String line; (line = reader.readLine()) != null;) {
line = line.replaceAll("\\b"+toDelete+"\\b", "");
writer.println(line);
}
reader.close();
writer.close();
}
catch (Exception e) {
System.out.println("Something went Wrong");
}
}
I get my output as: (It just removes the space)
1. end
2. endofthedayorendoftheweek
3. endline
4. something
5. "something" end
Can u guys help me on this?
Click here to follow the thread

You need to create an alternation group out of the set with
String.join("|", toDelete)
and use as
line = line.replaceAll("\\b(?:"+String.join("|", toDelete)+")\\b", "");
The pattern will look like
\b(?:end|something)\b
See the regex demo. Here, (?:...) is a non-capturing group that is used to group several alternatives without creating a memory buffer for the capture (you do not need it since you remove the matches).
Or, better, compile the regex before entering the loop:
Pattern pat = Pattern.compile("\\b(?:" + String.join("|", toDelete) + ")\\b");
...
line = pat.matcher(line).replaceAll("");
UPDATE:
To allow matching whole "words" that may contain special chars, you need to Pattern.quote those words to escape those special chars, and then you need to use unambiguous word boundaries, (?<!\w) instead of the initial \b to make sure there is no word char before and (?!\w) negative lookahead instead of the final \b to make sure there is no word char after the match.
In Java 8, you may use this code:
Set<String> nToDel = new HashSet<>();
nToDel = toDelete.stream()
.map(Pattern::quote)
.collect(Collectors.toCollection(HashSet::new));
String pattern = "(?<!\\w)(?:" + String.join("|", nToDel) + ")(?!\\w)";
The regex will look like (?<!\w)(?:\Q+end\E|\Qsomething-\E)(?!\w). Note that the symbols between \Q and \E is parsed as literal symbols.

The problem is that you're not creating the correct regex for replacing the words in the set.
"\\b"+toDelete+"\\b" will produce this String \b[end, something]\b which is not what you need.
To fix that you can do something like this:
for(String del : toDelete){
line = line.replaceAll("\\b"+del+"\\b", "");
}
What this does is to go through the set, produce a regex from each word and remove that word from the line String.
Another approach will be to produce a single regex from all the words in the set.
Eg:
String regex = "";
for(String word : toDelete){
regex+=(regex.isEmpty() ? "" : "|") + "(\\b"+word+"\\b)";
}
....
line = line.replace(regex, "");
This should produce a regex that looks something like this: (\bend\b)|(\bsomething\b)

Java, How can I find a pattern in a File and read the whole line?

I want to find a special charsequence in a file and I want to read the whole line where the occurrences are.
The following code just checks the first line and fetchess this ( the first ) line.
How can I fix it?
Scanner scanner = new Scanner(file);
String output = "";
output = output + scanner.findInLine(pattern) + scanner.next();
pattern and file are parameters

UPDATED ANSWER according to the comments on this very answer
In fact, what is used is Scanner#findWithHorizon, which in fact calls the Pattern#compile method with a set of flags (Pattern#compile(String, int)).
The result seems to be applying this pattern over and over again in the input text over lines of a file; and this supposes of course that a pattern cannot match multiple lines at once.
Therefore:
public static final String findInFile(final Path file, final String pattern,
final int flags)
throws IOException
{
final StringBuilder sb = new StringBuilder();
final Pattern p = Pattern.compile(pattern, flags);
String line;
Matcher m;
try (
final BufferedReader br = Files.newBufferedReader(path);
) {
while ((line = br.readLine()) != null) {
m = p.matcher(line);
while (m.find())
sb.append(m.group());
}
}
return sb.toString();
}
For completeness I should add that I have developed some time ago a package which allows a text file of arbitrary length to be read as a CharSequence and which can be used to great effect here: https://github.com/fge/largetext. It would work beautifully here since a Matcher matches against a CharSequence, not a String. But this package needs some love.
One example returning a List of matching strings in a file can be:
private static List<String> findLines(final Path path, final String pattern)
throws IOException
{
final Predicate<String> predicate = Pattern.compile(pattern).asPredicate();
try (
final Stream<String> stream = Files.lines(path);
) {
return stream.filter(predicate).collect(Collectors.toList());
}
}

How to split a file into several tokens

I was trying to tokenize an input file from sentences into tokens(words).
For example,
"This is a test file." into five words "this" "is" "a" "test" "file", omitting the punctuations and the white spaces. And store them into an arraylist.
I tried to write some codes like this:
public static ArrayList<String> tokenizeFile(File in) throws IOException {
String strLine;
String[] tokens;
//create a new ArrayList to store tokens
ArrayList<String> tokenList = new ArrayList<String>();
if (null == in) {
return tokenList;
} else {
FileInputStream fStream = new FileInputStream(in);
DataInputStream dataIn = new DataInputStream(fStream);
BufferedReader br = new BufferedReader(new InputStreamReader(dataIn));
while (null != (strLine = br.readLine())) {
if (strLine.trim().length() != 0) {
//make sure strings are independent of capitalization and then tokenize them
strLine = strLine.toLowerCase();
//create regular expression pattern to split
//first letter to be alphabetic and the remaining characters to be alphanumeric or '
String pattern = "^[A-Za-z][A-Za-z0-9'-]*$";
tokens = strLine.split(pattern);
int tokenLen = tokens.length;
for (int i = 1; i <= tokenLen; i++) {
tokenList.add(tokens[i - 1]);
}
}
}
br.close();
dataIn.close();
}
return tokenList;
}
This code works fine except I found out that instead of make a whole file into several words(tokens), it made a whole line into a token. "area area" becomes a token, instead of "area" appeared twice. I don't see the error in my codes. I believe maybe it's something wrong with my trim().
Any valuable advices is appreciated. Thank you so much.
Maybe I should use scanner instead?? I'm confused.

I think Scanner is more approprate for this task. As to this code, you should fix regex, try "\\s+";

Try pattern as String pattern = "[^\\w]"; in the same code

Java - Groovy : regex parse text block

I know that this is a common question and I've been through a lot of forums to figure out whats the problem in my code.
I have to read a text file with several blocks in the following format:
import com.myCompanyExample.gui.Layout
/*some comments here*/
#Layout
LayoutModel currentState() {
MyBuilder builder = new MyBuilder()
form example
title form{
row_1
row_1
row_n
}
return build.get()
}
#Layout
LayoutModel otherState() {
....
....
return build.get()
}
I have this code to read all the file and I'd like to extract each block between the keyword "#Layout" and the keyword "return". I need also to catch all newline so later I'll be able to split each matched block into a list
private void myReadFile(File fileLayout){
String line = null;
StringBuilder allText = new StringBuilder();
try{
FileReader fileReader = new FileReader(fileLayout);
BufferedReader bufferedReader = new BufferedReader(fileReader);
while((line = bufferedReader.readLine()) != null) {
allText.append(line)
}
bufferedReader.close();
}
catch(FileNotFoundException ex) {
System.out.println("Unable to open file");
}
catch(IOException ex) {
System.out.println("Error reading file");
}
Pattern pattern = Pattern.compile("(?s)#Layout.*?return",Pattern.DOTALL);
Matcher matcher = pattern.matcher(allText);
while(matcher.find()){
String [] layoutBlock = (matcher.group()).split("\\r?\\n")
for(index in layoutBlock){
//check each line of the current block
}
}
layoutBlock returns size=1

I think this can potentially be a so called XY problem anyway...if the groovy source is composed only by #Layout annotated blocks of code you can use a tempered greedy token to select till the next annotation (view online demo).
Change the pattern loc as this:
Pattern pattern = Pattern.compile( "#Layout(?:(?!#Layout).)*", Pattern.DOTALL );
PS: the dotall flag (?s) inside the regex and the parameter Pattern.DOTALL do the same thing (enable the so called multiline mode), use only one of them indifferently.
UPDATE
I tried your code, the problem (preserving newline) is in the method you use to slurp the file (bufferedReader.readline() remove the newline at the end of the string).
Simply readd a newline when append to allText:
String ln = System.lineSeparator();
while((line = bufferedReader.readLine()) != null) {
allText.append(line + ln);
}
Or you can replace all the code to slurp the file with this:
import java.nio.file.Files;
import java.nio.file.Paths;
//can throw an IOException
String filePath = "/path/to/layout.groovy";
String allText = new String(Files.readAllBytes(Paths.get(filePath)),StandardCharsets.UTF_8);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java sequentially parse information from file - java

Use Regex : .+?(?=AA). Check Here is the Demo

Related

Troubleshooting Java replaceAll

Regex for replacing Exact String match [duplicate]

Java, How can I find a pattern in a File and read the whole line?

How to split a file into several tokens

Java - Groovy : regex parse text block

Categories

Resources