Regex for replacing Exact String match [duplicate] - java

My input:
1. end
2. end of the day or end of the week
3. endline
4. something
5. "something" end
Based on the above discussions, If I try to replace a single string using this snippet, it removes the appropriate words from the line successfully
public class DeleteTest {
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
File file = new File("C:/Java samples/myfile.txt");
File temp = File.createTempFile("myfile1", ".txt", file.getParentFile());
String delete="end";
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
PrintWriter writer = new PrintWriter(new OutputStreamWriter(new FileOutputStream(temp)));
for (String line; (line = reader.readLine()) != null;) {
line = line.replaceAll("\\b"+delete+"\\b", "");
writer.println(line);
}
reader.close();
writer.close();
}
catch (Exception e) {
System.out.println("Something went Wrong");
}
}
}
My output If I use the above snippet:(Also my expected output)
1.
2. of the day or of the week
3. endline
4. something
5. "something"
But when I include more words to delete, and for that purpose when I use Set, I use the below code snippet:
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
File file = new File("C:/Java samples/myfile.txt");
File temp = File.createTempFile("myfile1", ".txt", file.getParentFile());
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
PrintWriter writer = new PrintWriter(new OutputStreamWriter(new FileOutputStream(temp)));
Set<String> toDelete = new HashSet<>();
toDelete.add("end");
toDelete.add("something");
for (String line; (line = reader.readLine()) != null;) {
line = line.replaceAll("\\b"+toDelete+"\\b", "");
writer.println(line);
}
reader.close();
writer.close();
}
catch (Exception e) {
System.out.println("Something went Wrong");
}
}
I get my output as: (It just removes the space)
1. end
2. endofthedayorendoftheweek
3. endline
4. something
5. "something" end
Can u guys help me on this?
Click here to follow the thread

You need to create an alternation group out of the set with
String.join("|", toDelete)
and use as
line = line.replaceAll("\\b(?:"+String.join("|", toDelete)+")\\b", "");
The pattern will look like
\b(?:end|something)\b
See the regex demo. Here, (?:...) is a non-capturing group that is used to group several alternatives without creating a memory buffer for the capture (you do not need it since you remove the matches).
Or, better, compile the regex before entering the loop:
Pattern pat = Pattern.compile("\\b(?:" + String.join("|", toDelete) + ")\\b");
...
line = pat.matcher(line).replaceAll("");
UPDATE:
To allow matching whole "words" that may contain special chars, you need to Pattern.quote those words to escape those special chars, and then you need to use unambiguous word boundaries, (?<!\w) instead of the initial \b to make sure there is no word char before and (?!\w) negative lookahead instead of the final \b to make sure there is no word char after the match.
In Java 8, you may use this code:
Set<String> nToDel = new HashSet<>();
nToDel = toDelete.stream()
.map(Pattern::quote)
.collect(Collectors.toCollection(HashSet::new));
String pattern = "(?<!\\w)(?:" + String.join("|", nToDel) + ")(?!\\w)";
The regex will look like (?<!\w)(?:\Q+end\E|\Qsomething-\E)(?!\w). Note that the symbols between \Q and \E is parsed as literal symbols.

The problem is that you're not creating the correct regex for replacing the words in the set.
"\\b"+toDelete+"\\b" will produce this String \b[end, something]\b which is not what you need.
To fix that you can do something like this:
for(String del : toDelete){
line = line.replaceAll("\\b"+del+"\\b", "");
}
What this does is to go through the set, produce a regex from each word and remove that word from the line String.
Another approach will be to produce a single regex from all the words in the set.
Eg:
String regex = "";
for(String word : toDelete){
regex+=(regex.isEmpty() ? "" : "|") + "(\\b"+word+"\\b)";
}
....
line = line.replace(regex, "");
This should produce a regex that looks something like this: (\bend\b)|(\bsomething\b)

Related

Java - Reading a text file when a certain sequence occurs

I haven't been able to find a way to read from a .txt file when a certain sequence occurs.
This is how an entry from my file looks like:
&1551:John:Packard:83:Heavy:Blonde&
I want my file to be read from &1551 (1551 is the unique ID number of the user) until the next "&". Do you guys have any suggestions as to how to accomplish this? The ":" is later used for splitting the string.
Thanks!
A simple JDK Scanner has the ability to read a file stopping at certain patterns:
public String findWithinHorizon(String pattern,
int horizon)
Attempts to find the next occurrence of the specified pattern.
public Scanner skip(Pattern pattern)
Skips input that matches the specified pattern, ignoring delimiters. This method will skip input if an anchored match of the specified pattern succeeds.
If a match to the specified pattern is not found at the current position, then no input is skipped and a NoSuchElementException is thrown.
So this should be enough:
// skip anything up to "$1551:" (but keep "1551:" for next read)
Pattern toSkip = Pattern.compile(".*?\\$(?=1511:)", Pattern.DOTALL);
sc.skip(toSkip);
// get everything starting at the "1551:" up to a "$" sign on same line
String line = sc.findWithinHorizon(".*(?=\\$)", 0);
If end of lines can be included between the $ signs, then you should compile the pattern with the DOTALL flag as I did for toSkip.
Firstly you will have to get input from string without staring & and ending &,
then split string by :
So, Assuming all inputs will be in new line below code should work,
public static void main(String args[]) {
BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader("D://test.txt"));
String sCurrentLine;
String[] fields;
while ((sCurrentLine = reader.readLine()) != null) {
sCurrentLine = sCurrentLine.substring(sCurrentLine.indexOf('&') + 1);
sCurrentLine = sCurrentLine.substring(0, sCurrentLine.indexOf('&'));
fields = sCurrentLine.split(":");
for (String tmp : fields)
System.out.println(tmp);
}
} catch (Exception e) {
System.out.println("Error in accepting String");
} finally {
try {
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Hope this helps.

How to determine the delimiter in CSV file

I have a scenario at which i have to parse CSV files from different sources, the parsing code is very simple and straightforward.
String csvFile = "/Users/csv/country.csv";
String line = "";
String cvsSplitBy = ",";
try (BufferedReader br = new BufferedReader(new FileReader(csvFile))) {
while ((line = br.readLine()) != null) {
// use comma as separator
String[] country = line.split(cvsSplitBy);
System.out.println("Country [code= " + country[4] + " , name=" + country[5] + "]");
}
} catch (IOException e) {
e.printStackTrace();
}
my problem come from the CSV delimiter character, i have many different formats, some time it is a , sometimes it is a ;
is there is any way to determine the delimiter character before parsing the file
univocity-parsers supports automatic detection of the delimiter (also line endings and quotes). Just use it instead of fighting with your code:
CsvParserSettings settings = new CsvParserSettings();
settings.detectFormatAutomatically();
CsvParser parser = new CsvParser(settings);
List<String[]> rows = parser.parseAll(new File("/path/to/your.csv"));
// if you want to see what it detected
CsvFormat format = parser.getDetectedFormat();
Disclaimer: I'm the author of this library and I made sure all sorts of corner cases are covered. It's open source and free (Apache 2.0 license)
Hope this helps.
Yes, but only if the delimiter characters are not allowed to exist as regular text
The most simple answer is to have a list with all the available delimiter characters and try to identify which character is being used. Even though, you have to place some limitations on the files or the person/people that created them. Look a the following two scenarios:
Case 1 - Contents of file.csv
test,test2,test3
Case 2 - Contents of file.csv
test1|test2,3|test4
If you have prior knowledge of the delimiter characters, then you would split the first string using , and the second one using |, getting the same result. But, if you try to identify the delimiter by parsing the file, both strings can be split using the , character, and you would end up with this:
Case 1 - Result of split using ,
test1
test2
test3
Case 2 - Result of split using ,
test1|test2
3|test4
By lacking the prior knowledge of which delimiter character is being used, you cannot create a "magical" algorithm that will parse every combination of text; even regular expressions or counting the number of appearance of a character will not save you.
Worst case
test1,2|test3,4|test5
By looking the text, one can tokenize it by using | as the delimiter. But the frequency of appearance of both , and | are the same. So, from an algorithm's perspective, both results are accurate:
Correct result
test1,2
test3,4
test5
Wrong result
test1
2|test3
4|test5
If you pose a set of guidelines or you can somehow control the generation of the CSV files, then you could just try to find the delimiter used with String.contains() method, employing the aforementioned list of characters. For example:
public class MyClass {
private List<String> delimiterList = new ArrayList<>(){{
add(",");
add(";");
add("\t");
// etc...
}};
private static String determineDelimiter(String text) {
for (String delimiter : delimiterList) {
if(text.contains(delimiter)) {
return delimiter;
}
}
return "";
}
public static void main(String[] args) {
String csvFile = "/Users/csv/country.csv";
String line = "";
String cvsSplitBy = ",";
String delimiter = "";
boolean firstLine = true;
try (BufferedReader br = new BufferedReader(new FileReader(csvFile))) {
while ((line = br.readLine()) != null) {
if(firstLine) {
delimiter = determineDelimiter(line);
if(delimiter.equalsIgnoreCase("")) {
System.out.println("Unsupported delimiter found: " + delimiter);
return;
}
firstLine = false;
}
// use comma as separator
String[] country = line.split(delimiter);
System.out.println("Country [code= " + country[4] + " , name=" + country[5] + "]");
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Update
For a more optimized way, in determineDelimiter() method instead of the for-each loop, you can employ regular expressions.
If the delimiter can appear in a data column, then you are asking for the impossible. For example, consider this first line of a CSV file:
one,two:three
This could be either a comma-separated or a colon-separated file. You can't tell which type it is.
If you can guarantee that the first line has all its columns surrounded by quotes, for example if it's always this format:
"one","two","three"
then you may be able to use this logic (although it's not 100% bullet-proof):
if (line.contains("\",\""))
delimiter = ',';
else if (line.contains("\";\""))
delimiter = ';';
If you can't guarantee a restricted format like that, then it would be better to pass the delimiter character as a parameter.
Then you can read the file using a widely-known open-source CSV parser such as Apache Commons CSV.
While I agree with Lefteris008 that it is not possible to have the function that correctly determine all the cases, we can have a function that is both efficient and give mostly correct result in practice.
def head(filename: str, n: int):
try:
with open(filename) as f:
head_lines = [next(f).rstrip() for x in range(n)]
except StopIteration:
with open(filename) as f:
head_lines = f.read().splitlines()
return head_lines
def detect_delimiter(filename: str, n=2):
sample_lines = head(filename, n)
common_delimiters= [',',';','\t',' ','|',':']
for d in common_delimiters:
ref = sample_lines[0].count(d)
if ref > 0:
if all([ ref == sample_lines[i].count(d) for i in range(1,n)]):
return d
return ','
My efficient implementation is based on
Prior knowledge such as list of common delimiter you often work with ',;\t |:' , or even the likely hood of the delimiter to be used so that I often put the regular ',' on the top of the list
The frequency of the delimiter appear in each line of the text file are equal. This is to resolve the problem that if we read a single line and see the frequency to be equal (false detection as Lefteris008) or even the right delimiter to appear less frequent as the wrong one in the first line
The efficient implementation of a head function that read only first n lines from the file
As you increase the number of test sample n, the likely hood that you get a false answer reduce drastically. I often found n=2 to be adequate
Add a condition like this,
String [] country;
if(line.contains(",")
country = line.split(",");
else if(line.contains(";"))
country=line.split(";");
That depends....
If your datasets are always the same length and/or the separator NEVER occurs in your datacolumns, you could just read the first line of the file, look at it for the longed for separator, set it and then read the rest of the file using that separator.
Something like
String csvFile = "/Users/csv/country.csv";
String line = "";
String cvsSplitBy = ",";
try (BufferedReader br = new BufferedReader(new FileReader(csvFile))) {
while ((line = br.readLine()) != null) {
// use comma as separator
if (line.contains(",")) {
cvsSplitBy = ",";
} else if (line.contains(";")) {
cvsSplitBy = ";";
} else {
System.out.println("Wrong separator!");
}
String[] country = line.split(cvsSplitBy);
System.out.println("Country [code= " + country[4] + " , name=" + country[5] + "]");
}
} catch (IOException e) {
e.printStackTrace();
}
Greetz Kai

searching in text file specific words using java

I've a huge text file, I'd like to search for specific words and print three or more then this number OF THE WORDS AFTER IT so far I have done this
public static void main(String[] args) {
String fileName = "C:\\Users\\Mishari\\Desktop\\Mesh.txt";
String line = null;
try {
FileReader fileReader =
new FileReader(fileName);
BufferedReader bufferedReader =
new BufferedReader(fileReader);
while((line = bufferedReader.readLine()) != null) {
System.out.println(line);
}
bufferedReader.close();
} catch(FileNotFoundException ex) {
System.out.println(
"Unable to open file '" +
fileName + "'");
} catch(IOException ex) {
System.out.println(
"Error reading file '"
+ fileName + "'");
}
}
It's only for printing the file can you advise me what's the best way of doing it.
You can look for the index of word in line using this method.
int index = line.indexOf(word);
If the index is -1 then that word does not exist.
If it exist than takes the substring of line starting from that index till the end of line.
String nextWords = line.substring(index);
Now use String[] temp = nextWords.split(" ") to get all the words in that substring.
while((line = bufferedReader.readLine()) != null) {
System.out.println(line);
if (line.contains("YOUR_SPECIFIC_WORDS")) { //do what you need here }
}
By the sounds of it what you appear to be looking for is a basic Find & Replace All mechanism for each file line that is read in from file. In other words, if the current file line that is read happens to contain the Word or phrase you would like to add words after then replace that found word with the very same word plus the other words you want to add. In a sense it would be something like this:
String line = "This is a file line.";
String find = "file"; // word to find in line
String replaceWith = "file (plus this stuff)"; // the phrase to change the found word to.
line = line.replace(find, replaceWith); // Replace any found words
System.out.println(line);
The console output would be:
This is a file (plus this stuff) line.
The main thing here though is that you only want to deal with actual words and not the same phrase within another word, for example the word "and" and the word "sand". You can clearly see that the characters that make up the word 'and' is also located in the word 'sand' and therefore it too would be changed with the above example code. The String.contains() method also locates strings this way. In most cases this is undesirable if you want to specifically deal with whole words only so a simple solution would be to use a Regular Expression (RegEx) with the String.replaceAll() method. Using your own code it would look something like this:
String fileName = "C:\\Users\\Mishari\\Desktop\\Mesh.txt";
String findPhrase = "and"; //Word or phrase to find and replace
String replaceWith = findPhrase + " (adding this)"; // The text used for the replacement.
boolean ignoreLetterCase = false; // Change to true to ignore letter case
String line = "";
try {
FileReader fileReader = new FileReader(fileName);
BufferedReader bufferedReader = new BufferedReader(fileReader);
while ((line = bufferedReader.readLine()) != null) {
if (ignoreLetterCase) {
line = line.toLowerCase();
findPhrase = findPhrase.toLowerCase();
}
if (line.contains(findPhrase)) {
line = line.replaceAll("\\b(" + findPhrase + ")\\b", replaceWith);
}
System.out.println(line);
}
bufferedReader.close();
} catch (FileNotFoundException ex) {
System.out.println("Unable to open file: '" + fileName + "'");
} catch (IOException ex) {
System.out.println("Error reading file: '" + fileName + "'");
}
You will of course notice the escaped \b word boundary Meta Characters within the regular expression used in the String.replaceAll() method specifically in the line:
line = line.replaceAll("\\b(" + findPhrase + ")\\b", replaceWith);
This allows us to deal with whole words only.

Java - Groovy : regex parse text block

I know that this is a common question and I've been through a lot of forums to figure out whats the problem in my code.
I have to read a text file with several blocks in the following format:
import com.myCompanyExample.gui.Layout
/*some comments here*/
#Layout
LayoutModel currentState() {
MyBuilder builder = new MyBuilder()
form example
title form{
row_1
row_1
row_n
}
return build.get()
}
#Layout
LayoutModel otherState() {
....
....
return build.get()
}
I have this code to read all the file and I'd like to extract each block between the keyword "#Layout" and the keyword "return". I need also to catch all newline so later I'll be able to split each matched block into a list
private void myReadFile(File fileLayout){
String line = null;
StringBuilder allText = new StringBuilder();
try{
FileReader fileReader = new FileReader(fileLayout);
BufferedReader bufferedReader = new BufferedReader(fileReader);
while((line = bufferedReader.readLine()) != null) {
allText.append(line)
}
bufferedReader.close();
}
catch(FileNotFoundException ex) {
System.out.println("Unable to open file");
}
catch(IOException ex) {
System.out.println("Error reading file");
}
Pattern pattern = Pattern.compile("(?s)#Layout.*?return",Pattern.DOTALL);
Matcher matcher = pattern.matcher(allText);
while(matcher.find()){
String [] layoutBlock = (matcher.group()).split("\\r?\\n")
for(index in layoutBlock){
//check each line of the current block
}
}
layoutBlock returns size=1
I think this can potentially be a so called XY problem anyway...if the groovy source is composed only by #Layout annotated blocks of code you can use a tempered greedy token to select till the next annotation (view online demo).
Change the pattern loc as this:
Pattern pattern = Pattern.compile( "#Layout(?:(?!#Layout).)*", Pattern.DOTALL );
PS: the dotall flag (?s) inside the regex and the parameter Pattern.DOTALL do the same thing (enable the so called multiline mode), use only one of them indifferently.
UPDATE
I tried your code, the problem (preserving newline) is in the method you use to slurp the file (bufferedReader.readline() remove the newline at the end of the string).
Simply readd a newline when append to allText:
String ln = System.lineSeparator();
while((line = bufferedReader.readLine()) != null) {
allText.append(line + ln);
}
Or you can replace all the code to slurp the file with this:
import java.nio.file.Files;
import java.nio.file.Paths;
//can throw an IOException
String filePath = "/path/to/layout.groovy";
String allText = new String(Files.readAllBytes(Paths.get(filePath)),StandardCharsets.UTF_8);

Java sequentially parse information from file

lets say I have a file with a structure like this:
Line 0:
354858 Some String That Is Important AA OTHER STUFF SOMESTUFF
THAT SHOULD BE IGNORED
Line 1:
543788 Another String That Is Important AA OTHER STUFF
SOMESTUFF THAT SHOULD BE IGNORED
and so on...
Now I would like to get the information that is marked in my example (see gray background). The sequence AA is always present (and could be used as a break and skip to the next line) while the information string varies in length.
What will be the best way to parse the information? A buffered reader with if, then, else or is there some kind of parser that you can tell, read a number of lenth XYZ then read everything into a String until you find AA then skip line.
To tell you which is best for your problem is not possible without more information.
One solution might be
String s = "354858 Some String That Is Important AA OTHER STUFF SOMESTUFF THAT SHOULD BE IGNORED";
String[] split = s.substring(0, s.indexOf(" AA")).split(" ", 2);
System.out.println("split = " + Arrays.toString(split));
output
split = [354858, Some String That Is Important]
You can read the file line by line and exclude the part which contains the AA charSequence:
final String charSequence = "AA";
String line;
BufferedReader r = new BufferedReader(new InputStreamReader(new FileInputStream("yourfilename")));
try {
while ((line = r.readLine()) != null) {
int pos = line.indexOf(charSequence);
if (pos > 0) {
String myImportantStuff = line.substring(0, pos);
//do something with your useful string
}
}
} finally {
r.close();
}
I would read the file line by line and match each line against a regular expression. I hope my comments in the code below will be detailed enough.
// The pattern to use
Pattern p = Pattern.compile("^([0-9]+)\\s+(([^A]|A[^A])+)AA");
// Read file line by line
BufferedReader br = new BufferedReader(new FileReader(myFile));
String line;
while((line = br.readLine()) != null) {
// Match line against our pattern
Matcher m = p.matcher(line);
if(m.find()) {
// Line is valid, process it however you want
// m.group(1) contains the number
// m.group(2) contains the text between number and AA
} else {
// Line has invalid format (pattern does not match)
}
}
Explanation of the regular expression (Pattern) I used:
^([0-9]+)\s+(([^A]|A[^A])+)AA
^ matches the start of the line
([0-9]+) matches any integral number
\s+ matches one or more whitespace characters
(([^A]|A[^A])+) matches any characters which are either not A or not followed by another A
AA matches the terminating AA
Update as a reply to comment:
If every line has a preceding | character, the expression looks like this:
^\|([0-9]+)\s+(([^A]|A[^A])+)AA
In JAVA, you need to escape it like this:
"^\\|([0-9]+)\\s+(([^A]|A[^A])+)AA"
The character | has a special meaning in regular expressions and has to be escaped.
Here is a solution for you:
public static void main(String[] args) {
InputStream source; //select a text source (should be a FileInputStream)
{
String fileContent = "354858 Some String That Is Important AA OTHER STUFF SOMESTUFF THAT SHOULD BE IGNORED\n" +
"543788 Another String That Is Important AA OTHER STUFF SOMESTUFF THAT SHOULD BE IGNORED";
source = new ByteArrayInputStream(fileContent.getBytes(StandardCharsets.UTF_8));
}
try(BufferedReader stream = new BufferedReader(new InputStreamReader(source))) {
Pattern pattern = Pattern.compile("^([0-9]+) (.*?) AA .*$");
while(true) {
String line = stream.readLine();
if(line == null) {
break;
}
Matcher matcher = pattern.matcher(line);
if(matcher.matches()) {
String someNumber = matcher.group(1);
String someText = matcher.group(2);
//do something with someNumber and someText
} else {
throw new ParseException(line, 0);
}
}
} catch (IOException | ParseException e) {
e.printStackTrace(); // TODO ...
}
}
You could use a regular expression, but if you know every line contains AA and you want the content up to AA you could can simply do substring(int,int) to get the part of the line up to AA
public List read(Path path) throws IOException {
return Files.lines(path)
.map(this::parseLine)
.collect(Collectors.toList());
}
public String parseLine(String line){
int index = line.indexOf("AA");
return line.substring(0,index);
}
Here's the non-Java8 version of read
public List read(Path path) throws IOException {
List<String> content = new ArrayList<>();
try(BufferedReader reader = new BufferedReader(new FileReader(path.toFile()))){
String line;
while((line = reader.readLine()) != null){
content.add(parseLine(line));
}
}
return content;
}
Use Regex : .+?(?=AA).
Check Here is the Demo

Categories

Resources