Troubleshooting Java replaceAll - java

I am trying to write a method that accepts an input string to be found and an input string to replace all instances of the found word and to return the number of replacements made. I am trying to use pattern and matcher from JAVA regex. I have a text file called "text.txt" which includes "this is a test this is a test this is a test". When I try to search for "test" and replace it with "mess", the method returns 1 each time and none of the words test are replaced.
public int findAndRepV2(String word, String replace) throws FileNotFoundException, IOException
{
int cnt = 0;
BufferedReader input = new BufferedReader( new FileReader(this.filename));
Writer fw = new FileWriter("test.txt");
String line = input.readLine();
while (line != null)
{
Pattern pattern = Pattern.compile(word, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {matcher.replaceAll(replace); cnt++;}
line = input.readLine();
}
fw.close();
return cnt;
}

First, you need to ensure that the text you are searching for is not interpreted as a regex. You should do:
Pattern pattern = Pattern.compile(Pattern.quote(word), Pattern.CASE_INSENSITIVE);
Second, replaceAll does something like this:
public String replaceAll(String replacement) {
reset();
boolean result = find();
if (result) {
StringBuffer sb = new StringBuffer();
do {
appendReplacement(sb, replacement);
result = find();
} while (result);
appendTail(sb);
return sb.toString();
}
return text.toString();
}
Note how it calls find until it can't find anything. This means that your loop will only be run once, since after the first call to replaceAll, the matcher has already found everything.
You should use appendReplacement instead:
StringBuffer buffer = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(buffer, replace);
cnt++;
}
buffer.append(line.substring(matcher.end()));
// "buffer" contains the string after the replacement
I noticed that in your method, you didn't actually do anything with the string after the replacement. If that's the case, just count how many times find returns true:
while (matcher.find()) {
cnt++;
}

Related

Fastest/Most efficient way to parse a document, search for strings and replace them in document with Java

So I have been working on the java program that scans and parses a number of files replacing terms (such as func_123) with their readable format.
There are three files that provide definitions, so each file needs to be parsed thrice.
The program loads definitions into a class called Pair and puts that pair into a ArraryList.
Then the program goes through each file line by line and replaces any matched string. Creating and running a new thread for each file.
So what would be the fastest/most efficient way to parse, replace and write these changes to the new file?
Below is what I have so far.
Code that parses through each file:
Thread thread = new Thread() {
#Override
public void run() {
try {
File temp = File.createTempFile("temp", "tmp");
BufferedReader br = new BufferedReader(new FileReader(file));
BufferedWriter bw = new BufferedWriter(new FileWriter(temp));
String s = null;
while ((s = br.readLine()) != null) {
s = Deobfuscator2.deobfuscate(s);
bw.write(s);
bw.newLine();
}
bw.close();
br.close();
writeFromFileTo(temp, file);
temp.delete();
} catch (IOException e) {
e.printStackTrace();
}
}
};
Code that decodes each string:
public static String deobfuscate(String s) {
for (Pair<String, String> pair : fieldsMappings) {
s = s.replaceAll(pair.key, pair.value);
}
for (Pair<String, String> pair : methodsMappings) {
s = s.replaceAll(pair.key, pair.value);
}
for (Pair<String, String> pair : paramsMappings) {
s = s.replaceAll(pair.key, pair.value);
}
return s;
}
Pair Class:
public static class Pair <K,V> {
private K key;
private V value;
public Pair(K key, V value) {
this.key = key;
this.value = value;
}
public K getKey() {
return key;
}
public V getValue() {
return value;
}
}
Helper function to copy contents from one file to another:
private void writeFromFileTo(File file1, File file2) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(file1));
BufferedWriter bw = new BufferedWriter(new FileWriter(file2));
String s = null;
while ((s = br.readLine()) != null) {
bw.write(s);
bw.newLine();
}
bw.close();
br.close();
}
I tried to be as clear as possible and give all the relevant code, but if you need/want anything else let me know.
My code works, but my problem is that it seems to take some time doing so and can be pretty resource intensive (if I don't limit the threads) when there are a lot of files to parse. In total there are about 33,000+ (10,000+ each) total definitions that would need to potentially be replaced.
Repeatedly calling replaceAll is expensive, as the regular expressions will be recompiled on every pass, and also you're creating new instances of the string for each replacement. A better approach is to precompile a regexp matching any key, then iterate across the string and replace each found key with the corresponding value:
static Pattern pattern;
static List<String> replacements = new ArrayList<>();
static {
StringBuilder sb = new StringBuilder();
for (List<Pair<String, String>> mapping : Arrays.asList(
fieldsMappings, methodsMappings, paramsMappings)) {
for (Pair<String, String> pair : mapping) {
sb.append("(");
sb.append(pair.key);
sb.append(")|");
replacements.append(Matcher.quoteReplacement(pair.value));
}
}
// Remove trailing "|" character in regexp.
if (sb.length() > 0) {
sb.setLength(sb.length() - 1);
}
pattern = Pattern.compile(sb.toString());
}
public static String deobfuscate(String s) {
StringBuffer sb = new StringBuffer();
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
// Figure out which key matched and fetch the corresponding replacement.
String replacement = null;
for (int i = 0; i < replacements.size(); i++) {
if (matcher.group(i) != null) {
replacement = replacements.get(i);
break;
}
}
if (replacement == null) {
// Should never happen.
throw new RuntimeException("Regexp matched, but no group matched");
}
matcher.appendReplacement(sb, replacement);
}
matcher.appendTail(sb);
return sb.toString();
}
The above code assumes that each key is a regexp. If keys are instead fixed strings, there's no need to use regexp groups to identify which key matched, you can use a map instead. This would look like
static Pattern pattern;
static Map<String, String> replacements = new HashMap<>();
static {
StringBuilder sb = new StringBuilder();
for (List<Pair<String, String>> mapping : Arrays.asList(
fieldsMappings, methodsMappings, paramsMappings)) {
for (Pair<String, String> pair : mapping) {
sb.append(Pattern.quote(pair.key));
sb.append("|");
replacements.put(pair.key, Matcher.quoteReplacement(pair.value));
}
}
// Remove trailing "|" character in regexp.
if (sb.length() > 0) {
sb.setLength(sb.length() - 1);
}
pattern = Pattern.compile(sb.toString());
}
public static String deobfuscate(String s) {
StringBuffer sb = new StringBuffer();
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
matcher.appendReplacement(sb, replacements.get(matcher.group()));
}
matcher.appendTail(sb);
return sb.toString();
}
Note that replacements are quoted with Matcher.quoteReplacement when building the replacement list/map, to ensure replacements are treated literally, since regexp backreferences won't work anyway when building a composite regexp from all the keys. If you depend on backreferences in the replacements, this approach won't work.
Be warned that the code above hasn't been tested (or even compiled).
replaceAll() method in String is slow, since the regex Patterns are repeatedly compiled for all keys. An idea is to cache 'compiled Patterns' instead of Strings and then repeatedly run replaceAll. At least this will be much faster than this current version.
A possible idea is to optimize 'examination of s' with prefix trie.
For example, suppose s looks like
'qqq aaa 111 bbb 222 ccc rgege'
and the keys are aaa bbb and ccc. Then your current algorithm examine characters of s 3 times. But if you examine characters one by one and looks up the prefix trie, and keeps indices of matched positions and values, it only takes one time examination of s to know that
replace aaa with aaaValue at 4, replace bbb at 12, and replace ccc at 20.
This would probably also significantly improve speed. There are Java libraries like concurrent-tree jar for this. If the performance is not as expected, there are some programming practice codes online for tries, and the performance would be optimal since trie implementation with primitive arrays can be found.

Java, How can I find a pattern in a File and read the whole line?

I want to find a special charsequence in a file and I want to read the whole line where the occurrences are.
The following code just checks the first line and fetchess this ( the first ) line.
How can I fix it?
Scanner scanner = new Scanner(file);
String output = "";
output = output + scanner.findInLine(pattern) + scanner.next();
pattern and file are parameters
UPDATED ANSWER according to the comments on this very answer
In fact, what is used is Scanner#findWithHorizon, which in fact calls the Pattern#compile method with a set of flags (Pattern#compile(String, int)).
The result seems to be applying this pattern over and over again in the input text over lines of a file; and this supposes of course that a pattern cannot match multiple lines at once.
Therefore:
public static final String findInFile(final Path file, final String pattern,
final int flags)
throws IOException
{
final StringBuilder sb = new StringBuilder();
final Pattern p = Pattern.compile(pattern, flags);
String line;
Matcher m;
try (
final BufferedReader br = Files.newBufferedReader(path);
) {
while ((line = br.readLine()) != null) {
m = p.matcher(line);
while (m.find())
sb.append(m.group());
}
}
return sb.toString();
}
For completeness I should add that I have developed some time ago a package which allows a text file of arbitrary length to be read as a CharSequence and which can be used to great effect here: https://github.com/fge/largetext. It would work beautifully here since a Matcher matches against a CharSequence, not a String. But this package needs some love.
One example returning a List of matching strings in a file can be:
private static List<String> findLines(final Path path, final String pattern)
throws IOException
{
final Predicate<String> predicate = Pattern.compile(pattern).asPredicate();
try (
final Stream<String> stream = Files.lines(path);
) {
return stream.filter(predicate).collect(Collectors.toList());
}
}

Java sequentially parse information from file

lets say I have a file with a structure like this:
Line 0:
354858 Some String That Is Important AA OTHER STUFF SOMESTUFF
THAT SHOULD BE IGNORED
Line 1:
543788 Another String That Is Important AA OTHER STUFF
SOMESTUFF THAT SHOULD BE IGNORED
and so on...
Now I would like to get the information that is marked in my example (see gray background). The sequence AA is always present (and could be used as a break and skip to the next line) while the information string varies in length.
What will be the best way to parse the information? A buffered reader with if, then, else or is there some kind of parser that you can tell, read a number of lenth XYZ then read everything into a String until you find AA then skip line.
To tell you which is best for your problem is not possible without more information.
One solution might be
String s = "354858 Some String That Is Important AA OTHER STUFF SOMESTUFF THAT SHOULD BE IGNORED";
String[] split = s.substring(0, s.indexOf(" AA")).split(" ", 2);
System.out.println("split = " + Arrays.toString(split));
output
split = [354858, Some String That Is Important]
You can read the file line by line and exclude the part which contains the AA charSequence:
final String charSequence = "AA";
String line;
BufferedReader r = new BufferedReader(new InputStreamReader(new FileInputStream("yourfilename")));
try {
while ((line = r.readLine()) != null) {
int pos = line.indexOf(charSequence);
if (pos > 0) {
String myImportantStuff = line.substring(0, pos);
//do something with your useful string
}
}
} finally {
r.close();
}
I would read the file line by line and match each line against a regular expression. I hope my comments in the code below will be detailed enough.
// The pattern to use
Pattern p = Pattern.compile("^([0-9]+)\\s+(([^A]|A[^A])+)AA");
// Read file line by line
BufferedReader br = new BufferedReader(new FileReader(myFile));
String line;
while((line = br.readLine()) != null) {
// Match line against our pattern
Matcher m = p.matcher(line);
if(m.find()) {
// Line is valid, process it however you want
// m.group(1) contains the number
// m.group(2) contains the text between number and AA
} else {
// Line has invalid format (pattern does not match)
}
}
Explanation of the regular expression (Pattern) I used:
^([0-9]+)\s+(([^A]|A[^A])+)AA
^ matches the start of the line
([0-9]+) matches any integral number
\s+ matches one or more whitespace characters
(([^A]|A[^A])+) matches any characters which are either not A or not followed by another A
AA matches the terminating AA
Update as a reply to comment:
If every line has a preceding | character, the expression looks like this:
^\|([0-9]+)\s+(([^A]|A[^A])+)AA
In JAVA, you need to escape it like this:
"^\\|([0-9]+)\\s+(([^A]|A[^A])+)AA"
The character | has a special meaning in regular expressions and has to be escaped.
Here is a solution for you:
public static void main(String[] args) {
InputStream source; //select a text source (should be a FileInputStream)
{
String fileContent = "354858 Some String That Is Important AA OTHER STUFF SOMESTUFF THAT SHOULD BE IGNORED\n" +
"543788 Another String That Is Important AA OTHER STUFF SOMESTUFF THAT SHOULD BE IGNORED";
source = new ByteArrayInputStream(fileContent.getBytes(StandardCharsets.UTF_8));
}
try(BufferedReader stream = new BufferedReader(new InputStreamReader(source))) {
Pattern pattern = Pattern.compile("^([0-9]+) (.*?) AA .*$");
while(true) {
String line = stream.readLine();
if(line == null) {
break;
}
Matcher matcher = pattern.matcher(line);
if(matcher.matches()) {
String someNumber = matcher.group(1);
String someText = matcher.group(2);
//do something with someNumber and someText
} else {
throw new ParseException(line, 0);
}
}
} catch (IOException | ParseException e) {
e.printStackTrace(); // TODO ...
}
}
You could use a regular expression, but if you know every line contains AA and you want the content up to AA you could can simply do substring(int,int) to get the part of the line up to AA
public List read(Path path) throws IOException {
return Files.lines(path)
.map(this::parseLine)
.collect(Collectors.toList());
}
public String parseLine(String line){
int index = line.indexOf("AA");
return line.substring(0,index);
}
Here's the non-Java8 version of read
public List read(Path path) throws IOException {
List<String> content = new ArrayList<>();
try(BufferedReader reader = new BufferedReader(new FileReader(path.toFile()))){
String line;
while((line = reader.readLine()) != null){
content.add(parseLine(line));
}
}
return content;
}
Use Regex : .+?(?=AA).
Check Here is the Demo

How to replace a substring without using replace() methods

I am trying to convert a text document to shorthand, without using any of the replace() methods in java. One of the strings I am converting is "the" to "&". The problem is, that I do not know the substring of each word that contains the "the" string. So how do I replace that part of a string without using the replace() method?
Ex: "their" would become "&ir", "together" would become "toge&r"
This is what I have started with,
String the = "the";
Scanner wordScanner = new Scanner(word);
if (wordScanner.contains(the)) {
the = "&";
}
I am just not sure how to go about the replacement.
You could try this :
String word = "your string with the";
word = StringUtils.join(word.split("the"),"&");
Scanner wordScanner = new Scanner(word);
I do not get your usage of Scanner for this, but you can read each character into a buffer (StringBuilder) until you read "the" into the buffer. Once you've done that, you can delete the word and then append the word you want to replace with.
public static void main(String[] args) throws Exception {
String data = "their together the them forever";
String wordToReplace = "the";
String wordToReplaceWith = "&";
Scanner wordScanner = new Scanner(data);
// Using this delimiter to get one character at a time from the scanner
wordScanner.useDelimiter("");
StringBuilder buffer = new StringBuilder();
while (wordScanner.hasNext()) {
buffer.append(wordScanner.next());
// Check if the word you want to replace is in the buffer
int wordToReplaceIndex = buffer.indexOf(wordToReplace);
if (wordToReplaceIndex > -1) {
// Delete the word you don't want in the buffer
buffer.delete(wordToReplaceIndex, wordToReplaceIndex + wordToReplace.length());
// Append the word to replace the deleted word with
buffer.append(wordToReplaceWith);
}
}
// Output results
System.out.println(buffer);
}
Results:
&ir toge&r & &m forever
This can be done without a Scanner using just a while loop and StringBuilder
public static void main(String[] args) throws Exception {
String data = "their together the them forever";
StringBuilder buffer = new StringBuilder(data);
String wordToReplace = "the";
String wordToReplaceWith = "&";
int wordToReplaceIndex = -1;
while ((wordToReplaceIndex = buffer.indexOf(wordToReplace)) > -1) {
buffer.delete(wordToReplaceIndex, wordToReplaceIndex + wordToReplace.length());
buffer.insert(wordToReplaceIndex, wordToReplaceWith);
}
System.out.println(buffer);
}
Results:
&ir toge&r & &m forever
You can use Pattern and Matcher Regex:
Pattern pattern = Pattern.compile("the ");
Matcher matcher = pattern.matcher("the cat and their owners");
StringBuffer sb = new StringBuffer();
while(matcher.find()){
matcher.appendReplacement(sb, "& ");
}
matcher.appendTail(sb);
System.out.println(sb.toString());

How do you create an array to store regex string matches in java?

I'm trying to take a file that store data of this form:
Name=”Biscuit”
LatinName=”Retrieverus Aurum”
ImageFilename=”Biscuit.png”
DNA=”ITAYATYITITIAAYI”
and read it with a regex to locate the useful information; namely, the fields and their contents.
I have created the regex already, but I can only seem to get one match at any given time, and would like instead to put each of the matches from each line in the file in their own index of a string.
Here's what I have so far:
Scanner scanFile = new Scanner(file);
while (scanFile.hasNextLine()){
System.out.println(scanFile.findInLine(".*"));
scanFile.nextLine();
}
MatchResult result = null;
scanFile.findInLine(Constants.ANIMAL_INFO_REGEX);
result = scanFile.match();
for (int i=1; i<=result.groupCount(); i++){
System.out.println(result.group(i));
System.out.println(result.groupCount());
}
scanFile.close();
MySpecies species = new MySpecies(null, null, null, null);
return species;
Thanks so much for your help!
I hope I understand your question correctly... Here is an example that is coming from the Oracle website:
/*
* This code writes "One dog, two dogs in the yard."
* to the standard-output stream:
*/
import java.util.regex.*;
public class Replacement {
public static void main(String[] args)
throws Exception {
// Create a pattern to match cat
Pattern p = Pattern.compile("cat");
// Create a matcher with an input string
Matcher m = p.matcher("one cat," +
" two cats in the yard");
StringBuffer sb = new StringBuffer();
boolean result = m.find();
// Loop through and create a new String
// with the replacements
while(result) {
m.appendReplacement(sb, "dog");
result = m.find();
}
// Add the last segment of input to
// the new String
m.appendTail(sb);
System.out.println(sb.toString());
}
}
Hope this helps...

Categories

Resources