Poor pattern searching performance

Poor pattern searching performance - java

I'm using a regex to search for a very specific pattern against a directory that's only about 106 MB in size. It takes about 10 seconds to complete.
Is there anything that I can do to improve the performance?
package com.JFileReader;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class FileData {
public static void main(String[] args) {
File dir = new File("/Users/me/Desktop/");
if(dir.isFile()) { handleFile(dir); }
if(dir.isDirectory()) { handleDir(dir); }
}
public static void handleFile(File aFile) {
String regex = "[a-zA-Z]+[.][a-zA-Z]+[#][a-zA-Z]+[.][a-zA-Z]+";
Pattern pattern = Pattern.compile(regex);
try {
BufferedReader br = new BufferedReader(new FileReader(aFile));
Matcher m;
String line;
while ((line = br.readLine()) != null) {
m = pattern.matcher(line);
if (m.find()) {
System.out.println("Found: " + aFile);
}
}
br.close();
} catch (IOException ioe) {
ioe.printStackTrace();
}
}
public static void handleDir(File dir) {
for (File file : dir.listFiles()) {
if(file.isFile()) { handleFile(file); }
if(file.isDirectory()) { handleDir(file); }
}
}
}

You can use possessive quantifiers:
String regex = "[a-zA-Z]++\\.[a-zA-Z]++#[a-zA-Z]++\\.[a-zA-Z]++";
When you use possessive quantifiers, the regex engine doesn't record backtrack positions and never go back to try other possibilities when the match fails.

Compiling your regex pattern repeatedly (for each file) is a relatively expensive waste.
You could define that once and keep using the same instance.

Related

reader.readLine() of BufferedReader causes this: Exception in thread "main" java.io.IOException: Stream closed

Here is a code snippet from my main Java function:
try (MultiFileReader multiReader = new MultiFileReader(inputs)) {
PriorityQueue<WordEntry> words = new PriorityQueue<>();
for (BufferedReader reader : multiReader.getReaders()) {
String word = reader.readLine();
if (word != null) {
words.add(new WordEntry(word, reader));
}
}
}
Here is how I get my BufferedReader readers from another Java file:
public List<BufferedReader> getReaders() {
return Collections.unmodifiableList(readers);
}
But for some reason, when I compile my code here is what I get:
The error happens exactly at the line where I wrote String word = reader.readLine(); and what's weird is that reader.readLine() is not null, in fact multiReader.getReaders() returns a list of 100 objects (they are files read from a directory). I would like some help solving that issue.
I posted where the issue is, now let me provide a broader view of my code. To run it, it suffices to compile it under the src/ directory doing javac *.java and java MergeShards shards/ sorted.txt provided that shards/ is present under src/ and contains .txt files in my scenario.
This is MergeShards.java where I have my main function:
import java.io.BufferedReader;
import java.io.Writer;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.List;
import java.util.Objects;
import java.util.PriorityQueue;
import java.util.stream.Collectors;
public final class MergeShards {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("Usage: MergeShards [input folder] [output file]");
return;
}
List<Path> inputs = Files.walk(Path.of(args[0]), 1).skip(1).collect(Collectors.toList());
Path outputPath = Path.of(args[1]);
try (MultiFileReader multiReader = new MultiFileReader(inputs)) {
PriorityQueue<WordEntry> words = new PriorityQueue<>();
for (BufferedReader reader : multiReader.getReaders()) {
String word = reader.readLine();
if (word != null) {
words.add(new WordEntry(word, reader));
}
}
try (Writer writer = Files.newBufferedWriter(outputPath)) {
while (!words.isEmpty()) {
WordEntry entry = words.poll();
writer.write(entry.word);
writer.write(System.lineSeparator());
String word = entry.reader.readLine();
if (word != null) {
words.add(new WordEntry(word, entry.reader));
}
}
}
}
}
private static final class WordEntry implements Comparable<WordEntry> {
private final String word;
private final BufferedReader reader;
private WordEntry(String word, BufferedReader reader) {
this.word = Objects.requireNonNull(word);
this.reader = Objects.requireNonNull(reader);
}
#Override
public int compareTo(WordEntry other) {
return word.compareTo(other.word);
}
}
}
This is my MultiFileReader.java file:
import java.io.BufferedReader;
import java.io.Closeable;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
public final class MultiFileReader implements Closeable {
private final List<BufferedReader> readers;
public MultiFileReader(List<Path> paths) {
readers = new ArrayList<>(paths.size());
try {
for (Path path : paths) {
readers.add(Files.newBufferedReader(path));
}
} catch (IOException e) {
e.printStackTrace();
} finally {
close();
}
}
public List<BufferedReader> getReaders() {
return Collections.unmodifiableList(readers);
}
#Override
public void close() {
for (BufferedReader reader : readers) {
try {
reader.close();
} catch (Exception ignored) {
}
}
}
}

The finally block in your constructor closes all of your readers. Remove that.
public MultiFileReader(List<Path> paths) {
readers = new ArrayList<>(paths.size());
try {
for (Path path : paths) {
readers.add(Files.newBufferedReader(path));
}
} catch (IOException e) {
e.printStackTrace();
} /* Not this. finally {
close();
} */
}

save single string to array separate by multi space in file text java

I'm a java beginner and I'm doing a small project about dictionary, now I want to save word and translate mean in file, because my native language often have space like chicken will be con gà so, I must use other way, not by space, but I really don't know how to do that, a word and it translation in one line, separate by "tab", mean multi space like chicken con gà now I want to get 2 words and store it in my array of Words which I created before, so I want to do something like
w1=word1inline;
w2=word2inline;
Word(word1inline,word2inline);(this is a member of array);
please help me, thanks a lot, I just know how to read line from file text, and use split to get word but I am not sure how to read by multi space.
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
public class docfile {
public static void main(String[] args)
{
String readLine;
ArrayList<String>str=new ArrayList<>(String);
try {
File file = new File("text.txt");
BufferedReader b = new BufferedReader(new FileReader(file));
while ((readLine = b.readLine()) != null) {
str.add()=readLine.split(" ");
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

If you stick to using tabs as a separator, this should work:
package Application;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
public class Application {
public static void main(String[] args) {
String line;
ArrayList<String> str = new ArrayList<>();
try {
File file = new File("text.txt");
BufferedReader b = new BufferedReader(new FileReader(file));
while ((line = b.readLine()) != null) {
for (String s : line.split("\t")) {
str.add(s);
}
}
str.forEach(System.out::println);
} catch (IOException e) {
e.printStackTrace();
}
}
}

Why not just use a properties file?
dict.properties:
chicken=con gá
Dict.java:
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Properties;
public class Dict {
public static void main(String[] args) throws IOException {
Properties dict = new Properties();
dict.load(Files.newBufferedReader(Paths.get("dict.properties")));
System.out.println(dict.getProperty("chicken"));
}
}
Output:
con gá

If your line is like this chicken con gà you can use indexof() method to find the first space in the string.
Then you can substring each word by using substring() method.
readLine = b.readLine();
ArrayList<String>str=new ArrayList<>();
int i = readLine.indexOf(' ');
String firstWord = readLine.substring(0, i);
String secondWord = readLine.substring(i+1, readLine.length());
str.add(firstWord);
str.add(secondWord);

Iterate an Array and test a regular expression to each value (Java)

I'm quite new to Java and I'm facing a situation I can't solve. I have some html code and I'm trying to run a regular expression to store all matches into an array. Here's my code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
public class RegexMatch{
boolean foundMatch = false;
public String[] arrayResults;
public String[] TestRegularExpression(String sourceCode, String pattern){
try{
Pattern regex = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
Matcher regexMatcher = regex.matcher(sourceCode);
while (regexMatcher.find()) {
arrayResults[matches] = regexMatcher.group();
matches ++;
}
} catch (PatternSyntaxException ex) {
// Exception occurred
}
return arrayResults;
}
}
I'm passing a string containing html code and the regular expression pattern to extract all meta tags and store them into the array. Here's how I instantiate the method:
RegexMatch regex = new RegexMatch();
regex.TestRegularExpression(sourceCode, "<meta.*?>");
String[] META_TAGS = regex.arrayResults;
Any hint?
Thanks!

Firstly, parsing HTML with regular expressions is a bad idea. There are alternatives which will convert the HTML into a DOM etc - you should look into those.
Assuming you still want the "match multiple results" idea though, it seems to me that a List<E> of some form would be more useful, so you don't need to know the size up-front. You can also build that in the method itself, rather than having state. For example:
import java.util.*;
import java.util.regex.*;
public class Test
{
public static void main(String[] args)
throws PatternSyntaxException
{
// Want to get x10 and x5 from this
String text = "x10 y5 x5 xyz";
String pattern = "x\\d+";
List<String> matches = getAllMatches(text, pattern);
for (String match : matches) {
System.out.println(match);
}
}
public static List<String> getAllMatches(String text, String pattern)
throws PatternSyntaxException
{
Pattern regex = Pattern.compile(pattern);
List<String> results = new ArrayList<String>();
Matcher regexMatcher = regex.matcher(text);
while (regexMatcher.find()) {
results.add(regexMatcher.group());
}
return results;
}
}
It's possible that there's something similar to this within the Matcher class itself, but I can't immediately see it...

With Jsoup, you could do something as simple as...
import java.util.ArrayList;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class GetMeta {
private static final String META_QUERY = "meta";
public static List<String> parseForMeta(String htmlText) {
Document jsDocument = Jsoup.parse(htmlText);
Elements metaElements = jsDocument.select(META_QUERY);
List<String> metaList = new ArrayList<String>();
for (Element element : metaElements) {
metaList.add(element.toString());
}
return metaList;
}
}
For example:
import java.io.IOException;
import java.net.*;
import java.util.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class GetMeta {
private static final String META_QUERY = "meta";
private static final String MAIN_URL = "http://www.yahoo.com";
public static void main(String[] args) {
try {
Scanner scan = new Scanner(new URL(MAIN_URL).openStream());
StringBuilder sb = new StringBuilder();
while (scan.hasNextLine()) {
sb.append(scan.nextLine() + "\n");
}
List<String> metaList = parseForMeta(sb.toString());
for (String metaStr : metaList) {
System.out.println(metaStr);
}
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
public static List<String> parseForMeta(String htmlText) {
Document jsDocument = Jsoup.parse(htmlText);
Elements metaElements = jsDocument.select(META_QUERY);
List<String> metaList = new ArrayList<String>();
for (Element element : metaElements) {
metaList.add(element.toString());
}
return metaList;
}
}

Remove Duplicate Lines from Text using Java

I was wondering if anyone has logic in java that removes duplicate lines while maintaining the lines order.
I would prefer no regex solution.

public class UniqueLineReader extends BufferedReader {
Set<String> lines = new HashSet<String>();
public UniqueLineReader(Reader arg0) {
super(arg0);
}
#Override
public String readLine() throws IOException {
String uniqueLine;
if (lines.add(uniqueLine = super.readLine()))
return uniqueLine;
return "";
}
//for testing..
public static void main(String args[]) {
try {
// Open the file that is the first
// command line parameter
FileInputStream fstream = new FileInputStream(
"test.txt");
UniqueLineReader br = new UniqueLineReader(new InputStreamReader(fstream));
String strLine;
// Read File Line By Line
while ((strLine = br.readLine()) != null) {
// Print the content on the console
if (strLine != "")
System.out.println(strLine);
}
// Close the input stream
in.close();
} catch (Exception e) {// Catch exception if any
System.err.println("Error: " + e.getMessage());
}
}
}
Modified Version:
public class UniqueLineReader extends BufferedReader {
Set<String> lines = new HashSet<String>();
public UniqueLineReader(Reader arg0) {
super(arg0);
}
#Override
public String readLine() throws IOException {
String uniqueLine;
while (lines.add(uniqueLine = super.readLine()) == false); //read until encountering a unique line
return uniqueLine;
}
public static void main(String args[]) {
try {
// Open the file that is the first
// command line parameter
FileInputStream fstream = new FileInputStream(
"/home/emil/Desktop/ff.txt");
UniqueLineReader br = new UniqueLineReader(new InputStreamReader(fstream));
String strLine;
// Read File Line By Line
while ((strLine = br.readLine()) != null) {
// Print the content on the console
System.out.println(strLine);
}
// Close the input stream
in.close();
} catch (Exception e) {// Catch exception if any
System.err.println("Error: " + e.getMessage());
}
}
}

If you feed the lines into a LinkedHashSet, it ignores the repeated ones, since it's a set, but preserves the order, since it's linked. If you just want to know whether you've seena given line before, feed them into a simple Set as you go on, and ignore those which the Set already contains/contained.

It can be easy to remove duplicate line from text or File using new java Stream API. Stream support different aggregate feature like sort,distinct and work with different java's existing data structures and their methods. Following example can use to remove duplicate or sort the content in File using Stream API
package removeword;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.OpenOption;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Arrays;
import java.util.Scanner;
import java.util.stream.Stream;
import static java.nio.file.StandardOpenOption.*;
import static java.util.stream.Collectors.joining;
public class Java8UniqueWords {
public static void main(String[] args) throws IOException {
Path sourcePath = Paths.get("C:/Users/source.txt");
Path changedPath = Paths.get("C:/Users/removedDouplicate_file.txt");
try (final Stream<String> lines = Files.lines(sourcePath )
// .map(line -> line.toLowerCase()) /*optional to use existing string methods*/
.distinct()
// .sorted()) /*aggregrate function to sort disctincted line*/
{
final String uniqueWords = lines.collect(joining("\n"));
System.out.println("Final Output:" + uniqueWords);
Files.write(changedPath , uniqueWords.getBytes(),WRITE, TRUNCATE_EXISTING);
}
}
}

Read the text file using a BufferedReader and store it in a LinkedHashSet. Print it back out.
Here's an example:
public class DuplicateRemover {
public String stripDuplicates(String aHunk) {
StringBuilder result = new StringBuilder();
Set<String> uniqueLines = new LinkedHashSet<String>();
String[] chunks = aHunk.split("\n");
uniqueLines.addAll(Arrays.asList(chunks));
for (String chunk : uniqueLines) {
result.append(chunk).append("\n");
}
return result.toString();
}
}
Here's some unit tests to verify ( ignore my evil copy-paste ;) ):
import org.junit.Test;
import static org.junit.Assert.*;
public class DuplicateRemoverTest {
#Test
public void removesDuplicateLines() {
String input = "a\nb\nc\nb\nd\n";
String expected = "a\nb\nc\nd\n";
DuplicateRemover remover = new DuplicateRemover();
String actual = remover.stripDuplicates(input);
assertEquals(expected, actual);
}
#Test
public void removesDuplicateLinesUnalphabetized() {
String input = "z\nb\nc\nb\nz\n";
String expected = "z\nb\nc\n";
DuplicateRemover remover = new DuplicateRemover();
String actual = remover.stripDuplicates(input);
assertEquals(expected, actual);
}
}

Here's another solution. Let's just use UNIX!
cat MyFile.java | uniq > MyFile.java
Edit: Oh wait, I re-read the topic. Is this a legal solution since I managed to be language agnostic?

For better/optimum performance, it's wise to use Java 8's API features viz. Streams & Method references with LinkedHashSet for Collection as below:
import java.io.IOException;
import java.io.PrintWriter;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.LinkedHashSet;
import java.util.stream.Collectors;
public class UniqueOperation {
private static PrintWriter pw;
enter code here
public static void main(String[] args) throws IOException {
pw = new PrintWriter("abc.txt");
for(String p : Files.newBufferedReader(Paths.get("C:/Users/as00465129/Desktop/FrontEndUdemyLinks.txt")).
lines().
collect(Collectors.toCollection(LinkedHashSet::new)))
pw.println(p);
pw.flush();
pw.close();
System.out.println("File operation performed successfully");
}

here I'm using a hashset to store seen lines
Scanner scan;//input
Set<String> lines = new HashSet<String>();
StringBuilder strb = new StringBuilder();
while(scan.hasNextLine()){
String line = scan.nextLine();
if(lines.add(line)) strb.append(line);
}

file reading into array

I am trying to read contents of a file using string tokenizer and store all the tokens in an array but i keep getting exception in main error. I need advise on how to do this.Below is the code am using for that;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
import java.util.StringTokenizer;
public class FileTokenizer
{
private static final String DEFAULT_DELIMITERS = "< , { } >";
private static final String DEFAULT_TEST_FILE = "trans1.txt";
public List<String> tokenize(Reader reader) throws IOException
{
List<String> tokens = new ArrayList<String>();
BufferedReader br = null;
try
{
int i = 0;
br = new BufferedReader(reader);
Scanner scanner = new Scanner(br);
while (scanner.hasNext())
{
StringTokenizer st = new StringTokenizer(scanner.next(), DEFAULT_DELIMITERS, true);
while (st.hasMoreElements())
{
String[] t = new String[200];
tokens.add(st.nextToken());
t[i] = st.nextToken();
System.out.println(t[i]);
i++;
}
}
}
finally
{
close(br);
}
return tokens;
}
public static void close(Reader r)
{
try
{
if (r != null)
{
r.close();
}
}
catch (IOException e)
{
e.printStackTrace();
}
}
public static void main(String[] args)
{
try
{
String fileName = ((args.length > 0) ? args[0] : DEFAULT_TEST_FILE);
FileReader fileReader = new FileReader(new File(fileName));
FileTokenizer fileTokenizer = new FileTokenizer();
List<String> tokens = fileTokenizer.tokenize(fileReader);
//System.out.println(tokens);
}
catch (IOException e)
{
e.printStackTrace();
}
}
}
My file looks like;
PDA = (
{ q1, q2, q3, q4},
{ 0, 1 },
{ 0, $ },
{ (q1, #, #) -> { (q2, $) }, (q2, 0, #) -> { (q2, 0) },
(q2, 1, 0) -> { (q3, #) }, (q3, 1, 0) -> { (q3, #) },
(q3, #, $) -> { (q4, #) } },
q1,
{ q1, q4}
)

You will get the java.util.NoSuchElementException since you are calling st.nextToken() twice within the loop
while (st.hasMoreElements())
Modifying harigm's example, you can then add t[i] to tokens as you require
String[] t = new String[200];
System.out.println(t[i]);
tokens.add(t[i]);

Delimiters shouldn't be separated by spaces:
private static final String DEFAULT_DELIMITERS = "<,{}>";
Also, keep the following in mind (from the Javadoc):
StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
String.split() was introduced in JDK 1.4.
That said:
Using a Scanner to tokenize a stream together with a StringTokenizer looks a bit weird to me;
You call st.nextToken() twice in the inner loop;
t is useless. You re-create it each time in your inner loop and use only one element of it.
It seems that what you are trying to build is a lexical analyzer. Maybe you should look up some documentation on the subject.

HI,
I have modified your code and Now works perfectly fine, check this
package org.sample;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
import java.util.StringTokenizer;
public class FileTokenizer
{
private static final String DEFAULT_DELIMITERS = "< , { } >";
// private static final String DEFAULT_TEST_FILE = "trans1.txt";
public List<String> tokenize(Reader reader) throws IOException
{
List<String> tokens = new ArrayList<String>();
BufferedReader br = null;
try
{
int i = 0;
br = new BufferedReader(reader);
Scanner scanner = new Scanner(br);
while (scanner.hasNext())
{
StringTokenizer st = new StringTokenizer(scanner.next(), DEFAULT_DELIMITERS, true);
while (st.hasMoreElements())
{
String[] t = new String[200];
// tokens.add(st.nextToken());
// t[i] = st.nextToken();
System.out.println(t[i]);
i++;
}
}
}
finally
{
close(br);
}
return tokens;
}
public static void close(Reader r)
{
try
{
if (r != null)
{
r.close();
}
}
catch (IOException e)
{
e.printStackTrace();
}
}
public static void main(String[] args)
{
try
{
// String fileName = ((args.length > 0) ? args[0] : DEFAULT_TEST_FILE);
FileReader fileReader = new FileReader(new File("c:\\DevTest\\1.txt"));
FileTokenizer fileTokenizer = new FileTokenizer();
List<String> tokens = fileTokenizer.tokenize(fileReader);
//System.out.println(tokens);
}
catch (IOException e)
{
e.printStackTrace();
}
}
}

Looking at your input file, I should point out that its hierarchical and irregular structure makes it more suited to be parsed by an actual parser. You may have to learn how to use a parser generator and write a lexer and grammar for it etc, but in the end you'll end up with a much more maintainable code. Doing this yourself is rather painstaking and error-prone.
I recommend ANTLR. It's quite mature, and it has a wide enough user base that I'm sure you can get help easily.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Poor pattern searching performance - java

You can use possessive quantifiers: String regex = "[a-zA-Z]++\\.[a-zA-Z]++#[a-zA-Z]++\\.[a-zA-Z]++"; When you use possessive quantifiers, the regex engine doesn't record backtrack positions and never go back to try other possibilities when the match fails.

Compiling your regex pattern repeatedly (for each file) is a relatively expensive waste. You could define that once and keep using the same instance.

Related

reader.readLine() of BufferedReader causes this: Exception in thread "main" java.io.IOException: Stream closed

save single string to array separate by multi space in file text java

Iterate an Array and test a regular expression to each value (Java)

Remove Duplicate Lines from Text using Java

file reading into array

Categories

Resources