Pattern matching in Thousands of files

Pattern matching in Thousands of files - java

I've a regex pattern of words like welcome1|welcome2|changeme... which I need to search for in thousands of files (varies between 100 to 8000) ranging from 1KB to 24 MB each, in size.
I would like to know if there's a faster way of pattern matching than doing what I have been trying.
Environment:
jdk 1.8
Windows 10
Unix4j Library
Here's what I tried till now
try (Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
.filter(FilePredicates.isFileAndNotDirectory())) {
List<String> obviousStringsList = Strings_PASSWORDS.stream()
.map(s -> ".*" + s + ".*").collect(Collectors.toList()); //because Unix4j apparently needs this
Pattern pattern = Pattern.compile(String.join("|", obviousStringsList));
GrepOptions options = new GrepOptions.Default(GrepOption.count,
GrepOption.ignoreCase,
GrepOption.lineNumber,
GrepOption.matchingFiles);
Instant startTime = Instant.now();
final List<Path> filesWithObviousStringss = stream
.filter(path -> !Unix4j.grep(options, pattern, path.toFile()).toStringResult().isEmpty())
.collect(Collectors.toList());
System.out.println("Time taken = " + Duration.between(startTime, Instant.now()).getSeconds() + " seconds");
}
I get Time taken = 60 seconds which makes me think I'm doing something really wrong.
I've tried different ways with the stream and on an average every method takes about a minute to process my current folder of 6660 files.
Grep on mysys2/mingw64 takes about 15 seconds and exec('grep...') in node.js takes about 12 seconds consistently.
I chose Unix4j because it provides java native grep and clean code.
Is there a way to produce better results in Java, that I'm sadly missing?

The main reason why native tools can process such text files much faster, is their assumption of one particular charset, especially when it has an ASCII based 8 Bit encoding, whereas Java performs a byte to character conversion whose abstraction is capable of supporting arbitrary charsets.
When we similarly assume a single charset with the properties named above, we can use lowlevel tools which may increase the performance dramatically.
For such an operation, we define the following helper methods:
private static char[] getTable(Charset cs) {
if(cs.newEncoder().maxBytesPerChar() != 1f)
throw new UnsupportedOperationException("Not an 8 bit charset");
byte[] raw = new byte[256];
IntStream.range(0, 256).forEach(i -> raw[i] = (byte)i);
char[] table = new char[256];
cs.newDecoder().onUnmappableCharacter(CodingErrorAction.REPLACE)
.decode(ByteBuffer.wrap(raw), CharBuffer.wrap(table), true);
for(int i = 0; i < 128; i++)
if(table[i] != i) throw new UnsupportedOperationException("Not ASCII based");
return table;
}
and
private static CharSequence mapAsciiBasedText(Path p, char[] table) throws IOException {
try(FileChannel fch = FileChannel.open(p, StandardOpenOption.READ)) {
long actualSize = fch.size();
int size = (int)actualSize;
if(size != actualSize) throw new UnsupportedOperationException("file too large");
MappedByteBuffer mbb = fch.map(FileChannel.MapMode.READ_ONLY, 0, actualSize);
final class MappedCharSequence implements CharSequence {
final int start, size;
MappedCharSequence(int start, int size) {
this.start = start;
this.size = size;
}
public int length() {
return size;
}
public char charAt(int index) {
if(index < 0 || index >= size) throw new IndexOutOfBoundsException();
byte b = mbb.get(start + index);
return b<0? table[b+256]: (char)b;
}
public CharSequence subSequence(int start, int end) {
int newSize = end - start;
if(start<0 || end < start || end-start > size)
throw new IndexOutOfBoundsException();
return new MappedCharSequence(start + this.start, newSize);
}
public String toString() {
return new StringBuilder(size).append(this).toString();
}
}
return new MappedCharSequence(0, size);
}
}
This allows to map a file into the virtual memory and project it directly to a CharSequence, without copy operations, assuming that the mapping can be done with a simple table and, for ASCII based charsets, the majority of the characters do not even need a table lookup, as their numerical value is identical to the Unicode codepoint.
With these methods, you may implement the operation as
// You need this only once per JVM.
// Note that running inside IDEs like Netbeans may change the default encoding
char[] table = getTable(Charset.defaultCharset());
try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
.filter(Files::isRegularFile)) {
Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
long startTime = System.nanoTime();
final List<Path> filesWithObviousStringss = stream//.parallel()
.filter(path -> {
try {
return pattern.matcher(mapAsciiBasedText(path, table)).find();
} catch(IOException ex) {
throw new UncheckedIOException(ex);
}
})
.collect(Collectors.toList());
System.out.println("Time taken = "
+ TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}
This runs much faster than the normal text conversion, but still supports parallel execution.
Besides requiring an ASCII based single byte encoding, there’s the restriction that this code doesn’t support files larger than 2 GiB. While it is possible to extend the solution to support larger files, I wouldn’t add this complication unless really needed.

I don’t know what “Unix4j” provides that isn’t already in the JDK, as the following code does everything with built-in features:
try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
.filter(Files::isRegularFile)) {
Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
long startTime = System.nanoTime();
final List<Path> filesWithObviousStringss = stream
.filter(path -> {
try(Scanner s = new Scanner(path)) {
return s.findWithinHorizon(pattern, 0) != null;
} catch(IOException ex) {
throw new UncheckedIOException(ex);
}
})
.collect(Collectors.toList());
System.out.println("Time taken = "
+ TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}
One important property of this solution is that it doesn’t read the whole file, but stops at the first encountered match. Also, it doesn’t deal with line boundaries, which is suitable for the words you’re looking for, as they never contain line breaks anyway.
After analyzing the findWithinHorizon operation, I consider that line by line processing may be better for larger files, so, you may try
try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
.filter(Files::isRegularFile)) {
Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
long startTime = System.nanoTime();
final List<Path> filesWithObviousStringss = stream
.filter(path -> {
try(Stream<String> s = Files.lines(path)) {
return s.anyMatch(pattern.asPredicate());
} catch(IOException ex) {
throw new UncheckedIOException(ex);
}
})
.collect(Collectors.toList());
System.out.println("Time taken = "
+ TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}
instead.
You may also try to turn the stream to parallel mode, e.g.
try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
.filter(Files::isRegularFile)) {
Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
long startTime = System.nanoTime();
final List<Path> filesWithObviousStringss = stream
.parallel()
.filter(path -> {
try(Stream<String> s = Files.lines(path)) {
return s.anyMatch(pattern.asPredicate());
} catch(IOException ex) {
throw new UncheckedIOException(ex);
}
})
.collect(Collectors.toList());
System.out.println("Time taken = "
+ TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}
It’s hard to predict whether this has a benefit, as in most cases, the I/O dominates such an operation.

I never used Unix4j yet, but Java provides nice file APIs as well nowadays. Also, Unix4j#grep seems to return all the found matches (as you're using .toStringResult().isEmpty()), while you seem to just need to know whether at least one match got found (which means that you should be able to break once one match is found). Maybe this library provides another method that could better suit your needs, e.g. something like #contains? Without the use of Unix4j, Stream#anyMatch could be a good candidate here. Here is a vanilla Java solution if you want to compare with yours:
private boolean lineContainsObviousStrings(String line) {
return Strings_PASSWORDS // <-- weird naming BTW
.stream()
.anyMatch(line::contains);
}
private boolean fileContainsObviousStrings(Path path) {
try (Stream<String> stream = Files.lines(path)) {
return stream.anyMatch(this::lineContainsObviousStrings);
}
}
public List<Path> findFilesContainingObviousStrings() {
Instant startTime = Instant.now();
try (Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))) {
return stream
.filter(FilePredicates.isFileAndNotDirectory())
.filter(this::fileContainsObviousStrings)
.collect(Collectors.toList());
} finally {
Instant endTime = Instant.now();
System.out.println("Time taken = " + Duration.between(startTime, endTime).getSeconds() + " seconds");
}
}

Please try this out too (if it is possible), I am curious how it performs on your files.
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.UncheckedIOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Optional;
import java.util.concurrent.TimeUnit;
import java.util.stream.Collectors;
import java.util.stream.Stream;
public class Filescan {
public static void main(String[] args) throws IOException {
Filescan sc = new Filescan();
sc.findWords("src/main/resources/files", new String[]{"author", "book"}, true);
}
// kind of Tuple/Map.Entry
static class Pair<K,V>{
final K key;
final V value;
Pair(K key, V value){
this.key = key;
this.value = value;
}
#Override
public String toString() {
return key + " " + value;
}
}
public void findWords(String directory, String[] words, boolean ignorecase) throws IOException{
final String[] searchWords = ignorecase ? toLower(words) : words;
try (Stream<Path> stream = Files.walk(Paths.get(directory)).filter(Files::isRegularFile)) {
long startTime = System.nanoTime();
List<Pair<Path,Map<String, List<Integer>>>> result = stream
// you can test it with parallel execution, maybe it is faster
.parallel()
// searching
.map(path -> findWordsInFile(path, searchWords, ignorecase))
// filtering out empty optionals
.filter(Optional::isPresent)
// unwrap optionals
.map(Optional::get).collect(Collectors.toList());
System.out.println("Time taken = " + TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()
- startTime) + " seconds");
System.out.println("result:");
result.forEach(System.out::println);
}
}
private String[] toLower(String[] words) {
String[] ret = new String[words.length];
for (int i = 0; i < words.length; i++) {
ret[i] = words[i].toLowerCase();
}
return ret;
}
private static Optional<Pair<Path,Map<String, List<Integer>>>> findWordsInFile(Path path, String[] words, boolean ignorecase) {
try (BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(path.toFile())))) {
String line = br.readLine();
line = ignorecase & line != null ? line.toLowerCase() : line;
Map<String, List<Integer>> map = new HashMap<>();
int linecount = 0;
while(line != null){
for (String word : words) {
if(line.contains(word)){
if(!map.containsKey(word)){
map.put(word, new ArrayList<Integer>());
}
map.get(word).add(linecount);
}
}
line = br.readLine();
line = ignorecase & line != null ? line.toLowerCase() : line;
linecount++;
}
if(map.isEmpty()){
// returning empty optional when nothing in the map
return Optional.empty();
}else{
// returning a path-map pair with the words and the rows where each word has been found
return Optional.of(new Pair<Path,Map<String, List<Integer>>>(path, map));
}
} catch (IOException ex) {
throw new UncheckedIOException(ex);
}
}
}

Related

Can you rebalance an unbalanced Spliterator of unknown size?

I want to use a Stream to parallelize processing of a heterogenous set of remotely stored JSON files of unknown number (the number of files is not known upfront). The files can vary widely in size, from 1 JSON record per file up to 100,000 records in some other files. A JSON record in this case means a self-contained JSON object represented as one line in the file.
I really want to use Streams for this and so I implemented this Spliterator:
public abstract class JsonStreamSpliterator<METADATA, RECORD> extends AbstractSpliterator<RECORD> {
abstract protected JsonStreamSupport<METADATA> openInputStream(String path);
abstract protected RECORD parse(METADATA metadata, Map<String, Object> json);
private static final int ADDITIONAL_CHARACTERISTICS = Spliterator.IMMUTABLE | Spliterator.DISTINCT | Spliterator.NONNULL;
private static final int MAX_BUFFER = 100;
private final Iterator<String> paths;
private JsonStreamSupport<METADATA> reader = null;
public JsonStreamSpliterator(Iterator<String> paths) {
this(Long.MAX_VALUE, ADDITIONAL_CHARACTERISTICS, paths);
}
private JsonStreamSpliterator(long est, int additionalCharacteristics, Iterator<String> paths) {
super(est, additionalCharacteristics);
this.paths = paths;
}
private JsonStreamSpliterator(long est, int additionalCharacteristics, Iterator<String> paths, String nextPath) {
this(est, additionalCharacteristics, paths);
open(nextPath);
}
#Override
public boolean tryAdvance(Consumer<? super RECORD> action) {
if(reader == null) {
String path = takeNextPath();
if(path != null) {
open(path);
}
else {
return false;
}
}
Map<String, Object> json = reader.readJsonLine();
if(json != null) {
RECORD item = parse(reader.getMetadata(), json);
action.accept(item);
return true;
}
else {
reader.close();
reader = null;
return tryAdvance(action);
}
}
private void open(String path) {
reader = openInputStream(path);
}
private String takeNextPath() {
synchronized(paths) {
if(paths.hasNext()) {
return paths.next();
}
}
return null;
}
#Override
public Spliterator<RECORD> trySplit() {
String nextPath = takeNextPath();
if(nextPath != null) {
return new JsonStreamSpliterator<METADATA,RECORD>(Long.MAX_VALUE, ADDITIONAL_CHARACTERISTICS, paths, nextPath) {
#Override
protected JsonStreamSupport<METADATA> openInputStream(String path) {
return JsonStreamSpliterator.this.openInputStream(path);
}
#Override
protected RECORD parse(METADATA metaData, Map<String,Object> json) {
return JsonStreamSpliterator.this.parse(metaData, json);
}
};
}
else {
List<RECORD> records = new ArrayList<RECORD>();
while(tryAdvance(records::add) && records.size() < MAX_BUFFER) {
// loop
}
if(records.size() != 0) {
return records.spliterator();
}
else {
return null;
}
}
}
}
The problem I'm having is that while the Stream parallelizes beautifully at first, eventually the largest file is left processing in a single thread. I believe the proximal cause is well documented: the spliterator is "unbalanced".
More concretely, appears that the trySplit method is not called after a certain point in the Stream.forEach's lifecycle, so the extra logic to distribute small batches at the end of trySplit is rarely executed.
Notice how all the spliterators returned from trySplit share the same paths iterator. I thought this was a really clever way to balance the work across all spliterators, but it hasn't been enough to achieve full parallelism.
I would like the parallel processing to proceed first across files, and then when few large files are still left spliterating, I want to parallelize across chunks of the remaining files. That was the intent of the else block at the end of trySplit.
Is there an easy / simple / canonical way around this problem?

Your trySplit should output splits of equal size, regardless of the size of the underlying files. You should treat all the files as a single unit and fill up the ArrayList-backed spliterator with the same number of JSON objects each time. The number of objects should be such that processing one split takes between 1 and 10 milliseconds: lower than 1 ms and you start approaching the costs of handing off the batch to a worker thread, higher than that and you start risking uneven CPU load due to tasks which are too coarse-grained.
The spliterator is not obliged to report a size estimate, and you are already doing this correctly: your estimate is Long.MAX_VALUE, which is a special value meaning "unbounded". However, if you have many files with a single JSON object, resulting in batches of size 1, this will hurt your performance in two ways: the overhead of opening-reading-closing the file may become a bottleneck and, if you manage to escape that, the cost of thread handoff may be significant compared to the cost of processing one item, again causing a bottleneck.
Five years ago I was solving a similar problem, you can have a look at my solution.

After much experimentation, I was still not able to get any added parallelism by playing with the size estimates. Basically, any value other than Long.MAX_VALUE will tend to cause the spliterator to terminate too early (and without any splitting), while on the other hand a Long.MAX_VALUE estimate will cause trySplit to be called relentlessly until it returns null.
The solution I found is to internally share resources among the spliterators and let them rebalance amongst themselves.
Working code:
public class AwsS3LineSpliterator<LINE> extends AbstractSpliterator<AwsS3LineInput<LINE>> {
public final static class AwsS3LineInput<LINE> {
final public S3ObjectSummary s3ObjectSummary;
final public LINE lineItem;
public AwsS3LineInput(S3ObjectSummary s3ObjectSummary, LINE lineItem) {
this.s3ObjectSummary = s3ObjectSummary;
this.lineItem = lineItem;
}
}
private final class InputStreamHandler {
final S3ObjectSummary file;
final InputStream inputStream;
InputStreamHandler(S3ObjectSummary file, InputStream is) {
this.file = file;
this.inputStream = is;
}
}
private final Iterator<S3ObjectSummary> incomingFiles;
private final Function<S3ObjectSummary, InputStream> fileOpener;
private final Function<InputStream, LINE> lineReader;
private final Deque<S3ObjectSummary> unopenedFiles;
private final Deque<InputStreamHandler> openedFiles;
private final Deque<AwsS3LineInput<LINE>> sharedBuffer;
private final int maxBuffer;
private AwsS3LineSpliterator(Iterator<S3ObjectSummary> incomingFiles, Function<S3ObjectSummary, InputStream> fileOpener,
Function<InputStream, LINE> lineReader,
Deque<S3ObjectSummary> unopenedFiles, Deque<InputStreamHandler> openedFiles, Deque<AwsS3LineInput<LINE>> sharedBuffer,
int maxBuffer) {
super(Long.MAX_VALUE, 0);
this.incomingFiles = incomingFiles;
this.fileOpener = fileOpener;
this.lineReader = lineReader;
this.unopenedFiles = unopenedFiles;
this.openedFiles = openedFiles;
this.sharedBuffer = sharedBuffer;
this.maxBuffer = maxBuffer;
}
public AwsS3LineSpliterator(Iterator<S3ObjectSummary> incomingFiles, Function<S3ObjectSummary, InputStream> fileOpener, Function<InputStream, LINE> lineReader, int maxBuffer) {
this(incomingFiles, fileOpener, lineReader, new ConcurrentLinkedDeque<>(), new ConcurrentLinkedDeque<>(), new ArrayDeque<>(maxBuffer), maxBuffer);
}
#Override
public boolean tryAdvance(Consumer<? super AwsS3LineInput<LINE>> action) {
AwsS3LineInput<LINE> lineInput;
synchronized(sharedBuffer) {
lineInput=sharedBuffer.poll();
}
if(lineInput != null) {
action.accept(lineInput);
return true;
}
InputStreamHandler handle = openedFiles.poll();
if(handle == null) {
S3ObjectSummary unopenedFile = unopenedFiles.poll();
if(unopenedFile == null) {
return false;
}
handle = new InputStreamHandler(unopenedFile, fileOpener.apply(unopenedFile));
}
for(int i=0; i < maxBuffer; ++i) {
LINE line = lineReader.apply(handle.inputStream);
if(line != null) {
synchronized(sharedBuffer) {
sharedBuffer.add(new AwsS3LineInput<LINE>(handle.file, line));
}
}
else {
return tryAdvance(action);
}
}
openedFiles.addFirst(handle);
return tryAdvance(action);
}
#Override
public Spliterator<AwsS3LineInput<LINE>> trySplit() {
synchronized(incomingFiles) {
if (incomingFiles.hasNext()) {
unopenedFiles.add(incomingFiles.next());
return new AwsS3LineSpliterator<LINE>(incomingFiles, fileOpener, lineReader, unopenedFiles, openedFiles, sharedBuffer, maxBuffer);
} else {
return null;
}
}
}
}

This is not a direct answer to your question. But I think it is worth a try with Stream in library abacus-common:
void test_58601518() throws Exception {
final File tempDir = new File("./temp/");
// Prepare the test files:
// if (!(tempDir.exists() && tempDir.isDirectory())) {
// tempDir.mkdirs();
// }
//
// final Random rand = new Random();
// final int fileCount = 1000;
//
// for (int i = 0; i < fileCount; i++) {
// List<String> lines = Stream.repeat(TestUtil.fill(Account.class), rand.nextInt(1000) * 100 + 1).map(it -> N.toJSON(it)).toList();
// IOUtil.writeLines(new File("./temp/_" + i + ".json"), lines);
// }
N.println("Xmx: " + IOUtil.MAX_MEMORY_IN_MB + " MB");
N.println("total file size: " + Stream.listFiles(tempDir).mapToLong(IOUtil::sizeOf).sum() / IOUtil.ONE_MB + " MB");
final AtomicLong counter = new AtomicLong();
final Consumer<Account> yourAction = it -> {
counter.incrementAndGet();
it.toString().replace("a", "bbb");
};
long startTime = System.currentTimeMillis();
Stream.listFiles(tempDir) // the file/data source could be local file system or remote file system.
.parallel(2) // thread number used to load the file/data and convert the lines to Java objects.
.flatMap(f -> Stream.lines(f).map(line -> N.fromJSON(Account.class, line))) // only certain lines (less 1024) will be loaded to memory.
.parallel(8) // thread number used to execute your action.
.forEach(yourAction);
N.println("Took: " + ((System.currentTimeMillis()) - startTime) + " ms" + " to process " + counter + " lines/objects");
// IOUtil.deleteAllIfExists(tempDir);
}
Till end, the CPU usage on my laptop is pretty high(about 70%), and it took about 70 seconds to process 51,899,100 lines/objects from 1000 files with Intel(R) Core(TM) i5-8365U CPU and Xmx256m jvm memory. Total file size is about: 4524 MB. if yourAction is not a heavy operation, sequential stream could be even faster than parallel stream.
F.Y.I I'm the developer of abacus-common

Why does the java DirectoryStream perform so slow?

I've done some testing with Streams in special with DirectoryStreams of the nio-package. I simply try to get a list of all files in a directory sorted by last modified date and size.
The JavaDoc of old File.listFiles() stated a Note to the method in Files:
Note that the Files class defines the newDirectoryStream method to
open a directory and iterate over the names of the files in the
directory. This may use less resources when working with very large
directories.
I run the code down below a lot of times (first three times below):
First-run:
Run time of Arrays.sort: 1516
Run time of Stream.sorted as Array: 2912
Run time of Stream.sorted as List: 2875
Second-run:
Run time of Arrays.sort: 1557
Run time of Stream.sorted as Array: 2978
Run time of Stream.sorted as List: 2937
Third-run:
Run time of Arrays.sort: 1563
Run time of Stream.sorted as Array: 2919
Run time of Stream.sorted as List: 2896
My question is: Why do the streams perform so bad?
import java.io.File;
import java.io.IOException;
import java.io.UncheckedIOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.attribute.FileTime;
import java.util.Arrays;
import java.util.Comparator;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.Stream;
public class FileSorter {
// This sorts from old to young and from big to small
Comparator<Path> timeSizeComparator = (Path o1, Path o2) -> {
int sorter = 0;
try {
FileTime lm1 = Files.getLastModifiedTime(o1);
FileTime lm2 = Files.getLastModifiedTime(o2);
if (lm2.compareTo(lm1) == 0) {
Long s1 = Files.size(o1);
Long s2 = Files.size(o2);
sorter = s2.compareTo(s1);
} else {
sorter = lm1.compareTo(lm2);
}
} catch (IOException ex) {
throw new UncheckedIOException(ex);
}
return sorter;
};
public String[] getSortedFileListAsArray(Path dir) throws IOException {
Stream<Path> stream = Files.list(dir);
return stream.sorted(timeSizeComparator).
map(Path::getFileName).map(Path::toString).toArray(String[]::new);
}
public List<String> getSortedFileListAsList(Path dir) throws IOException {
Stream<Path> stream = Files.list(dir);
return stream.sorted(timeSizeComparator).
map(Path::getFileName).map(Path::toString).collect(Collectors.
toList());
}
public String[] sortByDateAndSize(File[] fileList) {
Arrays.sort(fileList, (File o1, File o2) -> {
int r = Long.compare(o1.lastModified(), o2.lastModified());
if (r != 0) {
return r;
}
return Long.compare(o1.length(), o2.length());
});
String[] fileNames = new String[fileList.length];
for (int i = 0; i < fileNames.length; i++) {
fileNames[i] = fileList[i].getName();
}
return fileNames;
}
public static void main(String[] args) throws IOException {
// File (io package)
File f = new File("C:\\Windows\\system32");
// Path (nio package)
Path dir = Paths.get("C:\\Windows\\system32");
FileSorter fs = new FileSorter();
long before = System.currentTimeMillis();
String[] names = fs.sortByDateAndSize(f.listFiles());
long after = System.currentTimeMillis();
System.out.println("Run time of Arrays.sort: " + ((after - before)));
long before2 = System.currentTimeMillis();
String[] names2 = fs.getSortedFileListAsArray(dir);
long after2 = System.currentTimeMillis();
System.out.
println("Run time of Stream.sorted as Array: " + ((after2 - before2)));
long before3 = System.currentTimeMillis();
List<String> names3 = fs.getSortedFileListAsList(dir);
long after3 = System.currentTimeMillis();
System.out.
println("Run time of Stream.sorted as List: " + ((after3 - before3)));
}
}
Update
After applying the code from Peter I have this results:
Run time of Arrays.sort: 1615
Run time of Stream.sorted as Array: 3116
Run time of Stream.sorted as List: 3059
Run time of Stream.sorted as List with caching: 378
Update 2
After doing some research on the solution of Peter, I can say, that reading file attributes with for ex. Files.getLastModified must be a heavy crunch. Changing only that part in Comparator to:
Comparator<Path> timeSizeComparator = (Path o1, Path o2) -> {
File f1 = o1.toFile();
File f2 = o2.toFile();
long lm1 = f1.lastModified();
long lm2 = f2.lastModified();
int cmp = Long.compare(lm2, lm1);
if (cmp == 0) {
cmp = Long.compare(f2.length(), f1.length());
}
return cmp;
};
Gets the even better result on my computer:
Run time of Arrays.sort: 1968
Run time of Stream.sorted as Array: 1999
Run time of Stream.sorted as List: 1975
Run time of Stream.sorted as List with caching: 488
But as you can see, caching the object is the much best way. And as jtahlborn mentioned, it is a kind of stable sort.
Update 3 (best solution I've found)
After a bit more research, I've seen, that the methods Files.lastModified and Files.size, both do a huge job on a same thing: Attributes. So I made three versions of the PathInfo class to test:
Peters version as described down below
An old style File version, where I do a Path.toFile() once in the constructor and get all values from that file with f.lastModified and f.length
An version of Peters solution, but now I read an Attribute object with Files.readAttributes(path,BasicFileAttributes.class) and done things on this.
Putting it all in a loop for doing it 100 times each, I came up with these results:
After doing all hundred times
Mean performance of Peters solution: 432.26
Mean performance of old File solution: 343.11
Mean performance of read attribute object once solution: 255.66
Code in constructor of PathInfo for the best solution:
public PathInfo(Path path) {
try {
// read the whole attributes once
BasicFileAttributes bfa = Files.readAttributes(path, BasicFileAttributes.class);
fileName = path.getFileName().toString();
modified = bfa.lastModifiedTime().toMillis();
size = bfa.size();
} catch (IOException ex) {
throw new UncheckedIOException(ex);
}
}
My result: Never read attributes twice and caching in an Object is bursting performance.

Files.list() is a O(N) operation whereas sorting is O(N log N). It is far more likely that the operations inside the sorting which matter. Given the comparisons don't do the same thing, this is the most likely explanation. There is a lot of files with the same modification date under C:/Windows/System32 meaning the size would be checked quite often.
To show that most of the time is not spent in FIles.list(dir) Stream, I have optimise the comparison so the data about a file is only obtained once per file.
import java.io.File;
import java.io.IOException;
import java.io.UncheckedIOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.attribute.FileTime;
import java.util.Arrays;
import java.util.Comparator;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.Stream;
public class FileSorter {
// This sorts from old to young and from big to small
Comparator<Path> timeSizeComparator = (Path o1, Path o2) -> {
int sorter = 0;
try {
FileTime lm1 = Files.getLastModifiedTime(o1);
FileTime lm2 = Files.getLastModifiedTime(o2);
if (lm2.compareTo(lm1) == 0) {
Long s1 = Files.size(o1);
Long s2 = Files.size(o2);
sorter = s2.compareTo(s1);
} else {
sorter = lm1.compareTo(lm2);
}
} catch (IOException ex) {
throw new UncheckedIOException(ex);
}
return sorter;
};
public String[] getSortedFileListAsArray(Path dir) throws IOException {
Stream<Path> stream = Files.list(dir);
return stream.sorted(timeSizeComparator).
map(Path::getFileName).map(Path::toString).toArray(String[]::new);
}
public List<String> getSortedFileListAsList(Path dir) throws IOException {
Stream<Path> stream = Files.list(dir);
return stream.sorted(timeSizeComparator).
map(Path::getFileName).map(Path::toString).collect(Collectors.
toList());
}
public String[] sortByDateAndSize(File[] fileList) {
Arrays.sort(fileList, (File o1, File o2) -> {
int r = Long.compare(o1.lastModified(), o2.lastModified());
if (r != 0) {
return r;
}
return Long.compare(o1.length(), o2.length());
});
String[] fileNames = new String[fileList.length];
for (int i = 0; i < fileNames.length; i++) {
fileNames[i] = fileList[i].getName();
}
return fileNames;
}
public List<String> getSortedFile(Path dir) throws IOException {
return Files.list(dir).map(PathInfo::new).sorted().map(p -> p.getFileName()).collect(Collectors.toList());
}
static class PathInfo implements Comparable<PathInfo> {
private final String fileName;
private final long modified;
private final long size;
public PathInfo(Path path) {
try {
fileName = path.getFileName().toString();
modified = Files.getLastModifiedTime(path).toMillis();
size = Files.size(path);
} catch (IOException ex) {
throw new UncheckedIOException(ex);
}
}
#Override
public int compareTo(PathInfo o) {
int cmp = Long.compare(modified, o.modified);
if (cmp == 0)
cmp = Long.compare(size, o.size);
return cmp;
}
public String getFileName() {
return fileName;
}
}
public static void main(String[] args) throws IOException {
// File (io package)
File f = new File("C:\\Windows\\system32");
// Path (nio package)
Path dir = Paths.get("C:\\Windows\\system32");
FileSorter fs = new FileSorter();
long before = System.currentTimeMillis();
String[] names = fs.sortByDateAndSize(f.listFiles());
long after = System.currentTimeMillis();
System.out.println("Run time of Arrays.sort: " + ((after - before)));
long before2 = System.currentTimeMillis();
String[] names2 = fs.getSortedFileListAsArray(dir);
long after2 = System.currentTimeMillis();
System.out.println("Run time of Stream.sorted as Array: " + ((after2 - before2)));
long before3 = System.currentTimeMillis();
List<String> names3 = fs.getSortedFileListAsList(dir);
long after3 = System.currentTimeMillis();
System.out.println("Run time of Stream.sorted as List: " + ((after3 - before3)));
long before4 = System.currentTimeMillis();
List<String> names4 = fs.getSortedFile(dir);
long after4 = System.currentTimeMillis();
System.out.println("Run time of Stream.sorted as List with caching: " + ((after4 - before4)));
}
}
This prints on my laptop.
Run time of Arrays.sort: 1980
Run time of Stream.sorted as Array: 1295
Run time of Stream.sorted as List: 1228
Run time of Stream.sorted as List with caching: 185
As you can see, about 85% of the time is spent obtaining the modification date and size of the files repeatedly.

Test code to compare jackson Stream vs Map - Is this working right?

I have a server side application that I am profiling using VisualVM that makes use of Streaming API.
However, since there are a lot of factors in that code I also made a toy example to compare streaming vs mapping.
I have a feeling that something may be off in that there is a lot of randomness in the results.
Is it the measuring? Would using other types of typers make a difference? Is it that there is something that is multi-threaded I don't know about?
Currently I am writing to NUL file object the windows equivalent of dev/null. I am running this on high priority in case the operating system may affect it.
Toy Example Code:
import java.io.File;
import java.io.IOException;
import java.io.StringWriter;
import java.util.ArrayList;
import java.util.Map.Entry;
import java.util.Scanner;
import java.util.TreeMap;
import com.fasterxml.jackson.core.JsonEncoding;
import com.fasterxml.jackson.core.JsonFactory;
import com.fasterxml.jackson.core.JsonGenerator;
import com.fasterxml.jackson.databind.ObjectMapper;
public class TestStreamingMapping {
public final static int NUM_SIMULATED_CATALOGS = 10000;
public final static int CATALOG_SIZE = 1000; //1000 Items in CATALOG, 500 requests per second
public final static boolean WRITE_TO_FILE = false; //Write to file, or write to string
public final static boolean DEBUG_PRINT_100_CHAR = false; //Print out part of string to see all ok
public static final String mappingFile = "mapping.txt"; //If writing to file, where?
public static final String streamingFile = "streaming.txt"; //If streaming to file, where?
public static final boolean PRINT_INTERMEDIATE_RESULTS = false;
public static TreeMap<Long,Double> iterationPercentages = new TreeMap<Long,Double>();
ObjectMapper mapper= new ObjectMapper();
JsonFactory f = new JsonFactory();
JsonGenerator g;
public static long totalCountStream = 0, totalCountMap = 0;
public static void main(String args[])
{
System.out.println("Press enter when profiler is connected...");
new Scanner(System.in).nextLine();
System.out.println("Starting iterations of JSON generation.");
double percentage;
for(long i=0; i<NUM_SIMULATED_CATALOGS; i++)
{
performTest();
percentage = (totalCountStream*100.0d / totalCountMap);
iterationPercentages.put(i, percentage);
if(!PRINT_INTERMEDIATE_RESULTS && i%100 == 0)System.out.print(i+"-");
}
System.out.println("Total Streaming API: " + totalCountStream + " ns.");
System.out.println("Total Mapping API: " + totalCountMap + " ns.");
System.out.println("Total Stream(as % of map): " + totalCountStream*100.0d / totalCountMap + " %\r\n" );
System.out.println("Iteration\tStreamPercent");
for(Entry<Long, Double> entry : iterationPercentages.entrySet())
if(entry.getKey() % 20 ==0)
System.out.println(entry.getKey() + "\t\t" + Math.round(entry.getValue()) + "%" );
}
public static void performTest()
{
TestStreamingMapping test = new TestStreamingMapping();
long time1, time2;
double percentage = 0;
try {
long starttime1 = System.nanoTime();
test.streamingToFile();
totalCountStream+=time1=System.nanoTime() - starttime1;
long starttime2 = System.nanoTime();
test.objectMapping();
totalCountMap+=time2=System.nanoTime() - starttime2;
percentage = (time1*100.0d / time2);
if(PRINT_INTERMEDIATE_RESULTS)
{
System.out.println("Streaming API: " + time1 + " ns.");
System.out.println("Mapping API: " + time2 + " ns.");
System.out.println("Stream(as % of map): " + percentage + " %" );
System.out.println("----------------------------------------------\r\n");
}
} catch (IOException e) {
e.printStackTrace();
}
}
public String[] numbers;
public ArrayList<String> arrayList = new ArrayList<String>();
public TestStreamingMapping()
{
numbers=new String[62];
for(int i=0; i<60; i++) numbers[i] = String.valueOf(Math.random()*i);
for(int i=0; i<60; i++) arrayList.add(String.valueOf(Math.random()*i));
}
public void initializeGenerator(StringWriter writer) throws IOException
{
if(WRITE_TO_FILE)
g = f.createGenerator(new File(mappingFile), JsonEncoding.UTF8);
else
g = f. createGenerator(writer);
}
public void objectMapping() throws IOException
{
StringWriter writer = new StringWriter();
initializeGenerator(writer);
for(int j=0; j<CATALOG_SIZE; j++)
mapper.writeValue(g, this);
g.close();
writer.close();
if(DEBUG_PRINT_100_CHAR)
System.out.println(writer.toString().substring(0,100));
}
public void streamingToFile() throws IOException
{
StringWriter writer = new StringWriter();
initializeGenerator(writer);
for(int j=0; j<CATALOG_SIZE; j++)
{
g.writeStartObject();
g.writeFieldName("numbers_streaming");
g.writeStartArray();
for(int i=0; i<numbers.length; i++) g.writeString(numbers[i]);
g.writeEndArray();
g.writeFieldName("arrayList"); g.writeStartArray();
for(String num: arrayList) g.writeString(num);
g.writeEndArray();
g.writeEndObject();
}
g.close();
writer.close();
if(DEBUG_PRINT_100_CHAR)
System.out.println(writer.toString().substring(0,100));
}
}
The below code is simulating a service that would generate a JSON catalog document with 1000 Prouct Objects. The hotspot obviously is the serialization of the products (streamToFile() vs objectMapping()).

Ok, couple of things.
Most importantly, you should create just one JsonFactory instance, similar to how you reuse ObjectMapper. Reuse of these objects is one of key things for performance with Jackson. See here for more ideas.
Another thing to consider is that use of File adds I/O overhead, which should be about the same for both approaches, and diminishes difference in actual processing times. You may want to separate this to see how much of time is spent on file access. I realize that this may be bogus file (as per note on how OS deals with that), but even without physical overhead, OS typically incurs some syscall overhead.
And then one general aspect is that when measuring performance on JVM, you always need to keep in mind warm-up time: you should always warm up tests for multiple seconds (5 or 10 seconds minimum), as well as run actual test for sufficient time (like 30 seconds or more), to get more stable results.
This is where test frameworks can help, as they can actually statistically measure things and figure out when results stabilize enough to be meaningful.
Hope this helps!

Filter (search and replace) array of bytes in an InputStream

I have an InputStream which takes the html file as input parameter. I have to get the bytes from the input stream .
I have a string: "XYZ". I'd like to convert this string to byte format and check if there is a match for the string in the byte sequence which I obtained from the InputStream. If there is then, I have to replace the match with the bye sequence for some other string.
Is there anyone who could help me with this? I have used regex to find and replace. however finding and replacing byte stream, I am unaware of.
Previously, I use jsoup to parse html and replace the string, however due to some utf encoding problems, the file seems to appear corrupted when I do that.
TL;DR: My question is:
Is a way to find and replace a string in byte format in a raw InputStream in Java?

Not sure you have chosen the best approach to solve your problem.
That said, I don't like to (and have as policy not to) answer questions with "don't" so here goes...
Have a look at FilterInputStream.
From the documentation:
A FilterInputStream contains some other input stream, which it uses as its basic source of data, possibly transforming the data along the way or providing additional functionality.
It was a fun exercise to write it up. Here's a complete example for you:
import java.io.*;
import java.util.*;
class ReplacingInputStream extends FilterInputStream {
LinkedList<Integer> inQueue = new LinkedList<Integer>();
LinkedList<Integer> outQueue = new LinkedList<Integer>();
final byte[] search, replacement;
protected ReplacingInputStream(InputStream in,
byte[] search,
byte[] replacement) {
super(in);
this.search = search;
this.replacement = replacement;
}
private boolean isMatchFound() {
Iterator<Integer> inIter = inQueue.iterator();
for (int i = 0; i < search.length; i++)
if (!inIter.hasNext() || search[i] != inIter.next())
return false;
return true;
}
private void readAhead() throws IOException {
// Work up some look-ahead.
while (inQueue.size() < search.length) {
int next = super.read();
inQueue.offer(next);
if (next == -1)
break;
}
}
#Override
public int read() throws IOException {
// Next byte already determined.
if (outQueue.isEmpty()) {
readAhead();
if (isMatchFound()) {
for (int i = 0; i < search.length; i++)
inQueue.remove();
for (byte b : replacement)
outQueue.offer((int) b);
} else
outQueue.add(inQueue.remove());
}
return outQueue.remove();
}
// TODO: Override the other read methods.
}
Example Usage
class Test {
public static void main(String[] args) throws Exception {
byte[] bytes = "hello xyz world.".getBytes("UTF-8");
ByteArrayInputStream bis = new ByteArrayInputStream(bytes);
byte[] search = "xyz".getBytes("UTF-8");
byte[] replacement = "abc".getBytes("UTF-8");
InputStream ris = new ReplacingInputStream(bis, search, replacement);
ByteArrayOutputStream bos = new ByteArrayOutputStream();
int b;
while (-1 != (b = ris.read()))
bos.write(b);
System.out.println(new String(bos.toByteArray()));
}
}
Given the bytes for the string "Hello xyz world" it prints:
Hello abc world

The following approach will work but I don't how big the impact is on the performance.
Wrap the InputStream with a InputStreamReader,
wrap the InputStreamReader with a FilterReader that replaces the strings, then
wrap the FilterReader with a ReaderInputStream.
It is crucial to choose the appropriate encoding, otherwise the content of the stream will become corrupted.
If you want to use regular expressions to replace the strings, then you can use Streamflyer, a tool of mine, which is a convenient alternative to FilterReader. You will find an example for byte streams on the webpage of Streamflyer. Hope this helps.

I needed something like this as well and decided to roll my own solution instead of using the example above by #aioobe. Have a look at the code. You can pull the library from maven central, or just copy the source code.
This is how you use it. In this case, I'm using a nested instance to replace two patterns two fix dos and mac line endings.
new ReplacingInputStream(new ReplacingInputStream(is, "\n\r", "\n"), "\r", "\n");
Here's the full source code:
/**
* Simple FilterInputStream that can replace occurrances of bytes with something else.
*/
public class ReplacingInputStream extends FilterInputStream {
// while matching, this is where the bytes go.
int[] buf=null;
int matchedIndex=0;
int unbufferIndex=0;
int replacedIndex=0;
private final byte[] pattern;
private final byte[] replacement;
private State state=State.NOT_MATCHED;
// simple state machine for keeping track of what we are doing
private enum State {
NOT_MATCHED,
MATCHING,
REPLACING,
UNBUFFER
}
/**
* #param is input
* #return nested replacing stream that replaces \n\r (DOS) and \r (MAC) line endings with UNIX ones "\n".
*/
public static InputStream newLineNormalizingInputStream(InputStream is) {
return new ReplacingInputStream(new ReplacingInputStream(is, "\n\r", "\n"), "\r", "\n");
}
/**
* Replace occurances of pattern in the input. Note: input is assumed to be UTF-8 encoded. If not the case use byte[] based pattern and replacement.
* #param in input
* #param pattern pattern to replace.
* #param replacement the replacement or null
*/
public ReplacingInputStream(InputStream in, String pattern, String replacement) {
this(in,pattern.getBytes(StandardCharsets.UTF_8), replacement==null ? null : replacement.getBytes(StandardCharsets.UTF_8));
}
/**
* Replace occurances of pattern in the input.
* #param in input
* #param pattern pattern to replace
* #param replacement the replacement or null
*/
public ReplacingInputStream(InputStream in, byte[] pattern, byte[] replacement) {
super(in);
Validate.notNull(pattern);
Validate.isTrue(pattern.length>0, "pattern length should be > 0", pattern.length);
this.pattern = pattern;
this.replacement = replacement;
// we will never match more than the pattern length
buf = new int[pattern.length];
}
#Override
public int read(byte[] b, int off, int len) throws IOException {
// copy of parent logic; we need to call our own read() instead of super.read(), which delegates instead of calling our read
if (b == null) {
throw new NullPointerException();
} else if (off < 0 || len < 0 || len > b.length - off) {
throw new IndexOutOfBoundsException();
} else if (len == 0) {
return 0;
}
int c = read();
if (c == -1) {
return -1;
}
b[off] = (byte)c;
int i = 1;
try {
for (; i < len ; i++) {
c = read();
if (c == -1) {
break;
}
b[off + i] = (byte)c;
}
} catch (IOException ee) {
}
return i;
}
#Override
public int read(byte[] b) throws IOException {
// call our own read
return read(b, 0, b.length);
}
#Override
public int read() throws IOException {
// use a simple state machine to figure out what we are doing
int next;
switch (state) {
case NOT_MATCHED:
// we are not currently matching, replacing, or unbuffering
next=super.read();
if(pattern[0] == next) {
// clear whatever was there
buf=new int[pattern.length]; // clear whatever was there
// make sure we start at 0
matchedIndex=0;
buf[matchedIndex++]=next;
if(pattern.length == 1) {
// edgecase when the pattern length is 1 we go straight to replacing
state=State.REPLACING;
// reset replace counter
replacedIndex=0;
} else {
// pattern of length 1
state=State.MATCHING;
}
// recurse to continue matching
return read();
} else {
return next;
}
case MATCHING:
// the previous bytes matched part of the pattern
next=super.read();
if(pattern[matchedIndex]==next) {
buf[matchedIndex++]=next;
if(matchedIndex==pattern.length) {
// we've found a full match!
if(replacement==null || replacement.length==0) {
// the replacement is empty, go straight to NOT_MATCHED
state=State.NOT_MATCHED;
matchedIndex=0;
} else {
// start replacing
state=State.REPLACING;
replacedIndex=0;
}
}
} else {
// mismatch -> unbuffer
buf[matchedIndex++]=next;
state=State.UNBUFFER;
unbufferIndex=0;
}
return read();
case REPLACING:
// we've fully matched the pattern and are returning bytes from the replacement
next=replacement[replacedIndex++];
if(replacedIndex==replacement.length) {
state=State.NOT_MATCHED;
replacedIndex=0;
}
return next;
case UNBUFFER:
// we partially matched the pattern before encountering a non matching byte
// we need to serve up the buffered bytes before we go back to NOT_MATCHED
next=buf[unbufferIndex++];
if(unbufferIndex==matchedIndex) {
state=State.NOT_MATCHED;
matchedIndex=0;
}
return next;
default:
throw new IllegalStateException("no such state " + state);
}
}
#Override
public String toString() {
return state.name() + " " + matchedIndex + " " + replacedIndex + " " + unbufferIndex;
}
}

There isn't any built-in functionality for search-and-replace on byte streams (InputStream).
And, a method for completing this task efficiently and correctly is not immediately obvious. I have implemented the Boyer-Moore algorithm for streams, and it works well, but it took some time. Without an algorithm like this, you have to resort to a brute-force approach where you look for the pattern starting at every position in the stream, which can be slow.
Even if you decode the HTML as text, using a regular expression to match patterns might be a bad idea, since HTML is not a "regular" language.
So, even though you've run into some difficulties, I suggest you pursue your original approach of parsing the HTML as a document. While you are having trouble with the character encoding, it will probably be easier, in the long run, to fix the right solution than it will be to jury-rig the wrong solution.

I needed a solution to this, but found the answers here incurred too much memory and/or CPU overhead. The below solution significantly outperforms the others here in these terms based on simple benchmarking.
This solution is especially memory-efficient, incurring no measurable cost even with >GB streams.
That said, this is not a zero-CPU-cost solution. The CPU/processing-time overhead is probably reasonable for all but the most demanding/resource-sensitive scenarios, but the overhead is real and should be considered when evaluating the worthiness of employing this solution in a given context.
In my case, our max real-world file size that we are processing is about 6MB, where we see added latency of about 170ms with 44 URL replacements. This is for a Zuul-based reverse-proxy running on AWS ECS with a single CPU share (1024). For most of the files (under 100KB), the added latency is sub-millisecond. Under high-concurrency (and thus CPU contention), the added latency could increase, however we are currently able to process hundreds of the files concurrently on a single node with no humanly-noticeable latency impact.
The solution we are using:
import java.io.IOException;
import java.io.InputStream;
public class TokenReplacingStream extends InputStream {
private final InputStream source;
private final byte[] oldBytes;
private final byte[] newBytes;
private int tokenMatchIndex = 0;
private int bytesIndex = 0;
private boolean unwinding;
private int mismatch;
private int numberOfTokensReplaced = 0;
public TokenReplacingStream(InputStream source, byte[] oldBytes, byte[] newBytes) {
assert oldBytes.length > 0;
this.source = source;
this.oldBytes = oldBytes;
this.newBytes = newBytes;
}
#Override
public int read() throws IOException {
if (unwinding) {
if (bytesIndex < tokenMatchIndex) {
return oldBytes[bytesIndex++];
} else {
bytesIndex = 0;
tokenMatchIndex = 0;
unwinding = false;
return mismatch;
}
} else if (tokenMatchIndex == oldBytes.length) {
if (bytesIndex == newBytes.length) {
bytesIndex = 0;
tokenMatchIndex = 0;
numberOfTokensReplaced++;
} else {
return newBytes[bytesIndex++];
}
}
int b = source.read();
if (b == oldBytes[tokenMatchIndex]) {
tokenMatchIndex++;
} else if (tokenMatchIndex > 0) {
mismatch = b;
unwinding = true;
} else {
return b;
}
return read();
}
#Override
public void close() throws IOException {
source.close();
}
public int getNumberOfTokensReplaced() {
return numberOfTokensReplaced;
}
}

I came up with this simple piece of code when I needed to serve a template file in a Servlet replacing a certain keyword by a value. It should be pretty fast and low on memory. Then using Piped Streams I guess you can use it for all sorts of things.
/JC
public static void replaceStream(InputStream in, OutputStream out, String search, String replace) throws IOException
{
replaceStream(new InputStreamReader(in), new OutputStreamWriter(out), search, replace);
}
public static void replaceStream(Reader in, Writer out, String search, String replace) throws IOException
{
char[] searchChars = search.toCharArray();
int[] buffer = new int[searchChars.length];
int x, r, si = 0, sm = searchChars.length;
while ((r = in.read()) > 0) {
if (searchChars[si] == r) {
// The char matches our pattern
buffer[si++] = r;
if (si == sm) {
// We have reached a matching string
out.write(replace);
si = 0;
}
} else if (si > 0) {
// No match and buffered char(s), empty buffer and pass the char forward
for (x = 0; x < si; x++) {
out.write(buffer[x]);
}
si = 0;
out.write(r);
} else {
// No match and nothing buffered, just pass the char forward
out.write(r);
}
}
// Empty buffer
for (x = 0; x < si; x++) {
out.write(buffer[x]);
}
}

How to check if a string is a number [duplicate]

This question already has answers here:
How to check if a String is numeric in Java
(41 answers)
Closed 5 years ago.
I have conversion to Map problem in Core Java.
Below is requirement:
Given a String array below
String str[] = {"abc","123","def","456","ghi","789","lmn","101112","opq"};
Convert it into a Map such that the resultant output is below
Output
====== ======
key Value
====== ======
abc true
123 false
def true
456 false
The above should be printed for each element in the array. I have written the code but it's not working and I'm stuck. Please let me know how it can be resolved. Thanks in advance.
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
public class CoversionToMap {
/**
* #param args
*/
public static void main(String[] args) {
String str[] = {"abc","123","def","456","ghi","789","lmn","101112","opq"};
Map m = new HashMap();
for(int i=0;i<str.length;i++){
if(Integer.parseInt(str[i]) < 0){
m.put(str[i],true);
}else{
m.put(str[i],false);
}
}
//Print the map values finally
printMap(m);
}
public static void printMap(Map mp) {
Iterator it = mp.entrySet().iterator();
while (it.hasNext()) {
Map.Entry pairs = (Map.Entry)it.next();
System.out.println(pairs.getKey() + " = " + pairs.getValue());
}
}
}
exception:
Exception in thread "main" java.lang.NumberFormatException: For input string: "abc"
at java.lang.NumberFormatException.forInputString(Unknown Source)
at java.lang.Integer.parseInt(Unknown Source)
at java.lang.Integer.parseInt(Unknown Source)
at CoversionToMap.main(CoversionToMap.java:22)

Everyone is suggesting using exception handling for this, there is nothing exceptional here to warrant using exceptions like this, you don't try turning left in your car and if you crash go right do you? Something like this should do it
Map<String, Boolean> m = new HashMap<String, Boolean>();
for (String str: strs) {
m.put(str, isInteger(str));
}
public boolean isInteger(String str) {
int size = str.length();
for (int i = 0; i < size; i++) {
if (!Character.isDigit(str.charAt(i))) {
return false;
}
}
return size > 0;
}
Much clearer and more efficient that catching throwing exception, even when there are 99% integers as the integer value is not even needed so no conversion required.

Integer.parseInt(..) throws an exception for invalid input.
Your if clause should look like this:
if (isNumber(str[i])) {
...
} else {
...
}
Where isNumber can be implemented in multiple ways. For example:
using try { Integer.parseInt(..) } catch (NumberFormatException ex) (see this related question)
using commons-lang NumberUtils.isNumber(..)

You check if parseInt returns a number smaller than 0 to see if the input is non-numeric.
However, that method doesn't return any value at all, if the input is non-numeric. Instead it throws an exception, as you have seen.
The simplest way to do what you want is to catch that exception and act accordingly:
try {
Integer.parseInt(str[i]);
// str[i] is numeric
} catch (NumberFormatException ignored) {
// str[i] is not numeric
}

If you want to check if the string is a valid Java number you can use the method isNumber from the org.apache.commons.lang.math (doc here: http://commons.apache.org/lang/api-2.4/org/apache/commons/lang/math/NumberUtils.html).
This way you won't have to write your own implementation of isNumber

You need to use a try/catch block instead of testing the return value for parseInt.
try {
Integer.parseInt(str[i]);
m.put(str[i],true);
} catch(NumberFormatException e) {
m.put(str[i],false);
}

Your error occurs here:
if(Integer.parseInt(str[i]) < 0){
Integer.parseInt throws a NumberFormatException when the input isn't a number, so you need to use a try/catch block, for example:
try{
int number = Integer.parseInt(str[i]);
m.put(str[i],false);
}catch NumberFormatException nfe{
m.put(str[i],true);
}

Assuming you won't use any external libraries, you can also use a Regular Expression Matcher to do that. Just like
for (String element : str) {
m.put(element, element.matches("\\d+"));
}
Note that this works only with non-negative integers, but you can adapt the regular expression to match the number formats you want to map as true. Also, if element is null, you'll get a NullPointerException, so a little defensive code is required here.

Here is an improved answer which can be used for numbers with negative value, decimal points etc. It uses Regular Expressions.
Here it it:
public class StringValidator {
public static void printMap(Map<String, Boolean> map) {
Iterator it = map.entrySet().iterator();
for(Map.Entry<String, Boolean> entry:map.entrySet()){
System.out.println(entry.getKey()+" = "+ entry.getValue());
}
}
}
class ValidateArray{
public static void main(String[] args) {
String str[] = {"abcd", "123", "101.112", "-1.54774"};
Map<String, Boolean> m = new HashMap<String, Boolean>();
for (String s : str) {
m.put(s, isNumber(s));
}
StringValidator.printMap(m);
}
public static boolean isNumber(String str) {
Pattern pattern = Pattern.compile("^-?\\d+\\.?\\d*$");
Matcher matcher = pattern.matcher(str);
return matcher.matches();
}
}

Replace your parseInt line with a call to isInteger(str[i]) where isInteger is defined by:
public static boolean isInteger(String text) {
try {
new Integer(text);
return true;
} catch (NumberFormatException e) {
return false;
}
}

I would like to enter the contrary view on 'don't use exception handling' here. The following code:
try
{
InputStream in = new FileInputStream(file);
}
catch (FileNotFoundException exc)
{
// ...
}
is entirely equivalent to:
if (!file.exists())
{
// ...
}
else
try
{
InputStream in = new FileInputStream(file);
}
catch (FileNotFoundException exc)
{
// ...
}
except that in the former case:
The existence of the file is only checked once
There is no timing-window between the two checks during which things can change.
The processing at // ... is only programmed once.
So you don't see code like the second case. At least you shouldn't.
The present case is identical except that because it's a String there is no timing window. Integer.parseInt() has to check the input for validity anyway, and it throws an exception which must be caught somewhere anyway (unless you like RTEs stopping your threads). So why do everything twice?
The counter-argument that you shouldn't use exceptions for normal flow control just begs the question. Is it normal flow control? or is it an error in the input? [In fact I've always understood that principle to mean more specifically 'don't throw exceptions to your own code' within the method, and even then there are rare cases when it's the best answer. I'm not a fan of blanket rules of any kind.]
Another example detecting EOF on an ObjectInputStream. You do it by catching EOFException. There is no other way apart from prefixing a count to the stream, which is a design change and a format change. So, is EOF part of the normal flow, or is it an exception? and how can it be part of the normal flow given that it is only reported via an exception?

Here's a more general way to validate, avoiding exceptions, and using what the Format subclasses already know. For example the SimpleDateFormat knows that Feb 31 is not valid, as long as you tell it not to be lenient.
import java.text.Format;
import java.text.NumberFormat;
import java.text.ParsePosition;
import java.text.SimpleDateFormat;
import java.util.HashMap;
import java.util.Map;
public class ValidatesByParsePosition {
private static NumberFormat _numFormat = NumberFormat.getInstance();
private static SimpleDateFormat _dateFormat = new SimpleDateFormat(
"MM/dd/yyyy");
public static void printMap(Map<String, Boolean> map) {
for (Map.Entry<String, Boolean> entry : map.entrySet()) {
System.out.println(entry.getKey() + " = " + entry.getValue());
}
}
public static void main(String[] args) {
System.out.println("Validating Nums with ParsePosition:");
String numStrings[] = { "abcd", "123", "101.112", "-1.54774", "1.40t3" };
Map<String, Boolean> rslts = new HashMap<String, Boolean>();
for (String s : numStrings) {
rslts.put(s, isOk(_numFormat, s));
}
ValidatesByParsePosition.printMap(rslts);
System.out.println("\nValidating dates with ParsePosition:");
String dateStrings[] = { "3/11/1952", "02/31/2013", "03/14/2014",
"05/25/2014", "3/uncle george/2015" };
rslts = new HashMap<String, Boolean>();
_dateFormat.setLenient(false);
for (String s : dateStrings) {
rslts.put(s, isOk(_dateFormat, s));
}
ValidatesByParsePosition.printMap(rslts);
}
public static boolean isOk(Format format, String str) {
boolean isOK = true;
int errorIndx = -1;
int parseIndx = 0;
ParsePosition pos = new ParsePosition(parseIndx);
while (isOK && parseIndx < str.length() - 1) {
format.parseObject(str, pos);
parseIndx = pos.getIndex();
errorIndx = pos.getErrorIndex();
isOK = errorIndx < 0;
}
if (!isOK) {
System.out.println("value \"" + str
+ "\" not parsed; error at char index " + errorIndx);
}
return isOK;
}
}

boolean intVal = false;
for(int i=0;i<str.length;i++) {
intVal = false;
try {
if (Integer.parseInt(str[i]) > 0) {
intVal = true;
}
} catch (java.lang.NumberFormatException e) {
intVal = false;
}
m.put(str[i], !intVal);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Pattern matching in Thousands of files - java

Related

Can you rebalance an unbalanced Spliterator of unknown size?

Why does the java DirectoryStream perform so slow?

Test code to compare jackson Stream vs Map - Is this working right?

Filter (search and replace) array of bytes in an InputStream

How to check if a string is a number [duplicate]

Categories

Resources