Execution Time: Iterativ vs Instance - java

I just stumbled across a strange thing while coding in Java:
I read a file into a bytearray (byte[] file_bytes) and what I want is a hexdump output (like the utilities hexdump or xxd in Linux). Basically this works (see the for-loop-code that is not commented out), but for larger Files (>100 KiB) it takes a bit, to go through the bytearray-chunks, do proper formatting, and so on.
But if I swap the for-loop-code with the code that is commented out (using a class with the same for-loop-code for calculation!), it works very fast.
What is the reason for this behavior?
Codesnippet:
[...]
long counter = 1;
int chunk_size = 512;
int chunk_count = (int) Math.ceil((double) file_bytes.length / chunk_size);
for (int i = 0; i < chunk_count; i++) {
byte[] chunk = Arrays.copyOfRange(file_bytes, i * chunk_size, (i + 1) * chunk_size);
// this commented two lines calculate way more faster than the for loop below, even though the calculation algorithm is the same!
/*
* String test = new BytesToHexstring(chunk).getHexstring();
* hex_string = hex_string.concat(test);
*/
for (byte b : chunk) {
if( (counter % 4) != 0 ){
hex_string = hex_string.concat(String.format("%02X ", b));
} else{
hex_string = hex_string.concat(String.format("%02X\n", b));
}
counter++;
}
}
[...]
class BytesToHexstring:
class BytesToHexstring {
private String m_hexstring;
public BytesToHexstringTask(byte[] ba) {
m_hexstring = "";
m_hexstring = bytes_to_hex_string(ba);
}
private String bytes_to_hex_string(byte[] ba) {
String hexstring = "";
int counter = 1;
// same calculation algorithm like in the codesnippet above!
for (byte b : ba) {
if ((counter % 4) != 0) {
hexstring = hexstring.concat(String.format("%02X ", b));
} else {
hexstring = hexstring.concat(String.format("%02X\n", b));
}
counter++;
}
return hexstring;
}
public String getHexstring() {
return m_hexstring;
}
}
String hex_string:
00 11 22 33
44 55 66 77
88 99 AA BB
CC DD EE FF
Benchmarks:
file_bytes.length = 102400 Bytes = 100 KiB
via class: ~0,7 sec
without class: ~5,2 sec
file_bytes.length = 256000 Bytes = 250 KiB
via class: ~1,2 sec
without class: ~36 sec

There's an important difference between the two options. In the slow version, you are concatenating each iteration onto the entire hex string you built up for each byte. String concatenation is a slow operation since it requires copying the entire string. As you string gets larger this copying takes longer and you copy the whole thing every byte.
In the faster version you are building each chunk up individually and only concatenating whole chunks with the output string rather than each individual bytes. This mean much fewer expensive concatenations. You are still using concatenation while building uyp the chunk, but because a chunk is much smaller than the whole output those concatenations are faster.
You could do much better though by using a string builder instead of string concatenation. StringBuilder is a class designed for efficiently building up strings incrementally. It avoids the full copy on every append that concatenation does. I expect that if you remake this to use StringBuilder both versions would perform about the same, and be faster than either version you already have.

Related

Generating millions of random string in Java

I would like to generate millions of passwords randomly between 4 millions and 50 millions. The problem is the time it takes the processor to process it.
I would like to know if there is a solution to generate a lot of passwords in only a few seconds (max 1 minute for 50 millions).
I've done that for now but it takes me more than 3 min (with a very good config and I would like to run it on small config).
private final static String policy = "azertyuiopqsdfghjklmwxcvbnAZERTYUIOPQSDFGHJKLMWXCVBN1234567890";
private static List<String> names = new ArrayList<String>();
public static void main(String[] args) {
names.add("de");
init();
}
private static String generator(){
String password="";
int randomWithMathRandom = (int) ((Math.random() * ( - 6)) + 6);
for(var i=0;i<8;i++){
randomWithMathRandom = (int) ((Math.random() * ( - 6)) + 6);
password+= policy.charAt(randomWithMathRandom);
}
return password;
}
public static void init() {
for (int i = 0; i < 40000000; i++) {
names.add(generator());
}
}
btw I can't take a ready-made list. I think the most 'expensive' waste of time is the input into the list.
My current config :
ryzen 7 4800h
rtx 2600
SSD NVME
RAM 3200MHZ
UPDATE :
I tried with 20Millions and it's display an error: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "main"
Storing 50 million passwords as Strings in-memory could cause problems since either the stack or the heap may overflow. From this point of view, I think the best we can do is to generate a chunk of passwords, store them in a file, generate the next chunk, append them to the file... until the desired amount of passwords is created. I hacked together a small program that generates random Strings of length 32. As alphabet, I used all ASCII-characters between '!' (ASCII-value 33) and '~' (ASCII-value 126).
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.StandardOpenOption;
import java.text.DecimalFormat;
import java.text.DecimalFormatSymbols;
import java.util.Random;
import java.util.concurrent.TimeUnit;
class Scratch {
private static final int MIN = '!';
private static final int MAX = '~';
private static final Random RANDOM = new Random();
public static void main(final String... args) throws IOException {
final Path passwordFile = Path.of("passwords.txt");
if (!Files.exists(passwordFile)) {
Files.createFile(passwordFile);
}
final DecimalFormat df = new DecimalFormat();
final DecimalFormatSymbols ds = df.getDecimalFormatSymbols();
ds.setGroupingSeparator('_');
df.setDecimalFormatSymbols(ds);
final int numberOfPasswordsToGenerate = 50_000_000;
final int chunkSize = 1_000_000;
final int passwordLength = 32;
int generated = 0;
int chunk = 0;
final long start = System.nanoTime();
while (generated < numberOfPasswordsToGenerate) {
final StringBuilder passwords = new StringBuilder();
for (
int index = chunk * chunkSize;
index < (chunk + 1) * chunkSize && index < numberOfPasswordsToGenerate;
++index) {
final StringBuilder password = new StringBuilder();
for (int character = 0; character < passwordLength; ++character) {
password.append(fetchRandomLetterFromAlphabet());
}
passwords.append(password.toString()).append(System.lineSeparator());
++generated;
if (generated % 500_000 == 0) {
System.out.printf(
"%s / %s%n",
df.format(generated),
df.format(numberOfPasswordsToGenerate));
}
}
++chunk;
Files.writeString(passwordFile, passwords.toString(), StandardOpenOption.APPEND);
}
final long consumed = System.nanoTime() - start;
System.out.printf("Done. Took %d seconds%n", TimeUnit.NANOSECONDS.toSeconds(consumed));
}
private static char fetchRandomLetterFromAlphabet() {
return (char) (RANDOM.nextInt(MAX - MIN + 1) + MIN);
}
}
On my laptop, the program yields good results. It completes in about 33 seconds and all passwords are stored in a single file.
This program is a proof of concept and not production-ready. For example, if a password.txt does already exist, the content will be appended. For me, the file already has 1.7 GB after one run, so be aware of this. Furthermore, the generated passwords are temporarily stored in a StringBuilder, which may present a security risk since a StringBuilder cannot be cleared (i.e. its internal memory structured cannot be zeroed). Performance could further be improved by running the password generation multi-threaded, but I will leave this as an exercise to the reader.
To use the alphabet presented in the question, we can remove static fields MIN and MAX, define one new static field private static final char[] ALPHABET = "azertyuiopqsdfghjklmwxcvbnAZERTYUIOPQSDFGHJKLMWXCVBN1234567890".toCharArray(); and re-implement fetchRandomLetterFromAlphabet as:
private static char fetchRandomLetterFromAlphabet() {
return ALPHABET[RANDOM.nextInt(ALPHABET.length)];
}
We can use the following code-snippet to read-back the n-th (starting at 0) password from the file in constant time:
final int n = ...;
final RandomAccessFile raf = new RandomAccessFile(passwordFile.toString(), "r");
final long start = System.nanoTime();
final byte[] bytes = new byte[passwordLength];
// byte-length of the first n passwords, including line breaks:
final int offset = (passwordLength + System.lineSeparator().toCharArray().length) * n;
raf.seek(offset); // skip the first n passwords
raf.read(bytes);
// reset to the beginning of the file, in case we want to read more passwords later:
raf.seek(0);
System.out.println(new String(bytes));
I can give you some tips to optimize your code and make it faster, you can use them along with other.
If you know the number of passwords you need, you should create a string array and fill it with the variable in your loop.
If you have to use a dynamic size data structure, use linked list.
Linked list is better than array list when adding elements is your main target, and worse if you want to access them more then add them.
Use string builder instead of += operator on strings.
The += operator is very 'expensive' in complexity of time, because it always creates new strings. Using string builder append method can speed up your code.
Instead of using Math.random() and multiple the result to your range number, create a static Random object and use yourRandomInstance.next(int range).
Consider use the ascii table to get random character instead of using str.charAt(int index) method, it may speed up your code too, i offer you to check it.

Trying to assign a random ID to an object [duplicate]

I've been looking for a simple Java algorithm to generate a pseudo-random alpha-numeric string. In my situation it would be used as a unique session/key identifier that would "likely" be unique over 500K+ generation (my needs don't really require anything much more sophisticated).
Ideally, I would be able to specify a length depending on my uniqueness needs. For example, a generated string of length 12 might look something like "AEYGF7K0DM1X".
Algorithm
To generate a random string, concatenate characters drawn randomly from the set of acceptable symbols until the string reaches the desired length.
Implementation
Here's some fairly simple and very flexible code for generating random identifiers. Read the information that follows for important application notes.
public class RandomString {
/**
* Generate a random string.
*/
public String nextString() {
for (int idx = 0; idx < buf.length; ++idx)
buf[idx] = symbols[random.nextInt(symbols.length)];
return new String(buf);
}
public static final String upper = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
public static final String lower = upper.toLowerCase(Locale.ROOT);
public static final String digits = "0123456789";
public static final String alphanum = upper + lower + digits;
private final Random random;
private final char[] symbols;
private final char[] buf;
public RandomString(int length, Random random, String symbols) {
if (length < 1) throw new IllegalArgumentException();
if (symbols.length() < 2) throw new IllegalArgumentException();
this.random = Objects.requireNonNull(random);
this.symbols = symbols.toCharArray();
this.buf = new char[length];
}
/**
* Create an alphanumeric string generator.
*/
public RandomString(int length, Random random) {
this(length, random, alphanum);
}
/**
* Create an alphanumeric strings from a secure generator.
*/
public RandomString(int length) {
this(length, new SecureRandom());
}
/**
* Create session identifiers.
*/
public RandomString() {
this(21);
}
}
Usage examples
Create an insecure generator for 8-character identifiers:
RandomString gen = new RandomString(8, ThreadLocalRandom.current());
Create a secure generator for session identifiers:
RandomString session = new RandomString();
Create a generator with easy-to-read codes for printing. The strings are longer than full alphanumeric strings to compensate for using fewer symbols:
String easy = RandomString.digits + "ACEFGHJKLMNPQRUVWXYabcdefhijkprstuvwx";
RandomString tickets = new RandomString(23, new SecureRandom(), easy);
Use as session identifiers
Generating session identifiers that are likely to be unique is not good enough, or you could just use a simple counter. Attackers hijack sessions when predictable identifiers are used.
There is tension between length and security. Shorter identifiers are easier to guess, because there are fewer possibilities. But longer identifiers consume more storage and bandwidth. A larger set of symbols helps, but might cause encoding problems if identifiers are included in URLs or re-entered by hand.
The underlying source of randomness, or entropy, for session identifiers should come from a random number generator designed for cryptography. However, initializing these generators can sometimes be computationally expensive or slow, so effort should be made to re-use them when possible.
Use as object identifiers
Not every application requires security. Random assignment can be an efficient way for multiple entities to generate identifiers in a shared space without any coordination or partitioning. Coordination can be slow, especially in a clustered or distributed environment, and splitting up a space causes problems when entities end up with shares that are too small or too big.
Identifiers generated without taking measures to make them unpredictable should be protected by other means if an attacker might be able to view and manipulate them, as happens in most web applications. There should be a separate authorization system that protects objects whose identifier can be guessed by an attacker without access permission.
Care must be also be taken to use identifiers that are long enough to make collisions unlikely given the anticipated total number of identifiers. This is referred to as "the birthday paradox." The probability of a collision, p, is approximately n2/(2qx), where n is the number of identifiers actually generated, q is the number of distinct symbols in the alphabet, and x is the length of the identifiers. This should be a very small number, like 2‑50 or less.
Working this out shows that the chance of collision among 500k 15-character identifiers is about 2‑52, which is probably less likely than undetected errors from cosmic rays, etc.
Comparison with UUIDs
According to their specification, UUIDs are not designed to be unpredictable, and should not be used as session identifiers.
UUIDs in their standard format take a lot of space: 36 characters for only 122 bits of entropy. (Not all bits of a "random" UUID are selected randomly.) A randomly chosen alphanumeric string packs more entropy in just 21 characters.
UUIDs are not flexible; they have a standardized structure and layout. This is their chief virtue as well as their main weakness. When collaborating with an outside party, the standardization offered by UUIDs may be helpful. For purely internal use, they can be inefficient.
Java supplies a way of doing this directly. If you don't want the dashes, they are easy to strip out. Just use uuid.replace("-", "")
import java.util.UUID;
public class randomStringGenerator {
public static void main(String[] args) {
System.out.println(generateString());
}
public static String generateString() {
String uuid = UUID.randomUUID().toString();
return "uuid = " + uuid;
}
}
Output
uuid = 2d7428a6-b58c-4008-8575-f05549f16316
static final String AB = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
static SecureRandom rnd = new SecureRandom();
String randomString(int len){
StringBuilder sb = new StringBuilder(len);
for(int i = 0; i < len; i++)
sb.append(AB.charAt(rnd.nextInt(AB.length())));
return sb.toString();
}
If you're happy to use Apache classes, you could use org.apache.commons.text.RandomStringGenerator (Apache Commons Text).
Example:
RandomStringGenerator randomStringGenerator =
new RandomStringGenerator.Builder()
.withinRange('0', 'z')
.filteredBy(CharacterPredicates.LETTERS, CharacterPredicates.DIGITS)
.build();
randomStringGenerator.generate(12); // toUpperCase() if you want
Since Apache Commons Lang 3.6, RandomStringUtils is deprecated.
You can use an Apache Commons library for this, RandomStringUtils:
RandomStringUtils.randomAlphanumeric(20).toUpperCase();
In one line:
Long.toHexString(Double.doubleToLongBits(Math.random()));
Source: Java - generating a random string
This is easily achievable without any external libraries.
1. Cryptographic Pseudo Random Data Generation (PRNG)
First you need a cryptographic PRNG. Java has SecureRandom for that and typically uses the best entropy source on the machine (e.g. /dev/random). Read more here.
SecureRandom rnd = new SecureRandom();
byte[] token = new byte[byteLength];
rnd.nextBytes(token);
Note: SecureRandom is the slowest, but most secure way in Java of generating random bytes. I do however recommend not considering performance here since it usually has no real impact on your application unless you have to generate millions of tokens per second.
2. Required Space of Possible Values
Next you have to decide "how unique" your token needs to be. The whole and only point of considering entropy is to make sure that the system can resist brute force attacks: the space of possible values must be so large that any attacker could only try a negligible proportion of the values in non-ludicrous time1.
Unique identifiers such as random UUID have 122 bit of entropy (i.e., 2^122 = 5.3x10^36) - the chance of collision is "*(...) for there to be a one in a billion chance of duplication, 103 trillion version 4 UUIDs must be generated2". We will choose 128 bits since it fits exactly into 16 bytes and is seen as highly sufficient for being unique for basically every, but the most extreme, use cases and you don't have to think about duplicates. Here is a simple comparison table of entropy including simple analysis of the birthday problem.
For simple requirements, 8 or 12 byte length might suffice, but with 16 bytes you are on the "safe side".
And that's basically it. The last thing is to think about encoding so it can be represented as a printable text (read, a String).
3. Binary to Text Encoding
Typical encodings include:
Base64 every character encodes 6 bit, creating a 33% overhead. Fortunately there are standard implementations in Java 8+ and Android. With older Java you can use any of the numerous third-party libraries. If you want your tokens to be URL safe use the URL-safe version of RFC4648 (which usually is supported by most implementations). Example encoding 16 bytes with padding: XfJhfv3C0P6ag7y9VQxSbw==
Base32 every character encodes 5 bit, creating a 40% overhead. This will use A-Z and 2-7, making it reasonably space efficient while being case-insensitive alpha-numeric. There isn't any standard implementation in the JDK. Example encoding 16 bytes without padding: WUPIL5DQTZGMF4D3NX5L7LNFOY
Base16 (hexadecimal) every character encodes four bit, requiring two characters per byte (i.e., 16 bytes create a string of length 32). Therefore hexadecimal is less space efficient than Base32, but it is safe to use in most cases (URL) since it only uses 0-9 and A to F. Example encoding 16 bytes: 4fa3dd0f57cb3bf331441ed285b27735. See a Stack Overflow discussion about converting to hexadecimal here.
Additional encodings like Base85 and the exotic Base122 exist with better/worse space efficiency. You can create your own encoding (which basically most answers in this thread do), but I would advise against it, if you don't have very specific requirements. See more encoding schemes in the Wikipedia article.
4. Summary and Example
Use SecureRandom
Use at least 16 bytes (2^128) of possible values
Encode according to your requirements (usually hex or base32 if you need it to be alpha-numeric)
Don't
... use your home brew encoding: better maintainable and readable for others if they see what standard encoding you use instead of weird for loops creating characters at a time.
... use UUID: it has no guarantees on randomness; you are wasting 6 bits of entropy and have a verbose string representation
Example: Hexadecimal Token Generator
public static String generateRandomHexToken(int byteLength) {
SecureRandom secureRandom = new SecureRandom();
byte[] token = new byte[byteLength];
secureRandom.nextBytes(token);
return new BigInteger(1, token).toString(16); // Hexadecimal encoding
}
//generateRandomHexToken(16) -> 2189df7475e96aa3982dbeab266497cd
Example: Base64 Token Generator (URL Safe)
public static String generateRandomBase64Token(int byteLength) {
SecureRandom secureRandom = new SecureRandom();
byte[] token = new byte[byteLength];
secureRandom.nextBytes(token);
return Base64.getUrlEncoder().withoutPadding().encodeToString(token); //base64 encoding
}
//generateRandomBase64Token(16) -> EEcCCAYuUcQk7IuzdaPzrg
Example: Java CLI Tool
If you want a ready-to-use CLI tool you may use dice:
Example: Related issue - Protect Your Current Ids
If you already have an id you can use (e.g., a synthetic long in your entity), but don't want to publish the internal value, you can use this library to encrypt it and obfuscate it: https://github.com/patrickfav/id-mask
IdMask<Long> idMask = IdMasks.forLongIds(Config.builder(key).build());
String maskedId = idMask.mask(id);
// Example: NPSBolhMyabUBdTyanrbqT8
long originalId = idMask.unmask(maskedId);
Using Dollar should be as simple as:
// "0123456789" + "ABCDE...Z"
String validCharacters = $('0', '9').join() + $('A', 'Z').join();
String randomString(int length) {
return $(validCharacters).shuffle().slice(length).toString();
}
#Test
public void buildFiveRandomStrings() {
for (int i : $(5)) {
System.out.println(randomString(12));
}
}
It outputs something like this:
DKL1SBH9UJWC
JH7P0IT21EA5
5DTI72EO6SFU
HQUMJTEBNF7Y
1HCR6SKYWGT7
Here it is in Java:
import static java.lang.Math.round;
import static java.lang.Math.random;
import static java.lang.Math.pow;
import static java.lang.Math.abs;
import static java.lang.Math.min;
import static org.apache.commons.lang.StringUtils.leftPad
public class RandomAlphaNum {
public static String gen(int length) {
StringBuffer sb = new StringBuffer();
for (int i = length; i > 0; i -= 12) {
int n = min(12, abs(i));
sb.append(leftPad(Long.toString(round(random() * pow(36, n)), 36), n, '0'));
}
return sb.toString();
}
}
Here's a sample run:
scala> RandomAlphaNum.gen(42)
res3: java.lang.String = uja6snx21bswf9t89s00bxssu8g6qlu16ffzqaxxoy
A short and easy solution, but it uses only lowercase and numerics:
Random r = new java.util.Random ();
String s = Long.toString (r.nextLong () & Long.MAX_VALUE, 36);
The size is about 12 digits to base 36 and can't be improved further, that way. Of course you can append multiple instances.
Surprising, no one here has suggested it, but:
import java.util.UUID
UUID.randomUUID().toString();
Easy.
The benefit of this is UUIDs are nice, long, and guaranteed to be almost impossible to collide.
Wikipedia has a good explanation of it:
" ...only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%."
The first four bits are the version type and two for the variant, so you get 122 bits of random. So if you want to, you can truncate from the end to reduce the size of the UUID. It's not recommended, but you still have loads of randomness, enough for your 500k records easy.
An alternative in Java 8 is:
static final Random random = new Random(); // Or SecureRandom
static final int startChar = (int) '!';
static final int endChar = (int) '~';
static String randomString(final int maxLength) {
final int length = random.nextInt(maxLength + 1);
return random.ints(length, startChar, endChar + 1)
.collect(StringBuilder::new, StringBuilder::appendCodePoint, StringBuilder::append)
.toString();
}
public static String generateSessionKey(int length){
String alphabet =
new String("0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"); // 9
int n = alphabet.length(); // 10
String result = new String();
Random r = new Random(); // 11
for (int i=0; i<length; i++) // 12
result = result + alphabet.charAt(r.nextInt(n)); //13
return result;
}
import java.util.Random;
public class passGen{
// Version 1.0
private static final String dCase = "abcdefghijklmnopqrstuvwxyz";
private static final String uCase = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
private static final String sChar = "!##$%^&*";
private static final String intChar = "0123456789";
private static Random r = new Random();
private static StringBuilder pass = new StringBuilder();
public static void main (String[] args) {
System.out.println ("Generating pass...");
while (pass.length () != 16){
int rPick = r.nextInt(4);
if (rPick == 0){
int spot = r.nextInt(26);
pass.append(dCase.charAt(spot));
} else if (rPick == 1) {
int spot = r.nextInt(26);
pass.append(uCase.charAt(spot));
} else if (rPick == 2) {
int spot = r.nextInt(8);
pass.append(sChar.charAt(spot));
} else {
int spot = r.nextInt(10);
pass.append(intChar.charAt(spot));
}
}
System.out.println ("Generated Pass: " + pass.toString());
}
}
This just adds the password into the string and... yeah, it works well. Check it out... It is very simple; I wrote it.
Using UUIDs is insecure, because parts of the UUID aren't random at all. The procedure of erickson is very neat, but it does not create strings of the same length. The following snippet should be sufficient:
/*
* The random generator used by this class to create random keys.
* In a holder class to defer initialization until needed.
*/
private static class RandomHolder {
static final Random random = new SecureRandom();
public static String randomKey(int length) {
return String.format("%"+length+"s", new BigInteger(length*5/*base 32,2^5*/, random)
.toString(32)).replace('\u0020', '0');
}
}
Why choose length*5? Let's assume the simple case of a random string of length 1, so one random character. To get a random character containing all digits 0-9 and characters a-z, we would need a random number between 0 and 35 to get one of each character.
BigInteger provides a constructor to generate a random number, uniformly distributed over the range 0 to (2^numBits - 1). Unfortunately 35 is not a number which can be received by 2^numBits - 1.
So we have two options: Either go with 2^5-1=31 or 2^6-1=63. If we would choose 2^6 we would get a lot of "unnecessary" / "longer" numbers. Therefore 2^5 is the better option, even if we lose four characters (w-z). To now generate a string of a certain length, we can simply use a 2^(length*numBits)-1 number. The last problem, if we want a string with a certain length, random could generate a small number, so the length is not met, so we have to pad the string to its required length prepending zeros.
I found this solution that generates a random hex encoded string. The provided unit test seems to hold up to my primary use case. Although, it is slightly more complex than some of the other answers provided.
/**
* Generate a random hex encoded string token of the specified length
*
* #param length
* #return random hex string
*/
public static synchronized String generateUniqueToken(Integer length){
byte random[] = new byte[length];
Random randomGenerator = new Random();
StringBuffer buffer = new StringBuffer();
randomGenerator.nextBytes(random);
for (int j = 0; j < random.length; j++) {
byte b1 = (byte) ((random[j] & 0xf0) >> 4);
byte b2 = (byte) (random[j] & 0x0f);
if (b1 < 10)
buffer.append((char) ('0' + b1));
else
buffer.append((char) ('A' + (b1 - 10)));
if (b2 < 10)
buffer.append((char) ('0' + b2));
else
buffer.append((char) ('A' + (b2 - 10)));
}
return (buffer.toString());
}
#Test
public void testGenerateUniqueToken(){
Set set = new HashSet();
String token = null;
int size = 16;
/* Seems like we should be able to generate 500K tokens
* without a duplicate
*/
for (int i=0; i<500000; i++){
token = Utility.generateUniqueToken(size);
if (token.length() != size * 2){
fail("Incorrect length");
} else if (set.contains(token)) {
fail("Duplicate token generated");
} else{
set.add(token);
}
}
}
Change String characters as per as your requirements.
String is immutable. Here StringBuilder.append is more efficient than string concatenation.
public static String getRandomString(int length) {
final String characters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJLMNOPQRSTUVWXYZ1234567890!##$%^&*()_+";
StringBuilder result = new StringBuilder();
while(length > 0) {
Random rand = new Random();
result.append(characters.charAt(rand.nextInt(characters.length())));
length--;
}
return result.toString();
}
import java.util.Date;
import java.util.Random;
public class RandomGenerator {
private static Random random = new Random((new Date()).getTime());
public static String generateRandomString(int length) {
char[] values = {'a','b','c','d','e','f','g','h','i','j',
'k','l','m','n','o','p','q','r','s','t',
'u','v','w','x','y','z','0','1','2','3',
'4','5','6','7','8','9'};
String out = "";
for (int i=0;i<length;i++) {
int idx=random.nextInt(values.length);
out += values[idx];
}
return out;
}
}
I don't really like any of these answers regarding a "simple" solution :S
I would go for a simple ;), pure Java, one liner (entropy is based on random string length and the given character set):
public String randomString(int length, String characterSet) {
return IntStream.range(0, length).map(i -> new SecureRandom().nextInt(characterSet.length())).mapToObj(randomInt -> characterSet.substring(randomInt, randomInt + 1)).collect(Collectors.joining());
}
#Test
public void buildFiveRandomStrings() {
for (int q = 0; q < 5; q++) {
System.out.println(randomString(10, "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789")); // The character set can basically be anything
}
}
Or (a bit more readable old way)
public String randomString(int length, String characterSet) {
StringBuilder sb = new StringBuilder(); // Consider using StringBuffer if needed
for (int i = 0; i < length; i++) {
int randomInt = new SecureRandom().nextInt(characterSet.length());
sb.append(characterSet.substring(randomInt, randomInt + 1));
}
return sb.toString();
}
#Test
public void buildFiveRandomStrings() {
for (int q = 0; q < 5; q++) {
System.out.println(randomString(10, "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789")); // The character set can basically be anything
}
}
But on the other hand you could also go with UUID which has a pretty good entropy:
UUID.randomUUID().toString().replace("-", "")
I'm using a library from Apache Commons to generate an alphanumeric string:
import org.apache.commons.lang3.RandomStringUtils;
String keyLength = 20;
RandomStringUtils.randomAlphanumeric(keylength);
It's fast and simple!
You mention "simple", but just in case anyone else is looking for something that meets more stringent security requirements, you might want to take a look at jpwgen. jpwgen is modeled after pwgen in Unix, and is very configurable.
import java.util.*;
import javax.swing.*;
public class alphanumeric {
public static void main(String args[]) {
String nval, lenval;
int n, len;
nval = JOptionPane.showInputDialog("Enter number of codes you require: ");
n = Integer.parseInt(nval);
lenval = JOptionPane.showInputDialog("Enter code length you require: ");
len = Integer.parseInt(lenval);
find(n, len);
}
public static void find(int n, int length) {
String str1 = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
StringBuilder sb = new StringBuilder(length);
Random r = new Random();
System.out.println("\n\t Unique codes are \n\n");
for(int i=0; i<n; i++) {
for(int j=0; j<length; j++) {
sb.append(str1.charAt(r.nextInt(str1.length())));
}
System.out.println(" " + sb.toString());
sb.delete(0, length);
}
}
}
Here is the one-liner by abacus-common:
String.valueOf(CharStream.random('0', 'z').filter(c -> N.isLetterOrDigit(c)).limit(12).toArray())
Random doesn't mean it must be unique. To get unique strings, use:
N.uuid() // E.g.: "e812e749-cf4c-4959-8ee1-57829a69a80f". length is 36.
N.guid() // E.g.: "0678ce04e18945559ba82ddeccaabfcd". length is 32 without '-'
You can use the following code, if your password mandatory contains numbers and alphabetic special characters:
private static final String NUMBERS = "0123456789";
private static final String UPPER_ALPHABETS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
private static final String LOWER_ALPHABETS = "abcdefghijklmnopqrstuvwxyz";
private static final String SPECIALCHARACTERS = "##$%&*";
private static final int MINLENGTHOFPASSWORD = 8;
public static String getRandomPassword() {
StringBuilder password = new StringBuilder();
int j = 0;
for (int i = 0; i < MINLENGTHOFPASSWORD; i++) {
password.append(getRandomPasswordCharacters(j));
j++;
if (j == 3) {
j = 0;
}
}
return password.toString();
}
private static String getRandomPasswordCharacters(int pos) {
Random randomNum = new Random();
StringBuilder randomChar = new StringBuilder();
switch (pos) {
case 0:
randomChar.append(NUMBERS.charAt(randomNum.nextInt(NUMBERS.length() - 1)));
break;
case 1:
randomChar.append(UPPER_ALPHABETS.charAt(randomNum.nextInt(UPPER_ALPHABETS.length() - 1)));
break;
case 2:
randomChar.append(SPECIALCHARACTERS.charAt(randomNum.nextInt(SPECIALCHARACTERS.length() - 1)));
break;
case 3:
randomChar.append(LOWER_ALPHABETS.charAt(randomNum.nextInt(LOWER_ALPHABETS.length() - 1)));
break;
}
return randomChar.toString();
}
You can use the UUID class with its getLeastSignificantBits() message to get 64 bit of random data, and then convert it to a radix 36 number (i.e. a string consisting of 0-9,A-Z):
Long.toString(Math.abs( UUID.randomUUID().getLeastSignificantBits(), 36));
This yields a string up to 13 characters long. We use Math.abs() to make sure there isn't a minus sign sneaking in.
Here it is a Scala solution:
(for (i <- 0 until rnd.nextInt(64)) yield {
('0' + rnd.nextInt(64)).asInstanceOf[Char]
}) mkString("")
Using an Apache Commons library, it can be done in one line:
import org.apache.commons.lang.RandomStringUtils;
RandomStringUtils.randomAlphanumeric(64);
Documentation
public static String randomSeriesForThreeCharacter() {
Random r = new Random();
String value = "";
char random_Char ;
for(int i=0; i<10; i++)
{
random_Char = (char) (48 + r.nextInt(74));
value = value + random_char;
}
return value;
}
I think this is the smallest solution here, or nearly one of the smallest:
public String generateRandomString(int length) {
String randomString = "";
final char[] chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567890".toCharArray();
final Random random = new Random();
for (int i = 0; i < length; i++) {
randomString = randomString + chars[random.nextInt(chars.length)];
}
return randomString;
}
The code works just fine. If you are using this method, I recommend you to use more than 10 characters. A collision happens at 5 characters / 30362 iterations. This took 9 seconds.
public class Utils {
private final Random RANDOM = new SecureRandom();
private final String ALPHABET = "0123456789QWERTYUIOPASDFGHJKLZXCVBNMqwertyuiopasdfghjklzxcvbnm";
private String generateRandomString(int length) {
StringBuffer buffer = new StringBuffer(length);
for (int i = 0; i < length; i++) {
buffer.append(ALPHABET.charAt(RANDOM.nextInt(ALPHABET.length())));
}
return new String(buffer);
}
}

I need to decrypt a file efficiently [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 4 years ago.
Improve this question
I am trying to decrypt an encrypted file with unknown key - the only thing I know about it is that the key is an integer x, 0 <= x < 1010 (i.e. a maximum of 10 decimal digits).
public static String enc(String msg, long key) {
String ans = "";
Random rand = new Random(key);
for (int i = 0; i < msg.length(); i = i + 1) {
char c = msg.charAt(i);
int s = c;
int rd = rand.nextInt() % (256 * 256);
int s2 = s ^ rd;
char c2 = (char) (s2);
ans += c2;
}
return ans;
}
private static String tryToDecode(String string) {
String returnedString = "";
long key;
String msg = reader(string);
for (long i = 0; i <= 999999999; i++) {
System.out.println("decoding message with key + " + i);
key = i;
System.out.println("decoding with key: " + i + "\n" + enc(msg, key));
}
return returnedString;
}
I expect to find the plain text
The program works very slowly, is there any way to make it more efficient?
You can use Parallel Array Operations added in JAVA 8 if you are using Java 8 to achive this.
The best fit for you would be to use Spliterator
public void spliterate() {
System.out.println("\nSpliterate:");
int[] src = getData();
Spliterator<Integer> spliterator = Arrays.spliterator(src);
spliterator.forEachRemaining( n -> action(n) );
}
public void action(int value) {
System.out.println("value:"+value);
// Perform some real work on this data here...
}
I am still not clear about your situation. Here some great tutorials and articles to figure out which parallel array operations of java 8 is going to help you ?
http://www.drdobbs.com/jvm/parallel-array-operations-in-java-8/240166287
https://blog.rapid7.com/2015/10/28/java-8-introduction-to-parallelism-and-spliterator/
First things first: You can't println billions of lines. This will take forever, and it's pointless - you won't be able to see the text as it scrolls by, and your buffer won't save billion of lines so you couldn't scroll back up later even if you wanted to. If you prefer (and don't mind it being 2-3% slower than it otherwise would be), you can output once every hundred million keys, just so you can verify your program is making progress.
You can optimize things by not concatenating Strings inside the loop. Strings are immutable, so the old code was creating a rather large number of Strings, especially in the enc method. Normally I'd use a StringBuilder, but in this case a simple character array will meet our needs.
And there's one more thing we need to do that your current code doesn't do: Detect when we have the answer. If we assume that the message will only contain characters from 0-127 with no Unicode or extended ASCII, then we know we have a possible answer when the entire message contains only characters in this range. And we can also use this to further optimize, as we can then immediately discard any message that has a character outside of this range. We don't even have to finish decoding it and can move on to the next key. (If the message is of any length, the odds are that only one key will produce a decoded message with characters in that range - but it's not guaranteed, which is why I do not stop when I get to a valid message. You could probably do that, though.)
Due to the way random numbers are generated in Java, anything in the seed above 32 bits is not used by the encoding/decoding algorithm. So you only need to go up to 4294967295 instead of 9999999999. (This also means the key that was originally used to encode the message might not be the key this program uses to decode it, since 2-3 keys in the 10 digit range will produce the same encoding/decoding.)
private static String tryToDecode4(String msg) {
String returnedString = "";
for (long i=0; i<=4294967295l; i++)
{
if (i % 100000000 == 0) // This part is just to see that it's making progress. Remove if desired for a small speed gain.
System.out.println("Trying " + i);
char[] decoded = enc4(msg, i);
if (decoded == null)
continue;
returnedString = String.valueOf(decoded);
System.out.println("decoding with key: " + i + " " + returnedString);
}
return returnedString;
}
private static char[] enc4(String msg, long key) {
char[] ansC = new char[msg.length()];
Random rand = new Random(key);
for(int i=0;i<msg.length();i=i+1)
{
char c = msg.charAt(i);
int s = c;
int rd = rand.nextInt()%(256*256);
int s2 = s^rd;
char c2 = (char)(s2);
if (c2 > 127)
return null;
ansC[i] = c2;
}
return ansC;
}
This code finished running in a little over 3 minutes on my machine, with a message of "Hello World".
This code will not work well for very short messages (3-4 characters or less.) It will not work if the message contains Unicode or extended ASCII, although it could easily be modified to do so if you know the range of characters that might be in the message.

Save to binary file

I'd like to save data to binary file using java. For example i have the number 101 and in my program the output file have 4 Bytes. How can i save the number only in three bits (101) in the output file ?
My program looks like this:
public static void main(String args[]) throws FileNotFoundException, IOException {
int i = 101;
DataOutputStream os = new DataOutputStream(new FileOutputStream("file"));
os.writeInt(i);
os.close();
}
I found something like that: http://www.developer.nokia.com/Community/Wiki/Bit_Input/Output_Stream_utility_classes_for_efficient_data_transfer
You can not write less than one byte to a file. If you want to write the binary number 101 then do int i = 5 and use os.write(i) instead. This will write one byte: 0000 0101.
First off, you can't write just 3 bits to a file, memory is aligned at specific values (8, 16, 32, 64 or even 128 bit, this is compiler/platform specific). If you write smaller sizes than that, they will be expanded to match the alignment.
Secondly, the decimal number 101, written in binary is 0b01100101. the binary number 0b00000101, is decimal 5.
Thirdly, these numbers are now only 1 Byte (8 bit) long, so you can use a char instead of an int.
And last but not least, to write non-integer numbers, use os.write()
So to get to what you want, first check if you want to write 0b01100101 or 0b00000101. Change the int to a char and to the appropriate number (you can write 0b01100101 in Java). And use os.write()
A really naive implementation, I hope it helps you grasp the idea. Also untested, likely to contain off-by-one errors etc...
class BitArray {
// public fields for example code, in real code encapsulate
public int bits=0; // actual count of stored bits
public byte[] buf=new byte[1];
private void doubleBuf() {
byte [] tmp = new byte[buf.length * 2];
System.arraycopy(buf, 0, tmp, 0, buf.length);
buf = tmp;
}
private int arrayIndex(int bitNum) {
return bitNum / 8;
}
private int bitNumInArray(int bitNum) {
return bitNum & 7; // change to change bit order in buf's bytes
}
// returns how many elements of buf are actually in use, for saving etc.
// note that last element usually contains unused bits.
public int getUsedArrayElements() {
return arrayIndex(this.bits-1) + 1;
}
// bitvalue is 0 for 0, non-0 for 1
public void setBit(byte bitValue, int bitNum) {
if (bitNum >= this.bits || bitNum < 0) throw new InvalidArgumentException();
if (bitValue == 0) this.buf[arrayIndex(bitNum)] &= ~((byte)1 << bitNumInArray(bitNum));
else this.buf[arrayIndex(bitNum)] |= (byte)1 << bitNumInArray(bitNum);
}
public void addBit(int bitValue) {
// this.bits is old bit count, which is same as index of new last bit
if (this.buf.length <= arrayIndex(this.bits)) doubleBuf();
++this.bits;
setBit(bitValue, this.bits-1);
}
int readBit(int bitNum) { // return 0 or 1
if (bitNum >= this.bits || bitNum < 0) throw new InvalidArgumentException();
byte value = buf[arrayIndex(bitNum)] & ((byte)1 << bitNumInArray(bitNum));
return (value == 0) ? 0 : 1;
}
void addBits(int bitCount, int bitValues) {
for (int num = bitCount - 1 ; num >= 0 ; --num) {
// change loop iteration order to change bit order of bitValues
addBit(bitValues & (1 << num));
}
}
For efficient solution, it should use int or long array instead of byte array, and include more efficient method for multiple bit addition (add parts of bitValues an entire buf array element at a time, not bit-by-bit like above).
To save this, you need to save the right number of bytes from buf, calculated by getUsedArrayElements().

Efficient way to search a stream for a string

Let's suppose that have a stream of text (or Reader in Java) that I'd like to check for a particular string. The stream of text might be very large so as soon as the search string is found I'd like to return true and also try to avoid storing the entire input in memory.
Naively, I might try to do something like this (in Java):
public boolean streamContainsString(Reader reader, String searchString) throws IOException {
char[] buffer = new char[1024];
int numCharsRead;
while((numCharsRead = reader.read(buffer)) > 0) {
if ((new String(buffer, 0, numCharsRead)).indexOf(searchString) >= 0)
return true;
}
return false;
}
Of course this fails to detect the given search string if it occurs on the boundary of the 1k buffer:
Search text: "stackoverflow"
Stream buffer 1: "abc.........stack"
Stream buffer 2: "overflow.......xyz"
How can I modify this code so that it correctly finds the given search string across the boundary of the buffer but without loading the entire stream into memory?
Edit: Note when searching a stream for a string, we're trying to minimise the number of reads from the stream (to avoid latency in a network/disk) and to keep memory usage constant regardless of the amount of data in the stream. Actual efficiency of the string matching algorithm is secondary but obviously, it would be nice to find a solution that used one of the more efficient of those algorithms.
There are three good solutions here:
If you want something that is easy and reasonably fast, go with no buffer, and instead implement a simple nondeterminstic finite-state machine. Your state will be a list of indices into the string you are searching, and your logic looks something like this (pseudocode):
String needle;
n = needle.length();
for every input character c do
add index 0 to the list
for every index i in the list do
if c == needle[i] then
if i + 1 == n then
return true
else
replace i in the list with i + 1
end
else
remove i from the list
end
end
end
This will find the string if it exists and you will never need a
buffer.
Slightly more work but also faster: do an NFA-to-DFA conversion that figures out in advance what lists of indices are possible, and assign each one to a small integer. (If you read about string search on Wikipedia, this is called the powerset construction.) Then you have a single state and you make a state-to-state transition on each incoming character. The NFA you want is just the DFA for the string preceded with a state that nondeterministically either drops a character or tries to consume the current character. You'll want an explicit error state as well.
If you want something faster, create a buffer whose size is at least twice n, and user Boyer-Moore to compile a state machine from needle. You'll have a lot of extra hassle because Boyer-Moore is not trivial to implement (although you'll find code online) and because you'll have to arrange to slide the string through the buffer. You'll have to build or find a circular buffer that can 'slide' without copying; otherwise you're likely to give back any performance gains you might get from Boyer-Moore.
I did a few changes to the Knuth Morris Pratt algorithm for partial searches. Since the actual comparison position is always less or equal than the next one there is no need for extra memory. The code with a Makefile is also available on github and it is written in Haxe to target multiple programming languages at once, including Java.
I also wrote a related article: searching for substrings in streams: a slight modification of the Knuth-Morris-Pratt algorithm in Haxe. The article mentions the Jakarta RegExp, now retired and resting in the Apache Attic. The Jakarta Regexp library “match” method in the RE class uses a CharacterIterator as a parameter.
class StreamOrientedKnuthMorrisPratt {
var m: Int;
var i: Int;
var ss:
var table: Array<Int>;
public function new(ss: String) {
this.ss = ss;
this.buildTable(this.ss);
}
public function begin() : Void {
this.m = 0;
this.i = 0;
}
public function partialSearch(s: String) : Int {
var offset = this.m + this.i;
while(this.m + this.i - offset < s.length) {
if(this.ss.substr(this.i, 1) == s.substr(this.m + this.i - offset,1)) {
if(this.i == this.ss.length - 1) {
return this.m;
}
this.i += 1;
} else {
this.m += this.i - this.table[this.i];
if(this.table[this.i] > -1)
this.i = this.table[this.i];
else
this.i = 0;
}
}
return -1;
}
private function buildTable(ss: String) : Void {
var pos = 2;
var cnd = 0;
this.table = new Array<Int>();
if(ss.length > 2)
this.table.insert(ss.length, 0);
else
this.table.insert(2, 0);
this.table[0] = -1;
this.table[1] = 0;
while(pos < ss.length) {
if(ss.substr(pos-1,1) == ss.substr(cnd, 1))
{
cnd += 1;
this.table[pos] = cnd;
pos += 1;
} else if(cnd > 0) {
cnd = this.table[cnd];
} else {
this.table[pos] = 0;
pos += 1;
}
}
}
public static function main() {
var KMP = new StreamOrientedKnuthMorrisPratt("aa");
KMP.begin();
trace(KMP.partialSearch("ccaabb"));
KMP.begin();
trace(KMP.partialSearch("ccarbb"));
trace(KMP.partialSearch("fgaabb"));
}
}
The Knuth-Morris-Pratt search algorithm never backs up; this is just the property you want for your stream search. I've used it before for this problem, though there may be easier ways using available Java libraries. (When this came up for me I was working in C in the 90s.)
KMP in essence is a fast way to build a string-matching DFA, like Norman Ramsey's suggestion #2.
This answer applied to the initial version of the question where the key was to read the stream only as far as necessary to match on a String, if that String was present. This solution would not meet the requirement to guarantee fixed memory utilisation, but may be worth considering if you have found this question and are not bound by that constraint.
If you are bound by the constant memory usage constraint, Java stores arrays of any type on the heap, and as such nulling the reference does not deallocate memory in any way; I think any solution involving arrays in a loop will consume memory on the heap and require GC.
For simple implementation, maybe Java 5's Scanner which can accept an InputStream and use a java.util.regex.Pattern to search the input for might save you worrying about the implementation details.
Here's an example of a potential implementation:
public boolean streamContainsString(Reader reader, String searchString)
throws IOException {
Scanner streamScanner = new Scanner(reader);
if (streamScanner.findWithinHorizon(searchString, 0) != null) {
return true;
} else {
return false;
}
}
I'm thinking regex because it sounds like a job for a Finite State Automaton, something that starts in an initial state, changing state character by character until it either rejects the string (no match) or gets to an accept state.
I think this is probably the most efficient matching logic you could use, and how you organize the reading of the information can be divorced from the matching logic for performance tuning.
It's also how regexes work.
Instead of having your buffer be an array, use an abstraction that implements a circular buffer. Your index calculation will be buf[(next+i) % sizeof(buf)], and you'll have to be careful to full the buffer one-half at a time. But as long as the search string fits in half the buffer, you'll find it.
I believe the best solution to this problem is to try to keep it simple. Remember, beacause I'm reading from a stream, I want to keep the number of reads from the stream to a minimum (as network or disk latency may be an issue) while keeping the amount of memory used constant (as the stream may be very large in size). Actual efficiency of the string matching is not the number one goal (as that has been studied to death already).
Based on AlbertoPL's suggestion, here's a simple solution that compares the buffer against the search string character by character. The key is that because the search is only done one character at a time, no back tracking is needed and therefore no circular buffers, or buffers of a particular size are needed.
Now, if someone can come up with a similar implementation based on Knuth-Morris-Pratt search algorithm then we'd have a nice efficient solution ;)
public boolean streamContainsString(Reader reader, String searchString) throws IOException {
char[] buffer = new char[1024];
int numCharsRead;
int count = 0;
while((numCharsRead = reader.read(buffer)) > 0) {
for (int c = 0; c < numCharsRead; c++) {
if (buffer[c] == searchString.charAt(count))
count++;
else
count = 0;
if (count == searchString.length()) return true;
}
}
return false;
}
If you're not tied to using a Reader, then you can use Java's NIO API to efficiently load the file. For example (untested, but should be close to working):
public boolean streamContainsString(File input, String searchString) throws IOException {
Pattern pattern = Pattern.compile(Pattern.quote(searchString));
FileInputStream fis = new FileInputStream(input);
FileChannel fc = fis.getChannel();
int sz = (int) fc.size();
MappedByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, sz);
CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
CharBuffer cb = decoder.decode(bb);
Matcher matcher = pattern.matcher(cb);
return matcher.matches();
}
This basically mmap()'s the file to search and relies on the operating system to do the right thing regarding cache and memory usage. Note however that map() is more expensive the just reading the file in to a large buffer for files less than around 10 KiB.
A very fast searching of a stream is implemented in the RingBuffer class from the Ujorm framework. See the sample:
Reader reader = RingBuffer.createReader("xxx ${abc} ${def} zzz");
String word1 = RingBuffer.findWord(reader, "${", "}");
assertEquals("abc", word1);
String word2 = RingBuffer.findWord(reader, "${", "}");
assertEquals("def", word2);
String word3 = RingBuffer.findWord(reader, "${", "}");
assertEquals("", word3);
The single class implementation is available on the SourceForge:
For more information see the link.
Implement a sliding window. Have your buffer around, move all elements in the buffer one forward and enter a single new character in the buffer at the end. If the buffer is equal to your searched word, it is contained.
Of course, if you want to make this more efficient, you can look at a way to prevent moving all elements in the buffer around, for example by having a cyclic buffer and a representation of the strings which 'cycles' the same way the buffer does, so you only need to check for content-equality. This saves moving all elements in the buffer.
I think you need to buffer a small amount at the boundary between buffers.
For example if your buffer size is 1024 and the length of the SearchString is 10, then as well as searching each 1024-byte buffer you also need to search each 18-byte transition between two buffers (9 bytes from the end of the previous buffer concatenated with 9 bytes from the start of the next buffer).
I'd say switch to a character by character solution, in which case you'd scan for the first character in your target text, then when you find that character increment a counter and look for the next character. Every time you don't find the next consecutive character restart the counter. It would work like this:
public boolean streamContainsString(Reader reader, String searchString) throws IOException {
char[] buffer = new char[1024];
int numCharsRead;
int count = 0;
while((numCharsRead = reader.read(buffer)) > 0) {
if (buffer[numCharsRead -1] == searchString.charAt(count))
count++;
else
count = 0;
if (count == searchString.size())
return true;
}
return false;
}
The only problem is when you're in the middle of looking through characters... in which case there needs to be a way of remembering your count variable. I don't see an easy way of doing so except as a private variable for the whole class. In which case you would not instantiate count inside this method.
You might be able to implement a very fast solution using Fast Fourier Transforms, which, if implemented properly, allow you to do string matching in times O(nlog(m)), where n is the length of the longer string to be matched, and m is the length of the shorter string. You could, for example, perform FFT as soon as you receive an stream input of length m, and if it matches, you can return, and if it doesn't match, you can throw away the first character in the stream input, wait for a new character to appear through the stream, and then perform FFT again.
You can increase the speed of search for very large strings by using some string search algorithm
If you're looking for a constant substring rather than a regex, I'd recommend Boyer-Moore. There's plenty of source code on the internet.
Also, use a circular buffer, to avoid think too hard about buffer boundaries.
Mike.
I also had a similar problem: skip bytes from the InputStream until specified string (or byte array). This is the simple code based on circular buffer. It is not very efficient but works for my needs:
private static boolean matches(int[] buffer, int offset, byte[] search) {
final int len = buffer.length;
for (int i = 0; i < len; ++i) {
if (search[i] != buffer[(offset + i) % len]) {
return false;
}
}
return true;
}
public static void skipBytes(InputStream stream, byte[] search) throws IOException {
final int[] buffer = new int[search.length];
for (int i = 0; i < search.length; ++i) {
buffer[i] = stream.read();
}
int offset = 0;
while (true) {
if (matches(buffer, offset, search)) {
break;
}
buffer[offset] = stream.read();
offset = (offset + 1) % buffer.length;
}
}
Here is my implementation:
static boolean containsKeywordInStream( Reader ir, String keyword, int bufferSize ) throws IOException{
SlidingContainsBuffer sb = new SlidingContainsBuffer( keyword );
char[] buffer = new char[ bufferSize ];
int read;
while( ( read = ir.read( buffer ) ) != -1 ){
if( sb.checkIfContains( buffer, read ) ){
return true;
}
}
return false;
}
SlidingContainsBuffer class:
class SlidingContainsBuffer{
private final char[] keyword;
private int keywordIndexToCheck = 0;
private boolean keywordFound = false;
SlidingContainsBuffer( String keyword ){
this.keyword = keyword.toCharArray();
}
boolean checkIfContains( char[] buffer, int read ){
for( int i = 0; i < read; i++ ){
if( keywordFound == false ){
if( keyword[ keywordIndexToCheck ] == buffer[ i ] ){
keywordIndexToCheck++;
if( keywordIndexToCheck == keyword.length ){
keywordFound = true;
}
} else {
keywordIndexToCheck = 0;
}
} else {
break;
}
}
return keywordFound;
}
}
This answer fully qualifies the task:
The implementation is able to find the searched keyword even if it was split between buffers
Minimum memory usage defined by the buffer size
Number of reads will be minimized by using bigger buffer

Categories

Resources