I've got text file that contains 1 000 002 numbers in following formation:
123 456
1 2 3 4 5 6 .... 999999 100000
Now I need to read that data and allocate it to int variables (the very first two numbers) and all the rest (1 000 000 numbers) to an array int[].
It's not a hard task, but - it's horrible slow.
My first attempt was java.util.Scanner:
Scanner stdin = new Scanner(new File("./path"));
int n = stdin.nextInt();
int t = stdin.nextInt();
int array[] = new array[n];
for (int i = 0; i < n; i++) {
array[i] = stdin.nextInt();
}
It works as excepted but it takes about 7500 ms to execute. I need to fetch that data in up to several hundred of milliseconds.
Then I tried java.io.BufferedReader:
Using BufferedReader.readLine() and String.split() I got the same results in about 1700 ms, but it's still too many.
How can I read that amount of data in less that 1 second? The final result should be equal to:
int n = 123;
int t = 456;
int array[] = { 1, 2, 3, 4, ..., 999999, 100000 };
According to trashgod answer:
StreamTokenizer solution is fast (takes about 1400 ms) but it's still too slow:
StreamTokenizer st = new StreamTokenizer(new FileReader("./test_grz"));
st.nextToken();
int n = (int) st.nval;
st.nextToken();
int t = (int) st.nval;
int array[] = new int[n];
for (int i = 0; st.nextToken() != StreamTokenizer.TT_EOF; i++) {
array[i] = (int) st.nval;
}
PS. There is no need for validation. I'm 100% sure that data in ./test_grz file is correct.
Thanks for every answer but I've already found a method that meets my criteria:
BufferedInputStream bis = new BufferedInputStream(new FileInputStream("./path"));
int n = readInt(bis);
int t = readInt(bis);
int array[] = new int[n];
for (int i = 0; i < n; i++) {
array[i] = readInt(bis);
}
private static int readInt(InputStream in) throws IOException {
int ret = 0;
boolean dig = false;
for (int c = 0; (c = in.read()) != -1; ) {
if (c >= '0' && c <= '9') {
dig = true;
ret = ret * 10 + c - '0';
} else if (dig) break;
}
return ret;
}
It requires only about 300 ms to read 1 mln of integers!
StreamTokenizer may be faster, as suggested here.
You can reduce the time for the StreamTokenizer result by using a BufferedReader:
Reader r = null;
try {
r = new BufferedReader(new FileReader(file));
final StreamTokenizer st = new StreamTokenizer(r);
...
} finally {
if (r != null)
r.close();
}
Also, don't forget to close your files, as I've shown here.
You can also shave some more time off by using a custom tokenizer just for your purposes:
public class CustomTokenizer {
private final Reader r;
public CustomTokenizer(final Reader r) {
this.r = r;
}
public int nextInt() throws IOException {
int i = r.read();
if (i == -1)
throw new EOFException();
char c = (char) i;
// Skip any whitespace
while (c == ' ' || c == '\n' || c == '\r') {
i = r.read();
if (i == -1)
throw new EOFException();
c = (char) i;
}
int result = (c - '0');
while ((i = r.read()) >= 0) {
c = (char) i;
if (c == ' ' || c == '\n' || c == '\r')
break;
result = result * 10 + (c - '0');
}
return result;
}
}
Remember to use a BufferedReader for this. This custom tokenizer assumes the input data is always completely valid and contains only spaces, new lines, and digits.
If you read these results a lot and those results do not change much, you should probably save the array and keep track of the last file modified time. Then, if the file has not changed just use the cached copy of the array and this will speed up the results significantly. For example:
public class ArrayRetriever {
private File inputFile;
private long lastModified;
private int[] lastResult;
public ArrayRetriever(File file) {
this.inputFile = file;
}
public int[] getResult() {
if (lastResult != null && inputFile.lastModified() == lastModified)
return lastResult;
lastModified = inputFile.lastModified();
// do logic to actually read the file here
lastResult = array; // the array variable from your examples
return lastResult;
}
}
How much memory do you have in the computer? You could be running into GC issues.
The best thing to do is to process the data one line at a time if possible. Don't load it into an array. Load what you need, process, write it out, and continue.
This will reduce your memory footprint and still use the same amount of File IO
It it's possible to reformat the input so that each integer is on a separate line (instead of one long line with one million integers), you should be seeing much improved performance using Integer.parseInt(BufferedReader.readLine()) due to smarter buffering by line and not having to split the long string into a separate array of Strings.
Edit: I tested this and managed to read the output produced by seq 1 1000000 into an array of int well under half a second, but of course this depends on the machine.
I would extend FilterReader and parse the string as it is read in the read() method. Have a getNextNumber method return the numbers. Code left as an exercise for the reader.
Use a StreamTokenizer on a BufferedReader will give you quite good performance already. You shouldn't need to write your own readInt() function.
Here is the code I used to do some local performance testing:
/**
* Created by zhenhua.xu on 11/27/16.
*/
public class MyReader {
private static final String FILE_NAME = "./1m_numbers.txt";
private static final int n = 1000000;
public static void main(String[] args) {
try {
readByScanner();
readByStreamTokenizer();
readByStreamTokenizerOnBufferedReader();
readByBufferedInputStream();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void readByScanner() throws Exception {
long startTime = System.currentTimeMillis();
Scanner stdin = new Scanner(new File(FILE_NAME));
int array[] = new int[n];
for (int i = 0; i < n; i++) {
array[i] = stdin.nextInt();
}
long endTime = System.currentTimeMillis();
System.out.println(String.format("Total time by Scanner: %d ms", endTime - startTime));
}
public static void readByStreamTokenizer() throws Exception {
long startTime = System.currentTimeMillis();
StreamTokenizer st = new StreamTokenizer(new FileReader(FILE_NAME));
int array[] = new int[n];
for (int i = 0; st.nextToken() != StreamTokenizer.TT_EOF; i++) {
array[i] = (int) st.nval;
}
long endTime = System.currentTimeMillis();
System.out.println(String.format("Total time by StreamTokenizer: %d ms", endTime - startTime));
}
public static void readByStreamTokenizerOnBufferedReader() throws Exception {
long startTime = System.currentTimeMillis();
StreamTokenizer st = new StreamTokenizer(new BufferedReader(new FileReader(FILE_NAME)));
int array[] = new int[n];
for (int i = 0; st.nextToken() != StreamTokenizer.TT_EOF; i++) {
array[i] = (int) st.nval;
}
long endTime = System.currentTimeMillis();
System.out.println(String.format("Total time by StreamTokenizer with BufferedReader: %d ms", endTime - startTime));
}
public static void readByBufferedInputStream() throws Exception {
long startTime = System.currentTimeMillis();
BufferedInputStream bis = new BufferedInputStream(new FileInputStream(FILE_NAME));
int array[] = new int[n];
for (int i = 0; i < n; i++) {
array[i] = readInt(bis);
}
long endTime = System.currentTimeMillis();
System.out.println(String.format("Total time with BufferedInputStream: %d ms", endTime - startTime));
}
private static int readInt(InputStream in) throws IOException {
int ret = 0;
boolean dig = false;
for (int c = 0; (c = in.read()) != -1; ) {
if (c >= '0' && c <= '9') {
dig = true;
ret = ret * 10 + c - '0';
} else if (dig) break;
}
return ret;
}
Results I got:
Total time by Scanner: 789 ms
Total time by StreamTokenizer: 226 ms
Total time by StreamTokenizer with BufferedReader: 80 ms
Total time by BufferedInputStream: 95 ms
Related
So i'm making a program that removes duplicate letters in a string. The last step of it is updating the old string to the new string, and looping through the new string. I believe everything works besides the looping through the new string part. Any ideas what might be causing it to not work? It will work as intended for one pass through, and then after that it won't step through the new loop
public class homework20_5 {
public static void main(String[] arg) {
Scanner scanner = new Scanner(System.in);
String kb = scanner.nextLine();
int i;
for (i = 0; i < kb.length(); i++) {
char temp = kb.charAt(i);
if(temp == kb.charAt(i+1)) {
kb = kb.replace(""+temp, "");
i = kb.length() + i;
}
}
System.out.println(kb);
}
}
Instead of using complex algorithms and loops like this you can just use HashSet which will work just like a list but it won't allow any duplicate elements.
private static String removeDuplicateWords(String str) {
HashSet<Character> xChars = new LinkedHashSet<>();
for(char c: str.toCharArray()) {
xChars.add(c);
}
StringBuilder sb = new StringBuilder();
for (char c: xChars) {
sb.append(c);
}
return sb.toString();
}
So you actually want to remove all occurrences that appear more than once entirely and not just the duplicate appearances (while preserving one instance)?
"Yea that’s exactly right "
In that case your idea won't cut it because your duplicate letter detection can only detect continuous sequences of duplicates. A very simple way would be to use 2 sets in order to identify unique letters in one pass.
public class RemoveLettersSeenMultipleTimes {
public static void main(String []args){
String input = "abcabdgag";
Set<Character> lettersSeenOnce = lettersSeenOnceIn(input);
StringBuilder output = new StringBuilder();
for (Character c : lettersSeenOnce) {
output.append(c);
}
System.out.println(output);
}
private static Set<Character> lettersSeenOnceIn(String input) {
Set<Character> seenOnce = new LinkedHashSet<>();
Set<Character> seenMany = new HashSet<>();
for (Character c : input.toCharArray()) {
if (seenOnce.contains(c)) {
seenMany.add(c);
seenOnce.remove(c);
continue;
}
if (!seenMany.contains(c)) {
seenOnce.add(c);
}
}
return seenOnce;
}
}
There are a few problems here:
Problem 1
for (i = 0; i < kb.length(); i++) {
should be
for (i = 0; i < kb.length() - 1; i++) {
Because this
if (temp == kb.charAt(i+1))
will explode with an ArrayIndexOutOfBoundsException otherwise.
Problem 2
Delete this line:
i = kb.length() + i;
I don't understand what the intention is there, but nevertheless it must be deleted.
Problem 3
Rather than lots of code, there's a one-line solution:
String deduped = kb.replaceAll("[" + input.replaceAll("(.)(?=.*\\1)|.", "$1") + "]", "");
This works by:
finding all dupe chars via input.replaceAll("(.)(?=.*\\1)|.", "$1"), which in turn works by consuming every character, either capturing it as group 1 if it has a dupe or just consuming it if a non-dupe
building a regex character class from the dupes, which is used to delete them all (replace with a blank)
Say you feed the program with the input "AAABBC", then the expected output should be "ABC".
Now in the for-loop, i gets incremented from 0 to 5.
After 1st iteration:
kb becomes AABBC and i becomes 5 + 0 = 5 and gets incremented to 6.
And now the condition for the for-loop is that i < kb.length() which equates to 6 < 5 returning false. Hence the for-loop ends after just one iteration.
So the problematic line of code is i = kb.length() + i; and also the loop condition keeps changing as the size of kb changes.
I would suggest using a while loop like the following example if you don't worry too much about the efficiency.
public static void main(String[] arg) {
String kb = "XYYYXAC";
int i = 0;
while (i < kb.length()) {
char temp = kb.charAt(i);
for (int j = i + 1; j < kb.length(); j++) {
char dup = kb.charAt(j);
if (temp == dup) {
kb = removeCharByIndex(kb, j);
j--;
}
}
i++;
}
System.out.println(kb);
}
private static String removeCharByIndex(String str, int index) {
return new StringBuilder(str).deleteCharAt(index).toString();
}
Output: XYAC
EDIT: I misunderstood your requirements. So looking at the above comments, you want all the duplicates and the target character removed. So the above code can be changed like this.
public static void main(String[] arg) {
String kb = "XYYYXAC";
int i = 0;
while (i < kb.length()) {
char temp = kb.charAt(i);
boolean hasDup = false;
for (int j = i + 1; j < kb.length(); j++) {
if (temp == kb.charAt(j)) {
hasDup = true;
kb = removeCharByIndex(kb, j);
j--;
}
}
if (hasDup) {
kb = removeCharByIndex(kb, i);
i--;
}
i++;
}
System.out.println(kb);
}
private static String removeCharByIndex(String str, int index) {
return new StringBuilder(str).deleteCharAt(index).toString();
}
Output: AC
Although, this is not the best and definitely not an efficient solution to this, I think you can get the idea of iterating the input string character by character and removing it if it has duplicates.
The following answer concerns only the transformation of XYYYXACX to ACX. If we wanted to have AC, it's a whole different answer. The other answers already speak about it, and I'll invite you to consult the contains method of String too.
We should consider avoiding -most of the time- modifying the things we iterate. Using a temporary variable could be a kind of solution. To use it, we could change our mindset. Instead of erasing the undesired letters, we can save the ones we want.
To identify the desired character, we need to test if all surrounding letters are different from the tested one. It'll be the opposite of what you did with if(temp == kb.charAt(i+1)) { like if(temp != kb.charAt(i+1)) {. But considering that the tested string will not change anymore, we will need to test the previous letter too as if(temp != kb.charAt(i-1) && temp != kb.charAt(i+1)) {.
As previously said, once we have identified the letter, we will keep the value with a temporary variable. That will lead to replace kb = kb.replace(""+temp, ""); by buffer = buffer + temp; if buffer is our temporary variable initialized with an empty string (Aka. String buffer = "";). In the end, we could override our base value with the temporary one.
At this step, we will have:
public static void main(String[] arg) {
Scanner scanner = new Scanner(System.in);
String kb = scanner.nextLine();
String buffer = "";
int i;
for (i = 1; i < kb.length(); i++) {
char temp = kb.charAt(i);
if(temp != kb.charAt(i-1) && temp != kb.charAt(i+1)) {
buffer = buffer + temp;
}
}
kb = buffer;
System.out.println(kb);
}
That'll sadly not work, trying to access invalid indexes of our string. We should consider two particular behavior for the first and the last letter because they are close to only one letter. For these letters, we will have only one comparison. So, we can make them inside or outside the loop. For clarity, we will do it outside.
For the first one, it will look like to if (kb.charAt(0) != kb.charAt(1)) { and at if (kb.charAt(kb.length() - 1) != kb.charAt(kb.length() - 2)) { for the last. The body of the condition will remain the same as the one in the loop.
Once done, we will reduce the scope of our loop to exclude these character with for (i = 1; i < (kb.length() - 1); i++) {.
Now we will have something working, but only for one iteration:
public static void main(String[] arg) {
Scanner scanner = new Scanner(System.in);
String kb = scanner.nextLine();
String buffer = "";
int i;
if (kb.charAt(0) != kb.charAt(1)) {
buffer = buffer + kb.charAt(0);
}
for (i = 1; i < (kb.length() - 1); i++) {
char temp = kb.charAt(i);
if(temp != kb.charAt(i-1) && temp != kb.charAt(i+1)) {
buffer = buffer + temp;
}
}
if (kb.charAt(kb.length() - 1) != kb.charAt(kb.length() - 2)) {
buffer = buffer + kb.charAt(kb.length() - 1);
}
kb = buffer;
System.out.println(kb);
}
XYYYXACX will become XXACX.
Once said, our index problem can occur again if the string has only one letter. However, all of this would have been useless because obviously, we can't have a duplicate letter in this situation. As a fact, we should wrap the whole thing to ensure that we have at least two letters:
public static void main(String[] arg) {
Scanner scanner = new Scanner(System.in);
String kb = scanner.nextLine();
if (kb.length() >= 2) {
String buffer = "";
int i;
if (kb.charAt(0) != kb.charAt(1)) {
buffer = buffer + kb.charAt(0);
}
for (i = 1; i < (kb.length() - 1); i++) {
char temp = kb.charAt(i);
if (temp != kb.charAt(i - 1) && temp != kb.charAt(i + 1)) {
buffer = buffer + temp;
}
}
if (kb.charAt(kb.length() - 1) != kb.charAt(kb.length() - 2)) {
buffer = buffer + kb.charAt(kb.length() - 1);
}
kb = buffer;
}
System.out.println(kb);
}
The last thing to do is perform this treatment until we have no more undesired letters. For this task, the do { ... } while ( ... ) seems perfect. We can use for the condition comparison the size of the string. Because when the size of the previous iteration is equal to the temporary variable, we will know that we have finished.
We will need to perform this comparison before affecting the value of our temporary variable to the base one. Otherwise, it'll always be the same.
In the end, the following thing should be a potential solution:
public static void main(String[] arg) {
Scanner scanner = new Scanner(System.in);
String kb = scanner.nextLine();
Boolean modified;
do {
modified = false;
if (kb.length() >= 2) {
String buffer = "";
int i;
if (kb.charAt(0) != kb.charAt(1)) {
buffer = buffer + kb.charAt(0);
}
for (i = 1; i < (kb.length() - 1); i++) {
char temp = kb.charAt(i);
if (temp != kb.charAt(i - 1) && temp != kb.charAt(i + 1)) {
buffer = buffer + temp;
}
}
if (kb.charAt(kb.length() - 1) != kb.charAt(kb.length() - 2)) {
buffer = buffer + kb.charAt(kb.length() - 1);
}
modified = (kb.length() != buffer.length());
kb = buffer;
}
} while (modified);
System.out.println(kb);
}
Take note that this code is ugly for the sole purpose of the explanation. We should refactor this code. We can improve it a lot for the sake of brevity and, why not, performance.
I'm currently trying determine how to use bufferedreader to read from a console program. I know the correct syntax to read from the console and I know the program is working for smaller text. However, any text greater than 5118 characters will be truncated. The console itself will also not print out any text greater than 5118 characters. The goal is to create a java program that will read from the console independent of the size of data being read.
The following is the code I have created.
package countAnagrams;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.util.*;
public class TestClass {
public static int check_For_Missing_Characters(String a1, String b1){
int first_String_Length = a1.length();
int missing_Characters = 0;
for( int y = 0; y < first_String_Length; y++ ){
final char character_To_Check_String = a1.charAt(y);
if ( b1.chars().filter(ch -> ch ==
character_To_Check_String).count() == 0 ){
missing_Characters+=1;
}
}
return missing_Characters;
}
public static int check_For_Duplicate_Characters(String a1, String
b1){
int first_String_Length = a1.length();
int duplicat_Characters = 0;
String found_Characters = "";
for( int y = 0; y < first_String_Length; y++ ){
final char current_Character_To_Check = a1.charAt(y);
long first_String_Count = b1.chars().filter(ch -> ch ==
current_Character_To_Check).count();
long second_String_Count = a1.chars().filter(ch -> ch ==
current_Character_To_Check).count();
long found_String_Count = found_Characters.chars().filter(ch -> ch == current_Character_To_Check).count();
if ( first_String_Count > 0 && second_String_Count > 0 && found_String_Count == 0 ){
duplicat_Characters+=Math.abs(first_String_Count - second_String_Count);
found_Characters = found_Characters +
current_Character_To_Check;
}
}
return duplicat_Characters;
}
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
int test_Case_Count = Integer.parseInt(br.readLine()); //
Reading input from STDIN
for(int x = 0; x < test_Case_Count; x++ ){
int total_Count_Of_Diff_Chars = 0;
StringBuilder first_StringBuilder = new StringBuilder();
int first_String = '0';
while(( first_String = br.read()) != -1 ) {
first_StringBuilder.append((char) first_String );
}
StringBuilder second_StringBuilder = new StringBuilder();
String second_String = "";
while((( second_String = br.readLine()) != null )){
second_StringBuilder.append(second_String);
}
total_Count_Of_Diff_Chars = total_Count_Of_Diff_Chars +
check_For_Missing_Characters(first_StringBuilder.toString(),
second_StringBuilder.toString());
total_Count_Of_Diff_Chars = total_Count_Of_Diff_Chars +
check_For_Missing_Characters(second_StringBuilder.toString(),
first_StringBuilder.toString());
total_Count_Of_Diff_Chars = total_Count_Of_Diff_Chars +
check_For_Duplicate_Characters(second_StringBuilder.toString(),
first_StringBuilder.toString());
System.out.println(total_Count_Of_Diff_Chars);
}
br.close();
}
}
The above code will work for input that is less than 5118 characters. I would like to understand what is need to make it read beyond the 5118 limit. I'm not sure if the page, I using is causing the limit or there is something that I'm missing. Remember this is also written in java code.
i am trying to write to file with redirection from command line.
my programm is very slow when i read a file of 25MB and 90% of execution time spent in "System.out.println" .I tried some other methods than System.out.print but coulnt fix..
which method i have to use to print a big ArrayList? (with redirection)
i would appreciate your help and an example
thanks
here is my code:
public class Ask0 {
public static void main(String args[]) throws IOException {
int i = 0, token0, token1;
String[] tokens;
List<String> inputList = new ArrayList<>();
Map<Integer, List<Integer>> map = new HashMap<>();
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String input;
while ((input = br.readLine()) != null) {
tokens = input.split("\\|");
inputList.add(tokens[0] + "|" + tokens[1]);
token0 = Integer.parseInt(tokens[0]);
token1 = Integer.parseInt(tokens[1]);
List<Integer> l = map.get(token0);
if (l == null) {
l = new ArrayList<>();
map.put(token0, l);
}
if (l.contains(token1) == false) {
l.add(token1);
}
i++;
}
i = 0;
for (int j = inputList.size(); j > 0; j--) {
tokens = inputList.get(i).split("\\|");
token0 = Integer.parseInt(tokens[0]);
token1 = Integer.parseInt(tokens[1]);
List l = map.get(token0);
System.out.println(tokens[0] + "|" + tokens[1] + "["
+ (l.indexOf(token1) + 1) + "," + l.size() + "]");
i++;
}
}
}
Input
3|78 4|7765 3|82 2|8 4|14 3|78 2|8 4|12
Desired result
3|78[1,2] 4|7765[1,3] 3|82[2,2] 2|8[1,1] 4|14[2,3] 3|78[1,2] 2|8[1,1] 4|12[3,3]
For speed, the below code:
Uses a StringBuilder for fast concatenation into a resulting String and fast output since only one massive String is printed at the end, saving unnecessary buffer flushes.
Doesn't create a bunch of Strings when parsing the input, just a small byte[] and Integers in an ArrayList.
Manually uses a 64kiB buffer for reading.
Doesn't rejoin the tokens with "|" in the middle only to split them again later.
Uses a HashMap<Integer, HashMap<Integer, Integer>> instead of a HashMap<Integer, ArrayList<Integer>> to save time on element lookups in the list (turns algorithm from O(n2) time to O(n) time).
Some speedups that might not work as you want:
Doesn't waste time properly handling Unicode.
Doesn't waste time properly handling negative or overflowed numbers.
Doesn't care what the separator characters are (you could input "1,2,3,4,5,6" instead and it would still work just like "1|2\n3|4\n5|6\n").
You can see that it gives the correct results for your test input here (except that it separates the outputs by newlines like in your code).
private static final int BUFFER_SIZE = 65536;
private static enum InputState { START, MIDDLE }
public static void main(final String[] args) throws IOException {
// Input the numbers
final byte[] inputBuffer = new byte[BUFFER_SIZE];
final List<Integer> inputs = new ArrayList<>();
int inputValue = 0;
InputState inputState = InputState.START;
while (true) {
int j = 0;
final int bytesRead = System.in.read(inputBuffer, 0, BUFFER_SIZE);
if (bytesRead == -1) {
if (inputState == InputState.MIDDLE) {
inputs.add(inputValue);
}
break;
}
for (int i = 0; i < bytesRead; i++) {
byte ch = inputBuffer[i];
int leftToken = 0;
if (ch < 48 || ch > 57) {
if (inputState == InputState.MIDDLE) {
inputs.add(inputValue);
inputState = InputState.START;
}
}
else {
if (inputState == InputState.START) {
inputValue = ch - 48;
inputState = InputState.MIDDLE;
}
else {
inputValue = 10*inputValue + ch - 48;
}
}
}
}
System.in.close();
// Put the numbers into a map
final Map<Integer, Map<Integer, Integer>> map = new HashMap<>();
for (int i = 0; i < inputs.size();) {
final Integer left = inputs.get(i++);
final Integer right = inputs.get(i++);
final Map<Integer, Integer> rights;
if (map.containsKey(left)) {
rights = map.get(left);
}
else {
rights = new HashMap<>();
map.put(left, rights);
}
rights.putIfAbsent(right, rights.size() + 1);
}
// Prepare StringBuilder with results
final StringBuilder results = new StringBuilder();
for (int i = 0; i < inputs.size();) {
final Integer left = inputs.get(i++);
final Integer right = inputs.get(i++);
final Map<Integer, Integer> rights = map.get(left);
results.append(left).append('|').append(right);
results.append('[').append(rights.get(right)).append(',');
results.append(rights.size()).append(']').append('\n');
}
System.out.print(results);
}
You can alternatively manually use a 64 kiB byte[] output buffer with System.out.write(outputBuffer, 0, bytesToWrite); System.out.flush(); as well if you want to save memory, though that's a lot more work.
Also, if you know the minimum and maximum values that you'll see, you can use int[] or int[][] arrays instead of Map<Integer, Integer> or Map<Integer, Map<Integer, Integer>>, though that's somewhat more involved as well. It would be very fast, though.
I am trying to iterate through a txt file and count all characters. This includes \n new line characters and anything else. I can only read through the file once. I am also recording letter frequency, amount of lines, amount of words, and etc. I can't quite figure out where to count the total amount of characters. (see code below) I know I need to before I use the StringTokenizer. (I have to use this by the way). I have tried multiple ways, but just can't quite figure it out. Any help would be appreciated. Thanks in advance. Note* my variable numChars is only counting alpha characters(a,b,c etc) edit posting class variables to make more sense of the code
private final int NUMCHARS = 26;
private int[] characters = new int[NUMCHARS];
private final int WORDLENGTH = 23;
private int[] wordLengthCount = new int[WORDLENGTH];
private int numChars = 0;
private int numWords = 0;
private int numLines = 0;
private int numTotalChars = 0;
DecimalFormat df = new DecimalFormat("#.##");
public void countLetters(Scanner scan) {
char current;
//int word;
String token1;
while (scan.hasNext()) {
String line = scan.nextLine().toLowerCase();
numLines++;
StringTokenizer token = new StringTokenizer(line,
" , .;:'\"&!?-_\n\t12345678910[]{}()##$%^*/+-");
for (int w = 0; w < token.countTokens(); w++) {
numWords++;
}
while (token.hasMoreTokens()) {
token1 = token.nextToken();
if (token1.length() >= wordLengthCount.length) {
wordLengthCount[wordLengthCount.length - 1]++;
} else {
wordLengthCount[token1.length() - 1]++;
}
}
for (int ch = 0; ch < line.length(); ch++) {
current = line.charAt(ch);
if (current >= 'a' && current <= 'z') {
characters[current - 'a']++;
numChars++;
}
}
}
}
Use string.toCharArray(), something like:
while (scan.hasNext()) {
String line = scan.nextLine();
numberchars += line.toCharArray().length;
// ...
}
An Alternative would be to use directly the string.length:
while (scan.hasNext()) {
String line = scan.nextLine();
numberchars += line.length;
// ...
}
Using the BfferedReader you can do it like this:
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream(file), charsetName));
int charCount = 0;
while (reader.read() > -1) {
charCount++;
}
I would read by char from file with BufferedReader and use Guava Multiset to count chars
BufferedReader rdr = Files.newBufferedReader(path, charSet);
HashMultiset < Character > ms = HashMultiset.create();
for (int c;
(c = rdr.read()) != -1;) {
ms.add((char) c);
}
for (Multiset.Entry < Character > e: ms.entrySet()) {
char c = e.getElement();
int n = e.getCount();
}
I have a huge file with millions of columns, splited by space, but it only has a limited number of rows:
examples.txt:
1 2 3 4 5 ........
3 1 2 3 5 .........
l 6 3 2 2 ........
Now, I just want to read in the second column:
2
1
6
How do I do that in java with high performance.
Thanks
Update: the file is usually 1.4G containing hundreds of rows.
If your file is not statically structured, your only option is the naive one: read through the file byte sequence by byte sequence looking for newlines and grab the second column after each one. Use FileReader.
If your file were statically structured, you could calculate where in the file the second column would be for a given line and seek() to it directly.
I have to concur with #gene, try with a BufferedReader and getLine first, it's simple and easy to code. Just be careful not to alias the backing array between the result of getLine and any substring operation you use. String.substring() is a particularly common culprit, and I have had multi-MB byte-arrays locked in memory because a 3-char substring was referencing it.
Assuming ASCII, my preference when doing this is to drop down to the byte level. Use mmap to view the file as a ByteBuffer and then do a linear scan for 0x20 and 0x0A (assuming unix-style line separators). Then convert the relevant bytes to a String. If you are using an 8-bit charset it is extremely difficult to be faster than this.
If you are using Unicode the problem is sufficiently more complicated that I strongly urge you to use BufferedReader unless that performance really is unacceptable. If getLine() doesn't work, then consider just looping on a call to read().
Regardless you should always specify the Charset when initialising a String from an external bytestream. This documents your charset assumption explicitly. So I recommend a minor modification to gene's suggestion, so one of:
int i = Integer.parseInt(new String(buffer, start, length, "US-ASCII"));
int i = Integer.parseInt(new String(buffer, start, length, "ISO-8859-1"));
int i = Integer.parseInt(new String(buffer, start, length, "UTF-8"));
as appropriate.
Here is a little state machine that uses a FileInputStream as its input and handles its own buffering. There is no locale conversion.
On my 7-year old 1.4 GHz laptop with 1/2 Gb of memory it takes 48 seconds to go through 1.28 billion bytes of data. Buffers bigger than 4Kb seem to run slower.
On a new 1-year old MacBook with 4Gb it runs in 14 seconds. After the file is in cache it runs in 2.7 seconds. Again there is no difference with buffers bigger than 4Kb. This is the same 1.2 billion byte data file.
I expect memory-mapped IO would do better, but this is probably more portable.
It will fetch any column you tell it to.
import java.io.*;
import java.util.Random;
public class Test {
public static class ColumnReader {
private final InputStream is;
private final int colIndex;
private final byte [] buf;
private int nBytes = 0;
private int colVal = -1;
private int bufPos = 0;
public ColumnReader(InputStream is, int colIndex, int bufSize) {
this.is = is;
this.colIndex = colIndex;
this.buf = new byte [bufSize];
}
/**
* States for a tiny DFA to recognize columns.
*/
private static final int START = 0;
private static final int IN_ANY_COL = 1;
private static final int IN_THE_COL = 2;
private static final int WASTE_REST = 3;
/**
* Return value of colIndex'th column or -1 if none is found.
*
* #return value of column or -1 if none found.
*/
public int getNext() {
colVal = -1;
bufPos = parseLine(bufPos);
return colVal;
}
/**
* If getNext() returns -1, this can be used to check if
* we're at the end of file.
*
* Otherwise the column did not exist.
*
* #return end of file indication
*/
public boolean atEoF() {
return nBytes == -1;
}
/**
* Parse a line.
* The buffer is automatically refilled if p reaches the end.
* This uses a standard DFA pattern.
*
* #param p position of line start in buffer
* #return position of next unread character in buffer
*/
private int parseLine(int p) {
colVal = -1;
int iCol = -1;
int state = START;
for (;;) {
if (p == nBytes) {
try {
nBytes = is.read(buf);
} catch (IOException ex) {
nBytes = -1;
}
if (nBytes == -1) {
return -1;
}
p = 0;
}
byte ch = buf[p++];
if (ch == '\n') {
return p;
}
switch (state) {
case START:
if ('0' <= ch && ch <= '9') {
if (++iCol == colIndex) {
state = IN_THE_COL;
colVal = ch - '0';
}
else {
state = IN_ANY_COL;
}
}
break;
case IN_THE_COL:
if ('0' <= ch && ch <= '9') {
colVal = 10 * colVal + (ch - '0');
}
else {
state = WASTE_REST;
}
break;
case IN_ANY_COL:
if (ch < '0' || ch > '9') {
state = START;
}
break;
case WASTE_REST:
break;
}
}
}
}
public static void main(String[] args) {
final String fn = "data.txt";
if (args.length > 0 && args[0].equals("--create-data")) {
PrintWriter pw;
try {
pw = new PrintWriter(fn);
} catch (FileNotFoundException ex) {
System.err.println(ex.getMessage());
return;
}
Random gen = new Random();
for (int row = 0; row < 100; row++) {
int rowLen = 4 * 1024 * 1024 + gen.nextInt(10000);
for (int col = 0; col < rowLen; col++) {
pw.print(gen.nextInt(32));
pw.print((col < rowLen - 1) ? ' ' : '\n');
}
}
pw.close();
}
FileInputStream fis;
try {
fis = new FileInputStream(fn);
} catch (FileNotFoundException ex) {
System.err.println(ex.getMessage());
return;
}
ColumnReader cr = new ColumnReader(fis, 1, 4 * 1024);
int val;
long start = System.currentTimeMillis();
while ((val = cr.getNext()) != -1) {
System.out.print('.');
}
long stop = System.currentTimeMillis();
System.out.println("\nelapsed = " + (stop - start) / 1000.0);
}
}