So I'm trying to search through an Arraylist in Java and create a histogram consisting of lengths of string vs frequency that length is present in large text files. I've come up with a brute force algorithm but its much too slow to be of use in large data files. Is there a more efficient way of processing through an Arraylist? I've included the brute force method I came up with.
for (int i = 0; i < (maxLen + 1); i++)
{
int hit = 0;
for (int j = 0; j < list.size(); j++)
{
if (i == list.get(j).length())
++hit;
histogram[i] = hit;
}
}
This is terribly inefficient.
How about instead of looping through each possible length value, then each available word, you simply loop through the available words in the document and count their lengths?
For example:
Map<Integer, Integer> frequencies = new HashMap<Integer, Integer>();
for(int i=0; i<list.size(); i++) {
String thisWord = list.get(i);
Integer theLength = (Integer)(thisWord.length());
if(frequencies.containsKey(theLength) {
frequencies.put(theLength, new Integer(frequencies.get(theLength).intValue()+1));
}
else {
frequencies.put(theLength, new Integer(1));
}
}
Then, if the key does not exist in the HashMap, you know no words of that length exist in the document. If the key does exist, you can look up exactly how many times that occurred.
Note: Some aspects of this code example were made in order to prevent any additional confusion about boxing and unboxing. It is possible to write it slightly cleaner, and I would certainly do so in a production environment. Also, it assumes that you don't have knowledge of any minimum or maximum lengths of words (and is thus slightly more flexible, scalable, and catch-all). Otherwise, the other techniques for simply declaring a primitive array will work just as well (see Jon Skeet's answer).
For a cleaner version that takes advantage of autoboxing:
Map<Integer, Integer> frequencies = new HashMap<Integer, Integer>();
for(int i=0; i<list.size(); i++) {
String thisWord = list.get(i);
if(frequencies.containsKey(thisWord.length()) {
frequencies.put(thisWord.length(), frequencies.get(thisWord.length())+1);
}
else {
frequencies.put(thisWord.length(), 1);
}
}
Why don't you just loop over the list once?
int[] histogram = new int[maxLen + 1]; // All entries will be 0 to start with
for (String text : list) {
if (text.length() <= maxLen) {
histogram[text.length()]++;
}
}
This is now just O(N).
Related
I wrote below code to get duplicate elements from Arraylist. My aerospikePIDs list doesn't have any duplicate value but still when I am executing below code it is reading if condition.
ArrayList<Integer> aerospikePIDs = new ArrayList<Integer>();
ArrayList<Integer> duplicates = new ArrayList<Integer>();
boolean flag;
for(int j=0;j<aerospikePIDs.size();j++) {
for(int k=1;k<aerospikePIDs.size();k++) {
if(aerospikePIDs.get(j)==aerospikePIDs.get(k)) {
duplicates.add(aerospikePIDs.get(k));
flag=true;
}
if(flag=true)
System.out.println("duplicate elements for term " +searchTerm+duplicates);
}
}
Your inner loop should start from j + 1 (not from 1), otherwise when j = 1 (second iteration of j), for k = 1 (first iteration of k for j value equals to 1).
aerospikePIDs.get(j)==aerospikePIDs.get(k)
returns true.
So the code should be:
ArrayList<Integer> aerospikePIDs = new ArrayList<Integer>();
ArrayList<Integer> duplicates = new ArrayList<Integer>();
for (int j = 0; j < aerospikePIDs.size(); j++) {
for (int k = j + 1; k < aerospikePIDs.size(); k++) {
if (aerospikePIDs.get(j)==aerospikePIDs.get(k)) {
duplicates.add(aerospikePIDs.get(k));
System.out.println("duplicate elements for term " +searchTerm+duplicates);
}
}
}
Note: the flag is not necessary, because if you addeda duplicate you can print it directly in the if, without defining new unnecessary variables and code.
Use higher level abstractions:
Push all list elements into a Map<Integer, Integer> - key is the entry in your PIDs list, value is a counter. The corresponding loop simply checks "key present? yes - increase counter; else, add key with counter 1".
In the end, you can iterate that map, and each entry that has a counter > 1 ... has duplicates in your list; and you even get the number of duplicates for free.
And questions/answers that show you nice ways to do such things ... are posted here on almost daily basis. You can start here for example; and you only need to adapt from "String" key to "Integer" key.
Really: when working with collections, your first step is always: find the most highlevel way of getting things done - instead of sitting down and writing such error-prone low-level code as you just did.
You are iterating using the same arraylist. You are checking every data in inner for loop, for sure it will display duplicates.
I'm comparing two functions for use as my permutation generator. This question is about alot of things: the string intern table, the pros and cons of using iteration vs recursion for this problem, etc...
public static List<String> permute1(String input) {
LinkedList<StringBuilder> permutations = new LinkedList<StringBuilder>();
permutations.add(new StringBuilder(""+input.charAt(0)));
for(int i = 1; i < input.length(); i++) {
char c = input.charAt(i);
int size = permutations.size();
for(int k = 0; k < size ; k++) {
StringBuilder permutation = permutations.removeFirst(),
next;
for(int j = 0; j < permutation.length(); j++) {
next = new StringBuilder();
for(int b = 0; b < permutation.length(); next.append(permutation.charAt(b++)));
next.insert(j, c);
permutations.addLast(next);
}
permutation.append(c);
permutations.addLast(permutation);
}
}
List<String> formattedPermutations = new LinkedList<String>();
for(int i = 0; i < permutations.size(); formattedPermutations.add(permutations.get(i++).toString()));
return formattedPermutations;
}
public static List<String> permute2(String str) {
return permute2("", str);
}
private static List<String> permute2(String prefix, String str) {
int n = str.length();
List<String> permutations = new LinkedList<String>();
if (n == 0) permutations.add(prefix);
else
for (int i = 0; i < n; i++)
permutations.addAll(permute2(prefix + str.charAt(i), str.substring(0, i) + str.substring(i+1, n)));
return permutations;
}
I think these two algorithms should be generally equal, however the recursive implementation does well up to n=10, whereas permute1, the interative solution, has an outofmemoryerror at n=8, where n is the input string length. Is the fact that I'm using StringBuilder and then converting to Strings a bad idea? If so, why? I thought whenever you add to a string it creates a new one, which would be bad because then java would intern it, right? So you'd end up with a bunch of intermediate strings that aren't permutations but which are stuck in the intern table.
EDIT:
I replaced StringBuilder with String, which removed the need to use StringBuilder.insert(). However, I do have to use String.substring() to build up the permutation strings, which may not be the best way to do it, but it's empirically better than StringBuilder.insert(). I did not use a char array as Alex Suo suggested because since my method is supposed to return a list of strings, I would have to convert those char arrays into strings which would induce more garbage collection on the char arrays (the reason for the OutOfMemoryError). So with this in place, both the OutOfMemoryError and slowness problems are resolved.
public static List<String> permute3(String input) {
LinkedList<String> permutations = new LinkedList<String>();
permutations.add(""+input.charAt(0));
for(int i = 1; i < input.length(); i++) {
char c = input.charAt(i);
int size = permutations.size();
for(int k = 0; k < size ; k++) {
String permutation = permutations.removeFirst(),
next;
for(int j = 0; j < permutation.length(); j++) {
next = permutation.substring(0, j + 1) + c + permutation.substring(j + 1, permutation.length());
permutations.addLast(next);
}
permutations.addLast(permutation + c);
}
}
return permutations;
}
Firstly, since you got OutOfMemoryError, that hints me you have a lot of GC going on and as we all know, GC is a performance killer. As young-gen GC is stop-the-world, you probably get a lot worse performance by suffering from GCs.
Looking at your code, if you dive into the actual implementation of StringBuilder, you could see that insert() is a very expensive operation involving System.arraycopy() etc and potentially expandCapacity(). Since you don't mention your n for permutation I'd assume the n<10 so you won't have the problem here - you would have memory re-allocation since default buffer of StringBuilder is of size 16 only. StringBuilder is basically an auto char array but it's not magic - whatever you need to do by coding up from scratch, StringBuilder also needs to do it.
Having said the above, if you really want to achieve maximum performance, since the length of your string array is pre-defined, why not just use a char array with length = String.length() ? That's probably the best in term of performance.
I have two different hashmaps with query results, though the 2 hashmaps are different sizes hashmap, and I'm trying to find records that exist in hashmap A that don't exist in hashmap B.
I'll post my code so far; I did the comparison via sql and I get the result I want, but when I tried to put it in code I'm not successful. I hope you can point me in the right direction. Thanks in advance
HashMap<Integer, HashMap<String, Object>> mapA = new HashMap<>();
HashMap<Integer, HashMap<String, Object>> mapB = new HashMap<>();
int m=0;
for (int j = 0; j < mapA.size(); j++) {
for (int k = 0; k < mapB.size(); k++) {
if (!mapA.get(j).get("folio").toString().equals(
mapB.get(k).get("folio").toString())) {
m++; // count many records not exist on mapB
}
}
}
System.out.println(m);
There is an error in the logic. You want to find not existing in A records, but incrementing counter everytime when values of both iterating HashMaps don't equal (you will get much greater m in this case). Your code should look like.
for (int j = 0; j < mapA.size(); j++) {
boolean found=false;
for (int k = 0; k < mapB.size(); k++) {
if (mapA.get(j).get("folio").toString().equals(
mapB.get(k).get("folio").toString())) {
found=true;
break;
}
}
if (!found){
m++; // count many records not exist on mapB
}
}
Also there is additional possible error. In general case you have to make comparision not after toString method but compare objects (I think you didn't ovverite toString method of your objects to return valid identifier to compare them. And in most cases it will return not what you need. Or in other words you should ovveride equals methods of all you possiblle objects in hashmaps and use next code for comparision:
mapA.get(j).get("folio").equals(mapB.get(k).get("folio"))
In your case (with toString) comparision can return always false, because typical toString return class and ID of the objects.
I'm not sure I understood your approach. A more proper way to do that would be (in pseudo code):
For each element in Hashmap A:
If !HashmapB.contains(element):
++Counter;
Say you have a List of Strings or whatever, and you want to produce another List which will contain every possible combination of two strings from the original list (concated together), is there any more efficient way to do this other than using a nested for loop to combine the String with all the others?
Some sample code:
for(String s: bytes) {
for(String a: bytes) {
if(!(bytes.indexOf(a) == bytes.indexOf(s))) {
if(s.concat(a).length() == targetLength) {
String combination = s.concat(a);
validSolutions.add(combination);
}
}
}
}
The time for execution gets pretty bad pretty quickly as the size of the original list of Strings grows.
Any more efficient way to do this?
You can avoid checking i != j condition by setting j = i + 1. Also, things like bytes.length() get evaluated at each iteration of both loops - save it into a value and reuse. Calling a.length() inside the loop asks for a length of the same string multiple times - you can save some runtime on that as well. Here are the updates:
int len = bytes.length();
int aLength;
String a, b;
for(int i=0; i<len; i++) {
a = bytes[i];
aLength = a.length();
for(int j=i; j<len; j++) {
b = bytes[j];
if (b.length() + aLength == targetLength) {
validSolutions.add(b.concat(a));
validSolutions.add(a.concat(b));
}
}
}
Edit: j = i because you want to consider a combination of a string with itself; Also, you'd need to add a.concat(b) as well since this combination is never considered in the loop, but is a valid string
You can't get Better than O(N^2), because there are that many combinations. But you could speed up your algorithm a bit (from O(N^3)) by removing the indexOf calls:
for(int i=0; i<bytes.length(); i++) {
for(int j=0; j<bytes.length(); j++) {
string s = bytes[i];
string a = bytes[j];
if (i != j && s.length() + a.length() == targetLength) {
validSolutions.add(s.concat(a));
}
}
}
In addition to what Jimmy and lynxoid say, the fact that the total length is constrained gives you a further optimization. Sort your strings in order of length, then for each s you know that you require only the as such that a.length() == targetLength - s.length().
So as soon as you hit a string longer than that you can break out of the inner loop (since all the rest will be longer), and you can start at the "right" place for example with a lower-bound binary search into the array.
Complexity is still O(n^2), since in the worst case all the strings are the same length, equal to half of totalLength. Typically though it should go somewhat better than considering all pairs of strings.
I've been struggling to create a function to essentially find all the indices of duplicate elements in a multi-dimensional array(unsorted), in this case a 5x5 array, and then using the indices found changing the parallel elements in a score array. But only find duplicates within columns and not comparatively to the other columns in the array Here is what I've done so far, with research online. The main problem with this code is that it will find all the duplicate elements but not the originals. For example: if the array holds the elements:
{{"a","a","a"},{"b","b","b"},{"a","c","a"}}, then it should change the parallel score array to: {{0,1,0},{1,1,1},{0,1,0}}. But instead it only recognizes the last row and top the top row's duplicates.
Code:
public static void findDuplicates(String a[][])
{
System.out.println("*Duplicates*");
Set set = new HashSet();
for(int j = 0; j<a.length; j++)
{
for(int i=0; i < a[0].length; i++)
{
if(!set.contains(a[i][j]))
{
set.add(a[i][j]);
}
else
{
System.out.println("Duplicate string found at index " + i + "," + j);
scores[i][j] -= scores[i][j];
}
}
set = new HashSet();
}
}
I know my explanation is a bit complicated, but hopefully it is understandable enough. Thanks,
Jake.
Your logic is incorrect. Your outer loop is j and inner loop is i but you're doing:
set.add(a[i][j]);
It should be the other way around:
set.add(a[j][i]);
Technically you could get an out of bounds exception if the array isn't NxN. But you can state that as a precondition.
For some reason you're also setting to 0 with:
scores[i][j] -= scores[i][j];
Why not just:
scores[i][j] = 0;
But to find duplicates within columns:
public static void findDuplicates(String a[][]) {
for (int col=0; col<a[0].length; col++) {
Map<String, Integer> values = new HashMap<String, Integer>();
for (int row=0; row<a.length; row++) {
Integer current = values.put(a[row][col], row);
if (current != null) {
scores[row][col] = 0;
scores[current][col] = 0;
}
}
}
}
How does this work?
I've renamed the loop variables to row and col. There's no reason to use i and j when row and col are far more descriptive;
Like you I assume the input array is correct as a precondition. It can be NxM (rather than just NxN) however;
I use a Map to store the index of each value. Map.put() returns the old value if key is already in the Map. If that's the case you've found a duplicate;
The current (row,col) and (current,col) are set to 0. Why subtract the score from itself rather than simply setting to 0?
if the value "a" is found 3+ times in a column then scores[current][col] will be set to 0 more than once, which is unnecessary but not harmful and makes for simpler code.
I've declared the Map using generics. This is useful and advisable. It says the Map has String keys and Integer values, which saves some casting;
It also uses auto-boxing and auto-unboxing to convert an int (the loop variable) to and from the wrapper class Integer.