Linked list of Linked List of Integers, Java. - java

I first ran into my problem trying to create a int[][] of very large size (7k by 30k) for a dictionary gap list postings program. But alas I run out of space trying to allocate the array. How might I create a 2-d array of integers?
What I want is a list of list in which each list in the list is a list of integers. Here is a sample of my code.
Code:
static final int numberOfTerms = 6782;
static final int numberOfLines = 30383;
byte[][] countMatrix = new byte[numberOfLines][numberOfTerms];
int[][] gapsMatrix = new int[numberOfLines][numberOfTerms]; // To big!!
This list of lists is going to be filled with integers that represent the gaps between two occurrences of the same word in a specific text. So in count matrix I hold a byte indicating whether a word is specified for a specified index. Then in the function I am creating right now I am going through the countMatrix and if I find a byte there, I take the current index minus the last found index and save that number in my 2D-array of integers which gives me the just the gaps between each of the same word in the text.
So how might I create a data structure I need to accomplish this?

I don't know whether this will work for you but you can try Sparse Matrix as option if you want to stick to Array. There are several other options.Map, List ,Weak reference Collections etc

To create an array you need to have enough memory to create it.
An int uses 4-bytes per values and an array uses at least N * M times that.
e.g. 4 * 30383 * 6782 is about 820 MB you need to have free to create this.
This is about $8 worth of memory so this should be a big problem unless you don't have this much or you set your maximum memory too low.
I would increase your maximum memory by 1 GB at least and it should work.
Alternatives include
use a smaller size e.g. char or short or byte which is 2-4 x smaller.
use off heap memory such as a memory mapped file. This doesn't use much heap but does use disk space which is usually cheaper.
increase your maximum memory size.

You simply have insufficient memory to do that.
http://www.javamex.com/tutorials/memory/array_memory_usage.shtml
Sorry I didn't make it clear but, it is unlikely that using another DS is going to change this.

So how might I create a data structure I need to accomplish this?
If is understand correctly, then you want to record gaps between same terms.
Let us say, you have array of terms you need to analyze, then:
String[] terms = ...;
Map<String, List<Integer>> map = new TreeMap<String, <Integer>>();
for (int i = 0; i < terms.length; i++) {
String term = terms[i];
List<Integer> positions = map.get(term);
if (gaps == null) {
positions = new ArrayList<Integer>();
}
positions.add(i);
map.set(term, positions);
}
Later you just look at the positions of each term and may calculate gaps between those. (You may integrate that gaps calculation into this code, but I leave it as exercise for you).

Related

What is the time complexity of the add and element in a Java Array?

Hello I am research about that, but I cannot found anything in the oracle website.
The question is the next.
If you are using an static Array like this
int[] foo = new int[10];
And you want add some value to the 4 position of this ways
foor[4] = 4;
That don't shift the elements of the array so the time complexity will be O(1) because if you array start at 0x000001, and have 10 spaces, and you want put some in the x position you can access by (x*sizeOf(int))+initialMemoryPosition (this is a pseudocode)
Is this right, is this the way of that this type of array works in java, and if its time complexity O(1)
Thanks
The question is based on a misconception: in Java, you can't add elements to an array.
An array gets allocated once, initially, with a predefined number of entries. It is not possible to change that number later on.
In other words:
int a[] = new int[5];
a[4] = 5;
doesn't add anything. It just sets a value in memory.
So, if at all, we could say that we have somehow "O(1)" for accessing an address in memory, as nothing related to arrays depends on the number of entries.
Note: if you ask about ArrayList, things are different, as here adding to the end of the array can cause the creation of a new, larger (underlying) array, and moving of data.
An array is somewhere in memory. You don’t have control where, and you should not care where it is. The array is initialized when using the new type[size] syntax is used.
Accessing the array is done using the [] index operator. It will never modify size or order. Just the indexed location if you assign to it.
See also https://www.w3schools.com/java/java_arrays.asp
The time complexity is already correctly commented on. But that is the concern after getting the syntax right.
An old post regarding time complexity of collections can be found here.
Yes, it takes O(1) time. When you initialize an array, lets say, int[] foo = new int[10],
then it will create a new array with 0s. Since int has 4 bytes, which is 32 bits, every time assign a value to one element, i.e., foo[4] = 5, it will do foo[32 x input(which is 4)] = value(5); That's why array is 0-indexed, and how they assign values in O(1) time.

Techniques to redimension an array in Java : is "nullifying" the array bad?

I'm learning Java and surprisingly I found out that Java arrays are not dynamic - even though its cousing languages have dynamic arrays.
So I came out with ideas to kind of imitate a dynamic array in java on my own.
One thought I had was to copy the original array references to a temporary array, then turn the original array to null, re-set its index to a bigger value and then finally re-copy the values from the temporary array.
Example.:
if(numberOfEntries == array.length){
Type[] temp = new Type[numberOfEntries];
for(int x=0; x < numberOfEntries; x++){
temp[x] = array[x];
}
array = null;
array = new Type[numberOfEntries+1];
for(int x=0; x < numberOfEntries; x++){
array[x] = temp[x];
}
I know that this can result in data loss if the process is interrupted, but aside from that, is this a bad idea? What are my options?
Thanks!
Your idea is in the right ballpark. But for a task that you propose you should never implement your own version, unless it is for academic purposes and fun.
What you propose is roughly implemented by the ArrayList class.
This has an internal array and a size 'counter'. The internal array is filled when items are added. When the internal array is full all elements are copied to a bigger array. The internal array is never released to the user of the class (to make sure it's state is always valid).
In your example code, because an array is a pointer, you don't really need the temp array. Just create a new one, copy all elements and save the pointer to it as your array.
You might want to look into thrashing. Changing the size of the array by 1 is likely to be very inefficient. Depending on your use case, you might want to increase the array size by double, and similarly halve the array when it's only a quarter full.
ArrayList is convenient, but once it's full, it takes linear time to add an element. You can achieve something similar to resizing with the ensureCapacity() method. I'd recommend becoming more familiar with Java's Collections framework so you can make the best decisions in future by yourself.
Arrays are not dynamic their size can't change dynamically and right now you aren't changing the same object, you are replacing smaller size object with larger size object
int[5] Arr = new int[5] ; // Create an array of size 5
Arr = new int[10] ;// you assigned the different objects. It is not the same object.
So, we can't change the size of the array dynamically. You can use ArrayList for the same.
But keep try !!!
Please take a look at java.util.ArrayList which is dynamically, it is part of the Collections framework. Making the Array dynamically should be slower and error-prone.
Have you heard about time complexity , do you know how much time complexity you are increasing, Every time you are copying old array element to new array let you have 1 million element in array then think about copying time of element of one array to another array.
One more thing i want to tell you, ArrayList implementation used same logic except new length that you are using .

How to delete duplicate/aggregate rows faster in a file using Java (no DB)

I have a 2GB big text file, it has 5 columns delimited by tab.
A row will be called duplicate only if 4 out of 5 columns matches.
Right now, I am doing dduping by first loading each coloumn in separate List
, then iterating through lists, deleting the duplicate rows as it encountered and aggregating.
The problem: it is taking more than 20 hours to process one file.
I have 25 such files to process.
Can anyone please share their experience, how they would go about doing such dduping?
This dduping will be a throw away code. So, I was looking for some quick/dirty solution, to get job done as soon as possible.
Here is my pseudo code (roughly)
Iterate over the rows
i=current_row_no.
Iterate over the row no. i+1 to last_row
if(col1 matches //find duplicate
&& col2 matches
&& col3 matches
&& col4 matches)
{
col5List.set(i,get col5); //aggregate
}
Duplicate example
A and B will be duplicate A=(1,1,1,1,1), B=(1,1,1,1,2), C=(2,1,1,1,1) and output would be A=(1,1,1,1,1+2) C=(2,1,1,1,1) [notice that B has been kicked out]
A HashMap will be your best bet. In a single, constant time operation, you can both check for duplication and fetch the appropriate aggregation structure (a Set in my code). This means that you can traverse the entire file in O(n). Here's some example code:
public void aggregate() throws Exception
{
BufferedReader bigFile = new BufferedReader(new FileReader("path/to/file.csv"));
// Notice the paramter for initial capacity. Use something that is large enough to prevent rehashings.
Map<String, HashSet<String>> map = new HashMap<String, HashSet<String>>(500000);
while (bigFile.ready())
{
String line = bigFile.readLine();
int lastTab = line.lastIndexOf('\t');
String firstFourColumns = line.substring(0, lastTab);
// See if the map already contains an entry for the first 4 columns
HashSet<String> set = map.get(firstFourColumns);
// If set is null, then the map hasn't seen these columns before
if (set==null)
{
// Make a new Set (for aggregation), and add it to the map
set = new HashSet<String>();
map.put(firstFourColumns, set);
}
// At this point we either found set or created it ourselves
String lastColumn = line.substring(lastTab+1);
set.add(lastColumn);
}
bigFile.close();
// A demo that shows how to iterate over the map and set structures
for (Map.Entry<String, HashSet<String>> entry : map.entrySet())
{
String firstFourColumns = entry.getKey();
System.out.print(firstFourColumns + "=");
HashSet<String> aggregatedLastColumns = entry.getValue();
for (String column : aggregatedLastColumns)
{
System.out.print(column + ",");
}
System.out.println("");
}
}
A few points:
The initialCapaticy parameter for the HashMap is important. If the number of entries gets bigger than the capacity, then the structure is re-hashed, which is very slow. The default initial capacity is 16, which will cause many rehashes for you. Pick a value that you know is greater than the number of unique sets of the first four columns.
If ordered output in the aggregation is important, you can switch the HashSet for a TreeSet.
This implementation will use a lot of memory. If your text file is 2GB, then you'll probably need a lot of RAM in the jvm. You can add the jvm arg -Xmx4096m to increase the maximum heap size to 4GB. If you don't have at least 4GB this probably won't work for you.
This is also a parallelizable problem, so if you're desperate you could thread it. That would be a lot of effort for throw-away code, though. [Edit: This point is likely not true, as pointed out in the comments]
I would sort the whole list on the first four columns, and then traverse through the list knowing that all the duplicates are together. This would give you O(NlogN) for the sort and O(N) for the traverse, rather than O(N^2) for your nested loops.
I would use a HashSet of the records. This can lead to an O(n) timing instead of O(n^2). You can create a class which has each of the fields with one instance per row.
You need to have a decent amount of memory, but 16 to 32 GB is pretty cheap these days.
I would do something similar to Eric's solution, but instead of storing the actual strings in the HashMap, I'd just store line numbers. So for a particular four column hash, you'd store a list of line numbers which hash to that value. And then on a second path through the data, you can remove the duplicates at those line numbers/add the +x as needed.
This way, your memory requirements will be a LOT smaller.
The solutions already posted are nice if you have enough (free) RAM. As Java tends to "still work" even if it is heavily swapping, make sure you don't have too much swap activity if you presume RAM could have been the limiting factor.
An easy "throwaway" solution in case you really have too little RAM is partitioning the file into multiple files first, depending on data in the first four columns (for example, if the third column values are more or less uniformly distributed, partition by the last two digits of that column). Just go over the file once, and write the records as you read them into 100 different files, depending on the partition value. This will need minimal amount of RAM, and then you can process the remaining files (that are only about 20MB each, if the partitioning values were well distributed) with a lot less required memory, and concatenate the results again.
Just to be clear: If you have enough RAM (don't forget that the OS wants to have some for disk cache and background activity too), this solution will be slower (maybe even by a factor of 2, since twice the amount of data needs to be read and written), but in case you are swapping to death, it might be a lot faster :-)

Filling in uninitialized array in java? (or workaround!)

I'm currently in the process of creating an OBJ importer for an opengles android game. I'm relatively new to the language java, so I'm not exactly clear on a few things.
I have an array which will hold the number of vertices in the model(along with a few other arrays as well):
float vertices[];
The problem is that I don't know how many vertices there are in the model before I read the file using the inputstream given to me.
Would I be able to fill it in as I need to like this?:
vertices[95] = 5.004f; //vertices was defined like the example above
or do I have to initialize it beforehand?
if the latter is the case then what would be a good way to find out the number of vertices in the file? Once I read it using inputstreamreader.read() it goes to the next line until it reads the whole file. The only thing I can think of would be to read the whole file, count the number of vertices, then read it AGAIN the fill in the newly initialized array.
Is there a way to dynamically allocate the data as is needed?
You can use an ArrayList which will give you the dynamic size that you need.
List<Float> vertices = new ArrayList<Float>();
You can add a value like this:
vertices.add(5.0F);
and the list will grow to suit your needs.
Some things to note: The ArrayList will hold objects, not primitive types. So it stores the float values you provide as Float objects. However, it is easy to get the original float value from this.
If you absolutely need an array then after you read in the entire list of values you can easily get an array from the List.
You can start reading about Java Collections here.
In java arrays have to be initialised beforehand. In your case you have the following options:
1) Use an ArrayList (or some other implementation of List interface), as suggested by others. Such lists can grow dynamically so this will help.
2) If you have control over the file format, add information on the number of vertices to the beginning of the file, so you can pre-initialise your array with correct size.
3) If you don't have control over it, try guessing the number of vertices based on file size (float is 4 bytes, so maybe divide File.length() by 4, for example). If the guessed number is too small, you can dynamically create a bigger array (say, 120% of the previous array size), the copy all data from previous array into the new one and carry on. This may be costly but if your guessing of array size is precise it will not be a problem.
We might be able to give you more ideas if you give us more information on file format and/or how this array of vertices going to be used (like: stored for a long time, or thrown away quickly).
No, you can't fill in uninitialized array.
If you need a dynamic structure that allows storing data + indexes (which seem to be important in your case), I would go for Map (key of Map would be your index):
Map<Integer, Float> vertices = new HashMap<Integer, Float>();
vertices.put(95, 5.004f);

Find and list duplicates in an unordered array consisting of 10,000,000,00 elements

How can duplicate elements in an array, that consists of
unordered 10,000,000,00 elements, be determined? How can they be listed?
Please ensure the performance is taken care of while writing the logic of Java code.
What is the space complexity and time complexity of the logic?
Consider an example array, DuplicateArray[], as shown below.
String DuplicateArray[] = {"tom","wipro","hcl","Ibm","rachael","tom","wipro","hcl","Ibm","rachael",
"Bill","HP","hcl","Ibm","rachael","tom","wipro","hcl","Ibm","rachael","tom","wipro","hcl","Ibm","rachael",
"Bill","HP","hcl","Ibm","rachael","tom","wipro","hcl","Ibm","rachael","tom","wipro","hcl","Ibm","rachael",
"Agnus","wipro","hcl","Ibm","rachael","tom","wipro","hcl","Ibm","rachael","tom","wipro","hcl","Ibm","rachael",
"Obama","wipro","hcl","Ibm","rachael","tom","wipro","hcl","Ibm","rachael","rachael","tom","wipro","hcl","Ibm","rachael",
"Obama","HP","TCS","CTS","rachael","tom","wipro","hcl","Ibm","rachael","rachael","tom","wipro","hcl","Ibm","rachael"}
I suggest you to use Set. Best for you will be HashSet. Put your elements to it one by one. And check existence in every insert operation.
Something like this:
HashSet<String>hs = new HashSet<String>();
HashSet<String>Answer = new HashSet<String>();
for(String s: DuplicateArray){
if(!hs.contains(s))
hs.add(s);
else
Answer.add(s);
}
Code depends on the the assumption, that type of elements of your array is String
Here you go
class MyValues{
public int i = 1;
private String value = null;
public MyValues(String v){
value = v;
}
int hashCode()
{
return value.length;
}
boolean equals(Object obj){
return obj.equals(value);
}
}
Now iterate for duplicates
private Set<MyValues> values = new TreeSet<MyValues>();
for(String s : duplicatArray){
MyValues v = new MyValues(s);
if (values.add(v))
{
v.i++;
}
}
Time and space are both linear.
How many duplicates are expected? A few or comparable to the number of entries or something in between?
Do you know anything else about the values? E.g are they from some specific dictionary?
If not, iterate over the array, build a HashSet, noting when you are about to add an entry that's already there and keeping those in a list. I can't see anything else is going to be faster.
Firstly, do you mean 10,000,000,00 as one billion or 10 billion. If you mean the later, you cannot have more than 2 billion elements in an array or a Set. The suggestions you have so far will not work in this situation. To have 10 billion Strings in memory you will need at least 640 GB and AFAIK, there is not server available which will allow this volume of memory in a single JVM.
For a task this large, you may have to consider a solution which breaks up the work, either across multiple machines or put the work into files to be processed later.
You have to either assume;
You have a relatively small number of unique Strings. In this case, you can built a Set in memory of the words you have seen so far. These will fit into memory. (Or you might assume they do)
Break up the files into manageable sizes. A simple way to do this would be to write to a few hundred work files based on hashcode. The hashcode for the same strings will be the same so as you process each file in memory, you know that it will contain all the duplicates, if there are any.

Categories

Resources