performance finding elements in collection - java

I have two collections
The first contains all the elements.
The second contains the elements that I am interested in from the first collection.
The data is in an alphabetical format:
AAA
AA.12.AA
BBB.234.B1
CC.89
…
The first collection contains around 300.000 records roughly.
Now, if I want to get 10 thousand records from the first collection it is taking up to 40 seconds to find them.
Collection types: firstColl = ArrayList , secondColl = List
Action: I iterate all elements in the firstColl and for every element I check if the secondColl has the element in it.
Just want to know if anyone knows a most performance way to do it by using maybe BigList, Streams,...
CODE:
List<RegionPolygon> regionPolygons = new ArrayList<>();
for (RegionPolygon regionPolygon: result) {
if (regionsArray.contains(regionPolygon.getRegionRef())) {
regionPolygons.add(regionPolygon);
}
}
Note: RegionPolygon has a property which is a String with a very long value (more than 2000 thousand characters easily, although I am not using that property to check if it is the region that I am looking for). Just wanted to say this cause I don't know if this is part of the problem.
result = firstColl
regionsArray = secondColl
Thanks,

Related

Hash items in a 2d array, but only on one index

So, I have a 2d array (really, a List of Lists) that I need to squish down and remove any duplicates, but only for a specific field.
The basic layout is a list of Matches, with each Match having an ID number and a date. I need to remove all duplicates such that each ID only appears once. If an ID appears multiple times in the List of Matches, then I want to take the Match with the most recent date.
My current solution has me taking the List of Matches, adding it to a HashSet, and then converting that back to an ArrayList. However all that does is remove any exact Match duplicates, which still leaves me with the same ID appearing multiple times if they have different dates.
Set<Match> deDupedMatches = new HashSet<Match>();
deDupedMatches.addAll(originalListOfMatches);
List<Match> finalList = new ArrayList<Match>(deDupedMatches)
If my original data coming in is
{(1, 1-1-1999),(1, 2-2-1999),(1, 1-1-1999),(2, 3-3-2000)}
then what I get back is
{(1, 1-1-1999),(1, 2-2-1999),(2, 3-3-2000)}
But what I am really looking for is a solution that would give me
{(1, 2-2-1999),(2, 3-3-2000)}
I had some vague idea of hashing the original list in the same basic way, but only using the IDs. Basically I would end up with "buckets" based on the ID that I could iterate over, and any bucket that had more than one Match in it I could choose the correct one for. The thing that is hanging me up is the actual hashing. I am just not sure how or if I can get the Matches broken up in the way that I am thinking of.
If I understand your question correctly you want to take distinct IDs from a list with the latest date by which it occurs.
Because your Match is a class it is not as easy to compare with each other because of the fields not being looked at by Set.
What I would do to get around this problem is use a HashMap which allows distinct keys and values to be linked.
Keys cannot be repeated, values can.
I would do something like this while looping through:
if(map.putIfAbsent(match.getID(), match) != null &&
map.get(match.getID()).getDate() < match.getDate()){
map.replace(match.getID(),match);
}
So what that does is it loops through your matches.
Put the current Match with its ID in if that ID doesn't exist yet.
.putIfAbsent returns the old value which is null if it did not exist.
You then check if there was an item in the map at that ID using the putIfAbsent (2 birds with one stone).
after that it is safe to compare the two dates (one in map and one from iteration - the < is an exams for your comparison method)
if the new one is later then replace the current Match.
And finally in order to get your list you use .getValues()
This will remove duplicate IDs and leave only the latest ones.
Apologies for typos or code errors, this was done on a phone. Please notify me of any errors in the comments.
Java 7 does not have the .putIfAbsent and .replace functionality, but they can be substitued for .contains and .put

1 to 1 association of 2 string lists in java

I am a relatively new programmer and am working on my first project to build a portfolio. In my project I have 2 rather large lists of strings (about 3.1 million) and I need to "associate" the elements in each one with a 1 to 1 relationship from predetermined values (elements are selected according to a set method) not just linearly (from top to bottom). For example:
lista(0) = list1(5);
listb(0) = list2(2);
lista(1) = list1(1);
listb(1) = list2(4);
lista(2) = list1(3);
listb(2) = list2(1);
The point of this is to reorder the lists in a manner that can be recreated at a later time or by a different program by "remembering" a set of values. I am using 2 lists because I need to be able to search one list for a String then pull the value from the corresponding element in the other list.
I have tried many different methods like storing each list in an arrayList then accessing the elements in the preset order and storing them in new arrayLists in the new order, then removing the elements from the old arrayLists. This would be ideal but didn't work because removing elements from a really large arrayList was very slow. I figured that removing an element from the lists will prevent it from being used again.
I tried storing them in String arrays, then accessing each element in the predefined method, storing them in another array then nulling out the elements so that they wont be used again, but creating null spaces made searching a nightmare, because if the program hit a null element during the predefined "move" value, I had to add in checks for nulls, then more movement which made things more complicated and harder to reproduce later.
I need an easy, and efficient way to create these associations between these 2 lists and ANY ideas are welcome.
This is my first post to stackoverflow and I apologize if its formatted improperly or confusing, but please be gentle.
if you need to pull one value from a given string, why not using a map ? The key is the value of the first list and the value is the value of the second list
use Map<String,String> which stores Key as a string and value as a string.And the best part is time complexity of removing an element would be O(1).
As mentioned before, Map is an option.More specifically HashMap, or another option could be Hashtable. Make sure you look at what each has to offer. Some major differences are HashMap allows nulls but it is not synchronized. On the other hand Hashtable is synchronized and does not accept null as key.

Building an inverted index in Java-logic

I have a collection of around 1500 documents. I parsed through each document and extract tokens. These tokens are stored in an hashmap(as key) and the total number of times they occur in the collection (i.e. frequency) is stored as the value.
I have to extend this to build an inverted index. That is, the term(key)| number of documents it occurs it-->DocNo|Frequency in that document. For exmple,
Term DocFreq DocNum TermFreq
data 3 1 12
23 31
100 17
customer 2 22 43
19 2
Currently, I have the following in Java,
hashmap<string,integer>
for(each document)
{
extract line
for(each line)
{
extract word
for(each word)
{
perform some operations
get value for word from hashmap and increment by one
}
}
}
I have to build on this code. I can't really think of a good way to implement an inverted index.
So far, I thought of making value a 2D array. So the term would be the key and the value(i.e 2D array) would store the docId and termFreq.
Please let me know if my logic is correct.
I would do it by using a Map<String, TermFrequencies>. This map would maintain a TermFrequencies object for each term found. The TermFrequencies object would have the following methods:
void addOccurrence(String documentId);
int getTotalNumberOfOccurrences();
Set<String> getDocumentIds();
int getNumberOfOccurrencesInDocument(String documentId);
It would use a Map<String, Integer> internally to associate each document the term occurs in with the number of occurrences of the term in the document.
The algorithm would be extremely simple:
for(each document) {
extract line
for(each line) {
extract word
for(each word) {
TermFrequencies termFrequencies = map.get(word);
if (termFrequencies == null) {
termFrequencies = new TermFrequencies(word);
}
termFrequencies.addOccurrence(document);
}
}
}
The addOccurrence() method would simply increment a counter for the total number of occurrences, and would insert or update the number of occurrences in the internam map.
I think it is best to have two structures: a Map<docnum, Map<term,termFreq>> and a Map<term, Set<docnum>>. Your docFreqs can be read off as set.size in the values of the second map. This solution involves no custom classes and allows a quick retrieval of everything needed.
The first map contains all the informantion and the second one is a derivative that allows quick lookup by term. As you process a document, you fill the first map. You can derive the second map afterwards, but it is also easy to do it in one pass.
I once implemented what you're asking for. The problem with your approach is that it is not abstract enough. You should model Terms, Documents and their relationships using objects. In a first run, create the term index and document objects and iterate over all terms in the documents while populating the term index. Afterwards, you have a representation in memory that you can easily transform into the desired output.
Do not start by thinking about 2d-arrays in an object oriented language. Unless you want to solve a mathematical problem or optimize something it's not the right approach most of the time.
I dont know if this is still a hot question, but I would recommend you to do it like this:
You run over all your documents and give them an id in increasing order. For each document you run over all the words.
Now you have a Hashmap that maps Strings (your words) to an array of DocTermObjects. A DocTermObject contains a docId and a TermFrequency.
Now for each word in a document, you look it up in your HashMap, if it doesn't contain an Array of DocTermObjects you create it, else you look at its very LAST element only (this is important due to runtime, think about it). If this element has the docId that you treat at the moment, you increase the TermFrequency. Else or if the Array is empty, you add a new DocTermObject with your actual docId and set the TermFrequency to 1.
Later you can use this datastructure to compute scores for example. The scores you could also save in the DoctermObjects of course.
Hope it helped :)

IndexOutOfBoundsException - only sometimes?

I keep getting random java.lang.IndexOutOfBoundsException errors on my program.
What am i doing wrong?
The program runs fine, its a really long for loop, but for some elements i seem to be getting that error and then it continues on to the next element and it works fine.
for (int i = 0; i < response.getSegments().getSegmentInfo().size()-1; i++) {
reservedSeats = response.getSegments().getSegmentInfo().get(i).getCabinSummary().getCabinClass().get(i).getAmountOfResSeat();
usedSeats = response.getSegments().getSegmentInfo().get(i).getCabinSummary().getCabinClass().get(i).getAmountOfUsedSeat();
System.out.println("Reserved Seats: " + reservedSeats);
System.out.println("Used Seats : " + usedSeats);
}
How can i prevent this errors?
For those thinking this is an array, it is more likely a list.
Let me guess, you used to be getting ConcurrentModificationExceptions, so you rewrote the loop to use indexed lookup of elements (avoiding the iterator). Congratulations, you fixed the exception, but not the issue.
You are changing your List while this loop is running. Every now and then, you remove an element. Every now and then you look at the last element size()-1. When the order of operations looks like:
(some thread)
remove an element from response.getSegments().getSegmentInfo()
(some possibly other thread)
lookup up the size()-1 element of the above
You access an element that no longer exists, and will raise an IndexOutOfBoundsException.
You need to fix the logic around this List by controlling access to it such that if you need to check all elements, you don't assume the list will be the same as it crosses all elements, or (the much better solution) freeze the list for the loop.
A simple way of doing the latter is to do a copy of the List (but not the list's elements) and iterate over the copy.
--- Edited as the problem dramatically changed in an edit after the above was written ---
You added a lot of extra code, including a few extra list lookups. You are using the same index for all the list lookups, but there is nothing to indicate that all of the lists are the same size.
Also, you probably don't want to skip across elements, odds are you really want to access all of the cabin classes in a segmentInfo, not just the 3rd cabinClass in the 3rd segmentInfo, etc.
You seem to be using i to index into two entirely separate List objects:
response.getSegments().getSegmentInfo().get(i) // indexing into response.getSegments().getSegmentInfo()
.getCabinSummary().getCabinClass().get(i) // indexing into getCabinSummary().getCabinClass()
.getAmountOfResSeat();
This looks wrong to me. Is this supposed to happen this way? And is the list returned by getCabinClass() guaranteed to be at least as long as the one returned by getSegmentInfo()?
You're using i both as an index for the list of segment infos and for the list of cabin classes. This smells like the source of your problem.
I don't know your domain model but I'd expect that we need two different counters here.
Refactored code to show problem (guessed the types, replace with correct class names)
List<SegmentInfo> segmentInfos = response.getSegments().getSegmentInfo();
for (int i = 0; i < segmentInfos.size()-1; i++) {
// use i to get actual segmentInfo
SegmentInfo segmentInfo = segmentInfos.get(i);
List<CabinClass> cabinClasses = segmentInfo.getCabinSummary.getCabinClass();
// use i again to get actual cabin class ???
CabinClass cabinClass = cabinClasses.get(i);
reservedSeats = cabinClass.getAmountOfResSeat();
usedSeats = cabinClass.getAmountOfUsedSeat();
System.out.println("Reserved Seats: " + reservedSeats);
System.out.println("Used Seats : " + usedSeats);
}
Assuming that response.getSegments().getSegmentInfo() always returns an array of the same size, calling .get(i) on it should be safe, given the loop header (but are you aware that you are skipping the last element?) However, are you sure that .getCabinSummary() will return an array that is as large as the getSegmentInfo() array? It looks suspicious that you are using i to perform lookups in two different arrays.
You could split the first line in the loop body into two separate lines (I'm only guessing the type names here):
List<SegmentInfo> segmentInfo = response.getSegments().getSegmentInfo().get(i);
reservedSeats = segmentInfo.getCabinSummary().get(i).getAmountOfResSeat();
Then you'll see which lookup causes the crash.

Find and list duplicates in an unordered array consisting of 10,000,000,00 elements

How can duplicate elements in an array, that consists of
unordered 10,000,000,00 elements, be determined? How can they be listed?
Please ensure the performance is taken care of while writing the logic of Java code.
What is the space complexity and time complexity of the logic?
Consider an example array, DuplicateArray[], as shown below.
String DuplicateArray[] = {"tom","wipro","hcl","Ibm","rachael","tom","wipro","hcl","Ibm","rachael",
"Bill","HP","hcl","Ibm","rachael","tom","wipro","hcl","Ibm","rachael","tom","wipro","hcl","Ibm","rachael",
"Bill","HP","hcl","Ibm","rachael","tom","wipro","hcl","Ibm","rachael","tom","wipro","hcl","Ibm","rachael",
"Agnus","wipro","hcl","Ibm","rachael","tom","wipro","hcl","Ibm","rachael","tom","wipro","hcl","Ibm","rachael",
"Obama","wipro","hcl","Ibm","rachael","tom","wipro","hcl","Ibm","rachael","rachael","tom","wipro","hcl","Ibm","rachael",
"Obama","HP","TCS","CTS","rachael","tom","wipro","hcl","Ibm","rachael","rachael","tom","wipro","hcl","Ibm","rachael"}
I suggest you to use Set. Best for you will be HashSet. Put your elements to it one by one. And check existence in every insert operation.
Something like this:
HashSet<String>hs = new HashSet<String>();
HashSet<String>Answer = new HashSet<String>();
for(String s: DuplicateArray){
if(!hs.contains(s))
hs.add(s);
else
Answer.add(s);
}
Code depends on the the assumption, that type of elements of your array is String
Here you go
class MyValues{
public int i = 1;
private String value = null;
public MyValues(String v){
value = v;
}
int hashCode()
{
return value.length;
}
boolean equals(Object obj){
return obj.equals(value);
}
}
Now iterate for duplicates
private Set<MyValues> values = new TreeSet<MyValues>();
for(String s : duplicatArray){
MyValues v = new MyValues(s);
if (values.add(v))
{
v.i++;
}
}
Time and space are both linear.
How many duplicates are expected? A few or comparable to the number of entries or something in between?
Do you know anything else about the values? E.g are they from some specific dictionary?
If not, iterate over the array, build a HashSet, noting when you are about to add an entry that's already there and keeping those in a list. I can't see anything else is going to be faster.
Firstly, do you mean 10,000,000,00 as one billion or 10 billion. If you mean the later, you cannot have more than 2 billion elements in an array or a Set. The suggestions you have so far will not work in this situation. To have 10 billion Strings in memory you will need at least 640 GB and AFAIK, there is not server available which will allow this volume of memory in a single JVM.
For a task this large, you may have to consider a solution which breaks up the work, either across multiple machines or put the work into files to be processed later.
You have to either assume;
You have a relatively small number of unique Strings. In this case, you can built a Set in memory of the words you have seen so far. These will fit into memory. (Or you might assume they do)
Break up the files into manageable sizes. A simple way to do this would be to write to a few hundred work files based on hashcode. The hashcode for the same strings will be the same so as you process each file in memory, you know that it will contain all the duplicates, if there are any.

Categories

Resources