Searching an array list for most common String - java

I was wondering how I could search an ArrayList of Strings to find the most commonly occurring 'destination' in an 'Itinerary' object I've created (which contains a list of different destinations.)
So far I have:
public static String commonName(ArrayList<Itinerary> itinerary){
int count = 0;
int total = 0;
ArrayList<String> names = new ArrayList<String>();
Iterator<String>itr2 = names.iterator();
while(itr.hasNext()){
Itinerary temp = itr.next();
if(temp.iterator().hasNext()){ //if its has destinations
// Destination object in itinerary object
Destination temp2 = temp.iterator().next();
String name = temp2.getDestination().toLowerCase().replace(" ", "");
if(names.contains(name)){
count = count + 1;
//do something with counting the occurence of string name here
}
I'm having problems making an algorithm to search the array for the most commonly occurring string, or strings if there is a tie; and then displaying the number of the 'Itinerary object' (the parameter value) the string is found in. Any help would be great, thank you!!

I would make a HashMap<String,Integer>. Then I would go through each itinerary, and if the destination wans't in the Map I would create an entry with put(destination, 1), otherwise I would increment the count that was there with put(destination, get(destination)+1). Afterwards I'd go through the Map entries and look for the one with the highest count.

If you don't mind using an external jar, you could use HashBag from apache commons to do this easily.
public static String commonName(ArrayList<Itinerary> itinerary){
int count = 0;
int total = 0;
Bag names = new HashBag();
while(itr.hasNext()){ //while array of Itinerary object has next
Itinerary temp = itr.next(); //temp = 1st itineray object
if(temp.iterator().hasNext()){ //if its has destinations
Destination temp2 = temp.iterator().next(); //n Destination object in itinerary object
String name = temp2.getDestination().toLowerCase().replace(" ", "");
names.add(name, 1);
}
}
And then later you can call names.getCount("destination1") to get the number of occurrences of destination1
See http://commons.apache.org/collections/userguide.html#Bags

Try the group feature of the lambdaj library. To solve your problem you could group the Itenarary objects on the destination property and then find the group with the biggest size as in the following example:
Group<Sale> group = selectMax(group(itineraries,
by(on(Itenarary.class).getDestination())).subgroups(), on(Group.class).getSize());

In statistics, this is called the "mode". A vanilla Java 8 solution looks like this:
itinerary
.stream()
.flatMap(i -> StreamSupport.stream(
Spliterators.spliteratorUnknownSize(i.iterator(), 0)
))
.collect(Collectors.groupingBy(
s -> s.getDestination().toLowerCase().replace(" ", ""),
Collectors.counting()
))
.entrySet()
.stream()
.max(Comparator.comparing(Entry::getValue))
.ifPresent(System.out::println);
jOOλ is a library that supports mode() on streams. The following program:
System.out.println(
Seq.seq(itinerary)
.flatMap(i -> Seq.seq(i.iterator()))
.map(s -> s.getDestination().toLowerCase().replace(" ", ""))
.mode()
);
(disclaimer: I work for the company behind jOOλ)

Related

Count and remove similar elements in a list while iterating through it

I used many references in the site to build up my program but I'm kind of stuck right now. I think using iterator will do the job. Sadly even though I went through questions which had iterator, I couldn't get the way of using it properly to implement it on my code.
I want to,
1. remove the similar elements found in the list fname
2. count & add the that count of each element found in fname to
counter.
Please help me do the above using iterator or with any other method. Following is my code,
List<String> fname = new ArrayList<>(Arrays.asList(fullname.split(""))); //Assigning the string to a list//
int count = 1;
ArrayList<Integer> counter = new ArrayList<>();
List<String> holder = new ArrayList<>();
for(int element=0; element<=fname.size; element++)
{
for(int run=(element+1); run<=fname.size; run++)
{
if((fname.get(element)).equals(fname.get(run)))
{
count++;
holder.add(fname.get(run));
}
counter.add(count);
}
holder.add(fname.get(element));
fname.removeAll(holder);
}
System.out.println(fname);
System.out.println(counter);
Thanks.
From your questions, you basically want to:
1. Eliminate duplicates from given String List
You can simply convert your List to HashSet (it doesn't allow duplicates) and then convert it back to list (if you want the end result to be a List so you can do something else with it...)
2. Count all occurences of unique words in your list
The fastest coding is to use Java 8 Streams (code borrowed frome here: How to count the number of occurrences of an element in a List)
Complete code
public static void main(String[] args) {
String fullname = "a b c d a b c"; //something
List<String> fname = new ArrayList<>(Arrays.asList(fullname.split(" ")));
// Convert input to Set, and then back to List (your program output)
Set<String> uniqueNames = new HashSet<>(fname);
List<String> uniqueNamesInList = new ArrayList<>(uniqueNames);
System.out.println(uniqueNamesInList);
// Collects (reduces) your list
Map<String, Long> counts = fname.stream().collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
System.out.println(counts);
}
I do not think that you need iterators here. However, there are many other possible solutions you could use, like recursion. Nevertheless, I have just modified your code as the following:
final List<String> fname = new ArrayList<String>(Arrays.asList(fullname.split("")));
// defining a list that will hold the unique elements.
final List<String> resultList = new ArrayList<>();
// defining a list that will hold the number of replication for every item in the fname list; the order here is same to the order in resultList
final ArrayList<Integer> counter = new ArrayList<>();
for (int element = 0; element < fname.size(); element++) {
int count = 1;
for (int run = (element + 1); run < fname.size(); run++) {
if ((fname.get(element)).equals(fname.get(run))) {
count++;
// we remove the element that has been already counted and return the index one step back to start counting over.
fname.remove(run--);
}
}
// we add the element to the resulted list and counter of that element
counter.add(count);
resultList.add(fname.get(element));
}
// here we print out both lists.
System.out.println(resultList);
System.out.println(counter);
Assuming String fullname = "StringOfSomeStaff"; the output will be as the following:
[S, t, r, i, n, g, O, f, o, m, e, a]
[3, 2, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1]
You can try something like this:
Set<String> mySet = new HashSet<>();
mySet.addAll( fname ); // Now you have unique values
for(String s : mySet) {
count = 0;
for(String x : fname) {
if( s.equals(x) ) { count++; }
}
counter.add( count );
}
This way we don't have a specific order. But I hope it helps.
In Java 8, there's a one-liner:
List<Integer> result = fname
.stream()
.collect(Collectors.groupingBy(s -> s))
.entrySet()
.stream()
.map(e -> e.getValue().size())
.collect(Collectors.toList());
I was using LinkedHashMap to preserve order of elements. Also for loop, which I am using, implicitly uses Iterator. Code example is using Map.merge method, which is available since Java 8.
List<String> fname = new ArrayList<>(Arrays.asList(fullname.split("")));
/*
Create Map which will contain pairs kay=values
(in this case key is a name and value is the counter).
Here we are using LinkedHashMap (instead of common HashMap)
to preserve order in which name occurs first time in the list.
*/
Map<String, Integer> countByName = new LinkedHashMap<>();
for (String name : fname) {
/*
'merge' method put the key into the map (first parameter 'name').
Second parameter is a value which we that to associate with the key
Last (3rd) parameter is a function which will merge two values
(new and ald) if map already contains this key
*/
countByName.merge(name, 1, Integer::sum);
}
System.out.println(fname); // original list [a, d, e, a, a, f, t, d]
System.out.println(countByName.values()); // counts [3, 2, 1, 1, 1]
System.out.println(countByName.keySet()); // unique names [a, d, e, f, t]
Also same might be done using Stream API but it would be probably hard for understanding if you are not familiar with Streams.
Map<String, Long> countByName = fname.stream()
.collect(Collectors.groupingBy(Function.identity(), LinkedHashMap::new, Collectors.counting()));

Search array list for substring

Let say I have an array list with names and the names are stored like this...
John(2),
Bob(anytext),
Rick
I'm trying to iterate over my array list and check for "(" basically and just take the rest of the string behind it and return that as a string, and null if nothing there. I've seen methods to do similar things but I can't seem to find something to just return the rest of the string if it finds the "("
for(int i=0; i<list.size(); i++) {
String s = list.get(i);
int x = s.indexOf('(');
if(x==-1) break;
return s.substring(x+1);
}
Pass the strings you want to check to a method that does something like this:
if(str.contains("(")){
return str.substring(str.indexOf("("));
}else{
return null;
}
Java 8 version
List<String> list = Arrays.asList("John(2)", "Bob(anytext)", "Rick");
String result = list.stream()
.filter(x -> x.contains("("))
.findFirst()
.map(x -> x.substring(x.indexOf("(")))
.orElse(null);

Replace strings populated in an ArrayList<String> with other values

I am currently working on a project where I need to check an arraylist for a certain string and if that condition is met, replace it with the new string.
I will only show the relevant code but basically what happened before is a long string is read in, split into groups of three, then those strings populate an array. I need to find and replace those values in the array, and then print them out. Here is the method that populates the arraylist:
private static ArrayList<String> splitText(String text)
{
ArrayList<String> DNAsplit = new ArrayList<String>();
for (int i = 0; i < text.length(); i += 3)
{
DNAsplit.add(text.substring(i, Math.min(i + 3, text.length())));
}
return DNAsplit;
}
How would I search this arraylist for multiple strings (Here's an example aminoAcids = aminoAcids.replaceAll ("TAT", "Y");) and then print the new values out.
Any help is greatly appreciated.
In Java 8
list.replaceAll(s-> s.replace("TAT", "Y"));
There is no such "replace all" method on a list. You need to apply the replacement element-wise; the only difference vs doing this on a single string is that you need to get the value out of the list, and set the new value back into the list:
ListIterator<String> it = DNAsplit.listIterator();
while (it.hasNext()) {
// Get from the list.
String current = it.next();
// Apply the transformation.
String newValue = current.replace("TAT", "Y");
// Set back into the list.
it.set(newValue);
}
And if you want to print the new values out:
System.out.println(DNAsplit);
Why dont you create a hashmap that has a key-value and use it during the load time to populate this list instead of revising it later ?
Map<String,String> dnaMap = new HashMap<String,String>() ;
dnaMap.push("X","XXX");
.
.
.
dnaMap.push("Z","ZZZ");
And use it like below :
//Use the hash map to lookup the temp key
temp= text.substring(i, Math.min(i + 3, text.length()));
DNAsplit.add(dnaMap.get(temp));

Split a list into multiple sublist based on element properties in Java

Is there a way to split a list to multiple list?. Given list into two or more list based on a particular condition of it elements.
final List<AnswerRow> answerRows= getAnswerRows(.........);
final AnswerCollection answerCollections = new AnswerCollection();
answerCollections.addAll(answerRows);
The AnswerRow has properties like rowId, collectionId
based on collectionId i want to create one or more AnswerCollections
If you just want to group elements by collectionId you could try something like
List<AnswerCollection> collections = answerRows.stream()
.collect(Collectors.groupingBy(x -> x.collectionId))
.entrySet().stream()
.map(e -> { AnswerCollection c = new AnswerCollection(); c.addAll(e.getValue()); return c; })
.collect(Collectors.toList());
Above code will produce one AnswerCollection per collectionId.
With Java 6 and Apache Commons Collections, the following code produce the same results as the above code using Java 8 streams:
ListValuedMap<Long, AnswerRow> groups = new ArrayListValuedHashMap<Long, AnswerRow>();
for (AnswerRow row : answerRows)
groups.put(row.collectionId, row);
List<AnswerCollection> collections = new ArrayList<AnswerCollection>(groups.size());
for (Long collectionId : groups.keySet()) {
AnswerCollection c = new AnswerCollection();
c.addAll(groups.get(collectionId));
collections.add(c);
}
Is there a way to split a list to multiple list?
Yes, You can do it like this:
answerRows.subList(startIndex, endIndex);
Given list into two or more list based on a particular condition of it
elements.
You'll have to calculate the start and end indices based on your specific condition and then you can mint the subList out of your ArrayList using the above function.
For Example, if you want to pass batches of 1000 answerRows to a specific function then you can do something like this:
int i = 0;
for(; i < max && i < answerRows.size(); i++) {
if((i+1) % 1000 == 0) {
/* Prepare SubList & Call Function */
someFunction(answerRows.subList(i, i+1000));
}
}
/* Final Iteration */
someFunction(answerRows.subList(i, answerRows.size() - 1));

PairWise matching millions of records

I have an algorithmic problem at hand. To easily explain the problem, I will be using a simple analogy.
I have an input file
Country,Exports
Austrailia,Sheep
US, Apple
Austrialia,Beef
End Goal:
I have to find the common products between the pairs of countries so
{"Austrailia,New Zealand"}:{"apple","sheep}
{"Austrialia,US"}:{"apple"}
{"New Zealand","US"}:{"apple","milk"}
Process :
I read in the input and store it in a TreeMap > Where the List, the strings are interned due to many duplicates.
Essentially, I am aggregating by country.
where Key is country, Values are its Exports.
{"austrailia":{"apple","sheep","koalas"}}
{"new zealand":{"apple","sheep","milk"}}
{"US":{"apple","beef","milk"}}
I have about 1200 keys (countries) and total number of values(exports) is 80 million altogether.
I sort all the values of each key:
{"austrailia":{"apple","sheep","koalas"}} -- > {"austrailia":{"apple","koalas","sheep"}}
This is fast as there are only 1200 Lists to sort.
for(k1:keys)
for(k2:keys)
if(k1.compareTo(k2) <0){ //Dont want to double compare
List<String> intersectList = intersectList_func(k1's exports,k2's exports);
countriespair.put({k1,k2},intersectList)
}
This code block takes so long.I realise it O(n2) and around 1200*1200 comparisions.Thus,Running for almost 3 hours till now..
Is there any way, I can speed it up or optimise it.
Algorithm wise is best option, or are there other technologies to consider.
Edit:
Since both List are sorted beforehand, the intersectList is O(n) where n is length of floor(listOne.length,listTwo.length) and NOT O(n2) as discussed below
private static List<String> intersectList(List<String> listOne,List<String> listTwo){
int i=0,j=0;
List<String> listResult = new LinkedList<String>();
while(i!=listOne.size() && j!=listTwo.size()){
int compareVal = listOne.get(i).compareTo(listTwo.get(j));
if(compareVal==0){
listResult.add(listOne.get(i));
i++;j++;} }
else if(compareVal < 0) i++;
else if (compareVal >0) j++;
}
return listResult;
}
Update 22 Nov
My current implementation is still running for almost 18 hours. :|
Update 25 Nov
I had run the new implementation as suggested by Vikram and a few others. It's been running this Friday.
My question, is that how does grouping by exports rather than country save computational complexity. I find that the complexity is the same. As Groo mentioned, I find that the complexity for the second part is O(E*C^2) where is E is exports and C is country.
This can be done in one statement as a self-join using SQL:
test data. First create a test data set:
Lines <- "Country,Exports
Austrailia,Sheep
Austrailia,Apple
New Zealand,Apple
New Zealand,Sheep
New Zealand,Milk
US,Apple
US,Milk
"
DF <- read.csv(text = Lines, as.is = TRUE)
sqldf Now that we have DF issue this command:
library(sqldf)
sqldf("select a.Country, b.Country, group_concat(Exports) Exports
from DF a, DF b using (Exports)
where a.Country < b.Country
group by a.Country, b.Country
")
giving this output:
Country Country Exports
1 Austrailia New Zealand Sheep,Apple
2 Austrailia US Apple
3 New Zealand US Apple,Milk
with index If its too slow add an index to the Country column (and be sure not to forget the main. parts:
sqldf(c("create index idx on DF(Country)",
"select a.Country, b.Country, group_concat(Exports) Exports
from main.DF a, main.DF b using (Exports)
where a.Country < b.Country
group by a.Country, b.Country
"))
If you run out memory then add the dbname = tempfile() sqldf argument so that it uses disk.
Store something like following datastructure:- (following is a pseudo code)
ValuesSet ={
apple = {"Austrailia","New Zealand"..}
sheep = {"Austrailia","New Zealand"..}
}
for k in ValuesSet
for k1 in k.values()
for k2 in k.values()
if(k1<k2)
Set(k1,k2).add(k)
time complextiy: O(No of distinct pairs with similar products)
Note: I might be wrong but i donot think u can reduce this time complexity
Following is a java implementation for your problem:-
public class PairMatching {
HashMap Country;
ArrayList CountNames;
HashMap ProdtoIndex;
ArrayList ProdtoCount;
ArrayList ProdNames;
ArrayList[][] Pairs;
int products=0;
int countries=0;
public void readfile(String filename) {
try {
BufferedReader br = new BufferedReader(new FileReader(new File(filename)));
String line;
CountNames = new ArrayList();
Country = new HashMap<String,Integer>();
ProdtoIndex = new HashMap<String,Integer>();
ProdtoCount = new ArrayList<ArrayList>();
ProdNames = new ArrayList();
products = countries = 0;
while((line=br.readLine())!=null) {
String[] s = line.split(",");
s[0] = s[0].trim();
s[1] = s[1].trim();
int k;
if(!Country.containsKey(s[0])) {
CountNames.add(s[0]);
Country.put(s[0],countries);
k = countries;
countries++;
}
else {
k =(Integer) Country.get(s[0]);
}
if(!ProdtoIndex.containsKey(s[1])) {
ProdNames.add(s[1]);
ArrayList n = new ArrayList();
ProdtoIndex.put(s[1],products);
n.add(k);
ProdtoCount.add(n);
products++;
}
else {
int ind =(Integer)ProdtoIndex.get(s[1]);
ArrayList c =(ArrayList) ProdtoCount.get(ind);
c.add(k);
}
}
System.out.println(CountNames);
System.out.println(ProdtoCount);
System.out.println(ProdNames);
} catch (FileNotFoundException ex) {
Logger.getLogger(PairMatching.class.getName()).log(Level.SEVERE, null, ex);
} catch (IOException ex) {
Logger.getLogger(PairMatching.class.getName()).log(Level.SEVERE, null, ex);
}
}
void FindPairs() {
Pairs = new ArrayList[countries][countries];
for(int i=0;i<ProdNames.size();i++) {
ArrayList curr = (ArrayList)ProdtoCount.get(i);
for(int j=0;j<curr.size();j++) {
for(int k=j+1;k<curr.size();k++) {
int u =(Integer)curr.get(j);
int v = (Integer)curr.get(k);
//System.out.println(u+","+v);
if(Pairs[u][v]==null) {
if(Pairs[v][u]!=null)
Pairs[v][u].add(i);
else {
Pairs[u][v] = new ArrayList();
Pairs[u][v].add(i);
}
}
else Pairs[u][v].add(i);
}
}
}
for(int i=0;i<countries;i++) {
for(int j=0;j<countries;j++) {
if(Pairs[i][j]==null)
continue;
ArrayList a = Pairs[i][j];
System.out.print("\n{"+CountNames.get(i)+","+CountNames.get(j)+"} : ");
for(int k=0;k<a.size();k++) {
System.out.print(ProdNames.get((Integer)a.get(k))+" ");
}
}
}
}
public static void main(String[] args) {
PairMatching pm = new PairMatching();
pm.readfile("Input data/BigData.txt");
pm.FindPairs();
}
}
[Update] The algorithm presented here shouldn't improve time complexity compared to the OP's original algorithm. Both algorithms have the same asymptotic complexity, and iterating through sorted lists (as OP does) should generally perform better than using a hash table.
You need to group the items by product, not by country, in order to be able to quickly fetch all countries belonging to a certain product.
This would be the pseudocode:
inputList contains a list of pairs {country, product}
// group by product
prepare mapA (product) => (list_of_countries)
for each {country, product} in inputList
{
if mapA does not contain (product)
create a new empty (list_of_countries)
and add it to mapA with (product) as key
add this (country) to the (list_of_countries)
}
// now group by country_pair
prepare mapB (country_pair) => (list_of_products)
for each {product, list_of_countries} in mapA
{
for each pair {countryA, countryB} in list_of_countries
{
if mapB does not countain country_pair {countryA, countryB}
create a new empty (list_of_products)
and add it to mapB with country_pair {countryA, countryB} as key
add this (product) to the (list_of_products)
}
}
If your input list is length N, and you have C distinct countries and P distinct products, then the running time of this algorithm should be O(N) for the first part and O(P*C^2) for the second part. Since your final list needs to have pairs of countries mapping to lists of products, I don't think you will be able to lose the P*C^2 complexity in any case.
I don't code in Java too much, so I added a C# example which I believe you'll be able to port pretty easily:
// mapA maps each product to a list of countries
var mapA = new Dictionary<string, List<string>>();
foreach (var t in inputList)
{
List<string> countries = null;
if (!mapA.TryGetValue(t.Product, out countries))
{
countries = new List<string>();
mapA[t.Product] = countries;
}
countries.Add(t.Country);
}
// note (this is very important):
// CountryPair tuple must have value-type comparison semantics,
// i.e. you need to ensure that two CountryPairs are compared
// by value to allow hashing (mapping) to work correctly, in O(1).
// In C# you can also simply use a Tuple<string,string> to
// represent a pair of countries (which implements this correctly),
// but I used a custom class to emphasize the algorithm
// mapB maps each CountryPair to a list of products
var mapB = new Dictionary<CountryPair, List<string>>();
foreach (var kvp in mapA)
{
var product = kvp.Key;
var countries = kvp.Value;
for (int i = 0; i < countries.Count; i++)
{
for (int j = i + 1; j < countries.Count; j++)
{
var pair = CountryPair.Create(countries[i], countries[j]);
List<string> productsForCountryPair = null;
if (!mapB.TryGetValue(pair, out productsForCountryPair))
{
productsForCountryPair = new List<string>();
mapB[pair] = productsForCountryPair;
}
productsForCountryPair.Add(product);
}*
}
}
This is a great example to use Map Reduce.
At your map phase you just collect all the exports that belong to each Country.
Then, the reducer sorts the products (Products belong to the same country, because of mapper)
You will benefit from distributed, parallel algorithm that can be distributed into a cluster.
You are actually taking O(n^2 * time required for 1 intersect).
Lets see if we can improve time for intersect. We can maintain map for every country which stores corresponding products, so you have n hash maps for n countries. Just need to iterate thru all products once for initializing. If you want quick lookup, maintain a map of maps as:
HashMap<String,HashMap<String,Boolean>> countryMap = new HashMap<String, HashMap<String,Boolean>>();
Now if you want to find the common products for countries str1 and str2 do:
HashMap<String,Boolean> map1 = countryMap.get("str1");
HashMap<String,Boolean> map2 = countryMap.get("str2");
ArrayList<String > common = new ArrayList<String>();
Iterator it = map1.entrySet().iterator();
while (it.hasNext()) {
Map.Entry<String,Boolean> pairs = (Map.Entry)it.next();
//Add to common if it is there in other map
if(map2.containsKey(pairs.getKey()))
common.add(pairs.getKey());
}
So, total it will be O(n^2 * k) if there are k entries in one map assuming hash map lookup implementation is O(1) (I guess it is log k for java).
Using hashmaps where necessary to speed things up:
1) Go through the data and create a map with keys Items and values a list of countries associated with that item. So e.g. Sheep:Australia, US, UK, New Zealand....
2) Create a hashmap with keys each pair of countries and (initially) an empty list as values.
3) For each Item retrieve the list of countries associated with it and for each pair of countries within that list, add that item to the list created for that pair in step (2).
4) Now output the updated list for each pair of countries.
The largest costs are in steps (3) and (4) and both of these costs are linear in the amount of output produced, so I think this is not too far from optimal.

Categories

Resources