Data Analysis and Processing of Variables and their Concreteness - java

I have a variable_text file with 100K variables (some unique, some not) and an excel file with (unique) variables in one column and their respective concreteness values in another column. I already wrote some code to read the variables from the text file and search their concreteness value from the excel file and spit out the results in another result_text file.
My problem is I need to use an appropriate data structure to store the variables and their concreteness and count the frequency of repeating variables from the variable_text file. I've looked at HashTables and HashMaps but dont know if I should chose from these or if there's another viable option.
This data structure must represent a sort of table or map:
Variable | Frequency | Concreteness

Here is some code that can help you with that, i recommend that you do use a hashmap like this
public static void main(String[] args)
{
Map <String, Data> map = new HashMap <String, Data>();
String [] variables={"variable1", "variable2", "variable3", "variable4", "variable4", "variable1","variable1"};
int Concreteness=5;//for this example every variable will have the same cncreteness
for(int i=0; i<variables.length;i++)
{
Data variable_exists=map.get(variables[i]);
if(variable_exists!=null)
variable_exists.setFrecuency(variable_exists.getFrecuency()+1);
else
map.put(variables[i], new Data(Concreteness,1));
}
for (Map.Entry<String, Data> entry : map.entrySet())
{ System.out.println("variable = " + entry.getKey() + ", Frecuency = " + entry.getValue().getFrecuency()+ ", Concreteness = " + entry.getValue().getConcreteness()); }
}
the output for this example would be
variable = variable4, Frecuency = 2, Concreteness = 5
variable = variable1, Frecuency = 3, Concreteness = 5
variable = variable2, Frecuency = 1, Concreteness = 5
variable = variable3, Frecuency = 1, Concreteness = 5
and here is the Data class i used
public class Data
{
private int frecuency;
private int Concreteness;
Data (int Concreteness, int frecuency)
{
setFrecuency(frecuency);
setConcreteness(Concreteness);
}
public int getFrecuency()
{
return frecuency;
}
public void setFrecuency(int frecuenxy)
{
this.frecuency = frecuenxy;
}
public int getConcreteness()
{
return Concreteness;
}
public void setConcreteness(int Concreteness)
{
this.Concreteness = Concreteness;
}
}

Related

Java Hash map / Array List Count distinct values

I am pretty new into programming and I have an assignment to make, but I got stuck.
I have to implement a program which will read a CSV file (1 million+ lines) and count how many clients ordered "x" distinct products on a specific day.
The CSV looks like this:
Product Name | Product ID | Client ID | Date
Name 544 86 10/12/2017
Name 545 86 10/12/2017
Name 644 87 10/12/2017
Name 644 87 10/12/2017
Name 9857 801 10/12/2017
Name 3022 801 10/12/2017
Name 3021 801 10/12/2017
The result from my code is:
801: 2 - incorrect
86: 2 - correct
87: 2 - incorrect
Desired output is:
Client 1 (801): 3 distinct products
Client 2 (86): 2 distinct products
Client 3 (87): 1 distinct product
Additionally,
If I want to know how many clients ordered 2 distinct products I would like a result to look like this:
Total: 1 client ordered 2 distinct products
If I want to know the maximum number of distinct products ordered in a day, I would like the result to look like this:
The maximum number of distinct products ordered is: 3
I tried to use a Hash Map and Multimap by Google Guava (my best guess here), but I couldn't wrap my head around it.
My code looks like this:
package Test;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Map;
import com.google.common.collect.ArrayListMultimap;
import com.google.common.collect.HashMultimap;
public class Test {
public static void main(String[] args) {
//HashMultimap<String, String> myMultimap = HashMultimap.create();
Map<String, MutableInteger> map = new HashMap<String, MutableInteger>();
ArrayList<String> linesList = new ArrayList<>();
// Input of file which needs to be parsed
String csvFile = "file.csv";
BufferedReader csvReader;
// Data split by 'TAB' in CSV file
String csvSplitBy = "\t";
try {
// Read the CSV file into an ArrayList array for easy processing.
String line;
csvReader = new BufferedReader(new FileReader(csvFile));
while ((line = csvReader.readLine()) !=null) {
linesList.add(line);
}
csvReader.close();
} catch (IOException e) {
e.printStackTrace();
}
// Process each CSV file line which is now contained within
// the linesList list Array
for (int i = 0; i < linesList.size(); i++) {
String[] data = linesList.get(i).split(csvSplitBy);
String col2 = data[1];
String col3 = data[2];
String col4 = data[3];
// Determine if Column 4 has the desired date
// and count the values
if (col4.contains("10/12/2017")) {
String key = col3;
if (map.containsKey(key)) {
MutableInteger count = map.get(key);
count.set(count.get() + 1);
} else {
map.put(key, new MutableInteger(1));
}
}
}
for (final String k : map.keySet()) {
if (map.get(k).get() == 2) {
System.out.println(k + ": " + map.get(k).get());
}
}
}
}
Any advise or suggestion on how this can be implemented would be greatly appreciated.
Thank you in advance guys.
You could store a Setof productIds per clientId, and just take the size of that.
As a Set does not allow duplicate values, this will effectively give you the distinct number of productIds.
Also, I recommend that you give your variables meaningful name instead of col2, k, map... This will make your code more readable.
Map<String, Set<String>> distinctProductsPerClient = new HashMap<String, Set<String>>();
// Process each CSV file line which is now contained within
// the linesList list Array
// Start from 1 to skip the first line
for (int i = 1; i < linesList.size(); i++) {
String line = linesList.get(i);
String[] data = line.split(csvSplitBy);
String productId = data[1];
String clientId = data[2];
String date = data[3];
// Determine if Column 4 has the desired date
// and count the values
if (date.contains("10/12/2017")) {
if (!distinctProductsPerClient.containsKey(clientId)) {
distinctProductsPerClient.put(clientId, new HashSet<>());
}
distinctProductsPerClient.get(clientId).add(productId);
}
}
for (final String clientId : distinctProductsPerClient.keySet()) {
System.out.println(clientId + ": " + distinctProductsPerClient.get(clientId).size());
}
More advanced solution using Stream API (requires Java 9)
If you introduce the class OrderData(that represents a single line in the CSV) like this:
private static class OrderData {
private final String productName;
private final String productId;
private final String clientId;
private final String date;
public OrderData(String csvLine) {
String[] data = csvLine.split("\t");
this.productName = data[0];
this.productId = data[1];
this.clientId = data[2];
this.date = data[3];
}
public String getProductName() {
return productName;
}
public String getProductId() {
return productId;
}
public String getClientId() {
return clientId;
}
public String getDate() {
return date;
}
}
you can replace the for loop with this:
Map<String, Set<String>> distinctProductsPerClient2 = linesList.stream()
.skip(1)
.map(OrderData::new)
.collect(groupingBy(OrderData::getClientId, mapping(OrderData::getProductId, toSet())));
But I reckon this might be a little bit to complex if you're new into programming (although it might be a good exercise if you would try to understand what the above code does).

How to create nested object and array in parquet file?

How do I create a parquet file with nested fields? I have the following:
public static void main(String[] args) throws IOException {
int fileNum = 10; //num of files constructed
int fileRecordNum = 50; //record num of each file
int rowKey = 0;
for (int i = 0; i < fileNum; ++i) {
Map<String, String> metas = new HashMap<>();
metas.put(HConstants.START_KEY, genRowKey("%10d", rowKey + 1));
metas.put(HConstants.END_KEY, genRowKey("%10d", rowKey + fileRecordNum));
ParquetWriter<Group> writer = initWriter("pfile/scanner_test_file" + i, metas);
for (int j = 0; j < fileRecordNum; ++j) {
rowKey++;
Group group = sfg.newGroup().append("rowkey", genRowKey("%10d", rowKey))
.append("cf:name", "wangxiaoyi" + rowKey)
.append("cf:age", String.format("%10d", rowKey))
.append("cf:job", "student")
.append("timestamp", System.currentTimeMillis());
writer.write(group);
}
writer.close();
}
}
I want to create two fields:
Hobbies which contains a list of hobbies ("Swimming", "Kickboxing")
A teacher object that contains subfields like:
{
'teachername': 'Rachel',
'teacherage':50
}
Can someone provide an example how to do this in Java?
Parquet is columned (mini-storages) key-value storage... I.e. this kind of storage cannot keep nested data, but this storage accepts converting logical types of data to binary format (byte array with header that contains data to understand what kind of convertation should be applied to this data).
I'm not sure about how should you implement your converter, but basically you should work with Binary class as data container and create some converter... sample converter you can find for String data type.

manipulate and sort text file

I am working on a project where I have been given a text file and I have to add up the points for each team and printout the top 5 teams.
The text file looks like this:
FRAMae Berenice MEITE 455.455<br>
CHNKexin ZHANG 454.584<br>
UKRNatalia POPOVA 453.443<br>
GERNathalie WEINZIERL 452.162<br>
RUSEvgeny PLYUSHCHENKO 191.399<br>
CANPatrick CHAN 189.718<br>
CHNHan YAN 185.527<br>
CHNCheng & Hao 271.018<br>
ITAStefania & Ondrej 270.317<br>
USAMarissa & Simon 264.256<br>
GERMaylin & Daniel 260.825<br>
FRAFlorent AMODIO 179.936<br>
GERPeter LIEBERS 179.615<br>
JPNYuzuru HANYU 197.9810<br>
USAJeremy ABBOTT 165.654<br>
UKRYakov GODOROZHA 160.513<br>
GBRMatthew PARR 157.402<br>
ITAPaul Bonifacio PARKINSON 153.941<br>
RUSTatiana & Maxim 283.7910<br>
CANMeagan & Eric 273.109<br>
FRAVanessa & Morgan 257.454<br>
JPNNarumi & Ryuichi 246.563<br>
JPNCathy & Chris 352.003<br>
UKRSiobhan & Dmitri 349.192<br>
CHNXintong &Xun 347.881<br>
RUSYulia LIPNITSKAYA 472.9010<br>
ITACarolina KOSTNER 470.849<br>
JPNMao ASADA 464.078<br>
UKRJulia & Yuri 246.342<br>
GBRStacey & David 244.701<br>
USAMeryl &Charlie 375.9810<br>
CANTessa & Scott 372.989<br>
RUSEkaterina & Dmitri 370.278<br>
FRANathalie & Fabian 369.157<br>
ITAAnna & Luca 364.926<br>
GERNelli & Alexander 358.045<br>
GBRPenny & Nicholas 352.934<br>
USAAshley WAGNER 463.107<br>
CANKaetlyn OSMOND 462.546<br>
GBRJenna MCCORKELL 450.091<br>
The first three letters represent the team.
the rest of the text is the the competitors name.
The last digit is the score the competitor recived.
Code so far:
import java.util.Arrays;
public class project2 {
public static void main(String[] args) {
// TODO Auto-generated method stub
String[] array = new String[41];
String[] info = new String[41];
String[] stats = new String[41];
String[] team = new String[41];
//.txt file location
FileInput fileIn = new FileInput();
fileIn.openFile("C:\\Users\\O\\Desktop\\turn in\\team.txt");
// txt file to array
int i = 0;
String line = fileIn.readLine();
array[i] = line;
i++;
while (line != null) {
line = fileIn.readLine();
array[i] = line;
i++;
}
//Splitting up Info/team/score into seprate arrays
for (int j = 0; j < 40; j++) {
team[j] = array[j].substring(0, 3).trim();
info[j] = array[j].substring(3, 30).trim();
stats[j] = array[j].substring(36).trim();
}
// Random stuff i have been trying
System.out.println(team[1]);
System.out.println(info[1]);
System.out.println(stats[1]);
MyObject ob = new MyObject();
ob.setText(info[0]);
ob.setNumber(7, 23);
ob.setNumber(3, 456);
System.out.println("Text is " + ob.getText() + " and number 3 is " + ob.getNumber(7));
}
}
I'm pretty much stuck at this point because I am not sure how to add each teams score together.
This looks like homework... First of all you need to examine how you are parsing the strings in the file.
You're saying: the first 3 characters are the country, which looks correct, but then you set the info to the 4th through the 30th characters, which isn't correct. You need to dynamically figure out where that ends and the score begins. There is a space between the "info" and the "stats," knowing that you could use String's indexOf function. (http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#indexOf(int))
Have a look at Maps.
A map is a collection that allows you to get data associated with a key in a very short time.
You can create a Map where the key is a country name, with value being the total points.
example:
Map<String,Integer> totalScore = new HashMap<>();
if (totalScore.containsKey("COUNTRYNAME"))
totalScore.put("COUNTRYNAME", totalScore.get("COUNTRYNAME") + playerScore)
else
totalScore.put("COUNTRYNAME",0)
This will add to the country score if the score exists, otherwise it will create a new totalScore for a country initialized to 0.
Not tested, but should give you some ideas:
public static void main(String... args)
throws Exception {
class Structure implements Comparable<Structure> {
private String team;
private String name;
private Double score;
public Structure(String team, String name, Double score) {
this.team = team;
this.name = name;
this.score = score;
}
public String getTeam() {
return team;
}
public String getName() {
return name;
}
public Double getScore() {
return score;
}
#Override
public int compareTo(Structure o) {
return this.score.compareTo(o.score);
}
}
File file = new File("path to your file");
List<String> lines = Files.readAllLines(Paths.get(file.toURI()), StandardCharsets.UTF_8);
Pattern p = Pattern.compile("(\\d+(?:\\.\\d+))");
List<Structure> structures = new ArrayList<Structure>();
for (String line : lines) {
Matcher m = p.matcher(line);
while (m.find()) {
String number = m.group(1);
String text = line.substring(0, line.indexOf(number) - 1);
double d = Double.parseDouble(number);
String team = text.substring(0, 3);
String name = text.substring(3, text.length());
structures.add(new Structure(team, name, d));
}
}
Collections.sort(structures);
List<Structure> topFive = structures.subList(0, 5);
for (Structure structure : topFive) {
System.out.println("Team: " + structure.getTeam());
System.out.println("Name: " + structure.getName());
System.out.println("Score: " + structure.getScore());
}
}
Just remove <br> from your file.
Loading file into memory
Your string splitting logic looks fine.
Create a class like PlayerData. Create one instance of that class for each row and set all the three fields into that using setters.
Keep adding the PlayerData objects into an array list.
Accumulating
Loop through the arraylist and accumulate the team scores into a hashmap. Create a Map to accumulate the team scores by mapping teamCode to totalScore.
Always store row data in a custom object for each row. String[] for each column is not a good way of holding data in general.
Take a look in File Utils. After that you can extract the content from last space character using String Utils e removing the <br> using it as a key for a TreeMap. Than you can have your itens ordered.
List<String> lines = FileUtils.readLines(yourFile);
Map<String, String> ordered = new TreeMap<>();
for (String s : lines) {
String[] split = s.split(" ");
String name = split[0].trim();
String rate = splt[1].trim().substring(0, key.length - 4);
ordered.put(rate, name);
}
Collection<String> rates = ordered.values(); //names ordered by rate
Of course that you need to adjust the snippet.

Create Custom InputFormat of ColumnFamilyInputFormat for cassandra

I am working on a project, using cassandra 1.2, hadoop 1.2
I have created my normal cassandra mapper and reducer, but I want to create my own Input format class, which will read the records from cassandra, and I'll get the desired column's value, by splitting that value using splitting and indexing ,
so, I planned to create custom Format class. but I'm confused and not able to know, how would I make it? What classes are to be extend and implement, and how I will able to fetch the row key, column name, columns value etc.
I have my Mapperclass as follow:
public class MyMapper extends
Mapper<ByteBuffer, SortedMap<ByteBuffer, IColumn>, Text, Text> {
private Text word = new Text();
MyJDBC db = new MyJDBC();
public void map(ByteBuffer key, SortedMap<ByteBuffer, IColumn> columns,
Context context) throws IOException, InterruptedException {
long std_id = Long.parseLong(ByteBufferUtil.string(key));
long newSavePoint = 0;
if (columns.values().isEmpty()) {
System.out.println("EMPTY ITERATOR");
sb.append("column_N/A" + ":" + "N/A" + " , ");
} else {
for (IColumn cell : columns.values()) {
name = ByteBufferUtil.string(cell.name());
String value = null;
if (name.contains("int")) {
value = String.valueOf(ByteBufferUtil.toInt(cell.value()));
} else {
value = ByteBufferUtil.string(cell.value());
}
String[] data = value.toString().split(",");
// if (data[0].equalsIgnoreCase("login")) {
Long[] dif = getDateDiffe(d1, d2);
// logics i want to perform inside my custominput class , rather here, i just want a simple mapper class
if (condition1 && condition2) {
myhits++;
sb.append(":\t " + data[0] + " " + data[2] + " "+ data[1] /* + " " + data[3] */+ "\n");
newSavePoint = d2;
}
}
sb.append("~" + like + "~" + newSavePoint + "~");
word.set(sb.toString().replace("\t", ""));
}
db.setInterval(Long.parseLong(ByteBufferUtil.string(key)), newSavePoint);
db.setHits(Long.parseLong(ByteBufferUtil.string(key)), like + "");
context.write(new Text(ByteBufferUtil.string(key)), word);
}
I want to decrease my Mapper Class logics, and want to perform same calculations on my custom input class.
Please help, i wish for the positive r4esponse from stackies...
You can do the intended task by moving the Mapper logic to your custom input class (as you have indicated already)
I found this nice post which explains a similar problem statement as you have. I think it might solve your problem.

incompatible type of double array and properties string.split()

public static void main(String[] args)
{
String input="jack=susan,kathy,bryan;david=stephen,jack;murphy=bruce,simon,mary";
String[][] family = new String[50][50];
//assign family and children to data by ;
StringTokenizer p = new StringTokenizer (input,";");
int no_of_family = input.replaceAll("[^;]","").length();
no_of_family++;
System.out.println("family= "+no_of_family);
String[] data = new String[no_of_family];
int i=0;
while(p.hasMoreTokens())
{
data[i] = p.nextToken();
i++;
}
for (int j=0;j<no_of_family;j++)
{
family[j][0] = data[j].split("=")[0];
//assign child to data by commas
StringTokenizer v = new StringTokenizer (data[j],",");
int no_of_child = data[j].replaceAll("[^,]","").length();
no_of_child++;
System.out.println("data from input = "+data[j]);
for (int k=1;k<=no_of_child;k++)
{
family[j][k]= data[j].split("=")[1].split(",");
System.out.println(family[j][k]);
}
}
}
i have a list of family in input string and i seperate into a family and i wanna do it in double array family[i][j].
my goal is:
family[0][0]=1st father's name
family[0][1]=1st child name
family[0][2]=2nd child name and so on...
family[0][0]=jack
family[0][1]=susan
family[0][2]=kathy
family[0][3]=bryan
family[1][0]=david
family[1][1]=stephen
family[1][2]=jack
family[2][0]=murphy
family[2][1]=bruce
family[2][2]=simon
family[2][3]=mary
but i got the error as title: in compatible types
found:java.lang.String[]
required:java.lang.String
family[j][k]= data[j].split("=")[1].split(",");
what can i do?i need help
nyone know how to use StringTokenizer for this input?
Trying to understand why you can't just use split for your nested operation as well.
For example, something like this should work just fine
for (int j=0;j<no_of_family;j++)
{
String[] familySplit = data[j].split("=");
family[j][0] = familySplit[0];
String[] childrenSplit = familySplit[1].split(",");
for (int k=0;k<childrenSplit.length;k++)
{
family[j][k+1]= childrenSplit[k];
}
}
You are trying to assign an array of strings to a string. Maybe this will make it more clear?
String[] array = data.split("=")[1].split(",");
Now, if you want the first element of that array you can then do:
family[j][k] = array[0];
I always avoid to use arrays directly. They are hard to manipulate versus dynamic list. I implemented the solution using a Map of parent to a list of childrens Map<String, List<String>> (read Map<Parent, List<Children>>).
public static void main(String[] args) {
String input = "jack=susan,kathy,bryan;david=stephen,jack;murphy=bruce,simon,mary";
Map<String, List<String>> parents = new Hashtable<String, List<String>>();
for ( String family : input.split(";")) {
final String parent = family.split("=")[0];
final String allChildrens = family.split("=")[1];
List<String> childrens = new Vector<String>();
for (String children : allChildrens.split(",")) {
childrens.add(children);
}
parents.put(parent, childrens);
}
System.out.println(parents);
}
The output is this:
{jack=[susan, kathy, bryan], murphy=[bruce, simon, mary], david=[stephen, jack]}
With this method you can directory access to a parent using the map:
System.out.println(parents.get("jack"));
and this output:
[susan, kathy, bryan]

Categories

Resources