I'm trying to understand this code for canopy clustering. The purpose of these two classes (one map, one reduce) is to find canopy centers. My problem is that I don't understand the difference between the map and reduce functions. They are nearly the same.
So is there a difference? Or am I just repeating the same same process again in the reducer?
I think the answer is that there is a difference in how the map and reduce functions handle the code. They perform different actions on the data even with similar code.
So can someone please explain the process of the map and reduce when we try to find the canopy centers?
I know for example that a map might looks like this -- (joe, 1) (dave, 1) (joe, 1) (joe, 1)
and then the reduce will go like this: --- (joe, 3) (dave, 1)
Does the same type of thing happen here?
Or maybe I'm performing the same task twice?
Thanks so much.
map function:
package nasdaq.hadoop;
import java.io.*;
import java.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.util.*;
public class CanopyCentersMapper extends Mapper<LongWritable, Text, Text, Text> {
//A list with the centers of the canopy
private ArrayList<ArrayList<String>> canopyCenters;
#Override
public void setup(Context context) {
this.canopyCenters = new ArrayList<ArrayList<String>>();
}
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//Seperate the stock name from the values to create a key of the stock and a list of values - what is list of values?
//What exactly are we splitting here?
ArrayList<String> stockData = new ArrayList<String>(Arrays.asList(value.toString().split(",")));
//remove stock and make first canopy center around it canopy center
String stockKey = stockData.remove(0);
//?
String stockValue = StringUtils.join(",", stockData);
//Check wether the stock is avaliable for usage as a new canopy center
boolean isClose = false;
for (ArrayList<String> center : canopyCenters) { //Run over the centers
//I think...let's say at this point we have a few centers. Then we have our next point to check.
//We have to compare that point with EVERY center already created. If the distance is larger than EVERY T1
//then that point becomes a new center! But the more canopies we have there is a good chance it is within
//the radius of one of the canopies...
//Measure the distance between the center and the currently checked center
if (ClusterJob.measureDistance(center, stockData) <= ClusterJob.T1) {
//Center is too close
isClose = true;
break;
}
}
//The center is not smaller than the small radius, add it to the canopy
if (!isClose) {
//Center is not too close, add the current data to the center
canopyCenters.add(stockData);
//Prepare hadoop data for output
Text outputKey = new Text();
Text outputValue = new Text();
outputKey.set(stockKey);
outputValue.set(stockValue);
//Output the stock key and values to reducer
context.write(outputKey, outputValue);
}
}
}
Reduce function:
package nasdaq.hadoop;
import java.io.*;
import java.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
public class CanopyCentersReducer extends Reducer<Text, Text, Text, Text> {
//The canopy centers list
private ArrayList<ArrayList<String>> canopyCenters;
#Override
public void setup(Context context) {
//Create a new list for the canopy centers
this.canopyCenters = new ArrayList<ArrayList<String>>();
}
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for (Text value : values) {
//Format the value and key to fit the format
String stockValue = value.toString();
ArrayList<String> stockData = new ArrayList<String>(Arrays.asList(stockValue.split(",")));
String stockKey = key.toString();
//Check wether the stock is avaliable for usage as a new canopy center
boolean isClose = false;
for (ArrayList<String> center : canopyCenters) { //Run over the centers
//Measure the distance between the center and the currently checked center
if (ClusterJob.measureDistance(center, stockData) <= ClusterJob.T1) {
//Center is too close
isClose = true;
break;
}
}
//The center is not smaller than the small radius, add it to the canopy
if (!isClose) {
//Center is not too close, add the current data to the center
canopyCenters.add(stockData);
//Prepare hadoop data for output
Text outputKey = new Text();
Text outputValue = new Text();
outputKey.set(stockKey);
outputValue.set(stockValue);
//Output the stock key and values to reducer
context.write(outputKey, outputValue);
}
}
**Edit -- more code and explanation
Stockkey is the key value representing stocks. (nasdaq and things like that)
ClusterJob.measureDistance():
public static double measureDistance(ArrayList<String> origin, ArrayList<String> destination)
{
double deltaSum = 0.0;
//Run over all points in the origin vector and calculate the sum of the squared deltas
for (int i = 0; i < origin.size(); i++) {
if (destination.size() > i) //Only add to sum if there is a destination to compare to
{
deltaSum = deltaSum + Math.pow(Math.abs(Double.valueOf(origin.get(i)) - Double.valueOf(destination.get(i))),2);
}
}
//Return the square root of the sum
return Math.sqrt(deltaSum);
Ok, the straightforward interpretation of the code is:
- The mappers walks over some (presumably random) subset of the data, and generates canopy centers all of which are at least T1 distance from each other. These centers are emitted.
- The reducer then walks over all the canopy centers that belong to each specific stock key (like MSFT, GOOG, etc) from all the mappers and then ensures that there are no canopy centers that are within T1 of each other for each stock key value (e.g., no two centers in GOOG are within T1 of each other, although a center in MSFT and a center in GOOG may be close together).
The goal of the code is unclear, personally I think there's got to be a bug. The reducers basically solve the problem as if you are trying to generate centers for each stock key independently (i.e., calculate canopy centers for all data points for GOOG), while the mappers seem to solve the problem of trying to generate centers for all stocks. Placed together like that, you get a contradiction so neither problem is actually getting solved.
If you want centers for all stock keys:
- Then the map output must send everything to ONE reducer. Set the map output key to something trivial like a NullWritable. Then the reducer will perform the correct operations without change.
If you want centers for EACH stock key:
- Then the mapper needs to be changed so that effectively you have one separate canopy list for each stock key, you can either do this by keeping a separate arrayList for each stock key (preferred, since it will be faster) or, you can just change the distance metric such that stock keys that belong to different stock keys are an infinite distance apart (so they never interact).
P.S. By the way there are also some unrelated issues with your distance metric. First, you're parsing the data using Double.parseDouble, but not catching NumberFormatException. Since you're giving it stockData, which incudes non-digit strings like 'GOOG' in the very first field, you're going to end up crashing the job as soon as you run it. Second, the distance metric ignores any fields with missing values. That is an incorrect implementation of a L2 (pythagorean) distance metric. To see why, consider that this string: "," has distance 0 from any other point, and if it is chosen as a canopy center, no other centers can be chosen. Instead of just setting the delta for a missing dimension to zero, you might consider setting it to something reasonable like the population mean for that attribute, or (to be safe) just discarding that row from the data set for the purposes of clustering.
Related
I work for my school project on MPAndroidChart especially a realtime graph, i would like to display Time of the value.
I set a IndexAxisValueFormatter with a "getFormattedValue", it work but refresh every label and not just the last, I try to have for each entry in my graph a XLabel who show the time, I really don't know how to do that you're help would be welcome.
Code for create an entry and display it :
void creaGraph() {
ArrayList<ILineDataSet> dataSets = new ArrayList<>();
if (boxCO2.isChecked()) {
A_CO2.add(new Entry(indice, listData.recup_data(indice - 1).getCO2()));
LineDataSet setCO2 = new LineDataSet(A_CO2, "CO2");
setCO2.setAxisDependency(YAxis.AxisDependency.LEFT);
paramSet(setCO2);
setCO2.setColor(Color.RED);
setCO2.setCircleColor(Color.RED);
dataSets.add(setCO2);
}
LineData data = new LineData(dataSets);
graph.setData(data);
data.notifyDataChanged();
graph.notifyDataSetChanged();
graph.invalidate();
}
The override of getFormattedValue
#Override
public String getFormattedValue(float value) {
return listData.recup_data(GraphPage.indice - 1).getTemps();
}
And a picture of my issue
Every label are refresh when a new entry come
Also, I see after the 7th values entry no longer have a time values
You never use value in getFormattedValue. The string you construct there should be based on value or it will show the same thing for every axis entry.
Something like this:
#Override
public String getFormattedValue(float value) {
return makeDateStringAt(value);
}
For example, if you have a chart with axis values at 0, 1, 2, 3 then getFormattedValue will be called 4 times with 0f, 1f, 1f, and 3f as its arguments and you should use those inputs to create the appropriate string to show at those positions on the axis.
1) I'm practicing stuff with graphs in order to add that feture to my app, I want the upper labels ( the xAxis base ) to be shown only where entries occur.
I haven't found a suitable solution online yet, and currently it appears on every xAxis from first entry to last entry as in the picture below:
I want it to be without the one sI deleted, as shown in the picture below:
2) and the second question I'm struggling with it is that I want to be able to draw for example in (x=5, y=7) and after it to draw at (x=1, y =3), but it wont let me add an entry with a smaller x that any other entry that already in the graph.
You have to extend from ValueFormatter class.
for more detail take a look at link
You can pick your desired logic to make the label disappear with returning "".
for example:
public String getFormattedValue(float value) {
if ((int)value <= 0) //your logic to evaluate correctness
return ""; // make lable go away
//...
}
UPDATE 2 (in Kotlin):
There is another overload for getFormattedValue which have a AxisBase parameter and you can use mEntryCount or mEntries.
override fun getFormattedValue(value: Float, axis: AxisBase?): String {
if (axis?.mEntryCount!! <= 0)
return ""
}
I'm trying to implement a class to check if two game objects intersect. Can anyone give me a better solution / more elegant to this problem?
Basically I want to addCollision and know if one object collidesWith another. A double entry matrix seemed a good idea.
private class CollisionMatrix {
private boolean[][] matrix;
private HashMap<Tag, Integer> matrixIndexes = new HashMap<Tag, Integer>();
public CollisionMatrix() {
int i = 0;
for (Tag tag : Tag.values())
matrixIndexes.put(tag, i++);
matrix = new boolean[i][i];
}
private void addCollision(Tag tag1, Tag tag2) {
int p1 = matrixIndexes.get(tag1);
int p2 = matrixIndexes.get(tag2);
matrix[p1][p2] = true;
matrix[p2][p1] = true;
}
private boolean collidesWith(Tag tag1, Tag tag2) {
int p1 = matrixIndexes.get(tag1);
int p2 = matrixIndexes.get(tag2);
return matrix[p1][p2] || matrix[p2][p1];
}
}
This is not a complete answer, but it should set you on a path to get a more complete solution.
The simplest (not efficient) way to do this is to have a list of the objects that can collide with each other and then for every frame in time, got through every object in the list and check if the object collides (Shares the same space or bounding volume) with another one in the list.
pseudo code:
L: list of objects that can potentially collide.
t: time
for each frame in t {
for each object obj in L {
P: list of objects without obj
for each object otherObj in P {
does obj collide with otherObj
}
}
}
While this technically works, it's not a good solution as it will be very slow as soon as you start having many objects, and it doesn't take that many to make it slow.
To make this possible in real time, you would need to add some acceleration techniques.
One of these acceleration techniques is using "Bounding volume hierarchy" or BVH. https://en.wikipedia.org/wiki/Bounding_volume_hierarchy
In a nutshell, BVH is technique or algorithm to enable quick lookups of which objects are likely to collide.
It typically uses some type of tree structure to keep track of the positions and volumes occupied by the said objects. Tree structures provide faster lookup times than just linearly iterating a list multiple times.
Each level of the tree provides a hierarchy of bounding volumes (space the object is likely to occupy). Top levels of the tree provide a bigger volume for the particular object (a more rough, less granular or less fitting to the object's shape), but easier to discard if the object in question is not in that same space (you would know with little calculations that the object would never collide with anything in that same bounding volume). The deeper in the tree you go, the more granular or more fitting to the objects shape the bounding volumes get, until you get the objects which collide.
Hope this helps :)
I am developing an algorithm to extract text and images from PDF files in the reading order. I use iText java for this purpose and basically my algorithm works as follows.
Coordinates of every text chunk in the page is extracted using iText.
Rectangle object is created using the extracted coordinates. After this step we have whole bunch of rectangle objects representing actual text chunks in the page.
Group the rectangles into larger text blocks which will be corresponding to the actual columns in the pdf page.
Order the text blocks by Y then X
Apply the locationTextExtractionStrategy for text blocks one by one.
This approach gives my around 80% or slightly more results for the PDF files with medium to complex layouts. I know that it will be almost impossible to gain 100% accuracy because PDF files does not store information in the reading order.
What I want to do is to increase my accuracy here but the problem is iText stops me from doing that. I have identified a problem in iText. It sometimes extract false locations of text chunks which makes my algorithm incorrect. Following images are a good example for that.
You can see that in the actual PDF page there is a clear gap between columns. But the resulting rectangles contains some faulty rectangles in between that gap which prevents me from identifying the correct columns.
Following is the code that I use to extract locations of text chunks.
package com.InteliText.Extract;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;
import com.itextpdf.text.Rectangle;
import com.itextpdf.text.pdf.parser.ImageRenderInfo;
import com.itextpdf.text.pdf.parser.LineSegment;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextRenderInfo;
import com.itextpdf.text.pdf.parser.Vector;
/*
* THIS CLASS ACT AS THE TEXT EXTRACTOR FOR THE PREPROCESSOR
*/
public class PreProcessorStrategy extends SimpleTextExtractionStrategy{
private StringBuilder result = new StringBuilder();
private ArrayList<Double> fontSizes = new ArrayList<Double>();
private ArrayList<Double> lineSpaces = new ArrayList<Double>();
private ArrayList<TextSegment> textSegments = new ArrayList<TextSegment>();
Vector previousBaseLine = null;
#Override
public void beginTextBlock() {
// TODO Auto-generated method stub
}
#Override
public void endTextBlock() {
// TODO Auto-generated method stub
}
#Override
public void renderImage(ImageRenderInfo arg0) {
// TODO Auto-generated method stub
}
#Override
public void renderText(TextRenderInfo renderInfo) {
//This code assumes that if the baseline changes then we're on a newline
Vector curBaseline = renderInfo.getBaseline().getStartPoint();
Vector topRight = renderInfo.getAscentLine().getEndPoint();
//System.out.println(renderInfo.getText()+"\t"+curBaseline.get(0)+"\t"+topRight.get(0));
if(curBaseline.get(1) < 800 && curBaseline.get(1) > 50 ) {
// Chunk of text as a rectangle
Rectangle rect = new Rectangle(curBaseline.get(0), curBaseline.get(1), topRight.get(0), topRight.get(1));
double curFontSize = rect.getHeight();
fontSizes.add(curFontSize);
String text = renderInfo.getText();
boolean isBullet = text.contains("•");
if(!(text.equals(" ") || text.equals(" ") || text.equals(" ")) && !isBullet) {
double endX = topRight.get(0);
if(text.endsWith(" "))
endX -= 8;
textSegments.add(new TextSegment(curBaseline.get(0),endX,curBaseline.get(1),topRight.get(1),renderInfo.getText(),curFontSize));
}
result.append(renderInfo.getText());
}
previousBaseLine = topRight;
}
#Override
public String getResultantText() {
// TODO Auto-generated method stub
return result.toString();
}
public ArrayList<TextSegment> getResultantTextSegments() {
return this.textSegments;
}
I use the resulting textSegments ArrayList to create rectangle objects by looking at the coordinates stored in those textSegments. I suspects that this is might be a bug in iText.
As you can see currently I'm shrinking the text chunks a little bit if the content of that text chunk ends with a white space. But this is a temporary fix and I don't want to do that because it shrink the correct text chunks too.
So is there a work around for this one? Or if it is a problem in my code please help me to fix that..
I am assuming here that if you knew where the columns were you could assign each rectangle to the correct column. It looks to me that if you drew a line down the left edge of the right hand column you could assign almost all of the rectangles correctly based on whether their centre was to the right or left of that edge. So the problem is to find the parameters that describe the data best (in particular the left hand edge of the rightmost column) in the presence of outliers.
The absolutely correct way is probably to fit some sort of statistical model, but I think there are a couple of easier hacks that might work.
1) All of the overlapping rectangles in your image seem to be very small. Perhaps you can simply remove rectangles below a given size, work out where the columns should be, and then assign each small rectangle according to whether its centre is to the left or right of the left hand edge of the right hand column.
2) There is a general strategy for fitting data contaminated by outliers you can derive from https://en.wikipedia.org/wiki/RANSAC.
2a) Start by fitting the model to only a small amount of the data. You will be repeating 2a and 2b multiple times, and picking the best result. You are hoping that the initial points chosen for one of these cases are completely free of outliers. Note that if there are N outliers and you divide the data into N+1 chunks, at least one of these chunks must be completely free of outliers.
2b) Once you have an initial fit, look at all the data and work out which points are outliers and ignore them temporarily (i.e. put aside the k worst fitting points). Then fit the model again using the remaining points. In many cases you can prove that if you repeat this step indefinitely it will eventually converge to something, because changing the points identified as the k worst fits improves the fit, as does re-fitting the model, so each iteration improves the fit until you there is no change, at which point you declare that the process has converged.
I'm trying to color the text of every row in a table depending on one of the columns in the table. I'm having trouble grasping the concept of renderers, and I've tried out several different renderers but don't seem to understand what they do.
I am trying to load the top ten racers from a certain API given to us by our lecturer into the table model, but colouring each row based on the gender of the racer (which is returned by the getCategory() method of a Finisher/Racer object).
FYI, DataTable is an object written by our lecturer. It's basically a 2D array object.
public void showRacers(DefaultTableModel tblModel,
#SuppressWarnings("rawtypes") JList listOfRaces) {
// Clear the model of any previous searches
tblModel.setRowCount(0);
// Initialize an object to the selected city
CityNameAndKey city = (CityNameAndKey) listOfRaces.getSelectedValue();
// Get the runners for this city
DataTable runners = this.getRunners(city);
// Set the column headers
this.setColumnHeaders(tblModel);
// Make an array list of object Finisher
ArrayList<Finisher> finisherList = new ArrayList<Finisher>();
// Make an array that holds the data of each finisher
Object[] finisherData = new Object[6];
// Make a finisher object
Finisher f;
for (int r = 0; r < 10; r++) {
// Assign the data to the finisher object
finisherList.add(f = new Finisher(runners.getCell(r, 0), runners
.getCell(r, 1), runners.getCell(r, 2), runners
.getCell(r, 3), runners.getCell(r, 4), runners
.getCell(r, 5)));
// Add the data into the array
finisherData[0] = f.getPosition();
finisherData[1] = f.getBibNo();
finisherData[2] = f.getTime();
finisherData[3] = f.getGender();
finisherData[4] = f.getCategory();
finisherData[5] = f.getRuns();
// Put it into the table model
tblModel.addRow(finisherData);
}
}
I would greatly appreciate an explanation, rather than just the answer to my question. Guidance to the answer would be great, and some code would be extremely helpful, but please no: "You should have written this: ten lines of code I don't get
Thank you very much! :)
Using a TableCellRenderer will only allow you to color one column. You would have to have one for each column. A much easier approach is to override prepareRenderer(...) in JTable to color an entire row.
See trashgod's answer here or camickr's answer here