RowMatrix mat = new RowMatrix(parsedData.rdd());
Matrix pc = mat.computePrincipalComponents(1);
RowMatrix projected = mat.multiply(pc);
I need to print the elements in the RowMatrix projected using java
RowMatrix is a distributed data structure and the only way to reliably output its content is to to fetch data to the driver and print locally. Typically it is an operation you want to avoid but general approach is as follows
val mat: RowMatrix = ???
mat
.rows // Extract RDD[org.apache.spark.mllib.linalg.Vector]
.collect // you can use toLocalIterator to limit memory usage
.foreach(println) // Iterate over local Iterator and print
With Java equivalent being something like this:
List<Vector> vs = mat.rows().toJavaRDD().collect();
for(Vector v: vs) {
System.out.println(v);
}
In practice there should be no need for operation like this. If your data is small enough to be handled locally there is no reason to use DistributedMatrix. If data is large but wide then RowMatrix is a poor choice for a distributed data structure.
Related
Newbie to Java and Spark here looking for some help:
Is there a way to create a Dataset with a single column containing increasing values from 1 to n?
Dataset<Row> ds = ss.createDataSet("column-name", 1, 1000);
Above is kind of crude in that there is no such method as createDataSet but I am looking for something along those lines which can lazily create contents of ds.
Assuming that you are using java 8 +
int n = 1000;
List<Integer> intList = IntStream.range(1, n+1).boxed().collect(Collectors.toList());
Encoder<Integer> integerEncoder = Encoders.INT();
Dataset<Integer> primitiveDS = spark.createDataset(intList, integerEncoder);
Also keep in mind that When you try to create a Dataset within spark, you would need to create/pass a self generated List of Data, which is generated completely in driver.
I am creating a list from a spark dataframe using collectAsList method and reading the columns by iterating through the rows. The spark java job runs on a multi-node cluster, where the config is set to spawn multiple executors. Please suggest some alternative method to do the below functionality in JAVA.
List<Row> list = df.collectAsList();
List<Row> responseList = new ArrayList<>();
for(Row r: list) {
String colVal1 = r.getAs(colName1);
String colVal2 = r.getAs(colName2);
String[] nestedValues = new String[allCols.length];
nestedValues[0]=colVal1 ;
nestedValues[1]=colVal2 ;
.
.
.
responseList.add(RowFactory.create(nestedValues));
}
Thanks
The benefit of Spark is that you can process a large amount of data using the memory and processors across multiple executors on multiple nodes. The problem you are having might be due to using collectAsList and then processing the data. collectAsList brings all the data into the "driver" which is a single JVM on a single node. It should be used as a final step, to gather results after you have processed data. If you are trying to bring a very large amount of data into the driver and then process it, it could be very slow or fail and you are not actually using Spark to process your data at that point. Instead of using collectAsList use the methods available on the DataFrame to do the processing the data, such as map()
Definitions of driver and executor https://spark.apache.org/docs/latest/cluster-overview.html
In the Java API a DataFrame is a DataSet . Here's the DataSet documentation. Use the methods here to process your data https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Dataset.html
I faced the same issue. I used the toLocalIterator() method for holding less data, this method will return the Iterator object. collectAsList() this method will return total data as List, but Iterator will directly fetch data from the driver while reading.
Code Like :
Iterator<Row> itr = df.toLocalIterator();
while(itr.hasNext()){
Row row = itr.next();
//do somthing
}
#Tony is on the spot. Here are few more points-
Big data needs to be scalable so that with more processing power more data can be processed in same time. This is achieved with parallel processing using multiple executers.
Spark is also resilience, if some executors die then it can recover easily.
Using collect() makes your processing heavily dependent on only 1 process/node which is driver. It won’t scale and more prone to failure.
Essentially you can use all Spark APIs except few of these - collect, collectAsList, show in production grade code. You can use them for testing as well as for small amount of data. You should't use them for large amount of data.
In your case you can simply do something like -
Dataset<Row> df = // create your dataframe from a source like file, table etc.
df.select("column1","column2",array("column1","column2").as("column3")).save(.... file name ..)
You can use tons for column based functions available on https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html. These functions are pretty much everything you need, if not you can always read data into scala/java/python and operate on them using that language's syntax.
Creating a new answer to address the new specifics in your question.
Call .map on the dataframe, and put logic in a lambda to convert one row to a new row.
// Do your data manipulation in a call to `.map`,
// which will return another DataFrame.
DataFrame df2 = df.map(
// This work will be spread out across all your nodes,
// which is the real power of Spark.
r -> {
// I'm assuming the code you put in the question works,
// and just copying it here.
// Note the type parameter of <String> with .getAs
String colVal1 = r.getAs<String>(colName1);
String colVal2 = r.getAs<String>(colName2);
String[] nestedValues = new String[allCols.length];
nestedValues[0]=colVal1;
nestedValues[1]=colVal2;
.
.
.
// Return a single Row
RowFactory.create(nestedValues);
}
);
// When you are done, get local results as Rows.
List<Row> localResultRows = df2.collectAsList();
https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/DataFrame.html
I want to filter two sets of vertices by using like search and then i want to add edges between these vertices if a property eg. location matches .
Step 1: Do like search using mgrNo like starts with 100
Step 2:
Do like search using mgrNo like starts with 200
Step 3:Add edge
between the vertices generated by step 1 and 2 if a property say for
eg. location of vertex A and vertex B matches.
I would like to know how to do this in java using gremlinPipeLine
I'm not sure you really need complex Gremlin to do this:
// using some groovy for simplicity note that this is graph query syntax
// and not a "pipeline". to convert to java, you will just need to iterate
// the result of vertices() into an ArrayList and convert the use of
// each{} to foreach or some other java construct
mgr100 = g.query().has("mgrNo",CONTAINS_PREFIX,"100").vertices().toList()
mgr200 = g.query().has("mgrNo",CONTAINS_PREFIX,"200").vertices().toList()
mgr100.each {
mgr200.findAll{x -> x.location == it.location}.each{x -> it.addEdge('near', x)}
}
Note use of some Titan-specific syntax around CONTAINS_PREFIX. While you could probably try to convert this code into a Gremlin Pipeline somehow, i'm not so sure that it would be that much more readable than this.
This question is about MLlib (Spark 1.2.1+).
What is the best way to manipulate local matrices (moderate size, under 100x100, so does not need to be distributed).
For instance, after computing the SVD of a dataset, I need to perform some matrix operation.
The RowMatrix only provide a multiply function. The toBreeze method returns a DenseMatrix<Object> but the API does not seem Java friendly:
public final <TT,B,That> That $plus(B b, UFunc.UImpl2<OpAdd$,TT,B,That> op)
In Spark+Java, how to do any of the following operations:
transpose a matrix
add/subtract two matrices
crop a Matrix
perform element-wise operations
etc
Javadoc RowMatrix: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html
RDD<Vector> data = ...;
RowMatrix matrix = new RowMatrix(data);
SingularValueDecomposition<RowMatrix, Matrix> svd = matrix.computeSVD(15, true, 1e-9d);
RowMatrix U = svd.U();
Vector s = svd.s();
Matrix V = svd.V();
//Example 1: How to compute transpose(U)*matrix
//Example 2: How to compute transpose(U(:,1:k))*matrix
EDIT: Thanks for dlwh for pointing me in the right direction, the following solution works:
import no.uib.cipr.matrix.DenseMatrix;
// ...
RowMatrix U = svd.U();
DenseMatrix U_mtj = new DenseMatrix((int) U.numCols(), (int) U.numRows(), U.toBreeze().toArray$mcD$sp(), true);
// From there, matrix operations are available on U_mtj
Breeze just doesn't provide a Java-friendly API. (And, speaking as the main author, I have no plans to: it would hamstring the API too much.)
You can probably exploit the fact that MTJ uses the same dense matrix representation as we do. (Well, almost. Their API doesn't expose majorStride, but that shouldn't be an issue for you.)
That is, you can do something like this:
import no.uib.cipr.matrix.DenseMatrix;
// ...
breeze.linalg.DenseMatrix[Double] Ubreeze = U.toBreeze();
new DenseMatrix(Ubreeze.cols(), Ubreeze.rows(), Ubreeze.data());
I have 7 lines of data in a text file (shown below).
name: abcd
temp: 623.2
vel: 8
name: xyz
temp: 432
vel: 7.6
Using regex, I was able to read this data and I have been able to print it out. Now I need to store this data in some variable. I'm leaning towards storing this data in an array/ matrix. So physically, it would look something like this:
data = [abcd, 623.2, 8
xyz, 432, 7.6]
So in effect, 1st row contains the first 3 lines, the 2nd row contains lines from 5 to 7. My reason for choosing this type of variable for storage is that in the long run, calling out the data will be simpler - as in:
data[0][0] = abcd
data[1][1] = 432
I can't use the java matrix files from math.nist.gov because I'm not the root user and getting the IT dept to install stuff on my machine is proving to be a MAJOR waste of time. So I want to work with the resources I have - which is Eclipse and a java installation version 1.6.
I want to get this data and store it into a java array variable. What I wanted to know is: is choosing the array variable the right way to proceed? Or should I use a vector variable (altho, in my opinion, using a vector variable will complicate stuff)? or is there some other variable that will allow me to store data easily and call it out easily?
Btw, a little more details regarding my java installation - in case it helps in some way:
OpenJDK Runtime Environment (build 1.6.0-b09)
OpenJDK 64-bit Server VM (build 1.6.0-b09, mixed mode)
Thank you for your help
It seems to me that
name: abcd
temp: 623.2
vel: 8
is some sort of object, and you'd do well to store a list of these e.g. you would define an object
public class MyObject {
private String name;
private double temp;
private double vel;
// etc...
}
(perhaps - there may be more appropriate types), and store these in a list:
List<MyObject>
If you need to index them via their name attribute, then perhaps store a map (e.g.Map<String, MyObject>) where the key is the name of the object.
I'm suggesting creating an object for these since it's trivially easy to ask for obj.getName() etc. rather than remember or calculate array index offsets. Going forwards, you'll be able to add behaviour to these objects (e.g. you have a temp field - with an object you can retrieve that in centigrade/kelvin/fahrenheit etc.). Storing the raw data in arrays doesn't really allow you to leverage the functionality of a OO language.
(note re your installation woes - these classes are native to the Java JRE/JDK and don't require installations. They're fundamental to many programs in Java)
You can use an array, but rather than doing a two dimensional array, create a Data Class that holds the elements and then have an array of those elements.
For example:
public class MyData {
String name;
float temp;
int vel;
}
then you could define
MyData arr[];
You could also use a List() instead of an Array, depending on if you had sorting/searching type criteria. This approach gives you a lot more flexibility if you ever add an element or if you want to find duplicates or searching.
Wrap this information
name: xyz
temp: 432
vel: 7.6
in a class of it's own.
And use whichever implementation of a List<T> you prefer.
Provided that all keys in the key-value pair that you are reading are unique, why don't you store items in a java.util.Map?
Pattern pattern = Pattern.compile("(\\w+): (\\w+)");
try(BufferedReader reader = new BufferedReader(new FileReader("data.txt"))){
Map<String, String> items = new LinkedHashMap<>();
String line = null;
while( (line = reader.readLine()) != null) {
Matcher matcher = pattern.matcher(line);
while(matcher.find()){
items.put(matcher.group(1), matcher.group(2));
}
}
System.out.println(items);
}catch(IOException e) {
System.out.println(e.getMessage());
}
The map would then contain: {name=xyz, temp=432, vel=7}
And you could easily read a particular element like: items.get("name")
I think you can rely on java Collection framework.
You can use ArrayList instead of Arrays if there is a particular sequence in the data.
Moreover if you want to store data in key value pairs, then use Map.
Note: If you need sorted values, then use ArrayList with Comparator or Comparable Interface.
If you are using Map and you need unique and sorted values, then use TreeMap