Java Apache Spark flatMaps & Data Wrangling

Java Apache Spark flatMaps & Data Wrangling - java

I have to pivot the data in a file and then store it in another file. I am having some difficulty pivoting the data.
I have multiple files, that contain data which looks somewhat like I show below. The columns are variable lengths. I am trying to merge the files, first. But for some reason, the output is not correct. I haven't even tried the pivot method, but am not sure how to use it either.
How can this be achieved?
File 1:
0,26,27,30,120
201008,100,1000,10,400
201009,200,2000,20,500
201010,300,3000,30,600
File 2:
0,26,27,30,120,145
201008,100,1000,10,400,200
201009,200,2000,20,500,100
201010,300,3000,30,600,150
File 3:
0,26,27,120,145
201008,100,10,400,200
201009,200,20,500,100
201010,300,30,600,150
Output:
201008,26,100
201008,27,1000
201008,30,10
201008,120,400
201008,145,200
201009,26,200
201009,27,2000
201009,30,20
201009,120,500
201009,145,100
.....
I am not quite familiar with Spark, but am trying to use flatMap and flatMapValues. I am not sure how I can use it for now, but would appreciate some guidance.
import org.apache.commons.lang.StringUtils;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.sql.SparkSession;
import lombok.extern.slf4j.Slf4j;
#Slf4j
public class ExecutionTest {
public static void main(String[] args) {
Logger.getLogger("org.apache").setLevel(Level.WARN);
Logger.getLogger("org.spark_project").setLevel(Level.WARN);
Logger.getLogger("io.netty").setLevel(Level.WARN);
log.info("Starting...");
// Step 1: Create a SparkContext.
boolean isRunLocally = Boolean.valueOf(args[0]);
String filePath = args[1];
SparkConf conf = new SparkConf().setAppName("Variable File").set("serializer",
"org.apache.spark.serializer.KryoSerializer");
if (isRunLocally) {
log.info("System is running in local mode");
conf.setMaster("local[*]").set("spark.executor.memory", "2g");
}
SparkSession session = SparkSession.builder().config(conf).getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(session.sparkContext());
jsc.textFile(filePath, 2)
.map(new Function<String, String[]>() {
private static final long serialVersionUID = 1L;
#Override
public String[] call(String v1) throws Exception {
return StringUtils.split(v1, ",");
}
})
.foreach(new VoidFunction<String[]>() {
private static final long serialVersionUID = 1L;
#Override
public void call(String[] t) throws Exception {
for (String string : t) {
log.info(string);
}
}
});
}
}

Solution in Scala as I am not a JAVA person, you should be able to adapt. And add sorting, cache, etc.
Data is as follows, 3 files with duplicate entry evident, get rid of that if you do not want.
0, 5,10, 15 20
202008, 5,10, 15, 20
202009,10,20,100,200
8 rows generated above.
0,888,999
202008, 5, 10
202009, 10, 20
4 rows generated above.
0, 5
202009,10
1 row, which is a duplicate.
// Bit lazy with columns names, but anyway.
import org.apache.spark.sql.functions.input_file_name
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
val inputPath: String = "/FileStore/tables/g*.txt"
val rdd = spark.read.text(inputPath)
.select(input_file_name, $"value")
.as[(String, String)]
.rdd
val rdd2 = rdd.zipWithIndex
val rdd3 = rdd2.map(x => (x._1._1, x._2, x._1._2.split(",").toList.map(_.toInt)))
val rdd4 = rdd3.map { case (pfx, pfx2, list) => (pfx,pfx2,list.zipWithIndex) }
val df = rdd4.toDF()
df.show(false)
df.printSchema()
val df2 = df.withColumn("rankF", row_number().over(Window.partitionBy($"_1").orderBy($"_2".asc)))
df2.show(false)
df2.printSchema()
val df3 = df2.withColumn("elements", explode($"_3"))
df3.show(false)
df3.printSchema()
val df4 = df3.select($"_1", $"rankF", $"elements".getField("_1"), $"elements".getField("_2")).toDF("fn", "line_num", "val", "col_pos")
df4.show(false)
df4.printSchema()
df4.createOrReplaceTempView("df4temp")
val df51 = spark.sql("""SELECT hdr.fn, hdr.line_num, hdr.val AS pfx, hdr.col_pos
FROM df4temp hdr
WHERE hdr.line_num <> 1
AND hdr.col_pos = 0
""")
df51.show(100,false)
val df52 = spark.sql("""SELECT t1.fn, t1.val AS val1, t1.col_pos, t2.line_num, t2.val AS val2
FROM df4temp t1, df4temp t2
WHERE t1.col_pos <> 0
AND t1.col_pos = t2.col_pos
AND t1.line_num <> t2.line_num
AND t1.line_num = 1
AND t1.fn = t2.fn
""")
df52.show(100,false)
df51.createOrReplaceTempView("df51temp")
df52.createOrReplaceTempView("df52temp")
val df53 = spark.sql("""SELECT DISTINCT t1.pfx, t2.val1, t2.val2
FROM df51temp t1, df52temp t2
WHERE t1.fn = t2.fn
AND t1.line_num = t2.line_num
""")
df53.show(false)
returns:
+------+----+----+
|pfx |val1|val2|
+------+----+----+
|202008|888 |5 |
|202009|999 |20 |
|202009|20 |200 |
|202008|5 |5 |
|202008|10 |10 |
|202009|888 |10 |
|202008|15 |15 |
|202009|5 |10 |
|202009|10 |20 |
|202009|15 |100 |
|202008|20 |20 |
|202008|999 |10 |
+------+----+----+
What we see is Data Wrangling requiring massaged data for tempview creations and JOINing with SQL appropriately.
The key here is to know how to massage the data to make things easy. Note no groupBy etc. Per file, with varying length stuff, JOINing not attempted in RDD, too inflexible. Rank shows line#, so you know the first line with the 0 business.
This is what we call Data Wrangling. This is what we also call hard work for a few points on SO. This is one of my best efforts, and also one of the last of such efforts.
Weakness of solution is a lot of work to get 1st record of a file, there are alternatives. https://www.cyberciti.biz/faq/unix-linux-display-first-line-of-file/ preprocesing is what I would realistically consider.

Related

How to apply a function on a sequential data within group in spark?

I have a custom function which is depended on the order of the data. I want to apply this function for each group in spark in parallel (parallel groups). How can I do?
For example,
public ArrayList<Integer> my_logic(ArrayList<Integer> glist) {
Boolean b = true;
ArrayList<Integer> result = new ArrayList<>();
for (int i=1; i<glist.size();I++) { // Size is around 30000
If b && glist[i-1] > glist[i] {
// some logic then set b to false
result.add(glist[i]);
} else {
// some logic then set b to true
}
}
return result;
}
My data,
Col1 Col2
a 1
b 2
a 3
c 4
c 3
…. ….
I want something similar to below
df.group_by(col(“Col1”)).apply(my_logic(col(“Col2”)));
// output
a [1,3,5…]
b [2,5,8…]
…. ….

In Spark, you can use Window Aggregate Functions directly, I will show that here in Scala.
Here is your input data (my preparation):
import scala.collection.JavaConversions._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val schema = StructType(
StructField("Col1", StringType, false) ::
StructField("Col2", IntegerType, false) :: Nil
)
val row = Seq(Row("a", 1),Row("b", 8),Row("b", 2),Row("a", 5),Row("b", 5),Row("a", 3))
val df = spark.createDataFrame(row, schema)
df.show(false)
//input:
// +----+----+
// |Col1|Col2|
// +----+----+
// |a |1 |
// |b |8 |
// |b |2 |
// |a |5 |
// |b |5 |
// |a |3 |
// +----+----+
Here is the code to obtain desired logic :
import org.apache.spark.sql.expressions.Window
df
// NEWCOLUMN: EVALUATE/CREATE LIST OF VALUES FOR EACH RECORD OVER THE WINDOW AS FRAME MOVES
.withColumn(
"collected_list",
collect_list(col("Col2")) over Window
.partitionBy(col("Col1"))
.orderBy(col("Col2"))
)
// NEWCOLUMN: MAX SIZE OF COLLECTED LIST IN EACH WINDOW
.withColumn(
"max_size",
max(size(col("collected_list"))) over Window.partitionBy(col("Col1"))
)
// FILTER TO GET ONLY HIGHEST SIZED ARRAY ROW
.where(col("max_size") - size(col("collected_list")) === 0)
.orderBy(col("Col1"))
.drop("Col2", "max_size")
.show(false)
// output:
// +----+--------------+
// |Col1|collected_list|
// +----+--------------+
// |a |[1, 3, 5] |
// |b |[2, 5, 8] |
// +----+--------------+
Note:
you can just use collect_list() Aggregate function with groupBy directly but, you can not get the collection list ordered.
collect_set() Aggregate function you can explore if you want to eliminate duplicates (with some changes to the above query).
EDIT 2 : You can write your custom collect_list() as a UDAF (UserDefinedAggregateFunction) like this in Scala Spark for DataFrames
Online Docs
For Spark2.3.0
For Latest Version
Below Code Spark Version == 2.3.0
object Your_Collect_Array extends UserDefinedAggregateFunction {
override def inputSchema: StructType = StructType(
StructField("yourInputToAggFunction", LongType, false) :: Nil
)
override def dataType: ArrayType = ArrayType(LongType, false)
override def deterministic: Boolean = true
override def bufferSchema: StructType = {
StructType(
StructField("yourCollectedArray", ArrayType(LongType, false), false) :: Nil
)
}
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = new Array[Long](0)
}
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer.update(
0,
buffer.getAs[mutable.WrappedArray[Long]](0) :+ input.getLong(0)
)
}
override def merge(
buffer1: MutableAggregationBuffer,
buffer2: Row
): Unit = {
buffer1.update(
0,
buffer1.getAs[mutable.WrappedArray[Long]](0) ++ buffer2
.getAs[mutable.WrappedArray[Long]](0)
)
}
override def evaluate(buffer: Row): Any =
buffer.getAs[mutable.WrappedArray[Long]](0)
}
//Below is the query with just one line change i.e., calling above written custom udf
df
// NEWCOLUMN : USING OUR CUSTOM UDF
.withColumn(
"your_collected_list",
Your_Collect_Array(col("Col2")) over Window
.partitionBy(col("Col1"))
.orderBy(col("Col2"))
)
// NEWCOLUMN: MAX SIZE OF COLLECTED LIST IN EACH WINDOW
.withColumn(
"max_size",
max(size(col("your_collected_list"))) over Window.partitionBy(col("Col1"))
)
// FILTER TO GET ONLY HIGHEST SIZED ARRAY ROW
.where(col("max_size") - size(col("your_collected_list")) === 0)
.orderBy(col("Col1"))
.drop("Col2", "max_size")
.show(false)
//Output:
// +----+-------------------+
// |Col1|your_collected_list|
// +----+-------------------+
// |a |[1, 3, 5] |
// |b |[2, 5, 8] |
// +----+-------------------+
Note:
UDFs are not that efficient in spark hence, use them only when you absolutely need them. They are mainly focused for data analytics.

Why select after a join raises an exception in java spark dataframe?

I have two data dataframes: left and right. They are the same consisting of three columns: src relation, dest and have the same values.
1- I tried to joind these dataframes where the condition is the dst in left = the src in right. But it was not working. Where is error?
Dataset<Row> r = left
.join(right, left.col("dst").equalTo(right.col("src")));
Result:
+---+---------+---+---+---------+---+
|src|predicate|dst|src|predicate|dst|
+---+---------+---+---+---------+---+
+---+---------+---+---+---------+---+
2- If I renamed dst in the left as dst, and the src column in the right as dst2, then I apply a join, it works. But if I try to select some column from the optained dataframe. It raises an exception. Where is my error?
Dataset<Row> left = input_df.withColumnRenamed("dst", "dst2");
Dataset<Row> right = input_df.withColumnRenamed("src", "dst2");
Dataset<Row> r = left.join(right, left.col("dst2").equalTo(right.col("dst2")));
Then:
left.show();
gives:
+---+---------+----+
|src|predicate|dst2|
+---+---------+----+
| a| r1| :b1|
| a| r2| k|
|:b1| r3| :b4|
|:b1| r10| d|
|:b4| r4| f|
|:b4| r5| :b5|
|:b5| r9| t|
|:b5| r10| e|
+---+---------+----+
and
right.show();
gives:
+----+---------+---+
|dst2|predicate|dst|
+----+---------+---+
| a| r1|:b1|
| a| r2| k|
| :b1| r3|:b4|
| :b1| r10| d|
| :b4| r4| f|
| :b4| r5|:b5|
| :b5| r9| t|
| :b5| r10| e|
+----+---------+---+
result:
+---+---------+----+----+---------+---+
|src|predicate|dst2|dst2|predicate|dst|
+---+---------+----+----+---------+---+
| a| r1| b1| b1 | r10| d|
| a| r1| b1| b1 | r3| b4|
|b1 | r3| b4| b4 | r5| b5|
|b1 | r3| b4| b4 | r4| f|
+---+---------+----+----+---------+---+
Dataset<Row> r = left
.join(right, left.col("dst2").equalTo(right.col("dst2")))
.select(left.col("src"),right.col("dst"));
result:
Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved attribute(s) dst#45 missing from dst2#177,src#43,predicate#197,predicate#44,dst2#182,dst#198 in operator !Project [src#43, dst#45];
3- suppose the selected works, how can add the obtained dataframe to the left dataframe.
Im working in Java.

You were using:
r = r.select(left.col("src"), right.col("dst"));
It seems that Spark does not find the lineage back to the right dataframe. Not shocking as it goes through a lot of optimization.
Assuming your desired output is:
+---+---+
|src|dst|
+---+---+
| b1|:b5|
| b1| f|
|:b4| e|
|:b4| t|
+---+---+
You could use one of this 3 options:
Using the col() method
Dataset<Row> resultOption1Df = r.select(left.col("src"), r.col("dst"));
resultOption1Df.show();
Using the col() static function
Dataset<Row> resultOption2Df = r.select(col("src"), col("dst"));
resultOption2Df.show();
Using the column names
Dataset<Row> resultOption3Df = r.select("src", "dst");
resultOption3Df.show();
Here is the complete source code:
package net.jgp.books.spark.ch12.lab990_others;
import static org.apache.spark.sql.functions.col;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
/**
* Self join.
*
* #author jgp
*/
public class SelfJoinAndSelectApp {
/**
* main() is your entry point to the application.
*
* #param args
*/
public static void main(String[] args) {
SelfJoinAndSelectApp app = new SelfJoinAndSelectApp();
app.start();
}
/**
* The processing code.
*/
private void start() {
// Creates a session on a local master
SparkSession spark = SparkSession.builder()
.appName("Self join")
.master("local[*]")
.getOrCreate();
Dataset<Row> inputDf = createDataframe(spark);
inputDf.show(false);
Dataset<Row> left = inputDf.withColumnRenamed("dst", "dst2");
left.show();
Dataset<Row> right = inputDf.withColumnRenamed("src", "dst2");
right.show();
Dataset<Row> r = left.join(
right,
left.col("dst2").equalTo(right.col("dst2")));
r.show();
Dataset<Row> resultOption1Df = r.select(left.col("src"), r.col("dst"));
resultOption1Df.show();
Dataset<Row> resultOption2Df = r.select(col("src"), col("dst"));
resultOption2Df.show();
Dataset<Row> resultOption3Df = r.select("src", "dst");
resultOption3Df.show();
}
private static Dataset<Row> createDataframe(SparkSession spark) {
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField(
"src",
DataTypes.StringType,
false),
DataTypes.createStructField(
"predicate",
DataTypes.StringType,
false),
DataTypes.createStructField(
"dst",
DataTypes.StringType,
false) });
List<Row> rows = new ArrayList<>();
rows.add(RowFactory.create("a", "r1", ":b1"));
rows.add(RowFactory.create("a", "r2", "k"));
rows.add(RowFactory.create("b1", "r3", ":b4"));
rows.add(RowFactory.create("b1", "r10", "d"));
rows.add(RowFactory.create(":b4", "r4", "f"));
rows.add(RowFactory.create(":b4", "r5", ":b5"));
rows.add(RowFactory.create(":b5", "r9", "t"));
rows.add(RowFactory.create(":b5", "r10", "e"));
return spark.createDataFrame(rows, schema);
}
}

Replicating a row from a Dataset n times in Apache Spark using Java

I'm trying to replicate a single row from a Dataset n times and create a new Dataset from it. But, while replicating I need a column's value to be changed for each replication since it would be end up as the primary key when stored finally.
Below is the Scala code from SO post : Replicate Spark Row N-times
import org.apache.spark.sql.functions._
val result = singleRowDF
.withColumn("dummy", explode(array((1 until 100).map(lit): _*)))
.selectExpr(singleRowDF.columns: _*)
How can I create a column from an array of values in Java and pass it to explode function? Suggestions are helpful.
Thanks

This is the Java program to replicate a single row from a Dataset n times.
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.explode;
import static org.apache.spark.sql.functions.lit;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.IntStream;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class SparkSample{
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("SparkSample")
.master("local[*]")
.getOrCreate();
//Create Dataset
List<Tuple2<String,Double>> inputList = new ArrayList<Tuple2<String,Double>>();
inputList.add(new Tuple2<String,Double>("A",1.0));
Dataset<Row> df = spark.createDataset(inputList, Encoders.tuple(Encoders.STRING(), Encoders.DOUBLE())).toDF();
df.show(false);
//Java 8 style of creating Array. You can create by using for loop as well
int[] array = IntStream.range(0, 5).toArray();
//With Dummy Column
Dataset<Row> df1 = df.withColumn("dummy", explode(lit(array)));
df1.show(false);
//Drop Dummy Column
Dataset<Row> df2 = df1.drop(col("dummy"));
df2.show(false);
}
}
Below are the output of this program.
+---+---+
|_1 |_2 |
+---+---+
|A |1.0|
+---+---+
+---+---+-----+
|_1 |_2 |dummy|
+---+---+-----+
|A |1.0|0 |
|A |1.0|1 |
|A |1.0|2 |
|A |1.0|3 |
|A |1.0|4 |
+---+---+-----+
+---+---+
|_1 |_2 |
+---+---+
|A |1.0|
|A |1.0|
|A |1.0|
|A |1.0|
|A |1.0|
+---+---+

how to get number of days between two java.sql.timestamp field in scala

I am having two fields of java.sql.timestamp type in my dataframe and I want to find number of days between these two column
Below is the format of my data : *2016-12-23 23:56:02.0 (yyyy-MM-dd HH:mm:ss.S)
I had tried lots of method but did not find any solution. So can any one help here.

org.apache.spark.sql.functions is a treasure trove. For example, there is the datediff method that does exactly what you want: here is the ScalaDoc.
An example:
val spark: SparkSession = ??? // your spark session
val sc: SparkContext = ??? // your spark context
import spark.implicits._ // to better work with spark sql
import java.sql.Timestamp
final case class Data(id: Int, from: Timestamp, to: Timestamp)
val ds =
spark.createDataset(sc.parallelize(Seq(
Data(1, Timestamp.valueOf("2017-01-01 00:00:00"), Timestamp.valueOf("2017-01-11 00:00:00")),
Data(2, Timestamp.valueOf("2017-01-01 00:00:00"), Timestamp.valueOf("2017-01-21 00:00:00")),
Data(3, Timestamp.valueOf("2017-01-01 00:00:00"), Timestamp.valueOf("2017-01-23 00:00:00")),
Data(4, Timestamp.valueOf("2017-01-01 00:00:00"), Timestamp.valueOf("2017-01-07 00:00:00"))
)))
import org.apache.spark.sql.functions._
ds.select($"id", datediff($"from", $"to")).show()
By running this snippet you would end up with the following output:
+---+------------------+
| id|datediff(from, to)|
+---+------------------+
| 1| -10|
| 2| -20|
| 3| -22|
| 4| -6|
+---+------------------+

cannot access neo4j database with py2neo after creating it with the java BatchInserter

SOLVED
OK, I just messed with neo4j-server.properties` config file, I shouldn't have written the db path using "...".
I have created a neo4j database using java's inserter and I strive to access it with py2neo. Here's my java code:
///opt/java/64/jdk1.6.0_45/bin/javac -classpath $HOME/opt/usr/neo4j-community-1.8.2/lib/*:. neo_batch.java
///opt/java/64/jdk1.6.0_45/bin/java -classpath $HOME/opt/usr/neo4j-community-1.8.2/lib/*:. neo_batch
import org.neo4j.graphdb.GraphDatabaseService;
import org.neo4j.graphdb.Node;
import org.neo4j.graphdb.Transaction;
import org.neo4j.graphdb.factory.GraphDatabaseFactory;
import org.neo4j.graphdb.index.Index;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.io.Writer;
import java.util.HashMap;
import java.util.Map;
import java.lang.Long;
import org.neo4j.graphdb.Direction;
import org.neo4j.graphdb.DynamicRelationshipType;
import org.neo4j.graphdb.GraphDatabaseService;
import org.neo4j.graphdb.Node;
import org.neo4j.graphdb.RelationshipType;
import org.neo4j.helpers.collection.MapUtil;
import org.neo4j.unsafe.batchinsert.BatchInserter;
import org.neo4j.unsafe.batchinsert.BatchInserterImpl;
import org.neo4j.unsafe.batchinsert.BatchInserters;
import org.neo4j.unsafe.batchinsert.BatchInserterIndex;
import org.neo4j.unsafe.batchinsert.BatchInserterIndexProvider;
import org.neo4j.unsafe.batchinsert.LuceneBatchInserterIndexProvider;
public class neo_batch{
private static final String KEY = "id";
public static void main(String[] args) {
//create & connect 2 neo db folder
String batch_dir = "neo4j-batchinserter-store";
BatchInserter inserter = BatchInserters.inserter( batch_dir );
//set up neo index
BatchInserterIndexProvider indexProvider =
new LuceneBatchInserterIndexProvider( inserter );
BatchInserterIndex OneIndex =
indexProvider.nodeIndex( "one", MapUtil.stringMap( "type", "exact" ) );
OneIndex.setCacheCapacity( "id", 100000 );
//batchin graph, index and relationships
RelationshipType linked = DynamicRelationshipType.withName( "linked" );
for (int i=0;i<10;i++){
System.out.println(i);
long Node1 = createIndexedNode(inserter, OneIndex, i);
long Node2 = createIndexedNode(inserter, OneIndex, i+10);
inserter.createRelationship(Node1, Node2, linked, null);
}
indexProvider.shutdown();
inserter.shutdown();
}
// START SNIPPET: helperMethods
private static long createIndexedNode(BatchInserter inserter,BatchInserterIndex OneIndex,final long id)
{
Map<String, Object> properties = new HashMap<String, Object>();
properties.put( KEY, id );
long node = inserter.createNode( properties );
OneIndex.add( node, properties);
return node;
}
// END SNIPPET: helperMethods
}
Then I modify the neo4j-server.properties config file accordingly and start neo4j start.
The following python code suggests the graph is empty
from py2neo import neo4j
graph = neo4j.GraphDatabaseService("http://localhost:7474/db/data/")
graph.size()
Out[8]: 0
graph.get_indexed_node("one",'id',1)
What's wrong with my appraoch? Thanks
EDIT
Neither can I count the nodes with a cypher:
neo4j-sh (?)$ START n=node(*)
> return count(*);
+----------+
| count(*) |
+----------+
| 0 |
+----------+
1 row
0 ms
EDIT 2
I can check with the java api that the indexes and nodes exist
private static void query_batched_db(){
GraphDatabaseService graphDb = new GraphDatabaseFactory().newEmbeddedDatabase( batch_dir);
IndexManager indexes = graphDb.index();
boolean oneExists = indexes.existsForNodes("one");
System.out.println("Does the 'one' index exists: "+oneExists);
System.out.println("list indexes: "+graphDb.index().nodeIndexNames());
//search index 'one'
Index<Node> oneIndex = graphDb.index().forNodes( "one" );
for (int i=0;i<25;i++){
IndexHits<Node> hits = oneIndex.get( KEY, i );
System.out.println(hits.size());
}
graphDb.shutdown();
}
Where the output is
Does the 'one' index exists: true
list indexes: [Ljava.lang.String;#26ae533a
1
1
...
1
1
0
0
0
0
0
Now if I populate the graph using python, I won't be able to access them with the previous java method (will count 20 again)
from py2neo import neo4j
graph = neo4j.GraphDatabaseService("http://localhost:7474/db/data/")
idx=graph.get_or_create_index(neo4j.Node,"idx")
for k in range(100):
graph.get_or_create_indexed_node('idx','id',k,{'id':k}
EDIT 3
Now I delete the store I created with the batchinserter, namely neo4j-test-store while the neo4j-server.properties config file continues to point to the deleted store, namely org.neo4j.server.database.location="{some_path}/neo4j-test-store".
Now if I run a cypher count, I got a 100, 100 being the number of nodes I inserted using py2neo.
I am going crazy with this stuff!
SOLVED
OK, I just messed with neo4j-server.properties` config file, I shouldn't have written the db path using "...".

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Apache Spark flatMaps & Data Wrangling - java

Related

How to apply a function on a sequential data within group in spark?

Why select after a join raises an exception in java spark dataframe?

Replicating a row from a Dataset n times in Apache Spark using Java

how to get number of days between two java.sql.timestamp field in scala

cannot access neo4j database with py2neo after creating it with the java BatchInserter

Categories

Resources