Java, Python - How to convert Java FlatMap into Python LinkedList - java

I am working on a formulation of transportation problem through Linear Programming. Primarily I searched it on the web and found a code which is written in Java. But, I have to write whole stuff in Python. And I am converting it into Python. I don't claim myself good at Java nay at Python. I tried to convert a bit. Everything is fine, but I don't know how to convert the snippet below, it deals with LinkedLists and Stream functions of Java.
static LinkedList<Shipment> matrixToList() {
return stream(matrix)
.flatMap(row -> stream(row))
.filter(s -> s != null)
.collect(toCollection(LinkedList::new));
}
If you are interested in looking into how I converted above linked Java code, here you can see the Shipment class below is my (incomplete) Python code:
import sys
class TransportationProblem:
demand = list()
supply = list()
costs = list(list())
matrix = list(list())
def __init__(self):
pass
class Shipment:
costPerUnit = 0.0
quantity = 0.0
r = 0
c = 0
def __init__(self, quantity, costPerUnit, r, c):
self.quantity = quantity
self.costPerUnit = costPerUnit
self.r = r
self.c = c
def init(self, f_name= ""):
try:
with open(f_name) as f:
val = [int(x) for x in f.readline().strip().split(' ')]
numSources, numDestinations = val[0], val[1]
src = list()
dst = list()
val = [int(x) for x in f.readline().strip().split(' ')]
for i in range(0,numSources):
src.append(val[i])
val = [int(x) for x in f.readline().strip().split(' ')]
for i in range(0, numDestinations):
dst.append(val[i])
totalSrc = sum(src)
totalDst = sum(dst)
if totalSrc > totalDst:
dst.append(totalSrc - totalDst)
elif totalDst > totalSrc:
src.append(totalDst - totalSrc)
self.supply = src
self.demand = dst
self.costs = [[0 for j in range(len(dst))] for i in range(len(src))]
self.matrix = [[self.Shipment() for j in range(len(dst))] for i in range(len(src))]
for i in range(0,len(src)):
val = [int(x) for x in f.readline().strip().split(' ')]
for j in range(0, len(dst)):
self.costs[i][j] = val[j]
print self.costs
except IOError:
print "Error: can\'t find file or read data"
def northWestCornerRule(self):
northwest = 0
for r in range(0, len(self.supply)):
for c in range(northwest, len(self.demand)):
quantity = min(self.supply[r], self.demand[c])
if quantity > 0:
self.matrix[r][c] = self.Shipment(quantity=quantity, costPerUnit=self.costs[r][c], r=r, c=c)
self.supply[r] = self.supply[r] - quantity
self.demand[c] = self.demand[c] - quantity
if self.supply[r] == 0:
northwest = c
break
def steppingStone(self):
maxReduction = 0
move = []
leaving = self.Shipment()
self.fixDegenerateCase()
for r in range(0,len(self.supply)):
for c in range(0,len(self.demand)):
if self.matrix[r][c] != None:
pass
trail = self.Shipment(quantity=0, costPerUnit=self.costs[r][c], r=r, c=c)
path = self.geClosedPath(trail)
reduction = 0
lowestQuantity = sys.maxint
leavingCandidate = None
plus = True
for s in path:
if plus == True:
reduction = reduction + s.costPerUnit
else:
reduction = reduction - s.costPerUnit
if s.quantity < lowestQuantity:
leavingCandidate = s
lowestQuantity = s.quantity
plus = not plus
if reduction < maxReduction:
move = path
leaving = leavingCandidate
maxReduction = reduction
if move != None:
q = leaving.quantity
plus = True
for s in move:
s.quantity = s.quantity + q if plus else s.quantity - q
self.matrix[s.r][s.c] = None if s.quantity == 0 else s
plus = not plus
self.steppingStone()
def fixDegenerateCase(self):
pass
def getClosedPath(self):
pass
def matrixToList(self):
pass

We can break this into steps. You start with a matrix variable, which is some iterable that contains iterables of type Shipment.
To stream an object means that you perform an action on each element of the stream.
A map on a stream means that you take each object, say of type A, and transform it to some type B. A flatMap is a special case used when a map would produce Stream<B>. flatMap lets you concatenate these streams into a single stream.
Say each A maps to a stream of 3 objects {A1, A2} -> {{B11, B12, B13}, {B21, B22, B23}}
flatMap will make this one stream {A1, A2} -> {B11, B12, B13, B21, B22, B23}
In this case a matrix produces a stream of row objects. Each row is mapped into a stream of Shipment and flatMap is used to concatenate them.
Finally filter is used to remove empty shipments (ie value is null) and the collect method is called to transform the stream of Shipment into a List.
Recreating this without streams might look like bellow:
static LinkedList<Shipment> matrixToList() {
LinkedList<Shipment> result = new LinkedList<>();
for (List<Shipment> row : matrix) {
for (Shipment shipment : row) {
if (shipment != null) {
result.add(shipment );
}
}
}
return result;
}

Related

Sorting list of vector clocks (total order)?

I understand that vector clocks only provide a partial order. So you can't directly sort them. For this reason you use a tie-breaker for vectors that are concurrent, resulting in a total order.
However sorting the vector clocks so that every cause comes before every effect in the resulting list doesn't seem to work and I don't entirely get why.
I have extensive tests that show me that comparing two vectors works:
#Override
public int compareTo(VectorClock<K> that) {
var res = 0;
if (this.isAfter(that))
res = 1;
else if (that.isAfter(this))
res = -1;
else
res = this.timestamp.compareTo(that.timestamp);
System.out.println("compare " + this + " : " + that + " => " + res);
return res;
}
public boolean isAfter(VectorClock<K> that) {
boolean anyClockGreater = false;
var set = new HashSet<K>();
set.addAll(this.keySet());
set.addAll(that.keySet());
for (K key : set) {
final Clock thatClock = that.get(key);
final Clock thisClock = this.get(key);
if (thisClock == null || thisClock.isBefore(thatClock)) {
return false;
} else if (thisClock.isAfter(thatClock)) {
anyClockGreater = true;
}
}
// there is at least one local timestamp greater or local vector clock has additional timestamps
return anyClockGreater || that.entrySet().size() < entrySet().size();
}
However when sorting a list of vector clocks, e.g. one with two vectors that have a happenedBefore relationship and a third vector that is concurrent to both others, it may happen that only the concurrent one is compared to the two others, and the vectors that depend on each other are not compared to each other. Instead their order is (wrongly) decided transitively by the tie-breaker:
VectorClock<String> v1 = VectorClock.fromString("{0=23, 1=28, 2=15, 3=23, 4=15, 5=22, 6=14, 7=19}"); // after v3
VectorClock<String> v2 = VectorClock.fromString("{0=11, 1=16, 2=28, 3=17, 4=24, 5=15, 6=10, 7=8}");
VectorClock<String> v3 = VectorClock.fromString("{0=15, 1=19, 2=15, 3=20, 4=15, 5=22, 6=14, 7=19}"); // before v1
var s = new ArrayList<>(List.of(v1, v2, v3));
s.sort(VectorClock::compareTo);
assertTrue(s.indexOf(v3) < s.indexOf(v1));
Prints (and fails):
compare {0=11, 1=16, 2=28, 3=17, 4=24, 5=15, 6=10, 7=8} : {0=23, 1=28, 2=15, 3=23, 4=15, 5=22, 6=14, 7=19} => 1
compare {0=15, 1=19, 2=15, 3=20, 4=15, 5=22, 6=14, 7=19} : {0=11, 1=16, 2=28, 3=17, 4=24, 5=15, 6=10, 7=8} => 1
What is the underlying reason for this? Is this generally impossible or is there an error?

Why my Scala imperative style map creation snippet is slower than Java one?

I'm new to Scala, I'm trying to create a big map from IndexedSeq, I found a mention on SO that functional style map creation is much slower than imperative style Java one, decided to test it my self. So far I found out that not only Scala functional style code is slower but imperative too. What am I doing wrong, why my Scala code is several times slower? On my home computer it runs in 220 ms.(Java) and 460 ms.(Scala)
Scala version
private val testSize: Int = 1000000
val seq: IndexedSeq[Int] = for (i <- 0 until testSize) yield Random.nextInt()
val warmupMapt0 = System.nanoTime()
var warmupMap: mutable.HashMap[Int, Int] = new mutable.HashMap[Int, Int]
warmupMap.sizeHint(testSize)
for (i <- 0 until testSize) warmupMap.put(i, seq(i))
val t0 = System.nanoTime()
var map: mutable.HashMap[Int, Int] = new mutable.HashMap[Int, Int]
map.sizeHint(testSize)
for (i <- 0 until testSize) map.put(i, seq(i))
println((System.nanoTime() - t0)/ 1000000 + " ms.")
Java version
private static final int TEST_SIZE = 1_000_000;
public static void main(String[] args) {
int[] ar = new int[TEST_SIZE];
Random random = new Random();
for (int i = 0; i < TEST_SIZE; i++) {
ar[i] = random.nextInt();
}
Map<Integer, Integer> warmupMap = new HashMap<>(TEST_SIZE);
for (int i = 0; i < TEST_SIZE; i++) {
warmupMap.put(i, ar[i]);
}
Map<Integer, Integer> map = new HashMap<>(TEST_SIZE);
long t0 = System.nanoTime();
for (int i = 0; i < TEST_SIZE; i++) {
map.put(i, ar[i]);
}
System.out.println((System.nanoTime() - t0) / 1_000_000 + " ms.");
}
I think that one source of the problem is usage of an IndexedSeq. It is by default implemented by Vector which is generally a smart collection, but in your case it adds quite a large constant factor to creation of "array" of numbers and than accessing them by index. If you would like your code to be more equivalent to the java counterpart the following code would be the answer:
val ar = new Array(TestSize)
for (i <- 0 until TestSize) ar(i) = Random.next()
I read somewhere about foreach loops optimisation, can't find where, basically given enough warmup runs foreach loop should have similar efficiency compared to while loop, given function passed to it can be inlined.
Edit
Code can be further simplified:
val ar = Array.fill(TestSize)(Random.next())
Proposed by Alexey Romanov in comment.
It's probably the for comprehension. In Scala they work in a very different way than for loops in Java and produce code which JVM can't optimize well enough. See e.g. http://www.scalablescala.com/roller/scala/entry/why_is_using_for_foreach or http://downloads.typesafe.com/website/presentations/ScalaDaysSF2015/T2_Rytz_Backend_Optimizer.pdf (starting with slide 37). You can work around this by using a while loop or a macro library like http://scala-blitz.github.io/, https://github.com/non/spire or https://github.com/nativelibs4java/scalaxy-streams.

Fishers method to combine pvalues in Java

I have two genes with their Pvalues and I have to implement Fisher method on this using JAVA (its requirement). I performed expensive RnD but I couldn't find any help.
There is also build-in fisher method implementation in R which is as follow.
function (p)
{
keep <- (p > 0) & (p <= 1)
lnp <- log(p[keep])
chisq <- (-2) * sum(lnp)
df <- 2 * length(lnp)
if (sum(1L * keep) < 2)
stop("Must have at least two valid p values")
if (length(lnp) != length(p)) {
warning("Some studies omitted")
}
res <- list(chisq = chisq, df = df, p = pchisq(chisq, df,
lower.tail = FALSE), validp = p[keep])
class(res) <- c("sumlog", "metap")
res
}
Can anyone help me in understanding this code or if possible, share the Fisher method implementation in java?.

Mapping of elements gone bad

I am implementing k-means and I want to create the new centroids. But the mapping leaves one element out! However, when K is of a smaller value, like 15, it will work fine.
Based on that code I have:
val K = 25 // number of clusters
val data = sc.textFile("dense.txt").map(
t => (t.split("#")(0), parseVector(t.split("#")(1)))).cache()
val count = data.count()
println("Number of records " + count)
var centroids = data.takeSample(false, K, 42).map(x => x._2)
do {
var closest = data.map(p => (closestPoint(p._2, centroids), p._2))
var pointsGroup = closest.groupByKey()
println(pointsGroup)
pointsGroup.foreach { println }
var newCentroids = pointsGroup.mapValues(ps => average(ps.toSeq)).collectAsMap()
//var newCentroids = pointsGroup.mapValues(ps => average(ps)).collectAsMap() this will produce an error
println(centroids.size)
println(newCentroids.size)
for (i <- 0 until K) {
tempDist += centroids(i).squaredDist(newCentroids(i))
}
..
and in the for loop, I will get the error that it won't find the element (which is not always the same and it depends on K:
java.util.NoSuchElementException: key not found: 2
Output before the error comes up:
Number of records 27776
ShuffledRDD[5] at groupByKey at kmeans.scala:72
25
24 <- IT SHOULD BE 25
What is the problem?
>>> println(newCentroids)
Map(23 -> (-0.0050852959701492536, 0.005512245104477607, -0.004460964477611937), 17 -> (-0.005459583045685268, 0.0029015278781725795, -8.451635532994901E-4), 8 -> (-4.691649213483123E-4, 0.0025375451685393366, 0.0063490755505617585), 11 -> (0.30361112034069937, -0.0017342255382385204, -0.005751167731061906), 20 -> (-5.839587918939964E-4, -0.0038189763756820145, -0.007067070459859708), 5 -> (-0.3787612396704685, -0.005814121628643806, -0.0014961713117870657), 14 -> (0.0024755681263616547, 0.0015191503267973836, 0.003411769193899781), 13 -> (-0.002657690932944597, 0.0077671050923225635, -0.0034652379980563263), 4 -> (-0.006963114731610361, 1.1751361829025871E-4, -0.7481135105367823), 22 -> (0.015318187079953534, -1.2929035958285013, -0.0044176372190034684), 7 -> (-0.002321059060773483, -0.006316359116022083, 0.006164669723756913), 16 -> (0.005341800955165691, -0.0017540737037037035, 0.004066574093567247), 1 -> (0.0024547379611650484, 0.0056298656504855955, 0.002504618082524296), 10 -> (3.421068671121009E-4, 0.0045169004751299275, 5.696239049740164E-4), 19 -> (-0.005453716071428539, -0.001450277556818192, 0.003860007248376626), 9 -> (-0.0032921685273631807, 1.8477108457711313E-4, -0.003070412228855717), 18 -> (-0.0026803160958904053, 0.00913904078767124, -0.0023528013698630146), 3 -> (0.005750011594202901, -0.003607098309178754, -0.003615918896940412), 21 -> (0.0024925166025641056, -0.0037607353461538507, -2.1588444871794858E-4), 12 -> (-7.920202960526356E-4, 0.5390774232894769, -4.928884539473694E-4), 15 -> (-0.0018608492323232324, -0.006973787272727284, -0.0027266663434343404), 24 -> (6.151173211963486E-4, 7.081812613784045E-4, 5.612962808842611E-4), 6 -> (0.005323933953732931, 0.0024014750473186123, -2.969338590956889E-4), 0 -> (-0.0015991676750160377, -0.003001317289659613, 0.5384176139563245))
Question with relevant error: spark scala throws java.util.NoSuchElementException: key not found: 0 exception
EDIT:
After the observation of zero323 that two centroids were the same, I changed the code so that all the centroids are unique. However, the behaviour remains the same. For that reason, I suspect that closestPoint() may return the same index for two centroids. Here is the function:
def closestPoint(p: Vector, centers: Array[Vector]): Int = {
var index = 0
var bestIndex = 0
var closest = Double.PositiveInfinity
for (i <- 0 until centers.length) {
val tempDist = p.squaredDist(centers(i))
if (tempDist < closest) {
closest = tempDist
bestIndex = i
}
}
return bestIndex
}
How to get away with this? I am running the code like I describe in Spark cluster.
It can happen in the "E-step" (the assignment of points to cluster-indices is analogous to the E-step of the EM algorithm) that one of your indices will not be assigned any points. If this happens then you need to have a way of associating that index with some point, otherwise you're going to wind up with fewer clusters after the "M-step" (the assignment of centroids to the indices is analogous to the M-step of the EM algorithm.) Something like this should probably work:
val newCentroids = {
val temp = pointsGroup.mapValues(ps => average(ps.toSeq)).collectAsMap()
val nMissing = K - temp.size
val sample = data.takeSample(false, nMissing, seed)
var c = -1
(for (i <- 0 until K) yield {
val point = temp.getOrElse(i, {c += 1; sample(c) })
(i, point)
}).toMap
}
Just substitute that code for the line you are currently using to compute newCentroids.
There are other ways of dealing with this issue and the approach above is probably not the best (is it a good idea to be calling takeSample multiple times, once for each iteration of the the k-means algorithm? what if data contains a lot of repeated values?, etc.), but it is a simple starting point.
By the way, you might want to think about how you can replace the groupByKey with a reduceByKey.
Note: For the curious, here's a reference describing the similarities between the EM-algorithm and the k-means algorithm: http://papers.nips.cc/paper/989-convergence-properties-of-the-k-means-algorithms.pdf.

Introduce a counter into a loop within scala

I'm writing a small program which will convert a very large file into multiple smaller files, each file will contain 100 lines.
I'm iterating over a lines iteration :
while (lines.hasNext) {
val line = lines.next()
}
I want to introduce a counter and when it reaches a certain value, reset the counter and proceed. In java I would do something like :
int counter = 0;
while (lines.hasNext) {
val line = lines.next()
if(counter == 100){
counter = 0;
}
++counter
}
Is there something similar in scala or an alternative method ?
traditionally in scala you use .zipWithIndex
scala> List("foo","bar")
res0: List[java.lang.String] = List(foo, bar)
scala> for((x,i) <- res0.zipWithIndex) println(i + " : " +x)
0 : foo
1 : bar
(this will work with your lines too, as far as they are in Iterator, e.g. has hasNext and next() methods, or some other scala collection)
But if you need a complicated logic, like resetting counter, you may write it the same way as in java:
var counter = 0
while (lines.hasNext) {
val line = lines.next()
if(counter % 100 == 0) {
// now write to another file
}
}
Maybe you can tell us why you want to reset counter, so we may say how to do it better?
EDIT
according to your update, that is better to do using grouped method, as #pr1001 proposed:
lines.grouped(100).foreach(l => l.foreach(/* write line to file*/))
If your resetting counter represents the fact that there are repeated groups of data in the original list, you might want to use the grouped method:
scala> val l = List("one", "two", "three", "four")
l: List[java.lang.String] = List(one, two, three, four)
scala> l.grouped(2).toList
res0: List[List[java.lang.String]] = List(List(one, two), List(three, four))
Update: Since you're reading from a file, you should be able to pretty efficiently iterate over the file:
val bigFile = io.Source.fromFile("/tmp/verybigfile")
val groupedLines = bigFile.getLines.grouped(2).zipWithIndex
groupedLines.foreach(group => {
val (lines, index) = group
val p = new java.io.PrintWriter("/tmp/" + index)
lines.foreach(p.println)
p.close()
})
Of course this could also be written as a for comprehension...
You might even be able to get better performance by converting groupedLines to a parallel collection with .par before writing out each group of lines to its own file.
This would work:
lines grouped 100 flatMap (_.zipWithIndex) foreach {
case (line, count) => //whatever
}
You may use zipWithIndex along with some transformation.
scala> List(10, 20, 30, 40, 50).zipWithIndex.map(p => (p._1, p._2 % 3))
res0: List[(Int, Int)] = List((10,0), (20,1), (30,2), (40,0), (50,1))

Categories

Resources