How to use array of doubles as feature vector in Mallet - java

From what I've seen in documentation and various examples,
typical worfklow with data in Mallet requires you to work with feature list that you usually obtain by passing your data through "pipes" while iterating over them with some sort of iterator. The data is ususally stored in some csv file.
I am trying to obtain features list from two arrays of doubles.
One array stores actual features and is of size n x m (where n is amount of features and m is count of feature vectors) and other one of size 1 x m and contains binary labels. How should I convert those into feature list, so I can use them in classifiers.

I ended up writing custom Itereator similar to the one present in Mullet called "ArrayDataAndTargetIterator". I also had to use a pipe defined like this:
new SerialPipes(Arrays.asList(new Target2Label(), new Array2FeatureVector()));

Related

Apache Beam count of unique elements

I have an Apache Beam job, which injest data from PubSub and then load into BigQuery,
I transform PubSub message to pojo with fields
id,
name, count
Count mean the count of not unique elements into single ingest.
If i load from PubSub 3 elements, two of which are same, then i need to load into BigQuery 2 elements, one of them will have count 2.
I wonder how easily make it in Apache Beam.
I tried to make it wia DoFn or MapElements, but there i can process only single element.
I also tried to convert element to KV, and then count, but i have non determenistics coder.
In usual java app i can simple use equals or via Map, but here in Apache beam all is different.
The simple and right approach would be to use Count.<T>perElement(), like this :
Pipeline p = ...;
PCollection<T> elements = p.apply(...); // read elements
PCollection<KV<T, Long>> elementsCounts =
elements.apply(Count.<T>perElement());
PCollection<TableRow> results = elementsCounts.apply(ParDo.of(
new FormatOutputFn()));
Though, right, you need to have a deterministic elements coder for that. So if it's not case (as I understand from what you said above) you need to add a step before Count to transform an element into different representation where it will be possible to have a deterministic coder (like AvroCoder, for example).
If it's not possible for some reasons, then another workaround could be to calculate an uniq hash for every element (but the hash value must be deterministic as well), create a KV for every element with new hash as a Key and element as a Value and use GroupByKey downstream to have a grouped tuple of the same values.
Also, please note, that since PubSub is an unbounded source, you need to "window" your input by any type of Windows strategy (except Global one) since all your group/combine operations should be done inside a window. Take a look on WindowedWordCount as an example of solution for similar problem.

how can i store a series of 2 integers with default collection API's in java

I want to store a series of numbers but using Collection APIs in java i don't want to use the standard two dimensional array, so how can i do the same.
ArrayList<Integer,Integer>
I want to do something like this, so what method should i use.

spark - How to reduce the shuffle size of a JavaPairRDD<Integer, Integer[]>?

I have a JavaPairRDD<Integer, Integer[]> on which I want to perform a groupByKey action.
The groupByKey action gives me a:
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle
which is practically an OutOfMemory error, if I am not mistaken. This occurs only in big datasets (in my case when "Shuffle Write" shown in the Web UI is ~96GB).
I have set:
spark.serializer org.apache.spark.serializer.KryoSerializer
in $SPARK_HOME/conf/spark-defaults.conf, but I am not sure if Kryo is used to serialize my JavaPairRDD.
Is there something else that I should do to use Kryo, apart from setting this conf parameter, to serialize my RDD? I can see in the serialization instructions that:
Spark automatically includes Kryo serializers for the many commonly-used core Scala classes covered in the AllScalaRegistrar from the Twitter chill library.
and that:
Since Spark 2.0.0, we internally use Kryo serializer when shuffling RDDs with simple types, arrays of simple types, or string type.
I also noticed that when I set spark.serializer to be Kryo, the Shuffle Write in the Web UI increases from ~96GB (with default serializer) to 243GB!
EDIT: In a comment, I was asked about the logic of my program, in case groupByKey can be replaced with reduceByKey. I don't think it's possible, but here it is anyway:
Input has the form:
key: index bucket id,
value: Integer array of entity ids in this bucket
The shuffle write operation produces pairs in the form:
entityId
Integer array of all entity Ids in the same bucket (call them neighbors)
The groupByKey operation gathers all the neighbor arrays of each entity, some possibly appearing more than once (in many buckets).
After the groupByKey operation, I keep a weight for each bucket (based on the number of negative entity ids it contains) and for each neighbor id I sum up the weights of the buckets it belongs to.
I normalize the scores of each neighbor id with another value (let's say it's given) and emit the top-3 neighbors per entity.
The number of distinct keys that I get is around 10 million (around 5 million positive entity ids and 5 million negatives).
EDIT2: I tried using Hadoop's Writables (VIntWritable and VIntArrayWritable extending ArrayWritable) instead of Integer and Integer[], respectively, but the shuffle size was still bigger than the default JavaSerializer.
Then I increased the spark.shuffle.memoryFraction from 0.2 to 0.4 (even if deprecated in version 2.1.0, there is no description of what should be used instead) and enabled offHeap memory, and the shuffle size was reduced by ~20GB. Even if this does what the title asks, I would prefer a more algorithmic solution, or one that includes a better compression.
Short Answer: Use fastutil and maybe increase spark.shuffle.memoryFraction.
More details:
The problem with this RDD is that Java needs to store Object references, which consume much more space than primitive types. In this example, I need to store Integers, instead of int values. A Java Integer takes 16 bytes, while a primitive Java int takes 4 bytes. Scala's Int type, on the other hand, is a 32-bit (4-byte) type, just like Java's int, that's why people using Scala may not have faced something similar.
Apart from increasing the spark.shuffle.memoryFraction to 0.4, another nice solution was to use the fastutil library, as suggest in Spark's tuning documentation:
The first way to reduce memory consumption is to avoid the Java features that add overhead, such as pointer-based data structures and wrapper objects. There are several ways to do this: Design your data structures to prefer arrays of objects, and primitive types, instead of the standard Java or Scala collection classes (e.g. HashMap). The fastutil library provides convenient collection classes for primitive types that are compatible with the Java standard library.
This enables storing each element in int array of my RDD pair as an int type (i.e., using 4 bytes instead of 16 for each element of the array). In my case, I used IntArrayList instead of Integer[]. This made the shuffle size drop significantly and allowed my program to run in the cluster. I also used this library in other parts of the code, where I was making some temporary Map structures. Overall, by increasing spark.shuffle.memoryFraction to 0.4 and using fastutil library, shuffle size dropped from 96GB to 50GB (!) using the default Java serializer (not Kryo).
Alternative: I have also tried sorting each int array of an rdd pair and storing the deltas using Hadoop's VIntArrayWritable type (smaller numbers use less space than bigger numbers), but this also required to register VIntWritable and VIntArrayWritable in Kryo, which didn't save any space after all. In general, I think that Kryo only makes things work faster, but does not decrease the space needed, but I am not still sure about that.
I am not marking this answer as accepted yet, because someone else might have a better idea, and because I didn't use Kryo after all, as my OP was asking. I hope reading it, will help someone else with the same issue. I will update this answer, if I manage to further reduce the shuffle size.
Still not really sure what you want to do. However, because you use groupByKey and say that there is no way to do it by using reduceByKey, it makes me more confused.
I think you have rdd = (Integer, Integer[]) and you want something like (Integer, Iterable[Integer[]]) that's why you are using groupByKey.
Anyway, I am not really familiar with Java in Spark, but in Scala I would use reduceByKey to avoid the shuffle by
rdd.mapValues(Iterable(_)).reduceByKey(_++_) . Basically, you want to convert the value to a list of array and then combine the list together.
I think the best approach that can be recommended here (without more specific knowledge of the input data) in general is to use the persist API on your input RDD.
As step one, I'd try to call .persist(MEMORY_ONLY_SER) on the input, RDD to lower memory usage (albeit at a certain CPU overhead, that shouldn't be that much of a problem for ints in your case).
If that is not sufficient you can try out .persist(MEMORY_AND_DISK_SER) or if your shuffle still takes so much memory that the input dataset needs to be made easier on the memory .persist(DISK_ONLY) may be an option, but one that will strongly deteriorate performance.

Data structure for holding the content of a parsed CSV file

I'm trying to figure out what the best approach would be to parse a csv file in Java. Now each line will have an X amount of information. For example, the first line can have up to 5 string words (with commas separating them) while the next few lines can have maybe 3 or 6 or what ever.
My problem isn't reading the strings from the file. Just to be clear. My problem is what data structure would be best to hold each line and also each word in that line?
At first I thought about using a 2D array, but the problem with that is that array sizes must be static (the 2nd index size would hold how many words there are in each line, which can be different from line to line).
Here's the first few lines of the CSV file:
0,MONEY
1,SELLING
2,DESIGNING
3,MAKING
DIRECTOR,3DENT95VGY,EBAD,SAGHAR,MALE,05/31/2011,null,0,10000,07/24/2011
3KEET95TGY,05/31/2011,04/17/2012,120050
3LERT9RVGY,04/17/2012,03/05/2013,132500
3MEFT95VGY,03/05/2013,null,145205
DIRECTOR,XKQ84P6CDW,AGHA,ZAIN,FEMALE,06/06/2011,null,1,1000,01/25/2012
XK4P6CDW,06/06/2011,09/28/2012,105000
XKQ8P6CW,09/28/2012,null,130900
DIRECTOR,YGUSBQK377,AYOUB,GRAMPS,FEMALE,10/02/2001,12/17/2007,2,12000,01/15/2002
You could use a Map<Integer, List<String>>. The keys being the line numbers in the csv file, and the List being the words in each line.
An additional point: you will probably end up using List#get(int) method quite often. Do not use a linked list if this is the case. This is because get(int) for linked list is O(n). I think an ArrayList is your best option here.
Edit (based on AlexWien's observation):
In this particular case, since the keys are line numbers, thus yielding a contiguous set of integers, an even better data structure could be ArrayList<ArrayList<String>>. This will lead to faster key retrievals.
Use Array List. They are arrays with dynamic size.
The best way is to use a CSV parser, like http://opencsv.sourceforge.net/. This parser uses List of String[] to hold data.
Use a List<String>(), which can expand dynamically in size.
If you want to have 2 dimensions, use a List<List<String>>().
Here's an example:
List<List<String>> data = new ArrayList<List<String>>();
List<String> temp = Arrays.asList(someString.split(","));
data.add(temp);
put this in some kind of loop and get your data like that.

Filling in uninitialized array in java? (or workaround!)

I'm currently in the process of creating an OBJ importer for an opengles android game. I'm relatively new to the language java, so I'm not exactly clear on a few things.
I have an array which will hold the number of vertices in the model(along with a few other arrays as well):
float vertices[];
The problem is that I don't know how many vertices there are in the model before I read the file using the inputstream given to me.
Would I be able to fill it in as I need to like this?:
vertices[95] = 5.004f; //vertices was defined like the example above
or do I have to initialize it beforehand?
if the latter is the case then what would be a good way to find out the number of vertices in the file? Once I read it using inputstreamreader.read() it goes to the next line until it reads the whole file. The only thing I can think of would be to read the whole file, count the number of vertices, then read it AGAIN the fill in the newly initialized array.
Is there a way to dynamically allocate the data as is needed?
You can use an ArrayList which will give you the dynamic size that you need.
List<Float> vertices = new ArrayList<Float>();
You can add a value like this:
vertices.add(5.0F);
and the list will grow to suit your needs.
Some things to note: The ArrayList will hold objects, not primitive types. So it stores the float values you provide as Float objects. However, it is easy to get the original float value from this.
If you absolutely need an array then after you read in the entire list of values you can easily get an array from the List.
You can start reading about Java Collections here.
In java arrays have to be initialised beforehand. In your case you have the following options:
1) Use an ArrayList (or some other implementation of List interface), as suggested by others. Such lists can grow dynamically so this will help.
2) If you have control over the file format, add information on the number of vertices to the beginning of the file, so you can pre-initialise your array with correct size.
3) If you don't have control over it, try guessing the number of vertices based on file size (float is 4 bytes, so maybe divide File.length() by 4, for example). If the guessed number is too small, you can dynamically create a bigger array (say, 120% of the previous array size), the copy all data from previous array into the new one and carry on. This may be costly but if your guessing of array size is precise it will not be a problem.
We might be able to give you more ideas if you give us more information on file format and/or how this array of vertices going to be used (like: stored for a long time, or thrown away quickly).
No, you can't fill in uninitialized array.
If you need a dynamic structure that allows storing data + indexes (which seem to be important in your case), I would go for Map (key of Map would be your index):
Map<Integer, Float> vertices = new HashMap<Integer, Float>();
vertices.put(95, 5.004f);

Categories

Resources