How to pipe() a grouped by key RDD? - java

I Have done the follow workflow path so far:
1) JavaPairRDD< Integer, String > aRDD = fooRDD.mapToPair( )
2) JavaPairRDD< Integer, Iterable< String > > bRDD = aRDD.groupByKey( )
3) JavaPairRDD< Integer, List<String> > cRDD = bRDD.mapToPair( )
Now I have a problem: I need to cRDD.pipe('myscript.sh') but I noticed myscript.sh are receiving all the list for each key at once.
The long version: there is a bash script that will take each group of lines and create a PDF with the data. So bRDD will group lines by using a key, cRDD will sort and remove some undesirable data inside each group and the next step will be create one PDF report for each data group.
I'm thinking in convert the List<String> representing the group content into a new JavaPairRDD< Integer, String > for each group but I don't know how to do this and even if this is the correct way to proceed.
Example:
(1,'foo,b,tom'), (1,'bar,c,city'), (1,'fly,Marty'), (2,'newFoo,Jerry'), (2,'newBar,zed,Mark'), (2,'newFly,boring,data') (2,'jack,big,deal')
After groupBy:
(1, 'foo,b,tom','bar,c,city','fly,Marty')
(2, 'newFoo,Jerry','newBar,zed,Mark','newFly,boring,data','jack,big,deal')
How `myscript.sh' are taking the data (note one String for the entire group):
(1,['foo,b,tom,bar,c,city,fly,Marty'])
(2,['newFoo,Jerry,newBar,zed,Mark,newFly,boring,data,jack,big,deal'])
how I'm expecting to receive:
For partition 1 or worker 1:
1,'foo,b,tom'
1,'bar,c,city'
1,'fly,Marty'
For partition 2 or worker 2:
2,'newFoo,Jerry'
2,'newBar,zed,Mark'
2,'newFly,boring,data'
2,'jack,big,deal'
So I can process each line at one time but still keeping the group and can ensure that this will make group 1 go to one PDF report and group 2 go to another report. The major problem is my data line is already a comma-separated data then I can't determine where to start a new line value because all lines are merged as comma-separated line too.
I'm working with Java. Please give your answer in Java too.

You can't create RDD inside RDD. If you want to process all records continuously which belongs to particular key then you shouldn't again flatMap grouped RDDs ( bRDD, cRDD) . Instead, I would suggest to change grouped RDDs' ( bRDD, cRDD ) values separator to some other character.
e.g.
cRDD.map(s->{
StringBuilder sb =new StringBuilder();
Iterator<String> ite = s._2().iterator();
while (ite.hasNext()){
//change delimiter to colon(:) or some other character
sb.append(ite.next()+":");
}
return new Tuple2<Long,String>(s._1(),sb.toString());
}).pipe('myscript.sh');
In myscript.sh split records based on colon (:). I hope this would help.

Related

Amazon S3 Select Issue : not supporting line break occurring inside fields

I am trying to use Amazon S3 Select to read records from a CSV file and if the field contains a line break(\n), then the record is not being parsed as a single record. Also, the line break inside the field has been properly escaped by double quotes as per standard CSV format.
For example, the below CSV file
Id,Name,Age,FamilyName,Place
p1,Albert Einstein,25,"Einstein
Cambridge",Cambridge
p2,Thomas Edison,30,"Edison
Cardiff",Cardiff
is being parsed as
Line 1 : Id,Name,Age,FamilyName,Place
Line 2 : p1,Albert Einstein,25,"Einstein
Line 3 : Cambridge",Cambridge
Line 4 : p2,Thomas Edison,30,"Edison
Line 5 : Cardiff",Cardiff
Ideally it should have been parsed as given below:
Line 1:
Id,Name,Age,FamilyName,Place
Line 2:
p1,Albert Einstein,25,"Einstein
Cambridge",Cambridge
Line 3:
p2,Thomas Edison,30,"Edison
Cardiff",Cardiff
I'm setting AllowQuotedRecordDelimiter to TRUE in the SelectObjectContentRequest as given in their documentation. It's still not working.
Does anyone know if Amazon S3 Select supports line break inside fields as described in the case mentioned above? Or any other parameters I need to change or set to make this work?
This is being parsed / printed correctly. The confusion lies in that the literal newline is being printed in the output. You can test this if you run the following expression on the given csv:
SELECT COUNT(*) from s3Object s
Output: 2
Note that if you specify only the third column, you get only the correct value:
SELECT s._3 frin s3Object s
You get only the parts of each line that enclose said field:
"Einstein
Cambridge"
"Edison
Cardiff"
What's happening is the character in the field is the same as the default CSVOutput.RecordDelimiter value (\n) which is causing a clash. If you want to separate each field in a different way, you could add the the following to the CSVOutput part of the OutputSerialization:
"RecordDelimiter": "\r\n"
or use some other type of 1-2 length character sequence in place of \r\n

Writing/appending data to a CSV file, column wise, in JAVA

I want to write/append data to a CSV file, column-by-column, in below fashion:
query1 query2 query3
data_item1 data_item7 data_item12
data_item2 data_item8 data_item13
data_item3 data_item9 data_item14
data_item4 data_item10
data_item5 data_item11
data_item6
I have the data in a hashMap, with the queryID (i.e. query1,query2) being the key and data_items for the
corresponding queries being the values.
The values(data_items for every query) are in a list.
Therefore, my hash map looks like this :
HashMap<String,List<String>> hm = new HashMap<String,List<String>>();
How can I write this data, column by column to a csv, as demonstrated above, using JAVA ?
I tried CSVWriter, but couldn't do it. Can anyone please help me out ?
csv files are mostly used to persist data structured like a table... meaning data with columns and rows that are in a close context.
In your example there seems to be only a very loose connection between query1, 2 and 3, and no connection horizontally between item 1,7 and 12, or 2, 8 and 13 and so on.
On top of that writing into files are usually facilitated along rows or lines. So you open your file write one line, and then another and so on.
So to write the data columnwise as you are asking, you have to either restructure your data in your code alrady to have all the data which is written into one line available on writing that line, or run through your csv file and it's lines several times, each time adding another item to a row. Of course the latter option is very time consuming and would not make much sense.
So i would suggest if there is really no connection between the data of the 3 queries, you either write your data into 3 different csv files: query1.csv, 2.csv and 3.csv.
Or, if you have a horizontal connection i.e. between item 1,7 and 12, and so on you write it into one csv file, organizing the data into rows and columns. Something like:
queryNo columnX columnY columnZ
1 item1 item2 item3
2 item7 item8 item9
3 item12 item13 item14
How to do that is well described in this thread: Java - Writing strings to a CSV file.
Other examples you can also find here https://mkyong.com/java/how-to-export-data-to-csv-file-java/
After days of tinkering around, I finally succeeded. Here is the implementation :
for(int k=0;k<maxRows;k++) {
List<String> rowValues = new ArrayList<String>();
for(int i=0;i<queryIdListArr.length;i++) {
subList = qValuesList.subList(i, i+1);
List<String> subList2 = subList.stream().flatMap(List::stream).collect(Collectors.toList());
if(subList2.size()<=k) {
rowValues.add("");
}else{
rowValues.add(subList2.get(k));
}
}
String[] rowValuesArr = new String[rowValues.size()];
rowValuesArr = rowValues.toArray(rowValuesArr);
// System.out.println(rowValues);
writer.writeNext(rowValuesArr);
}
maxRows : Size of the value list with max size. I have a list of values for each key. My hash map looks like this
HashMap<String,List<String>> hm = new HashMap<String,List<String>>();
queryIdListArr : List of all the values obtained from the hash map.
qValuesList : List of all the value lists.
List<List<String>> qValuesList = new ArrayList<List<String>>();
subList2 : sublist obtained from qValuesList using the below syntax :
qValuesList.subList(i, i+1);
rowValuesArr is an array that gets populated with the index wise value for each
value fetched from qValuesList.
The idea is to fetch all the values for each index from all the sublists and then write those values to the row. If for that index, no value is found, write a blank character.

How can i write values in a column using comma

As per my jmeter test plan,i am saving following information into a csv file using Beanshell PostProcessor
username = vars.get("username");
password = vars.get("password");
f = new FileOutputStream("/path/user_details.csv", true);
p = new PrintStream(f);
this.interpreter.setOut(p);
print(username + "," + password);
f.close()
How can i save those values into a single column using comma (username,password)
Put double quotes around the entire string, so that the comma will be part of a single data item's value, rather than a value separator.
In practice the character you use for the column separator, and the characters you use as a delimiter, are configurable, by a CSV library (which, should really almost always be used instead of trying to get the syntax details right on your own).

How to limit a Trident DRPC result to contain the fields of only the last function of the topology?

I've got a simple Trident Topology running in a LocalDRPC where one of the functions outputs the result field, but when I run it the results I get back seem to be all the information from every tuple, instead of just the result field as I would have expected given the DRPC docs. Eg:
[["http:\/\/www.smbc-comics.com\/rss.php",http://www.smbc-comics.com/rss.php,[#document: null],[item: null],[link: null],[description: null],http://feedproxy.google.com/~r/smbc-comics/PvLb/~3/CBpJmAiJSxs/index.php,http://www.smbc-comics.com/comics/20141001.png,"http:\/\/www.smbc-comics.com\/comics\/20141001.png"], ...]
It would be okay to get all the information from every tuple back, but there's no indication of which of the fields is called result. As it stands it's not even valid JSON!
So how can I extract the value that corresponds to a specific field that I specified in the topology?
Storm returns every field that was processed during the execution chain in a Json array. The order of the values are the same as they were processed, so if you are interested in the result of only the last function then you should read only the last value from the array. If for any reason you are not interested in the intermediate results then you can limit it with the projection method. For example if you have a stream :
stream.each(new Fields("args"), new AddExclamation(), new Fields(EX_1))
.each(new Fields(EX_1), new AddPlus(), new Fields(P1, P2));
that returns
[["hello","hello!1","hello!1+1","hello!1+2"],["hello","hello!2","hello!2+1","hello!2+2"]]
then by setting projection, you can limit to P2
stream.each(new Fields("args"), new AddExclamation(), new Fields(EX_1))
.each(new Fields(EX_1), new AddPlus(), new Fields(P1, P2))
.project(new Fields(P2));
so the output will be only this
[["hello!1+2"],["hello!2+2"]]
You can see this in action here :
https://github.com/ExampleDriven/storm-example/blob/master/src/test/java/org/exampledriven/ExclamationPlusTridentTopologyTest.java

Load Social Network Data into Neo4J

I have a dataset similar to Twitter's graph. The data is in the following form:
<user-id1> <List of ids which he follows separated by spaces>
<user-id2> <List of ids which he follows separated by spaces>
...
I want to model this in the form of a unidirectional graph, expressed in the cypher syntax as:
(A:Follower)-[:FOLLOWS]->(B:Followee)
The same user can appear more than once in the dataset as he might be in the friend list of more than one person, and he might also have his friend list as part of the data set. The challenge here is to make sure that there are no duplicate nodes for any user. And if the user appears as a Follower and Followee both in the data set, then the node's label should have both the values, i.e., Follower:Followee. There are about 980k nodes in the graph and size of dataset is 1.4 GB.
I am not sure if Cypher's load CSV will work here because each line of the dataset has a variable number of columns making it impossible to write a query to generate the nodes for each of the columns. So what would be the best way to import this data into Neo4j without creating any duplicates?
I did actually exactly the same for the friendster dataset, which has almost the same format as yours.
There the separator for the many friends was ":".
The queries I used there, are these:
create index on :User(id);
USING PERIODIC COMMIT 1000
LOAD CSV FROM "file:///home/michael/import/friendster/friends-000______.txt" as line FIELDTERMINATOR ":"
MERGE (u1:User {id:line[0]})
;
USING PERIODIC COMMIT 1000
LOAD CSV FROM "file:///home/michael/import/friendster/friends-000______.txt" as line FIELDTERMINATOR ":"
WITH line[1] as id2
WHERE id2 <> '' AND id2 <> 'private' AND id2 <> 'notfound'
UNWIND split(id2,",") as id
WITH distinct id
MERGE (:User {id:id})
;
USING PERIODIC COMMIT 1000
LOAD CSV FROM "file:///home/michael/import/friendster/friends-000______.txt" as line FIELDTERMINATOR ":"
WITH line[0] as id1, line[1] as id2
WHERE id2 <> '' AND id2 <> 'private' AND id2 <> 'notfound'
MATCH (u1:User {id:id1})
UNWIND split(id2,",") as id
MATCH (u2:User {id:id})
CREATE (u1)-[:FRIEND_OF]->(u2)
;

Categories

Resources