How do you merge arrays with same id using Spark DataFrames? - java

My table looks like this
+-------+--------------------+
|id | c1|
+-------+--------------------+
| 1|ArrayBuffer(a,b) |
| 1|ArrayBuffer(c ) |
| 2|ArrayBuffer(d ) |
| 2|ArrayBuffer(e,f) |
| 2|ArrayBuffer(g ) |
| 3|ArrayBuffer(h ) |
+-------+--------------------+
I want to the output to look like this
+-------+--------------------+
|id | c1|
+-------+--------------------+
| 1|ArrayBuffer(a,b,c) |
| 2|ArrayBuffer(c,d,e,f,g)
| 3|ArrayBuffer(h ) |
+-------+--------------------+
Here's what I was thinking
SQLQuery = "SELECT table.id, join(table.c1) FROM table GROUP BY table.id
sqlContext.udf().register("join",
new UDF1<ArrayBuffer, ArrayBuffer>() {
#Override
public ArrayBuffer call(ArrayBuffer idArray) {
// how do I join them?
return idArray;
}
}, DataTypes.StringType);

Related

ANTLR4 How would I extract python expression variables

Using the following ANTLR grammar: https://github.com/bkiers/python3-parser/blob/master/src/main/antlr4/nl/bigo/pythonparser/Python3.g4 I want to parse from a given expression, lets say:
x.split(y, 3)
or
x + y
The variables x and y. How would I achieve this?
I tried the following approach but it seems cumbersome since I must add all build-in python functions:
Define a Listener interface
const listener = new MyPythonListener()
antlr.tree.ParseTreeWalker.DEFAULT.walk(listener, abstractTree)
Use regex + pattern matching:
const symbolicNames = ['TRUE', 'FALSE', 'NUMEBRS', 'STRING', 'LIST', 'TUPLE', 'DICTIONARY', 'INT', 'LONG', 'FLOAT', 'COMPLEX',
'BOOL', 'STR', 'INT', 'RANGE', 'NONE', 'LEN']
class MyPythonListener extends Python3Listener {
variables = []
enterExpr(ctx) {
const text = this.getElementText(ctx)
if (text && this.verifyIsVariable(text)) {
this.variables.push(text)
}
}
verifyIsVariable(leafText) {
return !leafText.includes('"') && !leafText.includes('\'') && isNaN(leafText) &&
!symbolicNames.includes(leafText.toUpperCase()) && leafText.match(/^[0-9a-zA-Z_]+$/)
}
}
I didn't look too closely at it, but after inspecting the parse tree for the Python code:
def some_method_name(some_param_name):
x.split(y, 3)
it appears that the variable names are children of the atom rule:
atom
: '(' ( yield_expr | testlist_comp )? ')'
| '[' testlist_comp? ']'
| '{' dictorsetmaker? '}'
| NAME
| number
| str+
| '...'
| NONE
| TRUE
| FALSE
;
where NAME is a variable name.
So you could do something like this:
String source = "def some_method_name(some_param_name):\n x.split(y, 3)\n";
Python3Lexer lexer = new Python3Lexer(CharStreams.fromString(source));
Python3Parser parser = new Python3Parser(new CommonTokenStream(lexer));
ParseTreeWalker.DEFAULT.walk(new Python3BaseListener() {
#Override
public void enterAtom(Python3Parser.AtomContext ctx) {
if (ctx.NAME() != null) {
System.out.println(ctx.NAME().getText());
}
}
}, parser.file_input());
which will print:
x
y
and not the method and parameter names.
Again: not thoroughly tested, I leave that for you. You can pretty print the parse tree like this:
String source = "def some_method_name(some_param_name):\n x.split(y, 3)\n";
Python3Lexer lexer = new Python3Lexer(CharStreams.fromString(source));
Python3Parser parser = new Python3Parser(new CommonTokenStream(lexer));
System.out.println(new Builder.Tree(source).toStringASCII());
to inspect for yourself where the nodes you're intereseted in occur in the parse tree. The code above will print:
'- file_input
|- stmt
| '- compound_stmt
| '- funcdef
| |- def
| |- some_method_name
| |- parameters
| | |- (
| | |- typedargslist
| | | '- tfpdef
| | | '- some_param
| | '- )
| |- :
| '- suite
| |- <NEWLINE>
| |- <INDENT>
| |- stmt
| | '- simple_stmt
| | |- small_stmt
| | | '- expr_stmt
| | | '- testlist_star_expr
| | | '- test
| | | '- or_test
| | | '- and_test
| | | '- not_test
| | | '- comparison
| | | '- star_expr
| | | '- expr
| | | '- xor_expr
| | | '- and_expr
| | | '- shift_expr
| | | '- arith_expr
| | | '- term
| | | '- factor
| | | '- power
| | | |- atom
| | | | '- x
| | | |- trailer
| | | | |- .
| | | | '- split
| | | '- trailer
| | | |- (
| | | |- arglist
| | | | |- argument
| | | | | '- test
| | | | | '- or_test
| | | | | '- and_test
| | | | | '- not_test
| | | | | '- comparison
| | | | | '- star_expr
| | | | | '- expr
| | | | | '- xor_expr
| | | | | '- and_expr
| | | | | '- shift_expr
| | | | | '- arith_expr
| | | | | '- term
| | | | | '- factor
| | | | | '- power
| | | | | '- atom
| | | | | '- y
| | | | |- ,
| | | | '- argument
| | | | '- test
| | | | '- or_test
| | | | '- and_test
| | | | '- not_test
| | | | '- comparison
| | | | '- star_expr
| | | | '- expr
| | | | '- xor_expr
| | | | '- and_expr
| | | | '- shift_expr
| | | | '- arith_expr
| | | | '- term
| | | | '- factor
| | | | '- power
| | | | '- atom
| | | | '- number
| | | | '- integer
| | | | '- 3
| | | '- )
| | '- <NEWLINE>
| '- <DEDENT>
'- <EOF>
Note that the Builder.Tree class is not part of the ANTLR library, it resides in the/my repo you linked to in your question: https://github.com/bkiers/python3-parser/blob/master/src/main/java/nl/bigo/pythonparser/Builder.java

Select 1 item per attribute value in Spring Data MongoRepository

I have a collection of objects in MongoDB and am using Spring Data MongoDB.
My collection of entities look something like this:
--------------------------------------------
| id | snapshot | name |
--------------------------------------------
| 2 | somedate | bla |
| 2 | somedate | foo |
| 3 | somedate | bar |
| 3 | somedate | cheese |
| 6 | somedate | milk |
| 6 | somedate | lorum |
| 6 | somedate | ipsum |
| 9 | somedate | do |
| 10 | somedate | re |
| 10 | somedate | mi |
| 15 | somedate | fa |
--------------------------------------------
I want to get a list of objects where I want to have only one object of each distinct id, the object for that id should be the one with the latest date.
My result should be something like this:
--------------------------------------------
| id | snapshot | name |
--------------------------------------------
| 2 | somedate | bla |
| 3 | somedate | bar |
| 6 | somedate | milk |
| 9 | somedate | do |
| 10 | somedate | mi |
| 15 | somedate | fa |
--------------------------------------------
Is this possible in using a MongoRepository query?
I'd appreciate any help.
With the aggregation framework it's possible. Run the following aggregation operation to get the desired result:
db.collection.aggregate([
{ "$sort": { "snapshot": -1 } },
{
"$group": {
"_id": "$id",
"snapshot": { "$first": "$snapshot" },
"name": { "$first": "$name" }
}
}
])
The above native aggregation operation can then be translated to Spring Data MongoDB aggregation as:
import static org.springframework.data.mongodb.core.aggregation.Aggregation.*;
TypedAggregation<Entity> aggregation = newAggregation(Entity.class,
sort(DESC, "snapshot"),
group("id")
.first("snapshot").as("snapshot")
.first("name").as("name")
);
AggregationResults<EntityStats> result = mongoTemplate.aggregate(aggregation, EntityStats.class);

Parsing SPARQL Result into jtable

I'm working on an Apache Jena project. I've got a Fuseki server running on my localhost.
I want to create a Java Program for my Fuseki server, that shows all the data in the triplestore in a JTable. I just have no idea how to parse the result from my query into a JTable.
My code sofar:
(left out the part where the window, table, frame etc is created)
private void Go() {
String query = "SELECT ?subject ?predicate ?object \n" +
"WHERE { \n" +
"?subject ?predicate ?object }";
Query sparqlQuery = QueryFactory.create(query, Syntax.syntaxARQ) ;
QueryEngineHTTP httpQuery = new QueryEngineHTTP("http://localhost:3030/AnimalDataSet/", sparqlQuery);
ResultSet results = httpQuery.execSelect();
System.out.println(ResultSetFormatter.asText(results));
while (results.hasNext()) {
QuerySolution solution = results.next();
}
httpQuery.close();
}
The sysout prints this, which is the correct data:
-------------------------------------------------------------------------------------------------------------------------------------
| subject | predicate | object |
=====================================================================================================================================
| <urn:animals:data> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#Seq> |
| <urn:animals:data> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1> | <urn:animals:lion> |
| <urn:animals:data> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2> | <urn:animals:tarantula> |
| <urn:animals:data> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3> | <urn:animals:hippopotamus> |
| <urn:animals:lion> | <http://www.some-ficticious-zoo.com/rdf#name> | "Lion" |
| <urn:animals:lion> | <http://www.some-ficticious-zoo.com/rdf#species> | "Panthera leo" |
| <urn:animals:lion> | <http://www.some-ficticious-zoo.com/rdf#class> | "Mammal" |
| <urn:animals:tarantula> | <http://www.some-ficticious-zoo.com/rdf#name> | "Tarantula" |
| <urn:animals:tarantula> | <http://www.some-ficticious-zoo.com/rdf#species> | "Avicularia avicularia" |
| <urn:animals:tarantula> | <http://www.some-ficticious-zoo.com/rdf#class> | "Arachnid" |
| <urn:animals:hippopotamus> | <http://www.some-ficticious-zoo.com/rdf#name> | "Hippopotamus" |
| <urn:animals:hippopotamus> | <http://www.some-ficticious-zoo.com/rdf#species> | "Hippopotamus amphibius" |
| <urn:animals:hippopotamus> | <http://www.some-ficticious-zoo.com/rdf#class> | "Mammal" |
-------------------------------------------------------------------------------------------------------------------------------------
I really hope someone here knows how to parse the data from the query into a JTbale :D
Thanks in advance!
I've done some further research and finally found the solution! It's quite easy actually.
You just simply change the while loop like this:
while(rs.hasNext())
{
QuerySolution sol = rs.nextSolution();
RDFNode object = sol.get("object");
RDFNode predicate = sol.get("predicate");
RDFNode subject = sol.get("subject");
DefaultTableModel model = (DefaultTableModel) table.getModel();
model.addRow(new Object[]{subject, predicate, object});
}
And that works fine for me!
For everyone who's interested, i've puplished my version as it is now to pastebin which has comments:
The link to the full (current) version of my project

Java code in Hadoop

I am running a map only job in Hadoop. The data-set is a set of html pages in a single file (returned by a crawler)
The mapper code is written in Java. I am using JSoup to parse. What I want as my output is a key that has both the contents of the title tag and the content of a meta tag. Ideally I should get 1592 records for my map output records. I am getting 3184.
The concatenation I attempt to do with this line of code is not happening.
String MN_Job = (jobT + "\t" + jobsDetail);
What I get instead is each of these separately, hence double the number of outputs. What am I doing wrong here?
public class JobsDataMapper extends Mapper<LongWritable, Text, Text, Text> {
private Text keytext = new Text();
private Text valuetext = new Text();
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
Document doc = Jsoup.parse(line);
Elements desc = doc.select("head title, meta[name=twitter:description]");
for (Element jobhtml : desc) {
Elements title = jobhtml.select("title");
String jobT = "";
for (Element titlehtml : title) {
jobT = titlehtml.text();
}
Elements meta = jobhtml.select("meta[name=twitter:description]");
String jobsDetail ="";
for (Element metahtml : meta) {
String content = metahtml.attr("content");
String content1 = content.replaceAll("\\p{Punct}+", " ");
jobsDetail = content1.replaceAll(" (?i)a | (?i)able | (?i)about | (?i)across | (?i)after | (?i)all | (?i)almost | (?i)also | (?i)am | (?i)among | (?i)an | (?i)and | (?i)any | (?i)are | (?i)as | (?i)at | (?i)be | (?i)because | (?i)been | (?i)but | (?i)by | (?i)can | (?i)cannot | (?i)could | (?i)dear | (?i)did | (?i)do | (?i)does | (?i)either | (?i)else | (?i)ever | (?i)every | (?i)for | (?i)from | (?i)get | (?i)got | (?i)had | (?i)has | (?i)have | (?i)he | (?i)her | (?i)hers | (?i)him | (?i)his | (?i)how | (?i)however | (?i)i | (?i)if | (?i)in | (?i)into | (?i)is | (?i)it | (?i)its | (?i)just | (?i)least | (?i)let | (?i)like | (?i)likely | (?i)may | (?i)me | (?i)might | (?i)most | (?i)must | (?i)my | (?i)neither | (?i)no | (?i)nor | (?i)not | (?i)nbsp | (?i)of | (?i)off | (?i)often | (?i)on | (?i)only | (?i)or | (?i)other | (?i)our | (?i)own | (?i)rather | (?i)said | (?i)say | (?i)says | (?i)she | (?i)should | (?i)since | (?i)so | (?i)some | (?i)than | (?i)that | (?i)the | (?i)their | (?i)them | (?i)then | (?i)there | (?i)these | (?i)they | (?i)this | (?i)tis | (?i)to | (?i)too | (?i)twas | (?i)us | (?i)wants | (?i)was | (?i)we | (?i)were | (?i)what | (?i)when | (?i)where | (?i)which | (?i)while | (?i)who | (?i)whom | (?i)why | (?i)will | (?i)with | (?i)would | (?i)yet | (?i)you | (?i)your "," ");
}
String IT_Job = (jobT + "\t" + jobsDetail);
keytext.set(IT_Job) ;
valuetext.set("JobDetail");
context.write( keytext, valuetext );
}
}
}
Edit: I know what the problem is. But the thing is that the solution might not be obvious in MapReduce. You might have to write your custom RecordReader. Let me explain the problem.
In your code you read line by line. Then you apply this to the line you read:
Elements desc = doc.select("head title, meta[name=twitter:description]");
But evidently, it might only have a title or a <meta name=twitter:description> tag. So you read one of those and store it. The other one remains blank. So at a time, only one of your variables, jobT and jobsDetail has any data. So for the code snippet:
String IT_Job = (jobT + "\t" + jobsDetail);
one time, the first one is blank and the second time, the other one is blank. So if you are expecting n records, you get 2n records. Similarly, if you'll attempt to extract three fields, then you should get 3n records. So you can test this theory by extracting another field and then checking if you are getting thrice the number of expected records.
If the theory turns out to be correct, you might want to delimit the webpages you extract with a specific delimiter string. Then you want to write a custom RecordReader which will read one html file at a time according to the delimiter and then process the entire html file at once. That way you'll get the title and the meta tags together.
Just by the look at the numbers: 3184/2 = 1592.
I think that your file is just duplicated in the input folder. I can't tell for sure, because you have not given the code how you submit the job, but maybe you can verify it with a simple:
bin/hadoop fs -ls /your/input_path
When submitting, either make sure that there is just the single file in there, or just reference the single file in your submission logic.
I made changes to the original code removing the loops that were not necessary. What was happening the older code was that when there is a title in the record, it is output, and later when there is a content, it is output as well. So, there are two writes per HTML file.
public class JobsDataMapper extends Mapper<LongWritable, Text, Text, Text> {
private Text keytext = new Text();
private Text valuetext = new Text();
private String jobT = new String();
private String jobName= new String();
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
Document doc = Jsoup.parse(line);
Elements desc = doc.select("head title, meta[name=twitter:description]");
for (Element jobhtml : desc){
Elements title = jobhtml.select("title");
String jobTT = title.text();
jobT =jobTT ;
if (jobT.length()> 0){
jobName=jobTT;
}
Elements meta = jobhtml.select("meta[name=twitter:description]");
String jobsDetail ="";
String content = meta.attr("content");
String content1 = content.replaceAll("\\p{Punct}+", " ");
jobsDetail = content1.toLowerCase();
jobsDetail = content1.replaceAll(" a| able | about | across | after | all | almost | also | am | among | an | and | any | are | as | at | be| because | been | but | by | can | cannot | could | dear | did | do | does | either | else | ever | every | for | from | get | got | had | has | have | he | her | hers | him | his | how | however | i | if | in | into | is | it | its | just | least | let | like | likely | may | me | might | most | must | my | neither | no | nor | not | nbsp | of | off | often | on | only | or | other | our | own | rather | said | say | says | she | should | since | so | some | than | that | the | their | them | then | there | these | they | this | tis | to | too | twas | us | wants | was | we | were | what | when | where | which | while | who | whom | why | will | with | would | yet | you | your "," ");
if (jobsDetail.length()>0) {
String MN_Job = (jobName+ "\t" + jobsDetail);
keytext.set(MN_Job) ;
valuetext.set("JobInIT");
context.write( keytext, valuetext );
}
}
}
}

Spring MVC 3 with Hibernate - Whats my model for use in ModelAndView?

I've just setup my first Spring MVC 3 project with Hibernate 3 using Maven.
Now I'm used to having a controller-page with my controller and a model package with my models,
but with hibernate integrated what i now have is:
.
|____main
| |____java
| | |____com
| | | |____cqrify
| | | | |____tellus
| | | | | |____App.java
| | | | | |____controller
| | | | | | |____ContactController.java
| | | | | |____dao
| | | | | | |____ContactDAO.java
| | | | | | |____impl
| | | | | | | |____ContactDAOImpl.java
| | | | | |____form
| | | | | | |____Contact.java
| | | | | |____service
| | | | | | |____ContactService.java
| | | | | | |____impl
| | | | | | | |____ContactServiceImpl.java
| |____resources
| | |____config.properties
| | |____log4j.xml
| | |____Messages.properties
| | |____META-INF
| |____webapp
| | |____resources
| | | |____css
| | | |____gfx
| | | |____js
| | |____WEB-INF
| | | |____classes
| | | |____spring
| | | | |____appServlet
| | | | | |____servlet-context.xml
| | | | |____root-context.xml
| | | |____views
| | | | |____editContact.jsp
| | | | |____newContact.jsp
| | | | |____showContacts.jsp
| | | | |____includes
| | | | |____taglib_includes.jsp
| | | |____web.xml
|____test
| |____java
| |____resources
| | |____log4j.xml
|____test.txt
What I understand is that I'm to autowire "ContactService" as that is now my "model", but how do I use with with ModelAndView?
My controller
import com.cqrify.tellus.form.Contact;
import com.cqrify.tellus.service.ContactService;
#Controller
public class ContactController {
#Autowired
private ContactService contactService;
#RequestMapping(value="/")
public ModelAndView listContacts(){
Map<String, Object> contactMap;
contactMap.put("contactList", contactService.listContacts());
ModelAndView modelAndView = new ModelAndView("showContacts", "ContactService", contactMap);
return modelAndView;
}
}
As seen above
ModelAndView modelAndView = new ModelAndView("showContacts", "ContactService", contactMap);
is this right that "ContactService" will now be my modelName or have i completely missed something?
In your case you can simply return:
new ModelAndView("showContacts", "contactList", contactService.listContacts());
This means that you want to render showContacts view and the list of contacts will be available for the view under contactList name.
ContactService is a business object used to find (fetch) the model, IMHO it should not be used to name the model itself.

Categories

Resources