How can I do this inner join properly in Apache PIG?

How can I do this inner join properly in Apache PIG? - java

I have two files, one called a-records
123^record1
222^record2
333^record3
and the other file called b-records
123^jim
123^jim
222^mike
333^joe
you can see in file A that I have the token 123 one time. In file B it's in there twice. Is there a way using Apache PIG I can join the data such that I only get ONE joined record from the A file?
here is my current script which outputs the following below
arecords = LOAD '$a' USING PigStorage('^') as (token:chararray, type:chararray);
brecords = LOAD '$b' USING PigStorage('^') as (token:chararray, name:chararray);
x = JOIN arecords BY token, brecords BY token;
dump x;
which yields:
(123,record1,123,jim)
(123,record1,123,jim)
(222,record2,222,mike)
(333,record3,333,joe)
when what I REALLY want is(notice token 123 is only in there once after the join)
(123,record1,123,jim)
(222,record2,222,mike)
(333,record3,333,joe)
any ideas? thanks so much

I would do something like this :
arecords = LOAD '$a' USING PigStorage('^') as (token:chararray, type:chararray);
brecords = LOAD '$b' USING PigStorage('^') as (token:chararray, name:chararray);
bdistinct = DISTINCT brecords;
x = JOIN arecords BY token, bdistinct BY token;
dump x;

Related

character of ms-access not found on java [duplicate]

I am executing following query from my web application and access 2007 query wizard. And I am getting two different result.
SELECT R.Rept_Name, D.Dist_Name,S.State_Name FROM (tblReporter AS R LEFT JOIN tblDist AS D ON R.Dist_Id=D.Dist_Id) LEFT JOIN tblState AS S ON S.State_Id=R.State_Id WHERE R.Rept_Name LIKE '*Ra*' ORDER BY R.Rept_Name;
Result from web application is with 0 rows and from query wizard 2 rows.If I remove where condition than both result are same. Please help me what is wrong with query. If any other info require please tell me.
Web application code ...
public DataTable getRept(string rept, string mobno)
{
DataTable dt = new DataTable();
using (OleDbConnection conn = new OleDbConnection(getConnection()))
{
using (OleDbCommand cmd = conn.CreateCommand())
{
cmd.CommandType = CommandType.Text;
cmd.CommandText = "SELECT R.Rept_Name, D.Dist_Name,S.State_Name FROM (tblReporter AS R LEFT JOIN tblDist AS D ON R.Dist_Id=D.Dist_Id) LEFT JOIN tblState AS S ON S.State_Id=R.State_Id WHERE R.Rept_Name LIKE '*" + rept + "*' ORDER BY R.Rept_Name;";
conn.Open();
using (OleDbDataReader sdr = cmd.ExecuteReader())
{
if (sdr.HasRows)
dt.Load(sdr);
}
}
}
return dt;
}

You are getting tripped up by the difference in LIKE wildcard characters between queries run in Access itself and queries run from an external application.
When running a query from within Access itself you need to use the asterisk as the wildcard character: LIKE '*Ra*'.
When running a query from an external application (like your C# app) you need to use the percent sign as the wildcard character: LIKE '%Ra%'.

SpagoBI multi value parameter

I'm trying to create a multi-value parameter in SpagoBI.
Here is my data set query whose last line appears to be causing an issue.
select C."CUSTOMERNAME", C."CITY", D."YEAR", P."NAME"
from "CUSTOMER" C, "DAY" D, "PRODUCT" P, "TRANSACTIONS" T
where C."CUSTOMERID" = T."CUSTOMERID"
and D."DAYID" = T."DAYID"
and P."PRODUCTID" = T."PRODUCTID"
and _CITY_
I created before open script in my dataset which looks like this:
this.queryText = this.queryText.replace(_CITY_, " CUSTOMER.CITY in ( "+params["cp"].value+" ) ");
My parameter is set as string, display type dynamic list box.
When I run the report I'm getting that error.
org.eclipse.birt.report.engine.api.EngineException: There are errors evaluating script "
this.queryText = this.queryText.replace(_CITY_, " CUSTOMER.CITY in ( "+params["cp"].value+" ) ");
":
Fail to execute script in function __bm_beforeOpen(). Source:
Could anyone please help me?

Hello I managed to solve the problem. Here is my code:
var substring = "" ;
var strParamValsSelected=reportContext.getParameterValue("citytext");
substring += "?," + strParamValsSelected ;
this.queryText = this.queryText.replace("'xxx'",substring);
As You can see the "?" is necessary before my parameter. Maybe It will help somebody. Thank You so much for Your comments.

If you are using SpagoBI server and High charts (JFreeChart Engine) / JSChat Engine you can just use ($P{param_url}) in query,
or build dynamic query using Java script / groovy Script
so your query could also be:
select C."CUSTOMERNAME", C."CITY", D."YEAR", P."NAME"
from "CUSTOMER" C, "DAY" D, "PRODUCT" P, "TRANSACTIONS" T
where C."CUSTOMERID" = T."CUSTOMERID"
and D."DAYID" = T."DAYID"
and P."PRODUCTID" = T."PRODUCTID"
and CUSTOMER."CITY" in ('$P{param_url}')

Inserting data with UCanAccess from big text files is very slow

I'm trying to read text files .txt with more than 10.000 lines per file, splitting them and inserting the data in Access database using Java and UCanAccess. The problem is that it becomes slower and slower every time (as the database gets bigger).
Now after reading 7 text files and inserting them into database, it would take the project more than 20 minutes to read another file.
I tried to do just the reading and it works fine, so the problem is the actual inserting into database.
N.B: This is my first time using UCanAccess with Java because I found that the JDBC-ODBC Bridge is no longer available. Any suggestions for an alternative solution would also be appreciated.

If your current task is simply to import a large amount of data from text files straight into the database, and it does not require any sophisticated SQL manipulations, then you might consider using the Jackcess API directly. For example, to import a CSV file you could do something like this:
String csvFileSpec = "C:/Users/Gord/Desktop/BookData.csv";
String dbFileSpec = "C:/Users/Public/JackcessTest.accdb";
String tableName = "Book";
try (Database db = new DatabaseBuilder()
.setFile(new File(dbFileSpec))
.setAutoSync(false)
.open()) {
new ImportUtil.Builder(db, tableName)
.setDelimiter(",")
.setUseExistingTable(true)
.setHeader(false)
.importFile(new File(csvFileSpec));
// this is a try-with-resources block,
// so db.close() happens automatically
}
Or, if you need to manually parse each line of input, insert a row, and retrieve the AutoNumber value for the new row, then the code would be more like this:
String dbFileSpec = "C:/Users/Public/JackcessTest.accdb";
String tableName = "Book";
try (Database db = new DatabaseBuilder()
.setFile(new File(dbFileSpec))
.setAutoSync(false)
.open()) {
// sample data (e.g., from parsing of an input line)
String title = "So, Anyway";
String author = "Cleese, John";
Table tbl = db.getTable(tableName);
Object[] rowData = tbl.addRow(Column.AUTO_NUMBER, title, author);
int newId = (int)rowData[0]; // retrieve generated AutoNumber
System.out.printf("row inserted with ID = %d%n", newId);
// this is a try-with-resources block,
// so db.close() happens automatically
}
To update an existing row based on its primary key, the code would be
Table tbl = db.getTable(tableName);
Row row = CursorBuilder.findRowByPrimaryKey(tbl, 3); // i.e., ID = 3
if (row != null) {
// Note: column names are case-sensitive
row.put("Title", "The New Title For This Book");
tbl.updateRow(row);
}
Note that for maximum speed I used .setAutoSync(false) when opening the Database, but bear in mind that disabling AutoSync does increase the chance of leaving the Access database file in a damaged (and possibly unusable) state if the application terminates abnormally while performing the updates.

Also, if you need to use slq/ucanaccess, you have to call setAutocommit(false) on the connection at the begin, and do a commit each 200/300 record. The performances will improve drammatically (about 99%).

createUserDefinedFunction : if already exists?

I'm using azure-documentdb java SDK in order to create and use "User Defined Functions (UDFs)"
So from the official documentation I finally find the way (with a Java client) on how to create an UDF:
String regexUdfJson = "{"
+ "id:\"REGEX_MATCH\","
+ "body:\"function (input, pattern) { return input.match(pattern) !== null; }\","
+ "}";
UserDefinedFunction udfREGEX = new UserDefinedFunction(regexUdfJson);
getDC().createUserDefinedFunction(
myCollection.getSelfLink(),
udfREGEX,
new RequestOptions());
And here is a sample query :
SELECT * FROM root r WHERE udf.REGEX_MATCH(r.name, "mytest_.*")
I had to create the UDF one time only because I got an exception if I try to recreate an existing UDF:
DocumentClientException: Message: {"Errors":["The input name presented is already taken. Ensure to provide a unique name property for this resource type."]}
How should I do to know if the UDF already exists ?
I try to use "readUserDefinedFunctions" function without success. Any example / other ideas ?
Maybe for the long term, should we suggest a "createOrReplaceUserDefinedFunction(...)" on azure feedback

You can check for existing UDFs by running query using queryUserDefinedFunctions.
Example:
List<UserDefinedFunction> udfs = client.queryUserDefinedFunctions(
myCollection.getSelfLink(),
new SqlQuerySpec("SELECT * FROM root r WHERE r.id=#id",
new SqlParameterCollection(new SqlParameter("#id", myUdfId))),
null).getQueryIterable().toList();
if (udfs.size() > 0) {
// Found UDF.
}

An answer for .NET users.
`var collectionAltLink = documentCollections["myCollection"].AltLink; // Target collection's AltLink
var udfLink = $"{collectionAltLink}/udfs/{sampleUdfId}"; // sampleUdfId is your UDF Id
var result = await _client.ReadUserDefinedFunctionAsync(udfLink);
var resource = result.Resource;
if (resource != null)
{
// The UDF with udfId exists
}`
Here _client is Azure's DocumentClient and documentCollections is a dictionary of your documentDb collections.
If there's no such UDF in the mentioned collection, the _client throws a NotFound exception.

Jena TDB to store and query using API

I am new to both Jena-TDB and SPARQL, so it might be a silly question. I am using tdb-0.9.0, on Windows XP.
I am creating the TDB model for my trail_1.rdf file. My understanding here(correct me if I am wrong) is that the following code will read the given rdf file in TDB model and also stores/load (not sure what's the better word) the model in the given directory D:\Project\Store_DB\data1\tdb:
// open TDB dataset
String directory = "D:\\Project\\Store_DB\\data1\\tdb";
Dataset dataset = TDBFactory.createDataset(directory);
Model tdb = dataset.getDefaultModel();
// read the input file
String source = "D:\\Project\\Store_DB\\tmp\\trail_1.rdf";
FileManager.get().readModel( tdb, source);
tdb.close();
dataset.close();
Is this understanding correct?
As per my understanding since now the model is stored at D:\Project\Store_DB\data1\tdb directory, I should be able to run query on it at some later point of time.
So to query the TDB Store at D:\Project\Store_DB\data1\tdb I tried following, but it prints nothing:
String directory = "D:\\Project\\Store_DB\\data1\\tdb" ;
Dataset dataset = TDBFactory.createDataset(directory) ;
Iterator<String> graphNames = dataset.listNames();
while (graphNames.hasNext()) {
String graphName = graphNames.next();
System.out.println(graphName);
}
I also tried this, which also did not print anything:
String directory = "D:\\Project\\Store_DB\\data1\\tdb" ;
Dataset dataset = TDBFactory.createDataset(directory) ;
String sparqlQueryString = "SELECT (count(*) AS ?count) { ?s ?p ?o }" ;
Query query = QueryFactory.create(sparqlQueryString) ;
QueryExecution qexec = QueryExecutionFactory.create(query, dataset) ;
ResultSet results = qexec.execSelect() ;
ResultSetFormatter.out(results) ;
What am I doing incorrect? Is there anything wrong with my understanding that I have mentioned above?

For part (i) of your question, yes, your understanding is correct.
For part (ii), the reason that listNames does not return any results is because you have not put your data into a named graph. In particular,
Model tdb = dataset.getDefaultModel();
means that you are storing data into TDB's default graph, i.e. the graph with no name. If you wish listNames to return something, change that line to:
Model tdb = dataset.getNamedGraph( "graph42" );
or something similar. You will, of course, then need to refer to that graph by name when you query the data.
If your goal is simply to test whether or not you have successfully loaded data into the store, try the command line tools bin/tdbdump (Linux) or bat\tdbdump.bat (Windows).
For part (iii), I tried your code on my system, pointing at one of my TDB images, and it works just as one would expect. So: either the TDB image you're using doesn't have any data in it (test with tdbdump), or the code you actually ran was different to the sample above.

The problem in your part 1 code is, I think, you are not committing the data .
Try with this version of your part 1 code:
String directory = "D:\\Project\\Store_DB\\data1\\tdb";
Dataset dataset = TDBFactory.createDataset(directory);
Model tdb = dataset.getDefaultModel();
// read the input file
String source = "D:\\Project\\Store_DB\\tmp\\trail_1.rdf";
FileManager.get().readModel( tdb, source);
dataset.commit();//INCLUDE THIS STAMEMENT
tdb.close();
dataset.close();
and then try with your part 3 code :) ....

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can I do this inner join properly in Apache PIG? - java

I would do something like this : arecords = LOAD '$a' USING PigStorage('^') as (token:chararray, type:chararray); brecords = LOAD '$b' USING PigStorage('^') as (token:chararray, name:chararray); bdistinct = DISTINCT brecords; x = JOIN arecords BY token, bdistinct BY token; dump x;

Related

character of ms-access not found on java [duplicate]

SpagoBI multi value parameter

Inserting data with UCanAccess from big text files is very slow

createUserDefinedFunction : if already exists?

Jena TDB to store and query using API

Categories

Resources