I find the existing Spark implementations for accessing a traditional Database very restricting and limited. Particularly:
Use of Bind Variables is not possible.
Passing the partitioning parameters to your generated SQL is very restricted.
Most bothersome is that I am not able to customize my query in how partitioning takes place, all it allows is to identify a partitioning column, and upper / lower boundaries, but only allowed is a numeric column and values.
I understand I can provide the query to my database like you do a subquery, and map my partitioning column to a numeric value, but that will cause very inefficient execution plans on my database, where partition pruning (true Oracle Table Partitions), and or use of indexes is not efficient.
Is there any way for me to get around those restriction ... can I customize my query better ... build my own partition logic. Ideally I want to wrap my own custom jdbc code in an Iterator that I can be executed lazily, and does not cause the entire resultset to be loaded in memory (like the JdbcRDD works).
Oh - I prefer to do all this using Java, not Scala.
Take a look at the JdbcRDD source code. There's not much to it.
You can get the flexibility you're looking for by writing a custom RDD type based on this code, or even by subclassing it and overriding getPartitions() and compute().
I studied both JdbcRDD and new Spark SQL Data source API. None of them support your requirements.
Most likely this will be your own implementation. I recommend writing new Data sources API instead of sub-classing JdbcRDD which became obsolete in Spark 1.3.
Related
To perform good performance with Spark. I'm a wondering if it is good to use sql queries via SQLContext or if this is better to do queries via DataFrame functions like df.select().
Any idea? :)
There is no performance difference whatsoever. Both methods use exactly the same execution engine and internal data structures. At the end of the day, all boils down to personal preferences.
Arguably DataFrame queries are much easier to construct programmatically and provide a minimal type safety.
Plain SQL queries can be significantly more concise and easier to understand. They are also portable and can be used without any modifications with every supported language. With HiveContext, these can also be used to expose some functionalities which can be inaccessible in other ways (for example UDF without Spark wrappers).
Ideally, the Spark's catalyzer should optimize both calls to the same execution plan and the performance should be the same. How to call is just a matter of your style.
In reality, there is a difference accordingly to the report by Hortonworks (https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html ), where SQL outperforms Dataframes for a case when you need GROUPed records with their total COUNTS that are SORT DESCENDING by record name.
By using DataFrame, one can break the SQL into multiple statements/queries, which helps in debugging, easy enhancements and code maintenance.
Breaking complex SQL queries into simpler queries and assigning the result to a DF brings better understanding.
By splitting query into multiple DFs, developer gain the advantage of using cache, reparation (to distribute data evenly across the partitions using unique/close-to-unique key).
The only thing that matters is what kind of underlying algorithm is used for grouping.
HashAggregation would be more efficient than SortAggregation. SortAggregation - Will sort the rows and then gather together the matching rows. O(n*log n)
HashAggregation creates a HashMap using key as grouping columns where as rest of the columns as values in a Map.
Spark SQL uses HashAggregation where possible(If data for value is mutable). O(n)
Every database I've ever seen has a method for retrieving the count of the query prior to actually executing it. But I can't figure how to do this simple task in Accumulo.
Just for clarity, I want the Accumulo analog of this Mongo feature.
I checked the Scanner apidocs but I can't find anything. I'm using Java but answers for other languages would be greatly helpful too.
Accumulo is a lower-level application than a traditional RDBMS. It is based on Google's Big Table and not like a relational database. It's more accurately described as a massive parallel sorted map than a database.
It is designed to do different kinds of tasks than a relational database, and its focus is on big data.
To achieve the equivalent of the MongoDB feature you mentioned in Accumulo (to get a count of the size of an arbitrary query's result set), you can write a server-side Iterator which returns counts from each server, which can be summed on the client side to get a total. If you can anticipate your queries, you can also create an index which keeps track of counts during the ingest of your data.
Creating custom Iterators is an advanced activity. Typically, there are important trade-offs (time/space/consistency/convenience) to implementing something as seemingly simple as a count of a result set, so proceed with caution. I would recommend consulting the user mailing list for information and advice.
I have a use case where in I need to read rows from a file, transform them using an engine and then write the output to a database (that can be configured).
While I could write a query builder of my own, I was interested in knowing if there's already an available solution (library).
I searched online and could find jOOQ library but it looks like it is type-safe and has a code-gen tool so is probably suited for static database schema's. In the use case that I have db's can be configured dynamically and the meta-data is programatically read and made available for write-purposes (so a list of tables would be made available, user can select the columns to write and the insert script for these column needs to be dynamically created).
Is there any library that could help me with the use case?
If I understand correctly you need to query the database structure, display the result to via a GUI and have the user map data from a file to that structure?
Assuming this is the case, you're not looking for a 'library', you're looking for an ETL tool.
Alternatively, if you're set on writing something yourself, the (very) basic way to do this is:
the structure of a database using Connection.getMetaData(). The exact usage can vary between drivers so you'll need to create an abstraction layer that meets your needs - I'd assume you're just interested in the table structure here.
the format of the file needs to be mapped to a similar structure to the tables.
provide a GUI that allows the user to connect elements from the file to columns in the table including any type mapping that is needed.
create a parametrized insert statement based on file element to column mapping - this is just a simple bit of string concatenation.
loop throw the rows in the file performing a batch insert for each.
My advice, get an ETL tool, this sounds like a simple problem, but it's full of idiosyncrasies - getting even an 80% solution will be tough and time consuming.
jOOQ (the library you referenced in your question) can be used without code generation as indicated in the jOOQ manual:
http://www.jooq.org/doc/latest/manual/getting-started/use-cases/jooq-as-a-standalone-sql-builder
http://www.jooq.org/doc/latest/manual/sql-building/plain-sql
When searching through the user group, you'll find other users leveraging jOOQ in the way you intend
The setps you need to do is:
read the rows
build each row into an object
transform the above object to target object
insert the target object into the db
Among the above 4 steps, the only thing you need to do is step 3.
And for the above purpose, you can use Transmorph, EZMorph, Commons-BeanUtils, Dozer, etc.
I have found the Jquery datatables plug in extremely useful for simple, read only applications where I'd like to give the user pagination, sorting and searching of very large sets of data (millions of rows using server side processing).
I have a system for reusing this code but I end up doing the same thing over and over alot. I'd like to write a very generalized api that I essentially just need to configure the sql needed to retrieve the data used in the table. I am looking for a good design pattern/approach to do this. I've seen articles like this http://www.codeproject.com/Articles/359750/jQuery-DataTables-in-Java-Web-Applications and have a complete understanding of how server side processing works (have done it in java and asp.net many times). For someone to answer you will probably need to have a deep understanding of how server side processing works in java but here are some issues that come up with attempting to do this:
I generally run three separate queries. A count without the search clause, a count with the clause included, the query for the actual data. I haven't found an efficient way to do all 3 at once and doing so requires a lot of extra data to come back from db (ie counts over and over). The api needs to support behavior based on these three different queries and complex queries at that. I generally row number () over an index for the pagination to be relatively speedy with large data.
*where clause changes dynamically (user can search over a variable number of rows).
*order by clause changes for the same reason.
overall, each case is often pretty specific to the data we need. Is there a good way to abstract this so that I can do minimal work when I want to use the plug in server side.
So, the steps are as follows in most projects:
*extract the params the plug on sends to the server (alot of times my own are added, mostly date ranges)
*build the unfiltered count query (this is rarely dynamic).
*build the filtered count query (is dynamic)
*build the data query
*construct a model object of the table and return it as json.
A lot of the issues occur setting the prepared statements with a variable number of parameters. Dynamically generating the sql in a general way (say based on just column names) seems unlikely. I am wondering if someone else has created something they are using for this or if it sounds like a specific pattern is applicable. It has just occurred to me that creating a reusable filter may be helpful in java. Any advice would be greatly appreciated. Feel free to be language agnostic as the architecture is what I'm trying to figure out.
We have base search criteria where all request parameters relevant to DataTables are mapped onto class properties (fields) and custom search criteria class that extends base and contains specific to business logic fields for sutom search. Also on server side we have repository class that takes custom search criteria as an argument and makes queries to database.
If you are familiar with C#, you could check out custom binding code and example of usage.
You could do such custom binding in your Java code as well.
If have a large amount tabular data that I'm trying to display in a JSF datatable. Are there any any implementations or components out there that can handle database paging and sorting?
I currently have to pull all the rows in the table back and handle paging and sorting client side in JSF. Unfortunately this is not very performant and occasionally causes my app server to run out of memory.
Ideally, this datatable implementation would be able to wrap a JDBC query or a Hibernate query somehow. I'm not stuck on JSF 1.2, I plan on upgrading to 2.0 sometime soon if this matters.
Check out richfaces datatable datatable with table filtering and table sorting
Or the extended datatable
But, yes I forgot one small note :), If you need to go to the database and handle your filtering/sorting on db level you will need to provide your own implementation of DataModel by extending the org.ajax4jsf.model.SerializableDataModel.
There are some blogs going that way, see this one
You will always have memory problems when the Java code hauls/copies the entire dataset of a datastore (e.g. a RDBMS) in Java's memory and then do the sorting and filtering right in Java's memory using Java code. It will get worse when you even store it in the session scope of a webapplication.
The most memory efficient approach is to let the DB do the task where it is invented for. The SQL language offers you under each the ORDER BY clause to do the sorting, the WHERE clause to do the filtering and the (DB vendor specific) LIMIT/OFFSET clauses/subselects/functions to return only a subset of records based on firstrow and rowcount or lastrow. This way you end up with only the dataset in Java's memory which is actually to be displayed.
There is no standard JSF component which does exactly that. It will always require the entire dataset being available in the Java memory, because the filtering and sorting needs to happen with pure Java code or Javascripts. JSF knows nothing about SQL, you'll need to provide a custom implementation of javax.faces.model.DataModel to the datatable and/or control/manage that in the data access layer yourself. You can get a lot of new insights/ideas and find a kickoff example in this article. You can also find examples of the needed SQL queries in this JSP-targeted answer I posted before here.
Good luck.
Try this example:
http://weblogs.java.net/blog/caroljmcdonald/archive/2007/05/pagination_of_d.html