Get TableSchema from BigQuery result PCollection<TableRow> - java

When I run a query in BigQuery Web UI, the results are displayed in a table where both name and type of each field are known (even when a field is a result of COUNT(), AVG(), ... operation, type of field is known, of course).
The results can be then directly exported as a table/json/csv.
My question is, when I retrieve query results in my java project, e.g. with a query:
String query = "SELECT nationality, COUNT(DISTINCT personID) AS population
FROM Dataset.Table
GROUP BY nationality";
PCollection<TableRow> result = p.apply(BigQueryIO.Read.fromQuery(query));
... is it possible to obtain the schema of TableRow in result PCollection, without explicitly defining it?
I think it must be possible, since it's possible with the same query when using BigQuery Web UI.
But I can't figure out how to do it ...
TableSchema schema = // function of PCollection<TableRow> result ?
result.apply(BigQueryIO.Write
.named("Write Results Table")
.to(getTableReference(tableName))
.withSchema(schema));
That way query results could be always automatically exported/saved into a new table (only the table name then needs to be explicitly provided).
Any ideas? Any help would be appreciated :)

Unfortunately, Dataflow SDK doesn't expose a schema returned by BigQuery via Dataflow's BigQueryIO API. There's no "good" workaround within the Dataflow API alone.
Defining a schema manually is one workaround.
Alternatively, you could make a separate query to BigQuery directly via jobs: query at pipeline construction time, whose result can then be passed to BigQueryIO.Write transform. This may incur additional cost, but that can probably be mitigated by altering the query slightly to reduce the amount of data processed. Correctness of the output is not relevant, since you'd be storing the schema only.

Conceptually - you should write the function which will iterate thru all cells of given TableRow and for each - get name and type and while iterating you will create respective TableSchema.
For simple schemas, I would expect, it should be relatively easy.
For schemas with records, repeated, etc. this could be more complex

Related

What is more efficient of Querying the database

Should we query the table with more filtering by adding multiple conditions/ where clauses to SQL query and get the specific data
Or pull all the data and do filtering in our java class.
Looking for efficient way of coding practices
Example :
A table with multiple columns Id, Name, Place.
I need to pull the list of ids with Place should be form placesList and Name should match namesList.
1)
(SELECT id
FROM Person p
WHERE p.name IN (<name_list>)
AND p.place IN (<place_list>)
order
by p.id asc)
public List<Long> getIds(#BindIn("name_list") List<String> name_list, #BindIn("place_list") List<String> place_list);
or
2)
(SELECT id
FROM Person p)
public List getIds();
apply java8 filters to the result
Note: Above example i took name place for easy understanding. In Real time, data is huge and have multiple fields and rows in the table. The list used to filter is also large.
Best approach is query with required filters on database and which will reduce amount of data fetch you from applicaion and db back and forth and also it will reduce time on I/O operations(since it involves some latency to transfer large amount of data over network).
also reduces overhead of memory need to process large amount of data on application side.
Also when you are running query and filtering on multiple fields you can add indexes(if necessary) on fields it will improve query fetch time.
Hope it answers
You always want to perform things in the database if possible. You want to avoid transferring data from the database to your application, using up memory, just to remove it there.
Databases are very efficient at doing those things, so you'll want to use them to their full extent.
Query the database directly instead of downloading data in Java application. It will reduce the latency from the database to your java application.
But be very careful when using user inputs in the filter. Make sure that you have sanitized the user input before using them in a query to the database to avoid SQL injection.
If you are worried about security more than performance then filter the data in Java App (if the data is not massive in size).
But I strongly recommend filtering the data on the database itself by ensuring necessary safeguards.

Is there a way to make query return a ResultSet?

I have the following query:
#Select("SELECT* FROM "+MyData.TABLE_NAME+" where data_date = #{refDate}")
public List<MyData> getMyData(#Param("refDate") Date refDate);
This table data is HUGE! Loading so many rows in memory is not the best way!
Is it possible to have this same query return a resultset so that I can just iterate over one item?
edit:
I tried adding:
#ResultType(java.sql.ResultSet.class)
public ResultSet getMyData(#Param("refDate") Date refDate);
but it gives me:
nested exception is org.apache.ibatis.reflection.ReflectionException: Error instantiating interface java.sql.ResultSet with invalid types () or values (). Cause: java.lang.NoSuchMethodException: java.sql.ResultSet.<init>()
I'd suggest you use limit in your query. limit X, Y syntax is good for you. Try it.
If the table is huge, the query will become slower and slower. Then the best way to to iterate will be to filter based on id and use limit.
such as
select * from table where id>0 limit 100 and then
select * from table where id>100 limit 100 etc
There are multiple options you have ...
Use pagination on database side
I will just suppose the database is oracle. However other db vendors would also work. In oracle you have a rownum with which you can limit number of records to return. To return desired number of records you need to prepare a where clause using this rownum. Now, the question is how to supply a dynamic rownum in a query. This is where dynamic sqls of mybatis comes in use. You can pass these rownum values inside a parameter map which there onwards you can use in your query inside a mapper xml using a #{} syntax. With this approach you filter the records on db level itself and only bring or prepare java objects which are needed or in the current page.
Use pagination on mybatis side
Mybatis select method on sqlSession has a Rowbounds attribute. Populate this as per your needs and it will bring you those number of records only. Here, you are limiting number of records on mybatis side whereas in first approach the same was performed on db side which is better performant .
Use a Result handlers
Mybatis will give you control of actual jdbc result set. So, you can do/iterate over the result one by one here itself.
See this blog entry for more details.

Join Postgresql rows with Mongodb documents based on specific columns

I'm using MongoDB and PostgreSQL in my application. The need of using MongoDB is we might have any number of new fields that would get inserted for which we'll store data in MongoDB.
We are storing our fixed field values in PostgreSQL and custom field values in MongoDB.
E.g.
**Employee Table (RDBMS):**
id Name Salary
1 Krish 40000
**Employee Collection (MongoDB):**
{
<some autogenerated id of mongodb>
instanceId: 1 (The id of SQL: MANUALLY ASSIGNED),
employeeCode: A001
}
We get the records from SQL, and from their ids, we fetch related records from MongoDB. Then map the result to get the values of new fields and send on UI.
Now I'm searching for some optimized solution to get the MongoDB results in PostgreSQL POJO / Model so I don't have to fetch the data manually from MongoDB by passing ids of SQL and then mapping them again.
Is there any way through which I can connect MongoDB with PostgreSQL through columns (Here Id of RDBMS and instanceId of MongoDB) so that with one fetch, I can get related Mongo result too. Any kind of return type is acceptable but I need all of them at one call.
I'm using Hibernate and Spring in my application.
Using Spring Data might be the best solution for your use case, since it supports both:
JPA
MongoDB
You can still get all data in one request but that doesn't mean you have to use a single DB call. You can have one service call which spans to twp database calls. Because the PostgreSQL row is probably the primary entity, I advise you to share the PostgreSQL primary key with MongoDB too.
There's no need to have separate IDs. This way you can simply fetch the SQL and the Mongo document by the same ID. Sharing the same ID can give you the advantage of processing those requests concurrently and merging the result prior to returning from the service call. So the service method duration will not take the sum of the two Repositories calls, being the max of these to calls.
Astonishingly, yes, you potentially can. There's a foreign data wrapper named mongo_fdw that allows PostgreSQL to query MongoDB. I haven't used it and have no opinion as to its performance, utility or quality.
I would be very surprised if you could effectively use this via Hibernate, unless you can convince Hibernate that the FDW mapped "tables" are just views. You might have more luck with EclipseLink and their "NoSQL" support if you want to do it at the Java level.
Separately, this sounds like a monstrosity of a design. There are many sane ways to do what you want within a decent RDBMS, without going for a hybrid database platform. There's a time and a place for hybrid, but I really doubt your situation justifies the complexity.
Just use PostgreSQL's json / jsonb support to support dynamic mappings. Or use traditional options like storing json as text fields, storing XML, or even EAV mapping. Don't build a rube goldberg machine.

How to check if a table exists in jOOQ?

After opening a database connection, I want to check if the database is newly minted or not. I am using H2 which automatically creates a DB if one doesn't exist.
I tried this check:
db.Public.PUBLIC.getTables().isEmpty()
but that returns a static list of tables (without querying the schema in the database).
I could write raw SQL to get the table list, but that would be specific to the database engine. Is there a generic alternative in jOOQ?
You cannot use:
db.Public.PUBLIC.getTables().isEmpty()
Because generated meta information is not connected to the database. Instead, you may want to take a look at DSLContext.meta(). In your case, you'd simply write:
DSL.using(configuration).meta().getTables().isEmpty();
If you run this test very often, that's of course not a very performant way of checking if there are any tables, as it will fetch all tables into memory just to run an isEmpty() check. I suggest issuing an actual query instead:
int numberOfTables =
DSL.using(configuration)
.select(count())
.from("information_schema.tables")
.where("table_schema = 'PUBLIC'")
.fetchOne(0, int.class);
A future jOOQ version (after 3.11) will be able to offer you actual object existence predicates that can be used in SQL or elsewhere:
https://github.com/jOOQ/jOOQ/issues/8038

Joint data set limited to two tables in BIRT

I have a request that should extract data from three tables A, B, C based on two conditions, these tables A,B and C are located in the same data source.
does BIRT 3.1 supports joint data sets with more than two tables?
Otherwise, is there a way to overcome this limitation?
You don't say what your data source is, but assuming that it is a SQL data base. You can do something like this in the SQl. You only need to do BIRT joins if the data is in different data sources.
select TableA.Field
, TableB.OtherField
, TableC.SomeOtherField
from dbo.TableA
left join dbo.TableB
on TableA.Same = TableB.Same
left join dbo.TableC
on TableA.Same = TableC.Same
where TableA.Important = 'Something'
In addition to James' answer:
In many cases just joining the tables using SQL is the best solution (you should know SQL if you are developing with BIRT, unless someone else prepared the Data Sets and corresponding report items for you).
As an alternative, keep in mind that BIRT does not have a "data model" like other report designers (e.g. Oracle Reports) and that you link data from different data sets by creating a corresponding layout structure, with data set parameter bindings.
You didn't mention the logical structure of your data.
If it's master-detail-detail (for example, artist-album-title), then you would use for example a list item bound to DS "artist", containing a list or table item bound to DS "album" which in turn contains a table bound to DS "title".
The DS "album" would need a DS parameter like "artist_id" or whatever (which you use in the WHERE clause of the SELECT statement), and in the list/table item bound to DS "album", you would use row["artist_id"] as the value for the DS parameter "artist_id".
This is similar for the table item bound to DS "title". However, if the primary key consists of (artist_id, album_id, title_no), you probably need access to the current artist from the outer-most list item. To access this, you use row._outer["artist_id"].
The solution for this problem is using the stored procedure query, you set your proceure with whatever sql request you want, compile it with your DBMS, and you call it from BIRT with the syntax
call nameOfYourProceure{(?,?,?...)}
Question marks refer to the parameters that you will pass to your stored procedure.

Categories

Resources