How to read from Hive without map reduce? I am trying to read a column from a table created on Hive, but I don't want the overhead that exist from map reduce. Basicaly I want to retrive the values from a table created on Hive without overhead and get them the fastest way possible.
Instead of MapReduce, you can use Tez or Spark as you execution engine in Hive.
See hive.execution.engine in Hive Configuration Properties.
There are also quite a few SQL engines compatible with the hive metadata e.g Presto, Spark SQL, Impala.
generally, if you do a "select *from" a table in hive mapreduce wont run..
In your case are you using just a select column from a hive table also mapreduce wont run.
or you can create a subtable on the main table with the needed columns and number of rows and just do a select * on the table.
Related
I'm new to Spark, I'm loading a huge CSV file using Data Frame code given below
Dataset<Row> df = sqlContext.read().format("com.databricks.spark.csv").schema(customSchema)
.option("delimiter", "|").option("header", true).load(inputDataPath);
Now after loading CSV data in data frame, now I want to iterate through each row and based on some columns want to query from PostgreSQL DB (performing some geometry operation). Later want to merge some fields returned from DB with the data frame records. What's the best way to do it, consider huge amount of records.
Any help appreciated. I'm using Java.
Like #mck also pointed out: the best way is to use join.
with spark you can read external jdbc table using the DataRame Api
for example
val props = Map(....)
spark.read.format("jdbc").options(props).load()
see the DataFrameReader scaladoc for more options and which properties and values you need to set.
then use join to merge fields
1) Is it possible to create a Phoenix table backed by existing HBase table?
Based on this info here
http://phoenix.apache.org/language/#create_table
it should be possible but what options exactly one needs to pass in, I am not sure. I see no examples there.
"HBase table and column configuration options may be passed through as key/value pairs to configure the HBase table as desired. Note that when using the IF NOT EXISTS clause, if a table already exists, then no change will be made to it."
2) Also, is it possible that in the process of table creation, I control myself the mapping of phoenix column names to HBase column names?
3) I know that a Phoenix view (backed by an HBase table) has certain problems/limitations to maintain its indexes if the writing process writes directly to the underlying HBase table and not to the Phoenix view.
https://issues.apache.org/jira/browse/PHOENIX-1499
https://issues.apache.org/jira/browse/PHOENIX-1619
So... will that be a problem also if we create a Phoenix table backed by an HBase table (and write to the HBase table behind Phoenix's back)? I mean, if the answer to 1) is positive, will I have the same problem with a Phoenix table as with a Phoenix view (provided that my writes don't go through Phoenix)?
Directly taking a hit at the answer here.
a) Specific properties can be passed to column families or table in general. These are the various options that have been defined using options. Options can be referenced here : http://phoenix.apache.org/language/#options. You can create a view that references an existing hbase table. I prefer views because , i can drop and recreate them without issues unlike tables whose drop will cause the underlying HBase table to vanish as well
b) This is not possible AFAIK. There are no mapping options between existing hbase tables and corresponding phoenix views (i.e fN in phoenix refers to firstName in HBase)
c) That is correct. Atleast in 4.x versions of Phoenix , this is true.
i) If you create a Phoenix table (the HBase table will get created automatically) and write to HBase directly , you will have to take care of using Phoenix types for writing so that they will be properly read from HBase (https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/schema/types). Also note you will have to take care if you have SALT_BUCKETS and the phoenix salt if you have defined tables from phoenix like that
I want to use a result set to hold my user data. Just one wrinkle - the data is in a CSV (comma delimited) format, not in a database.
The CSV file contains a header row at the top, then data rows. I would like to dynamically create a result set but not using an SQL query. The input would be the CSV file.
I am expecting (from my Java experience) that there would be methods like rs.newRow, rs.newColumn, rs.updateRow and so on, so I could use the structure of a result set without having any database query first.
Are there such methods on result sets? I read the docs and did not find any way. It would be highly useful to just build a result set from the CSV and then use it like I had done an SQL query. I find it difficult to use a java bean, array list of beans, and such - because I want to read in CSV files with different structures, the equivalent of SELECT * FROM in SQL.
I don't know what the names or numbers of columns will be in advance. So other SO questions like "Manually add data to a Java ResultSet" do not answer my needs. In that case, he already has an SQL created result set, and wants to add rows.
In my case, I want to create a result set from scratch, and add columns as the first row is read in. Now that I am thinking about it, I could create a query from the first row, using a statement like SELECT COL1, COL2, COL3 ... FROM DUMMY. Then use the INSERT ROW statement to read and insert the rest. (Later:) But on trying that, it fails because there is no real table associated.
The CsvJdbc open source project should do just that:
// Load the driver.
Class.forName("org.relique.jdbc.csv.CsvDriver");
// Create a connection using the directory containing the file(s)
Connection conn = DriverManager.getConnection("jdbc:relique:csv:/path/to/files");
// Create a Statement object to execute the query with.
Statement stmt = conn.createStatement();
// Query the table. The name of the table is the name of the file without ".csv"
ResultSet results = stmt.executeQuery("SELECT col1, col2 FROM myfile");
Just another idea to complement #Mureinik's one, that I find really good.
Alternatively, you could use any CSV Reader library (many of them out there), and load the file into an in-memory table using any in-memory database such as H2, HyperSQL, Derby, etc. These ones offer you a full/complete SQL engine, where you can run high end/complex queries.
It requires more work but you get a lot of flexibility to use the data afterwards.
After you try the in-memory solution, switching to a persistent database is really easy (just change the URL). This way you could load the CSV file only once into the database. On the second execution an on, the database would be ready to use; no need to load the CSV again.
I have about 10 thousand records (stored as ArrayList in Java). I want to insert these records to Impala.
Should I use insert into table partition values to directly insert to impala. (I am not sure how many records can be inserted in one sql statement.)
Or should I write these records to HDFS then alter impala table?
Which way is preferred? Or is there any other solutions?
And also if I do these in every 5 minutes, how can I avoid so many small files in one partition (partitioned by hour)? These will produce 12 small files in each partition, so will this affect the query speed?
The best you can do is to do:
Create your table in impala as an external table associated with an HDFS route
Make the insertions directly in HDFS, if possible daily, per hour is probably little
Execute the invalidate metada $ TABLE_NAME command so that the data is visible
I hope the answer serves you
Regards!
I have a table with million of records. I want to execute a hive query, and want to return the result set in chunks to the client. Like say on the first client request for fetch results, I want to return the first 1000 records, then on subsequent requests the next 1000 records and so on.
One way is, I fetch the complete result set while executing hive query and save it and iterate the result set as per the request from the client. But if my result set is very huge then it can creating out of memory issues while saving the complete result set in memory.
Is it possible to get the data for the same hive query in chunks from hive? As per my exploration, I found that hive does not have support for pagination and also every time I cannot execute the query using limit clause in hive, as the documentation of hive says that the limit clause picks the records randomly.
I am using JDBC for hive query execution. Is there any solution provided in JDBC that can work with hive?
Or Is there any other approach to address this use case?
Thanks in advance.
The below is just an alternative approach:
Make your hive table bucketed and use the column or cloumns which are unique and has values in a range as the cluster by fields. Since you use cluster by your data would be sorted globally and distributed so you could always do the select query with these columns as your filter condition.
The above is just a suggestion.Hope it helps