Inserting filename as rowkey using HBase MapReduce

Inserting filename as rowkey using HBase MapReduce - java

using Java API, I'm trying to Put() to HBase 1.1.x the content of some files. To do so, I have created WholeFileInput class (ref : Using WholeFileInputFormat with Hadoop MapReduce still results in Mapper processing 1 line at a time ) to make MapReduce read the entire file instead of one line. But unfortunately, I cannot figure out how to form my rowkey from the given filename.
Example:
Input:
file-123.txt
file-524.txt
file-9577.txt
...
file-"anotherNumber".txt
Result on my HBase table:
Row-----------------Value
123-----------------"content of 1st file"
524-----------------"content of 2nd file"
...etc
If anyone has already faced this situation to help me with it
Thanks in advance.

Your
rowkey
can be like this
rowkey = prefix + (filenamepart or full file name) + Murmurhash(fileContent)
where your prefix can be between what ever presplits you have done with your table creation time.
For ex :
create 'tableName', {NAME => 'colFam', VERSIONS => 2, COMPRESSION => 'SNAPPY'},
{SPLITS => ['0','1','2','3','4','5','6','7']}
prefix can be any random id generated between range of pre-splits.
This kind of row key will avoid hot-spotting also if data increases.
& Data will be spread across region server.

Related

Why Spark dataframe cache doesn't work here

I just wrote a toy class to test Spark dataframe (actually Dataset since I'm using Java).
Dataset<Row> ds = spark.sql("select id,name,gender from test2.dummy where dt='2018-12-12'");
ds = ds.withColumn("dt", lit("2018-12-17"));
ds.cache();
ds.write().mode(SaveMode.Append).insertInto("test2.dummy");
//
System.out.println(ds.count());
According to my understanding, there're 2 actions, "insertInto" and "count".
I debug the code step by step, when running "insertInto", I see several lines of:
19/01/21 20:14:56 INFO FileScanRDD: Reading File path: hdfs://ip:9000/root/hive/warehouse/test2.db/dummy/dt=2018-12-12/000000_0, range: 0-451, partition values: [2018-12-12]
When running "count", I still see similar logs:
19/01/21 20:15:26 INFO FileScanRDD: Reading File path: hdfs://ip:9000/root/hive/warehouse/test2.db/dummy/dt=2018-12-12/000000_0, range: 0-451, partition values: [2018-12-12]
I have 2 questions:
1) When there're 2 actions on same dataframe like above, if I don't call ds.cache or ds.persist explicitly, will the 2nd action always causes the re-executing of the sql query?
2) If I understand the log correctly, both actions trigger hdfs file reading, does that mean the ds.cache() actually doesn't work here? If so, why it doesn't work here?
Many thanks.

It's because you append into the table where ds is created from, so ds needs to be recomputed because the underlying data changed. In such cases, spark invalidates the cache. If you read e.g. this Jira (https://issues.apache.org/jira/browse/SPARK-24596):
When invalidating a cache, we invalid other caches dependent on this
cache to ensure cached data is up to date. For example, when the
underlying table has been modified or the table has been dropped
itself, all caches that use this table should be invalidated or
refreshed.
Try to run the ds.count before inserting into the table.

I found that the other answer doesn't work. What I had to do was break lineage such that the df I was writing does not know that one of its source is the table I am writing to. To break lineage, I created a copy df using
copy_of_df = sql_context.createDataframe(df.rdd)

Spark writing to Cassandra with varying TTL

In Java Spark, I have a dataframe that has a 'bucket_timestamp' column, which represents the time of the bucket that the row belongs to.
I want to write the dataframe to a Cassandra DB. The data must be written to the DB with TTL. The TTL should be depended on the bucket timestamp - where each row's TTL should be calculated as ROW_TTL = CONST_TTL - (CurrentTime - bucket_timestamp), where CONST_TTL is a constant TTL that I configured.
Currently I am writing to Cassandra with spark using a constant TTL, with the following code:
df.write().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "key_space_name");
put("table, "table_name");
put("spark.cassandra.output.ttl, Long.toString(CONST_TTL)); // Should be depended on bucket_timestamp column
}
}).mode(SaveMode.Overwrite).save();
One possible way I thought about is - for each possible bucket_timestamp - filter the data according to timestamp, calculate the TTL and write filtered data to Cassandra. but this seems very non-efficient and not the spark way. Is there a way in Java Spark to provide a spark column as the TTL option, so that the TTL will differ for each row?
Solution should be working with Java and dataset< Row>: I encountered some solutions for performing this with RDD in scala, but didn't find a solution for using Java and dataframe.
Thanks!

From Spark-Cassandra connector options (https://github.com/datastax/spark-cassandra-connector/blob/v2.3.0/spark-cassandra-connector/src/main/java/com/datastax/spark/connector/japi/RDDAndDStreamCommonJavaFunctions.java) you can set the TTL as:
constant value (withConstantTTL)
automatically resolved value (withAutoTTL)
column-based value (withPerRowTTL)
In your case you could try the last option and compute the TTL as a new column of the starting Dataset with the rule you provided in the question.
For use case you can see the test here: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/it/scala/com/datastax/spark/connector/writer/TableWriterSpec.scala#L612

For DataFrame API there is no support for such functionality, yet... There is JIRA for it - https://datastax-oss.atlassian.net/browse/SPARKC-416, you can watch it to get notified when it's implemented...
So only choice that you have is to use RDD API as described in the #bartosz25's answer...

parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file

I have saved a remote DB table in Hive using saveAsTable method, now when i try to access the Hive table data using CLI command select * from table_name, It's giving me the error below:
2016-06-15 10:49:36,866 WARN [HiveServer2-Handler-Pool: Thread-96]:
thrift.ThriftCLIService (ThriftCLIService.java:FetchResults(681)) -
Error fetching results: org.apache.hive.service.cli.HiveSQLException:
java.io.IOException: parquet.io.ParquetDecodingException: Can not read
value at 0 in block -1 in file hdfs:
Any idea what I might be doing wrong here?

Problem:
Facing below issue while querying the data in impyla (data written by spark job)
ERROR: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1521667682013_4868_1_00, diagnostics=[Task failed, taskId=task_1521667682013_4868_1_00_000082, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://shastina/sys/datalake_dev/venmo/data/managed_zone/integration/ACCOUNT_20180305/part-r-00082-bc0c080c-4080-4f6b-9b94-f5bafb5234db.snappy.parquet
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:347)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:194)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:185)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:185)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:181)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Root Cause:
This issue is caused because of different parquet conventions used in Hive and Spark. In Hive, the decimal datatype is represented as fixed bytes (INT 32). In Spark 1.4 or later the default convention is to use the Standard Parquet representation for decimal data type. As per the Standard Parquet representation based on the precision of the column datatype, the underlying representation changes.
eg:
DECIMAL can be used to annotate the following types:
int32: for 1 <= precision <= 9
int64: for 1 <= precision <= 18; precision < 10 will produce a warning
Hence this issue happens only with the usage of datatypes which have different representations in the different Parquet conventions. If the datatype is DECIMAL (10,3), both the conventions represent it as INT32, hence we won't face an issue. If you are not aware of the internal representation of the datatypes it is safe to use the same convention used for writing while reading. With Hive, you do not have the flexibility to choose the Parquet convention. But with Spark, you do.
Solution:
The convention used by Spark to write Parquet data is configurable. This is determined by the property spark.sql.parquet.writeLegacyFormat
The default value is false. If set to "true", Spark will use the same convention as Hive for writing the Parquet data. This will help to solve the issue.
--conf "spark.sql.parquet.writeLegacyFormat=true"
References:
LanguageManualTypes-DecimalsdecimalDecimals
SPARK-20937
issues#spark.apache.org

I had a similar error (but at a positive index in a non-negative block), and it came from the fact that I had created the Parquet data with some Spark dataframe types marked as non-nullable when they were actually null.
In my case, I thus interpret the error as Spark attempting to read data from a certain non-nullable type and stumbling across an unexpected null value.
To add to the confusion, after reading the Parquet file, Spark reports with printSchema() that all the fields are nullable, whether they are or not. However, in my case, making them really nullable in the original Parquet file solved the problem.
Now, the fact that the question happens at "0 in block -1" is suspicious: it actually almost looks as if the data was not found, since block -1 looks like Spark has not even started reading anything (just a guess).

It looks like a schema mismatch problem here.
If you set your schema to be not nullable, and create your dataframe with None value, Spark would throw you ValueError: This field is not nullable, but got None error.
[Pyspark]
from pyspark.sql.functions import * #udf, concat, col, lit, ltrim, rtrim
from pyspark.sql.types import *
schema = ArrayType(StructType([StructField('A', IntegerType(), nullable=False)]))
# It will throw "ValueError".
df = spark.createDataFrame([[[None]],[[2]]],schema=schema)
df.show()
But it is not the case if you use udf.
Using the same schema, if you use udf for transformation, it won't throw you ValueError even if your udf return a None. And it is the place where data schema mismatch happens.
For example:
df = spark.createDataFrame([[[1]],[[2]]], schema=schema)
def throw_none():
def _throw_none(x):
if x[0][0] == 1:
return [['I AM ONE']]
else:
return x
return udf(_throw_none, schema)
# since value col only accept intergerType, it will throw null for
# string "I AM ONE" in the first row. But spark did not throw ValueError
# error this time ! This is where data schema type mismatch happen !
df = df.select(throw_none()(col("value")).name('value'))
df.show()
Then, the following parquet write and read will throw you the parquet.io.ParquetDecodingException error.
df.write.parquet("tmp")
spark.read.parquet("tmp").collect()
So be very careful on the null value if you are using udf, return the right data type in your udf. And unless it is unnecessary, please dont set nullable=False in your StructField. Set nullable=True will solve all the problem.

One more way to catch possible discrepancy is to eyeball the difference in schemata of parquet files produced by both sources, say hive and spark. You can dump schema with parquet-tools (brew install parquet-tools for macos):
λ $ parquet-tools schema /usr/local/Cellar/apache-drill/1.16.0/libexec/sample-data/nation.parquet
message root {
required int64 N_NATIONKEY;
required binary N_NAME (UTF8);
required int64 N_REGIONKEY;
required binary N_COMMENT (UTF8);
}

I had a similar error, in my case i was missing the default constructor

Are you able to use Avro instead of Parquet to store your Hive table? I ran into this issue because I was using Hive's Decimal datatype, and Parquet from Spark doesn't play nice with Decimal. If you post your table schema and some data samples, debugging will be easier.
Another possible option, from the DataBricks Forum, is to use a Double instead of a Decimal, but that was not an option for my data so I can't report on whether it works.

Apache Spark - how to get unmatched rows from two RDDs

I have two different RDDs, each RDD have some common fields, based on that fields i want to get unmatched records from RDD1 or RDD2.[Records available in RDD1 but not available in RDD2] [Records available in RDD2 but not available in RDD1]
It seems we could use subtract or subtractbyKey.
Sample Input:
**File 1:**
sam,23,cricket
alex,34,football
ann,21,football
**File 2:**
ruby,25,football,usa
alex,34,cricket,usa
ann,21,cricket,usa
**expected output:**
sam,23,cricket
Update:
Currently i am using Spark SQL to get the unmatched records from the RDDs(Writing a query to get the unmatched records).
What i am looking is, is it something we can do it with Spark Core itself instead of using Spark SQL and also i am not looking the code, is there any operation available in Spark Core?
Please advise on this.
Regards,
Shankar.

You could bring both RDDs to the same shape and use subtract to remove the common elements.
Given rdd1 from file1 and rdd2 from file2 as presented above, you could do something like:
val userScore2 = rdd2.map{case (name, score, sport, country) => (name, score, sport)}
val in1andNotin2 = rdd1 subtract userScore2
val in2andNotIn1 = userScore2 subtract rdd1

file (not in memory) based JDBC driver for CSV files

Is there a open source file based (NOT in-memory based) JDBC driver for CSV files? My CSV are dynamically generated from the UI according to the user selections and each user will have a different CSV file. I'm doing this to reduce database hits, since the information is contained in the CSV file. I only need to perform SELECT operations.
HSQLDB allows for indexed searches if we specify an index, but I won't be able to provide an unique column that can be used as an index, hence it does SQL operations in memory.
Edit:
I've tried CSVJDBC but that doesn't support simple operations like order by and group by. It is still unclear whether it reads from file or loads into memory.
I've tried xlSQL, but that again relies on HSQLDB and only works with Excel and not CSV. Plus its not in development or support anymore.
H2, but that only reads CSV. Doesn't support SQL.

You can solve this problem using the H2 database.
The following groovy script demonstrates:
Loading data into the database
Running a "GROUP BY" and "ORDER BY" sql query
Note: H2 supports in-memory databases, so you have the choice of persisting the data or not.
// Create the database
def sql = Sql.newInstance("jdbc:h2:db/csv", "user", "pass", "org.h2.Driver")
// Load CSV file
sql.execute("CREATE TABLE data (id INT PRIMARY KEY, message VARCHAR(255), score INT) AS SELECT * FROM CSVREAD('data.csv')")
// Print results
def result = sql.firstRow("SELECT message, score, count(*) FROM data GROUP BY message, score ORDER BY score")
assert result[0] == "hello world"
assert result[1] == 0
assert result[2] == 5
// Cleanup
sql.close()
Sample CSV data:
0,hello world,0
1,hello world,1
2,hello world,0
3,hello world,1
4,hello world,0
5,hello world,1
6,hello world,0
7,hello world,1
8,hello world,0
9,hello world,1
10,hello world,0

If you check the sourceforge project csvjdbc please report your expierences. the documentation says it is useful for importing CSV files.
Project page

This was discussed on Superuser https://superuser.com/questions/7169/querying-a-csv-file.
You can use the Text Tables feature of hsqldb: http://hsqldb.org/doc/2.0/guide/texttables-chapt.html
csvsql/gcsvsql are also possible solutions (but there is no JDBC driver, you will have to run a command line program for your query).
sqlite is another solution but you have to import the CSV file into a database before you can query it.
Alternatively, there is commercial software such as http://www.csv-jdbc.com/ which will do what you want.

To do anything with a file you have to load it into memory at some point. What you could do is just open the file and read it line by line, discarding the previous line as you read in a new one. Only downside to this approach is its linearity. Have you thought about using something like memcache on a server where you use Key-Value stores in memory you can query instead of dumping to a CSV file?

You can use either specialized JDBC driver, like CsvJdbc (http://csvjdbc.sourceforge.net) or you may chose to configure a database engine such as mySQL to treat your CSV as a table and then manipulate your CSV through standard JDBC driver.
The trade-off here - available SQL features vs performance.
Direct access to CSV via CsvJdbc (or similar) will allow you very quick operations on big data volumes, but without capabilities to sort or group records using SQL commands ;
mySQL CSV engine can provide rich set of SQL features, but with the cost of performance.
So if the size of your table is relatively small - go with mySQL. However if you need to process big files (> 100Mb) without need for grouping or sorting - go with CsvJdbc.
If you need both - handle very bif files and be able to manipulate them using SQL, then optimal course of action - to load the CSV into normal database table (e.g. mySQL) first and then handle the data as usual SQL table.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Inserting filename as rowkey using HBase MapReduce - java

Related

Why Spark dataframe cache doesn't work here

Spark writing to Cassandra with varying TTL

parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file

Apache Spark - how to get unmatched rows from two RDDs

file (not in memory) based JDBC driver for CSV files

Categories

Resources