I want to connect hive to elasticsearch. I followed the instruction from here.
I do the following steps
1. start-dfs.sh
2. start-yarn.sh
3. launch elasticsearch
4. launch kibana
5. launch hive
inside hive
a- create a database
b- create a table
c- load data into the table (LOAD DATA LOCAL INPATH '/home/myuser/Documents/datacsv/myfile.csv' OVERWRITE INTO TABLE students; )
d- add jar /home/myuser/elasticsearch-hadoop-7.10.1/dist/elasticsearch-hadoop-hive-7.10.1.jar
e- create a table for Elastic.
create table students_es (stt int not null, mahocvien varchar(10), tenho string, ten string, namsinh date, gioitinh string, noisinh string, namvaodang date, trinhdochuyenmon string, hesoluong float, phucaptrachnhiem float, chucvudct string, chucdqh string, dienuutien int, ghichu int) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.nodes' = '127.0.0.1', 'es.port' = '9201', 'es.resource' = 'students/student');
f- insert overwrite table students_es select * from students;
Then the error I got is the following
FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org/apache/commons/httpclient/protocol/ProtocolSocketFactory
I used the components
kibana: 7.10.1
hive : 3.1.2
hadoop: 3.1.2
I finally found how to solve it.
You need to download the jar file commons-httpclient-3.1.jar and put it into
your hive lib directory.
Related
I need to create a hive table on a unicode delimited file(unicode charcter - ."\uFFFD", replacement character)
To do this we are submitting hive jobs to cluster.
Tried with Lazy simple serde using ROW FORMAT Delimited -
gcloud dataproc jobs submit hive --cluster --region
--execute "CREATE EXTERNAL TABLE hiveuni_test_01(codes
string,telephone_num string,finding_name string,given_name
string,alt_finding_name string,house_num string,street_name
string,locality string,state string,reserved string,zip_code
string,directive_text string,special_listing_text string,id
string,latitude string,longitude string,rboc_sent_date string) ROW
FORMAT DELIMITED FIELDS TERMINATED BY '\uFFFD' LINES TERMINATED BY
'\n' STORED AS TEXTFILE LOCATION
'gs://hive-idaas-dev-warehouse/datasets/unicode_file';"
But this does not create the table correctly , entire row is put to the first column only.
We are using cloud SQL mysql server as hive metastore , checked that mysql has utf8 encoding also.
Tried with multidelimitserde -
gcloud dataproc jobs submit hive --cluster
dev-sm-35cb3516-ed82-4ec2-bf0d-89bd7e0e60f0 --region us-central1
--jars gs://hive-idaas-dev-warehouse/hive-jar/hive-contrib-0.14.0.jar --execute "CREATE EXTERNAL TABLE hiveuni_test_05 (codes string,telephone_num string,finding_name string,given_name
string,alt_finding_name string,house_num string,street_name
string,locality string,state string,reserved string,zip_code
string,directive_text string,special_listing_text string,id
string,latitude string,longitude string,rboc_sent_date string) ROW
FORMAT SERDE 'org.apache.hadoop.hive.serde2.MultiDelimitSerDe' WITH
SERDEPROPERTIES ('field.delim'='\uFFFD') STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION
'gs://hive-idaas-dev-warehouse/datasets/unicode_file';"
This gives an exception - java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.serde2.MultiDelimitSerDe not found
I have put an initialization script during start of the cluster which will place the hive-contrib-0.14.0.jar containing the class org.apache.hadoop.hive.serde2.MultiDelimitSerDe in /usr/lib/hadoop/lib/. I see that jar is placed in the folder by doing ssh to the cluster.
Is there a way to read unicode characters by hive client while creating table or why do I still get an error classNotFound even after placing the jar in hadoop lib directory?
hive-contrib-0.14.0 does not have org.apache.hadoop.hive.serde2.MultiDelimitSerDe. Instead the full qualified class name is org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe. Notice the extra contrib there.
So change your query to use the correct fully qualified class name and see if it solves the issue. You probably don't have to explicitly add a hive-contrib jar. It should be already under /usr/lib/hive/lib.
HIVE-20020 and HIVE-20619 were done on Hive 4.0, and since you are using Dataproc, it shouldn't apply since Dataproc does not have Hive 4.0 yet.
I am running a spark job on multinode cluster and trying to insert(append) a dataframe to an external Hive Table which is partitioned by 2 columns- date and hr.
dataframe.write().insertInto(hiveTable);
Hive table structure is as below:
CREATE EXTERNAL TABLE `database.hiveTable`(
`col1` string,
`col2` string,
`col3_json` string,
)
PARTITIONED BY (
`dt` string,
`hr` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION '/data/hdfs/tmp/test';
Note: col3_json column will have data in json string like:
{"group":[{"action":"Change","gid":"111","isId":"Y"},{"action":"Add","gid":"111","isId":"Y"},{"action":"Delete","gid":"111","isId":"N"}]}
The data is getting successfully inserted when the table is not partitioned.
But it is throwing below error, when the data is inserted into the above partitioned table:
org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
.
.
.
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$PathComponentTooLongException): The maximum path component name limit of hr=%7B%22group%22%7B%22action%22%3A%22Change%22,%22gid%22%22,%22isId%22%3A%22Y%22},%7B%22action%22Add%22,%22gid%111%22,%22isId%%22},%7B%22action%22%3A%22Delete%22,%22gid%2524%22,%22isId%22N%22}%5D} in directory /data/hdfs/tmp/test/.hive-staging_hive_2020-01-04_00-27-05_24_76879687968796-1/-ext-10000/_temporary/0/_temporary/attempt_20200104002705_0027_m_000000_0/dt=N is exceeded: limit=255 length=399
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxComponentLength(FSDirectory.java:1113)
```
I notice that the error has few strings which are present in the json data like: group, Change, gid, etc.
Not sure if this is related to the json data being inserted to the col3_json.
Please suggest.
When I try to copy a table to cassandra using the command:
copy images from 'images.csv'
I get the error:
'PicklingError: Can't pickle <class 'cqlshlib.copyutil.ImmutableDict'>: attribute lookup cqlshlib.copyutil.ImmutableDict failed'
I have successfully imported all of my other tables, but this one is not working. The only difference with this one is that it contains large binary blobs for images.
Here is a sample row from the csv file:
b267ba01-5420-4be5-b962-7e563dc245b0,,0x89504e...[large binary blob]...426082,0,7e700538-cce3-495f-bfd2-6a4fa968bdf6,pentium_e6600,01fa819e-3425-47ca-82aa-a3eec319a998,0,7e700538-cce3-495f-bfd2-6a4fa968bdf6,,,png,0
And here is the file that causes the error:
https://www.dropbox.com/s/5mrl6nuwelpf3lz/images.csv?dl=0
Here is my schema:
CREATE TABLE dealtech.images (
id uuid PRIMARY KEY,
attributes map<text, text>,
data blob,
height int,
item_id uuid,
name text,
product_id uuid,
scale double,
seller_id uuid,
text_bottom int,
text_top int,
type text,
width int
)
The tables were exported using cassandra 2.x and I am currently using cassandra 3.0.9 to import them.
I ran into this same issue with apache cassandra 3.9, although my datasets were fairly small (46 rows in one table, 262 rows in another table).
PicklingError: Can't pickle <class 'cqlshlib.copyutil.link'>: attribute lookup cqlshlib.copyutil.link failed
PicklingError: Can't pickle <class 'cqlshlib.copyutil.attribute'>: attribute lookup cqlshlib.copyutil.attribute failed
Where link and attribute are types I defined.
The COPY commands were apart of a .cql script that was being run inside a Docker container as apart of it's setup process.
I read in a few places where people were seeing this PicklingError on Windows (seemed to be related to NTFS), but the Docker container in this case was using Alpine Linux.
The fix was to add these options to the end of my COPY commands:
WITH MINBATCHSIZE=1 AND MAXBATCHSIZE=1 AND PAGESIZE=10;
http://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlshCopy.html
I was not seeing the PicklingError running these .cql scripts containing COPY commands locally, so it seems to be an issue that only rears it's head in a low memory situation.
Related issues:
Pickling Error running COPY command: CQLShell on Windows
Cassandra multiprocessing can't pickle _thread.lock objects
Working on apache-hive-0.13.1.
while creating table hive throw an error as below
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: com.cloudera.hive.serde.JSONSerDe
table structure is
create external table tweets(id BigInt, created_at String, scource String, favorited Boolean, retweet_count int,
retweeted_status Struct <
text:String,user:Struct<
screen_name:String, name:String>>,
entities Struct<
urls:Array<Struct<
expanded_url:String>>,
user_mentions:Array<Struct<
screen_name:String,
name:String>>,
hashtags:Array<Struct<text:String>>>,
text String,
user Struct<
screen_name:String,
name:String,
friends_count:int,
followers_count:int,
statuses_count:int,
verified:boolean,
utc_offset:int,
time_zone:String> ,
in_reply_to_screen_name String)
partitioned by (datehour int)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
location '/home/edureka/sachinG'
Added a json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar in class to resolved the issue but no success
Finally , got a solution for this. The issue is with json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar
Different distribution (Cloudera, Azure, etc ) needed different JSON-Serde jar file. Means, serde jar should be compatible to there distribution.
I changed jar and it worked for me.
I faced a similar issue while working with hive 1.2.1 and hbase 0.98. I followed below steps and the issue was resolved.
1) Copied all the hbase-* files from hbase/lib location to hive/lib directory
2) Verified that the hive-hbase-handler-1.2.1.jar was present in hive/lib
3) Verified that hive-serde-1.2.1.jar was present in hive/lib
4) Verified that zookeeper-3.4.6.jar was present in hive/lib(If not copy from hbase/lib and paste to hive/lib)
5) In the hive-site.xml(If not present use hive-default.xml.template) located at hive/conf under 'both'
a) hive.aux.jars.path field and
b) hive.added.jars.path field
give path '/usr/local/hive/lib/'.
6) Open hive terminal and create the table using below command:-
CREATE TABLE emp_hive (
RowKey_HIVE String,
Employee_No int,
Employee_Name String,
Job String,
Mgr int,
Hiredate String,
Salary int,
Commision int,
Department_No int
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" =":key,details:Employee_No_hbase,details:Employee_Name_hbase,details:job_hbase,details:Mgr_hbase,details:hiredate_hbase,details:salary_hbase,details:commision_hbase,details:department_no_hbase")
TBLPROPERTIES("hbase.table.name"="emp_hbase");
CREATE TABLE data.banks (
id text,
codes frozen<map<text, text>>
PRIMARY KEY (id,codes));
Added a corresponding model class with #Frozen("map<text, text>") anotation on codes field
Insert goes in properly but when i open cqlsh and run
select * from data.banks i get following error
Traceback (most recent call last):
File "/usr/bin/cqlsh", line 1078, in perform_simple_statement
rows = self.session.execute(statement, trace=self.tracing_enabled)
File "/usr/share/cassandra/lib/cassandra-driver-internal-only-2.6.0c2.post.zip/cassandra-driver-2.6.0c2.post/cassandra/cluster.py", line 1594, in execute
result = future.result(timeout)
File "/usr/share/cassandra/lib/cassandra-driver-internal-only-2.6.0c2.post.zip/cassandra-driver-2.6.0c2.post/cassandra/cluster.py", line 3296, in result
raise self._final_exception
error: unpack requires a string argument of length 4
One more problem is when i add a row with values ('1',{'code2':'435sdfd','code1':'2132sd'}). It shows one row inserted. But when I add another row with ('1',{'code2':'435sdfe','code1':'2132sd'}) .
It throws TimedOut Exception.
Using cassandra 2.1.8 , cassandra-driver-mapping 2.1.8 , kundera-cassandra-pelops 3.0 version.