PicklingError when copying a very large cassandra table using cqlsh - java

When I try to copy a table to cassandra using the command:
copy images from 'images.csv'
I get the error:
'PicklingError: Can't pickle <class 'cqlshlib.copyutil.ImmutableDict'>: attribute lookup cqlshlib.copyutil.ImmutableDict failed'
I have successfully imported all of my other tables, but this one is not working. The only difference with this one is that it contains large binary blobs for images.
Here is a sample row from the csv file:
b267ba01-5420-4be5-b962-7e563dc245b0,,0x89504e...[large binary blob]...426082,0,7e700538-cce3-495f-bfd2-6a4fa968bdf6,pentium_e6600,01fa819e-3425-47ca-82aa-a3eec319a998,0,7e700538-cce3-495f-bfd2-6a4fa968bdf6,,,png,0
And here is the file that causes the error:
https://www.dropbox.com/s/5mrl6nuwelpf3lz/images.csv?dl=0
Here is my schema:
CREATE TABLE dealtech.images (
id uuid PRIMARY KEY,
attributes map<text, text>,
data blob,
height int,
item_id uuid,
name text,
product_id uuid,
scale double,
seller_id uuid,
text_bottom int,
text_top int,
type text,
width int
)
The tables were exported using cassandra 2.x and I am currently using cassandra 3.0.9 to import them.

I ran into this same issue with apache cassandra 3.9, although my datasets were fairly small (46 rows in one table, 262 rows in another table).
PicklingError: Can't pickle <class 'cqlshlib.copyutil.link'>: attribute lookup cqlshlib.copyutil.link failed
PicklingError: Can't pickle <class 'cqlshlib.copyutil.attribute'>: attribute lookup cqlshlib.copyutil.attribute failed
Where link and attribute are types I defined.
The COPY commands were apart of a .cql script that was being run inside a Docker container as apart of it's setup process.
I read in a few places where people were seeing this PicklingError on Windows (seemed to be related to NTFS), but the Docker container in this case was using Alpine Linux.
The fix was to add these options to the end of my COPY commands:
WITH MINBATCHSIZE=1 AND MAXBATCHSIZE=1 AND PAGESIZE=10;
http://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlshCopy.html
I was not seeing the PicklingError running these .cql scripts containing COPY commands locally, so it seems to be an issue that only rears it's head in a low memory situation.
Related issues:
Pickling Error running COPY command: CQLShell on Windows
Cassandra multiprocessing can't pickle _thread.lock objects

Related

creating hive table using gcloud dataproc not working for unicode delimiter

I need to create a hive table on a unicode delimited file(unicode charcter - ."\uFFFD", replacement character)
To do this we are submitting hive jobs to cluster.
Tried with Lazy simple serde using ROW FORMAT Delimited -
gcloud dataproc jobs submit hive --cluster --region
--execute "CREATE EXTERNAL TABLE hiveuni_test_01(codes
string,telephone_num string,finding_name string,given_name
string,alt_finding_name string,house_num string,street_name
string,locality string,state string,reserved string,zip_code
string,directive_text string,special_listing_text string,id
string,latitude string,longitude string,rboc_sent_date string) ROW
FORMAT DELIMITED FIELDS TERMINATED BY '\uFFFD' LINES TERMINATED BY
'\n' STORED AS TEXTFILE LOCATION
'gs://hive-idaas-dev-warehouse/datasets/unicode_file';"
But this does not create the table correctly , entire row is put to the first column only.
We are using cloud SQL mysql server as hive metastore , checked that mysql has utf8 encoding also.
Tried with multidelimitserde -
gcloud dataproc jobs submit hive --cluster
dev-sm-35cb3516-ed82-4ec2-bf0d-89bd7e0e60f0 --region us-central1
--jars gs://hive-idaas-dev-warehouse/hive-jar/hive-contrib-0.14.0.jar --execute "CREATE EXTERNAL TABLE hiveuni_test_05 (codes string,telephone_num string,finding_name string,given_name
string,alt_finding_name string,house_num string,street_name
string,locality string,state string,reserved string,zip_code
string,directive_text string,special_listing_text string,id
string,latitude string,longitude string,rboc_sent_date string) ROW
FORMAT SERDE 'org.apache.hadoop.hive.serde2.MultiDelimitSerDe' WITH
SERDEPROPERTIES ('field.delim'='\uFFFD') STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION
'gs://hive-idaas-dev-warehouse/datasets/unicode_file';"
This gives an exception - java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.serde2.MultiDelimitSerDe not found
I have put an initialization script during start of the cluster which will place the hive-contrib-0.14.0.jar containing the class org.apache.hadoop.hive.serde2.MultiDelimitSerDe in /usr/lib/hadoop/lib/. I see that jar is placed in the folder by doing ssh to the cluster.
Is there a way to read unicode characters by hive client while creating table or why do I still get an error classNotFound even after placing the jar in hadoop lib directory?
hive-contrib-0.14.0 does not have org.apache.hadoop.hive.serde2.MultiDelimitSerDe. Instead the full qualified class name is org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe. Notice the extra contrib there.
So change your query to use the correct fully qualified class name and see if it solves the issue. You probably don't have to explicitly add a hive-contrib jar. It should be already under /usr/lib/hive/lib.
HIVE-20020 and HIVE-20619 were done on Hive 4.0, and since you are using Dataproc, it shouldn't apply since Dataproc does not have Hive 4.0 yet.

Connection and read data from elasticsearch to hive

I want to connect hive to elasticsearch. I followed the instruction from here.
I do the following steps
1. start-dfs.sh
2. start-yarn.sh
3. launch elasticsearch
4. launch kibana
5. launch hive
inside hive
a- create a database
b- create a table
c- load data into the table (LOAD DATA LOCAL INPATH '/home/myuser/Documents/datacsv/myfile.csv' OVERWRITE INTO TABLE students; )
d- add jar /home/myuser/elasticsearch-hadoop-7.10.1/dist/elasticsearch-hadoop-hive-7.10.1.jar
e- create a table for Elastic.
create table students_es (stt int not null, mahocvien varchar(10), tenho string, ten string, namsinh date, gioitinh string, noisinh string, namvaodang date, trinhdochuyenmon string, hesoluong float, phucaptrachnhiem float, chucvudct string, chucdqh string, dienuutien int, ghichu int) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.nodes' = '127.0.0.1', 'es.port' = '9201', 'es.resource' = 'students/student');
f- insert overwrite table students_es select * from students;
Then the error I got is the following
FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org/apache/commons/httpclient/protocol/ProtocolSocketFactory
I used the components
kibana: 7.10.1
hive : 3.1.2
hadoop: 3.1.2
I finally found how to solve it.
You need to download the jar file commons-httpclient-3.1.jar and put it into
your hive lib directory.

saving dataset to cassandra using java spark

I'm trying to save a dataset to cassandra db using java spark.
I'm able to read data into dataset successfully using the below code
Dataset<Row> readdf = sparkSession.read().format("org.apache.spark.sql.cassandra")
.option("keyspace","dbname")
.option("table","tablename")
.load();
But when I try to write dataset I'm getting IOException: Could not load or find table, found similar tables in keyspace
Dataset<Row> dfwrite= readdf.write().format("org.apache.spark.sql.cassandra")
.option("keyspace","dbname")
.option("table","tablename")
.save();
I'm setting host and port in sparksession
The thing is I'm able to write in overwrite and append modes but not able to create table
Versions which I'm using are below:
spark java 2.0
spark cassandra connector 2.3
Tried with different jar versions but nothing worked
I have also gone through different stack overflow and github links
Any help is greatly appreciated.
The write operation in Spark doesn't have a mode that will automatically create a table for you - there are multiple reasons for that. One of them is that you need to define a primary key for your table, otherwise, you may just overwrite data if you set incorrect primary key. Because of this, Spark Cassandra Connector provides a separate method to create a table based on your dataframe structure, but you need to provide a list of partition & clustering key columns. In Java it will look as following (full code is here):
DataFrameFunctions dfFunctions = new DataFrameFunctions(dataset);
Option<Seq<String>> partitionSeqlist = new Some<>(JavaConversions.asScalaBuffer(
Arrays.asList("part")).seq());
Option<Seq<String>> clusteringSeqlist = new Some<>(JavaConversions.asScalaBuffer(
Arrays.asList("clust", "col2")).seq());
CassandraConnector connector = new CassandraConnector(
CassandraConnectorConf.apply(spark.sparkContext().getConf()));
dfFunctions.createCassandraTable("test", "widerows6",
partitionSeqlist, clusteringSeqlist, connector);
and then you can write data as usual:
dataset.write()
.format("org.apache.spark.sql.cassandra")
.options(ImmutableMap.of("table", "widerows6", "keyspace", "test"))
.save();

Facing java heap space error on Hive

While trying to just copy data from one existing table to new table with Create table clone as select * from t_table, its working just perfect. On the other hand while trying to copy data from existing table to another existing table with Insert into table_clone select column1,col2.... from t_table, its throwing Heap space error. Source tables are same in both cases.
I have tried different size for the Container, Mapper, reducer, mapreduce.map.java.opts -Xmx5124m so on but its throwing same error every time.
few setting are :
yarn.scheduler.minimum-allocation-mb : 4GB
yarn.scheduler.maximum-allocation-mb : 6GB
Container memory ( yarn.nodemanager.resource.memory-mb ) : 18 GB
mapreduce.map.memory.mb : 6 GB
mapreduce.reduce.memory.mb : 8 GB
mapreduce.map.java.opts : -Xmx5124m
mapreduce.reduce.java.opts : -Xmx6144m
I am not able to copy data from non partitioned table to another non partitioned table. Though main requirement is to copy from non- partitioned table to partitioned table.
Here i am attaching the yarn log on some file hosting sites in parts
1. http://textuploader.com/522pt
2. http://textuploader.com/522pq
3. http://textuploader.com/522ph
4. http://textuploader.com/522pf
We are using Cloudera quickstart which has MapReduce2 embedded into the setup.

file (not in memory) based JDBC driver for CSV files

Is there a open source file based (NOT in-memory based) JDBC driver for CSV files? My CSV are dynamically generated from the UI according to the user selections and each user will have a different CSV file. I'm doing this to reduce database hits, since the information is contained in the CSV file. I only need to perform SELECT operations.
HSQLDB allows for indexed searches if we specify an index, but I won't be able to provide an unique column that can be used as an index, hence it does SQL operations in memory.
Edit:
I've tried CSVJDBC but that doesn't support simple operations like order by and group by. It is still unclear whether it reads from file or loads into memory.
I've tried xlSQL, but that again relies on HSQLDB and only works with Excel and not CSV. Plus its not in development or support anymore.
H2, but that only reads CSV. Doesn't support SQL.
You can solve this problem using the H2 database.
The following groovy script demonstrates:
Loading data into the database
Running a "GROUP BY" and "ORDER BY" sql query
Note: H2 supports in-memory databases, so you have the choice of persisting the data or not.
// Create the database
def sql = Sql.newInstance("jdbc:h2:db/csv", "user", "pass", "org.h2.Driver")
// Load CSV file
sql.execute("CREATE TABLE data (id INT PRIMARY KEY, message VARCHAR(255), score INT) AS SELECT * FROM CSVREAD('data.csv')")
// Print results
def result = sql.firstRow("SELECT message, score, count(*) FROM data GROUP BY message, score ORDER BY score")
assert result[0] == "hello world"
assert result[1] == 0
assert result[2] == 5
// Cleanup
sql.close()
Sample CSV data:
0,hello world,0
1,hello world,1
2,hello world,0
3,hello world,1
4,hello world,0
5,hello world,1
6,hello world,0
7,hello world,1
8,hello world,0
9,hello world,1
10,hello world,0
If you check the sourceforge project csvjdbc please report your expierences. the documentation says it is useful for importing CSV files.
Project page
This was discussed on Superuser https://superuser.com/questions/7169/querying-a-csv-file.
You can use the Text Tables feature of hsqldb: http://hsqldb.org/doc/2.0/guide/texttables-chapt.html
csvsql/gcsvsql are also possible solutions (but there is no JDBC driver, you will have to run a command line program for your query).
sqlite is another solution but you have to import the CSV file into a database before you can query it.
Alternatively, there is commercial software such as http://www.csv-jdbc.com/ which will do what you want.
To do anything with a file you have to load it into memory at some point. What you could do is just open the file and read it line by line, discarding the previous line as you read in a new one. Only downside to this approach is its linearity. Have you thought about using something like memcache on a server where you use Key-Value stores in memory you can query instead of dumping to a CSV file?
You can use either specialized JDBC driver, like CsvJdbc (http://csvjdbc.sourceforge.net) or you may chose to configure a database engine such as mySQL to treat your CSV as a table and then manipulate your CSV through standard JDBC driver.
The trade-off here - available SQL features vs performance.
Direct access to CSV via CsvJdbc (or similar) will allow you very quick operations on big data volumes, but without capabilities to sort or group records using SQL commands ;
mySQL CSV engine can provide rich set of SQL features, but with the cost of performance.
So if the size of your table is relatively small - go with mySQL. However if you need to process big files (> 100Mb) without need for grouping or sorting - go with CsvJdbc.
If you need both - handle very bif files and be able to manipulate them using SQL, then optimal course of action - to load the CSV into normal database table (e.g. mySQL) first and then handle the data as usual SQL table.

Categories

Resources