While trying to just copy data from one existing table to new table with Create table clone as select * from t_table, its working just perfect. On the other hand while trying to copy data from existing table to another existing table with Insert into table_clone select column1,col2.... from t_table, its throwing Heap space error. Source tables are same in both cases.
I have tried different size for the Container, Mapper, reducer, mapreduce.map.java.opts -Xmx5124m so on but its throwing same error every time.
few setting are :
yarn.scheduler.minimum-allocation-mb : 4GB
yarn.scheduler.maximum-allocation-mb : 6GB
Container memory ( yarn.nodemanager.resource.memory-mb ) : 18 GB
mapreduce.map.memory.mb : 6 GB
mapreduce.reduce.memory.mb : 8 GB
mapreduce.map.java.opts : -Xmx5124m
mapreduce.reduce.java.opts : -Xmx6144m
I am not able to copy data from non partitioned table to another non partitioned table. Though main requirement is to copy from non- partitioned table to partitioned table.
Here i am attaching the yarn log on some file hosting sites in parts
1. http://textuploader.com/522pt
2. http://textuploader.com/522pq
3. http://textuploader.com/522ph
4. http://textuploader.com/522pf
We are using Cloudera quickstart which has MapReduce2 embedded into the setup.
Related
I have written code for reading data from Oracle using Apache Beam with Dataflow as a Runner.
I am facing one weird error for a few tables while ingesting data.
There are various tables that have 200+ Columns with various data types such as Date, Number, and VARCHAR2(***) in Oracle.
Our requirement is to migrate all columns through dataflow to BigQuery. When we select all the columns in Select Query then it gives the Null Pointer exception error mentioned below. So we tried to use selective columns in Query but in that case
When the collective datatype size of selected columns is less than ~46752 Bytes then Pipeline runs successfully.
and when it crosses this limit then it starts giving a Null Pointer error.
If I explain in more detail, if there is a limit of 2000 Bytes (assumption, the actual value we are getting is around 46752 Bytes) then we will be able to select only two columns with VARCHAR2(1000) datatype or 4 columns with VARCHAR2(500) datatype and so on.
Note - The threshold limit ~46752 Bytes we have calculated by adding columns one by one in Query and executing the code.
We are not sure if is there any such limit attached to the Java JDBC connector of Apache Beam or not but we are facing challenges while migration when our selected columns cross this limit.
Please help me if I am missing any point here or any parameter while reading data through JdbcIO.
Below is the code snippet where the Code is giving an error. It is the entry point of our pipeline.
The code is giving errors while reading only. In the below code, I have not mentioned the write operation to BigQuery since it is not getting executed because of failure at JdbcIO.read() only. (cross-check it by commenting out the logic of BigQuery as well).
Read Data from Oracle Code
// Read from JDBC
Pipeline p2 = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
String query2 = "SELECT Col1...Col00 FROM table WHERE rownum<=1000";
PCollection<TableRow> rows = p2.apply(JdbcIO.<TableRow>read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"oracle.jdbc.OracleDriver", "jdbc:oracle:thin:#//localhost:1521/orcl")
.withUsername("root")
.withPassword("password"))
.withQuery(query2)
.withRowMapper(new JdbcIO.RowMapper<TableRow>() {
#Override
public TableRow mapRow(ResultSet resultSet) throws Exception {
schema = getSchemaFromResultSet(resultSet);
TableRow tableRow = new TableRow();
List<TableFieldSchema> columnNames = schema.getFields();
}
return tableRow;
}
)
);
p2.run().waitUntilFinish();
Error (only when it crosses above the limit of Bytes in columns)
Error message from worker: java.lang.NullPointerException
oracle.sql.converter.CharacterConverter1Byte.toUnicodeChars(CharacterConverter1Byte.java:344)
oracle.sql.CharacterSet1Byte.toCharWithReplacement(CharacterSet1Byte.java:134)
oracle.jdbc.driver.DBConversion._CHARBytesToJavaChars(DBConversion.java:964)
oracle.jdbc.driver.DBConversion.CHARBytesToJavaChars(DBConversion.java:867)
oracle.jdbc.driver.T4CVarcharAccessor.unmarshalOneRow(T4CVarcharAccessor.java:298)
oracle.jdbc.driver.T4CTTIrxd.unmarshal(T4CTTIrxd.java:934)
oracle.jdbc.driver.T4CTTIrxd.unmarshal(T4CTTIrxd.java:853)
oracle.jdbc.driver.T4C8Oall.readRXD(T4C8Oall.java:699)
oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:337)
oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:191)
oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:523)
oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:207)
oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:863)
oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:1153)
oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1275)
oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:3576)
oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:3620)
oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java
:1491)
org.apache.commons.dbcp2.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java
:122)
org.apache.commons.dbcp2.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java
:122) org.apache.beam.sdk.io.jdbc.JdbcIO$ReadFn.processElement(JdbcIO.java:1381)
More Details
Oracle Version - 11g
Apache Beam SDK - Java
There are a few columns that we are selecting has Null values also but
that should not cause any issues.
There is no issue with specific columns since I have tried all the possible combinations of the columns.
There is no number of columns select limit since I am able to read data from an Oracle table where the number of columns is more than 300 but the total byte size is less than 46752 Bytes.
The JdbcIO source transform has a property called FetchSize that defines how much data can be fetched from the database. The default value is 50000 bytes. We can change the runtime value using the withFetchSize(int fetchSize) method. Docs.
I just wrote a toy class to test Spark dataframe (actually Dataset since I'm using Java).
Dataset<Row> ds = spark.sql("select id,name,gender from test2.dummy where dt='2018-12-12'");
ds = ds.withColumn("dt", lit("2018-12-17"));
ds.cache();
ds.write().mode(SaveMode.Append).insertInto("test2.dummy");
//
System.out.println(ds.count());
According to my understanding, there're 2 actions, "insertInto" and "count".
I debug the code step by step, when running "insertInto", I see several lines of:
19/01/21 20:14:56 INFO FileScanRDD: Reading File path: hdfs://ip:9000/root/hive/warehouse/test2.db/dummy/dt=2018-12-12/000000_0, range: 0-451, partition values: [2018-12-12]
When running "count", I still see similar logs:
19/01/21 20:15:26 INFO FileScanRDD: Reading File path: hdfs://ip:9000/root/hive/warehouse/test2.db/dummy/dt=2018-12-12/000000_0, range: 0-451, partition values: [2018-12-12]
I have 2 questions:
1) When there're 2 actions on same dataframe like above, if I don't call ds.cache or ds.persist explicitly, will the 2nd action always causes the re-executing of the sql query?
2) If I understand the log correctly, both actions trigger hdfs file reading, does that mean the ds.cache() actually doesn't work here? If so, why it doesn't work here?
Many thanks.
It's because you append into the table where ds is created from, so ds needs to be recomputed because the underlying data changed. In such cases, spark invalidates the cache. If you read e.g. this Jira (https://issues.apache.org/jira/browse/SPARK-24596):
When invalidating a cache, we invalid other caches dependent on this
cache to ensure cached data is up to date. For example, when the
underlying table has been modified or the table has been dropped
itself, all caches that use this table should be invalidated or
refreshed.
Try to run the ds.count before inserting into the table.
I found that the other answer doesn't work. What I had to do was break lineage such that the df I was writing does not know that one of its source is the table I am writing to. To break lineage, I created a copy df using
copy_of_df = sql_context.createDataframe(df.rdd)
When I try to copy a table to cassandra using the command:
copy images from 'images.csv'
I get the error:
'PicklingError: Can't pickle <class 'cqlshlib.copyutil.ImmutableDict'>: attribute lookup cqlshlib.copyutil.ImmutableDict failed'
I have successfully imported all of my other tables, but this one is not working. The only difference with this one is that it contains large binary blobs for images.
Here is a sample row from the csv file:
b267ba01-5420-4be5-b962-7e563dc245b0,,0x89504e...[large binary blob]...426082,0,7e700538-cce3-495f-bfd2-6a4fa968bdf6,pentium_e6600,01fa819e-3425-47ca-82aa-a3eec319a998,0,7e700538-cce3-495f-bfd2-6a4fa968bdf6,,,png,0
And here is the file that causes the error:
https://www.dropbox.com/s/5mrl6nuwelpf3lz/images.csv?dl=0
Here is my schema:
CREATE TABLE dealtech.images (
id uuid PRIMARY KEY,
attributes map<text, text>,
data blob,
height int,
item_id uuid,
name text,
product_id uuid,
scale double,
seller_id uuid,
text_bottom int,
text_top int,
type text,
width int
)
The tables were exported using cassandra 2.x and I am currently using cassandra 3.0.9 to import them.
I ran into this same issue with apache cassandra 3.9, although my datasets were fairly small (46 rows in one table, 262 rows in another table).
PicklingError: Can't pickle <class 'cqlshlib.copyutil.link'>: attribute lookup cqlshlib.copyutil.link failed
PicklingError: Can't pickle <class 'cqlshlib.copyutil.attribute'>: attribute lookup cqlshlib.copyutil.attribute failed
Where link and attribute are types I defined.
The COPY commands were apart of a .cql script that was being run inside a Docker container as apart of it's setup process.
I read in a few places where people were seeing this PicklingError on Windows (seemed to be related to NTFS), but the Docker container in this case was using Alpine Linux.
The fix was to add these options to the end of my COPY commands:
WITH MINBATCHSIZE=1 AND MAXBATCHSIZE=1 AND PAGESIZE=10;
http://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlshCopy.html
I was not seeing the PicklingError running these .cql scripts containing COPY commands locally, so it seems to be an issue that only rears it's head in a low memory situation.
Related issues:
Pickling Error running COPY command: CQLShell on Windows
Cassandra multiprocessing can't pickle _thread.lock objects
I'm trying to export data from a Big Query table to GCS by Java API.
The Big Query table is the ProjectA while the GCS bucket in ProjectB and I have 2 different accounts (keys) to access them.
It seems there is no way in JobConfigurationExtractor object to specify destination credentials and the project details, just for Big Query object/table.
Is there a way to overcome this limitation? Anyone experiencing similar issues?
Snippet code
JobConfigurationExtract extract =
new JobConfigurationExtract().setSourceTable(table).setDestinationUri(cloudStoragePath);
return bigquery
.jobs()
.insert(
table.getProjectId(),
new Job().setConfiguration(new JobConfiguration().setExtract(extract)))
.execute();
}
Thanks!
using Java API, I'm trying to Put() to HBase 1.1.x the content of some files. To do so, I have created WholeFileInput class (ref : Using WholeFileInputFormat with Hadoop MapReduce still results in Mapper processing 1 line at a time ) to make MapReduce read the entire file instead of one line. But unfortunately, I cannot figure out how to form my rowkey from the given filename.
Example:
Input:
file-123.txt
file-524.txt
file-9577.txt
...
file-"anotherNumber".txt
Result on my HBase table:
Row-----------------Value
123-----------------"content of 1st file"
524-----------------"content of 2nd file"
...etc
If anyone has already faced this situation to help me with it
Thanks in advance.
Your
rowkey
can be like this
rowkey = prefix + (filenamepart or full file name) + Murmurhash(fileContent)
where your prefix can be between what ever presplits you have done with your table creation time.
For ex :
create 'tableName', {NAME => 'colFam', VERSIONS => 2, COMPRESSION => 'SNAPPY'},
{SPLITS => ['0','1','2','3','4','5','6','7']}
prefix can be any random id generated between range of pre-splits.
This kind of row key will avoid hot-spotting also if data increases.
& Data will be spread across region server.