I created a table in hbase using:
create 'Province','ProvinceINFO'
Now, I want to import my data from a tsv file to it. My table in tsv have two columns: ProvinceID (as pk), ProvinceName
I am using the below code for import:
bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv '-Dimporttsv.separator=,'
-Dimporttsv.columns= HBASE_ROW_KEY, ProvinceINFO:ProvinceName Province /usr/data
/Province.csv
but it gives me this error:
ERROR: No columns specified. Please specify with -Dimporttsv.columns=...
Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>
Imports the given input directory of TSV data into the specified table.
The column names of the TSV data must be specified using the -Dimporttsv.columns
option. This option takes the form of comma-separated column names, where each
column name is either a simple column family, or a columnfamily:qualifier. The special
column name HBASE_ROW_KEY is used to designate that this column should be used
as the row key for each imported record. You must specify exactly one column
to be t he row key, and you must specify a column name for every column that exists in
the
input data. Another special columnHBASE_TS_KEY designates that this column should be
used as timestamp for each record. Unlike HBASE_ROW_KEY, HBASE_TS_KEY is optional.
You must specify at most one column as timestamp key for each imported record.
Record with invalid timestamps (blank, non-numeric) will be treated as bad record.
Note: if you use this option, then 'importtsv.timestamp' option will be ignored.
By default importtsv will load data directly into HBase. To instead generate
HFiles of data to prepare for a bulk data load, pass the option:
-Dimporttsv.bulk.output=/path/for/output
Note: if you do not use this option, then the target table must already exist in HBase
Other options that may be specified with -D include:
-Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
'-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
-Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
-Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of
org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
-Dmapred.job.name=jobName - use the specified mapreduce job name for the import
For performance consider the following options:
-Dmapred.map.tasks.speculative.execution=false
-Dmapred.reduce.tasks.speculative.execution=false
Maybe also try wrapping column into a string, i.e.
bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=','
-Dimporttsv.columns="HBASE_ROW_KEY, ProvinceINFO:ProvinceName" Province /usr/data
/Province.csv
You should try something like:
bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=','
-Dimporttsv.columns= HBASE_ROW_KEY, ProvinceINFO:ProvinceName Province /usr/data
/Province.csv
Try to remove the spaces in -Dimporttsv.columns=a,b,c.
Related
Can I generate Liquibase changelog from a DB based on table name prefix.
Example:
If I have a DB schema and it has following tables:
abc
abcd
abcdef
xyz
I just want to generate ChangeLog for tables starting with "abc". So changelog for tables
abc,
abcd,
abcdef
Can someone help me if there's a way to do this?
It's possible with maven or liquibase command line if you're using liquibase version > 3.3.2.
Take a look at the release notes
Liquibase 3.3.2 is officially released. It is primarily a bugfix
release, but has one major new feature: object
diffChangeLog/generateChangeLog object filtering.
includeObjects/excludeObjects logic
You can now set an includeObjects or excludeObjects paramter on the
command line or Ant. For maven, the parameteres are diffExcludeObjects
and diffIncludeObjects. The format for these parameters are:
An object name (actually a regexp) will match any object whose name matches the regexp.
A type:name syntax that matches the regexp name for objects of the given type
If you want multiple expressions, comma separate them
The type:name logic will be applied to the tables containing columns, indexes, etc.
NOTE: name comparison is case sensitive. If you want insensitive
logic, use the (?i) regexp flag.
Example Filters:
“table_name” will match a table called “table_name” but not “other_table” or “TABLE_NAME”
“(i?)table_name” will match a table called “table_name” and “TABLE_NAME”
“table_name” will match all columns in the table table_name
“table:table_name” will match a table called table_name but not a column named table_name
“table:table_name, column:*._lock” will match a table called table_name and all columns that end with “_lock”
So try using excludeObjects or includeObjects parameters with generateChangeLog command
UPDATE
I've used liquibase command line, and this command does the trick (for mysql database):
liquibase
--changeLogFile=change.xml
--username=username
--password=password
--driver=com.mysql.cj.jdbc.Driver
--url=jdbc:mysql://localhost:3306/mydatabase
--classpath=mysql-connector-java-8.0.18.jar
--includeObjects="table:abc.*"
generateChangeLog
This works for me on Windows 10:
liquibase.properties:
changeLogFile=dbchangelog.xml
classpath=C:/Program\ Files/liquibase/lib/mysql-connector-java-8.0.20.jar
driver=com.mysql.cj.jdbc.Driver
url=jdbc:mysql://localhost:3306/liquibase?serverTimezone=UTC
username=root
password=password
schemas=liquibase
includeSchema=true
includeTablespace=true
includeObjects=table:persons
C:\Users\username\Desktop>liquibase generateChangeLog
Liquibase Community 4.0.0 by Datical
Starting Liquibase at 11:34:35 (version 4.0.0 #19 built at 2020-07-13 19:45+0000)
Liquibase command 'generateChangeLog' was executed successfully.
You can download mysql-connector here, find the generateChangeLog documentation here and more information on includeObjects here.
I read csv file, which has a duplicate column.
I want to preserve the name of the column in dataframe.
I tried to add this option in my sparkcontext conf spark.sql.caseSensitive and put it true , but unfortunately it has no effect.
The duplicate column name is NU_CPTE. Spark tried to rename it by adding number of column 0, 7
NU_CPTE0|CD_EVT_FINANCIER|TYP_MVT_ELTR|DT_OPERN_CLI|LI_MVT_ELTR| MT_OPERN_FINC|FLSENS|NU_CPTE7
SparkSession spark= SparkSession
.builder()
.master("local[2]")
.appName("Application Test")
.getOrCreate();
spark.sparkContext().getConf().set("spark.sql.caseSensitive","true");
Dataset<Row> df=spark.read().option("header","true").option("delimiter",";").csv("FILE_201701.csv");
df.show(10);
I want something like this as result:
NU_CPTE|CD_EVT_FINANCIER|TYP_MVT_ELTR|DT_OPERN_CLI|LI_MVT_ELTR| MT_OPERN_FINC|FLSENS|NU_CPTE
Spark is fixed to allow the duplicate column names with the number appended. Hence you are getting the numbers appended to the duplicate column names. Please find the below link
https://issues.apache.org/jira/browse/SPARK-16896
The way you're trying to set the caseSensitive property will indeed be ineffective. Try replacing:
spark.sparkContext().getConf().set("spark.sql.caseSensitive","true");
with:
spark.sql("set spark.sql.caseSensitive=true");
However, this still assumes your original columns have some sort of difference in casing. If they have the same casing, they will still be identical and will be suffixed with the column number.
I convert geodata (coordinates, attributes,...) to a dxf file.
I write attributes into extended data, but under the group code 1001 there must be an application name. I tried to write "Test" and some other words in it, but nothing works.
I receive the error message:
Invalid application name in 1001 group on line 50.
What is the application name in this context, where can I get it or whatever?
You are correct that DXF group 1001 should contain the Application ID of the Extended Entity Data (xData) attached to your entity.
This application ID may be an arbitrary name which fulfils the requiremnts of a symbol table name (which are documented as part of the AutoLISP snvalid function). When specifying an Application ID, you should try to ensure that it is unique and you should AVOID using ACAD, as this is reserved and used internally by AutoCAD.
The key point that is causing your file to fail to be parsed is that every Application ID referenced by xData within the file must also appear as a symbol table name within the APPID symbol table.
LOAD DATA INFILE '/testing.csv'
IGNORE INTO TABLE Test_table FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
IGNORE 1 ROWS (#col1)
SET test_ID="100",
test_reg_ID ="26003",
VALUE =#col1;
testing.csv:
TABLE Test_table
Question No.1: Help me convert this into JOOQ 3.6
Question No.2: I want to avoid the empty data which falls on row 1.
jOOQ has a CSV import API for that purpose. Here's how you'd translate that MySQL command to jOOQ:
DSL.using(configuration)
.loadInto(TEST_TABLE)
.loadCSV(new File("/testing.csv"))
.fields(TEST_TABLE.VALUE)
.separator(',')
// Available in jOOQ 3.10 only: https://github.com/jOOQ/jOOQ/issues/5737
// .lineSeparator("\n")
.ignoreRows(1)
.execute();
Note that jOOQ's Loader API doesn't support those default expressions as MySQL does (see #5740):
SET test_ID="100",
test_reg_ID ="26003"
There are a few workarounds:
You could patch the CSV data and prepend those columns before loading them.
You could use DSLContext.fetchFromCSV() and then use stream().map() to prepend the missing data, before using the alternative Record import API rather than the suggested CSV import API
You could run a simple UPDATE statement right after the import for this data.
A note on performance
Do note that jOOQ's loader API can be fine-tuned by specifying bulk, batch, and commit sizes. Nevertheless, the database's out-of-the-box import implementation is very likely to still be much faster than any client side import that has to go through JDBC.
I have data available to me in CSV file. Each CSV is different from another i.e. column names are different. For example in FileA unique identifier is called ID but in FileB it is called UID. Similarly, in FileA amount is called AMT but in FileB it is called CUST_AMT. The meaning is same but column names are different.
I want to create a general solution for saving this varying data from CSV files into a DB table. The solution must take into consideration additional formats that may become available in future.
Is there a best approach for such a scenario?
There are many solutions to this problem. But I think the easiest might be to generate a mapping from each input file format to a combined row format. You could create a configuration file that has column name to database field name mappings, and create a program that, given a CSV and a mapping file, can insert all the data into the database.
However, you would still have to alter the table for every new column you want to add.
More design work would require more details on how the data will be used after it enters the database.
I can think of the "Chain of responsibility" pattern at the start of the execution. So you read the header and let the chain of responsibility get the appropriate parser for that file.
Code could look like this:
interface Parser {
// returns true if this parser recognizes this format.
boolean accept(String fileHeader);
// Each parser can convert a line in the file into insert parameters to be
// used with PreparedStatement
Object[] getInsertParameters(String row);
}
This allows you to add new file formats by adding a new Parser object to the chain.
You would first initialize the Chain as follows:
List<Parser> parserChain = new ArrayList<Parser>();
parserChain.add(new ParserImplA());
parserChain.add(new ParserImplB());
parserChain.add(new ParserImplB());
....
Then you will use it as follows:
// read the header row from file
Parser getParser (String header) {
for (Parser parser: parserChain) {
if (parser.accept(header)
return parser;
}
throw new Exception("Unrecognized format!");
}
Then you can create a prepared statement for inserting a row into the table.
Processing each row of file would be :
preparedStatement.execute(parser.getInsertParameters(row));