How to read a CSV file column wise using hadoop?

How to read a CSV file column wise using hadoop? - java

i am trying to read a csv file which does not contains coma separated values , these are columns for NASDAQ Stocks, i want to read a particular column, assume (3rd), do not know , how to get the column items. IS there any method to read Column wise data in hadoop? pls help here.
My CSV File Format is:
exchange stock_symbol date stock_price_open stock_price_high stock_price_low stock_price_close stock_volume stock_price_adj_close
NASDAQ ABXA 12/9/2009 2.55 2.77 2.5 2.67 158500 2.67
NASDAQ ABXA 12/8/2009 2.71 2.74 2.52 2.55 131700 2.55
Edited Here:
Column A : exchange
Column B : stock_symbol
Column C : date
Column D : stock_price_open
Column E : stock_price_high
and similarly.
These are Columns and not a comma separated values. i need to read this file as column wise.

In Pig it will look like this:
Q1 = LOAD 'file.csv' USING PigStorage('\t') AS (exchange, stock_symbol, stock_date:double, stock_price_open, stock_price_high, stock_price_low, stock_price_close, stock_volume, stock_price_adj_close);
Q2 = FOREACH Q1 GENERATE stock_date;
DUMP C;

You can try to format excel sheet like, adding columns to a single text by using formula like:
=CONCATENATE(A2,";",B2,";",C2,";"D2,";",E2,";",F2,";",G2,";",H2,";",I2)
and concatenate these columns by your required separator, i have used ;, here. use what you want there to be.

Related

Split one column to multiple columns

I have one column contains below data . want to split data in to multiple columns using java code . problem i am facing was in string I have double quotes with comma it was falling in to another column. I have to split data as follows(target). Can any one help to fix this ?
I/P:
Column:
abc,"test,data",valid
xyz,"sample,data",invalid
Target:
Col1|Col2|Col3
abc|"test,data"|valid
xyz|"sample_data"|invalid

I highly recommend that you use a library to handle instead doing it yourself.
I guess your data is in CSV format, so you should take a look at common-csv.
You can resolve your problem with simple code:
CSVParser records = CSVParser.parse("abc,\"test,data\",valid", CSVFormat.DEFAULT);
for (CSVRecord csvRecord : records) {
for (String value : csvRecord) {
System.out.println(value);
}
}
Output:
abc
test,data
valid
Read more at https://www.baeldung.com/apache-commons-csv

How to Append Two Spark Dataframes with different columns in Java

I have One Dataframe on which I am performing one UDF operation and then operation of UDF gives only one column in Dataframe.
How i can append it to previous Datafram.
Example:
Dataframe 1:
sr_no , name, salary
Dataframe 2: UDF is giving output as ABS(Salary) - only one column as output from UDF applied on Dataframe1
How i can have output dataframe as Dataframe1 + Dataframe2 in JAVA
i.e sr_no, name, salary, ABS(Salary) output

Looks like you are searching for .withColumn method:
df1.withColumn("ABS(salary)", yourUdf.apply(col("salary")))
(Snippet requires to import static method col from org.apache.spark.sql.functions)

Got the ans.
Just do it like this : df= df.selectExpr("*","ABS(salary)"); This will give you output of udf with your entire dataframe. Else it will give only one column.

Missing leading zeroes of date in Hive partition using Spark Dataframe

I am adding a partition column to Spark Dataframe. New column(s) contains year month and day.
I have a timestamp column in my dataframe.
DataFrame dfPartition = df.withColumn("year", df.col("date").substr(0, 4));
dfPartition = dfPartition.withColumn("month", dfPartition.col("date").substr(6, 2));
dfPartition = dfPartition.withColumn("day", dfPartition.col("date").substr(9, 2));
I can see the correct values of columns when I output the dataframe eg : 2016 01 08
But When I export this dataframe to hive table like
dfPartition.write().partitionBy("year", "month","day").mode(SaveMode.Append).saveAsTable("testdb.testtable");
I see that directory structure generated misses leading zeroes.
I tried to cast column to String but did not work.
Is there is a way to capture two digits date/month in hive partition
Thanks

Per Spark documentation, partition-column-type inference is a feature enabled by default. OP string values, since they are interpretable as ints, were converted as such. If this is undesirable in the Spark session as a whole, one can disable it by setting the corresponding spark configuration attribute to false:
SparkSession.builder.config("spark.sql.sources.partitionColumnTypeInference.enabled", value = false)
or by running the corresponding SET key=value command using SQL. Otherwise, one can individually-counteract it at the column level w/ the corresponding Spark-native format-string function as J.Doe suggests.

Refer to Add leading zeros to Columns in a Spark Data Frame
you can see the answer of how to add leading 0's with this answer:
val df2 = df
.withColumn("month", format_string("%02d", $"month"))
I tried this on my code using the snippet below and it worked!
.withColumn("year", year(col("my_time")))
.withColumn("month", format_string("%02d",month(col("my_time")))) //pad with leading 0's
.withColumn("day", format_string("%02d",dayofmonth(col("my_time")))) //pad with leading 0's
.withColumn("hour", format_string("%02d",hour(col("my_time")))) //pad with leading 0's
.writeStream
.partitionBy("year", "month", "day", "hour")

Parsing a complicated CSV file

I am in the difficult situation now where i need to make a parser to parse a formatted document from tekla to be processed in the database.
so on the .CSV i have this
,SMS-PW-BM31,,1,,,,287.9
,,SMS-PW-BM31,1,H350*175*7*11,SS400,5805,287.9
,------------,--------------,----,---------------,--------,------------,---------
,SMS-PW-BM32,,1,,,,405.8
,,SMSPW-H707,1,H350*175*7*11,SS400,6697,332.2
,,SMSPW-EN12,1,PLT12x175,SS400,500,8.2
,,SMSPW-EN14,1,PLT16x175,SS400,500,11
,------------,--------------,----,---------------,--------,------------,---------
That is the document generated from the tekla software. What i expect from the output is something like this
HEAD-MARK COMPONENT-TYPE QUANTITY PROFILE GRADE LENGTH WEIGHT
SMS-PW-BM31 1 287.9
SMS-PW-BM31 SMS-PW-BM31 1 H350*175*7*11 SS400 5805 287.9
SMS-PW-BM32 1 405.8
SMS-PW-BM32 SMSPW-H707 1 H350*175*7*11 SS400 6697 332.2
SMS-PW-BM32 SMSPW-EN12 1 PLT12X175 SS400 500 8.2
SMS-PW-BM32 SMSPW-EN14 1 PLT16X175 SS400 500 11
How do i start from in Java ? the most complicated thing is distributing the head mark that separated by the '-'

CSV format is quite simple, there is a column delimiter that is a comma(,) and a row delimiter that is a new line(\n). Some columns will be surrounded by quotes(") to contain column data but it looks like you wont have to worry about that given your current file.
Look at String.split and you will find your answer after a bit of pondering it.

Create Excel files in java(invalid number)

I have a string like "2,345".I want to put it into a excel cell.I successfully did but in my excel file i got "2,345" as a string.So please suggest me how can i get "2,345" as a number value but with the same format as i used above(comma seperated).
Thanks in advance.

Remove the comma before inserting it into Excel, cast it to a number before inserting, then format the column to show the comma.
String replace
In Excel the code to format a Range with commas is:
SomeRange.Style = "Comma" 'or, recorded version
SomeRange.NumberFormat = "_-* #,##0_-;-* #,##0_-;_-* ""-""??_-;_-#_-"
'a simpler version..
SomeRange.NumberFormat = "#,##0"

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to read a CSV file column wise using hadoop? - java

In Pig it will look like this: Q1 = LOAD 'file.csv' USING PigStorage('\t') AS (exchange, stock_symbol, stock_date:double, stock_price_open, stock_price_high, stock_price_low, stock_price_close, stock_volume, stock_price_adj_close); Q2 = FOREACH Q1 GENERATE stock_date; DUMP C;

You can try to format excel sheet like, adding columns to a single text by using formula like: =CONCATENATE(A2,";",B2,";",C2,";"D2,";",E2,";",F2,";",G2,";",H2,";",I2) and concatenate these columns by your required separator, i have used ;, here. use what you want there to be.

Related

Split one column to multiple columns

How to Append Two Spark Dataframes with different columns in Java

Missing leading zeroes of date in Hive partition using Spark Dataframe

Parsing a complicated CSV file

Create Excel files in java(invalid number)

Categories

Resources