UDF to extract only the file name from path in Spark SQL - java

There is input_file_name function in Apache Spark which is used by me to add new column to Dataset with the name of file which is currently being processed.
The problem is that I'd like to somehow customize this function to return only file name, ommitting the full path to it on s3.
For now, I am doing replacement of the path on the second step using map function:
val initialDs = spark.sqlContext.read
.option("dateFormat", conf.dateFormat)
.schema(conf.schema)
.csv(conf.path).withColumn("input_file_name", input_file_name)
...
...
def fromFile(fileName: String): String = {
val baseName: String = FilenameUtils.getBaseName(fileName)
val tmpFileName: String = baseName.substring(0, baseName.length - 8) //here is magic conversion ;)
this.valueOf(tmpFileName)
}
But I'd like to use something like
val initialDs = spark.sqlContext.read
.option("dateFormat", conf.dateFormat)
.schema(conf.schema)
.csv(conf.path).withColumn("input_file_name", **customized_input_file_name_function**)

In Scala:
#register udf
spark.udf
.register("get_only_file_name", (fullPath: String) => fullPath.split("/").last)
#use the udf to get last token(filename) in full path
val initialDs = spark.read
.option("dateFormat", conf.dateFormat)
.schema(conf.schema)
.csv(conf.path)
.withColumn("input_file_name", get_only_file_name(input_file_name))
Edit: In Java as per comment
#register udf
spark.udf()
.register("get_only_file_name", (String fullPath) -> {
int lastIndex = fullPath.lastIndexOf("/");
return fullPath.substring(lastIndex, fullPath.length - 1);
}, DataTypes.StringType);
import org.apache.spark.sql.functions.input_file_name
#use the udf to get last token(filename) in full path
Dataset<Row> initialDs = spark.read()
.option("dateFormat", conf.dateFormat)
.schema(conf.schema)
.csv(conf.path)
.withColumn("input_file_name", get_only_file_name(input_file_name()));

Borrowing from a related question here, the following method is more portable and does not require a custom UDF.
Spark SQL Code Snippet: reverse(split(path, '/'))[0]
Spark SQL Sample:
WITH sample_data as (
SELECT 'path/to/my/filename.txt' AS full_path
)
SELECT
full_path
, reverse(split(full_path, '/'))[0] as basename
FROM sample_data
Explanation:
The split() function breaks the path into it's chunks and reverse() puts the final item (the file name) in front of the array so that [0] can extract just the filename.
Full Code example here :
spark.sql(
"""
|WITH sample_data as (
| SELECT 'path/to/my/filename.txt' AS full_path
| )
| SELECT
| full_path
| , reverse(split(full_path, '/'))[0] as basename
| FROM sample_data
|""".stripMargin).show(false)
Result :
+-----------------------+------------+
|full_path |basename |
+-----------------------+------------+
|path/to/my/filename.txt|filename.txt|
+-----------------------+------------+

commons io is natural/easiest import in spark means(no need to add additional dependency...)
import org.apache.commons.io.FilenameUtils
getBaseName(String fileName)
Gets the base name, minus the full path and extension, from a full fileName.
val baseNameOfFile = udf((longFilePath: String) => FilenameUtils.getBaseName(longFilePath))
Usage is like...
yourdataframe.withColumn("shortpath" ,baseNameOfFile(yourdataframe("input_file_name")))
.show(1000,false)

Related

GoogleAds API - Java / How to get all existing Keyword Plans?

I figured out how to create & delete keyword plans, but I couldn't figure out how I can get a list of all my existing keyword plans (resource names / plan ids)?
final long customerId = Long.valueOf("XXXXXXXXXX");
GoogleAdsClient googleAdsClient = new ...
KeywordPlanServiceClient client = googleAdsClient.getVersion8().createKeywordPlanServiceClient();
String[] allExistingKeywordPlans = client. ???
<dependency>
<groupId>com.google.api-ads</groupId>
<artifactId>google-ads</artifactId>
<version>16.0.0</version>
</dependency>
Further resources:
https://developers.google.com/google-ads/api/docs/samples/add-keyword-plan
Any hints on how this can be solved is highly appreciated! Many thanks in advance!
Maybe you can try to fetch the keyword_plan resource from your account.
This is how I've done it to create remove operations for all the existing keywordPlans.
GoogleAdsServiceClient.SearchPagedResponse response = client.search(SearchGoogleAdsRequest.newBuilder()
.setQuery("SELECT keyword_plan.resource_name FROM keyword_plan")
.setCustomerId(Objects.requireNonNull(googleAdsClient.getLoginCustomerId()).toString())
.build());
List<KeywordPlanOperation> keywordPlanOperations = response.getPage().getResponse().getResultsList().stream()
.map(x -> KeywordPlanOperation.newBuilder()
.setRemove(x.getKeywordPlan().getResourceName())
.build())
.collect(Collectors.toList());
Of course this can also be applied to your use-case.
This is for PHP if you like to remove all of the existing keyword plans:
$googleAdsServiceClient = $this->googleAdsClient->getGoogleAdsServiceClient();
/** #var GoogleAdsServerStreamDecorator $stream */
$stream = $googleAdsServiceClient->searchStream(
$linkedCustomerId,
'SELECT keyword_plan.resource_name FROM keyword_plan'
);
$keywordPlanServiceClient = $this->googleAdsClient->getKeywordPlanServiceClient();
/** #var GoogleAdsRow $googleAdsRow */
foreach ($stream->iterateAllElements() as $googleAdsRow) {
$keywordPlanOperation = new KeywordPlanOperation();
$keywordPlanOperation->setRemove($googleAdsRow->getKeywordPlan()->getResourceName());
$keywordPlanServiceClient->mutateKeywordPlans($this->linkedCustomerId, [$keywordPlanOperation]);
}
For python:
import argparse
import sys
from google.ads.googleads.client import GoogleAdsClient
from google.ads.googleads.errors import GoogleAdsException
def main(client, customer_id):
ga_service = client.get_service("GoogleAdsService")
query = """
SELECT keyword_plan.name, keyword_plan.id, keyword_plan.forecast_period, keyword_plan.resource_name
FROM keyword_plan
"""
# Issues a search request using streaming.
search_request = client.get_type("SearchGoogleAdsStreamRequest")
search_request.customer_id = customer_id
search_request.query = query
stream = ga_service.search_stream(search_request)
for batch in stream:
for row in batch.results:
resource_name = row.keyword_plan.resource_name
forecast_period = row.keyword_plan.forecast_period
id = row.keyword_plan.id
name = row.keyword_plan.name
print(
f'plan resource name "{resource_name}" with '
f'forecast period "{forecast_period.date_interval}" '
f"and ID {id} "
f' name "{name}" '
)
if __name__ == "__main__":
# GoogleAdsClient will read the google-ads.yaml configuration file in the
# home directory if none is specified.
googleads_client = GoogleAdsClient.load_from_storage(path='your-google-ads.yml-file-path',version="v10")
parser = argparse.ArgumentParser(
description=("Retrieves a campaign's negative keywords.")
)
# The following argument(s) should be provided to run the example.
parser.add_argument(
"-c",
"--customer_id",
type=str,
required=True,
help="The Google Ads customer ID.",
)
args = parser.parse_args()
try:
main(googleads_client, args.customer_id)
except GoogleAdsException as ex:
print(
f'Request with ID "{ex.request_id}" failed with status '
f'"{ex.error.code().name}" and includes the following
errors:'
)
for error in ex.failure.errors:
print(f'\tError with message "{error.message}".')
if error.location:
for field_path_element in error.location.field_path_elements:
print(f"\t\tOn field: {field_path_element.field_name}")
sys.exit(1)

how to populate select clause of dataframe dynamically? giving AnalysisException

I am Using spark-sql 2.4.1 and java 8.
val country_df = Seq(
("us",2001),
("fr",2002),
("jp",2002),
("in",2001),
("fr",2003),
("jp",2002),
("in",2003)
).toDF("country","data_yr")
> val col_df = country_df.select("country").where($"data_yr" === 2001)
val data_df = Seq(
("us_state_1","fr_state_1" ,"in_state_1","jp_state_1"),
("us_state_2","fr_state_2" ,"in_state_2","jp_state_1"),
("us_state_3","fr_state_3" ,"in_state_3","jp_state_1")
).toDF("us","fr","in","jp")
> data_df.select("us","in").show()
how to populate this select clause (of data_df) dynamically , from the country_df for given year ?
i.e. From first dataframe , i will get values of column , those are
the columns i need to select from second datafame. How can this be
done ?
Tried this :
List<String> aa = col_df.select(functions.lower(col("data_item_code"))).map(row -> row.mkString(" ",", "," "), Encoders.STRING()).collectAsList();
data_df.select(aa.stream().map(s -> new Column(s)).toArray(Column[]::new));
Error :
.AnalysisException: cannot resolve '` un `' given input columns: [abc,.....all columns ...]
So what is wrong here , and how to fix this ?
You can try with the below code.
Select the column name from the first dataset.
List<String> columns = country_df.select("country").where($"data_yr" === 2001).as(Encoders.STRING()).collectAsList();
Use the column names in selectexpr in second dataset.
public static Seq<String> convertListToSeq(List<String> inputList) {
return JavaConverters.asScalaIteratorConverter(inputList.iterator()).asScala().toSeq();
}
//using selectExpr
data_df.selectExpr(convertListToSeq(columns)).show(true);
scala> val colname = col_df.rdd.collect.toList.map(x => x(0).toString).toSeq
scala> data_df.select(colname.head, colname.tail: _*).show()
+----------+----------+
| us| in|
+----------+----------+
|us_state_1|in_state_1|
|us_state_2|in_state_2|
|us_state_3|in_state_3|
+----------+----------+
Using pivot you can get the values as column names directly like this:
val selectCols = col_df.groupBy().pivot($"country").agg(lit(null)).columns
data_df.select(selectCols.head, selectCols.tail: _*)

How to do logging of spark Dataset printSchema in info/debug level in spark- java project

Trying to covert my spark scala project into spark-java project.
I have a logging in scala as below
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
class ClassName{
val logger = LoggerFactory.getLogger("ClassName")
...
val dataframe1 = ....///read dataframe from text file.
...
logger.debug("dataframe1.printSchema : \n " + dataframe1.printSchema; //this is working fine.
}
Now I am trying to write it in java 1.8 as below
public class ClassName{
public static final Logger logger = oggerFactory.getLogger("ClassName");
...
Dataset<Row> dataframe1 = ....///read dataframe from text file.
...
logger.debug("dataframe1.printSchema : \n " + dataframe1.printSchema()); //this is not working
}
I tried several ways but nothing worked to log printSchema in debug/info mode.
dataframe1.printSchema() // this actually returning void hence not able to append to string.
How actually logging is done spark-java production grade projects ?
What is the best approach I need to follow to log in debugging?
How to handle the above scenario? i.e. log.debug( dataframe1.printSchema() ) in java ?
You can use df.schema.treeString. This returns a string when compared to Unit() equivalent of Void in java returned by df.printSchema. This is true in Scala and I believe it is the same in Java.Let me know if that helps.
scala> val df = Seq(1, 2, 3).toDF()
df: org.apache.spark.sql.DataFrame = [value: int]
scala> val x = df.schema.treeString
x: String =
"root
|-- value: integer (nullable = false)
"
scala> val y = df.printSchema
root
|-- value: integer (nullable = false)
y: Unit = ()
printSchema method already prints the schema to the console without returning it in any form. You can simply call the method and redirect console output somewhere else. There are other workarounds like this one.

Pivoting DataFrame - Spark SQL

I have a DataFrame containing below:
TradeId|Source
ABC|"USD,333.123,20170605|USD,-789.444,20170605|GBP,1234.567,20150602"
I want to pivot this data so it turns into below
TradeId|CCY|PV
ABC|USD|333.123
ABC|USD|-789.444
ABC|GBP|1234.567
The number of CCY|PV|Date triplets in the column "Source" is not fixed. I could do it in ArrayList but that requires to load the data in JVM and defeats the whole point of Spark.
Lets say my DataFrame looks as below:
DataFrame tradesSnap = this.loadTradesSnap(reportRequest);
String tempTable = getTempTableName();
tradesSnap.registerTempTable(tempTable);
tradesSnap = tradesSnap.sqlContext().sql("SELECT TradeId, Source FROM " + tempTable);
If you read databricks pivot, it says " A pivot is an aggregation where one (or more in the general case) of the grouping columns has its distinct values transposed into individual columns." And this is not what you desire I guess
I would suggest you to use withColumn and functions to get the final output you desire. You can do as following considering dataframe is what you have
+-------+----------------------------------------------------------------+
|TradeId|Source |
+-------+----------------------------------------------------------------+
|ABC |USD,333.123,20170605|USD,-789.444,20170605|GBP,1234.567,20150602|
+-------+----------------------------------------------------------------+
You can do the following using explode, split and withColumn to get the desired output
val explodedDF = dataframe.withColumn("Source", explode(split(col("Source"), "\\|")))
val finalDF = explodedDF.withColumn("CCY", split($"Source", ",")(0))
.withColumn("PV", split($"Source", ",")(1))
.withColumn("Date", split($"Source", ",")(2))
.drop("Source")
finalDF.show(false)
The final output is
+-------+---+--------+--------+
|TradeId|CCY|PV |Date |
+-------+---+--------+--------+
|ABC |USD|333.123 |20170605|
|ABC |USD|-789.444|20170605|
|ABC |GBP|1234.567|20150602|
+-------+---+--------+--------+
I hope this solves your issue
Rather than pivoting, what you are trying to achieve looks more like flatMap.
To put it simply, by using flatMap on a Dataset you apply to each row a function (map) that itself would produce a sequence of rows. Each set of rows is then concatenated into a single sequence (flat).
The following program shows the idea:
import org.apache.spark.sql.SparkSession
case class Input(TradeId: String, Source: String)
case class Output(TradeId: String, CCY: String, PV: String, Date: String)
object FlatMapExample {
// This function will produce more rows of output for each line of input
def splitSource(in: Input): Seq[Output] =
in.Source.split("\\|", -1).map {
source =>
println(source)
val Array(ccy, pv, date) = source.split(",", -1)
Output(in.TradeId, ccy, pv, date)
}
def main(args: Array[String]): Unit = {
// Initialization and loading
val spark = SparkSession.builder().master("local").appName("pivoting-example").getOrCreate()
import spark.implicits._
val input = spark.read.options(Map("sep" -> "|", "header" -> "true")).csv(args(0)).as[Input]
// For each line in the input, split the source and then
// concatenate each "sub-sequence" in a single `Dataset`
input.flatMap(splitSource).show
}
}
Given your input, this would be the output:
+-------+---+--------+--------+
|TradeId|CCY| PV| Date|
+-------+---+--------+--------+
| ABC|USD| 333.123|20170605|
| ABC|USD|-789.444|20170605|
| ABC|GBP|1234.567|20150602|
+-------+---+--------+--------+
You can now take the result and save it to a CSV, if you want.

How to get an App category from play store by its package name in Android?

I want to fetch the app category from play store through its unique identifier i.e. package name, I am using the following code but does not return any data. I also tried to use this AppsRequest.newBuilder().setAppId(query) still no help.
Thanks.
String AndroidId = "dead000beef";
MarketSession session = new MarketSession();
session.login("email", "passwd");
session.getContext().setAndroidId(AndroidId);
String query = "package:com.king.candycrushsaga";
AppsRequest appsRequest = AppsRequest.newBuilder().setQuery(query).setStartIndex(0)
.setEntriesCount(10).setWithExtendedInfo(true).build();
session.append(appsRequest, new Callback<AppsResponse>() {
#Override
public void onResult(ResponseContext context, AppsResponse response) {
String response1 = response.toString();
Log.e("reponse", response1);
}
});
session.flush();
Use this script:
######## Fetch App names and genre of apps from playstore url, using pakage names #############
"""
Reuirements for running this script:
1. requests library
Note: Run this command to avoid insecureplatform warning pip install --upgrade ndg-httpsclient
2. bs4
pip install requests
pip install bs4
"""
import requests
import csv
from bs4 import BeautifulSoup
# url to be used for package
APP_LINK = "https://play.google.com/store/apps/details?id="
output_list = []; input_list = []
# get input file path
print "Need input CSV file (absolute) path \nEnsure csv is of format: <package_name>, <id>\n\nEnter Path:"
input_file_path = str(raw_input())
# store package names and ids in list of tuples
with open(input_file_path, 'rb') as csvfile:
for line in csvfile.readlines():
(p, i) = line.strip().split(',')
input_list.append((p, i))
print "\n\nSit back and relax, this might take a while!\n\n"
for package in input_list:
# generate url, get html
url = APP_LINK + package[0]
r = requests.get(url)
if not (r.status_code==404):
data = r.text
soup = BeautifulSoup(data, 'html.parser')
# parse result
x = ""; y = "";
try:
x = soup.find('div', {'class': 'id-app-title'})
x = x.text
except:
print "Package name not found for: %s" %package[0]
try:
y = soup.find('span', {'itemprop': 'genre'})
y = y.text
except:
print "ID not found for: %s" %package[0]
output_list.append([x,y])
else:
print "App not found: %s" %package[0]
# write to csv file
with open('results.csv', 'w') as fp:
a = csv.writer(fp, delimiter=",")
a.writerows(output_list)
This is what i did, best and easy solution
https://androidquery.appspot.com/api/market?app=your.unique.package.name
Or otherwise you can get source html and get the string out of it ...
https://play.google.com/store/apps/details?id=your.unique.package.name
Get this string out of it - use split or substring methods
<span itemprop="genre">Sports</span>
In this case sports is your category
use android-market-api it will gives all information of application

Categories

Resources