I have a directory with several text files and I access that all files in spark as follows,
JavaRDD<String> filesRDD = sc.textFile(directoryName);
In each file, the first line is a header which contains some mapping values. eg:-
"1,apple|4,banana|3,lemon"
that means if, in the content, there is a "3", it maps to "lemon".
Example of the content as follows,
I like 1
John eat 3 and 1
and so on.
Now What I need to do is, I need to filter lines from the content first and assign original values from the mapping. For example, the first filter by the string "like" and I get "I like 1" then, I replace with mapping, then "I like apple"
Please note that this mapping header is different from each file. How can I do this? Since I'm new to spark, I don't have much idea on how to achieve this.
Do you want something like this?
var fruitPair = sc.parallelize(List("1,apple","4,banana","3,lemon")).map{ str =>
var temp = str.split(",")
(temp(0), temp(1))
}
fruitPair.toDF.show()
+---+------+
| _1| _2|
+---+------+
| 1| apple|
| 4|banana|
| 3| lemon|
+---+------+
var contents = List("I like 1", "John eat 3 and 1")
var results = contents.map { content =>
var tmpContent = content
fruitPair.collect.foreach { item =>
var index = tmpContent.indexOf(item._1)
if (index >= 0) {
tmpContent = tmpContent.replace(item._1, item._2)
}
}
tmpContent
}
results.foreach{ it => println(it) }
I like apple
John eat lemon and apple
results: List[String] = List(I like apple, John eat lemon and apple)
Related
I've got a simple problem, but I'm new to Java coming from PHP. I need to split a delimited text file into an array. I've broken it down into an array of lines, each one would look something like this:
{
{Bob | Smithers | Likes Cats | Doesnt Like Dogs},
{Jane | Haversham | Likes Bats | Doesnt Like People}
}
I need to turn this into a 2 dimensional array.
In PHP, it's a cinch. You just use explode(); I tried using String.split on a 1d array and it wasn't that bad either. The things is, I haven't yet learned how to be nice to Java. So I don't know how to loop through the array and turn it into a 2d. This is what I have:
for (i = 0; i < array.length; i++) {
String[i][] 2dArray = array[i].split("|", 4);
}
PHP would be
for ($i = 0; $i < count($array); $i++) {
$array[i][] = explode(",", $array[i]);
}
You can loop the array like this:
// Initialize array
String[] array = {
"Bob | Smithers | Likes Cats | Doesnt Like Dogs",
"Jane | Haversham | Likes Bats | Doesnt Like People"
};
// Convert 1d to 2d array
String[][] array2d = new String[2][4];
for(int i=0;i<array.length;i++) {
String[] temp = array[i].split(" \\| ");
for(int j=0;j<temp.length;j++) {
array2d[i][j] = temp[j];
}
}
// Print the array
for(int i=0;i<array2d.length;i++) {
System.out.println(Arrays.toString(array2d[i]));
}
Notes: I used \\|to split the pipe character.
Problem
If I got you right you have an input like this:
{{Bob | Smithers | Likes Cats | Doesnt Like Dogs},{Jane | Haversham | Likes Bats | Doesnt Like People}}
Readable version:
{
{Bob | Smithers | Likes Cats | Doesnt Like Dogs},
{Jane | Haversham | Likes Bats | Doesnt Like People}
}
And you want to represent that structure in a 2-dimensional String aray, String[][].
Solution
The key is the method String#split which splits a given String into substrings delimited by a given symbol. This is , and | in your example.
First of all we remove all {, } as we don't need them (as long as the text itself does not contain delimiter):
String input = ...
String inputWithoutCurly = input.replaceAll("[{}]", "");
The text is now:
Bob | Smithers | Likes Cats | Doesnt Like Dogs,Jane | Haversham | Likes Bats | Doesnt Like People
Next we want to create the outer dimension of the array, that is split by ,:
String[] entries = inputWithoutCurly.split(",");
Structure now is:
[
"Bob | Smithers | Likes Cats | Doesnt Like Dogs",
"Jane | Haversham | Likes Bats | Doesnt Like People"
]
We now want to split each of the inner texts into their components. We therefore iterate all entries, split them by | and collect them to the result:
// Declaring a new 2-dim array with unknown inner dimension
String[][] result = new String[entries.length][];
// Iterating all entries
for (int i = 0; i < entries.length; i++) {
String[] data = entries[i].split(" | ");
// Collect data to result
result[i] = data;
}
Finally we have the desired structure of:
[
[ "Bob", "Smithers", "Likes Cats", "Doesnt Like Dogs" ],
[ "Jane", "Haversham", "Likes Bats", "Doesnt Like People"]
]
Everything compact:
String[] entries = input.replaceAll("[{}]", "").split(",");
String[][] result = new String[entries.length][];
for (int i = 0; i < entries.length; i++) {
result[i] = entries[i].split(" | ");
}
Stream
If you have Java 8 or newer you can use the Stream API for a compact functional style:
String[][] result = Arrays.stream(input.replaceAll("[{}]", "").split(","))
.map(entry -> entry.split(" | "))
.toArray(String[][]::new);
I have a db with 2 columns, key and value. record:
------------------------------------
| key | value |
------------------------------------
| A | 1,desc 1;2,desc 2;3,desc 3 |
------------------------------------
I want to split value column become json format:
[{"key":"1","value":"desc 1"},{"key":"2","value":"desc 2"},{"key":"3", "value":"desc 3"}]
Where I am put split function? in service? because too dificult for 2 split. How to solve this problem?
Thanks,
Bobby
That depends on how your application is usually working with this value. If the usual case is using some specific data from this column, I would parse this at repository level already:
public static void main(String[] args) {
// You actually get this from DB
String value = "1,desc 1;2,desc 2;3,desc 3";
JSONArray j = new JSONArray();
Stream.of(value.split(";")).forEach((pair -> {
String[] keyValue = pair.split(",");
JSONObject o = new JSONObject();
o.put("key", keyValue[0]);
o.put("value", keyValue[1]);
j.put(o);
}));
System.out.println(j);
}
After reading csv file in Dataset, want to remove spaces from String type data using Java API.
Apache Spark 2.0.0
Dataset<Row> dataset = sparkSession.read().format("csv").option("header", "true").load("/pathToCsv/data.csv");
Dataset<String> dataset2 = dataset.map(new MapFunction<Row,String>() {
#Override
public String call(Row value) throws Exception {
return value.getString(0).replace(" ", "");
// But this will remove space from only first column
}
}, Encoders.STRING());
By using MapFunction, not able to remove spaces from all columns.
But in Scala, by using following way in spark-shell able to perform desired operation.
val ds = spark.read.format("csv").option("header", "true").load("/pathToCsv/data.csv")
val opds = ds.select(ds.columns.map(c => regexp_replace(col(c), " ", "").alias(c)): _*)
Dataset opds have data without spaces. Want to achieve same in Java. But in Java API columns method returns String[] and not able to perform functional programming on Dataset.
Input Data
+----------------+----------+-----+---+---+
| x| y| z| a| b|
+----------------+----------+-----+---+---+
| Hello World|John Smith|There| 1|2.3|
|Welcome to world| Bob Alice|Where| 5|3.6|
+----------------+----------+-----+---+---+
Expected Output Data
+--------------+---------+-----+---+---+
| x| y| z| a| b|
+--------------+---------+-----+---+---+
| HelloWorld|JohnSmith|There| 1|2.3|
|Welcometoworld| BobAlice|Where| 5|3.6|
+--------------+---------+-----+---+---+
Try:
for (String col: dataset.columns) {
dataset = dataset.withColumn(col, regexp_replace(dataset.col(col), " ", ""));
}
You can try following regex to remove white spaces between strings.
value.getString(0).replaceAll("\\s+", "");
About \s+ : match any white space character between one and unlimited times, as many times as possible.
Instead of replace use replaceAll function.
More about replace and replaceAll functions Difference between String replace() and replaceAll()
Background: I'm using Talend to do something (I guess) that is pretty common: generating multiple rows from one. For example:
ID | Name | DateFrom | DateTo
01 | Marco| 01/01/2014 | 04/01/2014
...could be split into:
new_ID | ID | Name | DateFrom | DateTo
01 | 01 | Marco | 01/01/2014 | 02/01/2014
02 | 01 | Marco | 02/01/2014 | 03/01/2014
03 | 01 | Marco | 03/01/2014 | 04/01/2014
The number of outcoming rows is dynamic, depending on the date period in the original row.
Question: how can I do this? Maybe using tSplitRow? I am going to check those periods with tJavaRow. Any suggestions?
Expanding on the answer given by Balazs Gunics
Your first part is to calculate the number of rows one row will become, easy enough with a date diff function on the to and from dates
Part 2 is to pass that value to a tFlowToIterate, and pick it up with a tJavaFlex that will use it in its start code to control a for loop:
tJavaFlex start:
int currentId = (Integer)globalMap.get("out1.id");
String currentName = (String)globalMap.get("out1.name");
Long iterations = (Long)globalMap.get("out1.iterations");
Date dateFrom = (java.util.Date)globalMap.get("out1.dateFrom");
for(int i=0; i<((Long)globalMap.get("out1.iterations")); i++) {
Main
row2.id = currentId;
row2.name = currentName;
row2.dateFrom = TalendDate.addDate(dateFrom, i, "dd");
row2.dateTo = TalendDate.addDate(dateFrom, i+1, "dd");
End
}
and sample output:
1|Marco|01-01-2014|02-01-2014
1|Marco|02-01-2014|03-01-2014
1|Marco|03-01-2014|04-01-2014
2|Polo|01-01-2014|02-01-2014
2|Polo|02-01-2014|03-01-2014
2|Polo|03-01-2014|04-01-2014
2|Polo|04-01-2014|05-01-2014
2|Polo|05-01-2014|06-01-2014
2|Polo|06-01-2014|07-01-2014
2|Polo|07-01-2014|08-01-2014
2|Polo|08-01-2014|09-01-2014
2|Polo|09-01-2014|10-01-2014
2|Polo|10-01-2014|11-01-2014
2|Polo|11-01-2014|12-01-2014
2|Polo|12-01-2014|13-01-2014
2|Polo|13-01-2014|14-01-2014
2|Polo|14-01-2014|15-01-2014
2|Polo|15-01-2014|16-01-2014
2|Polo|16-01-2014|17-01-2014
2|Polo|17-01-2014|18-01-2014
2|Polo|18-01-2014|19-01-2014
2|Polo|19-01-2014|20-01-2014
2|Polo|20-01-2014|21-01-2014
2|Polo|21-01-2014|22-01-2014
2|Polo|22-01-2014|23-01-2014
2|Polo|23-01-2014|24-01-2014
2|Polo|24-01-2014|25-01-2014
2|Polo|25-01-2014|26-01-2014
2|Polo|26-01-2014|27-01-2014
2|Polo|27-01-2014|28-01-2014
2|Polo|28-01-2014|29-01-2014
2|Polo|29-01-2014|30-01-2014
2|Polo|30-01-2014|31-01-2014
2|Polo|31-01-2014|01-02-2014
You can use tJavaFlex to do this.
If you have a small amount of columns the a tFlowToIterate -> tJavaFlex options could be fine.
In the begin part you can start to iterate, and in the main part you assign values to the output schema. If you name your output is row6 then:
row6.id = (String)globalMap.get("id");
and so on.
I came here as I wanted to add all context parameters into an Excel data sheet. So the solution bellow works when you are taking 0 input lines, but can be adapted to generate several lines for each line in input.
The design is actually straight forward:
tJava –trigger-on-OK→ tFileInputDelimited → tDoSomethingOnRowSet
↓ ↑
[write into a CSV] [read the CSV]
And here is the kind of code structure usable in the tJava.
try {
StringBuffer wad = new StringBuffer();
wad.append("Key;Nub"); // Header
context.stringPropertyNames().forEach(
key -> wad.
append(System.getProperty("line.separator")).
append(key + ";" + context.getProperty(key) )
);
// Here context.metadata contains the path to the CSV file
FileWriter output = new FileWriter(context.metadata);
output.write(wad.toString());
output.close();
} catch (IOException mess) {
System.out.println("An error occurred.");
mess.printStackTrace();
}
Of course if you have a set of rows as input, you can adapt the process to use a tJavaRow instead of a tJava.
You might prefer to use an Excel file as an on disk buffer, but dealing with this file format asks more work at least the first time when you don’t have the Java libraries already configured in Talend. Apache POI might help you if you nonetheless chose to go this way.
Firstly, I am aware that there are other posts similar, but since mine is using a URL and I am not always sure what my delimiter will be, I feel that I am alright posting my question. My assignment is to make a crude web browser. I have a textField that a user enters the desired URL into. I then have obviously have to navigate to that webpage. Here is an example from my teacher of what my code would look kinda like. This is the code i'm suposed to be sending to my socket. Sample url: http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol
GET /wiki/Hypertext_Transfer_Protocol HTTP/1.1\n
Host: en.wikipedia.org\n
\n
So my question is this: I am going to read in the url as just one complete string, so how do I extract just the "en.wikipedia.org" part and just the extension? I tried this as a test:
String url = "http://en.wikipedia.org/wiki/Hypertext Transfer Protocol";
String done = " ";
String[] hope = url.split(".org");
for ( int i = 0; i < hope.length; i++)
{
done = done + hope[i];
}
System.out.println(done);
This just prints out the URL without the ".org" in it. I think i'm on the right track. I am just not sure. Also, I know that websites can have different endings (.org, .com, .edu, etc) so I am assuming i'll have to have a few if statements that compenstate for the possible different endings. Basically, how do I get the url into the two parts that I need?
The URL class pretty much does this, look at the tutorial. For example, given this URL:
http://example.com:80/docs/books/tutorial/index.html?name=networking#DOWNLOADING
This is the kind of information you can expect to obtain:
protocol = http
authority = example.com:80
host = example.com
port = 80
path = /docs/books/tutorial/index.html
query = name=networking
filename = /docs/books/tutorial/index.html?name=networking
ref = DOWNLOADING
This is how you should split your URL parts: http://docs.oracle.com/javase/tutorial/networking/urls/urlInfo.html
Instead of url.split(".org"); try url.split("/"); and iterate through your array of strings.
Or you can look into regular expressions. This is a good example to start with.
Good luck on your homework.
Even though the answer with URL class is great, here is one more way to split URL to components using REGEXP:
"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\\?([^#]*))?(#(.*))?"
|| | | | | | | |
12 - scheme | | | | | | |
3 4 - authority, includes hostname/ip and port number.
5 - path| | | |
6 7 - query| |
8 9 - fragment
You can use it with Pattern class:
var regex = "^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\\?([^#]*))?(#(.*))?";
var pattern = Pattern.compile(REGEX);
var matcher = pattern.matcher("http://example.com:80/docs/books/tutorial/index.html?name=networking#DOWNLOADING");
if (matcher.matches()) {
System.out.println("scheme: " + matcher.group(2));
System.out.println("authority: " + matcher.group(4));
System.out.println("path: " + matcher.group(5));
System.out.println("query: " + matcher.group(7));
System.out.println("fragment: " + matcher.group(9));
}
you can use String class split() and store the result into the String array then iterate the array and store the variable and value into the Map.
public class URLSPlit {
public static Map<String,String> splitString(String s) {
String[] split = s.split("[= & ?]+");
int length = split.length;
Map<String, String> maps = new HashMap<>();
for (int i=0; i<length; i+=2){
maps.put(split[i], split[i+1]);
}
return maps;
}
public static void main(String[] args) {
String word = "q=java+online+compiler&rlz=1C1GCEA_enIN816IN816&oq=java+online+compiler&aqs=chrome..69i57j69i60.18920j0j1&sourceid=chrome&ie=UTF-8?k1=v1";
Map<String, String> newmap = splitString(word);
for(Map.Entry map: newmap.entrySet()){
System.out.println(map.getKey()+" = "+map.getValue());
}
}
}