Merge tab delimited files by key

Merge tab delimited files by key - java

I have three MapReduce jobs that produce tab delimited files, that operate on the same files. The first value is the key. This is the case for every output of these three MR jobs.
What I want to do now, is use MapReduce to "stitch" these files together by key. What would be the best Mapper output and Reducer input? I tried using ArrayWritable, but because of the shuffle, for some records the ArrayWritable from 1 file is in the third position, instead of the second.
I want this:
Key \t Values-from-first-MR-job \t Values-from-second-MR-job \t Values-from-third-MR-job
And this should be the same for all records. But, as I said, because of the shuffle, sometimes this happens for a few records:
Key \t Values-from-third-MR-job \t Values-from-first-MR-job \t Values-from-second-MR-job
How should I set up my Mapper and Reducer to fix this?

It's possible with simple tagging on the emitted value since only three types of files are involved. In map extract the path of the split, identify its position and add a suitable prefix to the value. For clarity, say the outputs are in 3 directories :
path1/mr_out_1
path2/mr_out_2
path3/mr_out_3
Using TextInputForamt for all these paths, in map you will do :
String[] keyVal = value.spilt("\t",2);
Path filePath = ((FileSplit) context.getInputSplit()).getPath();
String dirName = filePath.getParent().getName().toString();
Text outValue = new Text();
if(dirName.equals("mr_out_1")){
outValue.set("1_" + keyVal[1]);
} else if(dirName.equals("mr_out_2")){
outValue.set("2_" + keyVal[1]);
} else {
outValue.set("3_" + keyVal[1]);
}
context.write(new Text(keyVal[0]), outVal);
If you have all the files in the same directory, use the fileName instead of dirName. Then identify the flag based on the name(a regex match may be suitable) :
String fileName = filePath.getName().toString();
if(fileName.matches("regex")){ ... }
In reduce just put the incoming values to a List and sort. Rest is simple enough.
List<String> list = new ArrayList<String>(3);
for(Text v : values){
list.add(v.toString());
}
Collections.sort(list);
StringBuilder builder = new StringBuilder();
for(String s : list){
builder.append(s.substring(2)+"\t");
}
context.write(key, new Text(builder.toString().trim()));
I think it will serve the purpose. Keep in mind that the Collection.sort strategy will fail if there are more than 9 files (due to alphabetical order). Then you may extract the tag separately, cast it to an Integer and use a TreeMap<tag, actualString> for sorting.
NB: All the above snippets are using new API. I didn't use an IDE to write those, so few syntax errors may exist. And again I didn't follow proper conventions in writing. Say the outKey of map could be a class member, and using outKey.set(keyVal[0]) could remove a Text object creation overhead.

Related

Process two very large CSV files. Load them at the same time. Replace some column in some files with columns in other file

Want to replace some of the columns in one csv file with the column values in other CSV files which cannot fit in memory together. Language contraints JAVA,SCALA. No Framwework constraints.
One of the file has key-value kind of mapping and other file have large number of columns. And we have to replace the the values in large CSV file with the values in file that have key-value mapping.

Under the assumption that you can take in memory all the key-value mappings, then process the big one in a streaming fashion
import java.io.{File, PrintWriter}
import scala.io.Source
val kv_file = scala.io.Source.fromFile("key_values.csv")
// Construct a simple key value map
val kv : Map[String,String] = kv_file.getLines().map { line =>
val cols = line.split(";")
cols(0) -> cols(1)
}.toMap
val writer = new PrintWriter(new File("processed_big_file.csv" ))
big_file.getLines().foreach( line => {
// this is the key-value replace logic (as I understood)
val processed_cols = line.split(";").map { s => kv.getOrElse(s,s) }
val out_line = processed_cols.mkString(";");
writer.write(out_line)
})
// close file
writer.close()
Under the assumption that you cannotbe fully load thye key-value mapping then you could partially load in memory the file with the key-value maps and then still process the big one. Of course you have to iterate a bunch of times the files to get processed all the keys

easiest way to read a java file - is there a simpler auternative to JSON

I am writing a small java method that needs to read test data from a file on my win10 laptop.
The test data has not been formed yet but it will be text based.
I need to write a method that reads the data and analyses it character by character.
My questions are:
what is the simplest format to create and read the file....I was looking at JSON, something that does not look particularly complex but is it the best for a very simple application?
My second question (and I am a novice). If the file is in a text file on my laptop.....how do I tell my java code where to find it....how do I ask java to navigate the win10 operating system?

You can also map the text file into java objects (It depends on your text file).
For example, we have a text file that contains person name and family line by line like:
Foo,bar
John,doe
So for parse above text file and map it into a java object we can :
1- Create a Person Object
2- Read and parse the file (line by line)
Create Person Class
public class Person {
private String name;
private String family;
//setters and getters
}
Read The File and Parse line by line
public static void main(String[] args) throws IOException {
//Read file
//Parse line by line
//Map into person object
List<Person> personList = Files
.lines(Paths
.get("D:\\Project\\Code\\src\\main\\resources\\person.txt"))
.map(line -> {
//Get lines of test and split by ","
//It split words of the line and push them into an array of string. Like "John,Doe" -> [John,Doe]
List<String> nameAndFamily = Splitter.on(",").trimResults().omitEmptyStrings().splitToList(line);
//Create a new Person and get above words
Person person = new Person();
person.setName(nameAndFamily.get(0));
person.setFamily(nameAndFamily.get(1));
return person;
}
).collect(Collectors.toList());
//Process the person list
personList.forEach(person -> {
//You can whatever you want to the each person
//Print
System.out.println(person.getName());
System.out.println(person.getFamily());
});
}

Regarding your first question, I can't say much, without knowing anything about the data you like to write/read.
For your second question, you would normally do something like this:
String pathToFile = "C:/Users/SomeUser/Documents/testdata.txt";
InputStream in = new FileInputStream(pathToFile);
As your data gains more complexity you should probably think about using a defined format, if that is possible, something like JSON, YAML or similar for example.
Hope this helps a bit. Good luck with your project.

As for the format the text file needs to take, you should elaborate a bit on the kind of data. So I can't say much there.
But to navigate the file system, you just need to write the path a bit different:
The drive letter is a single character at the beginning of the path i.e. no colon ":"
replace the backslash with a slash
then you should be set.
So for example...
C:\users\johndoe\documents\projectfiles\mydatafile.txt
becomes
c/users/johndoe/documents/projectfiles/mydatafile.txt
With this path, you can use all the IO classes for file manipulation.

How to perform filter in javardd by header?

I am working on a JavaRDD code where I have to upload a csv into a JavaRDD named RestaurantDetailRDD. The RestaurantDetailRDD has an address column which must be filtered into another RDD named addressRDD. I just need the filter condition where I can split the address column by header provided in the csv.
// provide path to input text file
String path = "/home/lingesh/Downloads/newitems.csv";
// read text file to RDD
JavaRDD<String> restaurantDetailRDD = sc.textFile(path);
// collect RDD for printing
for(String line:restaurantDetailRDD.collect()){
System.out.println(line);
}
As you can see I just created the RestaurantDetailRDD
I expect the address column to be in placed in different RDD

If you know the position of your address column, you just can do a map function to transform the RDD into another RDD.
JavaRDD<String> columnRdd = rdd.map(f -> {
String[] arr = f.split(",");
return arr[position];
});
System.out.println("new count " + columnRdd.count());
It's better in this way because you're using spark functions which means you can handle spark partitions and make the computation faster. Don't try to use basic java functions until you really need print results for testing.

Amazon S3 Client NOT listing all folders in the bucket

I'm trying to list all so-called folders and sub-folders in an s3 bucket.
Now, as I am trying to list all the folders in a path recursively I am not using withDelimeter() function.
All the so-called folder names should end with / and this is my logic to list all the folders and sub-folders.
Here's the scala code (Intentionally not pasting the catch code here):
val awsCredentials = new BasicAWSCredentials(awsKey, awsSecretKey)
val client = new AmazonS3Client(awsCredentials)
def listFoldersRecursively(bucketName: String, fullPath: String): List[String] = {
try {
val objects = client.listObjects(bucketName).getObjectSummaries
val listObjectsRequest = new ListObjectsRequest()
.withPrefix(fullPath)
.withBucketName(bucketName)
val folderPaths = client
.listObjects(listObjectsRequest)
.getObjectSummaries()
.map(_.getKey)
folderPaths.filter(_.endsWith("/")).toList
}
}
Here's the structure of my bucket through an s3 client
Here's the list I am getting using this scala code
Without any apparent pattern, many folders are missing from the list of retrieved folders.
I did not use
client.listObjects(listObjectsRequest).getCommonPrefixes.toList
because it was returning empty list for some reason.
P.S: Couldn't add photos in post directly because of being a new user.

Without any apparent pattern, many folders are missing from the list of retrieved folders.
Here's your problem: you are assuming there should always be objects with keys ending in / to symbolize folders.
This is an incorrect assumption. They will only be there if you created them, either via the S3 console or the API. There's no reason to expect them, as S3 doesn't actually need them or use them for anything, and the S3 service does not create them spontaneously, itself.
If you use the API to upload an object with key foo/bar.txt, this does not create the foo/ folder as a distinct object. It will appear as a folder in the console for convenience, but it isn't there unless at some point you deliberately created it.
Of course, the only way to upload such an object with the console is to "create" the folder unless it already appears -- but appears in the console does not necessarily equate to exists as a distinct object.
Filtering on endsWith("/") is invalid logic.
This is why the underlying API includes CommonPrefixes with each ListObjects response if delimiter and prefix are specified. This is a list of the next level of "folders", which you have to recursively drill down into in order to find the next level.
If you specify a prefix, all keys that contain the same string between the prefix and the first occurrence of the delimiter after the prefix are grouped under a single result element called CommonPrefixes. If you don't specify the prefix parameter, the substring starts at the beginning of the key. The keys that are grouped under the CommonPrefixes result element are not returned elsewhere in the response.
https://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html
You need to access this functionality with whatever library you or using, or, you need to iterate the entire list of keys and discover the actual common prefixes on / boundaries using string splitting.

Well, in case someone faces the same problem in future, the alternative logic I used is as suggested by #Michael above, I iterated through all the keys, splat them at last occurrence of /. The first index of the returned list + / was the key of a folder, appended it to another list. At the end, returned the unique list I was appending into. This gave me all the folders and sub-folders in a certain prefix location.
Note that I didn't use CommonPrefixes because I wasn't using any delimiter and that's because I didn't want the list of folders at a certain level but instead recursively get all the folders and sub-folders
def listFoldersRecursively(bucketName: String, fullPath: String): List[String] = {
try {
val objects = client.listObjects(bucketName).getObjectSummaries
val listObjectsRequest = new ListObjectsRequest()
.withPrefix(fullPath)
.withBucketName(bucketName)
val folderPaths = client.listObjects(listObjectsRequest)
.getObjectSummaries()
.map(_.getKey)
.toList
val foldersList: ArrayBuffer[String] = ArrayBuffer()
for (folderPath <- folderPaths) {
val split = folderPath.splitAt(folderPath.lastIndexOf("/"))
if (!split._1.equals(""))
foldersList += split._1 + "/"
}
foldersList.toList.distinct
P.S: Catch block is intentionalyy missing due to irrelevancy.

The listObjects function (and others) is paginating, returning up to 100 entries every time.
From the doc:
Because buckets can contain a virtually unlimited number of keys, the
complete results of a list query can be extremely large. To manage
large result sets, Amazon S3 uses pagination to split them into
multiple responses. Always check the ObjectListing.isTruncated()
method to see if the returned listing is complete or if additional
calls are needed to get more results. Alternatively, use the
AmazonS3Client.listNextBatchOfObjects(ObjectListing) method as an easy
way to get the next page of object listings.

Combine multiple Treemaps into csv table with java

I am new to java and trying to figure out how to combine several treemaps into a table.
I have a java program that reads a text file and creates a treemap indexing the words in the file. The output has individual words as the key and the list of pages it appears on as the value. an example looks like this:
a 1:4:7
b 1:7
d 2
Now my program currently creates a thread for several text files and creates Treemaps for each file. I would like to combine these treemaps into one output. So say we have a second text file that looks like this:
a 1:2:4
b 3
c 7
The final output I am trying to create is a csv table that looks like this:
key,file1,file2
a,1:4:7,1:2:4
b,1:7,3
c,,7
d,2,
Is there a method to combine maps like this? I am a sql developer primarily so my idea was to print each map to a txt file along with the file name and then pivot this list based on the file name. This didn't seem like a very java like way to approach the problem though.

I think you need to do it manually.
I didnt compile my solution and it didnt write to csv file, but it should give you hint:
public void writeCsv(List<MyTreeMap> list) {
Set<String> ids = new TreeSet<String>();
// ids will store all unique key in your example: a,b,c,d
for(MyTreeMap m:list) {
for(String id:m.keySet()) {
ids.insert(id);
}
}
// iterate ids [a,b,c,d]
for(String id:ids) {
StringBuffer line = new StringBuffer();
line.append(id);
for(MyTreeMap m:list) {
String pages = m.get(id);
// pages will contains "1:4:7" like your example.
line.append(",");
line.append(pages);
}
System.out.println(line);
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.