Hadoop Distributed Cache to process large look up text file - java

I am trying to implement a MapReduce job that processes a large text file (as a look up file) in addition to the actual dataset (input). the look up file is more than 2GB.
I tried to load the text file as a third argument as follows:
but I got Java Heap Space Error.
After doing some search, it is suggested to use Distributed Cache. this is what I have done so far
First, I used this method to read the look up file:
public static String readDistributedFile(Context context) throws IOException {
URI[] cacheFiles = context.getCacheFiles();
Path path = new Path(cacheFiles[0].getPath().toString());
FileSystem fs = FileSystem.get(new Configuration());
StringBuilder sb = new StringBuilder();
BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(path)));
String line;
while ((line = br.readLine()) != null) {
// split line
sb.append(line);
sb.append("\n");
}
br.close();
return sb.toString();
}
Second, In the Mapper:
protected void setup(Context context)
throws IOException, InterruptedException {
super.setup(context);
String lookUpText = readDistributedFile(context);
//do something with the text
}
Third, to run the job
hadoop jar mapReduceJob.jar the.specific.class -files ../LargeLookUpFileInStoredLocally.txt /user/name/inputdataset/*.gz /user/name/output
But the problem is that the job is taking long time to be load.
May be it was not a good idea to use the distributed cache or may be I am missing something in my code.
I am working with Hadoop 2.5.
I have already checked some related questions such as [1].
Any ideas will be great!
[1] Hadoop DistributedCache is deprecated - what is the preferred API?

Distributed cache is mostly used to move the files which are needed by Map reduce at the task nodes and are not part of jar.
Other usage is when performing joins which include a big and small data set, so that, rather than using Multiple input paths, we use a single input(big) file, and get the other small file using distributed cache and then compare(or join) both the data sets.
The reason for more time in your case is because you are trying to read entire 2 gb file before the map reduce starts(as it is started in setup method).
Can you give the reason why you are loading the the huge 2gb file using distributed cache.

Related

Reading a file in order in Spark

My Spark Master needs to read a file in order. Here is what I am trying to avoid (in pseudocode):
if file-path starts with "hdfs://"
Read via HDFS API
else
Read via native FS API
I think the following would do the trick, letting Spark deal with distinguishing between local/HDFS:
JavaSparkContext sc = new JavaSparkContext(new SparkConf());
List<String> lines = sc.textFile(path).collect();
Is it safe to assume that lines will be in order; i.e. that lines.get(0) is the first line of the file, lines.get(1) is the second line; etc?
If not, any suggestions on how to avoid explicitly checking FS type?

Regarding stitching of multiple files into a single file

I work on query latencies and have a requirement where I have several files which contain data. I want to aggregate this data into a single file. I use a naive technique where I open each file and collect all the data in a global file. I do this for all the files but this is time taking. Is there a way in which you can stitch the end of one file to the beginning of another and create a big file containing all the data. I think many people might have faced this problem before. Can anyone kindly help ?
I suppose you are currently doing the opening and appending by hand; otherwise I do not know why it would take a long time to aggregate the data, especially since you describe the amount of files using multiple and several which seem to indicate it's not an enormous number.
Thus, I think you are just looking for a way to automatically to the opening and appending for you. In that case, you can use an approach similar to below. Note this creates the output file or overwrites it if it already exists, then appends the contents of all specified files. If you want to call the method multiple times and append to the same file instead of overwriting an existing file, an alternative is to use a FileWriter instead with true as a second argument to its constructor so it will append to an existing file.
void aggregateFiles(List<String> fileNames, String outputFile) {
PrintWriter writer = null;
try {
writer = new PrintWriter(outputFile);
for(String fileName : fileNames) {
Path path = Paths.get(fileName);
String fileContents = new String(Files.readAllBytes(path));
writer.println(fileContents);
}
} catch(IOException e) {
// Handle IOException
} finally {
if(writer != null) writer.close();
}
}
List<String> files = new ArrayList<>();
files.add("f1.txt");
files.add("someDir/f2.txt");
files.add("f3.txt");
aggregateFiles(files, "output.txt");

BufferedReader was never closed, but file was able to delete

Recently, I reviewed our application code, and I found one issue in our code.
/**
* truncate cat tree(s) from the import file
*/
private void truncateCatTreesInFile(File file, String userImplCode) throws Exception
{
String rowStr = null, treeCode = null;
BufferedReader reader = new BufferedReader(new FileReader(file));
rowStr = reader.readLine(); // skip 1st row - header
Impl impl;
List<String> row = null;
Set<String> truncatedTrees = new HashSet<String>();
while ((rowStr = reader.readLine()) != null)
{
row = CrudServiceHelper.getRowFromFile(rowStr);
if (row == null) continue;
impl = getCatImportImpl(row.get(ECatTreeExportImportData.IMPL.getIndex()), userImplCode);
treeCode = row.get(ECatTreeExportImportData.TREE_CODE.getIndex());
if(truncatedTrees.contains(treeCode)) continue;
truncatedTrees.add(treeCode);
CatTree catTree = _treeDao.findByCodeAndImpl(treeCode, impl.getId());
if(catTree!= null) _treeDao.makeTransient(catTree);
}
_treeDao.flush();
}
Looking at the above code, the "reader" was never closed, I was thinking it could be an issue, but actually, it just works fine, the file is able to delete by tomcat.
javax.servlet.context.tempdir>
[java] 2013-03-27 17:45:54,285 INFO [org.apache.struts2.dispatcher.Dispatch
er] -
Basically, what I am trying to do is uploading one file from browser, and generate sql based on the file to insert data into our database. After all done, delete the file.
I am surprised this code works fine, does anybody have an idea here? I tried to google it, but I did not get any idea.
Thanks,
Jack
Not closing a reader may result in a resource leak. Deleting an open file may still be perfectly fine.
Under Linux (and other Unix variants) deleting a file if just unlinking a name from it. A file without any names left gets actually freed. So opening a file, deleting it (removing its name) and then reading and writing to it is a well-known way to obtain a temporary file. Once the file is closed, the space is freed, but not earlier.
Under Windows, certain programs lock files they read, this prevents other processes from removing such a file. But not all programs do so. I don't have a Windows machine around to actually test how does Java handle this.
The fact that the code does not crash does not mean that the code works completely correctly. The problem you noticed might become visible only much later, if the app just consumes more and more RAM due to the leak. This is unlikely, though: the garbage collector will eventually close readers, and probably soon enough, because reader is local and never leaks out of the method.

Read a CSV file in Java

I wrote a programm that reads a csv file and puts it into a TableModel. My problem is that I want to expand the programm so, that if the csv file gets changes from outside my tablemodel gets updated and gets the new values.
I would now programm a scheduler so that the thread sleeps for about a minute and checks it every minute if the timestamp of the file changed. If so it would read the file again. But i dont know what happens to the whole programm if i use a scheduler because this little software i write will be a part of a much much bigger software wich is running on JDK 6. So I search for a performant and independent from the bigger software solution to get the changes in the tablemodel.
Can someone help out?
java.nio.file package now contains the Watch Service API. This, effectively:
This API enables you to register a directory (or directories) with the
watch service. When registering, you tell the service which types of
events you are interested in: file creation, file deletion, or file
modification. When the service detects an event of interest, it is
forwarded to the registered process. The registered process has a
thread (or a pool of threads) dedicated to watching for any events it
has registered for. When an event comes in, it is handled as needed.
See reference here.
Oh! This API is only available from JDK 7 (onwards).
**OpenCsv is a best way to read csv file in java.
if your are using maven then you can use below dependency or download it's jar from web.**
#SuppressWarnings({"rawtypes", "unchecked"})
public void readCsvFile() {
CSVReader csvReader;
CsvToBean csv;
File fileEntry;
try {
fileEntry = new File("path of your file");
csv = new CsvToBean();
csvReader = new CSVReader(new FileReader(fileEntry), ',', '"', 1);
List list = csv.parse(setColumMapping(), csvReader);
//List of LabReportSampleData class
} catch (IOException e) {
e.printStackTrace();
}
}
//Below function is used to map the your csv file to your mapping object.
//columns String array: The value inside your csv file. means 0 index map with degree variable in your mapping class.
#SuppressWarnings({"rawtypes", "unchecked"})
private static ColumnPositionMappingStrategy setColumMapping() {
ColumnPositionMappingStrategy strategy = new ColumnPositionMappingStrategy();
strategy.setType(LabReportSampleData.class);
String[] columns =
new String[] {"degree", "radian", "shearStress", "shearingStrain", "sourceUnit"};
strategy.setColumnMapping(columns);
return strategy;
}

Can multiple threads write data into a file at the same time?

If you have ever used a p2p downloading software, they can download a file with multi-threading, and they created only one file, So I wonder how the threads write data into that file. Sequentially or in parallel?
Imagine that you want to dump a big database table to a file, and how to make this job faster?
You can use multiple threads writing a to a file e.g. a log file. but you have to co-ordinate your threads as #Thilo points out. Either you need to synchronize file access and only write whole record/lines, or you need to have a strategy for allocating regions of the file to different threads e.g. re-building a file with known offsets and sizes.
This is rarely done for performance reasons as most disk subsystems perform best when being written to sequentially and disk IO is the bottleneck. If CPU to create the record or line of text (or network IO) is the bottleneck it can help.
Image that you want to dump a big database table to a file, and how to make this job faster?
Writing it sequentially is likely to be the fastest.
Java nio package was designed to allow this. Take a look for example at http://docs.oracle.com/javase/1.5.0/docs/api/java/nio/channels/FileChannel.html .
You can map several regions of one file to different buffers, each buffer can be filled separately by a separate thread.
The synchronized declaration enables doing this. Try the below code which I use in a similar context.
package hrblib;
import java.io.*;
public class FileOp {
static int nStatsCount = 0;
static public String getContents(String sFileName) {
try {
BufferedReader oReader = new BufferedReader(new FileReader(sFileName));
String sLine, sContent = "";
while ((sLine=oReader.readLine()) != null) {
sContent += (sContent=="")?sLine: ("\r\n"+sLine);
}
oReader.close();
return sContent;
}
catch (IOException oException) {
throw new IllegalArgumentException("Invalid file path/File cannot be read: \n" + sFileName);
}
}
static public void setContents(String sFileName, String sContent) {
try {
File oFile = new File(sFileName);
if (!oFile.exists()) {
oFile.createNewFile();
}
if (oFile.canWrite()) {
BufferedWriter oWriter = new BufferedWriter(new FileWriter(sFileName));
oWriter.write (sContent);
oWriter.close();
}
}
catch (IOException oException) {
throw new IllegalArgumentException("Invalid folder path/File cannot be written: \n" + sFileName);
}
}
public static synchronized void appendContents(String sFileName, String sContent) {
try {
File oFile = new File(sFileName);
if (!oFile.exists()) {
oFile.createNewFile();
}
if (oFile.canWrite()) {
BufferedWriter oWriter = new BufferedWriter(new FileWriter(sFileName, true));
oWriter.write (sContent);
oWriter.close();
}
}
catch (IOException oException) {
throw new IllegalArgumentException("Error appending/File cannot be written: \n" + sFileName);
}
}
}
You can have multiple threads write to the same file - but one at a time. All threads will need to enter a synchronized block before writing to the file.
In the P2P example - one way to implement it is to find the size of the file and create a empty file of that size. Each thread is downloading different sections of the file - when they need to write they will enter a synchronized block - move the file pointer using seek and write the contents of the buffer.
What kind of file is this? Why do you need to feed it with more threads? It depends on the characteristics (I don't know better word for it) of the file usage.
Transferring a file from several places over network (short: Torrent-like)
If you are transferring an existing file, the program should
as soon, as it gets know the size of the file, create it with empty content: this prevents later out-of-disk error (if there's not enough space, it will turns out at the creation, before downloading anything of it), also it helps the the performance;
if you organize the transfer well (and why not), each thread will responsible for a distinct portion of the file, thus file writes will be distinct,
even if somehow two threads pick the same portion of the file, it will cause no error, because they write the same data for the same file positions.
Appending data blocks to a file (short: logging)
If the threads just appends fixed or various-lenght info to a file, you should use a common thread. It should use a relatively large write buffer, so it can serve client threads quick (just taking the strings), and flush it out optimal scheduling and block size. It should use dedicated disk or even computer.
Also, there can be several performance issues, that's why are there logging servers around, even expensive commercial ones.
Reading and writing random time, random position (short: database)
It requires complex design, with mutexes etc., I never done this kinda stuff, but I can imagine. Ask Oracle for some tricks :)

Categories

Resources