i am working with hadoop 19 on opensuse linux, i am not using any cluster rather running my hadoop code on my machine itself. i am following the standard technique on putting in distributed cache but instead of acessing the files from the distributed cache again and again, i stored the contents of the file in an array. this part of extracting from the file is done in the configure() function. i am getting nullPointerException when i try to use the fiel name. this is the part of the code:
.
..part of main()
..
DistributedCache.addCacheFile(new URI("/home/hmobile/hadoop-0.19.2/output/part-00000"), conf2);
DistributedCache.addCacheFile(new URI("/home/hmobile/hadoop-0.19.2/output/part-00001"), conf2);
.
.part of mapper
public void configure(JobConf conf2)
{
String wrd; String line; try {
localFiles = DistributedCache.getLocalCacheFiles(conf2);
System.out.println(localFiles[0].getName());// error NULLPOINTEREXCEPTION
} catch (IOException ex) {
Logger.getLogger(blur2.class.getName()).log(Level.SEVERE, null, ex);
}
for(Path f:localFiles)// error NULLPOINTEREXCEPTION
{
if(!f.getName().endsWith("crc"))
{
BufferedReader br = null;
try {
br = new BufferedReader(new FileReader(f.toString()));
can such processing be not done in configure()?
This will depend on if you're using the local job runner (mapred.job.tracker=local) or if you're running in pseudo-distributed mode (i.e. mapred.job.tracker=localhost:8021 or =mynode.mydomain.com:8021). The distributed cache does NOT work in local mode, only pseudo-distributed and fully distributed modes.
Using the distributed cache in configure() is fine, otherwise.
Related
I have a problem with saving files and then downloading them after generating a .war file.
I need to handle the generation of many files after pressing the button by admin in the application. The files are generated using part of the code that was sent using the POST method and second part is from the database.
The files are hundreds / thousands and it is impossible to do it manually. Admin generates files from time to time. The user should be able to download these files from the application.
When I run the application in IntelliJ, app has access to the folders on the disk, so the following code works:
(part of backend class, responfible for saving files in path)
private void saveTextToFile(String text, String fileName) {
String filePathAndName = "/static/myFiles/" + fileName+ ".txt";
ClassLoader classLoader = getClass().getClassLoader();
File file = new File(classLoader.getResource(".").getFile() + filePathAndName );
FileWriter fileWriter = null;
try {
fileWriter = new FileWriter(file);
PrintWriter printWriter = new PrintWriter(fileWriter);
printWriter.print(text);
printWriter.close();
} catch (IOException e) {
e.printStackTrace();
}
}
The file was saved in folder:
C:\Users...\myProject\target\classes\static.
(and this is link to generated file in thymeleaf)
<html xmlns:th="http://www.thymeleaf.org">
<a th:href="#{|/myFiles/${thisIsMyFileName}|}">Download file</a>
</html>
Unfortunately, when I generate the .war file and run it, the files are not saved in the application's "resources" folder. As a result, the user cannot download this file via the link generated by thymeleaf.
In general, you do not want to upload anything into your application's files - it opens you to many security problems if someone figures out how to overwrite parts of the application, and in most application servers, it is simply not writable.
A much better approach is to have a designated server folder where you can write things. For example, you could have the following in your configuration:
myapp.base-folder = /any/server/folder/you/want
And then, in the code, you would find that folder as follows:
// env is an #AutoWired private Environment
File baseFolder = new File(env.getProperty("myapp.base-folder"));
I find this better than using a database (as #Stultuske suggested in comments), because databases are great for relations, but mostly overkill for actual files. Files can be accessed externally without firing up the database with minimal hassle, and having them separate keeps your database much easier to backup.
To generate links to the file, simply create a link as you would to any other type of request
<a th:href="#{/file/${fileId}|}">Download file</a>
-- and to handle it in the server, but returning the contents of the file:
#GetMapping(value="/file/{id}")
public StreamingResponseBody getFile(#PathVariable long id) throws IOException {
File f = new File(baseFolder, ""+id); // numerical id prevents filesytem traversal
InputStream in;
if (f.exists()) {
in = new BufferedInputStream(new FileInputStream(f));
} else {
// you could also signal error by returning a 404
in = new BufferedInputStream(getClass().getClassLoader()
.getResourceAsStream("static/img/unknown-id.jpg"));
}
return new StreamingResponseBody() {
#Override
public void writeTo(OutputStream os) throws IOException {
FileCopyUtils.copy(in, os);
}
};
}
I prefer numerical IDs to avoid hassles with path traversal - but you can easily use string filenames instead, and deal with security issues by carefully checking that the canonical path of the requested file starts with the canonical path of your baseFolder
was trying run my selenium automation code using java in a Tomcat server. It works fine when I run using javac but when it gets run on Tomcat as a jar It shows "com.google.common.base.Preconditions.checkState(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)V|" this as a log. Here my selenium-chrome driver is placed in desktop of my local machine and path is defined (Tomcat is also a local server)
I would go with a buffered file reader like this:
public static void main(String[] args) throws IOException {
try {
File f = new File("data.txt");
BufferedReader b = new BufferedReader(new FileReader(f));
String readLine;
while ((readLine = b.readLine()) != null) {
if (readLine.contains("WORD"))
System.out.println("Found WORD in: " + readLine);
}
} catch (IOException e) {
e.printStackTrace();
}
}
where "WORD" is the word you are searching for.
The advantage of a BufferedReader is that it reads ahead to reduce the number of I/O roundtrips - or as they put it in the JavaDoc: "Reads text from a character-input stream, buffering characters so as to provide for the efficient reading of characters, arrays, and lines."
FileChannel is a slightly newer invention, arriving in the NIO with Java 1.4. It might perform better than the BufferedReader - but I also find it a lot more low-level in its API, so unless you have very special performance requirements, I would leave the readahead/buffering to BufferedReader and FileReader.
You can also say that BufferedReader is "line oriented" whereas FileChannel is "byte oriented".
I like the BufferedReader from Java.io with a FileReader most:
https://docs.oracle.com/javase/7/docs/api/java/io/FileReader.html
https://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html
https://www.mkyong.com/java/how-to-read-file-from-java-bufferedreader-example/
It is easy to use and has most functions. But your file mus be char-based to use that ( like a text file)
I am trying to optimize a method writeZipResults which take list of ByteOutputStream and convert that into a single zip file as ZipOutputStream.
method definition:
public void writeZipResults(Map<String, ByteArrayOutputStream> files, OutputStream outputStream) throws IOException {
Stopwatch writeZipResultsTimer = Stopwatch.createStarted();
if(files != null) {
ZipOutputStream zip = new ZipOutputStream(outputStream);
for(String filename : files.keySet()) {
ByteArrayOutputStream pdfInMemory = files.get(filename);
if(pdfInMemory != null) {
ZipEntry entry = new ZipEntry(filename + fileTypes.getExtension());
zip.putNextEntry(entry);
pdfInMemory.writeTo(zip);
zip.closeEntry();
pdfInMemory.close();
}
}
zip.close();
logger.info("Took {} ms to complete writeZipResults method, writeZipResultsTimer.elapsed(TimeUnit.MILISECONDS));
}
}
To optimize above method I added zip.setLevel(0) i.e. no compression which minimized the method execution time to great extend in my local window system.
But when I am running the same code with zip.setLevel(0) in linux environment I am not getting same performance as it is under windows system.
To put my point here is the application logs(highlighted in yellow) from Linux and my local windows system captured for same scenario with exactly same data set:
To add more information:
Java Version: 7
Use case: for set of attributes create pdf file for each attribute and combine all pdf files into zip file and return in http response. All file creation process is in memory.
Please suggest how to optimize the method under linux environment?
I am trying to implement a MapReduce job that processes a large text file (as a look up file) in addition to the actual dataset (input). the look up file is more than 2GB.
I tried to load the text file as a third argument as follows:
but I got Java Heap Space Error.
After doing some search, it is suggested to use Distributed Cache. this is what I have done so far
First, I used this method to read the look up file:
public static String readDistributedFile(Context context) throws IOException {
URI[] cacheFiles = context.getCacheFiles();
Path path = new Path(cacheFiles[0].getPath().toString());
FileSystem fs = FileSystem.get(new Configuration());
StringBuilder sb = new StringBuilder();
BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(path)));
String line;
while ((line = br.readLine()) != null) {
// split line
sb.append(line);
sb.append("\n");
}
br.close();
return sb.toString();
}
Second, In the Mapper:
protected void setup(Context context)
throws IOException, InterruptedException {
super.setup(context);
String lookUpText = readDistributedFile(context);
//do something with the text
}
Third, to run the job
hadoop jar mapReduceJob.jar the.specific.class -files ../LargeLookUpFileInStoredLocally.txt /user/name/inputdataset/*.gz /user/name/output
But the problem is that the job is taking long time to be load.
May be it was not a good idea to use the distributed cache or may be I am missing something in my code.
I am working with Hadoop 2.5.
I have already checked some related questions such as [1].
Any ideas will be great!
[1] Hadoop DistributedCache is deprecated - what is the preferred API?
Distributed cache is mostly used to move the files which are needed by Map reduce at the task nodes and are not part of jar.
Other usage is when performing joins which include a big and small data set, so that, rather than using Multiple input paths, we use a single input(big) file, and get the other small file using distributed cache and then compare(or join) both the data sets.
The reason for more time in your case is because you are trying to read entire 2 gb file before the map reduce starts(as it is started in setup method).
Can you give the reason why you are loading the the huge 2gb file using distributed cache.
I have a problem in DistributedCache in Hadoop 2.x the new API, I found some people working around this issue, but it does not solve my problem example
this solution does not work with me Because i got a NullPointerException when trying to retrieve the data in DistributedCache
My Configuration is as follows:
Driver
public int run(String[] arg) throws Exception {
Configuration conf = this.getConf();
Job job= new Job(conf,"job Name");
...
job.addCacheFile(new URI(arg[1]);
Setup
protected void setup(Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
URI[] cacheFiles = context.getCacheFiles();
BufferedReader dtardr = new BufferedReader(new FileReader(cacheFiles[0].toString()));
Here when it starts creating the buffered reader it throws the NullPointerException, this happenning because context.getCacheFiles(); returns always NULL. How to solve this problem, and where is the cache files stored(HDFS, or local file system)
If you use the local JobRunner in Hadoop (non-distributed mode, as a single Java process), then no local data directory is created; the getLocalCacheFiles() or getCacheFiles() call will return an empty set of results.Can you make sure that you are running your job in a Distributed or Pseudo-Distributed mode.
Hadoop frame work will copy files set in the distributed cache to the local working directory of each task in the job.
There are copies of all cached files, placed in the local file system of each worker machine. (They will be in a subdirectory of mapred.local.dir.)
Can you refer this link for understanding more about DistributedCache.