Spark: reading local file, file should exist on all nodes? - java

I have a spark cluster with 2 machines say mach-1 and mach-2.
I code on my local and then export it to JAR, and copy it to mach-1.
Then i run the code on mach-1 using spark-submit.
The code tries to read a local file, which exists on mach-1.
It works well most of the time, but sometimes it gave me errors like File does not exist. So, i then copied the file to mach-2 as well, and now the code works.
Similarly, while writing out the file to local, sometimes it worked when the output folder was only available on mach-1, but then it gave an error, and i created the output folder on mach-2 as well. Now it creates the output in both mach-1 and mach-2 (some part in mach-1 and some part in mach-2).
Is this expected behavior? any pointers to texts explaining this.
P.S: i do not collect my RDDs before writing to local file (I do it in foreach). If i do that, the code works well with output folder only being present on mach-1.

Your input data has to exist at every Node. You can achieve this by copy the data to the nodes, using NFS or HDFS.
For your output you can write to NFS or HDFS. Or you call collect(), but only do it, when your Dataset does fit into the Memory of the Driver. When it doesn't fit you should call rdd.toLocalIterator() or take(n).
Is it possible, that you run your code in Cluster Mode and not in Client Mode?

Related

Deleting data files in java

In my knowledge when you delete a file the pointer of the file just get removed but the data of the file is still there ready to get overwrite and in Linux there is Shred that overwrite the file for 3 times by default
My question is how to write a code that does the same as shred but I don't know any keywords to use in my search and I never found source code of shred or any thing like it

Running a multithreaded program sychronized very slow Java

This question is a little complicated but I will do my best to make it simple.
I have a program which I want to run multithreaded.
This is what the program does:
initializes an executable (commandline utility)
loads a file into the executable (files are provided from a data provider method)
sends commands to that executable based on the file which was loaded
parses the received responses from executable
writes results to a csv file
All this takes place in a single method.
However when running in multithreaded mode, everything runs fine except, all the results written to the csv file are wrong and out of order.
However when I added the keyword sychronized in the method declaration and run the program with multiple threads, the program works just fine.
public sychronized void run(Dataprovider data) {
...
}
However the program runs at the same speed as if I were running in single thread mode. How can I fix this? This is driving me nuts...
How can I run this program properly multithreaded?
I'm looking for ideas and/or guidance
Edit:
However when running in multithreaded mode, everything runs fine
except, all the results written to the csv file are wrong and out of
order.
I load a file in the executable, I run some calculations on that file, then save it. I then get the file size in bytes (file.length) for that newly generated file. I compare the results of that new file with the old file (file which was loaded) and I see that the new file is smaller than the old file (which is totally wrong). The file sizes for the new file is consistently 12263 bytes, which is incorrect
Edit:
Here is a partial code which does the writing to CSV file:
Edit:
Removing code example for simplicity
However when running in multithreaded mode, everything runs fine
except, all the results written to the csv file are wrong and out of
order.
I can make sore guesses as to what you mean by this statement, but it would help if it were more specific.
Is it the case that the results are wrong because outputs from different threads are being jumbled together into the same line or even the same token within a line?
In a csv file, the records are typically separated by newline characters. Can you refactor your solution so that a thread produces a complete line before writing to the output, and writes that line all in one go to the output?
Does your solution already do it that way? (It's not clear... there is no code in the question.)

Self exploding and rejaring jar file during execution

I am currently working on a program to make sitting charts for my teacher's classroom. I have put all of the data files in the jar. These are read in and put in to a table. After running the main function of the program, it updates the files to match what the tables values are. I know I need to explode the jar and then rejar it during excution in order to edit the files, but I can't find any explination on how to rejar during excution. Does anyone have any ideas?
Short answer:
Put data files outside of the binary and ship together with JAR in a separate folder.
Long one:
It seems like you are approaching the problem from the wrong direction. JAR file is something like an executable (.exe) on Windows platform - a read only binary containing code.
You can (although it is a bad practice) put some resources like data files, multimedia, etc. inside JAR (like you can inside .exe). But a better solution would be to place these resources outside of the binary so you can switch them without recompiling/rebuilding.
If you need to modify the resources on-the-fly while the application is running, you basically have no choice. The data files have to be outside the binary. Once again, you'll never see a Windows .exe file modifying itself while running.
Tomasz is right that the following is bad practice, but it is possible.
The contents of the classpath are read into memory during bootstrapping, however the files are modifiable but their changes will not be reflected after initialisation. I would recommend putting the data into another file, separate to your class files, but if you insist on keeping them together, you could look at:
JarInputStream or ZipInputStream to read the contents of the JAR file
Get the JarEntry for the appropriate file
Read and modify the contents as you desire
JarOutputStream or ZipOutputStream to write the contents back out
Make sure you're not reading the resource through the classpath and that it's coming from a file on disk / network.

PHP synchronization

I'm unsure of the best solution for this but this is what I've done.
I'm using PHP to look into a directory that contains zip files.
These zip files contain text files that are to be loaded into an oracle database through SqlLoader (sqlldr).
I want to be able to start more than one PHP process via the command line to load these zip files into the db.
If other 'php loader' processes are running, they shouldn't overlap and try to load the same zip file. I know I could start one process and let it process each zip file but I'd rather start up a new process for incoming zip files so I can load concurrently.
Right now, I've created a class that will 'lock' a zip file, a directory, or a generic text file by creating a file called 'filename.ext.lock'. Other process that start up will check to see if a file has been 'locked' in this way, if it has it will skip that file and move on to another file for processing.
I've made a class that uses a directory and creates 'process id' files so that each PHP process has an id it can use for logging purposes and for identifying which PHP process has locked the file.
I'm on a windows machine and it isn't in the plan to make this an ubuntu machine, for those of you that might suggest pcntl.
What other solutions do you see? I know that this isn't truly synchronized because a lock file might be about to be created and then a context switch occurs and then another PHP process 'locks' the file before the first one can create the lock file.
Can you please provide me with some ideas about how I can make this solution better? A java implementation? Erlang?
Also forgot to mention, the PHP process connects to the DB to fetch metadata about the files that it is going to load via SqlLoader. I don't think that is important but just in case.
Quick note : I'm aware that sqlldr locks the table it is loading and that if multiple processes try to load to the same table it will become a bottle neck. To alleviate this problem I plan on making a directory that will contain files name after tables that are currently being loaded. After a table has completed loading the respective file will be deleted and other processes will check that it is safe to load that table.
Extra information : I'm using 7zip to unzip the files and php's exec to perform these commands.
I'm using exec to call sqlldr as well.
The zip files can be huge (1gb) and loading one table can take up to an 1hr.
Rather than creating a .lock file, you can just rename the zip file when a loader start to process a zip file. e.g. "foobar.zip.bar", the process should be faster than creating a new file on disk.
But it doesn't ensure your next loader will be loaded after the file rename. You should at least have some
controls loading new loaders in another script.
Also, just some side suggestion, its possible to emulate threading in PHP using CURL, you might want to try it out.
https://web.archive.org/web/20091014034235/http://www.ibuildings.co.uk/blog/archives/811-Multithreading-in-PHP-with-CURL.html
I do not know if I understand right, but I have a suggestion: get the lock files with a prefix of priority.
Example:
10-script.php started
20-script.php started (enters a loop waiting for a 10-foobar.ext.lock)
while 10-foobar.ext.lock is not generated by 10-script.php, still waiting
30-script.php will have to wait for 10-foobar.ext.lock and 20-example.ext.lock
I tried to find pcntl_fork with cygwin, but found nothing that works

Java - How to find that the user has changed the configuration file?

I am developing a Java Desktop Application. This app needs a configuration to be started. For this, I want to supply a defaultConfig.properties or defaultConfig.xml file with the application so that If user doesn't select any configuration, then the application will start with the help of defaultConfig file.
But I am afraid of my application crash if the user accidentally edit the defaultConfig file. So Is there any mechanism through which I can check before the start of the application that whether the config file has changed or not.
How other applications (out in the market) deal with this type of situation in which their application depends on a configuration file?
If the user edited the config file accidentally or intentionally, then the application won't run in future unless he re-installs the application.
I agree with David in that using a MD5 hash is a good and simple way to accomplish what you want.
Basically you would use the MD5 hashing code provided by the JDK (or somewhere else) to generate a hash-code based on the default data in Config.xml, and save that hash-code to a file (or hardcode it into the function that does the checking). Then each time your application starts load the hash-code that you saved to the file, and then load the Config.xml file and again generate a hash-code from it, compare the saved hash-code to the one generated from the loaded config file, if they are the same then the data has not changed, if they are different, then the data has been modified.
However as others are suggesting if the file should not be editable by the user then you should consider storing the configuration in a manner that the user can not easily edit. The easiest thing I can think of would be to wrap the Output Stream that you are using to write the Config.xml file in a GZIP Output Stream. Not only will this make it difficult for the user to edit the configuration file, but it will also cause the Config.xml file to take up less space.
I am not at all sure that this is a good approach but if you want to go ahead with this you can compute a hash of the configuration file (say md5) and recompute and compare every time the app starts.
Come to think of it, if the user is forbidden to edit a file why expose it? Stick it in a jar file for example, far away from the user's eyes.
If the default configuration is not supposed to be edited, perhaps you don't really want to store it in a file in the first place? Could you not store the default values of the configuration in the code directly?
Remove write permissions for the file. This way the user gets a warning before trying to change the file.
Add a hash or checksum and verify this before loading file
For added security, you can replace the simple hash with a cryptographic signature.
From I have found online so far there seems to be different approaches code wise. none appear to be a 100 hundred percent fix, ex:
The DirectoryWatcher implements
AbstractResourceWatcher to monitor a
specified directory.
Code found here twit88.com develop-a-java-file-watcher
one problem encountered was If I copy
a large file from a remote network
source to the local directory being
monitored, that file will still show
up in the directory listing, but
before the network copy has completed.
If I try to do almost anything non
trivial to the file at that moment
like move it to another directory or
open it for writing, an exception will
be thrown because really the file is
not yet completely there and the OS
still has a write lock on it.
found on the same site, further below.
How the program works It accepts a ResourceListener class, which is FileListener. If a change is detected in the program a onAdd, onChange, or onDelete event will be thrown and passing the file to.
will keep searching for more solutions.

Categories

Resources