Submitting multiple hadoop jobs through Java

Submitting multiple hadoop jobs through Java - java

I need to submit several jobs to Hadoop which are all related (which is why they are launched by the same driver class) but completely independent of each other. Right now I start jobs like this:
int res = ToolRunner.run(new Configuration(), new MapReduceClass(params), args);
which runs a job, gets the return code, and moves on.
What I'd like to do is submit several such jobs to run in parallel, retrieving the return code of each one.
The obvious (to me) idea would be to launch several threads, each of which is responsible for a single hadoop job, but I'm wondering if hadoop has a better way to accomplish this? I don't have any experience writing code with concurrency, so I'd rather not spend a lot of time learning the intricacies of it unless it's necessary here.

This could be a suggestion, but implies code, so I will put it as an answer.
In this code (personal code), I just iterate through some variable, and submit a job (the same job) several times.
Using job.waitForCompletion(false) will help you to submit several jobs.
while (processedInputPaths < inputPaths.length) {
if (processedInputPaths + inputPathsLimit < inputPaths.length) {
end = processedInputPaths + inputPathsLimit - 1;
} else {
end = inputPaths.length - 1;
}
start = processedInputPaths;
Job job = this.createJob(configuration, inputPaths, cycle, start, end, outputPath + "/" + cycle);
boolean success = job.waitForCompletion(true);
if (success) {
cycle++;
processedInputPaths = end + 1;
} else {
LOG.info("Cycle did not end successfully :" + cycle);
return -1;
}
}

psabbate's answer led me to find a couple of pieces of the API that I was missing. This is how I solved it:
In the driver class, start the jobs with code like this:
List<RunningJob> runningJobs = new ArrayList<RunningJob>();
for (String jobSpec: jobSpecs) {
// Configure, for example, a params map that gets passed into the MR class's constructor
ToolRunner.run(new Configuration(), new MapReduceClass(params, runningJobs), null);
}
for (RunningJob rj: runningJobs) {
System.err.println("Waiting on job "+rj.getID());
rj.waitForCompletion();
}
Then, in the MapReduceClass, define a private variable List<RunningJob> runningJobs, define a constructor like this:
public MergeAndScore(Map<String, String> p, List<RunningJob> rj) throws IOException {
params = Collections.unmodifiableMap(p);
runningJobs = rj;
}
And in the run() method that ToolRunner calls, define your JobConf and submit the job with
JobClient jc = new JobClient();
jc.init(conf);
jc.setConf(conf);
runningJobs.add(jc.submitJob(conf));
With this, run() returns immediately, and the jobs can be accessed via the runningJobs object in the driver class.
Note that I am working on an older version of Hadoop, so jc.init(conf) and/or jc.setConf(conf) may or may not be necessary depending on your setup, though probably at least one of them is required.

Related

Multiple threads writing to same list in java and return that list to a function

So I have a really large list of zip codes (about 80,000) that I want to pass onto a url and get the JSON data from that url for each zip code.
I am running a query on that JSON to see if it has the end_lat and if it does then I want to save that zip code to a list.
As I am fetching and matching JSON for a lot of zip codes its taking forever.
So I tried few different methods to make it a multi threaded application. I tried the good old Thread method with runnable interface.
I tried executor services. But everything stops abruptly which makes me believe that I should be making synchronized writes to that list.
public void breakingZipCodesForThreads() {
List<String> zip_Codes = Serenity.sessionVariableCalled("zipCodes");
int size = (int) Math.ceil(zip_Codes.size() / 5.0);
ExecutorService executor = Executors.newFixedThreadPool(4);
for (int start = 0; start < zip_Codes.size(); start += size) {
int end = Math.min(start + size, zip_Codes.size());
Runnable worker = new MyRunnable(zip_Codes.subList(start, end));
executor.execute(worker);
}
//run() method bascially has this code for a function
for (String zipCode : zip_Codes) {
currentPage = pageUrl + zipCode;
Response response = given().urlEncodingEnabled(false)
.when()
.get(currentPage);
try {
Object end_lat = response.getBody().path("end_lat");
if (end_lat != null && !end_lat.toString().isEmpty()) {
resultantZipCode.add(zipCode);
}
} catch (Exception e) {
//Something else
}
So essentially I want all my threads to concurrently write to the list "resultantZipCode" and in the end give me single list for all the zipcodes in there that satisfy my condition.
So how do I make my zip codes break into pieces, run parallely for the run function and save all the resultant zip codes and returns me that list? What am I missing?

Spark java.lang.StackOverflowError

I'm using spark in order to calculate the pagerank of user reviews, but I keep getting Spark java.lang.StackOverflowError when I run my code on a big dataset (40k entries). when running the code on a small number of entries it works fine though.
Entry Example :
product/productId: B00004CK40 review/userId: A39IIHQF18YGZA review/profileName: C. A. M. Salas review/helpfulness: 0/0 review/score: 4.0 review/time: 1175817600 review/summary: Reliable comedy review/text: Nice script, well acted comedy, and a young Nicolette Sheridan. Cusak is in top form.
The Code:
public void calculatePageRank() {
sc.clearCallSite();
sc.clearJobGroup();
JavaRDD < String > rddFileData = sc.textFile(inputFileName).cache();
sc.setCheckpointDir("pagerankCheckpoint/");
JavaRDD < String > rddMovieData = rddFileData.map(new Function < String, String > () {
#Override
public String call(String arg0) throws Exception {
String[] data = arg0.split("\t");
String movieId = data[0].split(":")[1].trim();
String userId = data[1].split(":")[1].trim();
return movieId + "\t" + userId;
}
});
JavaPairRDD<String, Iterable<String>> rddPairReviewData = rddMovieData.mapToPair(new PairFunction < String, String, String > () {
#Override
public Tuple2 < String, String > call(String arg0) throws Exception {
String[] data = arg0.split("\t");
return new Tuple2 < String, String > (data[0], data[1]);
}
}).groupByKey().cache();
JavaRDD<Iterable<String>> cartUsers = rddPairReviewData.map(f -> f._2());
List<Iterable<String>> cartUsersList = cartUsers.collect();
JavaPairRDD<String,String> finalCartesian = null;
int iterCounter = 0;
for(Iterable<String> out : cartUsersList){
JavaRDD<String> currentUsersRDD = sc.parallelize(Lists.newArrayList(out));
if(finalCartesian==null){
finalCartesian = currentUsersRDD.cartesian(currentUsersRDD);
}
else{
finalCartesian = currentUsersRDD.cartesian(currentUsersRDD).union(finalCartesian);
if(iterCounter % 20 == 0) {
finalCartesian.checkpoint();
}
}
}
JavaRDD<Tuple2<String,String>> finalCartesianToTuple = finalCartesian.map(m -> new Tuple2<String,String>(m._1(),m._2()));
finalCartesianToTuple = finalCartesianToTuple.filter(x -> x._1().compareTo(x._2())!=0);
JavaPairRDD<String, String> userIdPairs = finalCartesianToTuple.mapToPair(m -> new Tuple2<String,String>(m._1(),m._2()));
JavaRDD<String> userIdPairsString = userIdPairs.map(new Function < Tuple2<String, String>, String > () {
//Tuple2<Tuple2<MovieId, userId>, Tuple2<movieId, userId>>
#Override
public String call (Tuple2<String, String> t) throws Exception {
return t._1 + " " + t._2;
}
});
try {
//calculate pagerank using this https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaPageRank.java
JavaPageRank.calculatePageRank(userIdPairsString, 100);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
sc.close();
}

I have multiple suggestions which will help you to greatly improve the performance of the code in your question.
Caching: Caching should be used on those data sets which you need to refer to again and again for same/ different operations (iterative algorithms.
An example is RDD.count — to tell you the number of lines in the
file, the file needs to be read. So if you write RDD.count, at
this point the file will be read, the lines will be counted, and the
count will be returned.
What if you call RDD.count again? The same thing: the file will be
read and counted again. So what does RDD.cache do? Now, if you run
RDD.count the first time, the file will be loaded, cached, and
counted. If you call RDD.count a second time, the operation will use
the cache. It will just take the data from the cache and count the
lines, no recomputing.
Read more about caching here.
In your code sample you are not reusing anything that you've cached. So you may remove the .cache from there.
Parallelization: In the code sample, you've parallelized every individual element in your RDD which is already a distributed collection. I suggest you to merge the rddFileData, rddMovieData and rddPairReviewData steps so that it happens in one go.
Get rid of .collect since that brings the results back to the driver and maybe the actual reason for your error.

This problem will occur when your DAG grows big and too many level of transformations happening in your code. The JVM will not be able to hold the operations to perform lazy execution when an action is performed in the end.
Checkpointing is one option. I would suggest to implement spark-sql for this kind of aggregations. If your data is structured, try to load that into dataframes and perform grouping and other mysql functions to achieve this.

When your for loop grows really large, Spark can no longer keep track of the lineage. Enable checkpointing in your for loop to checkpoint your rdd every 10 iterations or so. Checkpointing will fix the problem. Don't forget to clean up the checkpoint directory after.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing

Below things fixed stackoverflow error, as others pointed it's because of lineage that spark keeps building, specially when you have loop/iteration in code.
Set checkpoint directory
spark.sparkContext.setCheckpointDir("./checkpoint")
checkpoint dataframe/Rdd you are modifying/operating in iteration
modifyingDf.checkpoint()
Cache Dataframe which are reused in each iteration
reusedDf.cache()

Making massive amounts of individual row updates faster or more efficient

I'm writing a java application that copies one database's information (db2) to anther database (sql server). The order of operations is very simple:
Check to see if anything has been updated in a certain time frame
Grab everything from the first database that is within the designated time frame
Map database information to POJOs
Divide subsets of POJOs into threads (pre defined # in properties file)
Threads cycle through each POJO Individually
Update the second database
I have everything working just fine, but at certain times of the day there is a huge jump in the amount of updates that need to take place (can get in to the hundreds of thousands).
Below you can see a generic version of my code. It follows the basic algorithm of the application. Object is generic, the actual application has 5 different types of specified objects each with its own updater thread class. But the generic functions below are exactly what they all look like. And in the updateDatabase() method, they all get added to threads and all run at the same time.
private void updateDatabase()
{
List<Thread> threads = new ArrayList<>();
addObjectThreads( threads );
startThreads( threads );
joinAllThreads( threads );
}
private void addObjectThreads( List<Thread> threads )
{
List<Object> objects = getTransformService().getObjects();
logger.info( "Found " + objects.size() + " Objects" );
createThreads( threads, objects, ObjectUpdaterThread.class );
}
private void createThreads( List<Thread> threads, List<?> objects, Class threadClass )
{
final int BASE_OBJECT_LOAD = 1;
int objectLoad = objects.size() / Database.getMaxThreads() > 0 ? objects.size() / Database.getMaxThreads() + BASE_OBJECT_LOAD : BASE_OBJECT_LOAD;
for (int i = 0; i < (objects.size() / objectLoad); ++i)
{
int startIndex = i * objectLoad;
int endIndex = (i + 1) * objectLoad;
try
{
List<?> objectSubList = objects.subList( startIndex, endIndex > objects.size() ? objects.size() : endIndex );
threads.add( new Thread( (Thread) threadClass.getConstructor( List.class ).newInstance( objectSubList ) ) );
}
catch (Exception exception)
{
logger.error( exception.getMessage() );
}
}
}
public class ObjectUpdaterThread extends BaseUpdaterThread
{
private List<Object> objects;
final private Logger logger = Logger.getLogger( ObjectUpdaterThread.class );
public ObjectUpdaterThread( List<Object> objects)
{
this.objects = objects;
}
public void run()
{
for (Object object : objects)
{
logger.info( "Now Updating Object: " + object.getId() );
getTransformService().updateObject( object );
}
}
}
All of these go to a spring service that looks like the code below. Again its generic, but each type of object has the exact same type of logic to them. The getObjects() from the code above are just one line pass throughs to the DAO so no need to really post that.
#Service
#Scope(value = "prototype")
public class TransformServiceImpl implements TransformService
{
final private Logger logger = Logger.getLogger( TransformServiceImpl.class );
#Autowired
private TransformDao transformDao;
#Override
public void updateObject( Object object )
{
String sql;
if ( object.exists() )
{
sql = Object.Mapper.UPDATE;
}
else
{
sql = Object.Mapper.INSERT;
}
boolean isCompleted = false;
while ( !isCompleted )
{
try
{
transformDao.updateObject( object, sql );
isCompleted = true;
}
catch (Exception exception)
{
logger.error( exception.getMessage() );
threadSleep();
logger.info( "Now retrying update for Object: " + object.getId() );
}
}
logger.info( "Updated Object: " + object.getId() );
}
}
Finally these all go to the DAO that looks like this:
#Repository
#Scope(value = "prototype")
public class TransformDaoImpl implements TransformDao
{
//#Resource is like #Autowired but with the added option of being able to specify the name
//Good for autowiring two different instances of the same class [NamedParameterJdbcTemplate]
//Another alternative = #Autowired #Qualifier(BEAN_NAME)
#Resource(name = "db2")
private NamedParameterJdbcTemplate db2;
#Resource(name = "sqlServer")
private NamedParameterJdbcTemplate sqlServer;
final private Logger logger = Logger.getLogger( TransformerImpl.class );
#Override
public void updateObject( Objet object, String sql )
{
MapSqlParameterSource source = new MapSqlParameterSource();
source.addValue( "column1_value", object.getColumn1Value() );
//put all source values from the POJO in just like above
sqlServer.update( sql, source );
}
}
My insert statements look like this:
"INSERT INTO dbo.OBJECT_TABLE " +
"(COLUMN1, COLUMN2...) " +
"VALUES(:column1_value, :column2_value... "
And my update statements look like this:
"UPDATE dbo.OBJECT_TABLE SET " +
"COLUMN1 = :column1_value, COLUMN2 = :column2_value, " +
"WHERE PRIMARY_KEY_COLUMN = :primary_key_value"
Its a lot of code and stuff I know, But I just wanted to layout everything I have in hopes that I can get help making this faster or more efficient. It takes hours on hours to update so many rows and it would nice if it only took a couple/few hours instead hours on hours. Thanks for any help. I welcome all learning experiences about spring, threads and databases.

If you're sending large amounts of SQL to the server, you should consider Batching it using the Statement.addBatch and Statement.executeBatch methods. The batches are finite in size (I always limited mine to 64K of SQL), but they dramatically lower the round trips to the database.
As I was iterating and creating SQL, I would keep track of how much I had batched already, when the SQL crossed the 64K boundary, I'd fire off an executeBatch and start a fresh one.
You may want to experiment with the 64K number, it may have been an Oracle limitation, which I was using at the time.
I can't speak to Spring, but batching is a part of the JDBC Statement. I'm sure it's straightforward to get to this.

Check to see if anything has been updated in a certain time frame
Grab everything from the first database that is within the designated time frame
Is there an index on the LAST_UPDATED_DATE column (or whatever you're using) in the source table? Rather than put the burden on your application, if it's within your control, why not write some triggers in the source database that create entries in an "update log" table? That way, all that your app would need to do is consume and execute those entries.
How are you managing your transactions? If you're creating a new transaction for each operation it's going to be brutally slow.
Regarding the threading code, have you considered using something more standard rather than writing your own? What you have is a pretty typical producer/consumer and Java has excellent support for that type of thing with ThreadPoolExecutor and numerous queue implementations to move data between threads that perform different tasks.
The benefit with using something off the shelf is that 1) it's well tested 2) there are numerous tuning options and sizing strategies that you can adjust to increase performance.
Also, rather than use 5 different thread types for each type of object that needs to be processed, have you considered encapsulating the processing logic for each type into separate strategy classes? That way, you could use a single pool of worker threads (which would be easier to size and tune).

Calling Groovy scripts from Java and refreshing the Groovy scripts periodically

I want to call the Groovy scripts from Java and refresh the Groovy scripts periodically.
For example ,
public class AppTest {
public static void main(String args[]) throws Exception {
TestVO test = new TestVO();
AnotherInput input = new AnotherInput();
test.setName("Maruthi");
input.setCity("Newark");
GroovyClassLoader loader = new GroovyClassLoader(AppTest.class.getClassLoader());
Class groovyClass = loader.parseClass(new File("src/main/resources/groovy/MyTestGroovy.groovy"));
GroovyObject groovyObject = (GroovyObject) groovyClass.newInstance();
Object[] inputs = {test,null};
Map<String,String> result = (Map<String, String>)groovyObject.invokeMethod("checkInput", inputs);
System.out.println(result);
}
}
And my Groovy script is
class MyTestGroovy {
def x = "Maruthi";
def checkInput = { TestVO input,AnotherInput city ->
if(input.getName().equals(x)) {
input.setName("Deepan");
println "Name changed Please check the name";
} else {
println "Still Maruthi Rocks";
}
Map<String, String> result = new HashMap<String,String>();
result.put("Status", "Success");
if(city != null && city.getCity().equalsIgnoreCase("Newark")) {
result.put("requested_State", "Newark");
}
return result;
}
def executeTest = {
println("Test Executed");
}
}
How efficient my memory would be managed when I create multiple instances of groovy script and execute the script. Is it advisable to use a number of Groovy scripts as my customized rule engine. Please advise.

It is usually better to have several instances of the same script, than parsing the class every time you want to create an instance. Performance wise that is because compiling the script takes some time, you have to pay in addition to creating an instance. Memory wise you use up the number of available classes up faster. Even if old classes are collected, if you have many scripts active, it can happen... though that normally means hundreds or even thousands of them (depends on the jvm version and your memory settings)
Of course, once the script changed, you will have to recompile the class anyway. So if in your scenario you will have only one instance of the class active at the same time, and a new instance is only required after a change to the source, you can recompile every time.
I mention that especially, because you might even be able to write the script in a way, that let's you reuse the same instance. But it is of course beyond the scope of this question.

Creating Performance Counters in Java

Does anyone know how can I create a new Performance Counter (perfmon tool) in Java?
For example: a new performance counter for monitoring the number / duration of user actions.
I created such performance counters in C# and it was quite easy, however I couldn’t find anything helpful for creating it in Java…

If you want to develop your performance counter independently from the main code, you should look at aspect programming (AspectJ, Javassist).
You'll can plug your performance counter on the method(s) you want without modifying the main code.

Java does not immediately work with perfmon (but you should see DTrace under Solaris).
Please see this question for suggestions: Java app performance counters viewed in Perfmon

Not sure what you are expecting this tool to do but I would create some data structures to record these times and counts like
class UserActionStats {
int count;
long durationMS;
long start = 0;
public void startAction() {
start = System.currentTimeMillis();
}
public void endAction() {
durationMS += System.currentTimeMillis() - start;
count++;
}
}
A collection for these could look like
private static final Map<String, UserActionStats> map =
new HashMap<String, UserActionStats>();
public static UserActionStats forUser(String userName) {
synchronized(map) {
UserActionStats uas = map.get(userName);
if (uas == null)
map.put(userName, uas = new UserActionStats());
return uas;
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Submitting multiple hadoop jobs through Java - java

Related

Multiple threads writing to same list in java and return that list to a function

Spark java.lang.StackOverflowError

Making massive amounts of individual row updates faster or more efficient

Calling Groovy scripts from Java and refreshing the Groovy scripts periodically

Creating Performance Counters in Java

Categories

Resources