Suggestions for faster file processing? - java

I have a cluster of 4 servers. A file consists of many logical documents. Each file is started as a workflow. So, in summary a workflow runs on the server for each physical input file which can contain as many as 3,00,000 logical documents. At any given time 80 workflows are running concurrently across the cluster. Is there a way to speed up the file processing? Is file splitting a good alternative ? Any suggestions? Everything is java based running on a tomcat servlet engine.

Try to process the files in Oracle Coherence. This gives grid processing. Coherence also provides data persistence as well.

Related

Is it possible to virtually divide hadoop cluster into small clusters

We are working to build a big cluster of 100 nodes with 300 TB storage. Then we have to serve it to different users (clients) with restricted resources limit i.e., we do not want to expose complete cluster to each user. Is it possible ? If it is not possible then what are other ways to do it. Are there any builtin solutions available ? It is just like cluster partitioning on demand.
On hadoop 2 there is a concept of HDFS Federation that can partition the file system namespace over multiple separated namenodes each of which manages a portion of the file system namespace.

Hadoop MapReduce without cluster - is it possible?

Is it possible to run a Hadoop MapReduce program without a cluster? I mean, I am just trying to fiddle around a little with map/reduce, for educational purposes, so all I want is to run few MapReduce programs on my computer, I don't need any job splitting to multiple nodes etc... Don't need any performance boosts or anything, as I said, just for educational purposes.. Do I still need to run a VM to achieve this? I am using IntelliJ Ultimate, and I'm trying to run simple WordCount.. I believe I've set up all necessary libraries and the entire project, and upon running I get this exception:
Exception in thread "main" java.io.IOException: Cannot initialize Cluster.
Please check your configuration for mapreduce.framework.name and the correspond server addresses.
I've found some posts saying that the entire map/reduce process can be run locally on the jvm, but couldn't yet find the way how to do it.
The whole installation tutorial of "pseudo-distributed" mode specifically walks you through the installation of a single node Hadoop cluster
There's also the "Mini cluster" which you'll find some Hadoop projects use for unit&integration tests
I feel like you're just asking if you need HDFS or YARN, though, and the answer is no, Hadoop can read file:// prefixed file paths from disk, with or without a cluster
Keep in mind that splitting is not just between nodes, but also between multiple cores of a single computer. If you're not doing any parallel processing, there's not much reason to use Hadoop other than to learn the API semantics.
Aside: From an "educational perspective", in my career thus far, I find more people writing Spark than MapReduce, and not many jobs asking specifically for MapReduce code

Adding new node to a scalable system with zero downtime

I am working as a developer on a batch processing solution, how it works is that we split a big file and process it across jvms. So we have 4 processor jvms which take a chunk of file and process it and 1 gateway jvm job of gateway jvm is to split the file into no. of processor jvms i.e. 4 and send a rest request which is consumed by processor jvms, rest request has all the details the file location it has to pick the file from and some other details
Now if i want to add another processor jvm without any downtime is there any way we can do it. Currently we are maintaining the urls for 4 jvms in a property file is there any better way to do it ? which provided me the ability to add more jvms without restarting any component
You can consider setting up a load balancer and putting your JVM(s) behind it. The load balancer would be responsible for distributing the incoming requests to the JVMs.
This way you can scale up or scale down your JVM depending on the work load. Also, if one of the JVMs are not working, other part of your system need not care about it anymore.
Not sure what is your use case and tech stack you are following. But it seems that you need to have distributed system with auto-scaling and dynamic provisioning capabilities. Have you considered Hadoop or Spark clusters or Akka?
If you can not use any of it, then solution is to maintain list of JVMs in some datastore (lets say in a table); its dynamic data meaning one can add/remove/update JVMs. Then you need a resource manager who can decide whether to spin up a new JVM based on load or any other conditional logic. This resource manager needs to monitor entire system. Also, whenever you create a task or chunk or slice data then distribute it using message queues such as ApacheMQ, ActiveMQ. You can also consider Kafka for complex use cases. Now a days, application servers such as websphere (Libery profile), weblogic also provide auto-scaling capability. So, if you are already using any of such application server then you can think of making use of that capability. I hope this helps.

How does truly big data import into HDFS before the data scientists grow old and die?

I'm brand-spanking-new to Hadoop and believe I'm beginning to see how much different data analytics ("offline") is from the super-low-latency world of web apps. One major thing I'm still struggling to understand is how truly "big data" makes it onto HDFS in the first place.
Say I have 500TB of data stored across a variety of systems (RDBMS, NoSQL, log data, whatever). My understanding is that, if I want to write MR jobs to query and analyze this data, I need to first import/ingest it all into HDFS.
But even if I had, say a 1Gbps network connection between each disparate system and my Hadoop cluster, this is 500TB = 500 * 1000Gb = 500,000 GB of data, which at 1Gbps, is 500,000 seconds or ~138 hours to port all the data onto my HDFS cluster. That's almost a week.
And, if my understanding of big data is correct, the terrabyte scale is actually pretty low-key, with many big data systems scaling into the petabyte range. Now we'd be up to months, maybe even years, just to be able to run MR jobs against them. If we have systems that are orders of magnitude beyond petabytes, then we're looking at having "flying rocket scooters" buzzing around everywhere before the data is even ready to be queried.
Am I missing something fundamental here? This just doesn't seem right to me.
Typically data is loaded as it's being generated. However, there are a few tools out there to help with the loading to HDFS.
Apache Flume - https://flume.apache.org/ - Designed for aggregating large amounts of log data. Flume has many bundled 'sources' which can be used to consume the log data including reading from files, directories, queuing systems, or even accepting incoming data from TCP/UDP/HTTP. With that you can setup Flume on a number of hosts to parallelize the data aggregation.
Apache Sqoop - http://sqoop.apache.org/ - Designed for bulk loading from structured datastores such as relational databases. Sqoop uses connectors to connect, structure, and load data to HDFS. The built in one can connect to any thing that adheres to JDBC 4 specifications.
500TB of data is a lot of data to load, but if it's spread out across multiple systems and formats using Sqoop and/or Flume should make relatively quick work.

using ehcache to handle file processing

I am new to ehcache concept and its usage. in my application i am loading many files using java.io ( lets say 100 at a time. it may be more than that) and process these files using multiple threads.
from performance perspective i want to implement a caching mechanism for this. can anyone please let me know how should i do this and what will be the best practice ?
PS - processing file steps
1. read the file
2. create java file object.
3. process the file.
4. move the file to a different location.
( i am using spring in my application)
Thank you all in advance.
We're operating a high traffic portal about 95M PIs / monthly.
We're using proxy servers and varnish https://www.varnish-cache.org/ to cache static contents.
At the same time you outsource caching from your application servers, and they've more free memory to operate on. I think it would be a right solution in your case , too.

Categories

Resources