Text file split libraries in Java

Text file split libraries in Java - java

My program receives large CSV files and transforms them to XML files. In order to have better performance I would like to split this files in smaller segments of (for example) 500 lines. What are the available Java libraries for splitting text files?

I don't understand what you'd be gaining by splitting up the CSV file into smaller ones? With Java, you can read and process the file as you go, you don't have to read it all at once...

What do you intend to do with those data ?
If it is just record by record processing then event oriented (SAX or StaX) parsing will be the way to go. For record by record processing, an existing "pipeline" toolkit may be applicable.
You can pre-process your file with a splitter function like this one or this Splitter.java.

How are you planning on distributing the work once the files have been split?
I have done something similar to this on a framework called GridGain - it's a grid computing framework which allows you to execute tasks on a grid of computers.
With this in hand you can then use a cache provider such as JBoss Cache to distribute the file to multiple nodes, specify a start and end line number and process. This is outlined in the following GridGain example: http://www.gridgainsystems.com/wiki/display/GG15UG/Affinity+MapReduce+with+JBoss+Cache
Alternatively you could look at something like Hadoop and the Hadoop File System for moving the file between different nodes.
The same concept could be done on your local machine by loading the file into a cache and then assigning certain "chunks" of the file to be worked on by seperate threads. The grid computing stuff really is only for really large problems, or to provide some level of scalability transparently to your solution. You might need to watch out for IO bottlenecks and locks, but a simple thread pool which you dispatch "jobs" into after the file is split could work.

Related

HDFS - load mass amount of files

For testing purposes I'm trying to load a massive amount of small files into HDFS. Actually we talk about 1 Million (1'000'000) files with a size from 1KB to 100KB. I generated those files with an R-Script on a Linux-System in one folder. Every file has a information structure that contains a header with product information and a different number of columns with numeric information.
The problem is when I try to upload those local files into HDFS with the command:
hdfs dfs -copyFromLocal /home/user/Documents/smallData /
Then i get one of the following Java-Heap-Size errors:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
I use the Cloudera CDH5 distribution with a Java-Heap-Size about 5 GB. Is there another way than increasing this Java-Heap-Size even more? Maybe a better way to load this mass amount of data into HDFS?
I'm very thankfully for every helpful comment!

If you will increase the memory and store the files in HDFS. After this you will get many problems at the time of processing.
Problems with small files and HDFS
A small file is one which is significantly smaller than the HDFS block size (default 64MB). If you’re storing small files, then you probably have lots of them (otherwise you wouldn’t turn to Hadoop), and the problem is that HDFS can’t handle lots of files.
Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes, as a rule of thumb. So 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up much beyond this level is a problem with current hardware. Certainly a billion files is not feasible.
Furthermore, HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.
Problems with small files and MapReduce
Map tasks usually process a block of input at a time (using the default FileInputFormat). If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead. Compare a 1GB file broken into 16 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file.
There are a couple of features to help alleviate the bookkeeping overhead: task JVM reuse for running multiple map tasks in one JVM, thereby avoiding some JVM startup overhead (see the mapred.job.reuse.jvm.num.tasks property), and MultiFileInputSplit which can run more than one split per map.
Solution
Hadoop Archives (HAR files)
Create .HAR File
Hadoop Archives (HAR files) were introduced to HDFS in 0.18.0 to alleviate the problem of lots of files putting pressure on the namenode’s memory. HAR files work by building a layered filesystem on top of HDFS. A HAR file is created using the hadoop archive command, which runs a MapReduce job to pack the files being archived into a small number of HDFS files
hadoop archive -archiveName name -p <parent> <src>* <dest>
hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo
Sequence Files
The usual response to questions about “the small files problem” is: use a SequenceFile. The idea here is that you use the filename as the key and the file contents as the value. This works very well in practice. Going back to the 10,000 100KB files, you can write a program to put them into a single SequenceFile, and then you can process them in a streaming fashion (directly or using MapReduce) operating on the SequenceFile. There are a couple of bonuses too. SequenceFiles are splittable, so MapReduce can break them into chunks and operate on each chunk independently. They support compression as well, unlike HARs. Block compression is the best option in most cases, since it compresses blocks of several records (rather than per record)
HBase
If you are producing lots of small files, then, depending on the access pattern, a different type of storage might be more appropriate. HBase stores data in MapFiles (indexed SequenceFiles), and is a good choice if you need to do MapReduce style streaming analyses with the occasional random look up. If latency is an issue, then there are lots of other choices

First of all: If this isn't a stress test on your namenode it's ill advised to do this. But I assume you know what you are doing. (expect slow progress on this)
If the objective is to just get the files on HDFS, try doing this in smaller batches or set a higher heap size on your hadoop client.
You do this like rpc1 mentioned in his answer by prefixing HADOOP_HEAPSIZE=<mem in Mb here> to your hadoop -put command.

Try to increase HEAPSIZE
HADOOP_HEAPSIZE=2048 hdfs dfs -copyFromLocal /home/user/Documents/smallData
look here

Hadoop Distributed File system is not good with many small files but with many big files. HDFS keep a record in a look up table that points to every file/block in HDFS and this Look up table usually is loaded in memory. So you should not just increase java heap size but also increase the heap size of the name node inside hadoop-env.sh, this is the default:
export HADOOP_HEAPSIZE=1000
export HADOOP_NAMENODE_INIT_HEAPSIZE="1000"
If you are going to do processing on those files, you should expect low performance on the first MapReduce job you run on them (Hadoop creates number of map tasks as the number of files/blocks and this will overload your system except when you use combineinputformat). advice you to either merge the files into big files (64MB/ 128MB) or use another data source (not HDFS).

For solve this problem, I build a single file with some format. The content of file are all the small files. The format will be like that:
<DOC>
<DOCID>1</DOCID>
<DOCNAME>Filename</DOCNAME>
<DOCCONTENT>
Content of file 1
</DOCCONTENT>
</DOC>
This structure could be more or less field, but the idea is the same. For example, I have use this stucture:
<DOC>
<DOCID>1</DOCID>
Content of file 1
</DOC>
And handle more of six million files.
If you desire process each file for a one map task, you could be delete \n char between and tags. After this, you only parse the structure and have the doc identifier and Content.

Combining large number of small files for mapreduce input

I am new to Hadoop & MapReduce .We are developing a network monitoring tool (in java).We collect various information of monitored devices periodically , say in every 5 sec. and write that information to HDFS through java client each information as new file(since we'r not using hdfs append facility).In HDFS our data organization would be like this:
/monitored_info
/f1.txt
/f2.txt
.......
/f1020010.txt
Thus each file typically less than 2KB in size.
I know each map task can take upto 1 file, and it will spawn as much as map task and the job will be inefficient. To get rid of this we used merging facility of FileUtil before submitting job:
FileUtil.copyMerge(fileSystem, new Path("monitored_info"), fileSystem,
new Path("mapInputfile"), false, conf, null);
Is it a good practice ? Or is there any other mechanism used for such requirements? Please help...

Check for Apache Kafka and Apache Flume. You can aggregate logs and move to your data store with them.
I'd use Flume personally. Easier to use imho.

If you want to use mapreduce there are different ways we can do that
Hadoop Archives, or HAR files, are a file archiving facility that packs files into HDFS
blocks more efficiently, thereby reducing namenode memory usage while still allowing
transparent access to files
The situation is alleviated somewhat by CombineFileInputFormat, which was designed
to work well with small files. Where FileInputFormat creates a split per file,
CombineFileInputFormat packs many files into each split so that each mapper has more
to process. Crucially, CombineFileInputFormat takes node and rack locality into account
when deciding which blocks to place in the same split, so it does not compromise the
speed at which it can process the input in a typical MapReduce job.
One technique for avoiding the many small files case is to merge small files
into larger files by using a SequenceFile: the keys can act as filenames (or a constant such as NullWritable, if not needed) and the values as file contents.

Concurrency while reading files from file system

We have an application that reads files from a particular folder, processes them and copies(some business logic) it to another folder.
The problem here is when there are very large number of files to be processed, running a single instance of an application or a single thread is no longer enough to process this files.
One approach we have for this is to start multiple instances of the application(I feel something is wrong with this approach. Suggest me an alternative if there is one).
Spawning threads or starting multiple instances of the application, care should be taken that, if a thread reads one file and starts processing it, another thread should not pick it up.
We are trying to achieve this by having a database table with the list of file names in the folder, so that when a thread first reads the table for the file name ,we will change the status to in-process or completed and pessimistically lock the table so that other threads cannot read it.
Is there any better solution to the problem ?

You can use most of your existing implementation as the front-end processor to feed file streams to worker threads that you can start/stop as demand dictates. Only the front-end thread opens files, so there is no possibility of one worker interfering with another.
EDIT: Added the word 'no' as it changes the meaning quite a bit...

Also have a look at JDK 7. It has a new file I/O API and a fork/ join framework which might help.

Take a look at Apache Camel (http://camel.apache.org), and its File component (http://camel.apache.org/file2.html). Using Camel allows you to very easily define a set of processing instructions to consume files in a directory atomically, and also to configure a thread pool to deal with multiple files at the same time. Camel in Action's a great book to get you started.

What you describe reminds me of the classical style to develop on UNIX.
In this classical style, you would move a file to a work-in-progress directory so that other files do not pick it up. In general: You could use one directory per processing state and than move files from state to state.
This works essentially because file moves are atomic (at least under Unix systems and NFTS).
What is nice with this approach, is that it is pretty easy to handle problematic situations like crashes and it has automatically a nice management interface everyone is familiar with (the filesystem GUI, ls, Windows Explorer, ...).

Processing large set of small files with Hadoop

I am using Hadoop example program WordCount to process large set of small files/web pages (cca. 2-3 kB). Since this is far away from optimal file size for hadoop files, the program is very slow. I guess it is because cost of setting and tearing the job are far greater then the job itself. Such small files also cause depletion of namespaces for file names.
I read that in this case I should use HDFS archive (HAR), but I am not sure how to modify this program WordCount to read from this archives. Can program continue to work without modification or some modification is necessary?
Even if I pack a lot of files in archives, the question remains if this will improve performance. I read that even if I pack multiple files, this files inside one archive will not be processed by one mapper, but many, which in my case (I guess) will not improve performance.
If this question is too simple, please understand that I am newbie to the Hadoop and have very little experience with it.

Using the HDFS won't change that you are causing hadoop to handle a large quantity of small files. The best option in this case is probably to cat the files into a single (or few large) file(s).
This will reduce the number of mappers you have, which will reduce the number of things required to be processed.
To use the HDFS can improve performance if you are operating on a distributed system. If you are only doing psuedo-distributed (one machine) then the HDFS isn't going to improve performance. The limitation is the machine.
When you are operating on a large number of small files, that will require a large number of mappers and reducers. The setup/down can be comparable to the processing time of the file itself, causing a large overhead. cating the files should reduce the number of mappers hadoop runs for the job, which should improve performance.
The benefit you could see from using the HDFS to store the files would be in distributed mode, with multiple machines. The files would be stored in blocks (default 64MB) across machines and each machine would be capable of processing a block of data that resides on the machine. This reduces network bandwidth use so it doesn't become a bottleneck in processing.
Archiving the files, if hadoop is going to unarchive them will just result in hadoop still having a large number of small files.
Hope this helps your understanding.

From my still limited understanding og Hadoop, I believe the right solution would be to create SequenceFile(s) containing your HTML files as values and possibly the URL as the key. If you do a M/R job over the SequenceFile(s), each mapper will process many files (depending on the split size). Each file will be presented to the map function as a single input.
You may want to use SequenceFileAsTextInputFormat as the InputFormat to read these files.
Also see: Providing several non-textual files to a single map in Hadoop MapReduce

I bookmarked this article recently to read it later and found the same question here :) The entry is a bit old, not exactly sure how relevant it is now. The changes to Hadoop are happening at a very rapid pace.
http://www.cloudera.com/blog/2009/02/the-small-files-problem/
The blog entry is by Tom White, who is also the author of "Hadoop: The Definitive Guide, Second Edition", a recommended read for those who are getting started with Hadoop.
http://oreilly.com/catalog/0636920010388

Can you concatenate files before submitting them to Hadoop?

CombineFileInputFormat can be used in this case which works well for large numaber of small files. This packs many of such files in a single split thus each mapper has more to process (1 split = 1 map task).
The overall processing time for mapreduce will also will also fall since there are lesser number of mappers running.
Since ther are no archive-aware InputFormat using CombineFileInputFormat will improve performance.

What are some tips for processing large files in Java

I need to perform a simple grep and other manipulations on large files in Java. I am not that familiar with the Java NIO utilities, but I am assuming that is what I need to use. What resources or helpful tips do you have for reading/writing large files. Also, I am working on a SWT application and need to display parts of that data within a text area on a GUI.

java.io.RandomAccessFile uses long for file-pointer offset so should be able to cope. However, you should read a chunk at a time otherwise overheads will be high. FileInputStream works similarly.
Java NIO shouldn't be too difficult. You don't need to mess around with Selectors or similar. In fact, prior to JDK7 you can't select with files. However, avoid mapping files. There is no unmap, so if you try to do it lots you'll run out of address space on 32-bit systems, or run into other problems (NIO does attempt to call GC, but it's a bit of a hack).

If all you are doing is reading the entire file a chunk at a time, with no special processing, then nio and java.io.RandomAccessFile are probably overkill. Just read and process the content of the file a block at a time. Ensure that you use a BufferedInputStream or BufferedReader.
If you have to read the entire file to do what you are doing, and you read only one file at a time, then you will gain little benefit from nio.

Maybe a little bit off topic: Have a look on VFS by apache. It's originally meant to be a library for hiding the ftp-http-file-whatever system behind a file system facade from your application's point of view. I mentioning it here because I have positive experience with accessing large files (via ftp) for searching, reading, copying etc. (large in that context means > 15MB) with this library.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.