How to convert .txt file to Hadoop's sequence file format - java

To effectively utilise map-reduce jobs in Hadoop, i need data to be stored in hadoop's sequence file format. However,currently the data is only in flat .txt format.Can anyone suggest a way i can convert a .txt file to a sequence file?

So the way more simplest answer is just an "identity" job that has a SequenceFile output.
Looks like this in java:
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJobName("Convert Text");
job.setJarByClass(Mapper.class);
job.setMapperClass(Mapper.class);
job.setReducerClass(Reducer.class);
// increase if you need sorting or a special number of files
job.setNumReduceTasks(0);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setInputFormatClass(TextInputFormat.class);
TextInputFormat.addInputPath(job, new Path("/lol"));
SequenceFileOutputFormat.setOutputPath(job, new Path("/lolz"));
// submit and wait for completion
job.waitForCompletion(true);
}

import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
//White, Tom (2012-05-10). Hadoop: The Definitive Guide (Kindle Locations 5375-5384). OReilly Media - A. Kindle Edition.
public class SequenceFileWriteDemo {
private static final String[] DATA = { "One, two, buckle my shoe", "Three, four, shut the door", "Five, six, pick up sticks", "Seven, eight, lay them straight", "Nine, ten, a big fat hen" };
public static void main( String[] args) throws IOException {
String uri = args[ 0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create( uri), conf);
Path path = new Path( uri);
IntWritable key = new IntWritable();
Text value = new Text();
SequenceFile.Writer writer = null;
try {
writer = SequenceFile.createWriter( fs, conf, path, key.getClass(), value.getClass());
for (int i = 0; i < 100; i ++) {
key.set( 100 - i);
value.set( DATA[ i % DATA.length]);
System.out.printf("[% s]\t% s\t% s\n", writer.getLength(), key, value);
writer.append( key, value); }
} finally
{ IOUtils.closeStream( writer);
}
}
}

It depends on what the format of the TXT file is. Is it one line per record? If so, you can simply use TextInputFormat which creates one record for each line. In your mapper you can parse that line and use it whichever way you choose.
If it isn't one line per record, you might need to write your own InputFormat implementation. Take a look at this tutorial for more info.

You can also just create an intermediate table, LOAD DATA the csv contents straight into it, then create a second table as sequencefile (partitioned, clustered, etc..) and insert into select from the intermediate table. You can also set options for compression, e.g.,
set hive.exec.compress.output = true;
set io.seqfile.compression.type = BLOCK;
set mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;
create table... stored as sequencefile;
insert overwrite table ... select * from ...;
The MR framework will then take care of the heavylifting for you, saving you the trouble of having to write Java code.

Be watchful with format specifier :.
For example (note the space between % and s), System.out.printf("[% s]\t% s\t% s\n", writer.getLength(), key, value); will give us java.util.FormatFlagsConversionMismatchException: Conversion = s, Flags =
Instead, we should use:
System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);

If your data is not on HDFS, you need to upload it to HDFS. Two options:
i) hdfs -put on your .txt file and once you get it on HDFS, you can convert it to seq file.
ii) You take text file as input on your HDFS Client box and convert to SeqFile using Sequence File APIs by creating a SequenceFile.Writer and appending (key,values) to it.
If you don't care about key, u can make line number as key and complete text as value.

if you have Mahout installed - it has something called : seqdirectory -- which can do it

Related

Printing to Excel File From Java Based on Columns

Apologies in advance for any formatting/other errors, I am still very much a newbie to Java.
I am conducting a gene expression analysis in which I have a program that prints ~ 6 million gene names and their expression values in 23,000 sets of 249 (there are 249 patients total and they each have 23,000 genes/gene expression values). Right now, I have this program looping through all the 249 individual patient files, obtaining the 23,000 gene values, and printing to a text file (with 6 million rows and 2 columns, one column for gene name and one for expression).
However, I would like this program to print to an excel file instead, so that there are 249 rows (for each patient) and 23,000 columns (for each gene). I have been trying for a couple of days to do this (with apache POI) and still am unable to. I found this example code: https://www.scientecheasy.com/2019/01/write-excel-file-in-java.html, which is what I have been trying to modify to fit my program, but nothing seems to be working. I have included my original program (that prints to the text file but also include the POI jars I downloaded). Any help would be MUCH appreciated!
import java.io.*;
import java.io.FileOutputStream;
import org.apache.poi.xssf.usermodel.XSSFCell;
import org.apache.poi.xssf.usermodel.XSSFRow;
import org.apache.poi.xssf.usermodel.XSSFSheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
public class CreateExcel {
public static final File folder = new File("C:/Users/menon/OneDrive/Documents/R/TARGET");
private static PrintStream output;
XSSFWorkbook wb = new XSSFWorkbook();
XSSFSheet sheet1 = wb.createSheet("values");
public static void main(String [] args) throws FileNotFoundException {
output = new PrintStream(new File("final_CSRSEF_data.txt"));
listFilesForFolder(folder);
}
public static double listFilesForFolder(final File folder) throws FileNotFoundException {
double value = 0.0;
//contains names of all the 23k genes in order to loop through the 249 files and collect the needed names each time
File list = new File("C:/Users/menon/OneDrive/Documents/NamesOfGenes.txt");
Scanner names = new Scanner(list);
String data;
while (names.hasNext()) {
String name = names.next();
//looping through all separate 249 patient files in folder and searching for gene name
for (final File fileEntry : folder.listFiles()) {
Scanner files = new Scanner(fileEntry);
if (fileEntry.isDirectory()) {
listFilesForFolder(fileEntry);
} else {
while (files.hasNextLine()) {
final String lineFromFile = files.nextLine();
if(lineFromFile.contains(name)) {
//System.out.print(name+ " in file " + fileEntry.getName());
String[] thisOne = lineFromFile.split("\\s");
String res = thisOne[0];
//System.out.println(res);
if (res.equals(name)) {
print(lineFromFile);
print("\n");
}
}
}
}
}
print("----------------");
print("\n");
}
return 0.0;
}
//print to final_CSRSEF_data.txt
private static void print(String stat) {
output.print(stat);
}
}
So basically what I am printing before the "---------------" in each text file should instead be in a separate column (not row) in an excel sheet.
Once again, thank you in advance!
try look at this:
www.pela.it, in my home page there's a link to download "gene expression" tool
i did it in java two years ago and if it is what you are trying to do i'll be happy to help.
It elaborate an xml raw data output from pcr tool and finally print a paginated excel with all elaboration phases. There's a ppt too that explain in detail.

How to read and write Parquet files efficiently?

I am working on a utility which reads multiple parquet files at a time and writing them into one single output file.
the implementation is very straightforward. This utility reads parquet files from the directory, reads Group from all the file and put them into a list .Then uses ParquetWrite to write all these Groups into a single file. After reading 600mb it throws Out of memory error for Java heap space. It also takes 15-20 minutes to read and write 500mb of data.
Is there a way to make this operation more efficient?
Read method looks like this:
ParquetFileReader reader = new ParquetFileReader(conf, path, ParquetMetadataConverter.NO_FILTER);
ParquetMetadata readFooter = reader.getFooter();
MessageType schema = readFooter.getFileMetaData().getSchema();
ParquetFileReader r = new ParquetFileReader(conf, path, readFooter);
reader.close();
PageReadStore pages = null;
try {
while (null != (pages = r.readNextRowGroup())) {
long rows = pages.getRowCount();
System.out.println("Number of rows: " + pages.getRowCount());
MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
RecordReader<Group> recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema));
for (int i = 0; i < rows; i++) {
Group g = (Group) recordReader.read();
//printGroup(g);
groups.add(g);
}
}
} finally {
System.out.println("close the reader");
r.close();
}
Write method is like this:
for(Path file : files){
groups.addAll(readData(file));
}
System.out.println("Number of groups from the parquet files "+groups.size());
Configuration configuration = new Configuration();
Map<String, String> meta = new HashMap<String, String>();
meta.put("startkey", "1");
meta.put("endkey", "2");
GroupWriteSupport.setSchema(schema, configuration);
ParquetWriter<Group> writer = new ParquetWriter<Group>(
new Path(outputFile),
new GroupWriteSupport(),
CompressionCodecName.SNAPPY,
2147483647,
268435456,
134217728,
true,
false,
ParquetProperties.WriterVersion.PARQUET_2_0,
configuration);
System.out.println("Number of groups to write:"+groups.size());
for(Group g : groups) {
writer.write(g);
}
writer.close();
I use these functions to merge parquet files, but it is in Scala. Anyway, it may give you good starting point.
import java.util
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.parquet.hadoop.{ParquetFileReader, ParquetFileWriter}
import org.apache.parquet.hadoop.util.{HadoopInputFile, HadoopOutputFile}
import org.apache.parquet.schema.MessageType
import scala.collection.JavaConverters._
object ParquetFileMerger {
def mergeFiles(inputFiles: Seq[Path], outputFile: Path): Unit = {
val conf = new Configuration()
val mergedMeta = ParquetFileWriter.mergeMetadataFiles(inputFiles.asJava, conf).getFileMetaData
val writer = new ParquetFileWriter(conf, mergedMeta.getSchema, outputFile, ParquetFileWriter.Mode.OVERWRITE)
writer.start()
inputFiles.foreach(input => writer.appendFile(HadoopInputFile.fromPath(input, conf)))
writer.end(mergedMeta.getKeyValueMetaData)
}
def mergeBlocks(inputFiles: Seq[Path], outputFile: Path): Unit = {
val conf = new Configuration()
val parquetFileReaders = inputFiles.map(getParquetFileReader)
val mergedSchema: MessageType =
parquetFileReaders.
map(_.getFooter.getFileMetaData.getSchema).
reduce((a, b) => a.union(b))
val writer = new ParquetFileWriter(HadoopOutputFile.fromPath(outputFile, conf), mergedSchema, ParquetFileWriter.Mode.OVERWRITE, 64*1024*1024, 8388608)
writer.start()
parquetFileReaders.foreach(_.appendTo(writer))
writer.end(new util.HashMap[String, String]())
}
def getParquetFileReader(file: Path): ParquetFileReader = {
ParquetFileReader.open(HadoopInputFile.fromPath(file, new Configuration()))
}
}
I faced with the very same problem. On not very big files (up to 100 megabytes), the writing time could be up to 20 minutes.
Try to use kite-sdk api. I know it looks as if abandoned but some things in it are done very efficiently. Also if you like Spring you can try spring-data-hadoop (which is some kind of a wrapper over kite-sdk-api). In my case the use of this libraries reduced the writing time to 2 minutes.
For example you can write in Parquet (using spring-data-hadoop but writing using kite-sdk-api looks quite similiar) in this manner:
final DatasetRepositoryFactory repositoryFactory = new DatasetRepositoryFactory();
repositoryFactory.setBasePath(basePath);
repositoryFactory.setConf(configuration);
repositoryFactory.setNamespace("my-parquet-file");
DatasetDefinition datasetDefinition = new DatasetDefinition(targetClass, true, Formats.PARQUET.getName());
try (DataStoreWriter<T> writer = new ParquetDatasetStoreWriter<>(clazz, datasetRepositoryFactory, datasetDefinition)) {
for (T record : records) {
writer.write(record);
}
writer.flush();
}
Of course you need to add some dependencies to your project (in my case this is spring-data-hadoop):
<dependency>
<groupId>org.springframework.data</groupId>
<artifactId>spring-data-hadoop</artifactId>
<version>${spring.hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.springframework.data</groupId>
<artifactId>spring-data-hadoop-boot</artifactId>
<version>${spring.hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.springframework.data</groupId>
<artifactId>spring-data-hadoop-config</artifactId>
<version>${spring.hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.springframework.data</groupId>
<artifactId>spring-data-hadoop-store</artifactId>
<version>${spring.hadoop.version}</version>
</dependency>
If you absolutely want to do it using only native hadoop api, in any case it will be useful to take a look at the source code of these libraries in order to implement efficiently writing in parquet files.
What you are trying to achieve is already possible using the merge command of parquet-tools. However, it is not recommended for merging small files, since it doesn't actually merge the row groups, only places them one after the another (exactly how you describe it in your question). The resulting file will probably have bad performance characteristics.
If you would like to implement it yourself nevertheless, you can either increase the heap size, or modify the code so that it does not read all of the files into memory before writing the new file, but instead reads them one by one (or even better, rowgroup by rowgroup), and immediately writes them to the new file. This way you will only need to keep in memory a single file or row group.
I have implemented something solution using Spark with pyspark script, below is sample code for same, here loading multiple parquet files from directory location, also if parquet files schema is little different in files we are merging that as well.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("App_name") \
.getOrCreate()
dataset_DF = spark.read.option("mergeSchema", "true").load("/dir/parquet_files/")
dataset_DF.write.parquet("file_name.parquet")
Hope it will be short solution.

serialize array in hadoop

I want to serialize a stringarray "textData" and send it from mapper to reducer
public void map(LongWritable key, Text value, OutputCollector< IntWritable,Text >
output, Reporter reporter) throws IOException {
Path pt=new Path("E:\\spambase.txt");
FileSystem fs = FileSystem.get(new Configuration());
BufferedReader textReader=new BufferedReader(new InputStreamReader(fs.open(pt)));
int numberOfLines = readLines( );
String[ ] textData = new String[numberOfLines];
int i;
for (i=0; i < numberOfLines; i++) {
textData[ i ] = textReader.readLine();
}
textReader.close();
You seem to have some misunderstanding about how the MapReduce process works.
The mapper should ideally not read an entire file within itself.
A Job object generates a collection of InputSplits for a given input path.
By default, Hadoop reads one line of each split in the path (the input can be a directory), or just of the given file.
Each line is passed one at a time into Text value of your map class at the LongWritable key offset of the input.
Its not clear what you are trying to output, but you're looking for the ArrayWritable class and you serialize data to a reducer using output.collect(). However you need to modify your mapper output types from IntWritable, Text to use output.collect(some_key, new ArrayWritable(textData))
It's worth pointing out that you're using the deprecated mapred libraries, not the mapreduce ones. And that E:\\ is not an hdfs path, but a local filesystem.

Map Reduce program to merge multiple xml files to a single xml file

I am new to Hindsight & Hadoop map reduce concept. I am trying to merge multiple XML files to a single XML file using map reduce program. My intention is to merge each XML file into a destination XML file by prepending and appending file name as start and end tag.
For eg. the below XML's should be merged into a single XML shown below
Input XML Files
<xml><a></a></xml>
<xml><b></b></xml>
<xml><c></c></xml>
Output XML File
<xml>
<File1Name><xml><a></a></xml><File2Name>
<File2Name><xml><b></b></xml><File3Name>
<File3Name><xml><c></c></xml><File3Name>
<xml>
Question 1: Is it possible to map a XML file to each mapper and create a key value pair, key as a file name and value as an each XML file prepending and appending file name as start and end tags and reducer to merge all XML's to a single context and output to XML shown above.
Question 2: How can i get file name as key in mapper code?
Answer 1:
I don't suggest sending just a single XML to a mapper unless the files are over 1gb a piece. You can send a list of xml locations to your mapper and then in your mapper code open each location and extract the data into your output.
Answer 2:
If using azure blob storage, you could list all the blobs in a container and assign them to the input split.
How to create your list of InputSplits:
ArrayList<InputSplit> ret = new ArrayList<InputSplit>();
/*Do this for each path we receive. Creates a directory of splits in this order s = input path (S1,1),(s2,1)…(sN,1),(s1,2),(sN,2),(sN,3) etc..
*/
for (int i = numMinNameHashSplits; i <= Math.min(numMaxNameHashSplits,numNameHashSplits–1); i++) {
for (Path inputPath : inputPaths) {
ret.add(new ParseDirectoryInputSplit(inputPath.toString(), i));
System.out.println(i + ” “+inputPath.toString());
}
}
return ret;
}
}
Once the List<InputSplits> is assembled, each InputSplit is handed to a Record Reader class where each Key, Value, pair is read then passed to the map task. The initialization of the recordreader class uses the InputSplit, a string representing the location of a “folder” of invoices in blob storage, to return a list of all blobs within the folder, the blobs variable below. The below Java code demonstrates the creation of the record reader for each hashslot and the resulting list of blobs in that location.
Public class ParseDirectoryFileNameRecordReader
extends RecordReader<IntWritable, Text> {
private int nameHashSlot;
private int numNameHashSlots;
private Path myDir;
private Path currentPath;
private Iterator<ListBlobItem> blobs;
private int currentLocation;
public void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
myDir = ((ParseDirectoryInputSplit)split).getDirectoryPath();
//getNameHashSlot tells us which slot this record reader is responsible for
nameHashSlot = ((ParseDirectoryInputSplit)split).getNameHashSlot();
//gets the total number of hashslots
numNameHashSlots = getNumNameHashSplits(context.getConfiguration());
//gets the input credientals to the storage account assigned to this record reader.
String inputCreds = getInputCreds(context.getConfiguration());
//break the directory path to get account name
String[] authComponents = myDir.toUri().getAuthority().split(“#”);
String accountName = authComponents[1].split(“\\.”)[0];
String containerName = authComponents[0];
String accountKey = Utils.returnInputkey(inputCreds, accountName);
System.out.println(“This mapper is assigned the following account:”+accountName);
StorageCredentials creds = new StorageCredentialsAccountAndKey(accountName,accountKey);
CloudStorageAccount account = new CloudStorageAccount(creds);
CloudBlobClient client = account.createCloudBlobClient();
CloudBlobContainer container = client.getContainerReference(containerName);
blobs = container.listBlobs(myDir.toUri().getPath().substring(1) + “/”, true,EnumSet.noneOf(BlobListingDetails.class), null,null).iterator();
currentLocation = –1;
return;
}
Once initialized, the record reader is used to pass the next key to the map task. This is controlled by the nextKeyValue method, and it is called every time map task starts. The blow Java code demonstrates this.
//This checks if the next key value is assigned to this task or is assigned to another mapper. If it assigned to this task the location is passed to the mapper, otherwise return false
#Override
public boolean nextKeyValue() throws IOException, InterruptedException {
while (blobs.hasNext()) {
ListBlobItem currentBlob = blobs.next();
//Returns a number between 1 and number of hashslots. If it matches the number assigned to this Mapper and its length is greater than 0, return the path to the map function
if (doesBlobMatchNameHash(currentBlob) && getBlobLength(currentBlob) > 0) {
String[] pathComponents = currentBlob.getUri().getPath().split(“/”);
String pathWithoutContainer =
currentBlob.getUri().getPath().substring(pathComponents[1].length() + 1);
currentPath = new Path(myDir.toUri().getScheme(), myDir.toUri().getAuthority(),pathWithoutContainer);
currentLocation++;
return true;
}
}
return false;
}
The logic in the map function is than simply as follows, with inputStream containing the entire XML string
Path inputFile = new Path(value.toString());
FileSystem fs = inputFile.getFileSystem(context.getConfiguration());
//Input stream contains all data from the blob in the location provided by Text
FSDataInputStream inputStream = fs.open(inputFile);
Resources:
http://www.andrewsmoll.com/3-hacks-for-hadoop-and-hdinsight-clusters/ "Hack 3"
http://blogs.msdn.com/b/mostlytrue/archive/2014/04/10/merging-small-files-on-hdinsight.aspx

Change the default delimiter of the mapreduce

Hi I am a beginner to MapReduce, and I want to program the WordCount so it output the K/V pairs. But the question is I don't want to use the 'tab' as the key value pair delimiter for the file. How could I change it?
The code I use is slightly different from the example one. Here is the driver class.
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Job1");
job.setJarByClass(Simpletask.class);
job.setMapperClass(TokenizerMapper.class);
//job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
Since I want the file name to be respective with the partition of the reducer, I use multipleout.write() in the reduce function, and thus the code is slightly different.
public void reduce(IntWritable key,Iterable<Text> values, Context context) throws IOException, InterruptedException {
String accu = "";
for (Text val : values) {
String[] entry=val.toString().split(",");
String MBR = entry[1];
//ASSUME MBR IS ENTRY 1. IT CAN BE REPLACED BY INVOKING FUNCTION TO CALCULATE MBR([COORDINATES])
String mes_line = entry[0]+",MBR"+MBR+" ";
result.set(mes_line);
mos.write(key, result, generateFileName(key));
}
Any help will be appreciated! Thank you!
Since you are using FileInputFormat the key is the line offset in the file, and the value is a line from the input file. It's upto the mapper to split the input line with any delimiter. You can use it to split the record read in map method. The default behavior comes with a specific input format like TextInputFormat etc.

Categories

Resources