How to disable native zlib compression library in hadoop - java

I have large number of files stored in gz format and trying to run map-reduce program (using PIG) by reading those files. Problem I am running into is, native Decompressor in Hadoop (ZlibDecompressor) is not able successfully decompresss some of it due to data check. But I am able to read those files successfully using java GZIPInputStream. Now my question is - Is there a way to disable Zlib? Or are there any alternate GZipCodec in hadoop(2.7.2) which I can use to decompress gzip input files?
Error given below
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1475882463863_0108_m_000022_0 - exited : java.io.IOException: incorrect data check
at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method)
at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:228)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:91)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
Thank you very much for your help.

I found the answer myself. You can set following property to disable all native libraries.
io.native.lib.available=false;
or you can extend org.apache.hadoop.io.compress.GzipCodec.java to remove native implementation only for GzipCompressor.

Related

Netcdf java library for reading HDf5 error

I am a newbie to netcdf library and i am currently using it to read the metadata from a hdf5 file in java. After reading, I thought that netcdf is a decent enough library for my use, so using it.
however in the first step when i try to read the file it throws an error
LOGGER.debug("Inside Try");
//InputStream fileStream = new FileInputStream(h5File);
//parser.parse(fileStream, handler, metadata);
LOGGER.debug("path is :"+ h5File.getPath());
NetcdfFile hf5File = NetcdfFile.open(h5File.getPath());
LOGGER.debug("Got NetCdFile");
I am assuming the problem occurs when i try to open it , as it says :
Inside Try
13:42:04.393 [main] DEBUG e.k.n.c.m.e.HDF5MetadataExtractor - path is :/var/www/webdav/admin/1151/data/XXXX.h5
13:42:04.495 [main] DEBUG ucar.nc2.NetcdfFile - Using IOSP ucar.nc2.iosp.hdf5.H5iosp
13:42:04.544 [main] ERROR ucar.nc2.iosp.hdf5.H5header - shape[0]=0 must be > 0
My Hd5f is a 2 dimensional array of integers, I am not interested in the array as such but with the metadata group associated with the file.
NetCDF-4 creates HDF5 files, it's true. The HDF5 library can read the HDF5-formatted files produced by NetCDF-4, but NetCDF-4 cannot read arbitrary HDF5 files.
Either you have found a bug in Netcdf-Java, or you have an odd HDF5 file. Have you confirmed this file is not corrupted in some way? Things i would try:
- use the C utility 'ncdump -h' to inspect the header
- use the HDF5 C utility 'h5dump -H' to inspect the file via HDF5
If both of those commands give sensible output, then the issue might rest with netcdf-java.
Netcdf-Java is a pure Java implementation that can read most HDF5 files, details here: http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/CDM/index.html#HDF5
If you find a failure, submit a bug report (and the file!) to netcdf-java#unidata.ucar.edu (must sign up here: http://www.unidata.ucar.edu/support/index.html#mailinglists). But first make sure you have tried the latest version (currently 4.6.3).
Netcdf-Java can write netCDF-4 files (which are HDF5 files) through a JNI interface to the netcdf-4 C library. HDF5 java libraries are also JNI interfaces, except directly to the HDF5 C library. HDF5 java libraries cannot write netCDF4 files (unless you really know what you are doing) because HDF5 does not implement shared dimensions, which are essential to the netCDF data model.
For earth science, this leads to the argument (disclaimer: by me) that you should write netCDF-4, not directly HDF5, details here: http://www.unidata.ucar.edu/blogs/developer/en/entry/dimensions_scales.
Well, thats probably more information than you wanted ;^)

How to get hdfs configuration info by using libhdfs.so

As title, in JAVA API, there are several methods in org.apache.hadoop.conf.Configuration to get details about what we have configure in hdfs configurion files. Such as hdfs-site.xml, core-site.xml. But I want to get this by using C API, libhdfs.so. could any body help me ?
Example program of libhdfs, C++ library to handle HDFS (Hadoop Distributed File System)use the following link
libhdfs

How to get Acoustid (Chromaprint) from Java to identify mp3/m4a/etc

Has anyone managed to use acoustid (http://acoustid.org/chromaprint) in an Java application? Accessing the chromaprint clib should be easy but I can't just pass in the audio file. I requires the raw uncompressed audio data.
I've tried using xuggler to get the uncompressed audio but didn't get anywhere. Basically I have no idea how to get the raw audio from encoded files like mp3/m4a/etc
Has anybody managed to make this work? Anyone mind sharing their code?
I suggest you use the fpcalc command line tool (included in Chromaprint, binaries for Windows/Mac/Linux are included on the website), run it in a subprocess from your Java application. You get output like this, which should be easy to parse:
FILE=/path/to/file.mp3
DURATION=398
FINGERPRINT=AQADtEqkRIkkrQ...
That's how most programs integrate AcoustID and I believe it's the easiest way.

How to read WAL files in pg_xlog directory through java

Am trying to read WAL files of the postgresql can any body tell me how to do that n what type of binary encoding is used in WAL binary files
Using pg_xlogdump to read WAL file (this contrib program added to PG 9.3 version - PG 9.3 released doc)
This utility can only be run by the user who installed the server,
because it requires read-only access to the data directory.
pg_xlogdump --help
pg_xlogdump decodes and displays PostgreSQL transaction logs for debugging.
Usage:
pg_xlogdump [OPTION]... [STARTSEG [ENDSEG]]
Options:
-b, --bkp-details output detailed information about backup blocks
-e, --end=RECPTR stop reading at log position RECPTR
-f, --follow keep retrying after reaching end of WAL
-n, --limit=N number of records to display
-p, --path=PATH directory in which to find log segment files
(default: ./pg_xlog)
-r, --rmgr=RMGR only show records generated by resource manager RMGR
use --rmgr=list to list valid resource manager names
-s, --start=RECPTR start reading at log position RECPTR
-t, --timeline=TLI timeline from which to read log records
(default: 1 or the value used in STARTSEG)
-V, --version output version information, then exit
-x, --xid=XID only show records with TransactionId XID
-z, --stats[=record] show statistics instead of records
(optionally, show per-record statistics)
-?, --help show this help, then exit
For example: pg_xlogdump 000000010000005A00000096
PostgreSQL Document or this blog
You can't really do that. It's easy enough to read the bytes from a WAL archive, but it sounds like you want to make sense of them. You will struggle with that.
WAL archives are a binary log showing what blocks changed in the database. They aren't SQL-level or row-level change logs, so you cannot just examine them to get a list of changed rows.
You probably want to investigate trigger-based replication or audit triggers instead.
The format is complicated and low-level as other answers imply.
However, if you have time to learn and understand the data that is stored, and know how to build the binary from source, there is a published reader for versions 8.3 to 9.2: xlogdump
The usual way to build it is as a contrib (Postgres add-on):
First get the source for the version of Postgres that you wish to view WAL data for.
./configure and make this, but no need to install
Then copy the xlogdump folder to the contrib folder (a git clone in that folder works fine)
Run make for xlogdump - it should find the parent postgres structure and build the binary
You can copy the binary to your path, or use it in-situ. Be warned, there is still a lot of internal knowledge of Postgres required before you will understand what you are looking at. If you have the database available, it is possible to attempt to reverse out SQL statements form the log.
To perform this in Java, you could either wrap the executable, link the C library as a hybrid, or figure out how to do the parsing you need from source. Any of those options are likely to involve a lot of detailed work.
The WAL files are in the same format as the actual database files themselves, and depends on the exact version of PostgreSQL that you are using. You will probably need to examine the source code for your particular version to determine the exact format.

How to analyze JVM crash file hs_err_pidXYZ.log

When working on a webapp in Eclipse and Tomcat (wtp) , tomcat crashes and create a file: hs_err_pid20216.log
I tried to use eclipse MAT to analyse the file but MAT doesn't recognize the file as something it can handle, I tried also DAT and it was the same thing. It won't show in the open file dialog.
What kind of file is it?
What should I use to analyze it?
Do I have to make changes to this file so that it will be possible for these tools to parse it.
The log file is available as a GitHub gist
UPDATE:
See #Dan Cruz reply for more information on how to deal with hs_err_pidXYZ.log file. For curious, the cause of the crash was Jackson being confused by a cyclic relationship (bidirectional one-to-many) but this is another story...
What kind of file it is?
It's a HotSpot error log file in text format.
What should I use to analyze it?
Start by downloading the OpenJDK 6 source bundle. Search through the hotspot *.cpp files for strings in the error log. Review the source files for an explanation of what the error log contains.
For example, using OpenJDK 7 sources, you can find siginfo (the operating system process signal information) in the os::print_siginfo() method of os_linux.cpp, Registers (the CPU registers' values) in the os::print_context() method of os_linux_x86.cpp, etc.
Do I have to make changes to this file sothat it will be possible for these tools to parse it.
That would be impossible since the Eclipse Memory Analyzer requires a heap file, which the HotSpot error log is not.
https://fastthread.io gives a well descriptive analyze on the file. it just need to upload it and it will give following items:
Reason to crash
Recommended Solutions
Active Thread (when app crashed)
Core Dump Location
All threads
...
It's a text file. Open it in an editor and try to understand what it means.

Categories

Resources