Best way to read TSV file using Apache Spark in java

Best way to read TSV file using Apache Spark in java - java

I have a TSV file, where the first line is the header. I want to create a JavaPairRDD from this file. Currently, I'm doing so with the following code:
TsvParser tsvParser = new TsvParser(new TsvParserSettings());
List<String[]> allRows;
List<String> headerRow;
try (BufferedReader reader = new BufferedReader(new FileReader(myFile))) {
allRows = tsvParser.parseAll((reader));
//Removes the header row
headerRow = Arrays.asList(allRows.remove(0));
}
JavaPairRDD<String, MyObject> myObjectRDD = javaSparkContext
.parallelize(allRows)
.mapToPair(row -> new Tuple2<>(row[0], myObjectFromArray(row)));
I was wondering if there was a way to have the javaSparkContext read and process the file directly instead of splitting the operation into two parts.
EDIT: This is not a duplicate of How do I convert csv file to rdd, because I'm looking for an answer in Java, not Scala.

use https://github.com/databricks/spark-csv
import org.apache.spark.sql.SQLContext
SQLContext sqlContext = new SQLContext(sc);
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.option("delimiter","\t")
.load("cars.csv");
df.select("year", "model").write()
.format("com.databricks.spark.csv")
.option("header", "true")
.save("newcars.csv");

Try below code to read CSV file and create JavaPairRDD.
public class SparkCSVReader {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("CSV Reader");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> allRows = sc.textFile("c:\\temp\\test.csv");//read csv file
String header = allRows.first();//take out header
JavaRDD<String> filteredRows = allRows.filter(row -> !row.equals(header));//filter header
JavaPairRDD<String, MyCSVFile> filteredRowsPairRDD = filteredRows.mapToPair(parseCSVFile);//create pair
filteredRowsPairRDD.foreach(data -> {
System.out.println(data._1() + " ### " + data._2().toString());// print row and object
});
sc.stop();
sc.close();
}
private static PairFunction<String, String, MyCSVFile> parseCSVFile = (row) -> {
String[] fields = row.split(",");
return new Tuple2<String, MyCSVFile>(row, new MyCSVFile(fields[0], fields[1], fields[2]));
};
}
You can also use Databricks spark-csv (https://github.com/databricks/spark-csv). spark-csv is also included in Spark 2.0.0.

Apache Spark 2.x have built-in csv reader so you don't have to use https://github.com/databricks/spark-csv
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
/**
*
* #author cpu11453local
*/
public class Main {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder()
.master("local")
.appName("meowingful")
.getOrCreate();
Dataset<Row> df = spark.read()
.option("header", "true")
.option("delimiter","\t")
.csv("hdfs://127.0.0.1:9000/data/meow_data.csv");
df.show();
}
}
And maven file pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.meow.meowingful</groupId>
<artifactId>meowingful</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>jar</packaging>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
</properties>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.2.0</version>
</dependency>
</dependencies>
</project>

I'm the author of uniVocity-parsers and can't help you much with spark, but I believe something like this can work for you:
parserSettings.setHeaderExtractionEnabled(true); //captures the header row
parserSettings.setProcessor(new AbstractRowProcessor(){
#Override
public void rowProcessed(String[] row, ParsingContext context) {
String[] headers = context.headers() //not sure if you need them
JavaPairRDD<String, MyObject> myObjectRDD = javaSparkContext
.mapToPair(row -> new Tuple2<>(row[0], myObjectFromArray(row)));
//process your stuff.
}
});
If you want to paralellize the processing of each row, you can wrap a ConcurrentRowProcessor:
parserSettings.setProcessor(new ConcurrentRowProcessor(new AbstractRowProcessor(){
#Override
public void rowProcessed(String[] row, ParsingContext context) {
String[] headers = context.headers() //not sure if you need them
JavaPairRDD<String, MyObject> myObjectRDD = javaSparkContext
.mapToPair(row -> new Tuple2<>(row[0], myObjectFromArray(row)));
//process your stuff.
}
}, 1000)); //1000 rows loaded in memory.
Then just call to parse:
new TsvParser(parserSettings).parse(myFile);
Hope this helps!

Related

Unexpected behaviour of Apache Commons CollectionUtils addAll(Collection collection, Object[] elements)

Brief description:
Basically, inside my code i want to add a List <String> to every <key> inside a ListValuedMap <String, List<String>> as a <value>. I did some testing on the created ListValuedMap spreadNormal with
System.out.println(spreadNormal.keySet());
System.out.println(spreadNormal.values());
and it showed me, that the keys are inside the map (unsorted), but the corresponding values are empty. I am deleting the inserted List <String> after using Collections.addAll(String, List<String>) with list.clear() after each loop.
I would have expected, that indeed a copy of these values stay in my ListValuedMap but my results are:
[Agios Pharmaceuticals Inc., Vicor Corp., EDP Renov�veis S.A., Envista Holdings Corp., JENOPTIK AG,...]
[[], [], [], [], [], ...]
My expected result is more like this:
[Agios Pharmaceuticals Inc., ...] =
[["US00847X1046, "30,60", "30,80", "0,65"], ....]
Can you provide some explanations on that? Is the default behaviour of the Collections.addAll method to copy the reference to an object, instead of the object itself?
The corresponding code section is highlighted with // ++++++++++
Full code example (Eclipse):
import java.io.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import java.util.*;
import org.apache.commons.collections4.ListValuedMap;
import org.apache.commons.collections4.MultiSet;
import org.apache.commons.collections4.multimap.ArrayListValuedHashMap;
public class Webdata {
public static void main(String[] args) throws IOException
{
long start = System.currentTimeMillis();
parseData();
System.out.println(secondsElapsed(start)+" seconds processing time");
};
private static void parseData() throws IOException
{
List <String> subdirectories = new ArrayList<>();
String chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
String errorMessage1 = "String formatting problem";
String errorMessage2 = "String object non existent";
for (int i=0; i< chars.length(); i++)
{
subdirectories.add("https://www.tradegate.de/indizes.php?buchstabe="+chars.charAt(i));
}
List <String> stockMetadata = new ArrayList<>();
ListValuedMap <String, List<String>> nonError = new ArrayListValuedHashMap<>();
ListValuedMap <String, List<String>> numFormatError = new ArrayListValuedHashMap<>();
ListValuedMap <String, List<String>> nullPointerError = new ArrayListValuedHashMap<>();
ListValuedMap <String, List<String>> spreadTooHigh = new ArrayListValuedHashMap<>();
ListValuedMap <String, List<String>> spreadNormal = new ArrayListValuedHashMap<>();
int cap1 = 44;
int cap2 = 56;
for (int suffix = 0; suffix <chars.length(); suffix++)
{
Document doc = Jsoup.connect(subdirectories.get(suffix).toString()).get();
Elements htmlTableRows = doc.getElementById("kursliste_abc").select("tr");
htmlTableRows.forEach( tr->
{
String stockName = tr.child(0).text();
String bid_price = tr.child(1).text();
String ask_price = tr.child(2).text();
String isin = tr.child(0).selectFirst("a").absUrl("href").substring(cap1,cap2);
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
try
{
if (calcSpread(bid_price, ask_price) < 5)
{
Collections.addAll(stockMetadata, isin, bid_price, ask_price, calcSpread(bid_price, ask_price).toString());
spreadNormal.put(stockName,stockMetadata);
}
else if (calcSpread(bid_price, ask_price) > 5)
{
Collections.addAll(stockMetadata, isin, bid_price, ask_price, calcSpread(bid_price, ask_price).toString());
spreadTooHigh.put(stockName,stockMetadata);
}
stockMetadata.clear();
}
catch (NumberFormatException e)
{
Collections.addAll(stockMetadata, e.getMessage());
numFormatError.put(stockName, stockMetadata);
stockMetadata.clear();
}
catch (NullPointerException Ne)
{
Collections.addAll(stockMetadata, Ne.getMessage());
nullPointerError.put(stockName, stockMetadata);
stockMetadata.clear();
} //end of try-catch
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
}); //end of for-each loop htmlTableRows
} //end of JSOUP method
System.out.println(spreadNormal.keySet());
System.out.println(spreadNormal.values());
} //end of parseData()
public static Float calcSpread (String arg1, String arg2)
{
try
{
Float bid = Float.parseFloat(arg1.replace("," , "."));
Float ask = Float.parseFloat(arg2.replace("," , "."));
Float spread = ((ask-bid)/ask)*100;
return spread;
}
catch (NumberFormatException e)
{
return null;
}
}
public static Long secondsElapsed(Long start) {
Long startTime = start;
Long endTime = System.currentTimeMillis();
Long timeDifference = endTime - startTime;
return timeDifference / 1000;
}
} //end of class
pom.xml:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>TEST</groupId>
<artifactId>TEST</artifactId>
<version>0.0.1-SNAPSHOT</version>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.11.3</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-collections4</artifactId>
<version>4.4</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.0</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src</sourceDirectory>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.1</version>
<configuration>
<release>18</release>
</configuration>
</plugin>
</plugins>
</build>
</project>

There is nothing wrong with Collections.addAll.
I believe you expect all of your maps to be ListValuedMap<String, String> which is basically a Map<String, List<String>>. As such, your ListValuedMap<String, List<String>> is a Map<String, List<List<String>>>.
Just update each of your maps to be like below so each key is mapped to a List<String>:
ListValuedMap<String, String> nonError = new ArrayListValuedHashMap<>();
ListValuedMap<String, String> numFormatError = new ArrayListValuedHashMap<>();
ListValuedMap<String, String> nullPointerError = new ArrayListValuedHashMap<>();
ListValuedMap<String, String> spreadTooHigh = new ArrayListValuedHashMap<>();
ListValuedMap<String, String> spreadNormal = new ArrayListValuedHashMap<>();
And then, instead of using ListValuedMap.put(K key, V value) you have to use ListValuedMap.putAll(K key, Iterable<? extends V> values) like this:
spreadNormal.putAll(stockName, stockMetadata);
spreadTooHigh.putAll(stockName, stockMetadata);
numFormatError.putAll(stockName, stockMetadata);
nullPointerError.putAll(stockName, stockMetadata);
The putAll method will iterate over the stockMetadata iterable and add the elements one by one into the map's underlying list.

Can't find a codec for class org.json.JSONArray

private void getUsersWithin24Hours(String id, Map < String, Object > payload) throws JSONException {
JSONObject json = new JSONObject(String.valueOf(payload.get("data")));
Query query = new Query();
query.addCriteria(Criteria.where("user_id").is(id).and("timezone").in(json.get("timezone")).and("gender").in(json.get("gender")).and("locale").in(json.get("language")).and("time").gt(getDate()));
mongoTemplate.getCollection("user_log").distinct("user_id", query.getQueryObject());
}
I was going to made a query and get result from mongodb and I was succeed with mongo terminal command:
db.getCollection('user_log').find({"user_id" : "1", "timezone" : {$in: [5,6]}, "gender" : {$in : ["male", "female"]}, "locale" : {$in : ["en_US"]}, "time" : {$gt : new ISODate("2017-01-26T16:57:52.354Z")}})
but from java when I was trying it gave me below error.
org.bson.codecs.configuration.CodecConfigurationException: Can't find
a codec for class org.json.JSONArray
What is the ideal way to do this?
Hint : actually I think in my code error occurred of this part json.get("timezone"). because it contains array. When I am using hardcode string arrays this code works

You don't have to use JSONObject/JSONArray for conversion.
Replace with below line if the payload.get("data") is Map
BasicDBObject json = new BasicDBObject(payload.get("data"));
Replace with below line if the payload.get("data") holds json string.
BasicDBObject json =(BasicDBObject) JSON.parse(payload.get("data"));

Here's an example of MongoDB from MongoDB University course with a MongoDB database named "students" with a collection named "grades" :
pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.mongodb</groupId>
<artifactId>test</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>jar</packaging>
<name>test</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>org.mongodb</groupId>
<artifactId>mongodb-driver</artifactId>
<version>3.2.2</version>
</dependency>
</dependencies>
</project>
com/mongo/Main.java
package com.mongo;
import com.mongodb.MongoClient;
import com.mongodb.client.*;
import org.bson.Document;
import org.bson.conversions.Bson;
import javax.print.Doc;
public class Main {
public static void main(String[] args) {
MongoClient client = new MongoClient();
MongoDatabase database = client.getDatabase("students");
final MongoCollection<Document> collection = database.getCollection("grades");
Bson sort = new Document("student_id", 1).append("score", 1);
MongoCursor<Document> cursor = collection.find().sort(sort).iterator();
try {
Integer student_id = -1;
while (cursor.hasNext()) {
Document document = cursor.next();
// Doing more stuff
}
} finally {
cursor.close();
}
}
}

Error while inserting an array of JSON documents in to MongoDB using Java

I am trying to insert a json string which contains an array of documents but getting following exception.
MongoDB server version: 3.0.6
Mongo-Java driver version: 3.1.0
I understand that insertOne() method is used to insert just one document but over here it's an array of documents. I am not sure how to use insertMany() method here.
Please guide.
JSON String that I want to insert:
json = [{"freightCompanyId":201,"name":"USPS","price":8.00},{"freightCompanyId":202,"name":"FedEx","price":10.00},{"freightCompanyId":203,"name":"UPS","price":12.00},{"freightCompanyId":204,"name":"Other","price":15.00}]
Exception Log:
Exception in thread "main" org.bson.BsonInvalidOperationException: readStartDocument can only be called when CurrentBSONType is DOCUMENT, not when CurrentBSONType is ARRAY.
at org.bson.AbstractBsonReader.verifyBSONType(AbstractBsonReader.java:655)
at org.bson.AbstractBsonReader.checkPreconditions(AbstractBsonReader.java:687)
at org.bson.AbstractBsonReader.readStartDocument(AbstractBsonReader.java:421)
at org.bson.codecs.DocumentCodec.decode(DocumentCodec.java:138)
at org.bson.codecs.DocumentCodec.decode(DocumentCodec.java:45)
at org.bson.Document.parse(Document.java:105)
at org.bson.Document.parse(Document.java:90)
at com.ebayenterprise.ecp.jobs.Main.insert(Main.java:52)
at com.ebayenterprise.ecp.jobs.Main.main(Main.java:31)
Main.java
public class Main {
private static final Logger LOG = Logger.getLogger(Main.class);
public static void main(String[] args) throws IOException {
String json = getAllFreightCompanies();
insert(json);
}
private static String getAllFreightCompanies() throws IOException {
FreightCompanyDao freightCompanyDao = new FreightCompanyDaoImpl(DataSourceFactory.getDataSource(DatabaseType.POSTGRES.name()));
List<FreightCompany> freightCompanies = freightCompanyDao.getAllFreightCompanies();
return GenericUtils.toJson(freightCompanies);
}
private static void insert(String json) {
MongoClient mongoClient = new MongoClient("GSI-547576", 27017);
MongoDatabase database = mongoClient.getDatabase("test");
MongoCollection<Document> table = database.getCollection("fc");
Document document = Document.parse(json);
table.insertOne(document);
}
}
GenericUtils.java
public final class GenericUtils {
private static final Logger LOG = Logger.getLogger(GenericUtils.class);
private GenericUtils() {
}
public static String toJson(List<FreightCompany> freightCompanies) throws IOException {
String json = new ObjectMapper().writer().writeValueAsString(freightCompanies);
LOG.debug("json = " + json);
return json;
}
}
pom.xml
<dependency>
<groupId>org.mongodb</groupId>
<artifactId>mongo-java-driver</artifactId>
<version>3.1.0</version>
<type>jar</type>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-mapper-asl</artifactId>
<version>1.9.13</version>
</dependency>

You should either insert one by one or create a List of documents and use insertMany()
Here's an example:
MongoClient mongoClient = new MongoClient("GSI-547576", 27017);
MongoDatabase database = mongoClient.getDatabase("test");
MongoCollection < Document > table = database.getCollection("fc");
FreightCompanyDao freightCompanyDao = new FreightCompanyDaoImpl(DataSourceFactory.getDataSource(DatabaseType.POSTGRES.name()));
List < FreightCompany > freightCompanies = freightCompanyDao.getAllFreightCompanies();
for (FreightCompany company: freighetCompanies) {
Document doc = Document.parse(GenericUtils.toJson(company))
collection.insertOne(doc)
}

Hadoop: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected

My MapReduce jobs runs ok when assembled in Eclipse with all possible Hadoop and Hive jars included in Eclipse project as dependencies. (These are the jars that come with single node, local Hadoop installation).
Yet when trying to run the same program assembled using Maven project (see below) I get:
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
This exception happens when program is assembled using the following Maven project:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.bigdata.hadoop</groupId>
<artifactId>FieldCounts</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>FieldCounts</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-jobclient</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.hive.hcatalog</groupId>
<artifactId>hcatalog-core</artifactId>
<version>0.12.0</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>16.0.1</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3.2</version>
<configuration>
<source>${jdk.version}</source>
<target>${jdk.version}</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>attached</goal>
</goals>
<phase>package</phase>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>com.bigdata.hadoop.FieldCounts</mainClass>
</manifest>
</archive>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
* Please advise where and how to find compatible Hadoop jars? *
[update_1]
I am running Hadoop 2.2.0.2.0.6.0-101
As I have found here: https://github.com/kevinweil/elephant-bird/issues/247
Hadoop 1.0.3: JobContext is a class
Hadoop 2.0.0: JobContext is an interface
In my pom.xml I have three jars with version 2.2.0
hadoop-hdfs 2.2.0
hadoop-common 2.2.0
hadoop-mapreduce-client-jobclient 2.2.0
hcatalog-core 0.12.0
The only exception is hcatalog-core which version is 0.12.0, I could not find any more recent version of this jar and I need it!
How can I find which of these 4 jars produces java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected ?
Please, give me an idea how to solve this. (The only solution I see is to compile everything from source!)
[/update_1]
Full text of my MarReduce Job:
package com.bigdata.hadoop;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.util.*;
import org.apache.hcatalog.mapreduce.*;
import org.apache.hcatalog.data.*;
import org.apache.hcatalog.data.schema.*;
import org.apache.log4j.Logger;
public class FieldCounts extends Configured implements Tool {
public static class Map extends Mapper<WritableComparable, HCatRecord, TableFieldValueKey, IntWritable> {
static Logger logger = Logger.getLogger("com.foo.Bar");
static boolean firstMapRun = true;
static List<String> fieldNameList = new LinkedList<String>();
/**
* Return a list of field names not containing `id` field name
* #param schema
* #return
*/
static List<String> getFieldNames(HCatSchema schema) {
// Filter out `id` name just once
if (firstMapRun) {
firstMapRun = false;
List<String> fieldNames = schema.getFieldNames();
for (String fieldName : fieldNames) {
if (!fieldName.equals("id")) {
fieldNameList.add(fieldName);
}
}
} // if (firstMapRun)
return fieldNameList;
}
#Override
protected void map( WritableComparable key,
HCatRecord hcatRecord,
//org.apache.hadoop.mapreduce.Mapper
//<WritableComparable, HCatRecord, Text, IntWritable>.Context context)
Context context)
throws IOException, InterruptedException {
HCatSchema schema = HCatBaseInputFormat.getTableSchema(context.getConfiguration());
//String schemaTypeStr = schema.getSchemaAsTypeString();
//logger.info("******** schemaTypeStr ********** : "+schemaTypeStr);
//List<String> fieldNames = schema.getFieldNames();
List<String> fieldNames = getFieldNames(schema);
for (String fieldName : fieldNames) {
Object value = hcatRecord.get(fieldName, schema);
String fieldValue = null;
if (null == value) {
fieldValue = "<NULL>";
} else {
fieldValue = value.toString();
}
//String fieldNameValue = fieldName+"."+fieldValue;
//context.write(new Text(fieldNameValue), new IntWritable(1));
TableFieldValueKey fieldKey = new TableFieldValueKey();
fieldKey.fieldName = fieldName;
fieldKey.fieldValue = fieldValue;
context.write(fieldKey, new IntWritable(1));
}
}
}
public static class Reduce extends Reducer<TableFieldValueKey, IntWritable,
WritableComparable, HCatRecord> {
protected void reduce( TableFieldValueKey key,
java.lang.Iterable<IntWritable> values,
Context context)
//org.apache.hadoop.mapreduce.Reducer<Text, IntWritable,
//WritableComparable, HCatRecord>.Context context)
throws IOException, InterruptedException {
Iterator<IntWritable> iter = values.iterator();
int sum = 0;
// Sum up occurrences of the given key
while (iter.hasNext()) {
IntWritable iw = iter.next();
sum = sum + iw.get();
}
HCatRecord record = new DefaultHCatRecord(3);
record.set(0, key.fieldName);
record.set(1, key.fieldValue);
record.set(2, sum);
context.write(null, record);
}
}
public int run(String[] args) throws Exception {
Configuration conf = getConf();
args = new GenericOptionsParser(conf, args).getRemainingArgs();
// To fix Hadoop "META-INFO" (http://stackoverflow.com/questions/17265002/hadoop-no-filesystem-for-scheme-file)
conf.set("fs.hdfs.impl",
org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
conf.set("fs.file.impl",
org.apache.hadoop.fs.LocalFileSystem.class.getName());
// Get the input and output table names as arguments
String inputTableName = args[0];
String outputTableName = args[1];
// Assume the default database
String dbName = null;
Job job = new Job(conf, "FieldCounts");
HCatInputFormat.setInput(job,
InputJobInfo.create(dbName, inputTableName, null));
job.setJarByClass(FieldCounts.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
// An HCatalog record as input
job.setInputFormatClass(HCatInputFormat.class);
// Mapper emits TableFieldValueKey as key and an integer as value
job.setMapOutputKeyClass(TableFieldValueKey.class);
job.setMapOutputValueClass(IntWritable.class);
// Ignore the key for the reducer output; emitting an HCatalog record as
// value
job.setOutputKeyClass(WritableComparable.class);
job.setOutputValueClass(DefaultHCatRecord.class);
job.setOutputFormatClass(HCatOutputFormat.class);
HCatOutputFormat.setOutput(job,
OutputJobInfo.create(dbName, outputTableName, null));
HCatSchema s = HCatOutputFormat.getTableSchema(job);
System.err.println("INFO: output schema explicitly set for writing:"
+ s);
HCatOutputFormat.setSchema(job, s);
return (job.waitForCompletion(true) ? 0 : 1);
}
public static void main(String[] args) throws Exception {
String classpath = System.getProperty("java.class.path");
//System.out.println("*** CLASSPATH: "+classpath);
int exitCode = ToolRunner.run(new FieldCounts(), args);
System.exit(exitCode);
}
}
And class for complex key:
package com.bigdata.hadoop;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.WritableComparable;
import com.google.common.collect.ComparisonChain;
public class TableFieldValueKey implements WritableComparable<TableFieldValueKey> {
public String fieldName;
public String fieldValue;
public TableFieldValueKey() {} //must have a default constructor
//
public void readFields(DataInput in) throws IOException {
fieldName = in.readUTF();
fieldValue = in.readUTF();
}
public void write(DataOutput out) throws IOException {
out.writeUTF(fieldName);
out.writeUTF(fieldValue);
}
public int compareTo(TableFieldValueKey o) {
return ComparisonChain.start().compare(fieldName, o.fieldName)
.compare(fieldValue, o.fieldValue).result();
}
}

Hadoop has gone through a huge code refactoring from Hadoop 1.0 to Hadoop 2.0. One side effect
is that code compiled against Hadoop 1.0 is not compatible with Hadoop 2.0 and vice-versa.
However source code is mostly compatible and thus one just need to recompile code with target
Hadoop distribution.
The exception "Found interface X, but class was expected" is very common when you're running
code that is compiled for Hadoop 1.0 on Hadoop 2.0 or vice-versa.
You can find the correct hadoop version used in the cluster, then specify that hadoop version in the pom.xml file Build your project with the same version of hadoop used in the cluster and deploy it.

You need to recompile "hcatalog-core" to support Hadoop 2.0.0.
Currently "hcatalog-core" only supports Hadoop 1.0

Obviously, you have versions incompatibility between you Hadoop and Hive versions. You need to upgrade (or downgrade) your Hadoop version or Hive version.
This is due the incompatibility between Hadoop 1 and Hadoop 2.

Look for entries like this
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.2.1</version>
</dependency>
in your pom.xml.
These define the hadoop version to use. Change them or remove them as per your requirements.

Even I ran through this problem.
Was trying use HCatMultipleInputs with hive-hcatalog-core-0.13.0.jar. We are using hadoop 2.5.1.
The following code change helped me fix the issue:
//JobContext ctx = new JobContext(conf,jobContext.getJobID());
JobContext ctx = new Job(conf);

java string expression parser

I was asked to include math expression inside a string
say: "price: ${price}, tax: ${price}*${tax)"
the string is given at run-time and a Map values too
I used Velocity for this:
maven:
<properties>
<velocity.version>1.6.2</velocity.version>
<velocity.tools.version>2.0</velocity.tools.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.velocity</groupId>
<artifactId>velocity</artifactId>
<version>${velocity.version}</version>
</dependency>
<dependency>
<groupId>org.apache.velocity</groupId>
<artifactId>velocity-tools</artifactId>
<version>${velocity.tools.version}</version>
</dependency>
</dependencies>
java:
public class VelocityUtils {
public static String mergeTemplateIntoString(String template, Map<String,String> model)
{
try
{
final VelocityEngine ve = new VelocityEngine();
ve.init();
final VelocityContext context = new VelocityContext();
context.put("math", new MathTool());
context.put("number", new NumberTool());
for (final Map.Entry<String, String> entry : model.entrySet())
{
final String macroName = entry.getKey();
context.put(macroName, entry.getValue());
}
final StringWriter wr = new StringWriter();
final String logStr = "";
ve.evaluate(context, wr, logStr,template);
return wr.toString();
} catch(Exception e)
{
return "";
}
}
}
test class:
public class VelocityUtilsTest
{
#Test
public void testMergeTemplateIntoString() throws Exception
{
Map<String,String> model = new HashMap<>();
model.put("price","100");
model.put("tax","22");
String parsedString = VelocityUtils.mergeTemplateIntoString("price: ${price} tax: ${tax}",model);
assertEquals("price: 100 tax: 22",parsedString);
String parsedStringWithMath = VelocityUtils.mergeTemplateIntoString("price: $number.integer($math.div($price,2))",model);
assertEquals("price: 50",parsedStringWithMath);
}
}
would it be better to use SPel instead?

I agree that this is kind of off topic, but I think it merits an answer nonetheless.
The whole idea of using a templating engine is that you need to have access to the templates at runtime. If that is the case, then sure, Velocity is a good choice. Then you could provide new versions of the HTML, and assuming you didn't change the variables that were used, you would not have to provide a new version of the application itself (recompiled).
However, if you are just using Velocity to save yourself time, it's not saving much here: you could do this with the StringTokenizer in only a few more lines of code.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Best way to read TSV file using Apache Spark in java - java

Related

Unexpected behaviour of Apache Commons CollectionUtils addAll(Collection collection, Object[] elements)

Can't find a codec for class org.json.JSONArray

Error while inserting an array of JSON documents in to MongoDB using Java

Hadoop: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected

java string expression parser

Categories

Resources