I am trying to stream twitter feeds to hdfs and then use hive. But the first part, streaming data and loading to hdfs is not working and giving Null Pointer Exception.
This is what I have tried.
1. Downloaded apache-flume-1.4.0-bin.tar. Extracted it. Copied all the contents to /usr/lib/flume/.
in /usr/lib/ i changed owner to the user for flume directory.
When I do ls command in /usr/lib/flume/, it shows
bin CHANGELOG conf DEVNOTES docs lib LICENSE logs NOTICE README RELEASE-NOTES tools
2. Moved to conf/ directory. I copied file flume-env.sh.template as flume-env.sh And I edited the JAVA_HOME to my java path /usr/lib/jvm/java-7-oracle.
3. Next I created a file called flume.conf in same conf directory and added following contents
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = <Twitter Application API key>
TwitterAgent.sources.Twitter.consumerSecret = <Twitter Application API secret>
TwitterAgent.sources.Twitter.accessToken = <Twitter Application Access token>
TwitterAgent.sources.Twitter.accessTokenSecret = <Twitter Application Access token secret>
TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, couldera, data science, data scientist, business intelligence, mapreduce, datawarehouse, data ware housing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:8020/user/flume/tweets/%Y/%m/%d/%H/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 600
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
I created an app in twitter. Generated token and added all the keys to above file. API Key I added as consumer key.
I downloaded the flume-sources jar from cloudera -files as they mentioned in here.
4. I added the flume-sources-1.0-SNAPSHOT.jar to /user/lib/flume/lib.
5. Started Hadoop and done the following
hadoop fs -mkdir /user/flume/tweets
hadoop fs -chown -R flume:flume /user/flume
hadoop fs -chmod -R 770 /user/flume
6. I run the following in /user/lib/flume
/usr/lib/flume/conf$ bin/flume-ng agent -n TwitterAgent -c conf -f conf/flume-conf
It is showing JARs it is showing and then exiting.
When I checked the hdfs, there is no files in that. hadoop fs -ls /user/flume/tweets and it is showing nothing.
In hadoop, the core-site.xml file has following configuration
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
<fina1>true</fina1>
</property>
</configuration>
Thanks
I run the following command and it got worked
bin/flume-ng agent –conf ./conf/ -f conf/flume.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent
I used this command and it is working
flume-ng agent --conf /etc/flume-ng/conf/ -f /etc/flume-ng/conf/flume.conf - Dflume.root.logger=DEBUG,console -n TwitterAgent
Related
I'm using hadoop DFSAdmin api to report the dead blocks for an hdfs backup utility.
enter link description here
But by invoking my jar in the env (included the hadoop jars in classpath)
java -classpath "/usr/local/flytxt/hadoop:/usr/local/HBackup/conf:/usr/local/HBackup/lib/*:/usr/local/HBackup/hadoop-distcp-2.7.5.jar" -Djava.library.path="/usr/local/HBackup/lib/native" -jar hdfs-br-dfsadmin-0.0.1-SNAPSHOT.jar
fails returning:
report: FileSystem file:/// is not an HDFS file system
Usage: hdfs dfsadmin [-report] [-live] [-dead] [-decommissioning]
This is the code:
String[] argv= {"-report", "-dead" };
DFSAdmin.main(argv);
Tried this thread: enter link description here
but no help!
I have created a Spark Cluster with 3 workers on Kubernetes and a JupyterHub deployment to attach to it so I can run huge queries.
My parquet files are stored into IBM Cloud Object Storage (COS) and when I run a simple code to read from COS, I'm getting the following error:
Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=file:/path/myfile.parquet/_common_metadata; isDirectory=false; length=413; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} at parquet.hadoop.ParquetFileReader.readAllFootersInParallel
I have added all the required libraries to the /jars directory on SPARK_HOME directory in the driver.
This is the code I'm using to connect:
# Initial Setup - Once
import os
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
spark_session = SparkSession(sc)
credentials_staging_parquet = {
'bucket_dm':'mybucket1',
'bucket_eid':'bucket2',
'secret_key':'XXXXXXXX',
'iam_url':'https://iam.ng.bluemix.net/oidc/token',
'api_key':'XXXXXXXX',
'resource_instance_id':'crn:v1:bluemix:public:cloud-object-storage:global:a/XXXXX:XXXXX::',
'access_key':'XXXXX',
'url':'https://s3-api.us-geo.objectstorage.softlayer.net'
}
conf = {
'fs.cos.service.access.key': credentials_staging_parquet.get('access_key'),
'fs.cos.service.endpoint': credentials_staging_parquet.get('url'),
'fs.cos.service.secret.key': credentials_staging_parquet.get('secret_key'),
'fs.cos.service.iam.endpoint': credentials_staging_parquet.get('iam_url'),
'fs.cos.service.iam.service.id': credentials_staging_parquet.get('resource_instance_id'),
'fs.stocator.scheme.list': 'cos',
'fs.cos.impl': 'com.ibm.stocator.fs.ObjectStoreFileSystem',
'fs.stocator.cos.impl': 'com.ibm.stocator.fs.cos.COSAPIClient',
'fs.stocator.cos.scheme': 'cos',
'fs.cos.client.execution.timeout': '18000000',
'fs.stocator.glob.bracket.support': 'true'
}
hadoop_conf = sc._jsc.hadoopConfiguration()
for key in conf:
hadoop_conf.set(key, conf.get(key))
parquet_path = 'store/MY_FILE/*'
cos_url = 'cos://{bucket}.service/{parquet_path}'.format(bucket=credentials_staging_parquet.get('bucket_eid'), parquet_path=parquet_path)
df2 = spark_session.read.parquet(cos_url)
I got this similar error & Googled found this post. Next, I realized that I have a file format issue where the saved file was Avro and the file reader was Orc. So ... check your saved file format and reader formats are aligning.
Found the problem to my issue, the required libraries were not available for all workers in the cluster.
There are 2 ways to fix that:
Make sure you add the dependencies on the spark-submit command so it's distributed to the whole cluster, in this case it should be done in the kernel.json file on Jupyterhub located in /usr/local/share/jupyter/kernels/pyspark/kernel.json (assuming you created that).
OR
Add the dependencies on the /jars directory on your SPARK_HOME for each worker in the cluster and the driver (if you didn't do so).
I used the second approach. During my docker image creation, I added the libs so when I start my cluster, all containers already have the libraries required.
Try restarting your system or server and it will work after it.
I faced the same problem. It generally happens when you upgrade your Java version however spark lib still points to old java version. Rebooting your server/system resolves the problem.
I installed a HDP 2.5 Hadoop/Spark cluster using cloudbreak on Azure.
Everything works except the spark history server. In the log it says the default uri for the event log hdfs:///spark-history is false, the hostname is missing.
So I replaced it with a direct reference to the actual location on the azure blob storage: wasb://<host>:<port>/spark-history. This uri works when used with hdsf dfs -ls, but still the spark history server won't start. Now it complains about a class not found: Caused by: java.lang.NoClassDefFoundError: com/microsoft/azure/storage/blob/BlobListingDetails.
So, it seems it doesn't load some driver during start. I did find /usr/hdp/current/hadoop-client/lib/azure-storage-2.2.0.jar, that might be it. But I'm not sure how to make the history server load the jar during startup using the ambari config editor or whether this is even the right solution to the original problem.
The strangest thing is that Azure HDInsight uses blob storage and there the spark history server simply runs using the default hdfs:///spark-history setting.
Any suggestions on how to load the azure-storage driver or any other approach to this problem?
Thanx
I'll answer my own question. Someone on the hortonworks community forum had the answer: the spark assembly jar contains invalid storage jars. Updating the assembly jar solves the issue:
mkdir -p /tmp/jarupdate && cd /tmp/jarupdate
find /usr/hdp/ -name "azure-storage*.jar"
cp /usr/hdp/2.5.0.1-210/hadoop/lib/azure-storage-2.2.0.jar .
cp /usr/hdp/current/spark-historyserver/lib/spark-assembly-1.6.3.2.5.0.1-210-hadoop2.7.3.2.5.0.1-210.jar .
unzip azure-storage-2.2.0.jar
jar uf spark-assembly-1.6.3.2.5.0.1-210-hadoop2.7.3.2.5.0.1-210.jar com/
mv -f spark-assembly-1.6.3.2.5.0.1-210-hadoop2.7.3.2.5.0.1-210.jar /usr/hdp/current/spark-historyserver/lib/spark-assembly-1.6.3.2.5.0.1-210-hadoop2.7.3.2.5.0.1-210.jar
cd .. && rm -rf /tmp/jarupdate
I am trying to use Hive 1.2.0 over Hadoop 2.6.0. I have created an employee table. However, when I run the following query:
hive> load data local inpath '/home/abc/employeedetails' into table employee;
I get the following error:
Failed with exception Unable to move source file:/home/abc/employeedetails to destination hdfs://localhost:9000/user/hive/warehouse/employee/employeedetails_copy_1
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
What wrong am I doing here? Are there any specific permissions that I need to set? Thanks in advance!
As mentioned by Rio, the issue involved lack of permissions to load data into hive table. I figures out the following command solves my problems:
hadoop fs -chmod g+w /user/hive/warehouse
See the permission for the HDFS directory:
hdfs dfs -ls /user/hive/warehouse/employee/employeedetails_copy_1
Seems like you may not have permission to load data into hive table.
The error might be due to permission issue on local filesystem.
Change the permission for local filesystem:
sudo chmod -R 777 /home/abc/employeedetails
Now, run:
hive> load data local inpath '/home/abc/employeedetails' into table employee;
If we face same error After running the above command in distributed mode, we can try the the below cammand in all super users of all nodes.
sudo usermod -a -G hdfs yarn
Note:we get this error after restart the all the services of YARN(in AMBARI).My problem was resolved.This is admin command better to care when you are running.
I meet the same problems and search it two days .Finally I find the reason is that datenode start a moment and shut down.
solve steps:
hadoop fs -chmod -R 777 /home/abc/employeedetails
hadoop fs -chmod -R 777 /user/hive/warehouse/employee/employeedetails_copy_1
vi hdfs-site.xml and add follow infos :
dfs.permissions.enabled
false
hdfs --daemon start datanode
vi hdfs-site.xml #find the location of 'dfs.datanode.data.dir'and'dfs.namenode.name.dir'.If it is the same location ,you must change it ,this is why I can't start datanode reason.
follow 'dfs.datanode.data.dir'/data/current edit the VERSION and copy clusterID to 'dfs.namenode.name.dir'/data/current clusterID of VERSION。
start-all.sh
if above it is unsolved , to be careful to follow below steps because of the safe of data ,but I already solve the problem because follow below steps.
stop-all.sh
delete the data folder of 'dfs.datanode.data.dir' and the data folder of 'dfs.namenode.name.dir' and tmp folder.
hdfs namenode -format
start-all.sh
solve the problem
maybe you will meet other problem like this.
problems:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException):
Cannot create directory
/opt/hive/tmp/root/1be8676a-56ac-47aa-ab1c-aa63b21ce1fc. Name node is
in safe mode
methods: hdfs dfsadmin -safemode leave
It might be because your hive user does not have the access to the HDFS' local directories
I am trying to execute hadoop fs -put <source> <destination> from Java code. When I execute this command directly from the terminal, it works fine but when I try to execute this command from within the Java code using
String[] str = {"/usr/bin/hadoop","fs -put", source, dest};
Runtime.getRuntime().exec(str);
I get error as Error: Could not find or load main class fs. I tried to execute some non-hadoop commands like ls,mkdir commands from Java and they worked fine but the hadoop commands are not getting executed even though they work fine from the terminal.
What could be the possible reason for this and how can I solve it?
JAVA API TRY: I tried to use java api to perform the copy operation but I get error. The Java code is :
String source = "/home/tmpe/file1.csv";
String dest = "/user/tmpe/file1.csv";
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://node1:8020");
FileSystem fs = FileSystem.get(conf);
Path targetPath = new Path(dest);
Path sourcePath = new Path(source);
fs.copyFromLocalFile(false,true,sourcePath,targetPath);
The error which I get is:
Exception in thread "main" java.io.IOException: Mkdirs failed to create /user/tmpe
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:378)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:364)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:229)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1230)
I have already created /user/tmpe folder and it has full read-write permissions but still this error comes. I am unable to get the issue resolved
I guess you probably do not have a HADOOP_HOME environment variable set.
But since you're in Java, why on earth would you want to do a haddop fs -put in an external process when the Java API is even more friendly than the shell ?
Came across old post but if you haven't tried already, execute it with hadoop jar app_name.jar instead of java -jar. This way if classpath of your jar does not have all hadoop jars it will fetch the jars predefined in $HADOOP_CLASSPATH.