Some errors happen when loading data to HDFS

Some errors happen when loading data to HDFS - java

I have a Java program trying to load data to HDFS:
public class CopyFileToHDFS {
public static void main(String[] args) {
try{
Configuration configuration = new Configuration();
String msg = "message1";
String file = "hdfs://localhost:8020/user/user1/input.txt";
FileSystem hdfs = FileSystem.get(new URI(file), configuration);
FSDataOutputStream outputStream = hdfs.create(new Path(file), true);
outputStream.write(msg.getBytes());
}
catch(Exception e){
System.out.println(e.getMessage());
}
}
}
When I run the program, it gives me an error:
java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.fs.s3.S3FileSystem not found
It looks like some configuration issues. Can anyone give me some suggestions?
Thanks

Something is specifying that org.apache.hadoop.fs.FileSystem includes S3. One possible cause is an old, stale META-INF file; see this Spark bug report.
If you're creating an uber-jar, it could be somewhere in there. If you can't find and eliminate the spec that's causing the problem, a work-around is to include AWS & Hadoop jars where the Spark driver/executors can find them; see this Stackoverflow question.

Related

Loading Xml files from class path in Spring Boot

I am trying to load and validate xml files from a directory in the class path at startup of a Spring Boot application. I am seeing the following error which indicates that I am trying to load files using absolute path and not class path:
java.io.FileNotFoundException: class path resource [converters/mapper.xml] cannot be resolved to absolute file path because it does not reside in the file system: jar:file:/opt/core/home/libexec/boss/core-service-2.0.0.jar!/BOOT-INF/lib/core-api-2.0.0.jar!/converters/mapper.xml
Below is a code snippet that loads the files:
..
#Autowired
public FieldsMapTypeConvertersRegistry(#Value("${core.files-location:converters}")
String mapperFilesLocation) {
this.mapperFilesLocation = mapperFilesLocation;
}
..
try {
// ToDo we need to replace this when we enable multi-tenancy
ClassLoader classLoader = ClassUtils.getDefaultClassLoader();
ResourcePatternResolver resolver = new PathMatchingResourcePatternResolver(classLoader);
Resource[] xmlResources = resolver.getResources(mapperFilesLocation + "/*.xml");
for (Resource xmlResource : xmlResources) {
File file = ResourceUtils.getFile(xmlResource.getURL());
registerTypeConverter(file);
}
} catch (IOException e) {
// do stuff
} catch (JAXBException e) {
//do stuff
}
I think the issue is in this statement in the code above:
File file = ResourceUtils.getFile(xmlResource.getURL());
but I am not sure what other ways I can do that. Any help is really appreciated.

I'm just wondering why you are using ResourceUtils.getFile(xmlResource.getURL()) when xmlResource.getFile() is already available to get the File Handle. Ideally speaking, you should be catching the FileNotFoundException inside the catch block and checking the detailed message wrapped inside the exception.
Edit:
The exception is being thrown because the xml is not found in classpath at runtime. Most probably, the file target/converters/mapper.xml is not available.

Try something like this MyService.class.getClassLoader().getResourceAsStream("/file.xml"); and then create File from stream.

Try using commons-io:commons-io:2.7 (Maven artifact) and use the following code:
InputStream inputStream = obj.getClass()
.getClassLoader()
.getResourceAsStream("converters/mapper.xml");
String data = IOUtils.toString(inputStream, "UTF-8");

Jar that read file

I have a code that deals with elasticsearch index. One of its steps, the program needs to read a jsonschema file in order to continue its execution. The code works well on my machine, but when I execute it as jar file inside a docker container, it gives me the following error:
java.io.FileNotFoundException: dm.jsonschema (No such file or directory)
the code that loads the schema is:
public class loadSchema {
private static final String JSON_SCHEMA_DOCUMENT = "dm.jsonschema";
....
public static JsonSchema tryLoadJSONSchema() {
JsonSchemaFactory factory = JsonSchemaFactory.byDefault();
JsonNode cdmSchema = null;
try {
cdmSchema = JsonLoader.fromPath(JSON_SCHEMA_DOCUMENT);
} catch (IOException e) {
System.out.println(e);
System.exit(-1);
}
I placed the jsonschema file next to the jar file in the container, but it keeps giving me the same error. Any idea how to solve this problem?

The relative path of the File is resolved from wherever you are starting the java executable (mind the user.home), not from where the jar archive was located.
The best approach would be externalising this file location using a system property. For example if you define a jsonSchemaPath property:
String path = Sytem.getProperty("jsonSchemaPath");
JsonSchemaFactory factory = JsonSchemaFactory.byDefault();
JsonNode cdmSchema = JsonLoader.fromPath(path);
Then you can change it in the java command:
java -jar yourcode.jar -DjsonSchemaPath=/path/to/dm.jsonschema

Java/Gradle reading external config files

My project structure looks like below. I do not want to include the config file as a resource, but instead read it at runtime so that we can simply change settings without having to recompile and deploy. my problem are two things
reading the file just isn't working despite various ways i have tried (see current implementation below i am trying)
When using gradle, do i needto tell it how to build/or deploy the file and where? I think that may be part of the problem..that the config file is not getting deployed when doing a gradle build or trying to debug in Eclipse.
My project structure:
myproj\
\src
\main
\config
\com\my_app1\
config.dev.properties
config.qa.properties
\java
\com\myapp1\
\model\
\service\
\util\
Config.java
\test
Config.java:
public Config(){
try {
String configPath = "/config.dev.properties"; //TODO: pass env in as parameter
System.out.println(configPath);
final File configFile = new File(configPath);
FileInputStream input = new FileInputStream(configFile);
Properties prop = new Properties()
prop.load(input);
String prop1 = prop.getProperty("PROP1");
System.out.println(prop1);
} catch (IOException ex) {
ex.printStackTrace();
}
}

Ans 1.
reading the file just isn't working despite various ways i have tried
(see current implementation below i am trying)
With the location of your config file you have depicted,
Change
String configPath = "/config.dev.properties";
to
String configPath = "src\main\config\com\my_app1\config.dev.properties";
However read the second answer first.
Ans 2:
When using gradle, do i needto tell it how to build/or deploy the file
and where? I think that may be part of the problem..that the config
file is not getting deployed when doing a gradle build or trying to
debug in Eclipse.
You have two choices:
Rename your config directory to resources. Gradle automatically builds the resources under "src/main/resources" directory.
Let Gradle know the additional directory to be considered as resources.
sourceSets {
main {
resources {
srcDirs = ["src\main\config\com\my_app1"]
includes = ["**/*.properties"]
}
}
}

reading the file just isn't working despite various ways i have tried (see current implementation below i am trying)
You need to clarify this statement. Are you trying to load properties from an existing file? Because the code you posted that load the Properties object is correct. So probably the error is in the file path.
Anyway, I'm just guessing what you are trying to do. You need to clarify your question. Is your application an executable jar like the example below? Are trying to load an external file that is outside the jar (In this case gradle can't help you)?
If you build a simple application like this as an executable jar
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.Properties;
public class Main {
public static void main(String[]args) {
File configFile = new File("test.properties");
System.out.println("Reading config from = " + configFile.getAbsolutePath());
FileInputStream fis = null;
Properties properties = new Properties();
try {
fis = new FileInputStream(configFile);
properties.load(fis);
} catch (IOException e) {
e.printStackTrace();
return;
} finally {
if(fis != null) {
try {
fis.close();
} catch (IOException e) {}
}
}
System.out.println("user = " + properties.getProperty("user"));
}
}
When you run the jar, the application will try to load properties from a file called test.properties that is located in the application working directory.
So if you have test.properties that looks like this
user=Flood2d
The output will be
Reading config from = C:\test.properties
user = Flood2d
And that's because the jar file and test.properties file is located in C:\ and I'm running it from there.
Some java applications load configuration from locations like %appdata% on Windows or /Library/Application on MacOS. This solution is used when an application has a configuration that can change (it can be changed by manually editing the file or by the application itself) so there's no need to recompile the application with the new configs.
Let me know if I have misunderstood something, so we can figure out what you are trying to ask us.

Your question is slightly vague but I get the feeling that you want the config files(s) to live "outside" of the jars.
I suggest you take a look at the application plugin. This will create a zip of your application and will also generate a start script to start it. I think you'll need to:
Customise the distZip task to add an extra folder for the config files
Customise the startScripts task to add the extra folder to the classpath of the start script

The solution for me to be able to read an external (non-resource) file was to create my config folder at the root of the application.
myproj/
/configs
Doing this allowed me to read the configs by using 'config/config.dev.properies'

I am not familiar with gradle,so I can only give some advices about your question 1.I think you can give a full path of you property file as a parameter of FileInputStream,then load it using prop.load.
FileInputStream input = new FileInputStream("src/main/.../config.dev.properties");
Properties prop = new Properties()
prop.load(input);
// ....your code

How to read the resource file? (google cloud dafaflow)

My Dataflow pipeline needs to read a resource file GeoLite2-City.mmdb. I added it to my project and ran the pipeline. I confirmed that the project package zip file exists in the staging bucket on GCS.
However, when I try to read the resource file GeoLite-City.mmdb, I get a FileNotFoundException. How can I fix this? This is my code:
String path = myClass.class.getResource("/GeoLite2-City.mmdb").getPath();
File database = new File(path);
try
{
DatabaseReader reader = new DatabaseReader.Builder(database).build(); //<-this line get a FileNotFoundException
}
catch (IOException e)
{
LOG.info(e.toString());
}
My project package zip file is "classes-WOdCPQCHjW-hRNtrfrnZMw.zip"
(it contains class files and GeoLite2-City.mmdb)
The path value is "file:/dataflow/packages/staging/classes-WOdCPQCHjW-hRNtrfrnZMw.zip!/GeoLite2-City.mmdb", however it cannot be opened.
and This is the options.
--runner=BlockingDataflowPipelineRunner
--project=peak-myproject
--stagingLocation=gs://mybucket/staging
--input=gs://mybucket_log/log.68599ca3.gz
The Goal is transform the log file on GCS, and insert the transformed data to BigQuery.
When i ran locally, it was success importing to Bigquery.
i think there is a difference local PC and GCE to get the resource path.

I think the issue might be that DatabaseReader does not support paths to resources located inside a .zip or .jar file.
If that's the case, then your program worked with DirectPipelineRunner not because it's direct, but because the resource was simply located on the local filesystem rather than within the .zip file (as your comment says, the path was C:/Users/Jennie/workspace/DataflowJavaSDK-master/eclipse/starter/target/classe‌s/GeoLite2-City.mmdb, while in the other case it was file:/dataflow/packages/staging/classes-WOdCPQCHjW-hRNtrfrnZMw.zip!/GeoLite2-City.mmdb)
I searched the web for what DatabaseReader class you might be talking about, and seems like it is https://github.com/maxmind/GeoIP2-java/blob/master/src/main/java/com/maxmind/geoip2/DatabaseReader.java .
In that case, there's a good chance that your code will work with the following minor change:
try
{
InputStream stream = myClass.class.getResourceAsStream("/GeoLite2-City.mmdb");
DatabaseReader reader = new DatabaseReader.Builder(stream).build();
}
catch (IOException e)
{
...
}

How to index entire local Hard Drive into Apache Solr?

Is there a good approach with Solr or a client library feeding into Solr to index an entire hard drive. This should include content in the zip files, including recursively of zip files within zip files?
This should be able to run on Linux (no windows-only clients).
This will of course involve making a single scan over the entire file-system from the root (or any folder actually). I'm not concerned at this point with keeping the index up to date, just creating it initially. This would be similar to the old "Google Desktop" app, which Google discontinued.

You can manipulate Solr using the SolrJ API.
Here's the API documentation: http://lucene.apache.org/solr/4_0_0/solr-solrj/index.html
And here's a article on how to use SolrJ to index files on your harddrive.
http://blog.cloudera.com/blog/2012/03/indexing-files-via-solr-and-java-mapreduce/
Files are represented by InputDocument and you use .addField to attach fields that you'd like to search on at a later time.
Here's example code for an Index Driver:
public class IndexDriver extends Configured implements Tool {
public static void main(String[] args) throws Exception {
//TODO: Add some checks here to validate the input path
int exitCode = ToolRunner.run(new Configuration(),
new IndexDriver(), args);
System.exit(exitCode);
}
#Override
public int run(String[] args) throws Exception {
JobConf conf = new JobConf(getConf(), IndexDriver.class);
conf.setJobName("Index Builder - Adam S # Cloudera");
conf.setSpeculativeExecution(false);
// Set Input and Output paths
FileInputFormat.setInputPaths(conf, new Path(args[0].toString()));
FileOutputFormat.setOutputPath(conf, new Path(args[1].toString()));
// Use TextInputFormat
conf.setInputFormat(TextInputFormat.class);
// Mapper has no output
conf.setMapperClass(IndexMapper.class);
conf.setMapOutputKeyClass(NullWritable.class);
conf.setMapOutputValueClass(NullWritable.class);
conf.setNumReduceTasks(0);
JobClient.runJob(conf);
return 0;
}
}
Read the article for more info.
Compressed files
Here's info on handling compressed files: Using Solr CELL's ExtractingRequestHandler to index/extract files from package formats
There seems to be some bug with Solr not handling zip files, here's the bugreport with a fix: https://issues.apache.org/jira/browse/SOLR-2416

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Some errors happen when loading data to HDFS - java

Related

Loading Xml files from class path in Spring Boot

Jar that read file

Java/Gradle reading external config files

How to read the resource file? (google cloud dafaflow)

How to index entire local Hard Drive into Apache Solr?

Categories

Resources