How to index entire local Hard Drive into Apache Solr? - java

Is there a good approach with Solr or a client library feeding into Solr to index an entire hard drive. This should include content in the zip files, including recursively of zip files within zip files?
This should be able to run on Linux (no windows-only clients).
This will of course involve making a single scan over the entire file-system from the root (or any folder actually). I'm not concerned at this point with keeping the index up to date, just creating it initially. This would be similar to the old "Google Desktop" app, which Google discontinued.

You can manipulate Solr using the SolrJ API.
Here's the API documentation: http://lucene.apache.org/solr/4_0_0/solr-solrj/index.html
And here's a article on how to use SolrJ to index files on your harddrive.
http://blog.cloudera.com/blog/2012/03/indexing-files-via-solr-and-java-mapreduce/
Files are represented by InputDocument and you use .addField to attach fields that you'd like to search on at a later time.
Here's example code for an Index Driver:
public class IndexDriver extends Configured implements Tool {
public static void main(String[] args) throws Exception {
//TODO: Add some checks here to validate the input path
int exitCode = ToolRunner.run(new Configuration(),
new IndexDriver(), args);
System.exit(exitCode);
}
#Override
public int run(String[] args) throws Exception {
JobConf conf = new JobConf(getConf(), IndexDriver.class);
conf.setJobName("Index Builder - Adam S # Cloudera");
conf.setSpeculativeExecution(false);
// Set Input and Output paths
FileInputFormat.setInputPaths(conf, new Path(args[0].toString()));
FileOutputFormat.setOutputPath(conf, new Path(args[1].toString()));
// Use TextInputFormat
conf.setInputFormat(TextInputFormat.class);
// Mapper has no output
conf.setMapperClass(IndexMapper.class);
conf.setMapOutputKeyClass(NullWritable.class);
conf.setMapOutputValueClass(NullWritable.class);
conf.setNumReduceTasks(0);
JobClient.runJob(conf);
return 0;
}
}
Read the article for more info.
Compressed files
Here's info on handling compressed files: Using Solr CELL's ExtractingRequestHandler to index/extract files from package formats
There seems to be some bug with Solr not handling zip files, here's the bugreport with a fix: https://issues.apache.org/jira/browse/SOLR-2416

Related

How to have my java project to use some files without using their absolute path?

I have written a project where some images are used for the application's appearance and some text files will get created and deleted along the process. I only used the absolute path of all used files in order to see how the project would work, and now that it is finished I want to send it to someone else. so what I'm asking for is that how I can link those files to the project so that the other person doesn't have to set those absolute paths relative to their computer. something like, turning the final jar file with necessary files into a zip file and then that the person extracts the zip file and imports jar file, when runs it, the program work without any problems.
by the way, I add the images using ImageIcon class.
I'm using eclipse.
For files that you just want to read, such as images used in your app's icons:
Ship them the same way you ship your class files: In your jar or jmod file.
Use YourClassName.class.getResource or .getResourceAsStream to read these. They are not files, any APIs that need a File object can't work. Don't use those APIs (they are bad) - good APIs take a URI, URL, or InputStream, which works fine with this.
Example:
package com.foo;
public class MyMainApp {
public void example() {
Image image = new Image(MyMainApp.class.getResource("img/send.png");
}
public void example2() throws IOException {
try (var raw = MyMainApp.class.getResourceAsStream("/data/countries.txt")) {
BufferedReader in = new BufferedReader(
new InputStreamReader(raw, StandardCharsets.UTF_8));
for (String line = in.readLine(); line != null; line = in.readLine()) {
// do something with each country
}
}
}
}
This class file will end up in your jar as /com/foo/MyMainApp.class. That same jar file should also contain /com/foo/img/send.png and /data/countries.txt. (Note how starting the string argument you pass to getResource(AsStream) can start with a slash or not, which controls whether it's relative to the location of the class or to the root of the jar. Your choice as to what you find nicer).
For files that your app will create / update:
This shouldn't be anywhere near where your jar file is. That's late 80s/silly windows thinking. Applications are (or should be!) in places that you that that app cannot write to. In general the installation directory of an application is a read-only affair, and most certainly should not be containing a user's documents. These should be in the 'user home' or possibly in e.g. `My Documents'.
Example:
public void save() throws IOException {
Path p = Paths.get(System.getProperty("user.home"), "navids-app.save");
// save to that file.
}

how to fix this find bugs Potential Path Traversal in new File(filePath)

public void createFile(String filePath) {
File file = new File(filePath);
}
change to ====>
public void createFile(String filePath) {
File file = new File(FilenameUtils.getFullPath(pathName),
FilenameUtils.getName(pathName));
}
but it still has bugs Potential Path Traversal, how to fix this find bugs? thank you.
with hard code is ok, but it's not fit my requirement.
public void createFile(String filePath) {
File file = new File("resource/image/",
FilenameUtils.getName(pathName));
}
One of the ways to fix it is to not use variable input to access files on the server. One of the suggestions from here https://cwe.mitre.org/data/definitions/73.html
When the set of filenames is limited or known, create a mapping from a
set of fixed input values (such as numeric IDs) to the actual
filenames, and reject all other inputs. For example, ID 1 could map to
"inbox.txt" and ID 2 could map to "profile.txt". Features such as the
ESAPI AccessReferenceMap provide this capability.
Your code is still detected by SonarQube as non-valid because even if you use FilenameUtils.getName(), your code still uses user-provided variables as parts of the file path.

Hadoop Distributed Cache to process large look up text file

I am trying to implement a MapReduce job that processes a large text file (as a look up file) in addition to the actual dataset (input). the look up file is more than 2GB.
I tried to load the text file as a third argument as follows:
but I got Java Heap Space Error.
After doing some search, it is suggested to use Distributed Cache. this is what I have done so far
First, I used this method to read the look up file:
public static String readDistributedFile(Context context) throws IOException {
URI[] cacheFiles = context.getCacheFiles();
Path path = new Path(cacheFiles[0].getPath().toString());
FileSystem fs = FileSystem.get(new Configuration());
StringBuilder sb = new StringBuilder();
BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(path)));
String line;
while ((line = br.readLine()) != null) {
// split line
sb.append(line);
sb.append("\n");
}
br.close();
return sb.toString();
}
Second, In the Mapper:
protected void setup(Context context)
throws IOException, InterruptedException {
super.setup(context);
String lookUpText = readDistributedFile(context);
//do something with the text
}
Third, to run the job
hadoop jar mapReduceJob.jar the.specific.class -files ../LargeLookUpFileInStoredLocally.txt /user/name/inputdataset/*.gz /user/name/output
But the problem is that the job is taking long time to be load.
May be it was not a good idea to use the distributed cache or may be I am missing something in my code.
I am working with Hadoop 2.5.
I have already checked some related questions such as [1].
Any ideas will be great!
[1] Hadoop DistributedCache is deprecated - what is the preferred API?
Distributed cache is mostly used to move the files which are needed by Map reduce at the task nodes and are not part of jar.
Other usage is when performing joins which include a big and small data set, so that, rather than using Multiple input paths, we use a single input(big) file, and get the other small file using distributed cache and then compare(or join) both the data sets.
The reason for more time in your case is because you are trying to read entire 2 gb file before the map reduce starts(as it is started in setup method).
Can you give the reason why you are loading the the huge 2gb file using distributed cache.

DistributedCache in Hadoop 2.x

I have a problem in DistributedCache in Hadoop 2.x the new API, I found some people working around this issue, but it does not solve my problem example
this solution does not work with me Because i got a NullPointerException when trying to retrieve the data in DistributedCache
My Configuration is as follows:
Driver
public int run(String[] arg) throws Exception {
Configuration conf = this.getConf();
Job job= new Job(conf,"job Name");
...
job.addCacheFile(new URI(arg[1]);
Setup
protected void setup(Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
URI[] cacheFiles = context.getCacheFiles();
BufferedReader dtardr = new BufferedReader(new FileReader(cacheFiles[0].toString()));
Here when it starts creating the buffered reader it throws the NullPointerException, this happenning because context.getCacheFiles(); returns always NULL. How to solve this problem, and where is the cache files stored(HDFS, or local file system)
If you use the local JobRunner in Hadoop (non-distributed mode, as a single Java process), then no local data directory is created; the getLocalCacheFiles() or getCacheFiles() call will return an empty set of results.Can you make sure that you are running your job in a Distributed or Pseudo-Distributed mode.
Hadoop frame work will copy files set in the distributed cache to the local working directory of each task in the job.
There are copies of all cached files, placed in the local file system of each worker machine. (They will be in a subdirectory of mapred.local.dir.)
Can you refer this link for understanding more about DistributedCache.

How to put a serialized object into the Hadoop DFS and get it back inside the map function?

I'm new to Hadoop and recently I was asked to do a test project using Hadoop.
So while I was reading BigData, happened to know about Pail. Now what I want to do is something like this. First create a simple object and then serialize it using Thrift and put that into the HDFS using Pail. Then I want to get that object inside the map function and do what ever I want. But I have no idea on getting tat object inside the map function.
Can someone please tell me of any references or explain how to do that?
I can think of three options:
Use the -files option and name the file in HDFS (preferable as the task tracker will download the file once for all jobs running on that node)
Use the DistributedCache (similar logic to the above), but you configure the file via some API calls rather than through the command line
Load the file directly from HDFS (less efficient as you're pulling the file over HDFS for each task)
As for some code, put the load logic into your mapper's setup(...) or configure(..) method (depending on whether you're using the new or old API) as follows:
protected void setup(Context context) {
// the -files option makes the named file available in the local directory
File file = new File("filename.dat");
// open file and load contents ...
// load the file directly from HDFS
FileSystem fs = FileSystem.get(context.getConfiguration());
InputStream hdfsInputStream = fs.open("/path/to/file/in/hdfs/filename.dat");
// load file contents from stream...
}
DistributedCache has some example code in the Javadocs

Categories

Resources