DFC reading a file - java

I am using the DFC to access documentum. I am trying to read a file. I have the r_object_id and I now wish to return the document assoicated with this. How would I do this in java?

assuming you also have a valid session with at least read access to the file:
String docId= getDocId();
IDfSysObject doc = (IDfSysObject)session.getObject(new DfId(docId));
ByteArrayInputStream stream = doc.getContent();
see the javadocs for the return type here for info on how to process the return. Also I've noticed you've been asking quite a few Documentum Foundation Classes questions. depending on the version of the DFCs you are using, you can find the javadocs online at either powerlink or subscribenet and probably answer many of your own questions.

Related

S3 Implementation for org.apache.parquet.io.InputFile?

I am trying to write a Scala-based AWS Lambda to read Snappy compressed Parquet files based in S3. The process will write them backout in partitioned JSON files.
I have been trying to use the org.apache.parquet.hadoop.ParquetFileReader class to read the files... the non-deprecated way to do this appears to pass it a implementation of the org.apache.parquet.io.InputFile interface. There is one for Hadoop (HadoopInputFile)... but I cannot find one for S3. I also tried some of the deprecated ways for this class, but could not get them to work with S3 either.
Any solution to this dilemma?
Just in case anyone is interested... why I am doing this in Scala? Well... I cannot figure out another way to do it. The Python implementations for Parquet (pyarrow and fastparquet) both seem to struggle with complicated list/struct based schemas.
Also, I have seen some AvroParquetReader based code (Read parquet data from AWS s3 bucket) that might be a different solution, but I could not get these to work without a known schema. but maybe I am missing something there.
I'd really like to get the ParquetFileReader class to work, as it seem clean.
Appreciate any ideas.
Hadoop uses its own filesystem abstraction layer, which has an implementation for s3 (https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#S3A).
The setup should look someting like the following (java, but same should work with scala):
Configuration conf = new Configuration();
conf.set(Constants.ENDPOINT, "https://s3.eu-central-1.amazonaws.com/");
conf.set(Constants.AWS_CREDENTIALS_PROVIDER,
DefaultAWSCredentialsProviderChain.class.getName());
// maybe additional configuration properties depending on the credential provider
URI uri = URI.create("s3a://bucketname/path");
org.apache.hadoop.fs.Path path = new Path(uri);
ParquetFileReader pfr = ParquetFileReader.open(HadoopInputFile.fromPath(path, conf))

How to use Weka JSONLoader in Java IDE?

I want to use Weka in order to parse an existing json file in java eclipse. I believe this can be done using the JSONLoader class. After I read the classes' specifications (http://weka.sourceforge.net/doc.dev/weka/core/converters/JSONLoader.html#JSONLoader--) I thought that this could be easily done by doing this:
JSONLoader jsonLoader = new JSONLoader(jsonFile);
Then I thought by just doing jsonLoader.getFileDescription() or jsonLoader.getSource() would give me results. This is not how it's done though and I can't find anywhere how to use the JSONLoader class in my java code. So in order not to make this question too broad, how can I create a JSONLoader object that reads a source that is in JSON format?
First of all it has nothing to do with eclipse so you should edit your question.
A brief look at the documentation of JSONLoader(in the link you provided) can tell that you need to set the data source you want to parse using setSource (the constructor is empty):
JSONLoader jsonLoader = new JSONLoader();
File f = new File("PATH_TO_YOUR_JSON_FILE");
jsonLoader.setSource(f); //you can also use InputStream instead of a File
After doing that you can use other methods that parse your JSON:
Instances dataset = jsonLoader.getDataSet();
jsonLoader.getFileDescription();
...

Convert RTF string to HTML String using Java

I tried using the simple EditorKit option, but that doesn't seem to support all the RTF formats.
So I turned into using either Tika,JODConverter or POI.
As of now I managed to make it work with JODConverter and openOffice by using this
OfficeManager officeManager = new DefaultOfficeManagerConfiguration()
.setPortNumbers(8100, 8101).buildOfficeManager();
officeManager.start();
OfficeDocumentConverter converter - new
OfficeDocumentConverter(officeManger);
try{
File tempFile = File.createTempFile("tempRtf", ".rtf");
BufferedWriter bw = new BufferedWriter(new FileWriter(tempFile));
bw.write(rtfString);
bw.close;
File outputTempFile = File.createTempFile("otuputTepFile", ".html");
converter.convert(tempFile, outputTempFile);
return FileUtils.readFileToString(outputTempFile);
This works.
My problem is that I actually set up a server and close it, which takes a lot of time.
I tried to see if I can bring up the process on the first run\report (I use it as a Handler in birt report) and then just to check if the process is running, if so use it to convert, and that's it, it'll save lots of time I see is wasted on initiating and closing process ( I don't care it will stay up)
My problem is that it seems like these classes as noted here are not present on my version of JODConverter.
After farther investigation, I found out that they are on the JODConverter 2.2 API and i use the 3.0 core-beta-4.
JODConverter seems to be kinda complex to a my simple need.
so if anyone knows how to start the office manger once and then just check if its up I'd love a code sample, and Of course if anyone got better solution than JODConverter to my need ill be glad to hear it.
EDIT: I need my Handler to do 2 things, 1. check if there is an instance of officemanager up, and connect to it (we skip the officeManager.start())
and 2. if the instance isn't up, then ill basically do what the code sample i wrote sent.
This code is written in a BIRT Handler, so i can't create the officeManager globally and just share it, cause the handler class runs everytime i call birt engine.
Maybe i can set up the officeManager in the Birt itself? then ill have the instance in the handler?

Microdata extraction from HTML in Java

I really need help to extract Mircodata which is embedded in HTML5. My purpose is to get structured data from a webpage just like this tool of google: http://www.google.com/webmasters/tools/richsnippets. I have searched a lot but there is no possible solution.
Currently, I use the any23 library but I can’t find any documentation, just only javadocs which dont provide enough information for me.
I use any23's Microdata Extractor but getting stuck at the third parameter: "org.w3c.dom.Document in". I can't parse a HTML content to be a w3cDom. I have used JTidy as well as JSoup but the DOM objects in these library are not fixed with the Extractor constructor. In addition, I also doubt about the 2nd parameter of the Microdata Extractor.
I hope that anyone can help me to do with any23 or suggest another library can solve this extraction issues.
Edit: I found solution myself by using the same way as any23 command line tool did. Here is the snippet of code:
HTTPDocumentSource doc = new HTTPDocumentSource(DefaultHTTPClient.createInitializedHTTPClient(), value);
InputStream documentInputInputStream = doc.openInputStream();
TagSoupParser tagSoupParser = new TagSoupParser(documentInputInputStream, doc.getDocumentURI());
Document document = tagSoupParser.getDOM();
ByteArrayOutputStream byteArrayOutput = new ByteArrayOutputStream();
MicrodataParser.getMicrodataAsJSON(tagSoupParser.getDOM(),new PrintStream(byteArrayOutput));
String result = byteArrayOutput.toString("UTF-8");
These line of code only extract microdata from HTML and write them in JSON format. I tried to use MicrodataExtractor which can change the output format to others(Rdf, turtle, ...) but the input document seems to only accept XML format. It throws "Document didn't start" when I put in a HTML document.
If anyone found the way to use MicrodataExtractor, please leave the answer here.
Thank you.
xpath is generally the way to consume html or xml.
have a look at: How to read XML using XPath in Java

Get real file extension -Java code

I would like to determine real file extension for security reason.
How can I do that?
Supposing you really mean to get the true content type of a file (ie it's MIME type) you should refer to this excellent answer.
You can get the true content type of a file in Java using the following code:
File file = new File("filename.asgdsag");
InputStream is = new BufferedInputStream(new FileInputStream(file));
String mimeType = URLConnection.guessContentTypeFromStream(is);
There are a number of ways that you can do this, some more complicated (and more reliable) than others. The page I linked to discusses quite a few of these approaches.
Not sure exactly what you mean, but however you do this it is only going to work for the specific set of file formats which are known to you
you could exclude executables (are you talking windows here?) - there's some file header information here http://support.microsoft.com/kb/65122 - you could scan and block files which look like they have an exe header - is this getting close to what you mean when you say 'real file extension'?

Categories

Resources