Upload documents into Watson's Retrieve & Rank service - java

I'm implementing a solution using Watson's Retrieve & Rank service.
When I use the tooling interface, I upload my documents and they appear as a list, where I can click on any of them to open up all the Titles that are inside the document ( Answer Units ), as you can see on the Picture 1 and Picture 2.
When I try to upload documents via Java, it wont recognize the documents, they get uploaded in parts ( Answer units as documents ), each part as a new document.
I would like to know how can I upload my documents as a entire document and not only parts of it?
Here's the codes for the upload function in Java:
public Answers ConvertToUnits(File doc, String collection) throws ParseException, SolrServerException, IOException{
DC.setUsernameAndPassword(USERNAME,PASSWORD);
Answers response = DC.convertDocumentToAnswer(doc).execute();
SolrInputDocument newdoc = new SolrInputDocument();
WatsonProcessing wp = new WatsonProcessing();
Collection<SolrInputDocument> newdocs = new ArrayList<SolrInputDocument>();
for(int i=0; i<response.getAnswerUnits().size(); i++)
{
String titulo = response.getAnswerUnits().get(i).getTitle();
String id = response.getAnswerUnits().get(i).getId();
newdoc.addField("title", titulo);
for(int j=0; j<response.getAnswerUnits().get(i).getContent().size(); j++)
{
String texto = response.getAnswerUnits().get(i).getContent().get(j).getText();
newdoc.addField("body", texto);
}
wp.IndexDocument(newdoc,collection);
newdoc.clear();
}
wp.ComitChanges(collection);
return response;
}
public void IndexDocument(SolrInputDocument newdoc, String collection) throws SolrServerException, IOException
{
UpdateRequest update = new UpdateRequest();
update.add(newdoc);
UpdateResponse addResponse = solrClient.add(collection, newdoc);
}

You can specify config options in this line:
Answers response = DC.convertDocumentToAnswer(doc).execute();
I think something like this should do the trick:
String configAsString = "{ \"conversion_target\":\"answer_units\", \"answer_units\": { \"selector_tags\": [] } }";
JsonParser jsonParser = new JsonParser();
JsonObject customConfig = jsonParser.parse(configAsString).getAsJsonObject();
Answers response = DC.convertDocumentToAnswer(doc, null, customConfig).execute();
I've not tried it out, so might not have got the syntax exactly right, but hopefully this will get you on the right track.
Essentially, what I'm trying to do here is use the selector_tags option in the config (see https://www.ibm.com/watson/developercloud/doc/document-conversion/customizing.shtml#htmlau for doc on this) to specify which tags the document should be split on. By specifying an empty list with no tags in, it results in it not being split at all - and coming out in a single answer unit as you want.
(Note that you can do this through the tooling interface, too - by unticking the "Split my documents up into individual answers for me" option when you upload the document)

Related

Stop Bullet number to be updated automatically when merging word docs using docx4j

I am trying to merge 2 docx files which has their own bullet number, after merging of word docs the bullets are automatically updated.
E.g:
Doc A has 1 2 3
Doc B has 1 2 3
After merging the bullet numbering are updated to be 1 2 3 4 5 6
how to stop this.
I am using following code
if(counter==1)
{
FirstFileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FirstFileIS = new java.io.ByteArrayInputStream(FirstFileByteStream);
FirstWordFile = org.docx4j.openpackaging.packages.WordprocessingMLPackage.load(FirstFileIS);
main = FirstWordFile.getMainDocumentPart();
//Add page break for Table of Content
main.addObject(objBr);
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Table of contents - End
}
else
{
FileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FileIS = new java.io.ByteArrayInputStream(FileByteStream);
byte[] bytes = IOUtils.toByteArray(FileIS);
AlternativeFormatInputPart afiPart = new AlternativeFormatInputPart(new PartName("/part" + (chunkCount++) + ".docx"));
afiPart.setContentType(new ContentType(CONTENT_TYPE));
afiPart.setBinaryData(bytes);
Relationship altChunkRel = main.addTargetPart(afiPart);
CTAltChunk chunk = Context.getWmlObjectFactory().createCTAltChunk();
chunk.setId(altChunkRel.getId());
main.addObject(objBr);
htmlCode = new StringBuilder();
htmlCode.append("<html>");
htmlCode.append("<h2><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><p style=\"font-family:'Arial Black'; color: #f35b1c\">"+ReqName+"</p></h2>");
htmlCode.append("</html>");
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Add Page Break before new content
main.addObject(objBr);
//Add new content
main.addObject(chunk);
}
Looking at your code, you are adding HTML altChunks to your document.
For these to display it Word, the HTML is converted to normal docx content.
An altChunk is usually converted by Word when you open the docx.
(Alternatively, docx4j-ImportXHTML can do it for an altChunk of type XHTML)
The upshot is that what happens with the bullets (when Word converts your HTML) is largely outside your control. You could experiment with CSS but I think Word will mostly ignore it.
An alternative may be to use XHTML altChunks, and have docx4j-ImportXHTML convert them. main.convertAltChunks()
If the same problem occurs when you try that, well, at least we can address it.
I was able to fix my issue using following code. I found it at (http://webapp.docx4java.org/OnlineDemo/forms/upload_MergeDocx.xhtml). You can also generate your custom code, they have a nice demo where they generate code according to your requirement :).
public final static String DIR_IN = System.getProperty("user.dir")+ "/";
public final static String DIR_OUT = System.getProperty("user.dir")+ "/";
public static void main(String[] args) throws Exception
{
String[] files = {"part1docx_20200717t173750539gmt.docx", "part1docx_20200717t173750539gmt (1).docx", "part1docx_20200717t173750539gmt.docx"};
List blockRanges = new ArrayList();
for (int i=0 ; i< files.length; i++) {
BlockRange block = new BlockRange(WordprocessingMLPackage.load(new File(DIR_IN + files[i])));
blockRanges.add( block );
block.setStyleHandler(StyleHandler.RENAME_RETAIN);
block.setNumberingHandler(NumberingHandler.ADD_NEW_LIST);
block.setRestartPageNumbering(false);
block.setHeaderBehaviour(HfBehaviour.DEFAULT);
block.setFooterBehaviour(HfBehaviour.DEFAULT);
block.setSectionBreakBefore(SectionBreakBefore.NEXT_PAGE);
}
// Perform the actual merge
DocumentBuilder documentBuilder = new DocumentBuilder();
WordprocessingMLPackage output = documentBuilder.buildOpenDocument(blockRanges);
// Save the result
SaveToZipFile saver = new SaveToZipFile(output);
saver.save(DIR_OUT+"OUT_MergeWholeDocumentsUsingBlockRange.docx");
}

Modify XML File with SAX

I'm currently trying to generate a set of models (specified via XML). In order to achieve this, I need to change a single attribute inside the file and save it under a new file name.
The XML File looks like this:
(...)
<place id="P19" initialMarking="0" invariant="< inf" markingOffsetX="0.0" markingOffsetY="0.0" name="P19" nameOffsetX="-5.0" nameOffsetY="35.0" positionX="615.0" positionY="375.0"/>
<place id="P20" initialMarking="0" invariant="< inf" markingOffsetX="0.0" markingOffsetY="0.0" name="P20" nameOffsetX="-5.0" nameOffsetY="35.0" positionX="375.0" positionY="225.0"/>
(...)
What needs changing is the value of initialMarking to values from 2 through 999.
Here is what I have so far:
This is where I get the list of files to change and pass them to the parser
public void parse(String dir){
getFiles(dir);
try {
XMLReader xmlReader = XMLReaderFactory.createXMLReader();
for(int i = 0; i < fileList.length; i++) {
FileReader reader = new FileReader(fileList[i]);
InputSource inputSource = new InputSource(reader);
xmlReader.setContentHandler(new ModelContentHandler());
xmlReader.parse(inputSource);
}
(...)
This is where I'm searching for the element I need to change:
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
if(localName.equals("place") && atts.getValue(0).equals("P14") && atts.getValue(1).equals("2")){
System.out.println("Initial Marking of " + atts.getValue(0) + " is: " + atts.getValue(1) + "\n");
while(currentTokens <= Configuration.MAX_TOKENS){
System.out.println("Setting initial Tokens to: " + currentTokens);
}
}
}
Now, instead of printing out "Setting..." I'd like to change the according value and just save the whole file under some new name like "Model_X_Y_Token.xml".
Seems like a fairly simple thing to do, but I've never used SAX before and looking at the JavaDoc, I can't even find a place to start.
Maybe someone can point me in the right direction?
One of the best approaches here is to use dom4j.I don't exactly get the big picture of what you're trying to do, but I understand the result you want to get. Note that you will also need jaxen for this.
Step 1 : read the file into an xml doxument
for(int i=0; i<fileList.length; i++){
Document doc = new SAXReader().read(fileList[i]);
}
Step 2 : parse the elements you need. For this you need to know a bit of xpath. ///place will fetch all the place elements. ///place[#id="P14"] will fetch only one place element.
Element place14 = (Element) doc.selectSingleNode("//*/place[#id="p14" and initialMarking="2"]");
Step 3 : change the attributes of the element
plac14.attribute("attributename").setValue("attributeValue");
The most efficient way possible is with vtd-xml as it is the only API that does something called incremental update...
import com.ximpleware.*;
public class changeAttrVal {
public static void main(String s[]) throws VTDException,java.io.UnsupportedEncodingException,java.io.IOException{
VTDGen vg = new VTDGen();
if (!vg.parseFile("input.xml", false))
return;
VTDNav vn = vg.getNav();
AutoPilot ap = new AutoPilot(vn);
XMLModifier xm = new XMLModifier(vn);
ap.selectXPath("/*/place[#id=\"p14\" and #initialMarking=\"2\"]/#initialMarking");
int i=0;
while((i=ap.evalXPath())!=-1){
xm.updateToken(i+1, "499");// change initial marking from 2 to 499
}
xm.output("new.xml"); // output to a new document called new.xml
}
}

XPages: Creating JSON String from large amount of document

I have been trying to create a Json String with a large amount document but using the below code but i get out of range or have to wait till up to 5min b4 the String is greated any idiea how i could optimise the code?
public String getJson() throws NotesException {
...
View view1 = ...;
ViewNavigator nav =view1.createViewNav();
ViewEntry ve = nav.getFirst();
JSONObject jsonMain = new JSONObject();
JSONArray items = new JSONArray();
Document docRoot = null
while (ve != null) {
docRoot= ve.getDocument();
items.add(getJsonDocAndChildren(docRoot));
ViewEntry veTemp = nav.getNextSibling(ve);
ve.recycle();
ve = docTemp;
}
jsonMain.put("identifier", "name");
jsonMain.put("label", "name");
jsonMain.put("items", items);
return jsonMain.toJSONString();
}
private JSONObject getJsonDocAndChildren(Document doc) throws NotesException {
String name = doc.getItemValueString("Name");
JSONObject jsonDoc = new JSONObject();
jsonDoc.put("name", name);
jsonDoc.put("field", doc.getItemValueString("field"));
DocumentCollection responses = doc.getResponses();
JSONArray children = new JSONArray();
getDocEntry(name,children);//this add all doc that has the fieldwith the same value name to children
if (responses.getCount() > 0) {
Document docResponse = responses.getFirstDocument();
while (docResponse != null) {
children.add(getJsonDocAndChildren(docResponse));
Document docTemp = responses.getNextDocument(docResponse);
docResponse.recycle();
docResponse = docTemp;
}
}
jsonDoc.put("children", children);
return jsonDoc;
}
There are a few things here, ranging from general efficiency to optimizations based on how you want to use the code.
The big one that would likely speed up your processing would be to do view operations only, without cracking open the documents. Since it looks like you want to get responses indiscriminately, you could add the response documents to the original view, with the "Show responses in hierarchy" option turned on. Then, if you have columns for Name and field in the view (and no "Show responses only") columns, then a nav.getNext() walk down the view will get them in turn. By storing the entry.getIndentLevel() value for each previous entry and comparing it at the start of the loop, you could "step" up and down the JSON tree: when the indent level increases by one, create a new array and add it to the existing object; when it decreases, step up one. It may be a little conceptually awkward at first, having to track previous states in a flat loop, but it'd be much more efficient.
Another option, also having the benefit of not having to crack open each individual document, would be to have a view of the response documents categorized by #Text($REF) and then making your recursive method look more like:
public static void walkTree(final View treeView, final String documentId) {
ViewNavigator nav = treeView.createViewNavFromCategory(documentId);
nav.setBufferMaxEntries(400);
for (ViewEntry entry : nav) {
// Do code here
walkTree(treeView, entry.getUniversalID(), callback);
}
}
(That example is using the OpenNTF Domino API, but, if you're not using that, you could down-convert the for loop to the legacy style)
As a minor improvement any time you traverse through ViewNavigators, you can set view.setAutoUpdate(false) and then nav.setBufferMaxEntries(400) to improve the internal caching.
And finally, depending on your needs - say, if you're outputting the JSON directly to an HTTP response's output stream - you could use JsonWriter instead of JsonObject to stream the content out instead of building a huge object in memory. I wrote about it with some simple code here: https://frostillic.us/blog/posts/EF0B875453B3CFC285257D570072F78F
You should first determine where the time is spent in your code. Maybe it is in doc.getResponses() or responses.getNextDocument() which you did not show here.
The obvious optimization which could be done within your code snippet is the following:
Basically you have some data structure called Document and build up a corresponding in memory JSON structure consisting of JSONObjects and JSONArrays. This JSON structure is then serialized to a String and returned.
Instead of building the JSON structure you could directly use a JsonWriter (don't know what JSON library you are using but there must be something like a JsonWriter). This avoids the memory allocations for the temporary JSON structure.
In getJson() you start:
StringWriter stringOut = new StringWriter();
JsonWriter out = new JsonWriter(stringOut);
and end
return stringOut.toString();
Now everywhere where you creating JSONObjects or JSONArrays you invoke corresponding writer methods. e.g.
private void getJsonDocAndChildren(Document doc, JsonWriter out) throws NotesException {
out.name("name");
out.value(doc.getItemValueString("Name"));
out.name("field");
out.value(doc.getItemValueString("field"));
DocumentCollection responses = doc.getResponses();
if (responses.getCount() > 0) {
Document docResponse = responses.getFirstDocument();
out.startArray();
...
Hope you get the idea.

Having trouble extracting values from JSON

Examples:
{"name":"tv.twitch:twitch:5.16"}
{"name":"tv.twitch:twitch-external-platform:4.5","extract":{"exclude":["META-INF/"]},"natives":{"windows":"natives-windows-${arch}"},"rules":[{"os":{"name":"windows"},"action":"allow"}]}
These lines came from a JSONArray, I'd like to extract the "natives" portion. The problem is, not all items in the JSONArray have the "natives" value. Here is my current code to extract the "name" value
JSONObject json = new JSONObject(readUrl(url.toString()));
JSONArray jsonArray = json.getJSONArray("libraries");
ArrayList<String> libraries = new ArrayList<String>();
for (int i = 0; i < jsonArray.length(); i++) {
JSONObject next = jsonArray.getJSONObject(i);
String lib = next.getString("name");
libraries.add(lib);
}
I'm not exactly sure about this since I am new to java/JSON parsing, but would an object in the array without the "natives" value cause the program to end?
You can use has method from JSONObject to determine if it contains specified key or not.
Determine if the JSONObject contains a specific key.
In your case you can do like this:
JSONObject json = new JSONObject(readUrl(url.toString()));
if(json.has("natives")) {
//Logic to extract natives
} else {
//Logic to extract without natives
}
I think this simple lines should suffice for your requirement. See the API:here
You seem to want to extract content at JSON Pointers /name and /extract/natives/windows.
In this case, using this library (which depends on Jackson), it is as simple as:
// All of these are thread safe
private static final ObjectReader READER = JacksonUtils.getReader();
private static final JsonPointer NAME_POINTER = JsonPointer.of("name");
private static final JsonPointer WINDOWS_POINTER
= JsonPointer.of("extract", "native", "windows");
// Fetch content from URL
final JsonNode content = READER.readTree(url.getInputStream());
// Get content at pointers, if any
final JsonNode nameNode = NAME_POINTER.path(content);
final JsonNode windowsNode = WINDOWS_POINTER.path(content);
Then, to check if a node actually exists, check against .isMissingNode():
if (windowsNode.isMissingNode())
// deal with no windows content
Alternatively, use .get() instead of .path() and check for null instead.

Search in spreadsheets not working for new files created

I create copies of my spreadsheet template on google docs with document list api and I realised that:
1. title queries works fine
2. content queries are not working(*) or partially working(**)
(*)for majority of spreadsheets: I searched every word from the content of a spreadsheet and I get no results
(**) for a few spreadsheets I find results for some words that are copied from template; the particular words queries are not working
3. If I update the spreadsheet after a few minutes all queries work fine.
(I make this searches from UI)
This are the steps for creating this files:
1. Copy spreadsheet template to root
private String sendPostCopyRequest(String authorizationToken, String resourceID, String title, int noRetries) throws IOException{
/*
resourceId = resource id for the template that i want to copy
title = the title of the new file created
*/
String urlStr = "https://docs.google.com/feeds/default/private/full";
URL url = new URL(urlStr);
HttpURLConnection copyHttpUrlConn = (HttpURLConnection) url.openConnection();
copyHttpUrlConn.setDoOutput(true);
copyHttpUrlConn.setRequestMethod("POST");
String outputString = "<?xml version='1.0' encoding='UTF-8'?>" +
"<entry xmlns=\"http://www.w3.org/2005/Atom\"> " +
"<id>https://docs.google.com/feeds/default/private/full/" + resourceID +"</id>" +
" <title>" + title + "</title></entry>";
copyHttpUrlConn.setRequestProperty("GData-Version", "3.0");
copyHttpUrlConn.setRequestProperty("Content-Type","application/atom+xml");
copyHttpUrlConn.setRequestProperty("Content-Length", outputString.length() + "");
copyHttpUrlConn.setRequestProperty("Authorization", "GoogleLogin auth=" + authorizationToken);
OutputStream outputStream = copyHttpUrlConn.getOutputStream();
outputStream.write(outputString.getBytes());
copyHttpUrlConn.getResponseCode();
return readIdFromResponse(copyHttpUrlConn.getInputStream());
}
2. I update some cells using this method:
public boolean setCellValue(SpreadsheetService spreadSheetService, SpreadsheetEntry entry, int worksheetNumber, String position, String value) throws IOException, ServiceException {
List<WorksheetEntry> worksheets = entry.getWorksheets();
WorksheetEntry worksheet = worksheets.get(worksheetNumber);
URL cellFeedUrl = worksheet.getCellFeedUrl();
CellQuery query = new CellQuery(cellFeedUrl);
query.setReturnEmpty(true);
query.setRange(position);
CellFeed cellFeed = spreadSheetService.query(query, CellFeed.class);
CellEntry cell = cellFeed.getEntries().get(0);
cell.changeInputValueLocal(value);
cell.update();
return true;
}
3. I move the created file to a new folder (collection)
public DocumentListEntry moveSpreadSheet(DocsService docsService, String entryId, String destinationFolderDocId) throws MalformedURLException, IOException, ServiceException {
DocumentListEntry newEntry = null;
newEntry = new com.google.gdata.data.docs.SpreadsheetEntry();
newEntry.setId(entryId);
String destFolderUri = "https://docs.google.com/feeds/default/private/full/folder%3A"+ destinationFolderDocId + "/contents";
return docsService.insert(new URL(destFolderUri), newEntry);
}
(the same results with gdata java sdk api 1.4.5, 1.4.6, 1.4.7)
This happens from 2011-12-23 (with aproximation). For all the spreadsheets created with the same code before this date all queries work fine.
I can provide any other information on request.
Update:
This issue seems to appear also at uploading spreadsheets with conversion.
If I update the files after a period of time after creation/upload (~2 hours) the queries returns them in results.
Your issue could be related to slowish Google indexing of spreadsheet contents.
https://groups.google.com/a/googleproductforums.com/d/msg/docs/vEhI_HkKX3I/MGKqkryrx90J
"at the moment it can take about 10 minutes to index the content you've written into your spreadsheet. So if you type something in, and then search for it right away, it might not show up yet in your list of document results. Give it a few more minutes (we are working on making this faster)"

Categories

Resources