Search in spreadsheets not working for new files created - java

I create copies of my spreadsheet template on google docs with document list api and I realised that:
1. title queries works fine
2. content queries are not working(*) or partially working(**)
(*)for majority of spreadsheets: I searched every word from the content of a spreadsheet and I get no results
(**) for a few spreadsheets I find results for some words that are copied from template; the particular words queries are not working
3. If I update the spreadsheet after a few minutes all queries work fine.
(I make this searches from UI)
This are the steps for creating this files:
1. Copy spreadsheet template to root
private String sendPostCopyRequest(String authorizationToken, String resourceID, String title, int noRetries) throws IOException{
/*
resourceId = resource id for the template that i want to copy
title = the title of the new file created
*/
String urlStr = "https://docs.google.com/feeds/default/private/full";
URL url = new URL(urlStr);
HttpURLConnection copyHttpUrlConn = (HttpURLConnection) url.openConnection();
copyHttpUrlConn.setDoOutput(true);
copyHttpUrlConn.setRequestMethod("POST");
String outputString = "<?xml version='1.0' encoding='UTF-8'?>" +
"<entry xmlns=\"http://www.w3.org/2005/Atom\"> " +
"<id>https://docs.google.com/feeds/default/private/full/" + resourceID +"</id>" +
" <title>" + title + "</title></entry>";
copyHttpUrlConn.setRequestProperty("GData-Version", "3.0");
copyHttpUrlConn.setRequestProperty("Content-Type","application/atom+xml");
copyHttpUrlConn.setRequestProperty("Content-Length", outputString.length() + "");
copyHttpUrlConn.setRequestProperty("Authorization", "GoogleLogin auth=" + authorizationToken);
OutputStream outputStream = copyHttpUrlConn.getOutputStream();
outputStream.write(outputString.getBytes());
copyHttpUrlConn.getResponseCode();
return readIdFromResponse(copyHttpUrlConn.getInputStream());
}
2. I update some cells using this method:
public boolean setCellValue(SpreadsheetService spreadSheetService, SpreadsheetEntry entry, int worksheetNumber, String position, String value) throws IOException, ServiceException {
List<WorksheetEntry> worksheets = entry.getWorksheets();
WorksheetEntry worksheet = worksheets.get(worksheetNumber);
URL cellFeedUrl = worksheet.getCellFeedUrl();
CellQuery query = new CellQuery(cellFeedUrl);
query.setReturnEmpty(true);
query.setRange(position);
CellFeed cellFeed = spreadSheetService.query(query, CellFeed.class);
CellEntry cell = cellFeed.getEntries().get(0);
cell.changeInputValueLocal(value);
cell.update();
return true;
}
3. I move the created file to a new folder (collection)
public DocumentListEntry moveSpreadSheet(DocsService docsService, String entryId, String destinationFolderDocId) throws MalformedURLException, IOException, ServiceException {
DocumentListEntry newEntry = null;
newEntry = new com.google.gdata.data.docs.SpreadsheetEntry();
newEntry.setId(entryId);
String destFolderUri = "https://docs.google.com/feeds/default/private/full/folder%3A"+ destinationFolderDocId + "/contents";
return docsService.insert(new URL(destFolderUri), newEntry);
}
(the same results with gdata java sdk api 1.4.5, 1.4.6, 1.4.7)
This happens from 2011-12-23 (with aproximation). For all the spreadsheets created with the same code before this date all queries work fine.
I can provide any other information on request.
Update:
This issue seems to appear also at uploading spreadsheets with conversion.
If I update the files after a period of time after creation/upload (~2 hours) the queries returns them in results.

Your issue could be related to slowish Google indexing of spreadsheet contents.
https://groups.google.com/a/googleproductforums.com/d/msg/docs/vEhI_HkKX3I/MGKqkryrx90J
"at the moment it can take about 10 minutes to index the content you've written into your spreadsheet. So if you type something in, and then search for it right away, it might not show up yet in your list of document results. Give it a few more minutes (we are working on making this faster)"

Related

Upload documents into Watson's Retrieve & Rank service

I'm implementing a solution using Watson's Retrieve & Rank service.
When I use the tooling interface, I upload my documents and they appear as a list, where I can click on any of them to open up all the Titles that are inside the document ( Answer Units ), as you can see on the Picture 1 and Picture 2.
When I try to upload documents via Java, it wont recognize the documents, they get uploaded in parts ( Answer units as documents ), each part as a new document.
I would like to know how can I upload my documents as a entire document and not only parts of it?
Here's the codes for the upload function in Java:
public Answers ConvertToUnits(File doc, String collection) throws ParseException, SolrServerException, IOException{
DC.setUsernameAndPassword(USERNAME,PASSWORD);
Answers response = DC.convertDocumentToAnswer(doc).execute();
SolrInputDocument newdoc = new SolrInputDocument();
WatsonProcessing wp = new WatsonProcessing();
Collection<SolrInputDocument> newdocs = new ArrayList<SolrInputDocument>();
for(int i=0; i<response.getAnswerUnits().size(); i++)
{
String titulo = response.getAnswerUnits().get(i).getTitle();
String id = response.getAnswerUnits().get(i).getId();
newdoc.addField("title", titulo);
for(int j=0; j<response.getAnswerUnits().get(i).getContent().size(); j++)
{
String texto = response.getAnswerUnits().get(i).getContent().get(j).getText();
newdoc.addField("body", texto);
}
wp.IndexDocument(newdoc,collection);
newdoc.clear();
}
wp.ComitChanges(collection);
return response;
}
public void IndexDocument(SolrInputDocument newdoc, String collection) throws SolrServerException, IOException
{
UpdateRequest update = new UpdateRequest();
update.add(newdoc);
UpdateResponse addResponse = solrClient.add(collection, newdoc);
}
You can specify config options in this line:
Answers response = DC.convertDocumentToAnswer(doc).execute();
I think something like this should do the trick:
String configAsString = "{ \"conversion_target\":\"answer_units\", \"answer_units\": { \"selector_tags\": [] } }";
JsonParser jsonParser = new JsonParser();
JsonObject customConfig = jsonParser.parse(configAsString).getAsJsonObject();
Answers response = DC.convertDocumentToAnswer(doc, null, customConfig).execute();
I've not tried it out, so might not have got the syntax exactly right, but hopefully this will get you on the right track.
Essentially, what I'm trying to do here is use the selector_tags option in the config (see https://www.ibm.com/watson/developercloud/doc/document-conversion/customizing.shtml#htmlau for doc on this) to specify which tags the document should be split on. By specifying an empty list with no tags in, it results in it not being split at all - and coming out in a single answer unit as you want.
(Note that you can do this through the tooling interface, too - by unticking the "Split my documents up into individual answers for me" option when you upload the document)

How to get spreadsheets from a specific Google Drive folder?

The code provided in this tutorial (snippet given below) retrieves a list of all the spreadsheets for the authenticated user.
public class MySpreadsheetIntegration {
public static void main(String[] args) throws AuthenticationException,
MalformedURLException, IOException, ServiceException {
SpreadsheetService service = new SpreadsheetService("MySpreadsheetIntegration-v1");
// TODO: Authorize the service object for a specific user (see other sections)
// Define the URL to request. This should never change.
URL SPREADSHEET_FEED_URL = new URL(
"https://spreadsheets.google.com/feeds/spreadsheets/private/full");
// Make a request to the API and get all spreadsheets.
SpreadsheetFeed feed = service.getFeed(SPREADSHEET_FEED_URL,
SpreadsheetFeed.class);
List<SpreadsheetEntry> spreadsheets = feed.getEntries();
// Iterate through all of the spreadsheets returned
for (SpreadsheetEntry spreadsheet : spreadsheets) {
// Print the title of this spreadsheet to the screen
System.out.println(spreadsheet.getTitle().getPlainText());
}
}
}
But I don't want to get all the spreadsheets. I only want to get those spreadsheets that are in a particular folder (if the folder exists, otherwise terminate the program). Is it possible using this API? If yes, how?
As far as my understanding goes, the SpreadsheetFeed has to be changed. But I didn't get any example snippet against it.
I worked out the solution as follows:
First, get the fileId of that particular folder. Use setQ() to pass query checking for folder and folder name. The following snippet will be useful:
result = driveService.files().list()
.setQ("mimeType='application/vnd.google-apps.folder'
AND title='" + folderName + "'")
.setPageToken(pageToken)
.execute();
Then, get the list of files in that particular folder. I found it from this tutorial. Snippet is as follows:
private static void printFilesInFolder(Drive service, String folderId) throws IOException {
Children.List request = service.children().list(folderId);
do {
try {
ChildList children = request.execute();
for (ChildReference child : children.getItems()) {
System.out.println("File Id: " + child.getId());
}
request.setPageToken(children.getNextPageToken());
} catch (IOException e) {
System.out.println("An error occurred: " + e);
request.setPageToken(null);
}
} while (request.getPageToken() != null &&
request.getPageToken().length() > 0);
}
Lastly, check for spreadsheets and get worksheet feeds for them. The following snippet might help.
URL WORKSHEET_FEED_URL = new URL("https://spreadsheets.google.com/feeds/worksheets/" + fileId + "/private/full");
WorksheetFeed feed = service.getFeed(WORKSHEET_FEED_URL, WorksheetFeed.class);
worksheets = feed.getEntries();

Save file from a website with java

I'm trying to build a jsoup based java app to automatically download English subtitles for films (I'm lazy, I know. It was inspired from a similar python based app). It's supposed to ask you the name of the film and then download an English subtitle for it from subscene.
I can make it reach the download link but I get an Unhandled content type error when I try to 'go' to that link. Here's my code
public static void main(String[] args) {
try {
String videoName = JOptionPane.showInputDialog("Title: ");
subscene(videoName);
}
catch (Exception e) {
System.out.println(e.getMessage());
}
}
public static void subscene(String videoName){
try {
String siteName = "http://www.subscene.com";
String[] splits = videoName.split("\\s+");
String codeName = "";
String text = "";
if(splits.length>1){
for(int i=0;i<splits.length;i++){
codeName = codeName+splits[i]+"-";
}
videoName = codeName.substring(0, videoName.length());
}
System.out.println("videoName is "+videoName);
// String url = "http://www.subscene.com/subtitles/"+videoName+"/english";
String url = "http://www.subscene.com/subtitles/title?q="+videoName+"&l=";
System.out.println("url is "+url);
Document doc = Jsoup.connect(url).get();
Element exact = doc.select("h2.exact").first();
Element yuel = exact.nextElementSibling();
Elements lis = yuel.children();
System.out.println(lis.first().children().text());
String hRef = lis.select("div.title > a").attr("href");
hRef = siteName+hRef+"/english";
System.out.println("hRef is "+hRef);
doc = Jsoup.connect(hRef).get();
Element nonHI = doc.select("td.a40").first();
Element papa = nonHI.parent();
Element link = papa.select("a").first();
text = link.text();
System.out.println("Subtitle is "+text);
hRef = link.attr("href");
hRef = siteName+hRef;
Document subDownloadPage = Jsoup.connect(hRef).get();
hRef = siteName+subDownloadPage.select("a#downloadButton").attr("href");
Jsoup.connect(hRef).get(); //<-- Here's where the problem lies
}
catch (java.io.IOException e) {
System.out.println(e.getMessage());
}
}
Can someone please help me so I don't have to manually download subs?
I just found out that using
java.awt.Desktop.getDesktop().browse(java.net.URI.create(hRef));
instead of
Jsoup.connect(hRef).get();
downloads the file after prompting me to save it. But I don't want to be prompted because this way I won't be able to read the name of the downloaded zip file (I want to unzip it after saving using java).
Assuming that your files are small, you can do it like this. Note that you can tell Jsoup to ignore the content type.
// get the file content
Connection connection = Jsoup.connect(path);
connection.timeout(5000);
Connection.Response resultImageResponse = connection.ignoreContentType(true).execute();
// save to file
FileOutputStream out = new FileOutputStream(localFile);
out.write(resultImageResponse.bodyAsBytes());
out.close();
I would recommend to verify the content before saving.
Because some servers will just return a HTML page when the file cannot be found, i.e. a broken hyperlink.
...
String body = resultImageResponse.body();
if (body == null || body.toLowerCase().contains("<body>"))
{
throw new IllegalStateException("invalid file content");
}
...
Here:
Document subDownloadPage = Jsoup.connect(hRef).get();
hRef = siteName+subDownloadPage.select("a#downloadButton").attr("href");
//specifically here
Jsoup.connect(hRef).get();
Looks like jsoup expects that the result of Jsoup.connect(hRef) should be an HTML or some text that it's able to parse, that's why the message states:
Unhandled content type. Must be text/*, application/xml, or application/xhtml+xml
I followed the execution of your code manually and the last URL you're trying to access returns a content type of application/x-zip-compressed, thus the cause of the exception.
In order to download this file, you should use a different approach. You could use the old but still useful URLConnection, URL or use a third party library like Apache HttpComponents to fire a GET request and retrieve the result as an InputStream, wrap it into a proper writer and write your file into your disk.
Here's an example about doing this using URL:
URL url = new URL(hRef);
InputStream in = url.openStream();
OutputStream out = new BufferedOutputStream(new FileOutputStream("D:\\foo.zip"));
final int BUFFER_SIZE = 1024 * 4;
byte[] buffer = new byte[BUFFER_SIZE];
BufferedInputStream bis = new BufferedInputStream(in);
int length;
while ( (length = bis.read(buffer)) > 0 ) {
out.write(buffer, 0, length);
}
out.close();
in.close();

Apache POI: find characters in Word document without spaces

I want to read the number of characters without spaces in a Word document using Apache POI.
I can get the number of characters with spaces using the SummaryInformation.getCharCount() method as in the following code:
public void countCharacters() throws FileNotFoundException, IOException {
File wordFile = new File(BASE_PATH, "test.doc");
POIFSFileSystem p = new POIFSFileSystem(new FileInputStream(wordFile));
HWPFDocument doc = new HWPFDocument(p);
SummaryInformation props = doc.getSummaryInformation();
int numOfCharsWithSpaces = props.getCharCount();
System.out.println(numOfCharsWithSpaces);
}
However there seems to be no method for returning the number of characters without spaces.
How do I find this value?
If you want to base this on the metadata of the document, all you will get is estimates (according to the Microsoft specs). There are essentially two values which you can play around with:
GKPIDSI_CHARCOUNT (which is what you already accessed in your own code sample)
GKPIDDSI_CCHWITHSPACES
Don't ask me about the exact differences of those two values, though. I haven't designed this stuff...
Below is a code sample to illustrate the access to them (GKPIDDSI_CCHWITHSPACES is a little awkward):
HWPFDocument document = [...];
SummaryInformation summaryInformation = document.getSummaryInformation();
System.out.println("GKPIDSI_CHARCOUNT: " + summaryInformation.getCharCount());
DocumentSummaryInformation documentSummaryInformation = document.getDocumentSummaryInformation();
Integer count = null;
for (Property property : documentSummaryInformation.getProperties()) {
if (property.getID() == 0x11) {
count = (Integer) property.getValue();
break;
}
}
System.out.println("GKPIDDSI_CCHWITHSPACES: " + count);
The moment at which Word's internal algorithm that updates those values kicks in is rather unpredictable to me. So what you see in Word's own statistics may not necessarily be the same as when running the above code.

Saving the first Image from URL

Here's my problem. I have a txt file called "sites.txt" . In these i type random internet sites. My Goal is to save the first image of each site. I tried to filter the Server response by the img tag and it actually works for some sites, but for some not.
The sites where it works the img src starts with http:// ... the sites it doesnt work start with anything else.
I also tried to add the http:// to the img src images which didnt have it, but i still get the same error:
Exception in thread "main" java.net.MalformedURLException: no protocol:
at java.net.URL.<init>(Unknown Source)
My current code is:
public static void main(String[] args) throws IOException{
try {
File file = new File ("sites.txt");
Scanner scanner = new Scanner (file);
String url;
int counter = 0;
while(scanner.hasNext())
{
url=scanner.nextLine();
URL page = new URL(url);
URLConnection yc = page.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine = in.readLine();
while (!inputLine.toLowerCase().contains("img"))inputLine = in.readLine();
in.close();
String[] parts = inputLine.split(" ");
int i=0;
while(!parts[i].contains("src"))i++;
String destinationFile = "image"+(counter++)+".jpg";
saveImage(parts[i].substring(5,parts[i].length()-1), destinationFile);
String tmp=scanner.nextLine();
System.out.println(url);
}
scanner.close();
}
catch (FileNotFoundException e)
{
System.out.println ("File not found!");
System.exit (0);
}
}
public static void saveImage(String imageUrl, String destinationFile) throws IOException {
// TODO Auto-generated method stub
URL url = new URL(imageUrl);
String fileName = url.getFile();
String destName = fileName.substring(fileName.lastIndexOf("/"));
System.out.println(destName);
InputStream is = url.openStream();
OutputStream os = new FileOutputStream(destinationFile);
byte[] b = new byte[2048];
int length;
while ((length = is.read(b)) != -1) {
os.write(b, 0, length);
}
is.close();
os.close();
}
I also got a tip to use the apache jakarte http client libraries but i got absolutely no idea how i could use those i would appreciate any help.
A URL (a type of URI) requires a scheme in order to be valid. In this case, http.
When you type www.google.com into your browser, the browser is inferring you mean http:// and automatically prepends it for you. Java doesn't do this, hence your exception.
Make sure you always have http://. You can easily fix this using regex:
String fixedUrl = stringUrl.replaceAll("^((?!http://).{7})", "http://$1");
or
if(!stringUrl.startsWith("http://"))
stringUrl = "http://" + stringUrl;
An alternative solution
Simply try with ImageIO that contains static convenience methods for locating ImageReaders and ImageWriters, and performing simple encoding and decoding.
Sample code:
// read a image from the URL
// I used the URL that is your profile pic on StackOverflow
BufferedImage image = ImageIO
.read(new URL(
"https://www.gravatar.com/avatar/3935223a285ab35a1b21f31248f1e721?s=32&d=identicon&r=PG&f=1"));
// save the image
ImageIO.write(image, "jpg", new File("resources/avatar.jpg"));
When you're scraping the site's HTML for image elements and their src attributes, you'll run into several different representations of URLs.
Some examples are:
resource = https://google.com/images/srpr/logo9w.png
resource = google.com/images/srpr/logo9w.png
resource = //google.com/images/srpr/logo9w.png
resource = /images/srpr/logo9w.png
resource = images/srpr/logo9w.png
For the second through fifth ones, you'll need to build the rest of the URL.
The second one may be more difficult to differentiate from the fourth and fifth ones, but I'm sure there are workarounds. The URL Standard leads me to believe you won't see it as often, because I don't think it's technically valid.
The third case is pretty simple. If the resource variable starts with //, then you just need to prepend the protocol/scheme to it. You can do this with the site object you have:
url = site.getProtocol() + ":" + resource
For the fourth and fifth cases, you'll need to prepend the resource with the entire site's URL.
Here's a sample application that uses jsoup to parse the HTML, and a simple utility method to build the resource URL. You're interested in the buildResourceUrl method. Also, it doesn't handle the second case; I'll leave that to you.
import java.io.*;
import java.net.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
public class SiteScraper {
public static void main(String[] args) throws IOException {
URL site = new URL("https://google.com/");
Document doc = Jsoup.connect(site.toString()).get();
Elements images = doc.select("img");
for (Element image : images) {
String src = image.attr("src");
System.out.println(buildResourceUrl(site, src));
}
}
static URL buildResourceUrl(URL site, String resource)
throws MalformedURLException {
if (!resource.matches("^(http|https|ftp)://.*$")) {
if (resource.startsWith("//")) {
return new URL(site.getProtocol() + ":" + resource);
} else {
return new URL(site.getProtocol() + "://" + site.getHost() + "/"
+ resource.replaceAll("^/", ""));
}
}
return new URL(resource);
}
}
This obviously won't cover everything, but it's a start. You may run into problems when the URL you're trying to access is in a subdirectory of the root of the site (i.e., http://some.place/under/the/rainbow.html). You may even encounter base64 encoded data URI's in the src attribute... It really depends on the individual case and how far you're willing to go.

Categories

Resources