How can i extract some data from website and email it - java

As an example suppose i want my program to
Vist stackoverflow everyday
Find the most question in some tag for that day
Format it and then send it to my email address
I don't know how to do it , i know php more , but i have some understnading of Java , j2ee , spring MVC but not java network programming
Any guidelines how should i go

I'd start by looking at the Stack Exchange API.

What you can possibly do is extract the contents of url and write it to a string buffer and then using JSOUP.jar (used to parse html elements) parse the html string to get the content of your choice.I have a small sample which does exactly that i read all the contents of the url into a string and then parse the content based on the CLASS TAG (here in this case it is question-hyperlink)
package com.tps.examples;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class URLGetcontent {
public static void main(String args[]) {
try {
URL url = new URL("http://stackoverflow.com/questions");
URLConnection conn = url.openConnection();
conn.setDoOutput(true);
// Get the response
BufferedReader rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line;
StringBuffer str = new StringBuffer();
while ((line = rd.readLine()) != null) {
// System.out.println(line);
str.append(line);
}
Document doc = Jsoup.parse(str.toString());
Elements content = doc.getElementsByClass("question-hyperlink");
for (int i = 0; i < content.size(); i++) {
System.out.println("Question: " + (i + 1) + ". " + content.eq(i).text());
System.out.println("");
}
System.out.println("*********************************");
} catch (Exception e) {
}
}
}
Once the data is extracted you can use javamail class to send the content in email.

As you're wanting to retrieve data from a website (i.e. over HTTP), you probably want to look into using one of many HTTP clients already written in Java or PHP.
Apache HTTP Client is a good Java client used by many people. If you're invoking a RESTful interface, Jersey has a nice client library.
On the PHP side of things, someone already mentioned file_get_contents... but you could also look into the cURL library
As far as emailing goes, there's a JavaMail API (I'll admit I'm not familiar with it), or depending on your email server you might jump through other hoops (for example Exchange can send email through their SOAP interfaces.)

with file_get_contents() in PHP you cal also fetch files via HTTP:
$stackoverflow = file_get_contents("http://stackoverflow.com");
Then you have to parse this. For many sites there are special APIs which you can request via JSON or XML.
If you know shell scripting (that's the way i do it for many sites - works great with a cronjob :)) then you can use sed, wget, w3m, grep, mail to do it...

StackOverflow and other stackexchange sites provide a simple API (stackapps). You Please check out.

Related

Java make call to website and find all photos

So I'm about to create a project which basically makes an API call, then take data, look for photos and display for user as a slideshow.
I want to make an API call to National Geographic Photo Of The Day, and I have found National Geographic Photo Of The Day Archive
and I want to make a call to that website, save somewhere all photos from that gallery and then let user decide if he likes photos or not. How can I approach my goal? For now I have only tried to establish connection with linked gallery
package javaapplication1;
import java.net.*;
import java.io.*;
import javax.imageio.ImageIO;
public class JavaApplication1 {
public static void main(String[] args) throws Exception {
URL natgeo = new URL("https://www.nationalgeographic.com/photography/photo-of-the-day/archive/");
URLConnection yc = natgeo.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine;
while((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
}
}
and read in console output, but have no idea how to approach reading what came back as answer. I don't know if national geographic api exists, so don't know which approach would be better - finding API and make call for those photos or parse page and look for images and save them locally.
Appreciate all help!
What you're trying to do is called "Web Scraping". You don't just have to make a connection with the gallery, you also have to parse the HTML and pull out the URL for the image, then download the image. I suggest you look into jsoup, a Java library built for this stuff. For image downloading a manipulation, the Java Image IO library has a lot of great functionality.

Handling file downloads via REST API

I want to set up the REST API to support file downloads via Java (The java part is not needed at the moment -- I am saying it in here so you can make your answer more specific for my problem).
How would I do that?
For example, I have this file in a folder (./java.jar), how can I stream it in such a way for it to be downloadable by a Java client?
I forgot to say that this, is for some paid-content.
My app should be able to do this
Client: Post to server with username,pass.
Rest: Respond accordingly to what user has bought (so if it has bought that file, download it)
Client: Download file and put it in x folder.
I thought of encoding a file in base64 and then posting the encoded result into the usual .json (maybe with a nice name -- useful for the java application, and with the code inside -- though I would not know how I should rebuild the file at this point). <- Is this plausible? Or is there an easier way?
Also, please do not downvote if unnecessary, although there is no code in the question, that doesn't mean I haven't researched it, it just means that I found nothing suitable for my situation.
Thanks.
What you need is a regular file streaming, using a valid URL.
Below code is an excerpt from here
import java.net.*;
import java.io.*;
public class URLReader {
public static void main(String[] args) throws Exception {
URL oracle = new URL("http://www.oracle.com/");
BufferedReader in = new BufferedReader(
new InputStreamReader(oracle.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
}
}
For your needs, based on your updated comments on the above answer, you could call your REST endpoint after user logs in(with Auth and other headers/body you wish to receive) and proceed to the download.
Convert your jar/downloadable content to bytes. More on this
Java Convert File to Byte Array and visa versa
Later, in case if you dont want regular streaming as aforementioned in previous answers, you can put the byte content in the body as Base64 String. You can encode to Base64 from your byte array using something like below.
Base64.encodeToString(byte[], Base64.NO_WRAP + Base64.URL_SAFE);
Reference from here: How to send byte[] and strings to a restful webservice and retrieve this information in the web method implementation
Again, there are many ways to do this, this is one of the ways you can probably do using REST.

Office Web Apps Word Editing

The idea is to build a proprietary Java back end document system using Office Web Apps.
We have created the WOPI client which allows us to view/edit PowerPoint and Excel web app documents but we can only view Word Documents.
In order to edit Word Web App documents you need to implement MS-FSSHTTP.
It appears there is no information about how to actually do this in code. Has anyone performed this or would know how?
recently my team and I have implemented a WOPI-Host that supports viewing and editing of Word, PPT and Excel documents. You can take a look at https://github.com/marx-yu/WopiHost which is a command prompt project that listens on the 8080 port and enables editing and viewing of word documents though the Microsoft Office Web Apps.
We have implemented this solution in a webApi and it works great. Hope this sample project will help you out.
After requested, I will try and add code samples to clarify the way to implement it based on my webApi implementation, but their is a lot of code to implement to actually make it work properly.
First things first, to enabled editing you will need to capture Http Posts in a FilesController. Each posts that concern the actual editing will have the header X-WOPI-Override equal to COBALT. In these post you will find out that the InputStream is and Atom type. Based on the MS-WOPI documentation, in your response you will need to include the following headers X-WOPI-CorrelationID and request-id.
Here is the code of my webApi post method (it is not complete since I'm still implementing that WOPI protocol).
string wopiOverride = Request.Headers.GetValues("X-WOPI-Override").First();
if (wopiOverride.Equals("COBALT"))
{
string filename = name;
EditSession editSession = CobaltSessionManager.Instance.GetSession(filename);
var filePath = HostingEnvironment.MapPath("~/App_Data/");
if (editSession == null){
var fileExt = filename.Substring(filename.LastIndexOf('.') + 1);
if (fileExt.ToLower().Equals(#"xlsx"))
editSession = new FileSession(filename, filePath + "/" + filename, #"yonggui.yu", #"yuyg", #"yonggui.yu#emacle.com", false);
else
editSession = new CobaltSession(filename, filePath + "/" + filename, #"patrick.racicot", #"Patrick Racicot", #"patrick.racicot#hospitalis.com", false);
CobaltSessionManager.Instance.AddSession(editSession);
}
//cobalt, for docx and pptx
var ms = new MemoryStream();
HttpContext.Current.Request.InputStream.CopyTo(ms);
AtomFromByteArray atomRequest = new AtomFromByteArray(ms.ToArray());
RequestBatch requestBatch = new RequestBatch();
Object ctx;
ProtocolVersion protocolVersion;
requestBatch.DeserializeInputFromProtocol(atomRequest, out ctx, out protocolVersion);
editSession.ExecuteRequestBatch(requestBatch);
foreach (Request request in requestBatch.Requests)
{
if (request.GetType() == typeof(PutChangesRequest) && request.PartitionId == FilePartitionId.Content)
{
//upload file to hdfs
editSession.Save();
}
}
var responseContent = requestBatch.SerializeOutputToProtocol(protocolVersion);
var host = Request.Headers.GetValues("Host");
var correlationID = Request.Headers.GetValues("X-WOPI-CorrelationID").First();
response.Headers.Add("X-WOPI-CorrelationID", correlationID);
response.Headers.Add("request-id", correlationID);
MemoryStream memoryStream = new MemoryStream();
var streamContent = new PushStreamContent((outputStream, httpContext, transportContent) =>
{
responseContent.CopyTo(outputStream);
outputStream.Close();
});
response.Content = streamContent;
response.Content.Headers.ContentType = new MediaTypeHeaderValue("application/octet-stream");
response.Content.Headers.ContentLength = responseContent.Length;
}
As you can see in this method I make use of CobaltSessionManager and CobaltSession which are used to create and manage editing sessions on the Cobalt protocol. You will also need a what I call CobaltHostLockingStore which is used to handle the different requests when communicating with the Office Web App server in the edition initialization.
I won't be posting the code for these 3 classes since they are already coded in the sample github project I posted and that they are fairly simple to understand even though they are big.
If you have more questions or if it's not clear enough don't hesitate to comment and I will update my post accordingly.
Patrick Racicot, provided great answer. But i had problem saving docx(exception in CobaltCore.dll), and i even started using dotPeak reflector trying to figure it out.
But after i locked editSession variable in my WebApi method everything started working like magic. It seems that OWA is sending requests that should be handled as a chain, not in parallel as usually controller method acts.

How to work with html code readed on Java?

I know how to read the HTML code of a website, for example, the next java code reads all the HTML code from http://www.transfermarkt.co.uk/en/fc-barcelona/startseite/verein_131.html this is a website that shows all the football players of F.C. Barcelona.
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
public class ReadWebPage {
public static void main(String[] args) throws IOException {
String urltext = "http://www.transfermarkt.co.uk/en/fc-barcelona/startseite/verein_131.html";
URL url = new URL(urltext);
BufferedReader in = new BufferedReader(new InputStreamReader(url
.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
// Process each line.
System.out.println(inputLine);
}
in.close();
}
}
OK, but now I need to work with the HTML code, I need to obtain the names ("Valdés, Victor", "Pinto, José Manuel", etc...) and the positions (Goalkeeper, Defence, Midfield, Striker) of each of the players of the team. For example, I need to create an ArrayList <String> PlayerNames and an ArrayList <String> PlayerPositions and put on these arrays all the names and positions of all the players.
How I can do it??? I can't find the code example that can do it on google..... code examples are welcome
thanks
I would recommend using HtmlUnit, which will give you access to the DOM tree of the HTML page, and even execute JavaScript in case the data are dynamically put in the page using AJAX.
You could also use JSoup: no JavaScript, but more lightweight and support for CSS selectors.
I think that the best approach is first to purify HTML code into the valid XHTML form, and them apply XSL transformation - for retrieving some part of information you can use XPATH expressions. The best available html tag balancer is in my opinion neko HTML (http://nekohtml.sourceforge.net/).
You might like to take a look at htmlparser
I used this for something similar.
Usage something like this:
Parser fullWebpage = new Parser("WEBADDRESS");
NodeList nl = fullWebpage.extractAllNodesThatMatch(new TagNameFilter("<insert html tag>"));
NodeList tds = nodes.extractAllNodesThatMatch(new TagNameFilter("a"),true);
String data = tds.toHtml();
Java has its own, built-in HTML parser. A positive feature of this parser it that it is error tolerant and would assume some tags even if they are missing or misspelled. While called swing.text.html.Parser, it has actually nothing shared with Swing (and with text only as much as HTML is a text). Use ParserDelegator. You need to write a callback for use with this parser, otherwise it is not complex to use. The code example (written as a ParserDelegator test) can be found here. Some say it is a reminder of the HotJava browser. The only problem with it, seems not upgraded to the most recent versions of HTML.
The simple code example would be
Reader reader; // read HTML from somewhere
HTMLEditorKit.ParserCallback callback = new MyCallBack(); // Implement that interface.
ParserDelegator delegator = new ParserDelegator();
delegator.parse(reader, callback, false);
I've found a link that is just what you was looking for:
http://tiny-url.org/work_with_html_java

Java URL library for grabbing lines on a website

I want to be able to grab N lines (HTML text content that start on new lines) on a specific URL e.g. www.sitename.com and store them as strings in an array.
something like
public void grabLines(){
//create instance of class from imported library
//pass sitename into it
//from the instance, call a method for grabbing the lines on the site and pass in "N" as a parameter
//the method returns an array/list of N Strings that I can access later
}
Is there a native Java library I can import to do this? Does it allow me do what I want easily?
Thanks
Are you trying to make a screen scraper? you will be pulling html as opposed to just what you see. also if the website is dynamic you won't be able to pull everything that you can see. If you want just html and stuff you can try something like this. I tried to build a bloomberg screen scraper and then parse out the random html tags.
try {
URL bbg = new URL("http://www.bloomberg.com/markets/economic-calendar/");
BufferedReader r = new BufferedReader(new InputStreamReader( bbg.openStream()));
while( (temp = r.readLine())!= null){
System.out.println(temp);
}
} catch (Exception e){
e.printStackTrace();
}
Apache HttpClient is an abstraction above the URL/Reader technique above, but similar: Apache HTTP Client

Categories

Resources