Java program to read information from a website

Java program to read information from a website - java

I am writing a program in Java that will help track a "Fantasy College Basketball" league for my friends. I am struggling with finding the best implementation to automatically update the statistics for each player drafted.
As some background, every day individuals in the fantasy league earn points based on statistics that college basketball players they drafted earned that week. Right now, I do this mannually:
1: Go to a player's ESPN profile
ESPN tracks individual player stats with a URL that is based on a random and unique player ID number. Frank Kaminsky's ID is 56759, so his ESPN profile is: http://espn.go.com/mens-college-basketball/player/_/id/56769/. We can assume that the user will input a player's ESPN ID when the player is drafted and we will have that information when updating stats.
2: Parse HTML page to get relevant stats
Looking at the URL above - the important information is in the "2014 - 2015 Game Log" section. I would want to obtain the most recent game's PTS, REB, AST, BLK, STL, PF, and TO to use elsewhere in my program.
What is the best approach to this?
My first reaction was to use a .openStream() on a URL, but this would require a lot of careful string parsing. The HTML really isn't pretty line by line...
I have heard of jsoup, but haven't used it ever before. If people here think that is the best way to proceed, I'd be happy to learn how to use it.

Use Jsoup, it is easy to learn and made for the job.
The JSoup website has a nice tutorial on it.
Have a look here: http://jsoup.org/cookbook/input/load-document-from-url
Then parse your document with the methods explained here: http://jsoup.org/cookbook/extracting-data/selector-syntax

I would recommend http://www.seleniumhq.org/
This is an external library but it is really easy to use and learn. Normally it is used to test websites but it is really multi-purposed.
Driver driver = new ChromeDriver();
driver.get("http://yoursitehere.iamnotarealsite");
That would be the code to open a chrome browser and to navigate to your site. To find elements you can do things like:
WebElement stats=driver.findElement(By.cssSelector("div#statsOrSomething"));
And you can use standard get text functions on WebElements:
stats.getText();//Gets players stats
Also did I mention that it has many language bindings including Java? Also: I don't work for selenium or its parent company so this is not a shameless plug.

Related

Use google-api/mediawiki-api to retrieve information

I am currently working on a University project under the theme of "search-engine".
For this purpose we were given access to a database of scientific publications
(http://dblp.uni-trier.de)
It is a 2GB XML file which looks something like this:
<article key="GottlobSR96">
<author>Georg Gottlob</author>
<author>Michael Schrefl</author>
<author>Brigitte Röck</author>
<title>Extending Object-Oriented Systems with Roles.</title>
<pages>268-296</pages>
<year>1996</year>
<volume>14</volume>
<journal>TOIS</journal>
<number>3</number>
<url>db/journals/tois/tois14.html#GottlobSR96</url>
</article>
As you can see the "article"-tag contains various information such as author,title of the paper,year of publication. My job now is to implement a Java solution which takes search terms of different categories (author, university,title) as input and provides the user with additional information.
For example if you enter the name of a professor it should return data like his date of birth, the University he works at, number of publications, etc..
I suppose this would work using google api to find for a persons entry on the University homepage and then somehow parsing through the page to find the needed information. For Universities there should be a Wikipedia page.
I already tried using mediawiki api but couldn't figure out how to get only the specific information I want.(I could only get the intro paragraph)
I've never worked on a project of this scale so I don't really have a clue on how to implement foreign API's/libraries etc. into my own code.
So i guess my question is:
How do i get specific information based on a google-search? May it be through wikipedia or otherwise.

Implementing twitter and facebook like hashtags

This might look really silly.. and a question with no research, but trust me it is not. I have done some research on it. One of them would be the following link:
http://www.quora.com/Twitter-1/How-does-Twitter-implement-hashtags
Also I am not looking for a complete solution here.. I will do my hard work, but I just need some guidance regarding this, just want to know which way should I approach?
I want to implement twitter and now even facebook like hashtags for my application.. So that users can add messages with hashtags and others can search over them.. like what is trending and what is relevant.
We are using Mysql, mongo and elasticsearch in our storage tech stack. any ideas how could I start working to implement this? Would I need another storage? One way is that I can store my hastags in db and then do a text search for them in Elasticsearch.
What can people with more experience in this field suggest here?

A start with MongoDB would be to parse each message for hashtags the user used and put these into a sub-array of the document. Example status update:
Peter
April 29th 2014 12:28:34
Hello friends, I visited the #tradeshow in #washington and drank a delicious #coffee
This message would look like this in MongoDB:
{
author: "Peter",
date: ISODate("2014-04-29 12:28:34"),
text: "Hello friends, I visited the #tradeshow in #washington and drank a delicious #coffee",
hashtags: [
"tradeshow",
"washington",
"coffee"
]
}
When you then create an index on db.collection.hashtags you can quickly search for all messages which include one of these hashtags. You likely want to order and limit the results by date so the user sees the most recent results first. When you make it a compound index which also includes the date, you can also speed that up.
How to implement "trending" topics is a quite complex question. It is also very subjective depending on what you would consider "trending". The exact algorithms Twitter or Facebook use to determine which topics are trending or not is not public. According to various social media analysts they also change them frequently, so we can assume that they are quite complex by now.
That means we can not help you to come up with an algorithm on your own. But when you already have an algorithm in mind to calculate the "trendyness" of a hashtag, we could help you to find a good implementation.

Java - Extracting plaintext from web-page source code (getting massive quantities of lyrics from website)

O community, I'm in the process of writing the pseudocode for an application that extracts song lyrics from a remote host (web-server, not my own) by reading the page's source code.
This is assuming that:
Lyrics are being displayed in plaintext
Portion of source code containing lyrics is readable by Java front-end application
I'm not looking for source code to answer the question, but what is the technical term used for querying a remote webpage for plaintext content?
If I can determine the webpage naming scheme, I could set the pointer of the URL object to the appropriate webpage, right? The only limitations would be irregular capitalization, and would only be effective if the plaintext was found in EXACTLY the same place.
Do you have any suggestions?
I was thinking something like this for "Buck 65", singing "I look good"
URL url = new URL(http://www.elyrics.net/read/b/buck-65-lyrics/i-look-good-lyrics.html);
I could substitute "buck-65-lyrics" & "i-look-good-lyrics" to reflect user input?
Input re-directed to PostgreSQL table
Current objective:
User will request name of {song, artist, album}, Java front-end will query remote webpage
Full source code (containing plaintext) will be extracted with Java front-end
Lyrics will be extracted from source code (somehow)
If song is not currently indexed by PostgreSQL server, will be added to table.
Operations will be made on the plaintext to suit the objectives of the program
I'm only looking for direction. If I'm headed completely in the wrong direction, please let me know. This is only for the pseudocode. I'm not looking for answers, or hand-outs, I need assistance in determining what I need to do. Are there external libraries for extracting plaintext that you know of? What technical names are there for what I'm trying to accomplish?
Thanks, Tyler

This approach is referred to as screen or data scraping. Note that employing it often breaks the target service's terms of service. Usually, this is not a robust approach, which is why API-like services with guarantees about how they operate are preferable.
Your approach sounds like it will work for the most part, but a few things to keep in mind.
If the web service you're interacting with requires a very precise URL scheme, you should not feed your user-provided data directly into it, since it is likely to be muddied by missing words, abbreviations, or misspellings. You might be better off doing some sort of search, first, and using that search's best result.
Reading HTML data is more complicated than you think. Use an existing library like jsoup to assist you.

The technical term to extract content from a site is web scraping, you can google that. There are a lot of online libraries, for java there is jsoup. Though its easy to write your own regex.
1st thing I would do i use curl and get the content from the site just for testing, this will give you a fair idea of what to do.

You will have to use a HTML parser. One of the most popular is jsoup.
Take care abut the legal aspect fo what you you do ;)

Getting data/information for android app use

I have been wondering about this, which is why I have put off learning app development for so long. Let's say I was making a school timetable app, that all the user had to do was enter the name of their course, and then the app shows the timetable for that course..
The questions is can I get information from the college or do I have to hard code it into the database myself?
How does one get information to use if they need it?
Thanks

It depends. Does the college provide you an interface you can use? Probably not one that was meant to be used by a third party app.
If not, then you have to somehow get the information into your database. Either per parsing their online HTML schedules or inputing it by hand (obviously always one of the last options to consider).

If the college had a website that you could view, you could scan the page for class listings and pull that data in - but more than likely that sort of data will need to be entered manually by you when you ship the app.

If college is having its website and the website provides RSS feed for time table you parse that XML file and show the data which is parse or you can save the time table information of which course in the database and display that using cursors.

Recognizing colors/patterns in webpage

I want to try to create a learning chess application as a school project. My first plan was to simply pit this AI against itself, but to really show if it has been succesful it needs to be able to show how well it progresses. In order to do this, i want it to play rated games on sites such as chess.com. However, they do not (yet) have a public API, i believe.
Therefore, i wanted to make a program in java that recognizes colors and images. It keeps an internal 2-dimensional array of all the positions, and recognizes the pieces on the board. I think i have found a way to do this in a window using something like the Java Robot Class.
What i would like it to do, however, is to open this webpage in an internal window and keep doing this in the background. Is there a way to recognize colors within the own window, without needing to be in the foreground?
Edit: I'm planning on using this browser component i just found. I noticed that it is possible to create a full-page snapshot of the page and save it as a BufferedImage(?). Would this make it easier to do this?
Edit 2: I just read that 'Outside assistance from other people, computers/chess engines, or endgame tablebases is entirely prohibited'. I suppose letting a computer do all the playing does certainly include in that. So i might try using another site, so answers that are specific for chess.com won't cut it!

I don't know it it helps but may be you can have a look at the Sikuli project.
http://sikuli.org/
Sikuli is a program (and an API) to handle the interactions with the User Interface. For instance, you can write a script to click on an image or a button in certain conditions.
Especially interesting for you, there is a Java integration: http://sikuli.org/docx/faq/030-java-dev.html
Here is an extract of the website to give you an idea of the code you can write.
EDIT: in this code it is important to notice that you are defining new Patterns with the images. Sikuli will be able to find matching patterns.
import org.sikuli.script.*;
public class TestSikuli {
public static void main(String[] args) {
Screen s = new Screen();
try{
s.click("imgs/spotlight.png", 0);
s.wait("imgs/spotlight-input.png");
s.type(null, "hello world\n", 0);
}
catch(FindFailed e){
e.printStackTrace();
}
}
}

You should consider playing on a chess server where an API is avaible and chess engines are allowed. There is The Internet Chess Club (ICC) where you must pay to have a human account and then you can get a free computer account for your engine. There is also the Free Internet Chess Server (FICS) where you and your engine can get free accounts.
The ICC is usually prefered because the level of players is higher there with lots of international masters and chess masters playing there.
The best way to Interface with theses sites is to implement the xboard protocol. This will allow your engines to play through the Winboard or XBoard interface (among others) and theses interface can be used to connect on FICS or ICC and automatically play there.
I hope this help, even if it does not directly answer the question.

I'm not sure what your input is but you have two options:
You can work an a PNG image. Load the image into a BufferedImage (docs) object and examine it there. You can use a screen shot tool to create those.
It seems chess.com uses HTML with JavaScript. You can download the HTML using HttpComponents and examine it to see where the pieces are. This has the additional benefit that you don't have to guess which piece goes where since the HTML contains the source information.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java program to read information from a website - java

Use Jsoup, it is easy to learn and made for the job. The JSoup website has a nice tutorial on it. Have a look here: http://jsoup.org/cookbook/input/load-document-from-url Then parse your document with the methods explained here: http://jsoup.org/cookbook/extracting-data/selector-syntax

Related

Use google-api/mediawiki-api to retrieve information

Implementing twitter and facebook like hashtags

Java - Extracting plaintext from web-page source code (getting massive quantities of lyrics from website)

Getting data/information for android app use

Recognizing colors/patterns in webpage

Categories

Resources