How to find and extract "main" image in website - java

I need help tackling a problem. I need a program which, given a site, finds and extracts the "main" picture, i.e. the one which represents the site. (To say it is the biggest or the first picture is sometimes but not always true).
How should I approach this? Are there any libraries that could help me with this?
Thanks!

OPTION 1
You could checkout Goose. It does something similar to what Pocket and Readability does, i.e. try to extract the main article from a given webpage using a set of heuristics. It can apparently also extract the main image from that article, but it is a bit of a hit and miss, so 60% of the time it works everytime.
It used to be a Java project but rewritten to Scala.
From the readme
Goose will try to extract the following information:
Main text of an article
Main image of article
Any Youtube/Vimeo movies embedded in article
Meta Description
Meta tags
Publish Date
Try it here: http://jimplush.com/blog/goose
OPTION 2
You could use a Java wrapper (e.g. GhostDriver) for running a headless browser, like PhantomJS. Then, fetch the website and find the img element with the largest dimensions. This GhostDriver test case shows how to query the DOM for elements and get it's renderd size.
OPTION 3
Use a library like jsoup that helps you parse HTML. Then get the value from the src attribute from all img tags. Request each URL you find for an image and measure their sizes. The one with the biggest dimensions is likely to be the website's main image.

Another solution would be to extract the meta tags for social media sharing first, if they are present, you are lucky otherwise you stil can try the other solutions.
<meta property="og:image" content="http://www.example.com/image.jpg"/>
<meta name="twitter:image" content="http://www.example.com/image.jpg">
<meta itemprop="image" content="http://www.example.com/image.jpg">
If you are yousing JSOUP the code would be like that:
String imageUrlOpenGraph = document.select("meta[property=og:image]").stream()
.findFirst()
.map(doc -> doc.attr("content").trim())
.orElse(null);
String imageUrlTwitter = document.select("meta[name=twitter:image]").stream()
.findFirst()
.map(doc -> doc.attr("content").trim())
.orElse(null);
String imageUrlGooglePlus = document.select("meta[itemprop=image]").stream()
.findFirst()
.map(doc -> doc.attr("content").trim())
.orElse(null);

You could use a service like embedly. Among a lot of other information they allow you to extract the main image of any page. Works particularly well for articles. You can try it here.

You need artificial intelligence to do so, Computer Vision namely.
It too big to fit in an answer. This link might help
If you are a mathematician with experience of Probability and Bayes rule, then you can just take the unit called Image Processing and Computer Vision.
If you are looking for available software you want to use check this out...
This stackoverflow thread might help...
There's this software called moodstocks which might help.

ImageResolver can do that for you without the need of server side interaction, except for a small proxy script.

Related

Scraping issue (data-reactid)

I'm trying to scrape a website and compile a spreadsheet based on what data I pull.
The website I am trying to scrape is WEARVR.
I am not too experienced with scraping, but my approach would be to find unique attributes within html tags and use this to scrape what I want.
So for this website my approach would be firstly to scrape a list of URLs of the pages you are taken to upon clicking on one of the experiences, for example : https://www.wearvr.com/#game_id=game_1041, and then secondly, cycle through this list scraping the relevant attributes each time.
However I am stuck at the first step as instead of working with simple "a href" tags, I come across "data-reactid" tags which confuse the matter.
I do my scraping with iMacros but I'm pretty decent at Java now so would learn scraping in Java if need be (which seems likely as iMacros is pretty limited).
My question is, how do these "data-reactid" tags work, and as such how can I utilise them for my scraping purposes?
Additionally if this is an XY problem, please let me know and suggest a better approach.
Thanks for reading!
The simplest way to approach scraping is to treat the page like a big string (because ultimately, that is what it is). You can search within that string for certain things (like href=) to grab links. You can also intelligently assume that whatever is in the a tags is relevant to the link and grab that.
You really don't have to understand HTML, and you don't have to understand how the page or any additional css or markup work, you just need to identify what sort of identifiable string combinations are around the text you want. I will say this is probably much easier to implement in Java than using IMacro, and probably more accurate.
The other way you can handle it, which requires a little more knowledge of HTML and XML, is to treat the entire page as an XML document. This...doesn't always work with HTML, particularly if it is older or badly formed, so the string approach is easier. You get some utility out of the various XML map libraries that exist, but otherwise its similar to the above.

Create a pdf with text at given coordinates (PDFBox?)

My Situation:
I'm programming in java
Using a library from a person from my university I'm able to read pdfs and create a XML document out of it
This XML document contains additional informations e.g. the coordinates of the text in the original document
My Problem
I would like to create the read PDF again with the content set at its original coordinates (Again: I have the coordinates)
My Question:
-> Do you know a way to create a pdf and set the text of the pdf at given coordinates? <-
I'm doing a lot of research these days about, but maybe I tried the wrong google search terms since I cant find much helpful results. So i thought I might be able to ask here, in the forum where I found the most help so far in my young "programmers life" :)
Most of the results I get, even here, are about people trying to get the coordinates, but I already have them.
I heard during a discussion that PDFBox might be able to do this, but I'm also happy to work with any other framework or library that is capable for my problem.
Thanks for every help and thought you're sharing with me.
Thanks a lot for your comments. In the end I've decided for iText, which allowed me to do all my tasks (placing text at absolute coordinates, give it a background color by certain criterias) in a quite easy and efficient way.
If someone here is searching for inspiration and has a similar task, check my related post here on stackoverflow for some code snippets How can I add a background color to my (pdf-) text using iText to create it with Java

Rapid Miner 101

I'm back with a question. I'm playing with Rapid Miner for automatic text classification and cant get it work. I'm getting an error that says, "no example set in the example, offending operator Performance ". Any idea what that is referring to ?
In RapidMiner you have to use the converter components before using it as example sets. So, if you have an output as 'doc', for example, you have to use the component 'Documents to Data' in order to link it to the next input 'exa'. That´s all!
Could you provide more details about your RapidMiner text mining process?
Without more context, your question is difficult to answer.
For more help with RapidMiner, you may want to check out the RapidMiner user forum: http://forum.rapid-i.com/
At RapidMiner Resources, you can find RapidMiner tutorial videos about how to text mining with RapidMiner:
http://rapidminerresources.com/index.php?page=text-mining-3
Rapid-I also offers a 90 minutes text mining webinar. You can find it at the Rapid-I web page under "services" and "training" or in the web shop.
I hope these links help you to get started with automatic text classification with RapidMiner. If you provide more details about your RapidMiner text mining process, I may also be able to directly answer your question.
If it says that there is no Example Set, then the issue is probably with your original data. Can you post an image of your process?
For instance, make sure that you have connected the initial input to your operator - what two operators does the error occur at?
One thought: the example set in text mining is usually your document collection, but if you are really using documents (PDF, Word) then your format will be Documents (Doc), and you may need to transform your documents to data (Documents to Data operator). Then you should have an Example Set that you can feed into your Performance operator.
Hope this helps - as the earlier comment said, without knowing the process, it is hard to tell exactly where the error is.

How to webscrape scholar.google.com in Java?

I want to write a Java func grabTopResults(String f) such that grabTopResults("automata theory") returns me a list of the top 100 cited papers on scholar.google.com for "automata theory".
Does anyone have suggestions for what libraries will make my life easy?
Thanks!
As I'm sure Google can afford the bandwidth, I'll ignore the question of whether this is immoral/illegal/prohibited by Google's T&C
First thing you need to do is figure out what HTTP request (or requests) you need to issue in order to obtain the page with the data you need. Once you've figured this out, use HttpClient to issue the same request from Java code. The previous link shows example code that explains how to do this.
Once you've downloaded the content of the relevant page, you'll need to use a HTML parser to extract the data you're interested in. The Jericho parser suggested by peperg is a good choice.
If the Google police come knocking, you've never heard of me, OK?
I use http://jericho.htmlparser.net/docs/index.html . Google Scholar doesn't have API ( http://code.google.com/p/google-ajax-apis/issues/detail?id=109 ). Of course it is not allowed by Google (read terms of use. Automatic requestr are forbidden).
Below is a bit of example code which gets the titles on the first page using the open source product TestPlan. It is a standalone product, but if you really need it I could help you integrated it into your Java code (it is written in Java itself).
GotoURL http://scholar.google.com/
SubmitForm with
%Params:q% automate theory
end
set %Items% as response //div[#class='gs_r']
foreach %Item% in %Items%
set %Title% as selectIn %Item% h3
Notice %Title%
end
This produces output like the below (my IP is Germany, thus a german response). Obviously you could format it however you like, or write it to a file; this is just a rough test.
00000000-00 GOTOURL http://scholar.google.com/
00000001-00 SUBMITFORM default
00000002-00 NOTICE [ZITATION] Stochastic complexity in statistical inquiry theory
00000003-00 NOTICE AUTOMATED THEORY FORMATION IN MATHEMATICS1
00000004-00 NOTICE Constraint generation via automated theory formation
00000005-00 NOTICE [BUCH] Automated theorem proving: after 25 years
00000006-00 NOTICE [BUCH] Introduction to the Theory of Computation
00000007-00 NOTICE [ZITATION] Computer-controlled systems: theory and design
00000008-00 NOTICE [BUCH] … , randomness & incompleteness: papers on algorithmic information theory
00000009-00 NOTICE [BUCH] Automatic control systems
00000010-00 NOTICE [BUCH] VLSI physical design automation: theory and practice
00000011-00 NOTICE Singular Control Systems.

Generate Images for formulas in Java

I'd like to generate an image file showing some mathematical expression, taking a String like "(x+a)^n=∑_(k=0)^n" as input and getting a more (human) readable image file as output. AFAIK stuff like that is used in Wikipedia for example. Are there maybe any java libraries that do that?
Or maybe I use the wrong approach. What would you do if the requirement was to enable pasting of formulas from MS Word into an HTML-document? I'd ask the user to just make a screenshot himself, but that would be the lazy way^^
Edit: Thanks for the answers so far, but I really do not control the input. What I get is some messy Word-style formula, not clean latex-formatted one.
Edit2: http://www.panschk.de/text.tex
Looks a bit like LaTeX doesn't it? That's what I get when I do
clipboard.getContents(RTFTransfer.getInstance()) after having pasted a formula from Word07.
First and foremost you should familiarize yourself with TeX (and LaTeX) - a famous typesetting system created by Donald Knuth. Typesetting mathematical formulae is an advanced topic with many opinions and much attention to detail - therefore use something that builds upon TeX. That way you are sure to get it right ;-)
Edit: Take a look at texvc
It can output to PNG, HTML, MathML. Check out the README
Edit #2 Convert that messy Word-stuff to TeX or MathML?
My colleague found a surprisingly simple solution for this very specific problem: When you copy formulas from Word2007, they are also stored as "HTML" in the Clipboard. As representing formulas in HTML isn't easy neither, Word just creates a temporary image file on the fly and embeds it into the HTML-code. You can then simply take the temporary formula-image and copy it somewhere else. Problem solved;)
What you're looking for is Latex.
MikTex is a nice little application for churning out images using LaTeX.
I'd like to look into creating them on-the-fly though...
Steer clear of LaTeX. Seriously.
Check out JEuclid. It can convert MathML expressions into images.

Categories

Resources