How does Facebook implement Attach Link - java

Facebook can (almost) always extract the most important text content and images from the page. I think a common parsing rule cannot do this.
How does Facebook implement it?
Has it prepared the rules for parsing links from popular sites?
Or there is a smarter way to find out real content from html?

Meta tags. Many websites will even optimize for facebook by using open graph og <meta> tags. Even those that don't use og will often have <meta> tags with useful information like summaries, title, image, etc.
https://developers.facebook.com/docs/opengraph/keyconcepts/
So to answer your question - they don't. The websites do it for them.

Related

Scraping issue (data-reactid)

I'm trying to scrape a website and compile a spreadsheet based on what data I pull.
The website I am trying to scrape is WEARVR.
I am not too experienced with scraping, but my approach would be to find unique attributes within html tags and use this to scrape what I want.
So for this website my approach would be firstly to scrape a list of URLs of the pages you are taken to upon clicking on one of the experiences, for example : https://www.wearvr.com/#game_id=game_1041, and then secondly, cycle through this list scraping the relevant attributes each time.
However I am stuck at the first step as instead of working with simple "a href" tags, I come across "data-reactid" tags which confuse the matter.
I do my scraping with iMacros but I'm pretty decent at Java now so would learn scraping in Java if need be (which seems likely as iMacros is pretty limited).
My question is, how do these "data-reactid" tags work, and as such how can I utilise them for my scraping purposes?
Additionally if this is an XY problem, please let me know and suggest a better approach.
Thanks for reading!
The simplest way to approach scraping is to treat the page like a big string (because ultimately, that is what it is). You can search within that string for certain things (like href=) to grab links. You can also intelligently assume that whatever is in the a tags is relevant to the link and grab that.
You really don't have to understand HTML, and you don't have to understand how the page or any additional css or markup work, you just need to identify what sort of identifiable string combinations are around the text you want. I will say this is probably much easier to implement in Java than using IMacro, and probably more accurate.
The other way you can handle it, which requires a little more knowledge of HTML and XML, is to treat the entire page as an XML document. This...doesn't always work with HTML, particularly if it is older or badly formed, so the string approach is easier. You get some utility out of the various XML map libraries that exist, but otherwise its similar to the above.

How to transform an HTML fragment to another HTML fragment?

I have a browser editor, of type contentEditable where users can copy/paste or select html fragments to put inside.
These fragments can be any kind of HTML, so we must sanitize the content so that it does not contain some security issue tags (like <script> etc...).
I know some sanitizer libraries that allow some Whitelist policy (like JSoup on the JVM), but these rules are generally very simple, like saying which tags/attributes are whitelisted and nothing else.
We want more advanced rules like:
Define which inline styles to keep or not,
Transform relative links to absolute links
Blacklist or whitelist some tags according to their className
Allow some URI attributes according to the URI pattern (like allowing only links to a certain domain).
In some cases we want forbidden dom nodes to be "replaced" by their childs (to remove formatting and html layout elements, but not to loose the text nodes that were in the blacklisted tags
So far we have done some code to handle this but I find this very hacky. Is there a known library, standard or algorithm to handle such things? I'm not an XML parse/transform expert, anything I could use like XSLT, SAX or something else that could help me solve my problem.
I'm looking for solutions on both the browser (JS) and the JVM (Java or Scala). Any idea on how to achieve this?
Maybe Showdown.js can help you? https://github.com/showdownjs/showdown

retrieve information from a url

I want to make a program that will retrieve some information a url.
For example i give the url below, from
librarything
How can i retrieve all the words below the "TAGS" tab, like
Black Library fantasy Thanquol & Boneripper Thanquol and Bone Ripper Warhammer ?
I am thinking of using java, and design a data mining wrapper, but i am not sure how to start. Can anyone give me some advice?
EDIT:
You gave me excellent help, but I want to ask something else.
For every tag we can see how many times each tag has been used, when we press the "number" button. How can I retrieve that number also?
You could use a HTML parser like Jsoup. It allows you to select HTML elements of interest using simple CSS selectors:
E.g.
Document document = Jsoup.connect("http://www.librarything.com/work/9767358/78536487").get();
Elements tags = document.select(".tags .tag a");
for (Element tag : tags) {
System.out.println(tag.text());
}
which prints
Black Library
fantasy
Thanquol & Boneripper
Thanquol and Bone Ripper
Warhammer
Please note that you should read website's robots.txt -if any- and read the website's terms of service -if any- or your server might be IP-banned sooner or later.
I've done this before using PHP with a page scrape, then parsing the HTML as a string using Regular Expressions.
Example here
I imagine there's something similar in java and other languages. The concept would be similar:
Load page data.
Parse the data, (i.e. with a regex, or via the DOM model and using some CSS selectors or some XPath selectors.
Do what you want with the data :)
It's worth remembering that some people might not appreciate you data mining their site and profiting / redistrubuting it on a large scale.

How to detect different data types inside HTML page?

What is the best way to detect data types inside html page using Java facilities DOM API, regexp, etc?
I'd like to detect types like skype plugin does for the phone/skype numbers, similar for addresses, emails, time, etc.
'Types' is an inappropriate term for the kind of information you are referring to. Choice of DOM API or regex depends upon the structure of information within the page.
If you know the structure, (for example tables being used for displaying information, you already know from which cell you can find phone number and which cell you can find email address), it makes sense to go with a DOM API.
Otherwise, you should use regex on plain HTML text without parsing it.
I'd use regexes in the following order:
Extract only the BODY content
Remove all tags to leave just plain text
Match relevant patterns in text
Of course, this assumes that markup isn't providing hints, and that you're purely extracting data, not modifying page context.
Hope this helps,
Phil Lello

Java HTML normalizer?

IS there a library which can transform any given HTML page with JS, CSS all over it, into a minimalistic uniform format?
For instance, if we render stackoverflow homepage, I want it to be shown in a minimal format. I want all other sites to be rendered down.
Sort of like Lynx web browser but with minimal graphics.
The best tool for HTML to Lynx style text I have come across is Jericho's Renderer.
It's easy to use:
Source source=new Source(new URL(sourceUrlString)); // or new Source("<html>pass in raw html string</html>");
String renderedText=source.getRenderer().toString();
System.out.println("\nSimple rendering of the HTML document:\n");
System.out.println(renderedText);
(from here)
and handles HTML in the wild (badly formatted) very well.
Here's the first few lines of this page formatted this way using Jericho:
Stack Exchange log in | careers | chat
| meta | about | faq
Stack Overflow
* Questions
* Tags
* Users
* Badges
* Unanswered
* Ask Question
Java HTML normalizer?
**
IS there a library which can transform
any given HTML page with JS, CSS all
over it, into a minimalistic uniform
format?
For instance, if we render
stackoverflow homepage, I want it to
be shown in a minimal format. I want
all other sites to be rendered down.
Sort of like Lynx web browser but with
minimal graphics.
java lynx link|edit|flag asked 2 days
ago Kim Jong Woo 593112 89% accept
rate Do you want to transform your
HTML code to simpler HTML code, or do
your want to show this "minimalistic
uniform format" to your user? Or do
you want to create a image? – Paŭlo
Ebermann yesterday simpler html code
without sacrificing the relative
positioning of the elements. – Kim
Jong Woo 16 hours ago
2 Answers
To answer your firtst question: No. I
don'nt think there is a library for
that purpose. (At least this is what
my "googeling" resulted in).
And i think the reason for this is,
that what you want is a very special
need.
So as a solution for your problem you
can parse the html and display it the
way you want to in a JEditorpane or
whatever you are using for display.
I can only suggest a way i would do it
(this is because i am familiar with
xml and everything around it).
*
Use a library to ensure that your html conforms to xhtml:
http://htmlcleaner.sourceforge.net/release.php
*
then either parse the xml with DOM or SAX parsers and display it the
way you want.
or
* use xslt to transform the document into some other html document
which results in a view that fits your
needs.
or
* use one of the available html parser librarys. (The most of which i
found where kind of outdated (2006))
but they could be an option for you.
This is just one suggestion how you
could do it. I'm sure there are
thousands of other ways which will do
the same thing.
To answer your firtst question: No. I don'nt think there is a library for that purpose. (At least this is what my "googeling" resulted in).
And i think the reason for this is, that what you want is a very special need.
So as a solution for your problem you can parse the html and display it the way you want to in a JEditorpane or whatever you are using for display.
I can only suggest a way i would do it (this is because i am familiar with xml and everything around it).
Use a library to ensure that your html conforms to xhtml: http://htmlcleaner.sourceforge.net/release.php
then either parse the xml with DOM or SAX parsers and display it the way you want.
or
use xslt to transform the document into some other html document which results in a view that fits your needs.
or
use one of the available html parser librarys. (The most of which i found where kind of outdated (2006)) but they could be an option for you.
This is just one suggestion how you could do it. I'm sure there are thousands of other ways which will do the same thing.

Categories

Resources