mediawiki syntax parser for android - java

I need some help fetching the content of a mediawiki page and displaying it in my android app. I've found that mediawiki has a api which can export the markup text, but I need someway to parse it.
Other solutions are also welcome, but I don't just want to include the whole page (with menus and everything) as a webview.

You can get the HTML for just the content section (i.e. not including the menus and everything) by specifying action=render when fetching the page normally. You can get about the same result using the API's action=parse. In either case, you could then supply your own framing HTML, stylesheets, and the like for proper display to the user.

If you are in control of the wiki you may be best served by using the mobile format that wikipedia uses. Otherwise it will depend on what the markup text it returns looks like. If the markup is what you would see in the wiki editor you may have issues. Wikipedia for example make extensive use of templates (or whatever they call them). I'm not sure if you can do much if you don't have the actual temple code behind that.
You also might want to look at http://www.mediawiki.org/wiki/Markup_spec if you are headed down this path.

Related

How to deal in Android with RESTful API that throws HTML table

This API returns a whole HTML table. I'm searching how to add this table (as is) into my UI but I've never seen such API throwing HTM table.Browsing Internet for an answer is not giving me any hope either.
Is it possible to put it into a webview? or any other UI object? My application sends a word to the API, and I'm getting the table in return.
I'd appreciate some code example.
You can certainly just show that exact same page in a WebView. If you want to parse the table and display only certain information, there is a library call JSOUP that is available which makes it very convenient to parse HTML.
It looks like you don't mind displaying the whole thing in a WebView - if that is acceptable, then you just load the page into a WebView widget. WebView will take care of rendering the page exactly as you see it in a browser. You only have to tell it what to load.
You parse the output like you would any other web request. If you wanted to include the table in your own webpage, you could. Or you could parse the response for the specific info you need.
Don't think of it as an API, think of it as a URL you're requesting and now you need to do something with the contents. That might help with your Googling. You're essentially doing page scraping.

Crawler4j vs. Jsoup for the pages crawling and parsing in Java

I want to get the content of a page and extract the specific parts of it. As far as I know, there are at least two solutions for such task: Crawler4j and Jsoup.
Both of them are capable retrieving the content of a page and extract sub-parts of it. The only thing I'm not sure about, what is the difference between them? There is a similar question, which is marked as answered:
Crawler4j is a crawler, Jsoup is a parser.
But I just checked, Jsoup is also capable crawling a page in addition to a parsing functionality, while Crawler4j is capable not only crawling the page but parsing its content.
What is the difference between Crawler4j and Jsoup?
Crawling is something bigger than just retrieving the contents of a single URI. If you just want to retrieve the content of some pages then there is no real benefit from using something like Crawler4J.
Let's take a look at an example. Assume you want to crawl a website. The requirements would be:
Give base URI (home page)
Take all the URIs from each page and retrieve the contents of those too.
Move recursively for every URI you retrieve.
Retrieve the contents only of URIs that are inside this website (there could be external URIs referencing another website, we don't need those).
Avoid circular crawling. Page A has URI for page B (of the same site). Page B has URI for page A, but we already retrieved the content of page A (the About page has a link for the Home page, but we already got the contents of Home page so don't visit it again).
The crawling operation must be multithreaded
The website is vast. It contains a lot of pages. We only want to retrieve 50 URIs beginning from Home page.
This is a simple scenario. Try solving this with Jsoup. All this functionality must be implemented by you. Crawler4J or any crawler micro framework for that matter, would or should have an implementation for the actions above. Jsoup's strong qualities shine when you get to decide what to do with the content.
Let's take a look at some requirements for parsing.
Get all paragraphs of a page
Get all images
Remove invalid tags (tags that do not comply to the HTML specs)
Remove script tags
This is where Jsoup comes to play. Of course, there is some overlapping here. Some things might be possible with both Crawler4J or Jsoup, but that doesn't make them equivalent. You could remove the mechanism of retrieving content from Jsoup and still be an amazing tool to use. If Crawler4J would remove the retrieval, then it would lose half of its functionality.
I used both of them in the same project in a real life scenario.
I crawled a site, leveraging the strong points of Crawler4J, for all the problems mentioned in the first example. Then I passed the content of each page I retrieved to Jsoup, in order to extract the information I needed. Could I have not used one or the other? Yes, I could, but I would have had to implement all the missing functionality.
Hence the difference, Crawler4J is a crawler with some simple operations for parsing (you could extract the images in one line), but there is no implementation for complex CSS queries. Jsoup is a parser that gives you a simple API for HTTP requests. For anything more complex there is no implementation.

Is it advisable to use 'include' tag of jsp for setting the general structure of the website?

I am making a website with around 20 pages in it. Now almost all the pages have same general layout like the menu bar, header, footer etc. I've made a jsp page which contains this common contents and then with the help of 'include' tag I'm using it for the other pages. So is it advisable to follow this technique? Kindly inform me about the pros and cons of using this technique.
Thanks in advance.
Remember that with each #include tag,the whole jsp thing will be converted to a servlet and then it will work as required HTML format as compiled by the browser. So there is no doubt that for a large application it will create unnecessary performance issue.
Instead of doing this you may use iframe tag which is now widely used in web development.
You may modify the iframe source code as u want........
So it's totally depends on which way you want to proceed and your application context.there is no fixed rule that you must have to use this or that technoque

How to make Liferay not produce condensed HTML code?

I found that Liferay transfers my JSP code in a somehow "condensed" way -- putting most of the text into a few very long lines.
This makes it uncomfortable to debug javascript.
Is it possible to turn off this feature temporary?
For others looking at this post, if you simply want to do this on an adhoc basis you can add these params to the URL:
/web/guest/page?js_fast_load=0&css_fast_load=0&strip=0
Note this is for JS, CSS and HTML
HTML Minification is on regardless you're in developer mode or not since HTML stripping can itself produce problems you want to see in developer mode.
You can add strip=0 parameter to the URL to prevent the served HTML page being stripped.
In order to turn HTML-Stripping completely off change in your system.properties:
com.liferay.filters.strip.StripFilter=false
But as #BalusC said you should use a tool which does the formatting when debugging. So you're not bothered by the stripping.
There are two ways to do it. Copy the following in portal-ext.properties and restart the server
javascript.fast.load=false
or If you dont want to restart and its just for temporary purpose add js_fast_load parameter to url and set its value to false.
For example if you are in a page http://localhost:8080/web/guest/home in which your portlet or the javascript is present. Use this url instead http://localhost:8080/web/guest/home?js_fast_load=0
Liferay has a file named portal-developer.properties as template in WEB-INF/classes. You can either reference this or just copy/paste the content into your portal-ext.properties.
This has several options to minify html, js, css and others. You'll kill your loading time - i.e. you really only want these options at development time, but then it really helps.
By default all files are also combined into a single one (for js, another for css etc.) - with the development options you'll get a separate request for every file on every page request.
I just want to update package name for Liferay 6.2 from #Fabian Barney's answer:
com.liferay.portal.servlet.filters.strip.StripFilter=false

Is there a way to change or reskin an incoming website on the fly?

I have a project where they want me to embed a website into a java application and they want the website to have a similar color scheme as the rest of the application. I know a lot about CSS and building websites but I am not aware of a way to change the look of a website as it comes in on the fly. Is there someone who can help?
Update:
I don't have access to the header because it is not my website. To give more info about the project is we have a browser embedded in a java client application. The user needs to access a website that displays the contents of a database. I have no access to the original html or css from the site.
What i need is to change the background color and font sizes of the incoming webpage to match the look and feel of the java application.
One approach would be to replace their CSS with your own.
You could also take the approach used by the Stylish plugin, which involves a lot !important decelerations to override the site's CSS. Since this is a Java app, I assume the user will not have opportunity to supply their own CSS, so using !important here doesn't precisely go against the standard.
In your particular situation, I'd look into data scraping, all you need to do is scrape the website for the data, and then re-style it to present it how you want.
Good luck
The Greasemonkey add-on for Firefox does just this. You can write a bit of Javascript code and have it run when certain web pages load. One common thing to use it for is to make changes to the DOM to move page elements around, hide or resize elements, change colors, etc. There are a bunch of examples at userscripts.org if you want to get an idea of what I am talking about.
Your code would simply need to do something similar. Load the page (including the normal style sheets) and then use the DOM to make changes to style elements as desired. Browse through the source of the page to get the names/ids of important elements, and your code can key off of those. Loading an additional CSS file containing your changes is an option, but doing it programmatically will likely give you more flexibility in the event that the target website changes.
Depends on what do you use to show the pages in Java. Most browser implementations support dynamic changes to the DOM, so you can simply add a CSS file to header as a last element, and it will be applied.
you need to know the markup of the html / css so you can make the best skin.
you could theoretically do it by styling just the basic tags: h1...h6, p, etc... but it would not be as good and would probably fail to produce the best results at times and even produce horrible things at times.
if you KNOW the site markup then you can make a skin and simply use CSS/images to skin it as you wanted it.
just include your CSS markup LAST so that it overrides the one already present on the site that you want to skin differently.
should not be a difficult thing per se. the skin itself is probably the better (more effort required) part of the job.
On the fly, should mean changing the html fetched. So parsing and replacing tokens seems to be a/the way.
You could change the locations of the style sheet files by replacing the href value in a link that points to a css file, and set the value to your style sheet (a different URI).
<link type="text/css" href="mylocalURI" rel="stylesheet />
(this should be the result of a process/replacement)
I think you understand what should happend for inline styles.
I would use JTidy to normalize the original site HTML to XHTML, then use XSLT to filter only the interesting/relevant information, obtaining XML format; and finally (since I wouln't want to convert XML to objects), XSLT again to transform the "pure" XML into the HTML look & feel I need/want.
All of this can be assembled as streams, using more or less 4 Kb of buffer per filter (12 Kb total) per thread. Also meaning that it will run fast enough. And all built on standard, open-source available components.
Cheers.

Categories

Resources