I am parsing through wikipedia dump in java. In my module I want to know the page id of the internal pages of wiki those are referred by the current page. Getting the internal links and thus the url from it is easy. But how to get Page ID from url.
Do I have to use some mediaWiki for this? If yes how
Any other alternative?
for eg: http://en.wikipedia.org/wiki/United_States
I want to get its Page-Id i.e 3434750
You can use the API for that. Specifically, the query would look something like:
http://en.wikipedia.org/w/api.php?action=query&titles=United_States
(You can also specify more than one page title in the titles parameter, separated by |.)
As an alternative, you could download the page.sql dump (1 GB compressed for the English Wikipedia), which also contains this information. To actually query it, you could either import it into an MySQL database and then query that, or you could directly parse the SQL.
If you can't use the api you can always get the pageID from the info page reached by appending ?action=info to the url. Should make a better starting point for a parser.
For your example above: https://en.wikipedia.org/wiki/United_States?action=info
Related
I am working with content parsing I executed the sample program for this i have taken a sample link
please visit the below link
http://www.equitymaster.com/stockquotes/sector.asp?sector=0%2CSOFTL&utm_source=top-menu&utm_medium=website&utm_campaign=performance&utm_content=key-sector
or
Click Here
in the above link i parsed the table data and store into java object.
BSE and NSE are not my exact requirement just I am taken sample example. the above link is developed in the tables they are not used id's and classes. in my example I parsed data using XPath
this is my Xpath
/html/body/table[4]/tbody/tr/td/table[2]/tbody/tr[2]/td[2]/font/table[2]
I selected and parsing it is working fine . here is a problem in future if they changed website structure my program will not work for sure. tell me any other way to parse data dynamically and able to store in database. display the results based on the condition even if they changed the webpage structure I used for this JSOUP api for this. Tell me any other ApI's which provide best support for this type of requirement
If you're trying to parse a page without any clear id/class to select your nodes, you have to try and rely on something else. Redefining the whole tree is indeed the weakest way of doing it, if anything is added/changed everything will collapse.
You could try relying on color: //table[#bgcolor="#c9d0e0"], the "GET MORE INFO" field: //table[tr/td//text()="GET MORE INFO"], the "More Info" there is on every line: //table[.//td//text()=" More Info "]...
The idea is to find something ideally unique (if you can't find any unique criteria, table[color condition selecting a few tables][2] is still stronger walking the whole tree), present every time, and use that as an id.
I have implemented the pagination like below
http://myhost.com/product-2/213-1
Which means there are total 213 products and this is the first page.
When I check the what pages Google crawled in my website I see the result like
http://myhost.com/product-2/213-1-2/144-0/144-1/144-14/125-1/125-12/125-1/151-15/108-10/131-1/134-13/140-14/140-1/118-11/126-1/126-12/110-1/270-27/270-1/270-27
This means google is somehow appending all the page numbers at the end of the URL and crawling that URL. Could someone give me a solution to stop this? For this particular case I want Google to crawl only one page having all the product information.
Use canonical URLs to tell Google which page is the one page you want to show in the search results.
That is weird. Looks like you are using relative links in your pagination and your URL-Router is accepting this somehow without throwing a 404... instead it is showing content because of interpreting only a part of the URL and not the whole. So search-engines can crawl this URLs.
Example:
You are linking to
next-side/
instead of
/path/to/next-side/
If you post a link the community can try out!
By the way I wouldn't recommend to change the URL by number of items. Using fixed URLs is much better and the number of items is of none interest. Better use something like /shop/category/subcategory/product .
I will add to the great answers that they gave you that you can use the rel next\prev pagination elements.
To let google know that the next link is the next page in your list.
More information you can find on google webmaster blogs there is an post and video tutorial.
they both will explain you how to implement and use the pagination tags.
I have some content displayed using computed fields inside a repeat in my xpage.
I now need to be able to send out a newsletter (by email) every week with the content of this repeat. The content can be both plain text and html
My site is also translated into different languages so I need the code to be able to specify the language and return the content in that language.
I am thinking about creating a scheduled lotusscript or java agent that somehow read the content of the repeat. is this possible? if so, some sample code to get me started would be great
edit: the content is only available to logged in users
thanks
Thomas
Use a java agent, and instead of going to the content natively, do a web page open and open the page as if in a browser, then process the result. (you could make a special version of the web page that hides all extraneous content as well if you wanted)
How is the data for the repeat evaluated? Can it be translated in to a lotusscript database.search?
If so then it would be best to forget about the actual xPage and concentrate on working out how to get the same data via LotusScript and then write your scheduled agent to loop through the document collection and generate the email that way.
Looking to the Xpage would generate a lot of extra work, you need to be authenticated as the user ( if the data in the repeat is different from one user to the next ) to get the exact same data that this particular user would see and then you have to parse the page to extract the data.
If you have a complicated enough newsletter that you want to do an Xpage and not build the html yourself in the agent, what you could do is build a single xpage that changes what's rendered based on a special query string, then in your agent get the html from a URLConnection and pass the html into the body of your email.
You could build the URL based on a view that shows documents with today's date.
I would solve this by giving the user a teaser on what to read and give them a link to the full content.
You should check out Weihang Chens (my colleague) article about rendering an xPage as Mime and sending it as a mail.
http://www.bleedyellow.com/blogs/weihang/entry/render_a_xpages_programmtically_and_send_it_as_a_mail?lang=en_us
We got this working in house and it is very convenient.
He describes 3 different approaches to the problem.
Is ther any way to get data from other sites and display in our jsp pages dynamically.
http://www.dictionary30.com/meaning/Misty see this url
in that one block is like
Wikipedia Meaning and Definition on 'Misty'
In that block they are fetching the data from Wikipedia and displaying into dictionaly30.
Question:
How they are fetching wiki data to their site.?
I need to display data like that in my jsp page by fetching from other site.
You can use URLConnection and read other site's data.
or better you use JSoup it will also parse specific data for you from some other site.
for your case
Document document = Jsoup.parse(new URL("http://www.dictionary30.com/meaning/Misty"), 10000);
Element div = document.select("div[id=contentbox]").first();
System.out.println(div.html());
You can fetch data from other site on your server side using URLConnection and provide this data to jsp page.
Do make sure you get permission first from the site owners before doing anything like that.
Most people don't take kindly to their data being leeched by others, especially as it costs them money and doesn't generate any (advertising) income.
It's also very risky in that your own site/application will quickly fail as soon as the site you're leeching from gets changed to a different layout.
Using Java (.jsp or whatever) is there a way where I can send a request for this page:
http://www.mystore.com/store/shelf.jsp?category=mens#page=2
and have the Java code parse the URL and see the #page=2 and respond accordingly?
Basically, I'm looking for the Java code that allows me to access the characters following the hash tag.
The reason I'm doing this is that I want to load subsequent pages via AJAX (on my shelf) and then allow the user to copy and paste the URL and send it to a friend. Without the ability of Java being able to read the characters following the hash tag I'm uncertain as to how I would manipulate the URL with Javascript in a way that the server would be able to also read without causing the page to re-load.
I'm having trouble even figuring out how to access/see the entire URL (http://www.mystore.com/store/shelf.jsp?category=mens#page=2) from within my Java code...
You can't.
The fragment identifier is only used client side, so it isn't sent to the server.
You have to parse it out with JavaScript, and then run your Ajax routines.
If you are loading entire pages (and just leaving some navigation and branding behind) then it almost certainly isn't worth using Ajax for this in the first place. Regular links work better.
Why can't you use a URL like this:
http://www.mystore.com/store/shelf.jsp?category=mens&page=2
If you want the data to be stored in the url, it's gotta be in the query string.