I have a huge wiki dump (~ 50GB after extracting the tar.bz file), from which I want to extract the individual articles. I am using the wikixmlj library to extract the contents and it does gives the title, text, categories mentioned at the end and a few other attributes. But I am more interested in the external links/references associated with each article, for which this library doesnt provide any API for.
Is there any elegant and efficient way to extract that other than parsing the wikiText that we get with the getWikiText() API.
Or is there any other java library to extract from this dump file, which gives me the title, content, categories and the references/external-links.
The XML dump contains exactly what the library is offering you: the page text along with some basic metadata. It doesn't contain any metadata about categories or external links.
The way I see it, you have three options:
Use the specific SQL dumps for the data you need, e.g. categorylinks.sql for categories or externallinks.sql for external links. But there is no dump for references (because MediaWiki doesn't track those).
Parse the wikitext from the XML dump. This would have problems with templates.
Use your own instance of MediaWiki to parse the wikitext into HTML and then parse that. This could potentially handle templates too.
May be too late but this link could help: http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/wikiprep.html
Here is an example output of above program: http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/sample.hgw.xml
Related
I am currently working at a project which generates contracts. The idea is that I put the data in a form and save it in a simple database.
So long, this was my favorite place to search for good ideas and simple solutions.
Now I am facing another problem and I don't know how I can solve that. I want to create a PDF and replace some placeholders with some data from my form.
One idea was, that I use an existing Word template with some bookmarks and replace them with the data from my form. Maybe there is a way to do that, and I am just too stupid to find it.
Another idea was, that I am using XML. Therefore, I thought I was clever and just converted the Word template to an PDF, so I am able to convert that PDF to an XML. Attached, you find the XML file. But now I need the XSL file - is there an easy way to create the XSL file?
Or maybe there is another simple solution to solve my problem.
In these attachments you find the PDF file, the Word template and the XML:
Thank you a lot :)
Using a template is a good idea - it makes some changes much quicker to make and then deploy. The comments above are focused on conversion, but don't forget you need to merge your data in (population) first.
If you can use Adobe tools, you can have a PDF template and use the Adobe tools to populate. This saves a "conversion" stage.
You mentioned using Word for templates. This means you to run through two stages of processing:
population - docx is a zipped set of XML files - so you can process them with your own code or using a library.
conversion - you need pdf, so you have to convert the docx to pdf. You also have to watch out for fonts at this stage (ie make sure they are available on your host).
The population stage you could do yourself since you are familiar with XML. But it is definitely complicated. The conversion needs to use a tool that is ideal for it. There are a few mentioned in the comments already.
There are some free/os and commercial tools that can do both parts:
docx4j
JOD Reports
Libre Office (using the Java Uno API) (I blogged this once - Java Convert Word to PDF with UNO)
Docmosis (please note I work for Docmosis)
I suggest starting with the simple example you have attached and prove you can both populate and convert that. Then switch to a more complicated example to see if you can do the other things that might be required (eg repeating or conditions or other logic) during the population stage.
Is it a good idea? Well I have used other 3rd party Libraries like JSoup and it works great, but for this project it's different. Is it worth it to load and parse a whole document when you just want to get one item from it? Some of the html pages are simple too, so I could use String methods too. Reason is cause memory will be an issue, and it also takes some time to load the document too. When parsing XML I always use a SAX Parser because it doesn't load it in memory and it is fast. Could I use the same thing on html documents, or is there already one like this out there? So if there is a non-DOM HTML lightweight parser, that would be great too.
If the HTML is XML compliant (i.e. it's XHTML) then you can use a standard SAX parser. Here you can find a list of HTML parsers in Java to choose from: http://java-source.net/open-source/html-parsers. HotSax probably will handle all your use cases.
Which APIs in java help in extracting table metadata from a pdf, and presenting that table in a web page?
The result should be that when the source of page is viewed it will show the html code of that table.
Itext is usefull in this context
http://itextpdf.com/
I assume that, you need a PDF library for Java.
PDFBox is one of the popular libraries created to PDF manipulation and I think it is worth to look at it.
try The Metadata Extract Tool which extracts metadata from specific file types including PDF. Then you can parse the xml output with any Java XML parser. Once you're able to parse it, elements can be easily laid down in your view page.
I have a huge pdf file (20 mb/800 pages) which contains some information.
It has got index with hyperlinks. Also most of the remaining information is in Tabular format (in pdf). I need to retrieve this information using Java and store it in SQL Server.
Which is the best API available to read this kind of file from Java?
It is unlikely to be in tabular format inside the PDF as PDF does not contain structure information unless explicitly added at creation time. I wrote an article explaining some of the issues with text extraction from at PDF at http://www.jpedal.org/PDFblog/2009/04/pdf-text/
Have you tried iText:
iText
Download iText
iText in Action — 2nd Edition
List of the Examples
I'm trying to generate some graphs with prefuse, and it seems like the easiest way to load the data into prefuse is to use a GraphML file.
Is there an easy way to write these files from my data?
Or is there an easier way to load my data into prefuse?
Thanks
yEd can export graphs in GraphML format and JGraphT has a GraphMLExporter. Leaves the problem on how to get your data into those products or libraries. But at least both can create the desired format.
on the other hand - GraphML is in XML format so you can easily use jdom or dom4j to create a DOM, add the nodes based on your data an serialize it to an XML file. This shouldn't be to complicated.
You could use the Network Workbench, which allows you to load data in a lot of different forms including edge lists. Edge lists are usually the easiest format to generate.
I'm not completely sure if you can export from NWB to say GraphML, but NWB includes a number of visualizations, some of which are based on Prefuse.
If you want to do more with your data than just visualize it then NWB might help you.
Check PyGraphML, a basic Python library designed to parse and generate GraphML files. http://hadim.github.io/pygraphml/index.html