Convert text file all data to xml format in java - java

I have a text file and i need to convert this text file all data in xml format to make more readable.
Text file
how can i convert it in xml format.
Any java library or any way that i can do it.

Your question is rather vague (and you could probably find the answer yourself with just a little research), but I'll give you a hint.
Your sample appears to be an INI file (as traditionally used for configuration files on Windows & DOS). So, look for an "INI file parser." If you can't find one, you should be able to write a simple parser yourself using regular expressions. It's a simple file format, consisting of section headings like [SectionTitle] and data fields like Key=Value. That's all.
As for generating XML ... it shouldn't be hard, but "xml format" is not a useful description. Can you be more specific? E.g., what will the XML be used for?

Try this: http://www.smooks.org/mediawiki/index.php?title=Main_Page. I've used it and it's great.
A more sophisticated solution would be to use Mule Data Mapper. On the server side, obviously.

Related

OneNote parsing - how to get to the Text Blobs in the document?

I am creating a parser for the .one file extension, which when finished I will add to the Apache Tika project.
Here is the APL 2.0 licensed Open Source project I'm creating: https://github.com/nddipiazza/onenote-parser-java
I used the specification document here: https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-one/73d22548-a613-4350-8c23-07d15576be50
As a starting point, I ported over the code from this open source C++ project: https://github.com/dropbox/onenote-parser
I have gotten a long way in the parsing of the documents, but I've hit a road block.
Here is the OneNote file I'm using to parse: https://drive.google.com/file/d/1uROTEnKeBKU08CG_K5zdDTGHa178LgBK/view?usp=sharing
I am unable to view the Section1TextArea1 and Section1TextArea2 in my parsed results. So I'm missing some sort of key data parsing element or something.
It is definitely in the OneNote file itself. I can see it in the Hex viewer:
Here is the JSON parse output: https://gist.github.com/nddipiazza/02d2252d357b3b02a6b9ab1050474267
I feel like the spec document is missing some very important information needed in order to parse this proprietary format.
What major element(s) am I missing resulting in me not getting the actual text content?
I figured it out. It was a matter of understanding that property values in OneNote can have either:
Binary contents
Ascii text contents
UTF-16LE contents.
There is a variety of them sprinkled throughout.
Also I just went ahead and parse the entire root file tree. It will result in lots of duplicate text but i don't really care.
The project is updated with test cases and the fix here: https://github.com/nddipiazza/onenote-parser-java/tree/master/src/main/java/org/apache/tika/onenote
UPDATE:
Just created the apache tika PR: https://github.com/apache/tika/pull/300

How to extract data from a PDF file using Tika or any other library and store it in CSV/excel format

I want to extract the data present inside a PDF file and present it in the format of a CSV/Excel sheet.I got to know that this can be done using Tika library in java.But,i did find the solution as to how extract the data as simple text,but i want to know how to store it in an excel sheet.
If someone has done such type of work earlier,then please help me.
The first part (and the hard one) is to parse original data and interpret it as a table. Apache Tika will give you xhtml representation (or call your own handler with SAX events) but it usually won't construct a table for you. From pdf file, I mean, since pdf isn't a tabular format by itself.
So, you'll have to take Tika-produced paragraphs, split them and pass resulting cells to some csv/xls/xlsx writter.
It might work if you have some regular table in you pdf (one line per table row, clean cell logical separation etc). But it will look like parsing plain text, of course.
In case I wouldn't work, you'll have to take pdf parser (like Apache PDFBox) and try to interpret its output.
The second part (output) is simple. If csv/ssv/tsv is suitable for you -- use your preferred library to produce it (I can recommend Apache commons-csv).
But take into account that MS Excel requires BOM for UTF-8 and UTF-16 csv to understand that file isn't in one-byte encoding (like CP-1252 etc).
If you want Excel xls or xlsx format -- just use Apache POI to write it.

Java Text Parser with customized open and closing tags?

I am looking for a simple java Text parser that parses a piece of text just like a XML SAX parser but with customized open and closing tags. Instead of "<" I have to deal with "[".
Which one can I use, or do I have to make some myself?
Example text:
[hlpLnk key=hlpId]some text and [lnk key=PrfS]products[/lnk] and some more text[/hlpLnk]
I found a this parser on JavaWorld: LINKE
It's simple, does what I need, only need some refactoring,cleanup and testing.
And the Nano XML parser source is looking useful: clear, small, easy to read. With some changes i could use this I think. (Projects like Xerces contains a lot I don't need/want)

how to write special characters(interpunct) in a xml file in java?

I have a problem in writing a xml file with UTF-8 in JAVA.
Problem: I have a file with filename having an interpunct(middot)(·) in it. When im trying to write the filename inside a xml tag, using java code i get some junk number like  in filename instead of ·
OutputStreamWriter osw =new OutputStreamWriter(file_output_stream,"UTF8");
Above is the java code i used to write the xmlfile. Can anybody tell me why to understand and sort the problem ? thanks in advance
Java sources are UTF-16 by default.
If your character is not in it, then use an escape:
String a = "\u00b7";
Or tell your compiler to use UTF-8 and simply write it to the code as-is.
That character is ASCII 183 (decimal), so you need to escape the character to ·. Here is a demonstration: If I type "·" into this answer, I get "·"
The browser is printing your character because this web page is XML.
There are utility methods that can do this for you, such as apache commons-lang library's StringEscapeUtils.escapeXml() method, which will correctly and safely escape the entire input.
In general it is a good idea to use UTF-8 everywhere.
The editor has to know that the source is in UTF-8. You could use the free programmers editor JEdit which can deal with many encodings.
The javac compiler has to know that the java source is in UTF-8. In Java you can use the solution of #OndraŽižka.
This makes for two settings in your IDE.
Don't try to create XML by hand. Use a library for the purpose. You are just scratching the surface of the heap of special cases that will break a hand-made solution.
One way, using core Java classes, is to create a DOM, then serialize that using an no-op XSL transform that writes to a StreamResult. (if your document is large, you can do something similar by driving a SAX event handler.)
There are many third party libraries that will help you do the same thing very easily.

Generate HTML from plain text using Java

I have to convert a .log file into a nice and pretty HTML file with tables. Right now I just want to get the HTML header down. My current method is to println to file every single line of the HTML file. for example
p.println("<html>");
p.println("<script>");
etc. there has to be a simpler way right?
How about using a JSP scriplet and JSTL?, you could create some custom object which holds all the important information and display it formatted using the Expression Language.
Printing raw HTML text as strings is probably the "easiest" (most straightforward) way to do what you're asking but it has its drawbacks (e.g. properly escaping the content text).
You could use the DOM (e.g. Document et al) interface provided by Java but that would hardly be "easy". Perhaps there are "DOM builder" type tools/libraries for Java that would simplify this task for you; I suggest looking at dom4j.
Look at this Java HTML Generator library (easy to use). It should make generating the actual HTML muuuch clearer. There are complications when creating HTML with Java Strings (what happens if you want to change something like a rowspan?) that can be avoided with this library. Especially when dealing with tables.
There are many templating engines available. Have a look at https://stackoverflow.com/questions/174204/suggestions-for-a-java-based-templating-engine
This way you can define a template in a txt file and have the java code fill in the variables.

Categories

Resources