HL7 version 2.7 parser using java except Hapi

HL7 version 2.7 parser using java except Hapi - java

Is there any good parser which can parser HL7 V2.7 message using Java except HAPI. My goal is to convert the message into a XML file.

There is my own open source alternative called HL7X, which does work with any HL7v2 version. It converts your HL7 String into a XML String.
Example:
MSH|^~\&|||||20121116122025||ADT^A01|5730224|P|2.5||||||UNICODE UTF-8
EVN|A01|20130120151827
PID||0|123||Name^Firstname^^^^||193106170000|w
PV1||E|
Gets transformed to
<?xml version="1.0" encoding="UTF-8"?>
<HL7X>
<HL7X>
<MSH>
<MSH.1>^~\&</MSH.1>
<MSH.6>20121116122025</MSH.6>
<MSH.8>
<MSH.8.1>ADT</MSH.8.1>
<MSH.8.2>A01</MSH.8.2>
</MSH.8>
<MSH.9>5730224</MSH.9>
<MSH.10>P</MSH.10>
<MSH.11>2.5</MSH.11>
<MSH.17>UNICODE UTF-8</MSH.17>
</MSH>
<EVN>
<EVN.1>A01</EVN.1>
<EVN.2>20130120151827</EVN.2>
</EVN>
<PID>
<PID.2>0</PID.2>
<PID.3>123</PID.3>
<PID.5>
<PID.5.1>Name</PID.5.1>
<PID.5.2>Firstname</PID.5.2>
</PID.5>
<PID.7>193106170000</PID.7>
<PID.8>F</PID.8>
</PID>
<PV1>
<PV1.2>E</PV1.2>
</PV1>
</HL7X>

this http://www.dcm4che.org/confluence/display/ee2/Home open source Java software can receive various HL7 messages through the MLLP protocol, convert them to XML, run through XSLT transformer and then load them into database and serve to DICOM clients as needed. In order to do this in the code base there is the HL7->XML code. Just find it, copy/paste it and use it.
Once I knew where exactly this code is as I was troubleshooting message character set problem. At that time I have found that the HL7 parser is rather simple-minded and can understand only 1 character set provided in the configuration. It does not read/use character set (MSH-18, Table 0211, Grahame Grieve's encoding tips) provided in the messages neither does it support switching character sets during the message decoding (see chapter "Escape sequences supporting multiple character sets" in HL7 specification).
So I know the parser code is there. It is in Java. It produces XML inputs for the customer-specific XSLT transformation script. It should be quite easy to reuse.
You should be able to find it by yourself. Otherwise your question would turn out as plain finding a tool §4 is an off-topic :)

Related

How to read xml file and How do I append a node inside XML tag file in java

I am new to java, i am struggling in one program i don't know how to write it
I need some code that will read in the tasks and apply the appropriate task findings markup into the val.xml file.
For example:
A task in val.xml:
<task name="12-19" additionalIntervalInformationNeeded="No">
Converter (Cleaning)
</task>
The matching task-findings markup in the findings.xml:
<tf taskid="olive-12-19">
<task-findings val="28">
<task-finding>
<title>Left Converter</title>
</task-finding>
</task-findings>
</tf>
So the goal is to use the tasked attribute value from the element to locate the correct task-findings markup.
Incorporate the element and all child elements into the task markup (just inside the ending tag.
The result to the above examples would be as such:
<task name="12-19" additionalIntervalInformationNeeded="No">
Converter (Cleaning)
<tf taskid="olive-12-19">
<task-findings val="28">
<task-finding>
<title>Left Converter</title>
</task-finding>
</task-findings>
</tf>
</task>
Please suggest me how to write code.

From your use case, it appears that you can write a program to read in the two xml files and then edit and write them as an output file. XML files can be read and written just like TXT files in Java, you'll just need to change the file extension while reading and saving the files. This will need you to write your own parser or use regex etc methods.
Another way to go is by writing a JAXP or the Java API for XML, provided by Oracle. This will help you read, process and edit XML files via Java.
There are other parser APIs called DOM Parser API & SAX (Simple API for XML) API. That can be used to read and alter XML files. This was used by older Java versions and are useful for small XML files. Currently, the StaX or Streaming API for XML is used instead.
The tutorial blog here will help you get an idea of StaX library parsing XML files via Java.

CharConversionException while transforming xml file

I have a Java program which process xml files. When transforming xml into another xml file base on certain schema( xsd/xsl) it throws following error.
This error only throws for one xml file which has a xml tag like this.
<abc>xxx yyyy “ggggg vvvv” uuuu</abc>
But after removing or re-type two quotes, it doesn’t throw the error.
Anybody, please assist me to resolve this issue.
java.io.CharConversionException: Character larger than 4 bytes are not supported: byte 0x93 implies a length of more than 4 bytes
at .org.apache.xmlbeans..impl.piccolo.xml.UTF8XMLDecoder.decode(UTF8XMLDecoder.java:162)
<?xml version= “1.0’ encoding =“UTF-8” standalone =“yes “?><xyz xml s=“http://pqr.yy”><Header><abc> aaa “cccc” aaaaa vvv</abc></Header></xyz>.

As others have reported in comments, it has failed because the typographical quotation marks are encoded in Windows-1292 encoding, not in UTF-8, so the parser hasn't managed to decode them.
The encoding declared in the XML declaration must match the actual encoding used for the characters.
To find out how this error arose, and to prevent it happening again, we would need to know where this (wannabe) XML file came from, and how it was created.
My guess would be that someone used a "smart" editor; Microsoft editors in particular are notorious for changing what you type to what Microsoft think you wanted to type. If you're editing XML by hand it's best to use an XML-aware editor.

Is there any way to process my rest of xml file despite of any fatal error like SAXParserException encountered [duplicate]

Currently, I'm working on a feature that involves parsing XML that we receive from another product. I decided to run some tests against some actual customer data, and it looks like the other product is allowing input from users that should be considered invalid. Anyways, I still have to try and figure out a way to parse it. We're using javax.xml.parsers.DocumentBuilder and I'm getting an error on input that looks like the following.
<xml>
...
<description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
...
</xml>
As you can tell, the description has what appears to be an invalid tag inside of it (<THIS-IS-PART-OF-DESCRIPTION>). Now, this description tag is known to be a leaf tag and shouldn't have any nested tags inside of it. Regardless, this is still an issue and yields an exception on DocumentBuilder.parse(...)
I know this is invalid XML, but it's predictably invalid. Any ideas on a way to parse such input?

That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.
An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.
Options, most desirable first:
Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)
Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:
Standalone: xmlstarlet has robust recovering and repair capabilities credit: RomanPerekhrest
xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null
Standalone and C/C++: HTML Tidy works with XML too. Taggle is a port of TagSoup to C++.
Python: Beautiful Soup is Python-based. See notes in the Differences between parsers section. See also answers to this question for more
suggestions for dealing with not-well-formed markup in Python,
including especially lxml's recover=True option.
See also this answer for how to use codecs.EncodedFile() to cleanup illegal characters.
Java: TagSoup and JSoup focus on HTML. FilterInputStream can be used for preprocessing cleanup.
.NET:
XmlReaderSettings.CheckCharacters can
be disabled to get past illegal XML character problems.
#jdweng notes that XmlReaderSettings.ConformanceLevel can be set to
ConformanceLevel.Fragment so that XmlReader can read XML Well-Formed Parsed Entities lacking a root element.
#jdweng also reports that XmlReader.ReadToFollowing() can sometimes
be used to work-around XML syntactical issues, but note
rule-breaking warning in #3 below.
Microsoft.Language.Xml.XMLParser is said to be “error-tolerant”.
Go: Set Decoder.Strict to false as shown in this example by #chuckx.
PHP: See DOMDocument::$recover and libxml_use_internal_errors(true). See nice example here.
Ruby: Nokogiri supports “Gentle Well-Formedness”.
R: See htmlTreeParse() for fault-tolerant markup parsing in R.
Perl: See XML::Liberal, a "super liberal XML parser that parses broken XML."
Process the data as text manually using a text editor or
programmatically using character/string functions. Doing this
programmatically can range from tricky to impossible as
what appears to be
predictable often is not -- rule breaking is rarely bound by rules.
For invalid character errors, use regex to remove/replace invalid characters:
PHP: preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
Ruby: string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000‌}-\u{FFFD}", ' ')
JavaScript: inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')
For ampersands, use regex to replace matches with &: credit: blhsin, demo
&(?!(?:#\d+|#x[0-9a-f]+|\w+);)
Note that the above regular expressions won't take comments or CDATA
sections into account.

A standard XML parser will NEVER accept invalid XML, by design.
Your only option is to pre-process the input to remove the "predictably invalid" content, or wrap it in CDATA, prior to parsing it.

The accepted answer is good advice, and contains very useful links.
I'd like to add that this, and many other cases of not-wellformed and/or DTD-invalid XML can be repaired using SGML, the ISO-standardized superset of HTML and XML. In your case, what works is to declare the bogus THIS-IS-PART-OF-DESCRIPTION element as SGML empty element and then use eg. the osx program (part of the OpenSP/OpenJade SGML package) to convert it to XML. For example, if you supply the following to osx
<!DOCTYPE xml [
<!ELEMENT xml - - ANY>
<!ELEMENT description - - ANY>
<!ELEMENT THIS-IS-PART-OF-DESCRIPTION - - EMPTY>
]>
<xml>
<description>blah blah
<THIS-IS-PART-OF-DESCRIPTION>
</description>
</xml>
it will output well-formed XML for further processing with the XML tools of your choice.
Note, however, that your example snippet has another problem in that element names starting with the letters xml or XML or Xml etc. are reserved in XML, and won't be accepted by conforming XML parsers.

IMO these cases should be solved by using JSoup.
Below is a not-really answer for this specific case, but found this on the web (thanks to inuyasha82 on Coderwall). This code bit did inspire me for another similar problem while dealing with malformed XMLs, so I share it here.
Please do not edit what is below, as it is as it on the original website.
The XML format, requires to be valid a unique root element declared in the document.
So for example a valid xml is:
<root>
<element>...</element>
<element>...</element>
</root>
But if you have a document like:
<element>...</element>
<element>...</element>
<element>...</element>
<element>...</element>
This will be considered a malformed XML, so many xml parsers just throw an Exception complaining about no root element. Etc.
In this example there is a solution on how to solve that problem and succesfully parse the malformed xml above.
Basically what we will do is to add programmatically a root element.
So first of all you have to open the resource that contains your "malformed" xml (i. e. a file):
File file = new File(pathtofile);
Then open a FileInputStream:
FileInputStream fis = new FileInputStream(file);
If we try to parse this stream with any XML library at that point we will raise the malformed document Exception.
Now we create a list of InputStream objects with three lements:
A ByteIputStream element that contains the string: <root>
Our FileInputStream
A ByteInputStream with the string: </root>
So the code is:
List<InputStream> streams =
Arrays.asList(
new ByteArrayInputStream("<root>".getBytes()),
fis,
new ByteArrayInputStream("</root>".getBytes()));
Now using a SequenceInputStream, we create a container for the List created above:
InputStream cntr =
new SequenceInputStream(Collections.enumeration(str));
Now we can use any XML Parser library, on the cntr, and it will be parsed without any problem. (Checked with Stax library);

How to check encoding in java?

I am facing a problem about encoding.
For example, I have a message in XML, whose format encoding is "UTF-8".
<message>
<product_name>apple</product_name>
<price>1.3</price>
<product_name>orange</product_name>
<price>1.2</price>
.......
</message>
Now, this message is supporting multiple languages:
Traditional Chinese (big5),
Simple Chinese (gb),
English (utf-8)
And it will only change the encoding in specific fields.
For example (Traditional Chinese),
蘋果
1.3
橙
1.2
.......
Only "蘋果" and "橙" are using big5, "<product_name>" and "</product_name>" are still using utf-8.
<price>1.3</price> and <price>1.2</price> are using utf-8.
How do I know which word is using different encoding?

It looks like whoever is providing the XML is providing incorrect XML. They should be using a consistent encoding.
http://sourceforge.net/projects/jchardet/files/ is a pretty good heuristic charset detector.
It's a port of the one used in Firefox to detect the encoding of pages that are missing a charset in content-type or a BOM.
You could use that to try and figure out the encoding for substrings in a malformed XML file if you can't get the provider to fix their output.

you should use only one encoding in one xml file. there are counterparts of the characters of big5 in the UTF_8 encoding.

Because I cannot get the provider to fix the output, so I should be handle it by myself and I cannot use the extend library in this project.
I only can solve that like this,
String str = new String(big5String.getByte("UTF-8"));
before display the message.

How do you embed binary data in XML?

I have two applications written in Java that communicate with each other using XML messages over the network. I'm using a SAX parser at the receiving end to get the data back out of the messages. One of the requirements is to embed binary data in an XML message, but SAX doesn't like this. Does anyone know how to do this?
UPDATE: I got this working with the Base64 class from the apache commons codec library, in case anyone else is trying something similar.

You could encode the binary data using base64 and put it into a Base64 element; the below article is a pretty good one on the subject.
Handling Binary Data in XML Documents

XML is so versatile...
<DATA>
<BINARY>
<BIT index="0">0</BIT>
<BIT index="1">0</BIT>
<BIT index="2">1</BIT>
...
<BIT index="n">1</BIT>
</BINARY>
</DATA>
XML is like violence - If it doesn't solve your problem, you're not using enough of it.
EDIT:
BTW: Base64 + CDATA is probably the best solution
(EDIT2:
Whoever upmods me, please also upmod the real answer. We don't want any poor soul to come here and actually implement my method because it was the highest ranked on SO, right?)

Base64 is indeed the right answer but CDATA is not, that's basically saying: "this could be anything", however it must not be just anything, it has to be Base64 encoded binary data. XML Schema defines Base 64 binary as a primitive datatype which you can use in your xsd.

I had this problem just last week. I had to serialize a PDF file and send it, inside an XML file, to a server.
If you're using .NET, you can convert a binary file directly to a base64 string and stick it inside an XML element.
string base64 = Convert.ToBase64String(File.ReadAllBytes(fileName));
Or, there is a method built right into the XmlWriter object. In my particular case, I had to include Microsoft's datatype namespace:
StringBuilder sb = new StringBuilder();
System.Xml.XmlWriter xw = XmlWriter.Create(sb);
xw.WriteStartElement("doc");
xw.WriteStartElement("serialized_binary");
xw.WriteAttributeString("types", "dt", "urn:schemas-microsoft-com:datatypes", "bin.base64");
byte[] b = File.ReadAllBytes(fileName);
xw.WriteBase64(b, 0, b.Length);
xw.WriteEndElement();
xw.WriteEndElement();
string abc = sb.ToString();
The string abc looks something that looks like this:
<?xml version="1.0" encoding="utf-16"?>
<doc>
<serialized_binary types:dt="bin.base64" xmlns:types="urn:schemas-microsoft-com:datatypes">
JVBERi0xLjMKJaqrrK0KNCAwIG9iago8PCAvVHlwZSAvSW5mbw...(plus lots more)
</serialized_binary>
</doc>

I usually encode the binary data with MIME Base64 or URL encoding.

Try Base64 encoding/decoding your binary data. Also look into CDATA sections

Any binary-to-text encoding will do the trick. I use something like that
<data encoding="yEnc>
<![CDATA[ encoded binary data ]]>
</data>

Maybe encode them into a known set - something like base 64 is a popular choice.

Base64 overhead is 33%.
BaseXML for XML1.0 overhead is only 20%. But it's not a standard and only have a C implementation yet. Check it out if you're concerned with data size. Note that however browsers tends to implement compression so that it is less needed.
I developed it after the discussion in this thread: Encoding binary data within XML : alternatives to base64.

While the other answers are mostly fine, you could try another, more space-efficient, encoding method like yEnc. (yEnc wikipedia link) With yEnc also get checksum capability right "out of the box". Read and links below. Of course, because XML does not have a native yEnc type your XML schema should be updated to properly describe the encoded node.
Why: Due to the encoding strategies base64/63, uuencode et al. encodings increase the amount of data (overhead) you need to store and transfer by roughly 40% (vs. yEnc's 1-2%). Depending on what you're encoding, 40% overhead could be/become an issue.
yEnc - Wikipedia abstract:
https://en.wikipedia.org/wiki/YEnc
yEnc is a binary-to-text encoding scheme for transferring binary files in messages on Usenet or via e-mail. ... An additional advantage of yEnc over previous encoding methods, such as uuencode and Base64, is the inclusion of a CRC checksum to verify that the decoded file has been delivered intact.
‎

You can also Uuencode you original binary data. This format is a bit older but it does the same thing as base63 encoding.

If you have control over the XML format, you should turn the problem inside out. Rather than attaching the binary XML you should think about how to enclose a document that has multiple parts, one of which contains XML.
The traditional solution to this is an archive (e.g. tar). But if you want to keep your enclosing document in a text-based format or if you don't have access to an file archiving library, there is also a standardized scheme that is used heavily in email and HTTP which is multipart/* MIME with Content-Transfer-Encoding: binary.
For example if your servers communicate through HTTP and you want to send a multipart document, the primary being an XML document which refers to a binary data, the HTTP communication might look something like this:
POST / HTTP/1.1
Content-Type: multipart/related; boundary="qd43hdi34udh34id344"
... other headers elided ...
--qd43hdi34udh34id344
Content-Type: application/xml
<myxml>
<data href="cid:data.bin"/>
</myxml>
--qd43hdi34udh34id344
Content-Id: <data.bin>
Content-type: application/octet-stream
Content-Transfer-Encoding: binary
... binary data ...
--qd43hdi34udh34id344--
As in above example, the XML refer to the binary data in the enclosing multipart by using a cid URI scheme which is an identifier to the Content-Id header. The overhead of this scheme would be just the MIME header. A similar scheme can also be used for HTTP response. Of course in HTTP protocol, you also have the option of sending a multipart document into separate request/response.
If you want to avoid wrapping your data in a multipart is to use data URI:
<myxml>
<data href="data:application/something;charset=utf-8;base64,dGVzdGRhdGE="/>
</myxml>
But this has the base64 overhead.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.