Need to handle special characters in URL - java

My input html is
<p>
<span>first
</span>
<span>Google Cloud Connect for Microsoft Office</span>
</p>
I am using xslt1.0 to convert the html to xml..my output xml is
<Relationship Id="rId12700703801" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink" Target="http://tools.google.com/dlpage/cloudconnect#utm_campaign=launch&utm_source=en-na-us-gdb-GCC-Appsperience_02242011&utm_medium=blog" TargetMode="External"/></Relationships>
with error "XML Parsing Error: not well-formed" in the location =(after launch&utm_source) in target attribute..
I want to escape the special characters present in url through xslt and make the xml.
Please help me. Thanks in advance..

are you generating the input html? if so you can use URLEncoder.encode to properly encode the string so the transformer doesn't complain about the syntax.
If this is just a random html page, and you have no control over it, then you probably need to use some html parser, such as tagsoup, et. al, to pre-correct it as most html files are not properly formatted.

XSLT expects XML as input, not HTML. You need to turn your HTML into XML if you want to transform it with XSLT.
I think it might be possible to do it with HTML Tidy.

Related

Get all legal Text from HTML File without Library

we have to get out all the Text from an HTML File without the usage of Jsoup or similar. Whats the best/only way to do that? Our Example looks like this:
<ul><li>Coffee</li><li>Tea</li><li>Milk</li></ul>
<h2>An Ordered HTML List</h2>
<ol><li>Coffee</li><li>Tea</li><li>Milk</li></ol>´´´
need to get all the text out of these html tags without using any libs and if the Tag is not done correctly, print out an error message. Need help guys

Parse text response to object java

I have consumed rest Webservice and I get a response in text format, I want to convert this response to java object in order to do some logic.
I have tried to parse the text to XML unfortunately, doesn't work.
Response from web service:
<html>
<title>rest response</title>
<body>
XY111NWA1
XY112NWA1
XY113NWA1
XY114NWA1
XY115NWA1
XY116NWA1
XY117NWA1
XY118NWA1
XY119NWA1
</body>
</html>
I expect the output to be in XML format so I can parse then to JAVA object, or
get the date between tags and the result should be a java list as follows:
XY111NWA1
XY112NWA1
XY113NWA1
XY114NWA1
XY115NWA1
XY116NWA1
XY117NWA1
XY118NWA1
XY119NWA1
That's not a REST response(not a JSON or response formatted XML), so if you need to parse HTML page(or html-like response).
WAY 1: I would rather recommend such tool as a jsoup:
Simple document parsing https://jsoup.org/cookbook/input/parse-document-from-string
More complete example which handles a list of links:
https://jsoup.org/cookbook/extracting-data/example-list-links
This way you can extract any node/value/attribute from your html response/file.
WAY 2 You can try more general SAX or JAXB parsers for parsing xml (even if it's HTML) like in https://www.mkyong.com/java/jaxb-hello-world-example/ . But that's if you have some contract in your response. And from what I see that's not a best approach here.
Relevant link: Which is the best library for XML parsing in java

Extract some data using Regex

I'm struggling some time to extract JSON data from one html tag. To be more specific it's a script tag and using JSOUP library I can get data between script tags. But inside there is some JSON data which I can't extract. Here is the tag:
<script type="text/javascript">jwplayer.key="WbtWzGvcRNi6Tk+gtKldIbx+nn6lXZFvKiaO2g==";jwplayer("tvplayer").setup({playlist:[{image: "http://img.canlitvlive.io/yayin/trt1_480.jpg?1509735585",title:"TRT 1 Canlı Yayın - CanliTVLive.io",file : "http://yayin.canlitvlive.io/trt1/live.m3u8?tkn=8JD95lXv9dOUXwtgOTBYfw&tms=1509749985"}],...</script>
I need url from file tag which is inside jwplayer. I tried using regular expression for example I tried somethig like this:
"playlist[\":\\s\\{]+file[\":\\s\\{]+\"([^\"]+)\""
But I don't have much experience with regex and can't figure out right pattern. Can someone help with this? Thanks
I'm guessing you just need some whitespace
file\s*:\s*"(.*?)"
https://regex101.com/r/4HldaP/3

How to parse xml having html tags within xml tags

I've got an xml which has html within the xml tags and i'm not able to parse as it.
When i start parsing the xml the str tag has html in it
can anyone help me out in extracting the html with all the tags.
It is a good idea to store XHTML within CDATA tags (<![CDATA[ and ]]>), so that it can be retrieved normally:
<str name="body">
<![CDATA[<font face="arial" size="2"><ul><li><p align="justify">india’s first</p></li></ul></font>]]>
</str>
Problem is not the HTML but improper HTML. If this HTML is in your hand, ensure it complies with XHTML and xml parser will treat it as normal xml. However, you may otherwise use tools like "HTML Tidy" ti fix your HTML and use HTML parsers. For example:
http://www.codeproject.com/KB/dotnet/apmilhtml.aspx

How to keep the HTML tags specified

I am using this pattern to remove all HTML tags (Java code):
String html="text <a href=#>link</a> <b>b</b> pic<img src=#>";
html=html.replaceAll("\\<.*?\\>", "");
System.out.println(html);
Now, I want to keep tag <a ...> (with </a>) and tag <img ...>
I want the result to be:
text <a href=#>link</a> b pic<img src=#>
How to do this?
I don't need HTML parser to do this,
because I need this regex pattern to filter a lot of html fragment,
so,I want the solution with regex
You could do this using a negative lookahead:
"<(?!(?:a|/a|img)\\b).*?>"
Rubular
However this has a number of problems and I would recommend instead that you use an HTML parser if you want a robust solution.
For more information see this question:
What HTML parsing libraries do you recommend in Java
Check this out http://sourceforge.net/projects/regexcreator/ . This is very handy gui regex editor.
Hey! Here is your answer:
You can’t parse [X]HTML with regex.
Use a proper HTML parser, for example htmlparser, Jericho or the validator.nu HTML parser. Then use the parser’s API, SAX or DOM to pull out the stuff you’re interested in.
If you insist on using regular expressions, you’re almost certain to make some small mistake that will lead to breakage, and possibly to cross-site scripting attacks, depending on what you’re doing with the markup.
See also this answer.
I recommend you use strip_tags (a PHP function)
string strip_tags ( string $str [, string $allowable_tags ] )
<?php
$text = '<p>Test paragraph.</p><!-- Comment --> Other text';
echo strip_tags($text);
echo "\n";
// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>
OUTPUT
Test paragraph. Other text
<p>Test paragraph.</p> Other text

Categories

Resources