JTidy java API toConvert HTML to XHTML

JTidy java API toConvert HTML to XHTML - java

I am using JTidy to convert from HTML to XHTML but I found in my XHTML file this tag .
Can i prevent it ?
this is my code
//from html to xhtml
try
{
fis = new FileInputStream(htmlFileName);
}
catch (java.io.FileNotFoundException e)
{
System.out.println("File not found: " + htmlFileName);
}
Tidy tidy = new Tidy();
tidy.setShowWarnings(false);
tidy.setXmlTags(false);
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setXHTML(true);//
tidy.setMakeClean(true);
Document xmlDoc = tidy.parseDOM(fis, null);
try
{
tidy.pprint(xmlDoc,new FileOutputStream("c.xhtml"));
}
catch(Exception e)
{
}

I had only success, when the input is treated as XML as well. So either set xmltags to true
tidy.setXmlTags(true);
and live with the errors and warnings or do the conversion twice.
First conversion to sanitize the html (html to xhtml) and a second conversion from xhtml to xhtml with set xmltags, thus no errors and warnings occur.
String htmlFileName = "test.html";
try( InputStream in = Thread.currentThread().getContextClassLoader().getResourceAsStream(htmlFileName);
FileOutputStream fos = new FileOutputStream("tmp.xhtml");) {
Tidy tidy = new Tidy();
tidy.setShowWarnings(true);
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setXHTML(true);
tidy.setMakeClean(true);
Document xmlDoc = tidy.parseDOM(in, fos);
} catch (Exception e) {
e.printStackTrace();
}
try( InputStream in = new FileInputStream("tmp.xhtml");
FileOutputStream fos = new FileOutputStream("c.xhtml");) {
Tidy tidy = new Tidy();
tidy.setShowWarnings(true);
tidy.setXmlTags(true);
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setXHTML(true);
tidy.setMakeClean(true);
Document xmlDoc = tidy.parseDOM(in, null);
tidy.pprint(xmlDoc, fos);
} catch (Exception e) {
e.printStackTrace();
}
I used the latest jtidy version 938.

i created a function that parse the the xhtml code and remove the unwelcome tags
and to add a link to the css File "tableStyle.css"
public static String xhtmlparser(){
String Cleanline="";
try {
// the file url
FileInputStream fstream = new FileInputStream("c.xhtml");
// Use DataInputStream to read binary NOT text.
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
String strLine = null;
int linescounter=0;
while ((strLine = br.readLine()) != null) {// read every line in the file
String m=strLine.replaceAll(" ", "");
linescounter++;
if(linescounter==5)
m=m+"\n"+ "<link rel="+ "\"stylesheet\" "+"type="+ "\"text/css\" "+"href= " +"\"tableStyle.css\""+ "/>";
Cleanline+=m+"\n";
}
}
catch(IOException e){}
return Cleanline;
}
but as a performance issue is it good?
by the way it works will

You can use the following method to get xhtml from html
public static String getXHTMLFromHTML(String inputFile,
String outputFile) throws Exception {
File file = new File(inputFile);
FileOutputStream fos = null;
InputStream is = null;
try {
fos = new FileOutputStream(outputFile);
is = new FileInputStream(file);
Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.parse(is, fos);
} catch (FileNotFoundException e) {
e.printStackTrace();
}finally{
if(fos != null){
try {
fos.close();
} catch (IOException e) {
fos = null;
}
fos = null;
}
if(is != null){
try {
is.close();
} catch (IOException e) {
is = null;
}
is = null;
}
}
return outputFile;
}

Related

Instead of rendering tables and other html tags in docx these are saved as plain text using docx4j-ImportXHTML

I want to render html code to docx. Instead of rendering html(i.e. tables in tabular format) it simply writes html code in it as plain text. I am using docx4j-ImportXHTML jar. I used the code from here and modified it to save in a file.
What am I doing wrong?
public static void xhtmlToDocx(String xhtml, String destinationPath, String fileName)
{
File dir = new File (destinationPath);
File actualFile = new File (dir, fileName);
WordprocessingMLPackage wordMLPackage = null;
try
{
wordMLPackage = WordprocessingMLPackage.createPackage();
}
catch (InvalidFormatException e)
{
e.printStackTrace();
}
XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
//XHTMLImporter.setDivHandler(new DivToSdt());
//OutputStream os = null;
OutputStream fos = null;
try
{
fos = new FileOutputStream(actualFile);
wordMLPackage.getMainDocumentPart().getContent().addAll(
XHTMLImporter.convert( xhtml, null) );
System.out.println(XmlUtils.marshaltoString(wordMLPackage
.getMainDocumentPart().getJaxbElement(), true, true));
// Back to XHTML
HTMLSettings htmlSettings = Docx4J.createHTMLSettings();
htmlSettings.setWmlPackage(wordMLPackage);
// output to an OutputStream.
//os = new ByteArrayOutputStream();
// If you want XHTML output
Docx4jProperties.setProperty("docx4j.Convert.Out.HTML.OutputMethodXML",
true);
Docx4J.toHTML(htmlSettings, fos, Docx4J.FLAG_EXPORT_PREFER_XSL);
}
catch (Docx4JException | FileNotFoundException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
finally{
try {
fos.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}

I corrected my code as below:
Use ByteArrayStream instead of FileOutputStream i.e.
Instead of
fos = new FileOutputStream(actualFile);
wordMLPackage.getMainDocumentPart().getContent().addAll(
XHTMLImporter.convert( xhtml, null) );
Use:
fos = new ByteArrayOutputStream();
Add wordMLPackage.save(actualFile)
Full code:
public static void xhtmlToDocx1(String xhtml, String destinationPath, String fileName)
{
File dir = new File (destinationPath);
File actualFile = new File (dir, fileName);
WordprocessingMLPackage wordMLPackage = null;
try
{
wordMLPackage = WordprocessingMLPackage.createPackage();
}
catch (InvalidFormatException e)
{
e.printStackTrace();
}
XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
OutputStream fos = null;
try
{
fos = new ByteArrayOutputStream();
System.out.println(XmlUtils.marshaltoString(wordMLPackage
.getMainDocumentPart().getJaxbElement(), true, true));
HTMLSettings htmlSettings = Docx4J.createHTMLSettings();
htmlSettings.setWmlPackage(wordMLPackage);
Docx4jProperties.setProperty("docx4j.Convert.Out.HTML.OutputMethodXML",
true);
Docx4J.toHTML(htmlSettings, fos, Docx4J.FLAG_EXPORT_PREFER_XSL);
wordMLPackage.save(actualFile);
}
catch (Docx4JException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
finally{
try {
fos.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}

JSF Primefaces p:fileDownload file name contains UTF-8 characters

I am working on Java 8, JSF 2, Primefaces 5.1.
Conversation to PDF or Docx works, but when I am displaying file name, it just skips UTF-8 encoded letters, in my case, Lithuanian letters like ą,č,ę,ė,į,š,ų,ū
What I have tried so farm is :
<h:form enctype="multipart/form-data;charset=UTF-8">
Charset.forName("UTF-8").encode(myString)
or
byte[] bytes = templateTitle.getBytes(Charset.forName("UTF-8"));
String title = new String(bytes, Charset.forName("UTF-8"));
or
UTF-8 text is garbled when form is posted as multipart/form-data
checked some tuttorials about encoding, still, no use,
also checked this, but I just do not understand this example...
Primefaces fileDownload non-english file names corrupt
my code:
Download file as docx
public void downloadTemplateAsDocx() throws Exception {
try {
InputStream content = null;
String objID = this.actData.getMainActs().get(0).getId();
ContentStream cmisStream = folderCatalogue.getDocumentContentStream(objID);
content = cmisStream.getStream();
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
AlternativeFormatInputPart afiPart = new AlternativeFormatInputPart(new PartName("/hw.html"));
afiPart.setBinaryData(content);
afiPart.setContentType(new ContentType("text/html"));
Relationship altChunkRel = wordMLPackage.getMainDocumentPart().addTargetPart(afiPart);
CTAltChunk ac = Context.getWmlObjectFactory().createCTAltChunk();
ac.setId(altChunkRel.getId());
wordMLPackage.getMainDocumentPart().addObject(ac);
wordMLPackage.getContentTypeManager().addDefaultContentType("html", "text/html");
File fileTmp = File.createTempFile("tempDocFile", "docx");
wordMLPackage.save(fileTmp);
streamedContent = new DefaultStreamedContent(new FileInputStream(fileTmp), cmisStream.getMimeType(),
templateTitle + ".docx", "UTF-8");
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (InvalidFormatException eInv) {
eInv.printStackTrace();
} catch (IOException ioEx) {
ioEx.printStackTrace();
} catch (Docx4JException docxEx) {
docxEx.printStackTrace();
}
}
code for .Pdf file download.
public void downloadTemplateAsPdf() {
try {
InputStream content = null;
String objID = this.actData.getMainActs().get(0).getId();
ContentStream cmisStream = folderCatalogue.getDocumentContentStream(objID);
content = cmisStream.getStream();
File fileTmp = File.createTempFile("tempFile", "pdf");
OutputStream fileStream = new FileOutputStream(fileTmp);
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, fileStream);
document.open();
XMLWorkerHelper worker = XMLWorkerHelper.getInstance();
worker.parseXHtml(writer, document, content, Charset.forName("UTF-8"));
document.close();
fileStream.close();
streamedContent = new DefaultStreamedContent(new FileInputStream(fileTmp), cmisStream.getMimeType(),
templateTitle + ".pdf");
} catch (FileNotFoundException e) {
e.printStackTrace();
System.out.println("File was not found");
} catch (IOException ex) {
ex.printStackTrace();
} catch (Exception exeption) {
exeption.printStackTrace();
}
}
EDIT:
<p:fileDownload value="#{controller.streamedContent}" />
private StreamedContent streamedContent;

Solution,
String title = URLEncoder.encode(templateTitle, "UTF-8");
StringBuilder fileName = new StringBuilder(title);
if (title.contains("+")) {
for (int i = 0; i < title.length(); i++) {
if (title.charAt(i) == '+') {
fileName.setCharAt(i, ' ');
}
}
}
This Encoding works fine, just it replaces all spaces to + that's why I loop over it.

Is it possible to convert HTML into XHTML with Jsoup 1.8.1?

String body = "<br>";
Document document = Jsoup.parseBodyFragment(body);
document.outputSettings().escapeMode(EscapeMode.xhtml);
String str = document.body().html();
System.out.println(str);
expect: <br />
result: <br>
Can Jsoup convert value HTML into XHTML?

See Document.OutputSettings.Syntax.xml:
private String toXHTML( String html ) {
final Document document = Jsoup.parse(html);
document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
return document.html();
}

You should tell that syntax you want to leave the string in HTML or XML.
public String parserXHtml(String html) {
org.jsoup.nodes.Document document = Jsoup.parseBodyFragment(html);
document.outputSettings().syntax(org.jsoup.nodes.Document.OutputSettings.Syntax.xml); //This will ensure the validity
document.outputSettings().charset("UTF-8");
return document.toString();
}

You can use JTidy API to do this. Use jtidy-r938.jar
You can use the following method to get xhtml from html
public static String getXHTMLFromHTML(String inputFile,
String outputFile) throws Exception {
File file = new File(inputFile);
FileOutputStream fos = null;
InputStream is = null;
try {
fos = new FileOutputStream(outputFile);
is = new FileInputStream(file);
Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.parse(is, fos);
} catch (FileNotFoundException e) {
e.printStackTrace();
}finally{
if(fos != null){
try {
fos.close();
} catch (IOException e) {
fos = null;
}
fos = null;
}
if(is != null){
try {
is.close();
} catch (IOException e) {
is = null;
}
is = null;
}
}
return outputFile;
}

Lucene - how to index a content of files in two different fields

How can I index first line from files in a field and other lines in a different field?
My code is:
FileInputStream fis;
try {
fis = new FileInputStream(file);
} catch (FileNotFoundException fnfe) {
return;
}
try {
Document doc = new Document();
doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(fis, StandardCharsets.UTF_8))));
} finally {
fis.close();
}
Please help me!

I did it!
FileInputStream fis;
try {
fis = new FileInputStream(file);
} catch (FileNotFoundException fnfe) {
return;
}
try {
Document doc = new Document();
String line = null;
try (BufferedReader reader = new BufferedReader(new InputStreamReader(fis, StandardCharsets.UTF_8))) {
line = reader.readLine();
Field headerField = new TextField("header", line, Field.Store.YES);
headerField.setBoost(2.0F);
doc.add(headerField);
while ((line = reader.readLine()) != null ) {
doc.add(new TextField("contents", line, Field.Store.YES));
}
} catch (IOException e) {
System.err.println(e);
}
} finally {
fis.close();
}

How to open a txt file located on the web and read its contents to a string?

I have looked around on how to do this and I keep finding different solutions, none of which has worked fine for me and I don't understand why. Does FileReader only work for local files? I tried a combination of scripts found on the site and it still doesn't quite work, it just throws an exception and leaves me with ERROR for the variable content. Here's the code I've been using unsuccessfully:
public String downloadfile(String link){
String content = "";
try {
URL url = new URL(link);
URLConnection conexion = url.openConnection();
conexion.connect();
InputStream is = url.openStream();
BufferedReader br = new BufferedReader( new InputStreamReader(is));
StringBuilder sb = new StringBuilder();
String line;
while ((line = br.readLine()) != null) {
sb.append(line);
}
content = sb.toString();
br.close();
is.close();
} catch (Exception e) {
content = "ERROR";
Log.e("ERROR DOWNLOADING",
"File not Found" + e.getMessage());
}
return content;
}

Use this as a downloader(provide a path to save your file(along with the extension) and the exact link of the text file)
public static void downloader(String fileName, String url) throws IOException {
File file = new File(fileName);
url = url.replace(" ", "%20");
URL website = new URL(url);
if (file.exists()) {
file.delete();
}
if (!file.exists()) {
ReadableByteChannel rbc = Channels.newChannel(website.openStream());
FileOutputStream fos = new FileOutputStream(fileName);
fos.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
fos.close();
}
}
Then call this function to read the text file
public static String[] read(String fileName) {
String result[] = null;
Vector v = new Vector(10, 2);
BufferedReader br = null;
try {
br = new BufferedReader(new FileReader(fileName));
String tmp = "";
while ((tmp = br.readLine()) != null) {
v.add(tmp);
}
Iterator i = v.iterator();
result = new String[v.toArray().length];
int count = 0;
while (i.hasNext()) {
result[count++] = i.next().toString();
}
} catch (IOException ioe) {
ioe.printStackTrace();
}
return (result);
}
And then finally the main method
public static void main(){
downloader("D:\\file.txt","http://www.abcd.com/textFile.txt");
String data[]=read("D:\\file.txt");
}

try this:
try {
// Create a URL for the desired page
URL url = new URL("mysite.com/thefile.txt");
// Read all the text returned by the server
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String str;
StringBuilder sb = new StringBuilder();
while ((str = in.readLine()) != null) {
// str is one line of text; readLine() strips the newline character(s)
sb.append(str );
}
in.close();
String serverTextAsString = sb.toString();
} catch (MalformedURLException e) {
} catch (IOException e) {
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JTidy java API toConvert HTML to XHTML - java

Related

Instead of rendering tables and other html tags in docx these are saved as plain text using docx4j-ImportXHTML

JSF Primefaces p:fileDownload file name contains UTF-8 characters

Is it possible to convert HTML into XHTML with Jsoup 1.8.1?

Lucene - how to index a content of files in two different fields

How to open a txt file located on the web and read its contents to a string?

Categories

Resources