Convert .doc with images to .html using xdocreport

Convert .doc with images to .html using xdocreport - java

i am converting doc to html using following code
private static final String docName = "This is a test page.docx";
private static final String outputlFolderPath = "C://";
String htmlNamePath = "docHtml1.html";
String zipName="_tmp.zip";
static File docFile = new File(outputlFolderPath+docName);
File zipFile = new File(zipName);
public void ConvertWordToHtml() {
try {
InputStream doc = new FileInputStream(new File(outputlFolderPath+docName));
System.out.println("InputStream"+doc);
XWPFDocument document = new XWPFDocument(doc);
XHTMLOptions options = XHTMLOptions.create(); //.URIResolver(new FileURIResolver(new File("word/media")));;
String root = "target";
File imageFolder = new File( root + "/images/" + doc );
options.setExtractor( new FileImageExtractor( imageFolder ) );
options.URIResolver( new FileURIResolver( imageFolder ) );
OutputStream out = new FileOutputStream(new File(htmlPath()));
XHTMLConverter.getInstance().convert(document, out, options);
} catch (Exception ex) {
}
}
public static void main(String[] args) throws IOException, ParserConfigurationException, Exception {
Convertion cwoWord=new Convertion();
cwoWord.ConvertWordToHtml();
}
public String htmlPath(){
return outputlFolderPath+htmlNamePath;
}
public String zipPath(){
// d:/_tmp.zip
return outputlFolderPath+zipName;
}
Above code is converting doc to html fine. Issue comes when i try to convert a doc file which has graphics
like circle (shown in screenshot), In this case, graphics doesn't show into html file.
Please help me out how can we maintain graphics from doc to html file as well after conversion. Thanks in Advance

You can embed the images in the html by using the following code:
Base64ImageExtractor imageExtractor = new Base64ImageExtractor();
options.setExtractor(imageExtractor);
options.URIResolver(imageExtractor);
where Base64ImageExtractor looks like:
public class Base64ImageExtractor implements IImageExtractor, IURIResolver {
private byte[] picture;
public void extract(String imagePath, byte[] imageData) throws IOException {
this.picture = imageData;
}
private static final String EMBED_IMG_SRC_PREFIX = "data:;base64,";
public String resolve(String uri) {
StringBuilder sb = new StringBuilder(picture.length + EMBED_IMG_SRC_PREFIX.length())
.append(EMBED_IMG_SRC_PREFIX)
.append(Base64Utility.encode(picture));
return sb.toString();
}
}

Related

How to generate .dot file using Schemacrawler

Using schemcrawler I've generated html file
public final class ExecutableExample {
public static void main(final String[] args) throws Exception {
// Set log level
new LoggingConfig(Level.OFF);
final LimitOptionsBuilder limitOptionsBuilder = LimitOptionsBuilder.builder()
.includeSchemas(new IncludeAll())
.includeTables(new IncludeAll());
final LoadOptionsBuilder loadOptionsBuilder =
LoadOptionsBuilder.builder()
// Set what details are required in the schema - this affects the
// time taken to crawl the schema
.withSchemaInfoLevel(SchemaInfoLevelBuilder.standard());
final SchemaCrawlerOptions options =
SchemaCrawlerOptionsBuilder.newSchemaCrawlerOptions()
.withLimitOptions(limitOptionsBuilder.toOptions())
.withLoadOptions(loadOptionsBuilder.toOptions());
final Path outputFile = getOutputFile(args);
final OutputOptions outputOptions =
OutputOptionsBuilder.newOutputOptions(TextOutputFormat.html, outputFile);
final String command = "schema";
try (Connection connection = getConnection()) {
final SchemaCrawlerExecutable executable = new SchemaCrawlerExecutable(command);
executable.setSchemaCrawlerOptions(options);
executable.setOutputOptions(outputOptions);
executable.setConnection(connection);
executable.execute();
}
System.out.println("Created output file, " + outputFile);
}
private static Connection getConnection() {
final String connectionUrl = "jdbc:postgresql://localhost:5433/table_accounts";
final DatabaseConnectionSource dataSource = new DatabaseConnectionSource(connectionUrl);
dataSource.setUserCredentials(new SingleUseUserCredentials("postgres", "new_password"));
return dataSource.get();
}
private static Path getOutputFile(final String[] args) {
final String outputfile;
if (args != null && args.length > 0 && !isBlank(args[0])) {
outputfile = args[0];
} else {
outputfile = "./schemacrawler_output.html";
}
final Path outputFile = Paths.get(outputfile).toAbsolutePath().normalize();
return outputFile;
}
But I want to have an output in .dot file that contains diagram, node, graph, edge etc.. So how can I do it using my code or maybe some another way to do it with Java?

Simply change the output format from TextOutputFormat.html to DiagramOutputFormat.scdot.
Sualeh Fatehi, SchemaCrawler

Itext7 Hebrew reverse issue

I have simple piece of code that writes a PDF sometime this PDF will contain RTL languages like Hebrew or Arabic.
I was able to manipulate the text and mirror it using Bidi (Ibm lib)
But the text is still running in reverse
In English it would be something like:
instead of:
The quick
brown fox
jumps over
the lazy dog
It appears as:
the lazy dog
jumps over
brown fox
The quick
Complete code:
#Test
public void generatePdf() {
SimpleDateFormat formatter = new SimpleDateFormat("yyyy-MM-dd-hh.mm.ss");
String dest = "c:\\temp\\" + formatter.format(Calendar.getInstance().getTime()) + ".pdf";
String fontPath = "C:\\Windows\\Fonts\\ARIALUNI.TTF";
FontProgramFactory.registerFont(fontPath, "arialUnicode");
OutputStream pdfFile = null;
Document doc = null;
try {
ByteArrayOutputStream output = new ByteArrayOutputStream();
PdfFont PdfFont = PdfFontFactory.createRegisteredFont("arialUnicode", PdfEncodings.IDENTITY_H, true);
PdfDocument pdfDoc = new PdfDocument(new PdfWriter(output));
pdfDoc.setDefaultPageSize(PageSize.A4);
pdfDoc.addFont(PdfFont);
doc = new Document(pdfDoc);
doc.setBaseDirection(BaseDirection.RIGHT_TO_LEFT);
String txt = "בתשרי נתן הדקל פרי שחום נחמד בחשוון ירד יורה ועל גגי רקד בכסלו נרקיס הופיע בטבת ברד ובשבט חמה הפציעה ליום אחד. 1234 באדר עלה ניחוח מן הפרדסים בניסן הונפו בכוח כל החרמשים";
Bidi bidi = new Bidi();
bidi.setPara(txt, Bidi.RTL, null);
String mirrTxt = bidi.writeReordered(Bidi.DO_MIRRORING);
Paragraph paragraph1 = new Paragraph(mirrTxt)
.setFont(PdfFont)
.setFontSize(9)
.setTextAlignment(TextAlignment.CENTER)
.setHeight(200)
.setWidth(70);
paragraph1.setBorder(new SolidBorder(3));
doc.add(paragraph1);
Paragraph paragraph2 = new Paragraph(txt)
.setFont(PdfFont)
.setFontSize(9)
.setTextAlignment(TextAlignment.CENTER)
.setHeight(200)
.setWidth(70);
paragraph2.setBorder(new SolidBorder(3));
doc.add(paragraph2);
doc.close();
doc.flush();
pdfFile = new FileOutputStream(dest);
pdfFile.write(output.toByteArray());
ProcessBuilder b = new ProcessBuilder("cmd.exe","/C","explorer " + dest);
b.start();
} catch (Exception e) {
e.printStackTrace();
}finally {
try {pdfFile.close();} catch (IOException e) {e.printStackTrace();}
}
}

The only solution that I have found with iText7 and IBM ICU4J without any other third party libraries is to first render the lines and then mirror them one by one. This requires a helper class LineMirroring and it's not precisely the most elegant solution, but will produce the output that you expect.
Lines mirroring class:
public class LineMirroring {
private final PageSize pageSize;
private final String fontName;
private final int fontSize;
public LineMirroring(PageSize pageSize, String fontName, int fontSize) {
this.pageSize = pageSize;
this.fontName = fontName;
this.fontSize = fontSize;
}
public String mirrorParagraph(String input, int height, int width, Border border) {
final StringBuilder mirrored = new StringBuilder();
try (ByteArrayOutputStream output = new ByteArrayOutputStream()) {
PdfFont font = PdfFontFactory.createRegisteredFont(fontName, PdfEncodings.IDENTITY_H, true);
final PdfWriter writer = new PdfWriter(output);
final PdfDocument pdfDoc = new PdfDocument(writer);
pdfDoc.setDefaultPageSize(pageSize);
pdfDoc.addFont(font);
final Document doc = new Document(pdfDoc);
doc.setBaseDirection(BaseDirection.RIGHT_TO_LEFT);
final LineTrackingParagraph paragraph = new LineTrackingParagraph(input);
paragraph.setFont(font)
.setFontSize(fontSize)
.setTextAlignment(TextAlignment.RIGHT)
.setHeight(height)
.setWidth(width)
.setBorder(border);
LineTrackingParagraphRenderer renderer = new LineTrackingParagraphRenderer(paragraph);
doc.add(paragraph);
Bidi bidi;
for (LineRenderer lr : paragraph.getWrittenLines()) {
bidi = new Bidi(((TextRenderer) lr.getChildRenderers().get(0)).getText().toString(), Bidi.RTL);
mirrored.append(bidi.writeReordered(Bidi.DO_MIRRORING));
}
doc.close();
pdfDoc.close();
writer.close();
} catch (IOException ioe) {
ioe.printStackTrace();
}
return mirrored.toString();
}
private class LineTrackingParagraph extends Paragraph {
private List<LineRenderer> lines;
public LineTrackingParagraph(String text) {
super(text);
}
public void addWrittenLines(List<LineRenderer> lines) {
this.lines = lines;
}
public List<LineRenderer> getWrittenLines() {
return lines;
}
#Override
protected IRenderer makeNewRenderer() {
return new LineTrackingParagraphRenderer(this);
}
}
private class LineTrackingParagraphRenderer extends ParagraphRenderer {
public LineTrackingParagraphRenderer(LineTrackingParagraph modelElement) {
super(modelElement);
}
#Override
public void drawChildren(DrawContext drawContext) {
((LineTrackingParagraph)modelElement).addWrittenLines(lines);
super.drawChildren(drawContext);
}
#Override
public IRenderer getNextRenderer() {
return new LineTrackingParagraphRenderer((LineTrackingParagraph) modelElement);
}
}
}
Minimal JUnit Test:
public class Itext7HebrewTest {
#Test
public void generatePdf() {
final SimpleDateFormat formatter = new SimpleDateFormat("yyyy-MM-dd-hh.mm.ss");
final String dest = "F:\\Temp\\" + formatter.format(Calendar.getInstance().getTime()) + ".pdf";
final String fontPath = "C:\\Windows\\Fonts\\ARIALUNI.TTF";
final String fontName = "arialUnicode";
FontProgramFactory.registerFont(fontPath, "arialUnicode");
try (ByteArrayOutputStream output = new ByteArrayOutputStream()) {
PdfFont arial = PdfFontFactory.createRegisteredFont(fontName, PdfEncodings.IDENTITY_H, true);
PdfDocument pdfDoc = new PdfDocument(new PdfWriter(output));
pdfDoc.setDefaultPageSize(PageSize.A4);
pdfDoc.addFont(arial);
LineMirroring mirroring = new LineMirroring(pdfDoc.getDefaultPageSize(), fontName,9);
Document doc = new Document(pdfDoc);
doc.setBaseDirection(BaseDirection.RIGHT_TO_LEFT);
final String txt = "בתשרי נתן הדקל פרי שחום נחמד בחשוון ירד יורה ועל גגי רקד בכסלו נרקיס הופיע בטבת ברד ובשבט חמה הפציעה ליום אחד. 1234 באדר עלה ניחוח מן הפרדסים בניסן הונפו בכוח כל החרמשים";
final int height = 200;
final int width = 70;
final Border border = new SolidBorder(3);
Paragraph paragraph1 = new Paragraph(mirroring.mirrorParagraph(txt, height, width, border));
paragraph1.setFont(arial)
.setFontSize(9)
.setTextAlignment(TextAlignment.RIGHT)
.setHeight(height)
.setWidth(width)
.setBorder(border);
doc.add(paragraph1);
doc.close();
doc.flush();
try (FileOutputStream pdfFile = new FileOutputStream(dest)) {
pdfFile.write(output.toByteArray());
}
} catch (Exception e) {
e.printStackTrace();
}
}
}

pdfbox2.0.4 convert pdf with Chinese to png

I have imported pdfbox-2.0.4.jar, fontbox-2.0.4.jar and commons-logging-1.1.1.jar into eclipse kepler. The programm runs on win10.
The console prints lots of such warnings
org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
WARNING: Using fallback font ArialUnicodeMS for CID-keyed TrueType font KaiTi_GB2312.
And I cannot access the image file with whole content. How can I fix it?
My code is like this:
public class PdfboxTest {
private static final String filePath = "xxx";
private static final String outputFilePath = "xxx";
public static void change(File inputFile, File outputFolder) throws IOException {
String totalFileName = inputFile.getName();
String fileName = totalFileName.substring(0,totalFileName.lastIndexOf("."));
PDDocument doc = null;
try {
doc = PDDocument.load(inputFile);
PDFRenderer pdfRenderer = new PDFRenderer(doc);
int pageCounter = 0;
for(PDPage page : doc.getPages())
{
BufferedImage bim = pdfRenderer.renderImageWithDPI(pageCounter, 300, ImageType.RGB);
ImageIOUtil.writeImage(bim, outputFilePath + "\\" + fileName + (pageCounter++) +".png", 300);
}
doc.close();
} finally {
if (doc != null) {
doc.close();
}
}
}
public static void main(String[] args) {
File inputFile = new File(filePath);
File outputFolder = new File(outputFilePath);
if(!outputFolder.exists()){
outputFolder.mkdirs();
}
try {
change(inputFile, outputFolder);
} catch (IOException e) {
e.printStackTrace();
}
}
}

As seen in the comments - the best solution is to install the missing font KaiTi_GB2312. The message Using fallback font means that the PDF references the mentioned font and didn't embed it, but can't find it on your computer, so PDFBox tried a fallback solution, in this case the ArialUnicodeMS font. Sadly such fallback solutions are not always perfect, which is why some glyphs were missing in the rendered image.

Getting ava.lang.ClassNotFoundException: org.apache.pdfbox.io.RandomAccessRead console error after pdfbox request

I'm working in a servlet file for a web project and this is my code :
I have the v.2.0.0 of pdfbox library and my code works in a simple java application
pdfmanager.java :
public class pdfManager {
private PDFParser parser;
private PDFTextStripper pdfStripper;
private PDDocument pdDoc ;
private COSDocument cosDoc ;
private String Text ;
private String filePath;
private File file;
public pdfManager() {
}
public String ToText() throws IOException
{
this.pdfStripper = null;
this.pdDoc = null;
this.cosDoc = null;
file = new File(filePath);
parser = new PDFParser(new RandomAccessFile(file,"r")); // update for PDFBox V 2.0
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdDoc.getNumberOfPages();
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(10);
// reading text from page 1 to 10
// if you want to get text from full pdf file use this code
// pdfStripper.setEndPage(pdDoc.getNumberOfPages());
Text = pdfStripper.getText(pdDoc);
return Text;
}
public void setFilePath(String filePath) {
this.filePath = filePath;
}
}
the srvlet file :
PrintWriter out = response.getWriter() ;
out.println("\ndata we gottoo : ") ;
pdfManager pdfManager = new pdfManager();
pdfManager.setFilePath("/Users/rami/Desktop/pdf2.pdf");
System.out.println(pdfManager.ToText());
called in doGet method

The library you need is not on the classpath or other problems occur when the classloader wants to load the class of the library. If you are in on a server, be sure to add the library to classpath folder. This can be done by hand or your application has to provide/deliver it by itself. Since it's not clear how your app is deployed or delivered it can have many reasons

How to Create a nutch plugin that returns raw html to the parser

I am trying to create a plugin for nutch. I am using nutch 1.7 and solr. I used a lot of different tutorials. I want to realize a plugin that returns raw html data. i used the standard wiki of nutch and the following tutorial:http://sujitpal.blogspot.nl/2009/07/nutch-custom-plugin-to-parse-and-add.html
I created two files getDivinfohtml.java and getDivinfo.java.
getDivinfohtml.java needs to read the content and then return the complete source code. or atleast a part of the source code
package org.apache.nutch.indexer;
public class getDivInfohtml implements HtmlParseFilter
{
private static final Log LOG = LogFactory.getLog(getDivInfohtml.class);
private Configuration conf;
public static final String TAG_KEY = "source";
// Logger logger = Logger.getLogger("mylog");
// FileHandler fh;
//FileSystem fs = FileSystem.get(conf);
//Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
//SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
//Text key = new Text();
// Content content = new Content();
// fh = new FileHandler("/root/JulienKulkerNutch/mylogfile.log");
// logger.addHandler(fh);
// SimpleFormatter formatter = new SimpleFormatter();
//fh.setFormatter(formatter);
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
{
try
{
LOG.info("Parsing Url:" + content.getUrl());
LOG.info("Julien: "+content.toString().substring(content.toString().indexOf("<!DOCTYPE html")));
Parse parse = parseResult.get(content.getUrl());
Metadata metadata = parse.getData().getParseMeta();
String fullContent = metadata.get("fullcontent");
Document document = Jsoup.parse(fullContent);
Element contentwrapper = document.select("div#jobBodyContent").first();
String source = contentwrapper.text();
metadata.add("SOURCE", source);
return parseResult;
}
catch(Exception e)
{
LOG.info(e);
}
return parseResult;
}
public Configuration getConf()
{
return conf;
}
public void setConf(Configuration conf)
{
this.conf = conf;
}
}
It reads the compelete content right now and then extract the text in jobBodyContent.
Then we have the parser that needs to put the data into the fields
getDivinfo(parser)
public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
{
// LOG.info("Julien is sukkel");
try
{
fh = new FileHandler("/root/JulienKulkerNutch/mylogfile2.log");
SimpleFormatter formatter = new SimpleFormatter();
fh.setFormatter(formatter);
logger.info("Julien is sukkel");
Metadata metadata = parse.getData().getParseMeta();
logger.info("julien is gek:");
String fullContent = metadata.get("SOURCE");
logger.info("Output:" + metadata);
logger.info(fullContent);
String fullSource = parse.getData().getParseMeta().getValues(getDivInfohtml.TAG_KEY);
logger.info(fullSource);
doc.add("divcontent", fullContent);
}
catch(Exception e)
{
//LOG.info(e);
}
return doc;
}
the erros is in getDivinfo: String fullSource = parse.getData().getParseMeta().getValues(getDivInfohtml.TAG_KEY);
[javac] /root/JulienKulkerNutch/apache-nutch-1.8/src/plugin/myDivSelector/src/java/org/apache/nutch/indexer/getDivInfo.java:58: error: cannot find symbol
[javac] String fullSource = parse.getData().getParseMeta().getValues(getDivInfohtml.TAG_KEY);

You may need to implement HTMLParser. In your getFields implementation,
private static final Collection<WebPage.Field> FIELDS = new HashSet<WebPage.Field>();
static {
FIELDS.add(WebPage.Field.CONTENT);
FIELDS.add(WebPage.Field.OUTLINKS);
}
public Collection<Field> getFields() {
return FIELDS;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Convert .doc with images to .html using xdocreport - java

Related

How to generate .dot file using Schemacrawler

Itext7 Hebrew reverse issue

pdfbox2.0.4 convert pdf with Chinese to png

Getting ava.lang.ClassNotFoundException: org.apache.pdfbox.io.RandomAccessRead console error after pdfbox request

How to Create a nutch plugin that returns raw html to the parser

Categories

Resources