MSword to XML/HTML using Apache Tika

MSword to XML/HTML using Apache Tika - java

I happened to know Tika, very useful in text extraction from word:
curl www.vit.org/downloads/doc/tariff.doc \
| java -jar tika-app-1.3.jar --text
But is there a way to use it to convert the Ms Word file into XML/HTML?

Yes, it involves changing a whooping 4 characters in your command!
If you run java -jar tika-app-1.3.jar --help you'll get something that starts with:
usage: java -jar tika-app.jar [option...] [file|port...]
Options:
-? or --help Print this usage message
-v or --verbose Print debug level messages
-V or --version Print the Apache Tika version number
-g or --gui Start the Apache Tika GUI
-s or --server Start the Apache Tika server
-f or --fork Use Fork Mode for out-of-process extraction
-x or --xml Output XHTML content (default)
-h or --html Output HTML content
-t or --text Output plain text content
-T or --text-main Output plain text content (main content only)
-m or --metadata Output only metadata
.....
From that, you'll see that if you change your --text option to --html or --xml you'll get out nicely formatted XML instead of just the plain text

Despite the fact that this has been answered, since the op tagged the question with the java tag, for completeness I'll add reference to easily see how to do this in java.
The TikaTest.java superclass from Tika's unit tests is the easiest reference to convert word to html using the getXML method. It's a pity that they saw the usefulness of such an API in writing their unit tests, but chose not to expose it as a handy tool, forcing everyone to deal with handlers etc. which is unfortunate boilerplate for the common use case.

Related

how to convert java class encoding to utf-8 [duplicate]

What is the fastest, easiest tool or method to convert text files between character sets?
Specifically, I need to convert from UTF-8 to ISO-8859-15 and vice versa.
Everything goes: one-liners in your favorite scripting language, command-line tools or other utilities for OS, web sites, etc.
Best solutions so far:
On Linux/UNIX/OS X/cygwin:
Gnu iconv suggested by Troels Arvin is best used as a filter. It seems to be universally available. Example:
$ iconv -f UTF-8 -t ISO-8859-15 in.txt > out.txt
As pointed out by Ben, there is an online converter using iconv.
recode (manual) suggested by Cheekysoft will convert one or several files in-place. Example:
$ recode UTF8..ISO-8859-15 in.txt
This one uses shorter aliases:
$ recode utf8..l9 in.txt
Recode also supports surfaces which can be used to convert between different line ending types and encodings:
Convert newlines from LF (Unix) to CR-LF (DOS):
$ recode ../CR-LF in.txt
Base64 encode file:
$ recode ../Base64 in.txt
You can also combine them.
Convert a Base64 encoded UTF8 file with Unix line endings to Base64 encoded Latin 1 file with Dos line endings:
$ recode utf8/Base64..l1/CR-LF/Base64 file.txt
On Windows with Powershell (Jay Bazuzi):
PS C:\> gc -en utf8 in.txt | Out-File -en ascii out.txt
(No ISO-8859-15 support though; it says that supported charsets are unicode, utf7, utf8, utf32, ascii, bigendianunicode, default, and oem.)
Edit
Do you mean iso-8859-1 support? Using "String" does this e.g. for vice versa
gc -en string in.txt | Out-File -en utf8 out.txt
Note: The possible enumeration values are "Unknown, String, Unicode, Byte, BigEndianUnicode, UTF8, UTF7, Ascii".
CsCvt - Kalytta's Character Set Converter is another great command line based conversion tool for Windows.

Stand-alone utility approach
iconv -f ISO-8859-1 -t UTF-8 in.txt > out.txt
-f ENCODING the encoding of the input
-t ENCODING the encoding of the output
You don't have to specify either of these arguments. They will default to your current locale, which is usually UTF-8.

Try VIM
If you have vim you can use this:
Not tested for every encoding.
The cool part about this is that you don't have to know the source encoding
vim +"set nobomb | set fenc=utf8 | x" filename.txt
Be aware that this command modify directly the file
Explanation part!
+ : Used by vim to directly enter command when opening a file. Usualy used to open a file at a specific line: vim +14 file.txt
| : Separator of multiple commands (like ; in bash)
set nobomb : no utf-8 BOM
set fenc=utf8 : Set new encoding to utf-8 doc link
x : Save and close file
filename.txt : path to the file
" : qotes are here because of pipes. (otherwise bash will use them as bash pipe)

Under Linux you can use the very powerful recode command to try and convert between the different charsets as well as any line ending issues. recode -l will show you all of the formats and encodings that the tool can convert between. It is likely to be a VERY long list.

Get-Content -Encoding UTF8 FILE-UTF8.TXT | Out-File -Encoding UTF7 FILE-UTF7.TXT
The shortest version, if you can assume that the input BOM is correct:
gc FILE.TXT | Out-File -en utf7 file-utf7.txt

iconv(1)
iconv -f FROM-ENCODING -t TO-ENCODING file.txt
Also there are iconv-based tools in many languages.

Try iconv Bash function
I've put this into .bashrc:
utf8()
{
iconv -f ISO-8859-1 -t UTF-8 $1 > $1.tmp
rm $1
mv $1.tmp $1
}
..to be able to convert files like so:
utf8 MyClass.java

Try Notepad++
On Windows I was able to use Notepad++ to do the conversion from ISO-8859-1 to UTF-8. Click "Encoding" and then "Convert to UTF-8".

Oneliner using find, with automatic character set detection
The character encoding of all matching text files gets detected automatically and all matching text files are converted to utf-8 encoding:
$ find . -type f -iname *.txt -exec sh -c 'iconv -f $(file -bi "$1" |sed -e "s/.*[ ]charset=//") -t utf-8 -o converted "$1" && mv converted "$1"' -- {} \;
To perform these steps, a sub shell sh is used with -exec, running a one-liner with the -c flag, and passing the filename as the positional argument "$1" with -- {}. In between, the utf-8 output file is temporarily named converted.
Whereby file -bi means:
-b, --brief
Do not prepend filenames to output lines (brief mode).
-i, --mime
Causes the file command to output mime type strings rather than the more traditional human readable ones. Thus it may say for example text/plain; charset=us-ascii rather than ASCII text. The sed command cuts this to only us-ascii as is required by iconv.
The find command is very useful for such file management automation.
Click here for more find galore.

Assuming, you don't know the input encoding and still wish to automate most of the conversion, I concluded this one liner from summing up previous answers.
iconv -f $(chardetect input.text | awk '{print $2}') -t utf-8 -o output.text

DOS/Windows: use Code page
chcp 65001>NUL
type ascii.txt > unicode.txt
Command chcp can be used to change the code page. Code page 65001 is Microsoft name for UTF-8. After setting code page, the output generated by following commands will be of code page set.

PHP iconv()
iconv("UTF-8", "ISO-8859-15", $input);

Try EncodingChecker
EncodingChecker on github
File Encoding Checker is a GUI tool that allows you to validate the text encoding of one or more files. The tool can display the encoding for all selected files, or only the files that do not have the encodings you specify.
File Encoding Checker requires .NET 4 or above to run.
For encoding detection, File Encoding Checker uses the UtfUnknown Charset Detector library. UTF-16 text files without byte-order-mark (BOM) can be detected by heuristics.

to write properties file (Java) normally I use this in linux (mint and ubuntu distributions):
$ native2ascii filename.properties
For example:
$ cat test.properties
first=Execução número um
second=Execução número dois
$ native2ascii test.properties
first=Execu\u00e7\u00e3o n\u00famero um
second=Execu\u00e7\u00e3o n\u00famero dois
PS: I writed Execution number one/two in portugues to force special characters.
In my case, in first execution I received this message:
$ native2ascii teste.txt
The program 'native2ascii' can be found in the following packages:
* gcj-5-jdk
* openjdk-8-jdk-headless
* gcj-4.8-jdk
* gcj-4.9-jdk
Try: sudo apt install <selected package>
When I installed the first option (gcj-5-jdk) the problem was finished.
I hope this help someone.

With ruby:
ruby -e "File.write('output.txt', File.read('input.txt').encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: ''))"
Source: https://robots.thoughtbot.com/fight-back-utf-8-invalid-byte-sequences

Simply change encoding of loaded file in IntelliJ IDEA IDE, on the right of status bar (bottom), where current charset is indicated. It prompts to Reload or Convert, use Convert. Make sure you backed up original file in advance.

In powershell:
function Recode($InCharset, $InFile, $OutCharset, $OutFile) {
# Read input file in the source encoding
$Encoding = [System.Text.Encoding]::GetEncoding($InCharset)
$Text = [System.IO.File]::ReadAllText($InFile, $Encoding)
# Write output file in the destination encoding
$Encoding = [System.Text.Encoding]::GetEncoding($OutCharset)
[System.IO.File]::WriteAllText($OutFile, $Text, $Encoding)
}
Recode Windows-1252 "$pwd\in.txt" utf8 "$pwd\out.txt"
For a list of supported encoding names:
https://learn.microsoft.com/en-us/dotnet/api/system.text.encoding

There is also a web tool to convert file encoding: https://webtool.cloud/change-file-encoding
It supports wide range of encodings, including some rare ones, like IBM code page 37.

Use this Python script: https://github.com/goerz/convert_encoding.py
Works on any platform. Requires Python 2.7.

My favorite tool for this is Jedit (a java based text editor) which has two very convenient features :
One which enables the user to reload a text with a different encoding (and, as such, to control visually the result)
Another one which enables the user to explicitly choose the encoding (and end of line char) before saving

If macOS GUI applications are your bread and butter, SubEthaEdit is the text editor I usually go to for encoding-wrangling — its "conversion preview" allows you to see all invalid characters in the output encoding, and fix/remove them.
And it's open-source now, so yay for them 😉.

Visual Studio Code
Open your file in Visual Studio Code
Reopen with Encoding: In the bottom status bar, to the right, you should see your current file encoding (eg "UTF-8"). Click this and select "Reopen with Encoding".
Select the correct encoding of the file (eg: ISO 8859-2).
Confirm that your content is displaying as expected.
Save with Encoding: The bottom status bar should now display your new encoding format (eg: ISO 8859-2). Click this and choose "Save with Encoding" and select UTF-8 (or whatever new encoding you want).
NOTE: THIS WILL OVERWRITE YOUR ORGINIAL FILE. MAKE A BACKUP FIRST.

As described on How do I correct the character encoding of a file? Synalyze It! lets you easily convert on OS X between all encodings supported by the ICU library.
Additionally you can display some bytes of a file translated to Unicode from all the encodings to see quickly which is the right one for your file.

How to execute a perl program inside Map Reduce in Hadoop?

I have a perl program which will take a input file and process it and produce an output file as result. Now I need to use this perl program on hadoop. So that the perl program will run on data chunks stored on edge nodes thing is I shouldn't modify the perl code. I didn't know how to start this . Can someone please give me any advice ir suggestions.
Can I write a java program , in the mapper class call the perl program using process builder and combine the results in reducer class ??
Is there any other way to achieve this ?

I believe you can do this with hadoop streaming.
As per tom white, author of hadoop definitive guide, 3rd edition. Page # 622, Appendix C.
He used hadoop to execute a bash shell script as a mapper.
In your case you need to use perl script instead of that bash shell script.
Use Case: He has a lot of small files(one big tar file input), his shell script converts them into few big files(one big tar file output).
He used hadoop to process them in parallel by giving bash shell script as mapper. Therefore this mapper works with input files parallely and produce results.
example hadoop command:(copy pasted)
hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-D mapred.reduce.tasks=0 \
-D mapred.map.tasks.speculative.execution=false \
-D mapred.task.timeout=12000000 \
-input ncdc_files.txt \
-inputformat org.apache.hadoop.mapred.lib.NLineInputFormat \
-output output \
-mapper load_ncdc_map.sh \
-file load_ncdc_map.sh
Replace load_ncdc_map.sh with your xyz.perl in both places(last 2 lines in command).
Replace ncdc_files.txt with another text file which contains the list of your input files to be processed.(5th line from bottom)
Assumptions Taken: You have a fully functional hadoop cluster running and your perl script is error free.
Please try and let me know.

Process builder in any java program is used to call non-java applications or scripts. Process builder should work when called from the mapper class. You need to make sure that the perl script, the perl executable and the perl libraries are available for all mappers.

Bit late to the party...
I'm about to start using Hadoop::Streaming. This seems to be the consensus module to use.

Creating Weka classifier model without evaluation

I am trying to use java to feed a training dataset to Weka and get the model as output.
Found this instruction in Weka wiki:
You save a trained classifier with the -d option (dumping), e.g.:
java weka.classifiers.trees.J48 -t /some/where/train.arff -d /other/place/j48.model
The problem is when I use the mentioned command it first builds the model (takes seconds) and then it evaluates the data using 10-fold cross validation method, which takes minutes and is not needed.
The question is how can use weka to model the data for me without evaluating it.

java weka.classifiers.trees.J48 -no-cv -t /some/where/train.arff -d /other/place/j48.model
How I got there:
java weka.classifiers.trees.J48 --help
lists the available options, among others:
-no-cv Do not perform any cross validation.
So when I use your command and add the -no-cv flag, that seems to do what you want.

Parsing javadoc with Python-Sphinx

I use a shared repository partly containing Java and Python code. The code basis mainly stands on python, but some libraries are written in Java.
Is there a possibility to parse or preprocess Java documentation in order to use
it later in Python-Sphinx or even a plugin?

javasphinx (Github) (Documentation)
It took me way to long to find all the important details to set this up, so here's a brief for all my trouble.
Installation
# Recommend working in virtual environments with latest pip:
mkdir docs; cd docs
python3 -m venv env
source ./env/bin/activate
pip install --upgrade pip
# Recommend installing from source:
pip install git+https://github.com/bronto/javasphinx.git
The pypi version seemed to have broken imports, these issues did not seem to exist in the latest checkout.
Setup & Configuration
Assuming you've got a working sphinx setup already:
Important: add the java "domain" to sphinx, this is embedded in the javasphinx package and does not follow the common .ext. extension-namespace format. (This is the detail I missed for hours):
# docs/sources/conf.py
extensions = ['javasphinx']
Optional: If you want external javadoc linking:
# docs/sources/conf.py
javadoc_url_map = {
'<namespace_here>' : ('<base_url_here>', 'javadoc'),
}
Generating Documentation
The javasphinx package adds the shell tool javasphinx-apidoc, if your current environment is active you can call it as just javasphinx-apidoc, or use its full path: ./env/bin/javasphinx-apidoc:
$ javasphinx-apidoc -o docs/source/ --title='<name_here>' ../path/to/java_dirtoscan
This tool takes arguments nearly identical to sphinx-apidoc:
$ javasphinx-apidoc --help
Usage: javasphinx-apidoc [options] -o <output_path> <input_path> [exclude_paths, ...]
Options:
-h, --help show this help message and exit
-o DESTDIR, --output-dir=DESTDIR
Directory to place all output
-f, --force Overwrite all files
-c CACHE_DIR, --cache-dir=CACHE_DIR
Directory to stored cachable output
-u, --update Overwrite new and changed files
-T, --no-toc Don't create a table of contents file
-t TOC_TITLE, --title=TOC_TITLE
Title to use on table of contents
--no-member-headers Don't generate headers for class members
-s SUFFIX, --suffix=SUFFIX
file suffix (default: rst)
-I INCLUDES, --include=INCLUDES
Additional input paths to scan
-p PARSER_LIB, --parser=PARSER_LIB
Beautiful Soup---html parser library option.
-v, --verbose verbose output
Include Generated Docs in Index
In the output directory of the javasphinx-apidoc command there will have been a packages.rst table-of-contents file generated, you will likely want to include this into your index.html's table of contents like:
#docs/sources/index.rst
Contents:
.. toctree::
:maxdepth: 2
packages
Compile Documentation (html)
With either your python environment active or your path modified:
$ cd docs
$ make html
or
$ PATH=$PATH:./env/bin/ make html

The javadoc command allows you to write and use your own doclet classes to generate documentation in whatever form you choose. The output doesn't need to be directly human-readable ... so there's nothing stopping you outputting in a Sphinx compatible format.
However, I couldn't find any existing doclet that does this specific job.
References:
Oracle's Doclet Overview
UPDATE
The javasphinx extension may be a better alternative. It allows you to generate Sphinx documentation from javadoc comments embedded in Java source code.

Sphinx does not provide a built-in way to parse JavaDoc, and I do not know of any 3rd party extension for this task.
You'll likely have to write your own documenter for the Sphinx autodoc extension. There are different approaches you may follow:
Parse JavaDoc manually. I do not think that there is a JavaDoc pParser for Python, though.
Use Doxygen to parse JavaDoc into XML, and parse that XML. The Sphinx extension breathe does this, though for C++.
Write a Doclet for Java to turn JavaDoc into whatever output format you can hande, and parse this output.

Embed a Executable Binary in a shell script

First, I already googled but only found examples where a compressed file (say a .tar.gz) is embedded into a shell script.
Basically if I have a C program (hello.c) that prints a string, say Hello World!.
I compile it to get an executable binary
gcc hello.c -o hello
Now I have a shell script testEmbed.sh
What I am asking is if it is possible to embed the binary (hello) inside the shell script so that when I run
./testEmbed.sh
it executes the binary to print Hello World!.
Clarification:
One alternative is that I compress the executable into an archive and then extract it when the script runs. What I am asking is if it is possible to run the program without that.
Up until now, I was trying the method here. But it does not work for me. I guess the author was using some other distribution on another architecture. So, basically this did not work for me. :P
Also, if the workflow for a C program differs from a Java jar, I would like to know that too!

Yes, this can be done. It's actually quite similar in concept to your linked article. The trick is to use uuencode to encode the binary into text format then tack it on to the end of your script.
Your script is then written in such a way that it runs uudecode on itself to create a binary file, change the permissions then execute it.
uuencode and uudecode were originally created for shifting binary content around on the precursor to the internet, which didn't handles binary information that well. The conversion into text means that it can be shipped as a shell script as well. If, for some reason your distribution complains when you try to run uuencode, it probably means you have to install it. For example, on Debian Squeeze:
sudo aptitude install sharutils
will get the relevant executables for you. Here's the process I went through. First create and compile your C program hello.c:
pax> cat hello.c
#include <stdio.h>
int main (void) {
printf ("Hello\n");
return 0;
}
pax> gcc -o hello hello.c
Then create a shell script testEmbed.sh, which will decode itself:
pax> cat testEmbed.sh
#!/bin/bash
rm -f hello
uudecode $0
./hello
rm -f hello
exit
The first rm statement demonstrates that the hello executable is being created anew by this script, not left hanging around from your compilation. Since you need the payload in the file as well, attach the encoded executable to the end of it:
pax> uuencode hello hello >>testEmbed.sh
Afterwards, when you execute the script testEmbed.sh, it extracts the executable and runs it.
The reason this works is because uudecode looks for certain marker lines in its input (begin and end) which are put there by uuencode, so it only tries to decode the encoded program, not the entire script:
pax> cat testEmbed.sh
#!/bin/bash
rm -f hello
uudecode $0
./hello
rm -f hello
exit
begin 755 hello
M?T5,1#$!`0````````````(``P`!````$(,$"#0```#`!#```````#0`(``'
M`"#`'#`;``8````T````-(`$"#2`!`C#````X`````4````$`````P```!0!
: : :
M:&%N9&QE`%]?1%1/4E]%3D1?7P!?7VQI8F-?8W-U7VEN:70`7U]B<W-?<W1A
M<G0`7V5N9`!P=71S0$!'3$E"0U\R+C``7V5D871A`%]?:38X-BYG971?<&-?
4=&AU;FLN8G#`;6%I;#!?:6YI=```
`
end
There are other things you should probably worry about, such as the possibility that your program may require shared libraries that don't exist on the target system, but the process above is basically what you need.
The process for a JAR file is very similar, except that the way you run it is different. It's still a single file but you need to replace the line:
./hello
with something capable of running JAR files, such as:
java -jar hello.jar

I think makeself is what you're describing.

The portable way to do this is with the printf command and octal escapes:
printf '\001\002\003'
to print bytes 1, 2, and 3. Since you probably don't want to write that all by hand, the od -b command can be used to generate an octal dump of the file, then you can use a sed script to strip off the junk and put the right backslashes in place.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

MSword to XML/HTML using Apache Tika - java

I happened to know Tika, very useful in text extraction from word: curl www.vit.org/downloads/doc/tariff.doc \ | java -jar tika-app-1.3.jar --text But is there a way to use it to convert the Ms Word file into XML/HTML?

Related

how to convert java class encoding to utf-8 [duplicate]

How to execute a perl program inside Map Reduce in Hadoop?

Creating Weka classifier model without evaluation

Parsing javadoc with Python-Sphinx

Embed a Executable Binary in a shell script

Categories

Resources