how can I detect farsi web pages by tika?

how can I detect farsi web pages by tika? - java

I need a sample code to help me detect farsi language web pages by apache tika toolkit.
LanguageIdentifier identifier = new LanguageIdentifier("فارسی");
String language = identifier.getLanguage();
I have download apache.tika jar files and add them to the classpath. but this code gives error for Farsi language but it works for english.
how can I add Farsi to languageIdentifier package of tika?

Tika doesn't ship with a language profile for the Farsi language yet. As of version 1.0 27 languages are supported out of the box:
languages=be,ca,da,de,eo,et,el,en,es,fi,fr,gl,hu,is,it,lt,nl,no,pl,pt,ro,ru,sk,sl,sv,th,uk
In your example the input is misdetected as li(Lithuanian) with a distance of 0.41, which is above the certainty threshold of 0.022. See the source code for more information on the inner works of LanguageIdentifier.
The Farsi language (Persian, ISO 639-1 2-letter code fa) is not recognized by default.
If you want Tika to recognize another language, you have to create a language profile first.
For this the following steps are necessary:
Find a text corpus for your language. I found the Hamshahri Collection. This should be sufficient. Download the corpus or parts of it and create a plain text file out of the XML.
Create an ngram file for the language identifier. This can be done using TikaCLI:
java -jar tika-app-1.0.jar --create-profile=fa -eUTF-8 fa-corpus.txt
This will a file called fa.ngp which contains the n-grams.
Configure Tika so that it recognizes the new language. Either do this programmatically using LanguageIdentifier.initProfiles() or put a property file with the name tika.language.override.properties into the classpath. Make sure the ngram file is in the classpath as well.
If you now run Tika, it should correctly detect your language.
Update:
Detailed the steps necessary to create a language profile.

Related

Opensagres - Generate Word document that support Cambodian language

All the documents generated in our application are generated with java-11 + opensagres
/xdocreport-2.0.2 + Freemarker template engine.
The documents are generated correctly in multiple languages like: Russian and Chinese.
We've observed that when the input is in Cambodian language the Word document generated contains some utility boxes instead of Cambodian characters.
I've explained more in detail the issue here: https://github.com/opensagres/xdocreport/issues/575 , but I didn't receive any answer until now.
Did anyone manage to generate documents containing this language with opensagres ?
Thanks upfront!

The answer was, using Aspose framework(this is not free like opensagres).
The biggest advantages are that in Aspose you can force the framework to use some sets of fonts from the application resources and other great features(like smooth and simple pdf convertions).
The only trouble was that Aspose doesn't have integration with Freemarker template. In our case that meant changing a lot of quite big complex existing documents.
After some analyses and based on Aspose really kind support, we took the decision to use a hybrid solution like:
Documents would be still generated in memory with Opensagres and Freemarker
After that the documents will be loaded with Aspose, and render based on the application resources fonts. The native font for Cambodian characters is Daunpenh Font. This font was placed in application resources.
The full topic can be found here: https://forum.aspose.com/t/support-cambodian-language/252057

Java PC application - exported JAR do not behave as in development

I have an classic Java application for PC. The result of the build is a JAR file which is running on Windows machine.
The application is reading some XML files and creating an HTML document as an end result. The Xml file contains specific language characters that are not native to English.
While in development, in the IDE (Apache NetBeans 13), build - > Run the exported HTML file contains specific language characters.
When I run the JAR file, from the Project - > dist directory , HTML do not contain specific language characters.
For example characters like: č , ć , đ, š are being exported as : Ä� , while running from NetBeans they are exported as such, not as that strange symbol.
The letters in question are from Serbian, Croatian and Bosnian.
When I export the project from NetBeans, I made sure to have this option enabled:
Project -> Project properties -> Build -> Packaging where the "Copy Dependent Libraries" option is selected.
I am puzzled at this point. If anybody has any idea why something is working one way in IDE and other when exported please let me know.

The likely problem is that your HTML file needs to identify its character encoding. Nowadays, generally best to use UTF-8 as the encoding for most purposes.
Determine the file’s encoding
If you have access to the source code of your Java app, examine that to see what character encoding is being used when producing the HTML file. But I assume you have no such access.
Open the HTML file in a text-editor to examine its raw source code. See if it specifies a character encoding. If it does, and that character encoding indicator is incorrect, you will need to alter your HTML file.
If no character encoding is indicated within the HTML, you will need to experiment to discover the encoding. Open the HTML file in a web browser, then use the "view" or developer tools available in most browsers (Firefox, Safari, Edge, etc.) to explicitly switch between encodings.
If switching to a particular encoding causes the text to appear as expected, then you know the likely encoding.
Specify the file’s encoding
In the modern version of HTML, HTML5, UTF-8 is the default encoding assumed by the web browser. But if the web browser switches into Quirks Mode, the browser may assume another encoding. To help avoid Quirks Mode, a HTML5 document should start with <!DOCTYPE html>.
So, best to be explicit about the encoding. Once you determine the encoding being used by your Java app creating the HTML file, either alter that app (if you have source code) to write an indicator of the encoding, or else write another Java app to edit the produced HTML file to include the indicator. If you are not a Java developer, you could use any programming language or even a shell script to edit the produced HTML file.
To indicate the encoding of an HTML5 file, add a meta element.
For UTF-8:
<meta charset="UTF-8">
For Latin-1:
<meta charset="ISO-8859-1">
If your Java app was developed exclusively on Microsoft Windows, the developer may have knowingly or unwittingly used one of the Microsoft defined character encodings. Older versions of Java defaulted to using a character encoding specific to the host platform — but be aware in Java 18+ the default changes to UTF-8 across platforms.
For more info
You can read about these issues in many places. Like here and in Wikipedia.
If you are not savvy with character sets and character encoding, I highly recommend reading the surprisingly entertaining article, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), by Joel Spolsky.

How to parse Java code to generate class diagram?

I have to generate class diagrams from the java files such that the parser must be executable on the command line with the following format:
umlparser classpath output_file_name
umlparser is my .java file name
classpath is a folder name where all the .java source files will be
output file name is the name of the output image file you program will generate ( .jpg, .png or .pdf format)
Tools for parsing Java Source and generating UML diagrams can be used.
I was looking at http://yuml.me/diagram/class/draw and found that that is a good way to generate the class diagram.
However, I can't get the idea how can I get the code in the form of
[Customer|forname:string;surname:string|doPost();doGet()]<>-orders*>[Order]
[Order]++-0..*>[LineItem]
[Order]-[note:Aggregate root{bg:wheat}]
Any insight on how to generate this code?
Any other suggestions are also welcomed.

Maybe you need to take a look at doxygraph. Its maintainers define the tool as follows:
It relies on Doxygen to parse your source code and create an intermediate XML representation of the information it collects, so it supports all the same programming languages that Doxygen supports: C, C++, C#, Objective C, Java, Python, PHP, Tcl, D, IDL, VHDL, and Fortran.
Reverse-engineering functionality is also present in IntelliJ IDEA and Visual Paradigm editors, but it is a paid feature as far as I can remember.

Trouble getting accents to show up in my java app

We recently got a localization file the contains Portuguese translations of all the string in our java app. The file they gave me was a .csv file and I use file maker to create .tab file which is what we need for our purposes. Unfortunately none of the accents seem to work. For example the string vocÍ in our localization file shows up as vocΩ inside the application. I tried switching the language settings to portuguese before creating compiling but I still get this problem, anyone have any ideas of what else I might need to try?

I think that you problem is related to the file encoding used.
Java has full unicode support so there shouldn't be any problems, unless the file you are reading (the one made with FileMaker) is encoded in something different than UTF8 (which is the default used by Java).
You can try saving the file in a different encoding or specifying which encoding to use when opening it from Java (see here). Many API classes support additional parameters to specify which charset to use when opening a file. Just take a look at the documentation..

Is there any way to get a WordprocessingML Clipboard content in java?

I've got a customer who managed it to paste WordprocessingML content into our application. As far as I know it was a direct copy&paste from Word 2000 to our Java application. I tried every Word and Java Version combination, but I can't reproduce this behavior - especially, since our application filters for HTML and text/plain.
I'm pretty sure that the older Office version had there own clipboards and exported only the formats, which should be available to other programms. Every office version I know(except maybe 2007) exports HTML, RTF and Plain.
Is there any way to get a WordprocesingML content into the clipboard and maybe to get Java to mix-up the data flavours

Apache POI is a Java API To Access Microsoft Format Files. HWPF is its part for reading and writing MS Word files. Apache TIKA is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. It also gives some support for MS Word documents. I suggest you see if they fit your use case.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.