Arabic analyzer Lucene

Arabic analyzer Lucene - java

I'm trying to index an Arabic text file, using the ArabicAnalyzer provided by Apache Lucene. The following code shows what I am trying to do:
public class Indexer {
public static void main(String[] args) throws Exception {
String indexDir = "E:/workspace/IRThesisCorpusByApacheLucene/indexDir";
String dataDir = "E:/workspace/IRThesisCorpusByApacheLucene/dataDir";
long start = System.currentTimeMillis();
Indexer indexer = new Indexer(indexDir);
int numIndexed;
try {
numIndexed = indexer.index(dataDir, new TextFilesFilter());
} finally {
indexer.close();
}
long end = System.currentTimeMillis();
System.out.println("Indexing " + numIndexed + " files took "
+ (end - start) + " milliseconds");
}
private IndexWriter writer;
public Indexer(String indexDir) throws IOException {
Directory dir = FSDirectory.open(new File(indexDir));
writer = new IndexWriter(dir,new IndexWriterConfig
(Version.LUCENE_45, new ArabicAnalyzer(Version.LUCENE_45))
);
}
public void close() throws IOException {
writer.close();
}
public int index(String dataDir, FileFilter filter)
throws Exception {
System.out.println(" Dir Path :::::"+ new File(dataDir).getAbsolutePath());
File[] files = new File(dataDir).listFiles();
System.out.println(" Files number :::::"+files.length);
for (File f: files) {
System.out.println(" File is :::::"+f);
if (!f.isDirectory() &&
!f.isHidden() &&
f.exists() &&
f.canRead() &&
(filter == null || filter.accept(f))) {
indexFile(f);
}
}
return writer.numDocs();
}
private static class TextFilesFilter implements FileFilter {
public boolean accept(File path) {
return path.getName().toLowerCase()
.endsWith(".txt");
}
}
protected Document getDocument(File f) throws Exception {
Document doc = new Document();
InputStreamReader reader=new InputStreamReader
(new FileInputStream(f),"UTF8");
System.out.println(" Encoding is ::::"+reader.getEncoding());
doc.add(new TextField("contents",reader ));
doc.add(new TextField("filename", f.getName(),
Field.Store.YES));
doc.add(new TextField("fullpath", f.getCanonicalPath(),
Field.Store.YES));
return doc;
}
private void indexFile(File f) throws Exception {
System.out.println("Indexing " + f.getCanonicalPath());
Document doc = getDocument(f);
System.out.println(" In indexFile :::::::: doc is ::"+doc+" writer:::"+writer);
writer.addDocument(doc,new ArabicAnalyzer(Version.LUCENE_45));
}
}
my text file contains :
{سم الله الرحمن الرحيم
اهلا و سهلا بكم ، ماذا بعد
كتب يكتب كاتب مكتوب سيكتب }
When run, I get the following results in file _0.cfs:
I get words, but also get undefined characters
What is the problem here? Why doesn't it show Arabic correctly?

You shouldn't be looking at .cfs files directly. The cfs is a compound index file, and is not, in any way, a plain text document. You are intended to use the Lucene API to search and retrieve data from an index, not just look at the file in an editor. If you want to know more about Lucene file formats, feel free to look at the codec documentation

Related

Using, ( inputStrem , OutputStrem )

I want, if you try to copy to the directory, a message is displayed and the program shuts down. In the case of a file, display the file size and the time it was last modified. I don't know exactly, how can i show out file size, and last time modified.
..............................................................................................................................................................
import java.io.*;
public class KopeeriFail {
private static void kopeeri(String start, String end) throws Exception {
InputStream sisse = new FileInputStream(start);
OutputStream välja = new FileOutputStream(end);
byte[] puhver = new byte[1024];
int loetud = sisse.read(puhver);
while (loetud > 0) {
välja.write(puhver, 0, loetud);
loetud = sisse.read(puhver);
}
sisse.close();
välja.close();
}
public static void main(String[] args) throws Exception {
if (args.length != 1) {
System.out.println("Did you gave name to the file");
System.exit(1);
}
kopeeri(args[0], args[0] + ".copy");
}
}

You can easily fetch BasicFileAttributes which stores size and last modification timestamp.
public static void main(String[] args) throws IOException {
if (args.length != 1) {
System.err.println("Specify file name");
return;
}
Path initial = Paths.get(args[0]);
if (!Files.exists(initial)){
System.err.println("Path is not exist");
return;
}
if (Files.isDirectory(initial)) {
System.err.println("Path is directory");
return;
}
BasicFileAttributes attributes = Files.
readAttributes(initial, BasicFileAttributes.class);
System.out.println("Size is " + attributes.size() + " bytes");
System.out.println("Last modified time " + attributes.lastModifiedTime());
Files.copy(initial, initial.getParent()
.resolve(initial.getFileName().toString() + ".copy"));
}
Hope it helps!

JTextArea appending problems

Im making a backup program, and I want everything that i have the program backing up displayed on a JTextArea. well, it works, but only after the program is finished with the backup. How do i fix this? The code i have running this is here:
backup method
public void startBackup() throws Exception {
// txtarea is the JTextArea
Panel.txtArea.append("Starting Backup...\n");
for (int i = 0; i < al.size(); i++) {
//al is an ArrayList that holds all of the backup assignments selected
// from the JFileChooser
File file = new File((String) al.get(i));
File directory = new File(dir);
CopyFolder.copyFolder(file, directory);
}
}
Copy Folder class:
public class CopyFolder {
public static void copyFolder(File src, File dest) throws IOException {
if (src.isDirectory()) {
// if directory not exists, create it
if (!dest.exists()) {
dest.mkdir();
Panel.txtArea.append("Folder " + src.getName()
+ " was created\n");
}
// list all the directory contents
String files[] = src.list();
for (String file : files) {
// construct the src and dest file structure
File srcFile = new File(src, file);
File destFile = new File(dest, file);
// recursive copy
copyFolder(srcFile, destFile);
}
} else {
try {
CopyFile.copyFile(src, dest);
} catch (Exception e) {
}
}
}
}
CopyFile class
public class CopyFile {
public static void copyFile(File src, File dest) throws Exception {
// if file, then copy it
// Use bytes stream to support all file types
InputStream in = new FileInputStream(src);
OutputStream out = new FileOutputStream(dest);
byte[] buffer = new byte[1024];
int length;
// copy the file content in bytes
while ((length = in.read(buffer)) > 0) {
out.write(buffer, 0, length);
}
in.close();
out.close();
// System.out.println("File copied from " + src + " to " + dest);
Panel.txtArea.append("File copied " + src.getName() + "\n");
}
}
Thanks for the help in advance, and let me know of any assistance i can give. I did a google search on this, and it does seem to be a big problem, but i just cant think of how to fix it. Oh, and please dont downvote this just because it doesnt apply to you, its very aggravating. Thanks in advance again!
EDIT:
This is what i got:
public class test extends SwingWorker<Void, String> {
String txt;
JTextArea txtArea = null;
public test(JTextArea txtArea, String str) {
txt = str;
this.txtArea = txtArea;
}
protected Void doInBackground() throws Exception {
return null;
}
protected void process(String str) {
txtArea.append(str);
}
protected void getString() {
publish(txt);
}
}

The main problem you're having is you're trying to perform blocking actions in the Event Dispatching Thread. This will prevent the UI from been updated as repaint requests are not reaching the repaint manager until AFTER you've finished.
To over come this, you're going to need to off load the blocking work (ie the back up process) to a separate thread.
For this I suggest you have a read through the Concurrency in Swing Trail which will provide you with some useful strategies to solve your particular problem. In particular, you'll probably benifit from using a SwingWorker
Take a close look at doInBackground and the process methods
UPDATED with Example
Okay, so this is a REALLY simple example. This basically walks you C:\ drive to 3 directories deep and dumps the content to the supplied JTextArea
public class BackgroundWorker extends SwingWorker<Object, File> {
private JTextArea textArea;
public BackgroundWorker(JTextArea textArea) {
this.textArea = textArea;
}
#Override
protected Object doInBackground() throws Exception {
list(new File("C:\\"), 0);
return null;
}
#Override
protected void process(List<File> chunks) {
for (File file : chunks) {
textArea.append(file.getPath() + "\n");
}
textArea.setCaretPosition(textArea.getText().length() - 1);
}
protected void list(File path, int level) {
if (level < 4) {
System.out.println(level + " - Listing " + path);
File[] files = path.listFiles(new FileFilter() {
#Override
public boolean accept(File pathname) {
return pathname.isFile();
}
});
publish(path);
for (File file : files) {
System.out.println(file);
publish(file);
}
files = path.listFiles(new FileFilter() {
#Override
public boolean accept(File pathname) {
return pathname.isDirectory() && !pathname.isHidden();
}
});
for (File folder : files) {
list(folder, level + 1);
}
}
}
}
You would simply call new BackgroundWorker(textField).execute() and walk away :D
UPDATED with explicit example
public class BackgroundWorker extends SwingWorker<Object, String> {
private JTextArea textArea;
private File sourceDir;
private File destDir;
public BackgroundWorker(JTextArea textArea, File sourceDir, File destDir) {
this.textArea = textArea;
this.sourceDir = sourceDir;
this.destDir = destDirl
}
#Override
protected Object doInBackground() throws Exception {
if (sourceDir.isDirectory()) {
// if directory not exists, create it
if (!destDir.exists()) {
destDir.mkdir();
publish("Folder " + sourceDir.getName() + " was created");
}
// list all the directory contents
String files[] = sourceDir.list();
for (String file : files) {
// construct the src and dest file structure
File srcFile = new File(sourceDir, file);
File destFile = new File(destDir, file);
// recursive copy
copyFolder(srcFile, destFile);
}
} else {
try {
copyFile(sourceDir, destDir);
} catch (Exception e) {
}
}
return null;
}
public void copyFolder(File src, File dest) throws IOException {
if (src.isDirectory()) {
// if directory not exists, create it
if (!dest.exists()) {
publish("Folder " + src.getName() + " was created");
}
// list all the directory contents
String files[] = src.list();
for (String file : files) {
// construct the src and dest file structure
File srcFile = new File(src, file);
File destFile = new File(dest, file);
// recursive copy
copyFolder(srcFile, destFile);
}
} else {
try {
copyFile(src, dest);
} catch (Exception e) {
}
}
}
public void copyFile(File src, File dest) throws Exception {
// if file, then copy it
// Use bytes stream to support all file types
InputStream in = new FileInputStream(src);
OutputStream out = new FileOutputStream(dest);
byte[] buffer = new byte[1024];
int length;
// copy the file content in bytes
while ((length = in.read(buffer)) > 0) {
out.write(buffer, 0, length);
}
in.close();
out.close();
publish("File copied " + src.getName());
}
#Override
protected void process(List<String> chunks) {
for (String msg : chunks) {
textArea.append(msg + "\n");
}
textArea.setCaretPosition(textArea.getText().length() - 1);
}
}
Now to run...
new BackgroundWorker(textArea, sourceDir, destDir).execute();

Lucene tika indexing failure

I wrote (mostly copied from lucene-in-action ebook) an indexing example using Tika. But it doesn't index the documents at all. There is no error on compile or run. I tried indexing a .pdf, .ppt, .doc, even .txt document, no use, at search returns 0 hits, and i payed attention at the words in my documents. Please take a look at the code:
public class TikaIndexer extends Indexer {
private boolean DEBUG = false;
static Set textualMetadataFields = new HashSet();
static {
textualMetadataFields.add(Metadata.TITLE);
textualMetadataFields.add(Metadata.AUTHOR);
textualMetadataFields.add(Metadata.COMMENTS);
textualMetadataFields.add(Metadata.KEYWORDS);
textualMetadataFields.add(Metadata.DESCRIPTION);
textualMetadataFields.add(Metadata.SUBJECT);
}
public TikaIndexer(String indexDir) throws IOException {
super(indexDir);
}
#Override
protected boolean acceptFile(File f) {
return true;
}
#Override
protected Document getDocument(File f) throws Exception {
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY,
f.getCanonicalPath());
InputStream is = new FileInputStream(f);
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler(10*1024*1024);
try {
parser.parse(is, handler, metadata, new ParseContext());
} finally {
is.close();
}
Document doc = new Document();
doc.add(new Field("contents", handler.toString(), Field.Store.NO, Field.Index.ANALYZED));
if (DEBUG) {
System.out.println(" intregul textt: " + handler.toString());
}
for (String name : metadata.names()) {
String value = metadata.get(name);
if (textualMetadataFields.contains(name)) {
doc.add(new Field("contents", value,
Field.Store.NO, Field.Index.ANALYZED));
}
doc.add(new Field(name, value, Field.Store.YES, Field.Index.NO));
if (DEBUG) {
System.out.println(" " + name + ": " + value);
}
}
if (DEBUG) {
System.out.println();
}
return doc;
}
}
And main class:
public static void main(String args[])
{
String indexDir = "src/indexDirectory";
String dataDir = "src/filesDirectory";
try
{
TikaConfig config = TikaConfig.getDefaultConfig();
List<MediaType> parsers = new ArrayList(config.getParser().getSupportedTypes(new ParseContext())); //3
Collections.sort(parsers);
Iterator<MediaType> it = parsers.iterator();
System.out.println(parsers.size());
System.out.println("Tipuri de parsere:");
while (it.hasNext()) {
System.out.println(" " + it.next());
}
System.out.println();
long start = new Date().getTime();
TikaIndexer indexer = new TikaIndexer(indexDir);
int numIndexed = indexer.index(dataDir);
long end = new Date().getTime();
System.out.println("Indexarea a " + numIndexed + " fisiere a durat "
+ (end - start) + " milisecunde.");
System.out.println();
System.out.println("--------------------------------------------------------------");
System.out.println();
}
catch (Exception ex)
{
System.out.println("Nu s-a putut realiza indexarea: ");
ex.printStackTrace();
Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
}
}

lucene 3.5.0 stack overflow error on indexing

I started learning to use Lucene and first tried to compile the example for a Indexer class from a book i found. The class looks like this:
public class Indexer
{
private IndexWriter writer;
public static void main(String[] args) throws Exception
{
String indexDir = "src/indexDirectory";
String dataDir = "src/filesDirectory";
long start = System.currentTimeMillis();
Indexer indexer = new Indexer(indexDir);
int numIndexer = indexer.index(dataDir);
indexer.close();
long end = System.currentTimeMillis();
System.out.println("Indexarea a " + numIndexer + " fisiere a durat "
+ (end - start) + " milisecunde");
}
public Indexer(String indexDir) throws IOException
{
Directory dir = new FSDirectory(new File(indexDir), null) {};
writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_35), true, IndexWriter.MaxFieldLength.UNLIMITED);
}
public void close() throws IOException
{
writer.close();
}
public int index(String dataDir) throws Exception
{
File[] files = new File(dataDir).listFiles();
for (int i=0;i<files.length; i++)
{
File f = files[i];
if (!f.isDirectory() && !f.isHidden() && f.exists() && f.canRead() && acceptFile(f))
{
indexFile(f);
}
}
return writer.numDocs();
}
protected boolean acceptFile(File f)
{
return f.getName().endsWith(".txt");
}
protected Document getDocument(File f) throws Exception
{
Document doc = new Document();
doc.add(new Field("contents", new FileReader(f)));
doc.add(new Field("filename", f.getCanonicalPath(), Field.Store.YES, Field.Index.NOT_ANALYZED));
return doc;
}
private void indexFile(File f) throws Exception
{
System.out.println("Indexez " + f.getCanonicalPath());
Document doc = getDocument(f);
if (doc != null)
{
writer.addDocument(doc);
}
}
}
When i run it, i get
Exception in thread "main" java.lang.StackOverflowError
at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:345)
at org.apache.lucene.store.Directory.openInput(Directory.java:143)
and goes like this for tens of times.
The constructor of my class
public Indexer(String indexDir) throws IOException
{
Directory dir = new FSDirectory(new File(indexDir), null) {};
writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_35), true, IndexWriter.MaxFieldLength.UNLIMITED);
}
has a IndexWriter call that is deprecated (because the book is written for lucene 3.0.0) and uses this method IndexWriter.MaxFieldLength.UNLIMITED(also deprecated). Could this be causing the overflow? If so, what should i use instead?

No, don't make your own anonymous subclass of FSDirectory! Use FSDirectory.open instead, or instantiate a concrete subclass of FSDirectory provided by Lucene, like NIOFSDirectory (but in this case you must carefully read the Javadoc on your chosen implementation, every one has OS-specific pitfalls). Lucene never, even in version 3.0, expected you to make your own subclasses of FSDirectory.

Exception when indexing text documents with Lucene, using SnowballAnalyzer for cleaning up

I am indexing the documents with Lucene and am trying to apply the SnowballAnalyzer for punctuation and stopword removal from text .. I keep getting the following error :(
IllegalAccessError: tried to access method org.apache.lucene.analysis.Tokenizer.(Ljava/io/Reader;)V from class org.apache.lucene.analysis.snowball.SnowballAnalyzer
Here is the code, I would very much appreciate help!!!! I am new with this..
public class Indexer {
private Indexer(){};
private String[] stopWords = {....};
private String indexName;
private IndexWriter iWriter;
private static String FILES_TO_INDEX = "/Users/ssi/forindexing";
public static void main(String[] args) throws Exception {
Indexer m = new Indexer();
m.index("./newindex");
}
public void index(String indexName) throws Exception {
this.indexName = indexName;
final File docDir = new File(FILES_TO_INDEX);
if(!docDir.exists() || !docDir.canRead()){
System.err.println("Something wrong... " + docDir.getPath());
System.exit(1);
}
Date start = new Date();
PerFieldAnalyzerWrapper analyzers = new PerFieldAnalyzerWrapper(new SimpleAnalyzer());
analyzers.addAnalyzer("text", new SnowballAnalyzer("English", stopWords));
Directory directory = FSDirectory.open(new File(this.indexName));
IndexWriter.MaxFieldLength maxLength = IndexWriter.MaxFieldLength.UNLIMITED;
iWriter = new IndexWriter(directory, analyzers, true, maxLength);
System.out.println("Indexing to dir..........." + indexName);
if(docDir.isDirectory()){
File[] files = docDir.listFiles();
if(files != null){
for (int i = 0; i < files.length; i++) {
try {
indexDocument(files[i]);
}catch (FileNotFoundException fnfe){
fnfe.printStackTrace();
}
}
}
}
System.out.println("Optimizing...... ");
iWriter.optimize();
iWriter.close();
Date end = new Date();
System.out.println("Time to index was" + (end.getTime()-start.getTime()) + "miliseconds");
}
private void indexDocument(File someDoc) throws IOException {
Document doc = new Document();
Field name = new Field("name", someDoc.getName(), Field.Store.YES, Field.Index.ANALYZED);
Field text = new Field("text", new FileReader(someDoc), Field.TermVector.WITH_POSITIONS_OFFSETS);
doc.add(name);
doc.add(text);
iWriter.addDocument(doc);
}
}

This says that one Lucene class is inconsistent with another Lucene class -- one is accessing a member of the other that it can't. This strongly suggests you have two different and incompatible versions of Lucene in your classpath somehow.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Arabic analyzer Lucene - java

Related

Using, ( inputStrem , OutputStrem )

JTextArea appending problems

Lucene tika indexing failure

lucene 3.5.0 stack overflow error on indexing

Exception when indexing text documents with Lucene, using SnowballAnalyzer for cleaning up

Categories

Resources