TREC-How to use corpus (AP88) with Apache Lucene - java

I have been given two zip files : AP88.zip which contains a bunch of text files and file containing queries. I'm supposed to run a test using these two files using Apache Lucene. I think the basic idea is to use the queries to search into the various files contained in the AP88 zip file.
I understand the basics of Information retrieval and the theory behind it. However, I have no idea about where to start in order to run a test using the given data.
Could you please help to find a pre-existing code and how to use the given files in order to run a test with Apache Lucene ?
Many thanks.

Related

Using the code to input data for backpropagation

I am learning to build neural nets and I came across this code on github,
https://github.com/PavelJunek/back-propagation-java
There is a training set and a validation set required to be used but I don't know where to input the files. The Readme doesn't quite explain how to use the files. How do I test with different csv files I have on this code?
How so? It tells you exactly what to do. The program needs to get two CSV files: a CSV file containing all the training data and a second CSV file containing all of the validation data.
If you have a look at the Program.java file (in the main method), you'll see that you need to pass both files as arguments with the command line.

java - Quick way to extract the company/publisher/vendor information from the File

I am crawling files and directory from my local file system and trying to filter-out files based on their company/publisher/vendor name. Let's assume, I want to filter all files that belongs to Microsoft Corporation. I wonder if I could get these information from the java.nio.file or from java.io.File or from MIME_TYPE using APACHE TIKA but I am not finding any quick way to get the company/publisher/vendor information from the files.
I found the link which exactly does what I want but it slow-down the whole process so I am looking a quick way to extract the company/publisher/vendor details.

Generate XLS File from Page List in Pega PRPC

I'm working in PRPC 7.x; I need to generate an xls file with two worksheets each of which is basically just information copied from each of two page lists.
We don't have any jars aside from default PRPC, which means we don't have apache poi. We do have access to iText, I am unfamiliar with the library but am informed it may have potential use here.
Is there a way in PRPC to generate simple XLS files with a high degree of control?
There are 2 OOTB activities using which we can generate the XLS.
- ExportToExcel in #baseclass
- ViewExportToExcel in Rule-Obj-HTML.
One of this should help, let me know if you need any more info.
Rulesware

Indexing XML files with tags as fields

I'm trying to index a bunch of XML files on my hard drive with Apache Lucene in Java. My idea was, for the best performance, to use the tags in the files as fields in the Lucene index, so when you search for a specific tag in a file (for example, an Ordernumber), the Lucene query would simply be ordernumber:123.
For example, I have this part of a file:
<contactPerson>
<identification>
<source>CUSTOMER</source>
<sourceId>12345678</sourceId>
</identification>
<lastName>Vader</lastName>
<firstName>Darth</firstName>
<telefon>0000</telefon>
<emailAddress>darth.vader#Imperium.com</emailAddress>
<roleType>ORDERER</roleType>
</contactPerson>
Is it now possible to index the XML file, so that I can use the query lastname:Vader to look for the last name? Would you recommend a better solution to search in XML files? Is there maybe an out-of-the-box feature I could use? Would it be easier if the XML files were stored in a database?
Also, later I will use Elasticsearch or Solr for the same task, but then the XML files will be in a database. What different options would be available when using one of these?

How to read pig output in separate Java program

I have some pig output files and want to read them on another machine(without hadoop installation). I just want to read a tab-seperated plain text line and parse it into a java object. I am guessing we should be able to use pig.jar as dependency and be able to read it. I could not find relevant documentation. I think this class could be used? How can we provide the schema also.
I suggest you to store data in Avro serialization format. It is Pig-independent and it allows to handle complex data structures like you described (so you don't need to write your own parser). See this article for examples.
Your pig output files are just text files, right? Then you don't need any pig or hadoop jars.
Last time i worked with Pig was on amazon's EMR platform, and the output files were stashed in an s3 bucket. They were just text files and standard java can read the file in.
That class you referenced is for reading into pig from some text format.
Are you asking for a library to parse the pig data model into java objects? I.e. the text representation of tuples & bags, etc? If so then its probably easier to write it yourself. It's a VERY simple data model with only 3 -ish datatypes..

Categories

Resources