matrix computation from XML values - java

good day everyone. i try learn some about parsing files, and i decide start with DOM, xPath or SAX,
for example how can i getting numeric values form two 2x2 matrix.
and afterwards perform computation in JAVA of these two matrix to output result.
what is best to use for parsing,and then obtain result for computation matrixes with some algorithm?

You might look at the JScience library, which uses javolution for XML marshaling.

Related

Java: How to use TF-IDF to compute similarity of two documents?

My goals is to find a similarity value between two documents (collections of words). I have already found several answers like this SO post or this SO post which provide Python libraries that achieve this, but I have trouble understanding the approach and making it work for my use case.
If I understand correctly, TF-IDF of a document is computed with respect to a given term, right? That's how I interpret it from the Wikipedia article on this: "tf-idf...is a numerical statistic that is intended to reflect how important a word is to a document".
In my case, I don't have a specific search term which I want to compare to the document, but I have two different documents. I assume I need to first compute vectors for the documents, and then take the cosine between these vectors. But all the answers I found with respect to constructing these vectors always assume a search term, which I don't have in my case.
Can't wrap my head around this, any conceptual help or links to Java libraries that achieve this would be highly appreciated.
I suggest running terminology extraction first, together with their frequencies. Note that stemming can also be applied to the extracted terms to avoid noise in during the subsequent cosine similarity calculation. See Java library for keywords extraction from input text SO thread for more help and ideas on that.
Then, as you yourself mention, for each of those terms, you will have to compute the TF-IDF values, get the vectors and compute the cosine similarity.
When calculating TF-IDF, mind that 1 + log(N/n) (N standing for the total number of corpora and n standing for the number of corpora that include the term) formula is better since it avoids the issue when TF is not 0 and IDF turns out equal to 0.

How do I classify numerical data in Apache Mahout?

I have a numerical dataset of the format class, unigram count, bigram count, sentiment. I went through some of the Apache Mahout documentation and it was all about text data. I am aware that I need to perform 3 steps to classify: Convert to sequence files, Vectorize sequence files, Pass it to train the Naive Bayes Classifier. But I am having a hard time to understand the difference between classifying a text dataset vs classifying a numerical dataset in Mahout. What do I need to do differently in my case? I would appreciate any help.
As you might know, mahout can not use text data to train a model. If you start from a numerical dataset, the classification will be even easier because the vectors that mahout handle are numerical data vectors.
I used mahout on a text dataset and I know that in that case, I had to use dictionnary to convert text data to numerical data. Some algorithms handle it better than others ( for example Naive Bayes strongly prefers text-like data).
So in your case, try to use other classifiers like random forrest or online logistic regression to obtain more efficient result. In my experience, using random forrest, you can just define the type of features that you have (in your case all your features are numerical) so the classification could be done pretty easily. If you want to stick with Naive Bayes, I am sure it is still possible to classify your numerical dataset but I never used it so I can not give more help.

Software pattern for matching object with handles

I have been thinking in an approach for this problem but I have not found any solution which convince me. I am programming a crawler and I have a downloading task for every url from a urls list. In addition, the different html documents are parsed in different mode depending of the site url and the information that I want to take. So my problem is how to link every task with its appropriate parse.
The ideas are:
Creating an huge 'if' where check the download type and to associate a parse.
(Avoided, because the 'if' is growing with every new different site added to crawler)
Using polymorphism, to create a download task different for every different site and related to type of information which I want to get, and then use a post-action where link its parse.
(Increase the complexity again with every new parser)
So I am looking for some kind of software pattern or idea for say:
Hey I am a download task with this information
Really? Then you need this parse for extract it. Here is the parse you need.
Additional information:
The architecture is very simple. A list with urls which are seeds for the crawler. A producer which download the pages. Other list with html documents downloaded. And a consumer who will should apply the right parse for the page.
Depending of the page download sometimes we need use a parse A, or a parse B, etc..
EDIT
An example:
We have three site webs: site1.com, site2.com and site3.com
There are three urls type which we want parsing: site1.com/A, site1.com/B, site1.com/C, site2.com/A, site2.com/B, site2.com/C, ... site3.com/C
Every url it parsed different and usually the same information is between site1.com/A - site2.com/A - site3.com/A ; ... ; site1.com/C - site2.com/C - site3.com/C
Looks like a Genetic Algorithm aproached solution fits for your description of the problem, what you need to find first is the basics (atomic) solutions.
Here's a tiny description from wikipedia:
In a genetic algorithm, a population of candidate solutions (called individuals, creatures, or phenotypes) to an optimization problem is evolved toward better solutions. Each candidate solution has a set of properties (its chromosomes or genotype) which can be mutated and altered; traditionally, solutions are represented in binary as strings of 0s and 1s, but other encodings are also possible.[2]
The evolution usually starts from a population of randomly generated individuals, and is an iterative process, with the population in each iteration called a generation. In each generation, the fitness of every individual in the population is evaluated; the fitness is usually the value of the objective function in the optimization problem being solved. The more fit individuals are stochastically selected from the current population, and each individual's genome is modified (recombined and possibly randomly mutated) to form a new generation. The new generation of candidate solutions is then used in the next iteration of the algorithm. Commonly, the algorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness level has been reached for the population.
A typical genetic algorithm requires:
a genetic representation of the solution domain,
a fitness function to evaluate the solution domain.
A standard representation of each candidate solution is as an array of bits.[2] Arrays of other types and structures can be used in essentially the same way. The main property that makes these genetic representations convenient is that their parts are easily aligned due to their fixed size, which facilitates simple crossover operations. Variable length representations may also be used, but crossover implementation is more complex in this case. Tree-like representations are explored in genetic programming and graph-form representations are explored in evolutionary programming; a mix of both linear chromosomes and trees is explored in gene expression programming.
Once the genetic representation and the fitness function are defined, a GA proceeds to initialize a population of solutions and then to improve it through repetitive application of the mutation, crossover, inversion and selection operators.
I would externalize the parsing pattern / structure in some form ( like XML ) and use them dynamically.
For example, I have to download site1.com an site2.com . Both are having two different layout . I will create two xml which holds the layout pattern .
And one master xml which can hold which url should use which xml .
While startup load this master xml and use it as dictionary. When you have to download , download the page and find the xml from dictionary and pass the dictionary and stream to the parser ( single generic parser) which can read the stream based on Xml flow and xml information.
In this way, we can create common patterns in xml and use it to read similar sites. Use Regular expressions in xml patterns to cover most of sites in single xml.
If the layout is completely different , just create one xml and modify master xml that's it.
The secret / success of this design is how you create such generic xmls and it is purely depends on what you need and what you are doing after parsing.
This seems to be a connectivity problem. I'd suggest considering the quick find algorithm.
See here for more details.
http://jaysonlagare.blogspot.com.au/2011/01/union-find-algorithms.html
and here's a simple java sample,
https://gist.github.com/gtkesh/3604922

XML transformation in Java with best Performance

I want to do some manipulation on xml content in Java. See below xml
From Source XML:
<ns1:Order xmlns:ns1="com.test.ns" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<OrderHeader>
<Image>Image as BinaryData of size 250KB</Image>
</OrderHeader>
</ns1:Order>
Target XML:
<OrderData>
<OrderHeader>
<Image>Image as BinaryData of size 250KB</Image>
</OrderHeader>
</OrderData>
As shown, I have Source xml and I want target xml for that .. The only difference we can observe is root_element "ns1:Order" is replace with "OrderData" in target xml.
Fyi, OrderHeader has one sub-element Image which holds binary image of 250KB (so this xml going to be large one) .. also root element of target xml "OrderData" is well-known in advance.
Now, I want to achieve above result in java with best performance .. I have Source xml content already as byte[] and I want target xml content also as byte[] .. I am open to use Sax parser too.
Please provide the solution which has best performance for doing above stuff.
Thanks in advance,
Nurali
Do you mean machine performance or human performance? Spending an infinite amount of programmer time to achieve a microscopic gain in machine performance is a strange trade-off to make these days, when a powerful computer costs about the same as half a day of a contract programmer's time.
I would recommend using XSLT. It might not be fastest, but it will be fast enough. For a simple transformation like this, XSLT performance will be dominated by parsing and serialization costs, and those won't be any worse than for any other solution.
Not much will beat direct bytes/String manipulation, for instance, a regular expression.
But be warned, manipulating XML with Regex is always a hot debate
I used XLST to transform XML documents. That's another way to do it. There are several Java implementations of XLST processors.
The fastest way to manipulate strings in Java is using direct manipulation and the StringBuilder for the results. I wrote code to modify 20 mb strings that built a table of change locations and then copied and modified the string into a new StringBuilder. For Strings XSLT and RegEx are much slower than direct manipulation and SAX/DOM parsers are slower still.

Need help in latent semantic indexing

I am sorry, if my question sounds stupid :)
Can you please recommend me any pseudo code or good algo for LSI implementation in java?
I am not math expert. I tried to read some articles on wikipedia and other websites about
LSI ( latent semantic indexing ) they were full of math.
I know LSI is full of math. But if i see some source code or algo. I understand things more
easily. That's why i asked here, because so many GURU are here !
Thanks in advance
An idea of LSA is based on one assumption: the more two words occur in same documents, the more similar they are. Indeed, we can expect that words "programming" and "algorithm" will occur in same documents much more often then, say, "programming" and "dog-breeding".
Same for documents: the more common/similar words two documents have, the more similar themselves they are. So, you can express similarity of documents by frequencies of words and vice versa.
Knowing this, we can construct a co-occurrence matrix, where column names represent documents, row names - words and each cells[i][j] represents frequency of word words[i] in document documents[j]. Frequency may be computed in many ways, IIRC, original LSA uses tf-idf index.
Having such matrix, you can find similarity of two documents by comparing corresponding columns. How to compare them? Again, there are several ways. The most popular is a cosine distance. You must remember from school maths, that matrix may be treated as a bunch of vectors, so each column is just a vector in some multidimensional space. That's why this model is called "Vector Space Model". More on VSM and cosine distance here.
But we have one problem with such matrix: it is big. Very very big. Working with it is too computationally expensive, so we have to reduce it somehow. LSA uses SVD technique to keep the most "important" vectors. After reduction matrix is ready to use.
So, algorithm for LSA will look something like this:
Collect all documents and all unique words from them.
Extract frequency information and build co-occurrence matrix.
Reduce matrix with SVD.
If you're going to write LSA library by yourself, the good point to start is Lucene search engine, which will make much easier steps 1 and 2, and some implementation of high-dimensional matrices with SVD capability like Parallel Colt or UJMP.
Also pay attention to other techinques, which grown up from LSA, like Random Indexing. RI uses same idea and shows approximately same results, but doesn't use full matrix stage and is completely incremental, which makes it much more computationally efficient.
This maybe a bit late but I always liked Sujit Pal's blog http://sujitpal.blogspot.com/2008/09/ir-math-with-java-tf-idf-and-lsi.html and I have written a bit on my site if you are interested.
The process is way less complicated than it is often written up as. And really all you need is a library that can do single value decomposition of a matrix.
If you are interested I can explain in a couple of the short take away bits:
1) you create a matrix/dataset/etc with word counts of various documents - the different documents will be your columns and the rows the distinct words.
2) Once you've created the matrix you use a library like Jama (for Java) or SmartMathLibrary (for C#) and run the single value decomposition. All this does is take your original matrix and break it up in to three different parts/matrix that essentially represent your documents, your words, and kind of a multiplier (sigma) these are called the vectors.
3) Once you have you word, document, sigma vectors you shrink them equally (k) by just copying smaller parts of the vector/matrix and then multiply them back together. By shrinking them it kind of normalizes your data and this is LSI.
here are some fairly clear resources:
http://puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html
http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf
http://www.soe.ucsc.edu/classes/cmps290c/Spring07/proj/Flynn_talk.pdf
Hope this help you out a bit.
Eric

Categories

Resources