Rapid Miner 101 - java

I'm back with a question. I'm playing with Rapid Miner for automatic text classification and cant get it work. I'm getting an error that says, "no example set in the example, offending operator Performance ". Any idea what that is referring to ?

In RapidMiner you have to use the converter components before using it as example sets. So, if you have an output as 'doc', for example, you have to use the component 'Documents to Data' in order to link it to the next input 'exa'. That´s all!

Could you provide more details about your RapidMiner text mining process?
Without more context, your question is difficult to answer.
For more help with RapidMiner, you may want to check out the RapidMiner user forum: http://forum.rapid-i.com/
At RapidMiner Resources, you can find RapidMiner tutorial videos about how to text mining with RapidMiner:
http://rapidminerresources.com/index.php?page=text-mining-3
Rapid-I also offers a 90 minutes text mining webinar. You can find it at the Rapid-I web page under "services" and "training" or in the web shop.
I hope these links help you to get started with automatic text classification with RapidMiner. If you provide more details about your RapidMiner text mining process, I may also be able to directly answer your question.

If it says that there is no Example Set, then the issue is probably with your original data. Can you post an image of your process?
For instance, make sure that you have connected the initial input to your operator - what two operators does the error occur at?
One thought: the example set in text mining is usually your document collection, but if you are really using documents (PDF, Word) then your format will be Documents (Doc), and you may need to transform your documents to data (Documents to Data operator). Then you should have an Example Set that you can feed into your Performance operator.
Hope this helps - as the earlier comment said, without knowing the process, it is hard to tell exactly where the error is.

Related

How to get the passage from where the concept was found in concept insights using conceptual search?

I am using concept insights for a project of mine and I was able to do label and conceptual search successfully using the GET and POST commands.
As I was done with the conceptual search, I realized that they weren't showing the passage from where the related concepts were found. I would like to do something similar to what they have done in the concept insights demo where they show a small piece of text from where the concept and the related concepts were found(They have used the TED TALKS Corpus and I am using my own corpus).
Is there any way we can do that in Java ?
Thank you for your help.
You could try to use the Text Index Parameter of the Concept Insights output. This can be found in the explanation tag. I usually do a bit of post processing by taking 500 characters before the start of Text Index and 500 characters after the end of Text Index, trim the incomplete sentences at the beginning and end of the snippet I get, and use the result as the relevant snippet for the concept.

Java - Extracting plaintext from web-page source code (getting massive quantities of lyrics from website)

O community, I'm in the process of writing the pseudocode for an application that extracts song lyrics from a remote host (web-server, not my own) by reading the page's source code.
This is assuming that:
Lyrics are being displayed in plaintext
Portion of source code containing lyrics is readable by Java front-end application
I'm not looking for source code to answer the question, but what is the technical term used for querying a remote webpage for plaintext content?
If I can determine the webpage naming scheme, I could set the pointer of the URL object to the appropriate webpage, right? The only limitations would be irregular capitalization, and would only be effective if the plaintext was found in EXACTLY the same place.
Do you have any suggestions?
I was thinking something like this for "Buck 65", singing "I look good"
URL url = new URL(http://www.elyrics.net/read/b/buck-65-lyrics/i-look-good-lyrics.html);
I could substitute "buck-65-lyrics" & "i-look-good-lyrics" to reflect user input?
Input re-directed to PostgreSQL table
Current objective:
User will request name of {song, artist, album}, Java front-end will query remote webpage
Full source code (containing plaintext) will be extracted with Java front-end
Lyrics will be extracted from source code (somehow)
If song is not currently indexed by PostgreSQL server, will be added to table.
Operations will be made on the plaintext to suit the objectives of the program
I'm only looking for direction. If I'm headed completely in the wrong direction, please let me know. This is only for the pseudocode. I'm not looking for answers, or hand-outs, I need assistance in determining what I need to do. Are there external libraries for extracting plaintext that you know of? What technical names are there for what I'm trying to accomplish?
Thanks, Tyler
This approach is referred to as screen or data scraping. Note that employing it often breaks the target service's terms of service. Usually, this is not a robust approach, which is why API-like services with guarantees about how they operate are preferable.
Your approach sounds like it will work for the most part, but a few things to keep in mind.
If the web service you're interacting with requires a very precise URL scheme, you should not feed your user-provided data directly into it, since it is likely to be muddied by missing words, abbreviations, or misspellings. You might be better off doing some sort of search, first, and using that search's best result.
Reading HTML data is more complicated than you think. Use an existing library like jsoup to assist you.
The technical term to extract content from a site is web scraping, you can google that. There are a lot of online libraries, for java there is jsoup. Though its easy to write your own regex.
1st thing I would do i use curl and get the content from the site just for testing, this will give you a fair idea of what to do.
You will have to use a HTML parser. One of the most popular is jsoup.
Take care abut the legal aspect fo what you you do ;)

Create a pdf with text at given coordinates (PDFBox?)

My Situation:
I'm programming in java
Using a library from a person from my university I'm able to read pdfs and create a XML document out of it
This XML document contains additional informations e.g. the coordinates of the text in the original document
My Problem
I would like to create the read PDF again with the content set at its original coordinates (Again: I have the coordinates)
My Question:
-> Do you know a way to create a pdf and set the text of the pdf at given coordinates? <-
I'm doing a lot of research these days about, but maybe I tried the wrong google search terms since I cant find much helpful results. So i thought I might be able to ask here, in the forum where I found the most help so far in my young "programmers life" :)
Most of the results I get, even here, are about people trying to get the coordinates, but I already have them.
I heard during a discussion that PDFBox might be able to do this, but I'm also happy to work with any other framework or library that is capable for my problem.
Thanks for every help and thought you're sharing with me.
Thanks a lot for your comments. In the end I've decided for iText, which allowed me to do all my tasks (placing text at absolute coordinates, give it a background color by certain criterias) in a quite easy and efficient way.
If someone here is searching for inspiration and has a similar task, check my related post here on stackoverflow for some code snippets How can I add a background color to my (pdf-) text using iText to create it with Java

What technologies are there for formatted, structured data input and output?

I am working on a project here that ingests internal resumes from people at my company, strips out the skills and relevant content from them and stores it in a database. This was all done using docx4j and Grails. This required the resumes to first be submitted via a template that formatted everything just right so that the ingest tool knew what to look for to strip the data.
The 2nd portion of this, is what if we want to get out a "reduced" resume from the database. In other words, I want to search the uploaded content I now have, and only print out new resumes for people who have Java programming experience lets say. So I can go into my database, find the people who originally had java as a skill, and output a new set of resumes that are also still in a nice templated format, and only have the relevant info in them, instead of ALL the content.
I have been writing some software to do this in Java that will basically use a docx template, overwriting the items in customXML which are bound to the content controls in the doc, so the new data shows up and can eb saved as a new docx with that custom data.
This seems really cumbersome to me, and has some limitations. For one, lets say my template has a place for 3 Skills, and the particular person has 8 skills. There seems to be no good way to add those 5 additional skills to the docx other than painstakingly inserting the data with all of the formatting XML tags and such. This is a real pain, because if the template changes, I dont want to have to go back into my software and edit source code to change that additional data input XML tag to bold instead of italic.
I was doing some reading up on using Infopath to create a form that I could use to get the input, connecting to some sharepoint data source or something to store the stripped out data. However, I can't seem to find out if it is possible using sharepoint to get the data back out, in a nice formatted way. What would the general steps for this be? It seems like I couldnt find very much about this topic with any quick googling.
Thanks
You could set up the skills:
<skills>
<skill>..</skill>
<skill>..</skill>
and use a "repeat" content control pointing to the container. This would handle any number of <skill> entries.

How to webscrape scholar.google.com in Java?

I want to write a Java func grabTopResults(String f) such that grabTopResults("automata theory") returns me a list of the top 100 cited papers on scholar.google.com for "automata theory".
Does anyone have suggestions for what libraries will make my life easy?
Thanks!
As I'm sure Google can afford the bandwidth, I'll ignore the question of whether this is immoral/illegal/prohibited by Google's T&C
First thing you need to do is figure out what HTTP request (or requests) you need to issue in order to obtain the page with the data you need. Once you've figured this out, use HttpClient to issue the same request from Java code. The previous link shows example code that explains how to do this.
Once you've downloaded the content of the relevant page, you'll need to use a HTML parser to extract the data you're interested in. The Jericho parser suggested by peperg is a good choice.
If the Google police come knocking, you've never heard of me, OK?
I use http://jericho.htmlparser.net/docs/index.html . Google Scholar doesn't have API ( http://code.google.com/p/google-ajax-apis/issues/detail?id=109 ). Of course it is not allowed by Google (read terms of use. Automatic requestr are forbidden).
Below is a bit of example code which gets the titles on the first page using the open source product TestPlan. It is a standalone product, but if you really need it I could help you integrated it into your Java code (it is written in Java itself).
GotoURL http://scholar.google.com/
SubmitForm with
%Params:q% automate theory
end
set %Items% as response //div[#class='gs_r']
foreach %Item% in %Items%
set %Title% as selectIn %Item% h3
Notice %Title%
end
This produces output like the below (my IP is Germany, thus a german response). Obviously you could format it however you like, or write it to a file; this is just a rough test.
00000000-00 GOTOURL http://scholar.google.com/
00000001-00 SUBMITFORM default
00000002-00 NOTICE [ZITATION] Stochastic complexity in statistical inquiry theory
00000003-00 NOTICE AUTOMATED THEORY FORMATION IN MATHEMATICS1
00000004-00 NOTICE Constraint generation via automated theory formation
00000005-00 NOTICE [BUCH] Automated theorem proving: after 25 years
00000006-00 NOTICE [BUCH] Introduction to the Theory of Computation
00000007-00 NOTICE [ZITATION] Computer-controlled systems: theory and design
00000008-00 NOTICE [BUCH] … , randomness & incompleteness: papers on algorithmic information theory
00000009-00 NOTICE [BUCH] Automatic control systems
00000010-00 NOTICE [BUCH] VLSI physical design automation: theory and practice
00000011-00 NOTICE Singular Control Systems.

Categories

Resources