First off let me say language of preference is Java but any language is acceptable for the answer since I know most languages.
Question: Say I have a link, http://ca.answers.yahoo.com/question/index?qid=20140218053709AAM0WfI (random link for this example). Is there any possible way to get the title of the question and store it in a string, then get what he typed as the description for the question(aka this section here) and store it in a separate string? I know how to grab strings from a site but the problem I run into is that I keep grabbing answers aswell as the question.
Additional details.
I will not know the specific yahoo answer ahead of time, so the code
needs to be able to work with all basic questions (aka ones without
picture or other complications).
Code needs to work with all question/answer forum kinda sites not just yahoo.
Not asking anyone to write entire code or anything, I know that is not how site works. Just is there any specific functions that can easily obtain this information?
What you want is data scraping. You can use Beautiful Soup for that purpose. below is the link of one of the tutorials.
http://kochi-coders.com/2011/05/30/lets-scrape-the-page-using-python-beautifulsoup/
Related
Click here to see a screenshot of the assignment
Here is how the Navigation.csv looks like, where I take data from
https://pastebin.com/JXnaRTzi <-- Click on the link for code - this is my code so far, I am reading the file and making objects from each entry
Guys I need help with this assignment. I chose to do it in Java, but really it doesn't matter. I need advise and help for making the right approach on this kind of problem so it can work on a larger file. What kind of data structures should I use and maybe if it's no bother, give me a solution. I am a new developer and I'm trying to get into backend and frankly I'm a bit lost.
Click on links above to see the task details.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I want to implement a small search engine in java,with nosql database and xml,and install it on a site to search that site,I have some question about it:
1.is this really a good idea?
2.the most important question:where is nosql database used,I mean in this project search engine takes a word from user and searches where is this word used and retuns those phrases to user,so what is the role of database here?
3.what is the role of xml?
4.what is the best search method for it?
5.I have read in this 2 links first link and second one that to use lucene or solr,in this project can these two be used,how and where?
6.what is the best nosql database to use for it?
7.Is this a hard project?
I will be really really thankfull for your help.
I will try to give you my opinions and i will be glad to have constructive feedback in comments.
First of all you are takling a very soft argument and you may not like my point of view, the following point are marked to answer your questions
1) Yes and No. Yes because you can do a smart search over the keywords stored in your html code but you don't know how many pages you have to explore. Furthermore your content may change dynamically and the keyword could be potentially useless. This last part introduces the No part. No because you need a way to know the content of the pages, like the question here in stackoverflow are marked with tags. I guess they are stored somewhere.
2) You take a world from user, and you should run a "web-spider" on your own web site to know where this world occurs. It will take time to open all the pages you have, to search it, filter it, and eventually if you write a enough good code you can parse a page in a few second, something good means like a map-reduce algorithm.
EDIT: well the point is pretty clear. You don't know what kind of string or input (call it X from now on) the user will prompt. This said you store it somewhere and you start your search:
You write a script that checks all your pages, in your web site. This is pretty a bad idea. Please keep considering the stackoverflow example: how can you know exactly how many pages do you have? have you got a fix number of pages (static)? or your content changes dinamically (like the text and the number of pages in stackoverflow)? In order to do so you have to run an "algorithm" to open all your pages and look for the content.
You can look for a particular type of content since you can use the keyword tag of html pages to constraint your research. If x is in the keywords you are done for a single page, and you have to loop the search until you controlled all your webpages. Waste of time and space in memory. Suppose constant time to open a socket to your web page and say you have n pages, which contains m keywords, say that x contains l words: this takes roughly O(n*m*l). (without considering the fact that maybe you want to analyze the whole page)
If you have a lot of resources you can write this "algorithm" using a map-reduce model (see here is explained quite well map-reduce).
Instead if you use something like a tag system, mapping simply the tag to the pages, and saving them into a simple table (in the simple case 3 columns: ID TAG PAGE), you can allow a fast search on your database, looking in the tag colum for x, seems much faster.
3)This question doesn't ring any bell, instead: what will you do with xml? do you want to put is some where? your pages are in xml? you want to save xml search results?
4)I think that google already provides something like that. Anyhow a good way to do it is to open each page, read xml/html depending on the page, and run a regexp to match your word.
5)The two links are selfexplicative, in the answer you really find what you need.
6)No clue.
7)No. But you should define hard. It will take you a long time to think, and find an appropriate design for it, then you will decide if lucene is good for your, if you want to use sql, or whatever.
O community, I'm in the process of writing the pseudocode for an application that extracts song lyrics from a remote host (web-server, not my own) by reading the page's source code.
This is assuming that:
Lyrics are being displayed in plaintext
Portion of source code containing lyrics is readable by Java front-end application
I'm not looking for source code to answer the question, but what is the technical term used for querying a remote webpage for plaintext content?
If I can determine the webpage naming scheme, I could set the pointer of the URL object to the appropriate webpage, right? The only limitations would be irregular capitalization, and would only be effective if the plaintext was found in EXACTLY the same place.
Do you have any suggestions?
I was thinking something like this for "Buck 65", singing "I look good"
URL url = new URL(http://www.elyrics.net/read/b/buck-65-lyrics/i-look-good-lyrics.html);
I could substitute "buck-65-lyrics" & "i-look-good-lyrics" to reflect user input?
Input re-directed to PostgreSQL table
Current objective:
User will request name of {song, artist, album}, Java front-end will query remote webpage
Full source code (containing plaintext) will be extracted with Java front-end
Lyrics will be extracted from source code (somehow)
If song is not currently indexed by PostgreSQL server, will be added to table.
Operations will be made on the plaintext to suit the objectives of the program
I'm only looking for direction. If I'm headed completely in the wrong direction, please let me know. This is only for the pseudocode. I'm not looking for answers, or hand-outs, I need assistance in determining what I need to do. Are there external libraries for extracting plaintext that you know of? What technical names are there for what I'm trying to accomplish?
Thanks, Tyler
This approach is referred to as screen or data scraping. Note that employing it often breaks the target service's terms of service. Usually, this is not a robust approach, which is why API-like services with guarantees about how they operate are preferable.
Your approach sounds like it will work for the most part, but a few things to keep in mind.
If the web service you're interacting with requires a very precise URL scheme, you should not feed your user-provided data directly into it, since it is likely to be muddied by missing words, abbreviations, or misspellings. You might be better off doing some sort of search, first, and using that search's best result.
Reading HTML data is more complicated than you think. Use an existing library like jsoup to assist you.
The technical term to extract content from a site is web scraping, you can google that. There are a lot of online libraries, for java there is jsoup. Though its easy to write your own regex.
1st thing I would do i use curl and get the content from the site just for testing, this will give you a fair idea of what to do.
You will have to use a HTML parser. One of the most popular is jsoup.
Take care abut the legal aspect fo what you you do ;)
I can make a log in for easily, so that's not the problem. What is my problem is that I don't know how to check if the user's name and password are correct. I had a few ideas so here they are:
1) Saved in game, update every time someone registers -> Not practical
2) A MySQL database with something -> I'm just too stupid for that.
3) A website (php) that asks for ?name= and &password= in the URL, if it exists it echoes true, else false and then when I want to login, I just try to connect to that website (the user won't see that, of course) and see what it returns. I think this is the best idea for me but I don't know how to connect to the website and read what it says.
Just to make it clear, I have a domain and a website.
You're correct that #1 is really impractical and #3 seems effectively the same as you'd need to store your collection of username / password pairs somewhere. You should really consider #2. At some point we all felt "too stupid" for something new, but check it out, do some tutorials, and I'm sure you'll be well on your way.
An appropriate solution should include password encryption (ideally with a salt). Since you mentioned PHP, check out crypt. Also, take a look at PDO, probably the best (in my opinion) MySQL interface for PHP. Again take a look at the official docs here.
You seem like you're quite new to this, so some of that may be over your head at first. If so, just Google around with those keywords and you'll be sure to find many great tutorials.
I'm back with a question. I'm playing with Rapid Miner for automatic text classification and cant get it work. I'm getting an error that says, "no example set in the example, offending operator Performance ". Any idea what that is referring to ?
In RapidMiner you have to use the converter components before using it as example sets. So, if you have an output as 'doc', for example, you have to use the component 'Documents to Data' in order to link it to the next input 'exa'. That´s all!
Could you provide more details about your RapidMiner text mining process?
Without more context, your question is difficult to answer.
For more help with RapidMiner, you may want to check out the RapidMiner user forum: http://forum.rapid-i.com/
At RapidMiner Resources, you can find RapidMiner tutorial videos about how to text mining with RapidMiner:
http://rapidminerresources.com/index.php?page=text-mining-3
Rapid-I also offers a 90 minutes text mining webinar. You can find it at the Rapid-I web page under "services" and "training" or in the web shop.
I hope these links help you to get started with automatic text classification with RapidMiner. If you provide more details about your RapidMiner text mining process, I may also be able to directly answer your question.
If it says that there is no Example Set, then the issue is probably with your original data. Can you post an image of your process?
For instance, make sure that you have connected the initial input to your operator - what two operators does the error occur at?
One thought: the example set in text mining is usually your document collection, but if you are really using documents (PDF, Word) then your format will be Documents (Doc), and you may need to transform your documents to data (Documents to Data operator). Then you should have an Example Set that you can feed into your Performance operator.
Hope this helps - as the earlier comment said, without knowing the process, it is hard to tell exactly where the error is.