Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
This may be too broad of a question, but how does one access a Bible document in an Android application? Is it a text file that has an index to find specific verses or is it something much more complicated? I hope that is enough to answer from.
The first step would be to actually find a structured bible dataset from somewhere.
You can search and try to see if there's an xml version of your favourite translation somewhere, and maybe download that.
Once you've got it (either as xml, json, or whatever format), you can write some Java code to parse the file, and load it into some appropriate data structure that allows you to do what you want with it efficiently (eg. search it by verse).
You could even put it into a database (eg. MySQL or MongoDB), which would allow you to search it efficiently.
But really, how you want to structure the data depends on how you're going to use it, and what formats it's already available in (as it could be a pain to clean up the XML).
You might find the following resources useful:
Web Service APIs to directly get verses: http://www.4-14.org.uk/xml-bible-web-service-api
These would mean avoiding a lot of the headaches of dealing with file formats, indexing, and all kinds of other stuff.
Web service APIs generally work by your program submitting a query to a website (eg. including the biblical reference), and you get back some structured data (eg. xml/json) containing the verse(s) you requested.
Download a structured offline copy: http://www.bibletechnologies.net./osistext/
This would mean you have to find, download, parse, and index your own data structure for dealing with the text, but it would be much faster (if done right) than using a web service to do it.
The link I posted here has only some example books from the bible, but if you look you'll find more around the web.
It completely depends on the format of the file.
Any book or text document has multiple ways it can be stored and distributed. It could simply be in a .pdf file, or it could be stored in an XML, or .epub
It is beyond broad, because there are so many ways to do it, it's impossible to guess without more information.
This link has some information about the e-book formats:
http://en.wikipedia.org/wiki/Comparison_of_e-book_formats
And that's just one small subsection of ways text can be stored.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I write a service which stores millions files (20-30mb file) on a disk and I need to write a search function to find a file by name (there is no need to search file content) or view files in explorer (for example, navigate in browser as a folder structure). I want to make it fast, reliable and simple in Java. Say, I plan to run two services both of which can be used to upload a file or search files by name pattern. What will be the best technology/approach to use to implement this? Store a file on a disk as well as the path and name in the database, search against the database and fetch findings by path from the database? Any other good ideas? I thought about elasticsearch but looks like a heavy solution.
This question is too broad and rather not in a format of SO (concrete programming questions mostly with code snippets that try to address a concrete technical difficulty given the set of technologies).
There are many ways to fulfill your requirements. Yet, based solely on the information presented in your question, its impossible to recommend something because we don't really know your requirements. I'll explain:
I plan to run two services both of which can be used to upload a file or search files by name pattern.
Does this mean that the file system has to be distributed?
If so, consider Cloud solutions style aws's S3.
If you can't run in the cloud, here you can find a comprehensive list of distributed filesystems.
Elasticsearch can also work of course as a search engine, but its more a full fledged search engine, so looks like an overkill for me in this case.
You might want to work directly with lucene so that you won't need to run an additional process that also might fail (ES is built on top of lucene). Lucene will store its index directly on the filesystem, again if it meets the requirements.
Now you're talking also about the database - again a possible direction especially if you're already have one in your project. In general relational database management servers have some support of searching but there are more advanced solutions: in PostgreSQL for example you have a GIN index (inverted index) again the same concepts for full text search that go way beyond standard's SQL's LIKE Operator.
Yet another idea: go with a local disk. If you're on linux there is an indexing utility called "locate" that you can delegate the index creation to.
So the choice is yours.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have a report generated from a database containing about 100,000 entries, each entry contains about 10 columns, the data is stored on Amazon S3 and is generated monthly. I'm looking for some pointers on a way you recommend a way to present this many multiple pages of data on, and I want it to be sortable and because however I sort it it wont suit all users, ideally it should be searchable as well.
Is it possible to do purely client-side or is that unfeasible, do I need go back to the server. I don't have the database available but if needs be, the website is backed by a java servlet application running on Tomcat. A self contained library for doing this would be very useful.
To paraphrase the discussion above.
Providing search/paging in Javascript is not sensible because this would still require the user to download all the data in one go, and representing that amount of data in html is not going to work well.
So either have to provide a server backend and provide a mechanism for searching and paging. Or provide the data in a spreadsheet format then the user can use the capabilities of their spreadsheet tool, which is well suited to dealing with large volumes of data.
Im going to try the spreadsheet idea.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I am creating a Spring application and I have the need to integrate with Wikipedia. In particular, I would like to extract data on a given (large) set of Cities, e.g. country, website and coordinates.
I am trying to understand which libraries or frameworks can be useful, but the big issue I am dealing with is that there is no reference structure for the pages I would like to extract information from. The main problem is not that some information is missing, which would be totally acceptable, but rather the city representation changes from city to city. E.g. the DBPedia ontology (e.g. http://dbpedia.org/ontology/City) does not reflect what I can extract via SPARQL query from dbpedia.org/sparql. This way, I don't know how to extract the data I need systematically (i.e. for my entire set).
Is there any technology that can support my task of extracting data on a predefined set of cities?
One solution could be to put in place some Natural Language Processing in order to seek for the required info in the entire Wikipedia page, but that requires a lot of effort, if I have to write it on my own.
Another solution could be leveraging a source of structured data that pre-processed Wikipedia for me and gave some structure to the contained information, but I could not find one.
On third solution could be trying to make different queries to Wikipedia, but I cannot figure out a way to extract the information I need via those Wikipedia APIs.
Data from Wikipedia is being transfered to Wikidata. Using their API you could get what you want. If you want a shortcut you could use the Wikidata query tool: http://wdq.wmflabs.org/api_documentation.html
Im not a java guy, but I did something like this in .Net.
You need some kind of web scraping framework.
In .Net there is HtmlAgilityPack. Where you get the site and then with fx XPATH go through elements of the sites. Offcourse you need to know where on the site the informations is. That could be the tags around the heading, text and so on.
For java, the framework I just found was
Tag Soup
HtmlUnit
Web-Harvest
jARVEST
jsoup
Jericho HTML Parser
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
So I am trying to write a program which can collect certain information from different articles and combine them. The step in which I am having trouble is extracting the article from the web page.
I was wondering whether you could provide any suggestions to java libraries/methods for extracting text from a web page?
I have also found this product:
http://www.diffbot.com/products/automatic/article/
and was wondering whether you think this is the way to go? If so can someone point me to a java implementation - cannot seem to find one although apparently it exists.
Many thanks
Clarification - I am more looking for an algorithm/library/method for detecting where where in an html dom tree a block of text that could be an article is located. Like Safari's reader function.
ps if you think this is much easier done in something like python just say - although my program has to run in Java as it should eventually run on a server (using java framework) I could try having it make use of python scripts - although would only do this if you advise that Python is the way to go.
Have a look at Apache Tika. It's meant to be used together with a crawler and can extract both text and metadata for you. You can also select various output types.
I have found an open source solution which was extremely highly rated.
https://code.google.com/p/boilerpipe/
A review on different text extraction algorithms:
http://tomazkovacic.com/blog/122/evaluating-text-extraction-algorithms/
It appears that diffbot does perform very well but is not open source. So in terms of open source, boiler pipe could be the way to go.
This is not the answer to every malformed HTML you can get, but most of the time jtidy does a good job cleaning the HTML and giving you an interface for accessing the various DOM nodes, and with that access to the text inside that nodes.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
What I want to do is, given a url to a csv file, I need build a web service that takes the url as a parameter, then download the file, parse data in the file and finally visualize the data in the file, The web service is supposed to be working on any given url pointing to a csv file. I'm trying to build the web service using JAX-RS right now. Any hints on how a little more detailed architecture that could work for this purpose? Thanks in advance!
This is a very broad question, so I will answer how you would do it. I've done this before with XML and NOAA Weather Prediction data.
Make sure the URL you are parsing from contains pure .csv/.xml data, e.g. NOAA data. For testing purposes, it's good idea to download the .csv directly from the site and write code doing IO on the .csv file, but for once you're done with that, it's a much better idea to just read directly from the URL. I'm not sure what it's like for REST protocols, but for SOAP, the URL contains input parameters, so you can specify everything from the longitude latitude of a location, to date ranges, etc.
Use a CSV parsing library, or find a tutorial, like such.
Store the parsed items. You could use a multi-dimensional array for testing, and transition to a database for long-term storage and processing.
Process/visualize the data. Once you have the data structure all set, you can then get creative with visualizing the data.
Go and read some tutorials, this is a very general question. Don't expect people to write all code for your service. here is a tutorial link http://docs.oracle.com/cd/E17802_01/webservices/webservices/docs/1.6/tutorial/doc/
and here an explanation about how to download a cvs file and save in java Programatically Downloading CSV Files with Java
here a nice tut about how to parse csv via csv http://www.mkyong.com/java/how-to-read-and-parse-csv-file-in-java/