How do I identify a JanusGraph's name? - java

I want to port a social networking application from sql to JanusGraph. I'll be building the backend using Java because it has amazing documentation in janusgraph's official website. I have some beginner questions.
JanusGraph graph = JanusGraphFactory.open("my_setup.properties");
Is .properties file, the only identifier to access a graph? or is it
the file path? (In sql we have a name for database. Is there anything like a graph name?)
If I have the copy of properties file with same
preferences and rename it to my_setup_2.properties, will it access
the same graph or it'll create a new graph?
Is there any way I can identify these vertices belongs to this graph
from my storage backend or search backend?
For what kind of queries storage backend is used and for what kind of
queries search backend is used?
Is there anyway to dump my database? (for porting the graph from one
server to another just like sql dump)
I have only found hosting service providers for Janusgraph 0.1.1
(which is outdated. Latest one is 0.2.1 which supports latest elasticsearch) If I go to production with janusgraph 0.1.1 version how bad will it affect me if I use elasticsearch for search backend?

Is .properties file, the only identifier to access a graph? or is it
the file path? (In sql we have a name for database. Is there anything
like a graph name?)
JanusGraph has a pluggable storage and index backend. The .properties file just tells JanusGraph which backend to use and how they are configured. Different graphs instances will just point to different storage folders, indexes, etc. By looking at the documentation for the config file, it seems though you can specify a graphname which can be used with the ConfiguredGraphFactory to open a graph in this fashion ConfiguredGraphFactory.open("graphName")
If I have the copy of properties file with same preferences and rename
it to my_setup_2.properties, will it access the same graph or it'll
create a new graph?
Yes it will access the same data and hence the same graph.
Is there any way I can identify these vertices belongs to this graph
from my storage backend or search backend?
I don't know exactly for every storage backend but in the case of Elasticsearch, indexes created by JanusGraph are prefixed with janusgraph. I think there are similar mechanisms for other backends.
For what kind of queries storage backend is used and for what kind of
queries search backend is used?
The index backend is used whenever you add an has step on a property indexed with a mixed index. I think all other queries, including an has step on a property configured with a composite index will use the storage backend. For OLAP workloads you can even plug Spark or Giraph on your storage backend to do the heavy lifting.
Is there anyway to dump my database? (for porting the graph from one
server to another just like sql dump)
Graphs can be exported and imported to graph file formats like GraphML. It allows you to interface with other graph tools like Gephi for example. You won't be able to sql dump from your SQL database and directly import that to JanusGraph though. If you consider loading a lot of nodes and edges at once, please go through the documentation about bulk loading.
I have only found hosting service providers for Janusgraph 0.1.1
(which is outdated. Latest one is 0.2.1 which supports latest
elasticsearch) If I go to production with janusgraph 0.1.1 version how
bad will it affect me if I use elasticsearch for search backend?
I don't know about any hosting providers for JanusGraph 2.x. You will easily find hosted services for the pluggable storage backends compatible with JanusGraph 2.x.

Related

Relation of SOLR to DB to App in a Text Search Engine

I recently overheard a few coworkers talking about an article one of them had read involving the use of SOLR in conjunction with a database and an app to provide a "super-charged" text search engine for the app itself. From what I could make out, SOLR is a web service that exposes Lucene's text searching capabilities to a web-enabled app.
I wasn't able to find the article they were talking about, but doing a few relevant Google searches chaulks up several super-abstract articles on text search engines using SOLR.
What I'm wondering is: what's the relationship between all 3 components here?
Who calls who? Does Lucene somehow regularly extract and cache text data from the DB, and then the app queries SOLR for Lucene's text content? What's a typical software stack/setup for a Java-based, SOLR-powered text search engine? Thanks in advance!
You're right in your basic outline here: SOLR is a webservice and syntax helper that sits on top of Lucene.
Essentially, SOLR is configured to index specific data based on a number of configuration options (that include weighting, string manipulation, etc.) SOLR can either be pointed at a DB as its source of data to index, or individual documents (eg, XML files) can be submitted via the web API for indexing.
A web application would typically make an HTTP(s) request to the SOLR API, and SOLR would return indexed data that matches the query. For all intents and purposes, the web app sees SOLR as an HTTP API; it doesn't need to be aware of Lucene in any way. So essentially, the data flow looks like:
Website --> SOLR API --> indexed datasource (DB or document collection)
In terms of "when" SOLR looks at the DB to index new or updated data, this can be configured in a number of ways, but is most typically triggered by calling a specific function of the SOLR API that causes a reindex. This could occur manually, via a scheduled job, programmatically from the web app, etc.
This is what I understood when I started implementing it for my project -
SOLR can be termed as a middleman between your application server and
the DB. SOLR consists of its own server (jetty) which will be up and
listening to any request coming from your app server.
Your application server calls SOLR, giving it the module name and the
search pattern
SOLR will be fed with some xml config files which will tell it, which
table of your schema has to be cached (or indexed) for the given
module name
SOLR might be using Lucene's text search capabilities to understand
the "search pattern" and get the desired result from the already
cached/indexed data
SOLR indexing (full or partial) can be done manually (by executing
commands through GET URLs) or in regular intervals using the SOLR
config files
You can refer Apache SOLR site for more information

Versioning support in Google Cloud Storage from Java?

Previously I asked one question regarding building Document management system on top of GAE using Google cloud storage Document management system using Google Cloud Storage. I think I got appropriate answers for it. This question just an extension of the same. So My question is: can I handle versioning through my java code like mention in this link (developers.google.com/storage/docs/object-versioning) like listing all versions of an object, retrieving a specfic version of an object etc.
Since I found list API's for listing, deleting objects and doing several operations on Google cloud storage but can I handle version through any API's provided by the same from Java?
Thanks in advance.
As Google Cloud Storage doc states (https://developers.google.com/storage/docs/developer-guide) stored objects are immutable.
I.e. you can only delete object after storing and store new one, even with the same name.
So to have versioning you can organize data in pseudo folders. Like: bucket/file-name/version-1; data/file-name/version-2 etc.
Then you need to add some BL to handle this versions (access most recent one when needed, delete outdated, etc). However, in document management system its good to think about transactions, conflicts etc. So probably you will want to manage versions in DB (on GAE?) and just store versions content in the cloud as files (i.e. named by file content hashes).

Ideal place to store Binary data that can be rendered by calling a url

I am looking for an ideal (performance effective and maintainable) place to store binary data. In my case these are images. I have to do some image processing,scale the images and store in a suitable place which can be accesses via a RESTful service.
From my research so far I have a few options, like:
NoSql solution like MongoDB,GridFS
Storing as files in a file system in a directory hierarchy and then using a web server to access the images by url
Apache Jackrabbit Document repository
Store in a cache something like Memcache,Squid Proxy
Any thoughts of which one you would pick and why would be useful or is there a better way to do it?
Just started using GridFS to do exactly what you described.
From my experience thus far, the main advantage to GridFS is that it obviates the need for a separate file storage system. Our entire persistency layer is already put into Mongo, and so the next logical step would be to store our filesystem there as well. The flat namespacing just rocks and allows you a rich query language to fetch your files based off whatever metadata you want to attach to them. In our app we used an 'appdata' object that embedded all the ownership information, ensure
Another thing to consider with NoSQL file storage, and especially GridFS, is that it will shard and expand along with your other data. If you've got your entire DB key-value store inside the mongo server, then eventually if you ever have to expand your server cluster with more machines, your filesystem will grow along with it.
It can feel a little 'black box' since the binary data itself is split into chunks, a prospect that frightens those used to a classic directory based filesystem. This is alleviated with the help of admin programs like RockMongo.
All in all to store images in GridFS is as easy as inserting the docs themselves, most of the drivers for all the major languages handle everything for you. In our environment we took image uploads at an endpoint and used PIL to perform resizing. The images were then fetched from mongo at another endpoint that just output the data and mimetyped it as a jpeg.
Best of luck!
EDIT:
To give you an example of a trivial file upload with GridFS, here's the simplest approach in PyMongo, the python library.
from pymongo import Connection
import gridfs
binary_data = 'Hello, world!'
db = Connection().test_db
fs = gridfs.GridFS(db)
#the filename kwarg sets the filename in the mongo doc, but you can pass anything in
#and make custom key-values too.
file_id = fs.put(binary_data, filename='helloworld.txt',anykey="foo")
output = fs.get(file_id).read()
print output
>>>Hello, world!
You can also query against your custom values if you like, which can be REALLY useful if you want your queries to be based off custom information relative to your application.
try:
file = fs.get_last_version({'anykey':'foo'})
return file.read()
catch gridfs.errors.NoFile:
return None
These are just some simple examples, and the drivers for alot of the other languages (PHP, Ruby etc.) all have cognates.
I would go for jackrabbit in combination with its REST framework sling http://sling.apache.org
Sling allows you to upload/download files via REST calls or webdav while the underlying jackrabbit repository gives you a performant storage with the possibility to store your files in a tree structure (or flat if you like).
Both jackrabbit and sling support an event mechanism where you can asynchronously process the image after upload to i.e. create thumbnails.
The manual at http://sling.apache.org/site/manipulating-content-the-slingpostservlet-servletspost.html describes how to manipulate data using the REST interface provided by sling.
Storing the images as blobs in an RDBMS in another option, and you immediately get some guarantees about integrity, security etc (if this is setup properly on the database), store extra metadata, manage the collection with SQL etc.

How to share a library for data access in tomcat 7?

I'm fairly new to the whole web programming stuff and have the following problem:
I have 2 webapps, one a axis web service and another one is a spring application. Both should get a set of data from a library which contains the data in memory. This data is large so copying the data for each app is no option.
What I did so far is developing the library which loads and contains the data in a static container. The plan was, that both apps instatiate the class containing the container and may access the data.
Sadly, this doesn't work. I get an exception that the object I want to use are in different classloaders.
My question is: How can I provide such a container provider for both libraries in tomcat 7?
BTW: A database is no option, because its to slow.
Edit: I should have been clear about the data. The data is a Topic Map stored in an topic map engine. (see http://www.isotopicmaps.org ). The engine is used to access the data and therefore is the access point to the data. We have an own engine, which hold the data inmemory which is faster than a database backend.
I Want to have a servlet which provides the configuration and loading of topic maps and then the two servlets above should be able to read and modify a topic map. Thats why I need to have a sort of shared access point to the engine.
This is what distributed caches, key-value stores, document stores, and noSql databases are built for. There are many options and new ones each day. The free and open-source options are likely to meet your needs and provide you with as much support as you will needs. The one the is currently my favorite is membase.
So you want a distributed in-memory cache for a server cluster. You can use among others Terracotta for this. You can find here a nice introduction to Terracotta.
Update: I actually disagree the argument that a database is "too slow". If it's slow, then the datamodel and/or data access code is simply badly designed.

How do I put data into the datastore of Google's app engine?

I have a little application written in php+mysql I want to port to AppEngine, but I just can't find the way to port my mysql data to the datastore.
How am I supposed to save the data into my datastore? Is that even possible? I can only see documentation for persistence of Java objects, does that mean I have to port my database to a bunch of fake objects, one per line?
Edit: I say fake objects because I don't want to use them, they're just a way to get over a shortcoming of the GAE design.
I have a 30 megs table I need to check on every GET, by using objects I would need to create an object for every row, so I'd have a java class of maybe 45 megs with thousands upon thousands of lines like:
Row Row23423 = new Row (123,346,75,34,"a cow");
I just can't believe this is the only way.
Here's an idea, what about populating the data store by POST-ing the objects one by one? I mean, like the posts in a blog. You write a class that generates and persists the data, and then you Curl the url with the data, one by one. Slow, but it may work?
How to upload data with the bulk loader is described here. It's not supported directly in Java yet, but that doesn't have to stop you - just do the following:
Create an app.yaml that looks something like this:
application: myapp
version: upload
runtime: python
api_version: 1
handlers:
- url: /remote_api
script: $PYTHON_LIB/google/appengine/ext/remote_api/handler.py
login: admin
Make sure the application name is the same as your Java app's, and the version is not the same as the version you're using for Java. Upload this 'empty' app using appcfg.py.
Now, follow the directions for bulk loading in the page linked to above. When it comes time to run the tool, specify the server address with --server=upload.latest.myapp.appspot.com .
Since multiple versions of the same app share the same datastore - even across runtimes - the data uploaded with the Python version will be accessible to the Java one.
There is documentation on the datastore here.
I can't see anything about a raw data-porting service but if you can extract the data from your MySQL database into text files, then it should be relatively easy to write a script to import it into the app engine's data store using the persistence frameworks provided by it.
Your script would take your raw data, convert into a (Java) object model and imprt those Java objects into the store.
Migrating an application to Googles App Engine I think would be quite some task. As you have seen the App Engine does not have a relational database instead it uses BigTable. This will likely involve exporting it to Java objects (serialized in some way) and the inserting them.
You say "fake" objects in your post but I as you will have to use Java objects anyway I don't think they would be fake unless you plan on using one set of objects for the migration and a new set for the application.
There is no (good) general answer to the question of how to port a relational application to the GAE datastore, because the notion of "data" is incompatible between the two. Relational databases are all about the schema. GAE doesn't even have one. It's a schemaless persistent object datastore with very specific APIs. The environment is great for certain types of apps if you're developing from scratch, but it's pretty tricky to port to.
That said, you can import CSV files, as Nick explains, which you should be able to export from MySQL fairly easily. GAE supports Java and Python "at the same time" using the versions mechanism. So you can set up your data store in Python, and then run against it for your application in Java. (A Java version of the bulk loader is under development.)

Categories

Resources