Is there a way to use my file system as a cache? - java

I’m am taking a data structures course and am developing a project in java. The project is pretty much complete except for one aspect, implementation of a cache. My professor has been very vague as he is and should be with everything on how to implement this. The only hint he had given is that our operating system has its own file system which in and of itself is a map, and we can use it as a way to create a cache. I will paste the assignment details below. Any help would be greatly appreciated.
Almost forgot. My OS is windows 10
Requirements
This assignment asks you to create a web page categorization program.
The program reads 20 (or more) web pages. The urls for some of these web pages can be maintained in a control file that is read when the program starts. The others should be links from these pages. (Wikipedia is recommended source.) For each page, the program maintains frequencies of words along with any other related information that you choose.
The user can enter any other URL, and the program reports which other known page is most closely related, using a similarity metric of your choosing.
The implementation restrictions are:
Create a cache based on a custom hash table class you implement to keep track of pages that have not been modified since accessed; keep them in local files.
Use library collections or your own data structures for all other data stores. Read through the Collections tutorial.
Establish a similarity metric. This must be in part based on word-frequencies, but may include other attributes. If you follow the recommended approach of hash-based TF-IDF, create a hash table storing these.
A GUI allows a user to indicate one entity, and displays one or more similar ones.

Related

Anomaly detection in java application

What i'm trying to do is to integrate anomaly detection module in the existing java application to allow user to chose from different algorithms and forecasting models
Egads library looks quite optimistic, but I'm not sure whether it fits my purposes, in case new data coming in should I store and update the existing model or to pass the whole data once again. Also what if I would like to forecast only 15 min time window, by passing only 15 min data in the results won't be precise for sure.
Might be there're any other useful techniques and someone could share his experience of similar tasks. Unfortunately can't find any other java libs for that purposes.
What I found out is that we can't store the model trained initially and apply it against any incoming data, as soon as the initial time series is changed the exception is thrown. That's why the only possible option here to to train the model every time new data comes in, fortunately it doesn't have great performance impact on our system yet.
Library itself looks fine and could be used as a base in building anomaly detection systems, but its still not that flexible as it Python competitors, however it's open sourced and can be modified anytime depending on your needs.

What to do after writing the user requirements document?

Suppose I have to develop a Desktop Application using Java SE. I have finished writing User requirements document. In this document, I mentionned the functionnalities of my futur application. I analysed the needs of the user and established What the ideal application has to perform.
Now, I have to conceive the architecture of the application and ddetailled conception of the app. This is what I don't know how to do ?
I have an idea, which is as follow : elaborate a use case diagram, then for every use case, make an sequence diagram finally produce a class diagram from which I can generate the code.
Is this correct ? How about using a database management system at which level I add the use of DBMS ? from the first uml diagramme ?
please any help is a welcome.
Well you know the functions you will implement and you know the requirements. While at this point you should know or infer some database requirements you don't have the whole picture yet. If you want to do an iterative software development you can start with whatever you feel you got more to progress with, then go back to your other tasks and work in increments. Because you are doing an iterative process you will be erasing bits here and there, polishing your work as you go.
To work sequentially, you'd finish all the analysis documentation before doing design, and that before touching code. Initial databases can be generated from java classes (beans), so that's when that comes in.
Under your chosen methodology the wiki link you provided lists what it is expected to be done in order. For the part of high level design, which you claim to have problems with, you'll want the appropriate UML diagrams, use components for modules/software architecture.
Because it is high level design keep it high level, don't delve in details. eg for a videogame: Graphics, Audio, Network, etc, and how they will interact (Interfaces), don't define anything smaller, no classes, no methods, main packages/libraries can be done. For hardware architecture you may use a deployment diagram I guess, have each cube represent the hardware of the boxes that will run your code, you aren't prepared for deployment but you can make changes to your initial proposal next iteration if you need.
Database design is at the end, but the wiki specifically tells you to only define tables, don't define columns yet. You will define that in the low level detail phase.

Architecture with Neo4j and Mysql for social networking website

We are designing the architecture of social networking website which has highly interconnected dataset. (like user can follow other users, places, interests. And recommendations based on that). The feed would come from directly following entities as well as from indirectly connected entities. (the places and interest can be connected to other places and interests in a inverted tree like hierarchy ).
Now we plan to use Neo4j for storing the complex relationships between entities with their IDs. We want to store the actual data for that entity in MySQL. We want to keep graph database content only to minimal size (but with the entire relationship (that's very important for feeds)), so that we could load entire graph in RAM at run time. (entire graph in memory for fast retrieval of content). Once we get ID's of object from Neo4j, we could run normal SQL queries on MySQL.
We are using PHP and MySQL combination. Now we have learned that Neo4j, if implemented in embedded mode, is suitable for complex algorithm and fast data retrieval. Now we need to integrate Neo4j with PHP. We plan to create RESTful Java APIs (or SOAP) for Neo4j implementation. By this way we could do it.
We would have atleast 1 million nodes and 10 millions relationship. Can Neo4j traverse 1 million nodes without perfomance glitches in 1-5 seconds with proper indexing?
Now, please guide me if this would work. Anyone who has already done this kind of things before. Your any little guidance in this regards would be highly useful to me.
thank you
P/s: i am attaching some project relationship diagrams to give you more understanding. please ask if you need more inputs from me.
https://drive.google.com/file/d/0B-XA2uVZaFFTWDdwUEViZ2ZsbkE/edit?usp=sharing
https://drive.google.com/file/d/0B-XA2uVZaFFTTGV4d1IySXlWRGs/edit?usp=sharing
I published an unmanaged extension some time ago that represents a kind of activity stream. Feel free to have a look, you would consume it from PHP just via a simple http-REST call.
https://github.com/jexp/neo4j-activity-stream
A picture of the domain model is here:
yes, 10M relationships and 1M nodes should be no problem to even hold in memory. For fastest retrieval, I would build a server extension in Java and use the embedded API or even Cypher, and expose a custom REST endpoint that your PHP environment talks to, see http://docs.neo4j.org/chunked/milestone/server-plugins.html

Basic Java application data storage

I'm working on (essentially) a calendar application written in Java, and I need a way to store calendar events. This is the first "real" application I've written, as opposed to simple projects (usually for classes) that either don't store information between program sessions or store it as text or .dat files in the same directory as the program, so I have a few very basic questions about data storage.
How should the event objects and other data be stored? (.dat files, database of some type, etc)
Where should they be stored?
I'm guessing it's not good to load all the objects into memory when the program starts and not update them on the hard drive until the program closes. So what do I do instead?
If there's some sort of tutorial (or multiple tutorials) that covers the answers to my questions, links to those would be perfectly acceptable answers.
(I know there are somewhat similar questions already asked, but none of them I could find address a complete beginner perspective.)
EDIT: Like I said in one of the comments, in general with this, I'm interested in using it as an opportunity to learn how to do things the "right" (reasonably scalable, reasonably standard) way, even if there are simpler solutions that would work in this basic case.
For a quick solution, if your data structures (and of course the way you access them) are sufficiently simple, reading and writing the data to files, using your own format (e.g. binary, XML, ...), or perhaps standard formats such as iCalendar might be more suited to your problem. Libraries such as iCal4J might help you with that.
Taking into account the more general aspects of your question, this is a broader topic, but you may want to read about databases (relational or not). Whether you want to use them or not will depend on the overall complexity of your application.
A number of relational databases can be used in Java using JBDC. This should allow you to connect to the relational database (SQL) of your choice. Some of them run within their own server application (e.g. MS SQL, Oracle, MySQL, PostgreSQL), but some of them can be embedded within your Java application, for example: JavaDB (a variant of Apache Derby DB), Apache Derby DB, HSQLDB, H2 or SQLite.
These embeddable SQL databases will essentially store the data on files on the same machine the application is running on (in a format specific to them), but allow you to use the data using SQL queries.
The benefits include a certain structure to your data (which you build when designing your tables and possible constraints) and (when supported by the engine) the ability to handle concurrent access via transactions. Even in a desktop application, this may be useful.
This may imply a learning curve if you have to learn SQL, but it should save you the trouble of handling the details of defining your own file format. Giving structure to your data via SQL (often known by other developers) can be better than defining your own data structures that you would have to save into and read from your own files anyway.
In addition, if you want to deal with objects directly, without knowing much about SQL, you may be interested in Object-Relational Mapping frameworks such as Hibernate. Their aim is to hide the SQL details from you by being able to store/load objects directly. Not everyone likes them and they also come with their own learning curve (which may entail learning some details of how SQL works too). Their pros and cons could be discussed at length (there are certainly questions about this on StackOverflow or even DBA.StackExchange).
There are also other forms of databases, for example XML databases or Semantic-Web/RDF databases, which may or may not suit your needs.
How should the event objects and other data be stored? (.dat files,
database of some type, etc)
It depends on the size of the data to be stored (and loaded), and if you want to be able to perform queries on your data or not.
Where should they be stored?
A file in the user directory (or in a subdirectory of the user directory) is a good choice. Use System.getProperty("user.home") to get it.
I'm guessing it's not good to load all the objects into memory when
the program starts and not update them on the hard drive until the
program closes. So what do I do instead?
It might be a perfectly valid thing to do, unless the amount of data is so great that it would eat far too much memory. I don't think it would be a problem for a simple calendar application. If you don't want to do that, then store the events in a database and perform queries to only load the events that must be displayed.
A simple sequential file should suffice. Basically, each line in your file represents a record, or in your case an event. Separate each field in your records with a field delimiter, something like the pipe (|) symbol works nice. Remember to store each record in the same format, for example:
date|description|etc
This way you can read back each line in the file as a record, extract the fields by splitting the string on your delimiter (|) symbol, and use the data.
Storing the data in the same folder as your application should be fine.
The best way I find to handle the objects (for the most part), is to determine whether or not the amount of data you are storing is going to be large enough to have consequences on the user's memory. Based on your description, it should be fine in this program.
The right answer depends on details, but probably you want to write your events to a database. There are several good free databases out there, like MySQL and Postgres, so you can (relatively) easily grab one and play with it.
Learning to use a database well is a big subject, bigger than I'm going to answer in a forum post. (I could recommend that you read my book, "A Sane Approach to Database Design", but making such a shameless plug on a forum would be tacky!)
Basically, though, you want to read the data from the database when you need it, and update it when it changes. Don't read everything at start up and write it all back at shut-down.
If the amount of data is small and rarely changes, keeping it all in memory and writing it to a flat file is simpler and faster. But most applications don't fit that description.

Loading facebook's big text file to memory (39MB) for autocompletion

I'm trying to implement part of the facebook ads api, the auto complete function ads.getAutoCompleteData
Basically, Facebook supplies this 39MB file which updated weekly, and which contains targeting ads data including colleges, college majors, workplaces, locales, countries, regions and cities.
Our application needs to access all of those objects and supply auto completion using this file's data.
I'm thinking of preferred ways to solved this. I was thinking about one of the following options:
Loading it to memory using Trie (Patricia-trie), the disadvantage of course that it will take too much memory on the server.
Using a dedicated search platform such as Solr on a different machine, the disadvantage is perhaps over-engineering (Though the file size will probably increase largely in the future).
(Fill here cool, easy and speed of light option) ?
Well, what do you think?
I would stick with a service oriented architecture (especially if the product is supposed to handle high volumes) and go with Solr. That being said, 39 MB is not a lot of hold in memory if it's going to be a singleton. With indexes and all this will get up to what? 400MB? This of course depends on what your product does and what kind of hardware you wish to run it on.
I would go with Solr or write your own service that reads the file into a fast DB like MySQL's MyISAM table (or even in-memory table) and use mysql's text search feature to serve up results. Barring that I would try to use Solr as a service.
The benefit of writing my own service is that I know what is going on, the down side is that it'll be no where as powerful as Solr. However I suspect writing my own service will take less time to implement.
Consider writing your own service that serves up request in a async manner (if your product is a website then using ajax). The trouble with Solr or Lucene is that if you get stuck, there is not a lot of help out there.
Just my 2 cents.

Categories

Resources