How can I analyse ~13GB of data?

How can I analyse ~13GB of data? - java

I have ~300 text files that contain data on trackers, torrents and peers. Each file is organised like this:
tracker.txt
time torrent
time peer
time peer
...
time torrent
...
I have several files per tracker and much of the information is repeated (same information, different time).
I'd like to be able to analyse what I have and report statistics on things like
How many torrents are at each tracker
How many trackers are torrents listed on
How many peers do torrents have
How many torrents to peers have
The sheer quantity of data is making this hard for me to. Here's What I've tried.
MySQL
I put everything into a database; one table per entity type and tables to hold the relationships (e.g. this torrent is on this tracker).
Adding the information to the database was slow (and I didn't have 13GB of it when I tried this) but analysing the relationships afterwards was a no-go. Every mildly complex query took over 24 hours to complete (if at all).
An example query would be:
SELECT COUNT(DISTINCT torrent)
FROM TorrentAtPeer, Peer
WHERE TorrentAtPeer.peer = Peer.id
GROUP BY Peer.ip;
I tried bumping up the memory allocations in my my.cnf file but it didn't seem to help. I used the my-innodb-heavy-4G.cnf settings file.
EDIT: Adding table details
Here's what I was using:
Peer Torrent Tracker
----------- ----------------------- ------------------
id (bigint) id (bigint) id (bigint)
ip* (int) infohash* (varchar(40)) url (varchar(255))
port (int)
TorrentAtPeer TorrentAtTracker
----------------- ----------------
id (bigint) id (bigint)
torrent* (bigint) torrent* (bigint)
peer* (bigint) tracker* (bigint)
time (int) time (int)
*indexed field. Navicat reports them as being of normal type and Btree method.
id - Always the primary key
There are no foreign keys. I was confident in my ability to only use IDs that corresponded to existing entities, adding a foreign key check seemed like a needless delay. Is this naive?
Matlab
This seemed like an application that was designed for some heavy lifting but I wasn't able to allocate enough memory to hold all of the data in one go.
I didn't have numerical data so I was using cell arrays, I moved from these to tries in an effort to reduce the footprint. I couldn't get it to work.
Java
My most successful attempt so far. I found an implementation of Patricia Tries provided by the people at Limewire. Using this I was able to read in the data and count how many unique entities I had:
13 trackers
1.7mil torrents
32mil peers
I'm still finding it too hard to work out the frequencies of the number of torrents at peers. I'm attempting to do so by building tries like this:
Trie<String, Trie<String, Object>> peers = new Trie<String, Trie<String, Object>>(...);
for (String line : file) {
if (containsTorrent(line)) {
infohash = getInfohash(line);
}
else if (containsPeer(line)) {
Trie<String, Object> torrents = peers.get(getPeer(line));
torrents.put(infohash, null);
}
}
From what I've been able to do so far, if I can get this peers trie built then I can easily find out how many torrents are at each peer. I ran it all yesterday and when I came back I noticed that the log file wan't being written to, I ^Z the application and time reported the following:
real 565m41.479s
user 0m0.001s
sys 0m0.019s
This doesn't look right to me, should user and sys be so low? I should mention that I've also increased the JVM's heap size to 7GB (max and start), without that I rather quickly get an out of memory error.
I don't mind waiting for several hours/days but it looks like the thing grinds to a halt after about 10 hours.
I guess my question is, how can I go about analysing this data? Are the things I've tried the right things? Are there things I'm missing? The Java solution seems to be the best so far, is there anything I can do to get it work?

You state that your MySQL queries took too long. Have you ensured that proper indices are in place to support the kind of request you submitted? In your example, that would be an index for Peer.ip (or even a nested index (Peer.ip,Peer.id)) and an index for TorrentAtPeer.peer.
As I understand you Java results, you have much data but not that many different strings. So you could perhaps save some time by assigning a unique number to each tracker, torrent and peer. Using one table for each, with some indexed value holding the string and a numeric primary key as the id. That way, all tables relating these entities would only have to deal with those numbers, which could save a lot of space and make your operations a lot faster.

I would give MySQL another try but with a different schema:
do not use id-columns here
use natural primary keys here:
Peer: ip, port
Torrent: infohash
Tracker: url
TorrentPeer: peer_ip, torrent_infohash, peer_port, time
TorrentTracker: tracker_url, torrent_infohash, time
use innoDB engine for all tables
This has several advantages:
InnoDB uses clustered indexes for primary key. Means that all data can be retrieved directly from index without additional lookup when you only request data from primary key columns. So InnoDB tables are somewhat index-organized tables.
Smaller size since you do not have to store the surrogate keys. -> Speed, because lesser IO for the same results.
You may be able to do some queries now without using (expensive) joins, because you use natural primary and foreign keys. For example the linking table TorrentAtPeer directly contains the peer ip as foreign key to the peer table. If you need to query the torrents used by peers in a subnetwork you can now do this without using a join, because all relevant data is in the linking table.
If you want the torrent count per peer and you want the peer's ip in the results too then we again have an advantage when using natural primary/foreign keys here.
With your schema you have to join to retrieve the ip:
SELECT Peer.ip, COUNT(DISTINCT torrent)
FROM TorrentAtPeer, Peer
WHERE TorrentAtPeer.peer = Peer.id
GROUP BY Peer.ip;
With natural primary/foreign keys:
SELECT peer_ip, COUNT(DISTINCT torrent)
FROM TorrentAtPeer
GROUP BY peer_ip;
EDIT
Well, original posted schema was not the real one. Now the Peer table has a port field. I would suggest to use primary key (ip, port) here and still drop the id column. This also means that the linking table needs to have multicolumn foreign keys. Adjusted the answer ...

If you could use C++, you should take a look at Boost flyweight.
Using flyweight, you can write your code as if you had strings, but each instance of a string (your tracker name, etc.) uses only the size of a pointer.
Regardless of the language, you should convert the IP address to an int (take a look at this question) to save some more memory.

You most likely have a problem that can be solved by NOSQL and distributed technologies.
i) I would write a distributed system using Hadoop/HBase.
ii) Rent several tens / hundred AWS machines, but only for a few seconds (It'll still cost you less than a $0.50)
iii) Profit!!!

Related

DynamoDB Scan Query and BatchGet

We have a Dynamo DB table structure which consists Hash and Range as primary key.
Hash = date.random_number
Range = timestamp
How to get items within X and Y timestamp? Since hash key is attached with random_number, those many times query has to be fired. Is it possible to give multiple hash values and single RangeKeyCondition.
What would be most efficient in terms of cost and time?
Random number range is from 1 to 10.

If I understood correctly, you have a table with the following definition of Primary Keys:
Hash Key : date.random_number
Range Key : timestamp
One thing that you have to keep in mind is that , whether you are using GetItem or Query, you have to be able to calculate the Hash Key in your application in order to successfully retrieve one or more items from your table.
It makes sense to use the random numbers as part of your Hash Key so your records can be evenly distributed across the DynamoDB partitions, however, you have to do it in a way that your application can still calculate those numbers when you need to retrieve the records.
With that in mind, let's create the query needed for the specified requirements. The native AWS DynamoDB operations that you have available to obtain several items from your table are:
Query, BatchGetItem and Scan
In order to use BatchGetItem you would need to know beforehand the entire primary key (Hash Key and Range Key), which is not the case.
The Scan operation will literally go through every record of your table, something that in my opinion is unnecessary for your requirements.
Lastly, the Query operation allows you to retrieve one or more items from a table applying the EQ (equality) operator to the Hash Key and a number of other operators that you can use when you don't have the entire Range Key or would like to match more than one.
The operator options for the Range Key condition are: EQ | LE | LT | GE | GT | BEGINS_WITH | BETWEEN
It seems to me that the most suitable for your requirements is the BETWEEN operator, that being said, let's see how you could build the query with the chosen SDK:
Table table = dynamoDB.getTable(tableName);
String hashKey = "<YOUR_COMPUTED_HASH_KEY>";
String timestampX = "<YOUR_TIMESTAMP_X_VALUE>";
String timestampY = "<YOUR_TIMESTAMP_Y_VALUE>";
RangeKeyCondition rangeKeyCondition = new RangeKeyCondition("RangeKeyAttributeName").between(timestampX, timestampY);
ItemCollection<QueryOutcome> items = table.query("HashKeyAttributeName", hashKey,
rangeKeyCondition,
null, //FilterExpression - not used in this example
null, //ProjectionExpression - not used in this example
null, //ExpressionAttributeNames - not used in this example
null); //ExpressionAttributeValues - not used in this example
You might want to look at the following post to get more information about DynamoDB Primary Keys:
DynamoDB: When to use what PK type?
QUESTION: My concern is querying multiple times because of random_number attached to it. Is there a way to combine these queries and hit dynamoDB once ?
Your concern is completely understandable, however, the only way to fetch all the records via BatchGetItem is by knowing the entire primary key (HASH + RANGE) of all records you intend to get. Although minimizing the HTTP roundtrips to the server might seem to be the best solution at first sight, the documentation actually suggests to do exactly what you are doing to avoid hot partitions and uneven use of your provisioned throughput:
Design For Uniform Data Access Across Items In Your Tables
"Because you are randomizing the hash key, the writes to the table on
each day are spread evenly across all of the hash key values; this
will yield better parallelism and higher overall throughput. [...] To
read all of the items for a given day, you would still need to Query
each of the 2014-07-09.N keys (where N is 1 to 200), and your
application would need to merge all of the results. However, you will
avoid having a single "hot" hash key taking all of the workload."
Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html
Here there is another interesting point suggesting the moderate use of reads in a single partition... if you remove the random number from the hash key to be able to get all records in one shot, you are likely to fall on this issue, regardless if you are using Scan, Query or BatchGetItem:
Guidelines for Query and Scan - Avoid Sudden Bursts of Read Activity
"Note that it is not just the burst of capacity units the Scan uses
that is a problem. It is also because the scan is likely to consume
all of its capacity units from the same partition because the scan
requests read items that are next to each other on the partition. This
means that the request is hitting the same partition, causing all of
its capacity units to be consumed, and throttling other requests to
that partition. If the request to read data had been spread across
multiple partitions, then the operation would not have throttled a
specific partition."
And lastly, because you are working with time series data, it might be helpful to look into some best practices suggested by the documentation as well:
Understand Access Patterns for Time Series Data
For each table that you create, you specify the throughput
requirements. DynamoDB allocates and reserves resources to handle your
throughput requirements with sustained low latency. When you design
your application and tables, you should consider your application's
access pattern to make the most efficient use of your table's
resources.
Suppose you design a table to track customer behavior on your site,
such as URLs that they click. You might design the table with hash and
range type primary key with Customer ID as the hash attribute and
date/time as the range attribute. In this application, customer data
grows indefinitely over time; however, the applications might show
uneven access pattern across all the items in the table where the
latest customer data is more relevant and your application might
access the latest items more frequently and as time passes these items
are less accessed, eventually the older items are rarely accessed. If
this is a known access pattern, you could take it into consideration
when designing your table schema. Instead of storing all items in a
single table, you could use multiple tables to store these items. For
example, you could create tables to store monthly or weekly data. For
the table storing data from the latest month or week, where data
access rate is high, request higher throughput and for tables storing
older data, you could dial down the throughput and save on resources.
You can save on resources by storing "hot" items in one table with
higher throughput settings, and "cold" items in another table with
lower throughput settings. You can remove old items by simply deleting
the tables. You can optionally backup these tables to other storage
options such as Amazon Simple Storage Service (Amazon S3). Deleting an
entire table is significantly more efficient than removing items
one-by-one, which essentially doubles the write throughput as you do
as many delete operations as put operations.
Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

Strange Cassandra ReadTimeoutExceptions, depending on which client is querying

I have a cluster of three Cassandra nodes with more or less default configuration. On top of that, I have a web layer consisting of two nodes for load balancing, both web nodes querying Cassandra all the time. After some time, with the data stored in Cassandra becoming non-trivial, one and only one of the web nodes started getting ReadTimeoutException on a specific query. The web nodes are identical in every way.
The query is very simple (? is placeholder for date, usually a few minutes before the current moment):
SELECT * FROM table WHERE time > ? LIMIT 1 ALLOW FILTERING;
The table is created with this query:
CREATE TABLE table (
user_id varchar,
article_id varchar,
time timestamp,
PRIMARY KEY (user_id, time));
CREATE INDEX articles_idx ON table(article_id);
When it times-out, the client waits for a bit more than 10s, which, not surprisingly, is the timeout configured in cassandra.yaml for most connects and reads.
There are a couple of things that are baffling me:
the query only timeouts when one of the web nodes execute it - one of the nodes always fail, one of the nodes always succeed.
the query returns instantaneously when I run it from cqlsh (although it seems it only hits one node when I run it from there)
there are other queries issued which take 2-3 minutes (a lot longer than the 10s timeout) that do not timeout at all
I cannot trace the query in Java because it times out. Tracing the query in cqlsh didn't provide much insight. I'd rather not change the Cassandra timeouts as this is production system and I'd like to exhaust non-invasive options first. The Cassandra nodes all have plenty of heap, their heap is far from full, and GC times seem normal.
Any ideas/directions will be much appreciated, I'm totally out of ideas. Cassandra version is 2.0.2, using com.datastax.cassandra:cassandra-driver-core:2.0.2 Java client.

A few things I noticed:
While you are using time as a clustering key, it doesn't really help you because your query is not restricting by your partition key (user_id). Cassandra only orders by clustering keys within a partition. So right now your query is pulling back the first row which satisfies your WHERE clause, ordered by the hashed token value of user_id. If you really do have tens of millions of rows, then I would expect this query to pull back data from the same user_id (or same select few) every time.
"although it seems it only hits one node when I run it from there" Actually, your queries should only hit one node when you run them. Introducing network traffic into a query makes it really slow. I think the default consistency in cqlsh is ONE. This is where Carlo's idea comes into play.
What is the cardinality of article_id? Remember, secondary indexes work the best on "middle-of-the-road" cardinality. High (unique) and low (boolean) are both bad.
The ALLOW FILTERING clause should not be used in (production) application-side code. Like ever. If you have 50 million rows in this table, then ALLOW FILTERING is first pulling all of them back, and then trimming down the result set based on your WHERE clause.
Suggestions:
Carlo might be on to something with the suggestion of trying a different (lower) consistency level. Try setting a consistency level of ONE in your application and see if that helps.
Either perform an ALLOW FILTERING query, or a secondary index query. They both suck, but definitely do not do both together. I would not use either. But if I had to pick, I would expect a secondary index query to suck less than an ALLOW FILTERING query.
To solve this adequately at the scale in which you are describing, I would duplicate the data into a query table. As it looks like you are concerned with organizing time-sensitive data, and in getting the most-recent data. A query table like this should do it:
CREATE TABLE tablebydaybucket (
user_id varchar,
article_id varchar,
time timestamp,
day_bucket varchar,
PRIMARY KEY (day_bucket , time))
WITH CLUSTERING ORDER BY (time DESC);
Populate this table with your data, and then this query will work:
SELECT * FROM tablebydaybucket
WHERE day_bucket='20150519' AND time > '2015-05-19 15:38:49-0500' LIMIT 1;
This will partition your data by day_bucket, and cluster your data by time. This way, you won't need ALLOW FILTERING or a secondary index. Also your query is guaranteed to hit only one node, and Cassandra will not have to pull all of your rows back and apply your WHERE clause after-the-fact. And clustering on time in DESCending order, helps your most-recent rows come back quicker.

PHP Generating unique id from database

In my project, i am generating a unique id from a table of database with taking largest integer value of an attribute 'serial_key' with adding 1 to that number. It is generating the unique index to add new tuple of records.
But this mechanism failed when i deployed the application on multiple PCs in intranet or internet, it was generating the same unique id on all different machines at an instant. And i have plenty of data in the server so i have to manage the same pattern of the id, since it was constructed taking a specific format. Please suggest how to resolve this problem. Thanks.

You can use the HiLo Algorithm, to generate unique keys. It can be tuned for performance or continous keys. If you implement it in all your clients (i guess java and php) you would get unique keys across as many keys as you like (or your database performance allows). You would also not be dependent on any database and if you tune it for throughput you wouldn't need much additional database querys.
See this SO-Answer.

You can solve this by using an AUTO_INCREMENT attribute on your serial_key column. That way you don't have to worry about data collision. This is a common practice for primary keys.

Proper "allocator table" design for portable DB-based key allocation: to be used in preference to Scott Ambler's misguided "hi-lo" idea.
create table KEY_ALLOC (
SEQ varchar(32) not null,
NEXT bigint not null,
primary key (SEQ)
);
To allocate the next, say, 20 keys (which are then held as a range in the server & used as needed):
select NEXT from KEY_ALLOC where SEQ=?;
update KEY_ALLOC set NEXT=(old value+20) where SEQ=? and NEXT=(old value);
Providing you can commit this transaction (use retries to handle contention), you have allocated 20 keys & can dispense them as needed.
This scheme is 10x faster than allocating from an Oracle sequence, and is 100% portable amongst all databases.
Unlike Scott Ambler's hi-lo idea, it treats the keyspace as a contiguous linear numberline -- from which it efficiently allocates small chunks of configurable size. These keys are human-friendly and large chunks are not wasted. Mr Ambler's idea allocates the high 16- or 32-bits, and requires either ugly composite keys or generating large human-unfriendly key values as the hi-words increment.
Comparison of allocated keys:
Linear_Chunk Hi_Lo
100 65536
101 65537
102 65538
.. server restart
120 131072
121 131073
122 131073
.. server restart
140 196608
Guess which keys are easier to work with for a developer or DB admin.
I actually corresponded with him back in the 90's to suggest this improved scheme to him, but he was too stuck & obstinate to acknowledge the advantages of using a linear number-line.

performance improvement of queries against encrypted table without changing the application code

I have tagged this problem with both Oracle and Java because both Oracle and Java solutions would be accepted for this problem.
I am new to Oracle security and have been presented with the below problem to solve. I have done some research on the internet but I have had no luck so far. At first, I thought Oracle TDE might be helpful for my problem but here: Can Oracle TDE protect data from the DBA? it seems TDE doesn't protect data against DBA and this is an issue which is not to be tolerated.
Here is the problem:
I have a table containing millions of records. I have a Java application which queries this table using equality or range criteria against a column in the table which is the primary key column of the table. The primary key column contains sensitive data and thus has been encrypted already. As the result, querying data using normal (i.e. decrypted) values from the application cannot use the primary key's unique index access path. I need to improve the queries' performance without any changes on the application code (application config can be modified if necessary but not the code). It would be OK to do any changes that are necessary on the database side as long as that column remains encrypted.
Oracle people please: What solution(s) do you suggest to this problem? How can I create an index on decrypted column values and somehow force Oracle to utilize this index? How can I use partitioning such as hash-partitioning? How about views? Any, Any solution?
Java people please: I myself have this very vague idea which is to create a separate application in between (i.e between the database and the application) which acts as a proxy that receives the queries from the application and replaces the decrypted values with encrypted values and sends it for the database, it then receives the response and return the results back to the application. The proxy should behave like a database so that it should be possible for the application to connect to it by changing the connection string in the configuration file only. Would this work? How?
Thanks for all your help in advance!

which queries this table using equality or range criteria against a column in the table which is the primary key column of the table
To find a specific value it's simple enough - you can store the data encrypted any way you like - even as a hash and still retrieve a specific value using an index. But as per my comment elsewhere, you can't do range queries without either:
decrypting each and every row in the table
or
using an algorithm that can be cracked in a few seconds.
Using a linked list (or a related table) to define order instead of an algorithm with intrinsic ordering would force a brute force check on a much larger set of values - but it's nowhere near as secure as a properly encrypted value.
It doesn't matter if you use Oracle, Java or pencil and paper. Might be possible using quantum computing - but if you can't afford to ensure the security of your application / pay for good advice from an expert cryptographer, then you certainly won't be able to afford that.

How can I create an index on decrypted column values and somehow force Oracle to utilize this index?
Maybe you could create a function based index in which you index the decrypted value.
create index ix1 on tablename (decryptfunction(pk1));

Duplicate set of columns from one table to another table

My requirement is to read some set of columns from a table.
The source table has many - around 20-30 numeric columns and I would like to read only a set of those columns from the source table and keep appending the values of those columns to the destination table. My DB is on Oracle and the programming language is JDBC/Java.
The source table is very dynamic - there are frequent inserts and deletes happen on
it. Whereas at the destination table, I would like to keep the data for at least 30
days.
My Setup is described as below -
Database is Oracle.
Number of rows in the source table = 20 Million rows with 30 columns
Number of rows in destinationt table = 300 Million rows with 2-3 columns
The columns are all Numeric.
I am thinking of not doing a vanilla JDBC connection open and transfer the data,
which might be pretty slow looking at the size of the tables.
I am trying to take the dump of the selected columns of the source table using some
sql like -
SQL> spool on
SQL> select c1,c5,c6 from SRC_Table;
SQL> spool off
And later use SQLLoader to load the data into the destination database.
The source table is storing time series data and the data gets purged/deleted from source table within 2 days. Its part of OLTP environment. The destination table has larger retention period - 30days of data can be stored here and it is a part of OLAP environment. So, the view on source table where view selects only set of columns from the source table, does not work in this environment.
Any suggestion or review comments on this approach is welcome.
EDIT
My tables are partitioned. The easiest way to copy data is to exchange partition netween tables
*ALTER TABLE <table_name>
EXCHANGE PARTITION <partition_name>
WITH TABLE <new_table_name>
<including | excluding> INDEXES
<with | without> VALIDATION
EXCEPTIONS INTO <schema.table_name>;*
but since my source and destination tables have different columns so I think exchange partition will not work.

Shamik, okay, you're loading an OLAP database with OLTP data.
What's the acceptable latency? Does your OLAP need today's data before people come in to the office tomorrow morning, or is it closer to real time.
Saying the Inserts are "frequent" doesn't mean anything. Some of us are used to thousands of txns/sec - to others 1/sec is a lot.
And you say there's a lot of data. Same idea. I've read people's post where they have HUGE tables with a couple million records. i have table with hundreds of billions of records. SO again. A real number is very helpful.
Do not go with the trigger suggested by Schwern. If you believe your insert volume is large, it means you've probably have had issues in that area. A trigger will just make it worse.
Oracle provide lots of different choices for getting data from OLTP to OLAP. Instead of reinventing the wheel, use something already written. Oracle Streams was BORN to do this exact job. You can roll your own streams with using Oracle AQ. You can capture inserted rows without a trigger by using either Database Change Notification or Change Data Capture.
This is an extremely common problem, which is why I've listed 4 technologies designed to solve it.
Advanced Queuing
Streams
Change Data Capture
Database Change Notification
Start googling these terms and come back with questions on those. you'll be better off than building your own from the ground up or using triggers.

The problem seems a little vague, and frankly a little odd. The fact that there's hundreds of columns in a single table, and that you're duplicating data within the database, suggests a hosed database design.
Rather than do it manually, it sounds like a job for a trigger. Create an insert trigger on the source table to copy columns to the destination table just after they're inserted.
Another possibility is that since it seems all you want is a slice of the data in your original table, rather than duplicating it, a cardinal sin of database design, create a view which only includes the columns and ranges you want. Then just access that view like any other table.
I'm willing the guess that the root of the problem is accessing just the information you want in your source table is too slow. This suggests you might be able to fix that with better indexing. Also, your source table is probably just too damn wide.
Since I'm not an Oracle person, I leave the syntax of this as an exercise for the reader, but the concept should be sound.

On a tangential note, you might want to look at Oracle's partitioning here and here.
Partitioning enables tables and indexes to be split into smaller, more manageable components and is a key requirement for any large database with high performance and high availability requirements. Oracle Database 11g offers the widest choice of partitioning methods including interval, reference, list, and range in addition to composite partitions of two methods such as order date (range) and region (list) or region (list) and customer type (list).
Faster Performance—Lowers query times from minutes to seconds
Increases Availability—24 by 7 access to critical information
Improves Manageability—Manage smaller 'chunks' of data
Enables Information Lifecycle Management—Cost-efficient use of storage
Partitioning the table into daily partitions would make archiving easier as described here

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.