Login

I've a user model, having attributes as follow:-

class User(Model):
user_id = columns.Integer(primary_key=True)
username = columns.Text()
email = columns.Text()
fname = columns.Text()
lname = columns.Text()
age = columns.Text()
state = columns.Text()
city = columns.Text()
country = columns.Text()
gender = columns.Text()
phone = columns.Text()
school_name = columns.Text()
created_at = columns.Text()
race = columns.boolean()

This is my normal RDBMS model. My queries are as follow:-

1) Get all users with city = 'something'

2) Get a user with email = 'something'

3) Get a user with username = 'something'

4) Get all users with phones IN ('something' )

5) Get all users with state = 'something'

6) Get all users with age > something

7) Get all users with gender = 'something'

8) Get all users with race = 'something'

9) Get count(*),school_name users Group By schoolname

10) Get all users with created_date > 'something' LIMIT 1000

11) Get all users with username IN ('something') AND age IN ('something') AND phone IN ('something') AND state IN ('something') AND so on LIMIT 1000

I can get the above results for queries with a simple Select queries in RDBMS, but the problem lies in Cassandra.

Since, to get the result for the above queries in Cassandra, it is recommended to have a different model per query, which will speed up the reading capability. In this day and age disk is WAY cheaper than it used to be. That being said, I understand that it isn't always easy to just throw more disk at a problem. The bigger problem I see is adjusting the DAO layer of your application to keep 10 different tables in-sync. (Also, my inner instinct is not convinced to have 10 models for different queries. :P )

Can please someone explain me the proper model in Cassandra to get the result for these queries?

PS: The actions on the above model can be Read/Write/Update/Delete. **Query 11** is the most important query.

The most important is to make these queries really fast on large amounts of data, considering that the information about a particular user can be updated.

Probably the easiest way is to use Datastax Enterprise with the Search (Solr) or Analytics (Spark) option. You can download it for test purposes from

[To see links please register here]

for free. As long as your SLA for data to be available for query is over 1 second, the lucene indexes should be able to handle this varied search options.

The bigger question is why are you wanting to use Cassandra here? ANd what do you mean by large amounts of data? Typically, Cassandra is best served when your application has low latency read and write needs, the ability to replicate to many servers and many data centers, and have zero downtime. This is not typically needed in an data mart / warehouse or analytic database, which by the type of queries and your need to do it on "Large amounts of data" seems to point to. You don't want to typically put more than 1 - 3T of data per Cassandra node, though there are some outliers out there...

You are facing a real Cassandra limitation: if you are sure to go with Cassandra you need to follow the "Cassandra rules". Among these there are

- Denormalize
- Choose indexes wisely

So let's start. Each user should have unique id, username, email and phone. This means that these columns are not good candidate for indexing (<a href="http://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_when_use_index_c.html">read here why</a>), so denormalization is the right way.

From your queries you will have a user_by_username, user_by_email and user_by_phones. You might think that repeating data each time can be onerous in terms of updating and disk usage: so you can have a compromise by creating each of these containing as value only the ID of the user. e.g:

user_email | user_id
--------------+-------------------
[email protected] | 123-456-7aa |
[email protected] | efg-123-ghi |

In another table inside the KS you need a table that by id will retrieve all informations concerning the user. This will solve the `update problem`, if you need to update an email address or a phone you can update only couples of tables instead of N. The dark side is that you have to perform two queries to have your data.

Let's go on.

`state`, `gender` and `race` are good candidate for being indexed for the following reasons:

1. *Low Cardinality*
2. *Many rows will contain these values*

By indexing you will solve some other queries. The hardest part are the queries like

select * from users where age > xyz

This kind of query is not allowed in Cassandra since you need to perform the *"!equals"* operations on clustering part. To do this you need to "organize" users by some kind of common key: like the state or a "state-group" -- this means that to know all users with certain age you will have to query for each partition.

Take care: I am not providing a solution and this is not my goal -- what I'm trying to do is to provide an approach to solve this problem with Cassandra.

HTH,<br/>
Carlo

corendaqapfxm

plud367805

pavetta47068