jeudi 28 avril 2016

Large Scale App - Choosing the right database, and serving data from it

I have been doing a lot of research recently non-stop. I know this is a vague question, but I will try be as specific as I can. I have been developing small to medium web systems for a while, but never dealt with noSQL databases, or made something with this scale. The purpose of the question is to make the right decision, as I don't want to shoot myself in the foot. And hopefully it can help someone else like me, new to this and a specific example can clarify more things in my opinion.

To make this easier, I will give an example very close to what I'm trying to achieve, which is very simple and covers a lot of similar systems out there. Hopefully by this example I will clear out the type of data I'm dealing with, the kind of system, and operations that need to be performed.

Example System Overview

1) System Overview: Let's say we are dealing with a multi-tenant system, that allows users to track traffic to specific pages on their sites. The user/customer needs to plant a code of my system into their page, and this code, will call a service on my server that simply stores the data in the database. The data is simply the page visitor's data, stuff like the page they visited, browser, device, time of visit..etc. Here, the data is STORED

2) The users/customers can login to the system, and view their data. Here the data is SERVED

3) In addition to storing, and serving data, there's another service in the system, which will be processing this data. And finding patterns/conclusions about the site visitors. For example, specific page, is more visited via mobile devices.. (and do whatever with this conclusion). Here the data is PROCESSED

4) We are talking about Large volumes of data. Probably starts with 3-4 Million writes per day (visits), and I want to be sure it can scale up to Billion+

Question 1:

Now, I have been cracking my head over Google Cloud Solutions. I thought that it's best to go with App Engine, since I can't afford to maintain a virtual server. I decided to go with the DataStore and here comes my question:

1) Is DataStore the most appropriate solution for this case?

According to my research, I found that datastore is most appropriate, but of course I'm not 100% confident due to my lack of experience. But I chose this because I come from using SQL a lot. I didn't go for Cloud SQL, because my data is not that relational, and due to the volume and the need of reliability and scalability of my system, I thought a NoSQL solution would be better. Datastore has GQL syntax, which is similar to SQL querying language. Properties of objects are also indexed. I will probably need to read/filter data by Device type for example. I also saw that BigTable is very different (probably not used to it), and all data you have is Key:Value, where I think it's not possible/hard to query by specific property for example?

Eventhough I don't believe my data is really structured. Specially that I will be using Namespaces for users/customers (multi-tenant), and it's automatically included in the key. The entities I have at the moment is User (will probably include few entities under it), and Visit as a "standalone" entity.

Question 2

What's the best practice for serving data to customers?

I don't want to show customers Millions of visits. I would give them overall stats. For example how many visits via Android Mobile Devices. Or stats within Time-Range. So I believe there are 2 approaches. First, make separate database. Or maybe separate entity within the same DataStore, that holds total daily stats per customer. And then make serving by adding up "days stats" which doesn't feel right. Second, Every time I want to serve data, I make a query to the one big database. But also the second approach doesn't feel right. Because as I know, Datastore doesn't have "Group by" functionality for example, or aggregates. So every time, a customer wants to see stats "by device" I need to make the "group by" functionality by code on memory? or if I just want to display the "count" of specific filter query result, which is not directly supported with datastore I believe? Also making these calculations in real time when serving the data sounds like a lot.

Now I know there's a solution called BigQuery by Google, which I believe does what I want and allows me to serve the data I want to customers with high flexibility and efficiency, but as I understood it works only on datastore "backups", I need to serve data in real time.


I appreciate a lot anyone taking the time to give any input on the matter. Thank you very much in advance.

Aucun commentaire:

Enregistrer un commentaire