Real-time statistics: MySQL(/Drizzle) or MongoDB? -


we working on project feature real-time statistics of actions (e.g. clicks). on every click, log information date, age , gender (these come facebook), location, etc.

we discussing best place store these information , use them real-time stats. display aggregate statistics: example, number of clicks, number of clicks made male/female, number of clicks divided age groups (e.g. 18-24, 24-30...).

since on site using mongodb everywhere, colleague thought should store statistics inside well. i, however, prefer sql-based database task, mysql (or maybe drizzle), because believe sql better when doing operations data aggregation. although there's overhead of parsing sql, think mysql/drizzle may faster no-sql databases here. , inserts not slow too, when using insert delayed queries.

please note not need perform joins or collect data multiple tables/collections. thus, don't care if database different. however, care scalability , reliability. building (hopefully) become big, , we've designed every single line of code scalability in mind.

what think this? there reason prefer mongodb on mysql/drizzle this? or indifferent? 1 use, if us?

thank you, alessandro

so buddymedia using of this. gilt groupe has done pretty cool hummingbird (node.js + mongodb).

having worked large online advertiser in social media space, can attest real-time reporting pain. trying "roll-up" 500m impressions day challenge, trying real time worked, carried significant limitations. (like delayed 5-minutes :)

frankly, type of problem 1 of reasons started using mongodb. , i'm not one. people using mongodb kinds of real-time analytics: server monitoring, centralized logging, dashboard reporting.

the real key when doing type of reporting understand data structure different mongodb, you're going avoid "aggregation" queries, queries , output charts going different. there's coding work on client side.

here's key may point in right direction doing mongodb. take @ following data structure:

{   date: "20110430",   gender: "m",   age: 1, // 1 bucket   impression_hour: [ 100, 50, ...], // 24 of these   impression_minute: [ 2, 5, 19, 8, ... ], // 1440 of these   clicks_hour: [ 10, 2, ... ],   ... } 

there tweaks here, appropriate indexes, maybe mushing data+gender+age _id. that's kind of basic structure of click analytics mongodb. it's easy update impression , clicks { $inc : { clicks_hour.0 : 1 } }. update whole document atomically. , it's pretty natural report on. have array containing hourly or minute-level data points.

hopefully that's points in right direction.


Comments