mardi 17 octobre 2017

Graphing login events, but need extra data. Rewrite history or post-process?

We have been tracking user login events for a while now in a MongoDB collection. Each event contains the userID, datetime, and a couple other fundamental attributes about the event.

For a new feature, we want to present a graph of these login events, with different groups representing cohorts related to the user who did the event. Specifically, we want to group by the "Graduation Year" attribute of the user.

In our event log, we do not record the Graduation Year of the user who's logging in, so cannot easily query that directly. We see two ways to go forward, plus a 3rd "in-between" option:

  1. Instead of making a single MongoDB query to get the logins, we make that query PLUS a second one to our Relational DB to get the secondary user data we require, and merge the two together.

    We could optionally query for all the users, load them into memory, and loop through the Events, or we could go through the events and find only the User IDs that logged in and query for those specific User IDs. (Then loop again, merging them in.)

    The post-processing could be done on the server-side or we could send all the data to the client. (Currently our plan is to just send the raw event data to the client for processing into the graph.)

    Upsides: The event log is made to track events. User "Graduation Year" is not relevant to the event in question; it's relevant to the user who did the event. This seems to separate concerns more properly. As well, if we later decide we want to group on a different piece of metadata (let's say: male vs female), it's easy to just join that data in as well.

    Downsides: Part of the beauty of our event log is that it quickly can spit out tons of aggregate data that's ready-to-use. If there are 10,000 users, we may have 100,000 logins. It seems crazy to need to loop through 100,000 logins whenever this data is requested new (as in, not cached).

  2. We can write a script that does a one-time load of all the events (presumably in batches), then requests the user metadata and merges it in, re-writing the Event Log to include the relevant data.

    Upsides: The event log is our single point of interaction when loading the data. Client requests all the logins; gets 100,000 rows; sorts them and groups them according to Graduation Year; [Caches it;] and graphs it. Will have a script ready to re-add more data if it came to that, down the road.

    Downsides: We're essentially rewriting history. We're polluting our event log with secondary data that isn't explicitly about the event we claim to be tracking. Need to rewrite or modify the script to add more data that we didn't know we wanted to track, if we had to, down the road.

  3. We replicate the Users table in MongoDB, perhaps only as-needed (say when an event's metadata is unavailable), and do a join (I guess that's a "$lookup" in Mongo) to this table.

    Upsides: MongoDB does the heavy lifting of merging the data.

    Downsides: We need to replicate and keep-up-to-date, somehow, a secondary collection of our Users' relevant metadata. I don't think MongoDB's $lookup works like a join in MySQL, and maybe isn't really any more performant at all? Although I'd look into this before we implemented.

For the sake of estimation, let's just say that any given visitor to our site will never have to load more than 100,000 logins and 10,000 users.

For what it's worth, Option #2 seems most preferable to me, even though it involves rewriting history, for performance reasons. Although I am aware that, at some point, if we were sending a user's browser multiple years of login data (that is, all 100,000 imaginary logins), maybe that's already too much data for their browser to process and render quickly, and perhaps we'd already be better off grouping it and aggregating it as some sort of regularly-scheduled process on the backend. (I don't know!)

Aucun commentaire:

Enregistrer un commentaire