System design for tracking viewed posts and returning unseen posts
I came across this system design question and have been wondering what is a good approach. Requirement : We have a typical blog or a mini social media kind of website where users create "Posts". People then view posts created by other users. We need to design a system which tracks the posts viewed by each user and when asked for posts returns only those posts which the user hasn't already viewed. Each request might ask for different number of posts and we have to support pagination. Up to this point it is a fairly standard problem and this question has appeared in this forum previously. What changes is this : The posts themselves come from a different system (a popularity based post ranking system). It could be a DB which a different system populates or a back-end service which we have to call to fetch the posts (sorted by rank), we filter out the ones viewed by the current user and return only the rest. So we store the views information in a DB by ourselves where as the list of posts comes from a different one. In a single DB system, we would do a NOT IN kind of query as mentioned in this answer. That would take care of retrieving any number of posts and consequently support pagination also. However, things get a little tricky with two systems. Scale : Hundreds of millions of posts. Billions of views. One solution One simple approach I thought of was something like this (for brevity I will refer to the system from where we get the posts as posts DB, irrespective of whether it is a DB we talk to or a different service). I have put in representative SQL queries also, but those can be different if we use a different data store (like Redis for views DB). Fetch about 100 posts from the posts DB (which we get in sorted order based on rank). SELECT * FROM posts ORDER BY rank LIMIT 100 Check which of those posts have been viewed from the views DB and remove those. SELECT * FROM views WHERE uid = user.id AND post_id IN See if the remaining posts are greater than or equal to requested count. If yes, return the requested count. If not, go back to step 1 asking for the next 100 and repeat until we either find the requested number of posts or we run out of posts altogether. NOTE 1 : I chose 100 as certain DBs (like MongoDB) return 100 records at a time, at the driver level, when we fetch a large number of records. After that the driver keeps doing a getMore. NOTE 2 : I have skipped the tracking views part. We can assume that it is a separate service or end point which populates the views DB mentioned above. Question Is there a better approach, especially one that avoids repeated fetch of posts or is this the only practical approach, especially with the above mentioned scale? If the incoming requests ask for page = 3 and per_page = 50, then I have to 150 unseen posts and the above loop might run a few times. Is there a better way?
I came across this system design question and have been wondering what is a good approach.
Requirement :
We have a typical blog or a mini social media kind of website where users create "Posts". People then view posts created by other users. We need to design a system which tracks the posts viewed by each user and when asked for posts returns only those posts which the user hasn't already viewed. Each request might ask for different number of posts and we have to support pagination.
Up to this point it is a fairly standard problem and this question has appeared in this forum previously. What changes is this :
The posts themselves come from a different system (a popularity based post ranking system). It could be a DB which a different system populates or a back-end service which we have to call to fetch the posts (sorted by rank), we filter out the ones viewed by the current user and return only the rest. So we store the views information in a DB by ourselves where as the list of posts comes from a different one.
In a single DB system, we would do a NOT IN
kind of query as mentioned in this answer. That would take care of retrieving any number of posts and consequently support pagination also. However, things get a little tricky with two systems.
Scale : Hundreds of millions of posts. Billions of views.
One solution
One simple approach I thought of was something like this (for brevity I will refer to the system from where we get the posts as posts DB, irrespective of whether it is a DB we talk to or a different service). I have put in representative SQL queries also, but those can be different if we use a different data store (like Redis for views DB).
- Fetch about 100 posts from the posts DB (which we get in sorted order based on rank).
SELECT * FROM posts ORDER BY rank LIMIT 100
- Check which of those posts have been viewed from the views DB and remove those.
SELECT * FROM views WHERE uid = user.id AND post_id IN
- See if the remaining posts are greater than or equal to requested count.
- If yes, return the requested count.
- If not, go back to step 1 asking for the next 100 and repeat until we either find the requested number of posts or we run out of posts altogether.
NOTE 1 : I chose 100
as certain DBs (like MongoDB) return 100 records at a time, at the driver level, when we fetch a large number of records. After that the driver keeps doing a getMore
.
NOTE 2 : I have skipped the tracking views part. We can assume that it is a separate service or end point which populates the views DB
mentioned above.
Question
Is there a better approach, especially one that avoids repeated fetch of posts or is this the only practical approach, especially with the above mentioned scale?
If the incoming requests ask for page = 3
and per_page = 50
, then I have to 150 unseen posts and the above loop might run a few times. Is there a better way?
What's Your Reaction?