Saturday, 24 August 2013

Updating 300 million records in a collection

Updating 300 million records in a collection

I have collection called TimeSheet having few thousands records now. This
will eventually increase to 300 million records in a year. In this
collection I embed few fields from another collection called Department
which is mostly won't get any updates and only rarely some records will be
updated. By rarely I mean only once or twice in a year and also not all
records, only less than 1% of the records in the collection.
Mostly once a department is created there won't any update, even if there
is an update, it will be done initially (when there are not many related
records in TimeSheet)
Now if someone updates a department after a year, in a worst case scenario
there are chances collection TimeSheet will have about 300 million records
totally and about 5 million matching records for the department which gets
updated. The update query condition will be on a index field.
Since this update is time consuming and creates locks, I'm wondering is
there any better way to do it? One option that I'm thinking is run update
query in batches by adding extra condition like UpdatedDateTime> somedate
&& UpdatedDateTime < somedate.
Other details:
A single document size could be about 3 or 4 KB We have a replica set
containing three replicas.
Is there any other better way to do this? What do you think about this
kind of design? What do you think if there numbers I given are less like
below?
1) 100 million total records and 100,000 matching records for the update
query
2) 10 million total records and 10,000 matching records for the update query
3) 1 million total records and 1000 matching records for the update query
Note: The collection names department and timesheet, and their purpose are
fictional, not the real collections but the statistics that I have given
are true.

No comments:

Post a Comment