Deduplication of patchwork mail content?

Thu Oct 24 02:23:15 AEDT 2019

On Wed, Oct 09, 2019 at 05:35:03PM +1100, Daniel Axtens wrote:
> Hi sfr, jk, Konstantin and any other admins lurking,
> 
> I'm in the process of reworking the patchwork db schema to avoid one of
> our very big and very annoying (and slow) JOINs.
> 
> While I'm at it, it occurred to me that for both the ozlabs and
> kernel.org instances, there are a lot of mails that are sent across
> multiple projects. ATM the entire contents of the mail - content,
> headers, diff, what have you, will be stored in full for each project.
> 
> Would it be of value for your deployments if I used this opportunity to
> normalise the database and deduplicate emails? I was thinking of
> splitting the big raw text fields (diff, content, headers) into their
> own table and then indexing into that by message-id.

Daniel:

I think space is pretty cheap, and it's going to be a lot of work for
little savings. Adding some indexes would be a much more effective way
of improving performance in my view.

Best,
-K