Deduplication of patchwork mail content?

Thu Oct 24 09:36:05 AEDT 2019

Konstantin Ryabitsev <konstantin at linuxfoundation.org> writes:

> On Wed, Oct 09, 2019 at 05:35:03PM +1100, Daniel Axtens wrote:
>> Hi sfr, jk, Konstantin and any other admins lurking,
>> 
>> I'm in the process of reworking the patchwork db schema to avoid one of
>> our very big and very annoying (and slow) JOINs.
>> 
>> While I'm at it, it occurred to me that for both the ozlabs and
>> kernel.org instances, there are a lot of mails that are sent across
>> multiple projects. ATM the entire contents of the mail - content,
>> headers, diff, what have you, will be stored in full for each project.
>> 
>> Would it be of value for your deployments if I used this opportunity to
>> normalise the database and deduplicate emails? I was thinking of
>> splitting the big raw text fields (diff, content, headers) into their
>> own table and then indexing into that by message-id.
>
> Daniel:
>
> I think space is pretty cheap, and it's going to be a lot of work for
> little savings. Adding some indexes would be a much more effective way
> of improving performance in my view.
>
Cool, thanks to both of you. I will keep things the way they are, and
look at what indexes can be added.

Regards,
Daniel

> Best,
> -K