Deduplication of patchwork mail content?

Wed Oct 9 17:35:03 AEDT 2019

Hi sfr, jk, Konstantin and any other admins lurking,

I'm in the process of reworking the patchwork db schema to avoid one of
our very big and very annoying (and slow) JOINs.

While I'm at it, it occurred to me that for both the ozlabs and
kernel.org instances, there are a lot of mails that are sent across
multiple projects. ATM the entire contents of the mail - content,
headers, diff, what have you, will be stored in full for each project.

Would it be of value for your deployments if I used this opportunity to
normalise the database and deduplicate emails? I was thinking of
splitting the big raw text fields (diff, content, headers) into their
own table and then indexing into that by message-id.

I don't know how much space this would save you and if you think it's
worth it, but I figured I'd ask. I also haven't checked to see if this
gets messed up by list footers, or if we need to be more selective about
what headers we store - as they might differ from project to project
depending on the path the mail took - but I figured I'd run it up the
flagpole before investing too much time.

[I don't really want to experiment with completely different object
stores at this point - I want to get this schema thing done first. Maybe
in the future.]

Regards,
Daniel