Deduplication of patchwork mail content?
jk at ozlabs.org
Thu Oct 10 12:26:38 AEDT 2019
> While I'm at it, it occurred to me that for both the ozlabs and
> kernel.org instances, there are a lot of mails that are sent across
> multiple projects. ATM the entire contents of the mail - content,
> headers, diff, what have you, will be stored in full for each project.
The headers will be different, as they've gone through different lists.
This may not be too relevant to the actual purpose of patchwork though.
The comments (apart from the first) may diverge, depending on whether
responders keep both lists on CC.
The diffs will be the same, so we could deduplicate those, if it's worth
patchwork=# select sum(dup_size) from (select octet_length(diff) *
(n-1) as dup_size, a.msgid, n from (select msgid, count(msgid) as n,
min(id) as id from patchwork_submission group by msgid having
count(msgid) > 1) as a inner join patchwork_patch on
patchwork_patch.submission_ptr_id = a.id) as b;
patchwork=# select sum(octet_length(diff)) from patchwork_patch;
So 221MB out of 6.2GB is duplicate; around 3.5%.
More information about the Patchwork