Deduplication of patchwork mail content?

Jeremy Kerr jk at ozlabs.org
Thu Oct 10 12:26:38 AEDT 2019


Hi Daniel,

> While I'm at it, it occurred to me that for both the ozlabs and
> kernel.org instances, there are a lot of mails that are sent across
> multiple projects. ATM the entire contents of the mail - content,
> headers, diff, what have you, will be stored in full for each project.

The headers will be different, as they've gone through different lists.
This may not be too relevant to the actual purpose of patchwork though.

The comments (apart from the first) may diverge, depending on whether
responders keep both lists on CC.

The diffs will be the same, so we could deduplicate those, if it's worth
your trouble:

   patchwork=# select sum(dup_size) from (select octet_length(diff) *
   (n-1) as dup_size, a.msgid, n from (select msgid, count(msgid) as n,
   min(id) as id from patchwork_submission group by msgid having
   count(msgid) > 1) as a inner join patchwork_patch on
   patchwork_patch.submission_ptr_id = a.id) as b;
       sum    
   -----------
    221334709
   (1 row)

and:

   patchwork=# select sum(octet_length(diff)) from patchwork_patch;
       sum     
   ------------
    6261083055
   (1 row)


So 221MB out of 6.2GB is duplicate; around 3.5%.

Cheers,


Jeremy



More information about the Patchwork mailing list