====== Merge Mailman - or other - mbox Archives ====== {{tag>project software computing }} {{process:shack-heritage-badge.png}} ===== Situation ===== Mailman archives list email in an mbox formattet mailbox. There can occur several situations where some mbox files - where new emails get just appended to - forked at one point of time in the past. * two Mailservers, migration from one to another * two Mailservers, one of them crashed, and one restored from backup * other cases of desynchronized mbox folders ===== Problem ===== An mbox folder ist just one (sometimes large) file containing several emails one after another. So it's not easy to tell which email is present in one folder and it's even more nasty to put a missing email into a folder at a defined position. ===== Solution ===== So the first step would be to split the folder into the individual emails. This is where formail (it comes with procmail) comes handy. It just splits an mbox folder and calls a given tool on every mail providing the mail on its stdin. Now one would need a tool just catching this and creating files with enumerated filenames. As I need this functionality sometimes, I wrote this and called it 'enumcat': #!/bin/bash pfx="enumcat." seqf=".seq" c=0 width=6 usage () { cat << EOF $0 [ -p prefix ] [ -s sequencefile ] [ -c count ] [ -w width ] -p prefix file name prefix, may include path. default: 'enumcat.' -s sequencefile sequence file, may include path. default: '.seq' -c count counter start. default: 0 -w width length of numbers. default: 6 EOF } doopt () { local x while [ $# -gt 0 ]; do x=$1; shift # echo P: $# "'$x'" : $@ case "$x" in -p) pfx="$1" shift ;; -s) seqf="$1" shift ;; -c) c="$1" shift ;; -w) w="$1" shift ;; *) usage exit 1 ;; esac done } doopt "$@" [ -e $seqf ] && { c=$(< $seqf ) } c=$(( $c+1 )) echo $c > $seqf cat > $pfx$( printf "%0${width}d" $c ) Now we can split the mbox mailfolder. You have to do this on both folders. mkdir /tmp/folder1 formail -s enumcat -s /tmp/folder1/.seq -p /tmp/folder1/email. < /my/original/folder.mbox mkdir /tmp/folder2 formail -s enumcat -s /tmp/folder2/.seq -p /tmp/folder2/email. < /my/other/folder.mbox We have to prepare for comparison and compare: cd /tmp/folder1 && md5sum email.* > md5sum.all cd /tmp/folder2 && md5sum email.* > md5sum.all diff /tmp/folder1/md5sum.all /tmp/folder2/md5sum.all Collect the files unique to folder1 (!): F=$( diff /tmp/folder2/md5sum.all /tmp/folder1/md5sum.all | awk '/^ /my/other/folder.mbox That's it. Please be sure the folders are not to be written (from your mail server or mailing list manager) while this is going on. ;-) ===== Remarks ===== * You have to use bash >= 3.0 or 3.2 for that. * If you use this on otherwise differing mbox mail folders you will get a unified merged folder - but without the original flow of time reconstructed.