Inhaltsverzeichnis
Merge Mailman - or other - mbox Archives
Situation
Mailman archives list email in an mbox formattet mailbox. There can occur several situations where some mbox files - where new emails get just appended to - forked at one point of time in the past.
- two Mailservers, migration from one to another
- two Mailservers, one of them crashed, and one restored from backup
- other cases of desynchronized mbox folders
Problem
An mbox folder ist just one (sometimes large) file containing several emails one after another.
So it's not easy to tell which email is present in one folder and it's even more nasty to put a missing email into a folder at a defined position.
Solution
So the first step would be to split the folder into the individual emails. This is where formail (it comes with procmail) comes handy. It just splits an mbox folder and calls a given tool on every mail providing the mail on its stdin.
Now one would need a tool just catching this and creating files with enumerated filenames. As I need this functionality sometimes, I wrote this and called it 'enumcat':
#!/bin/bash pfx="enumcat." seqf=".seq" c=0 width=6 usage () { cat << EOF $0 [ -p prefix ] [ -s sequencefile ] [ -c count ] [ -w width ] -p prefix file name prefix, may include path. default: 'enumcat.' -s sequencefile sequence file, may include path. default: '.seq' -c count counter start. default: 0 -w width length of numbers. default: 6 EOF } doopt () { local x while [ $# -gt 0 ]; do x=$1; shift # echo P: $# "'$x'" : $@ case "$x" in -p) pfx="$1" shift ;; -s) seqf="$1" shift ;; -c) c="$1" shift ;; -w) w="$1" shift ;; *) usage exit 1 ;; esac done } doopt "$@" [ -e $seqf ] && { c=$(< $seqf ) } c=$(( $c+1 )) echo $c > $seqf cat > $pfx$( printf "%0${width}d" $c )
Now we can split the mbox mailfolder. You have to do this on both folders.
mkdir /tmp/folder1 formail -s enumcat -s /tmp/folder1/.seq -p /tmp/folder1/email. < /my/original/folder.mbox
mkdir /tmp/folder2 formail -s enumcat -s /tmp/folder2/.seq -p /tmp/folder2/email. < /my/other/folder.mbox
We have to prepare for comparison and compare:
cd /tmp/folder1 && md5sum email.* > md5sum.all cd /tmp/folder2 && md5sum email.* > md5sum.all diff /tmp/folder1/md5sum.all /tmp/folder2/md5sum.all
Collect the files unique to folder1 (!):
F=$( diff /tmp/folder2/md5sum.all /tmp/folder1/md5sum.all | awk '/^</ { print $3}' )
So we have enumerated files and enumeration represents a time flow. So we want the files in $F to become classified prior to the ones with the same numbers from the other folder. So we need to go one step back and generate file names classifying between this step and the following one. Shell sorting will help us here:
cd /tmp/folder1
FIRSTFILE=$( head -1 <<< "$F" ) FIRSTNUM=${FIRSTFILE#*.} FIRSTBASE=${FIRSTFILE%$FIRSTNUM} FLEN=$( echo -n "$FIRSTNUM" | wc -c ) FCOUNT=$( wc -l <<< "$F" ) FCLEN=$( echo -n "$FCOUNT" | wc -c ) PREVNUM=$( printf "%0${FLEN}d" $( expr $FIRSTNUM - 1 ) ) CNT=0 FNEW="" for i in $F ; do NFILE=${FIRSTBASE}${PREVNUM}$( printf "%0${FCLEN}d" $CNT ) mv $i $NFILE FNEW="$FNEW $NFILE" CNT=$(( $CNT + 1 )) done
$FNEW contains the new names of the files to be put to /tmp/folder2 (-i just to be really sure):
mv -i $FNEW /tmp/folder2/
Now glue it together:
mv /my/other/folder.mbox /my/other/folder.mbox.BAK cat /tmp/folder2/email.* > /my/other/folder.mbox
That's it. Please be sure the folders are not to be written (from your mail server or mailing list manager) while this is going on.
Remarks
- You have to use bash >= 3.0 or 3.2 for that.
- If you use this on otherwise differing mbox mail folders you will get a unified merged folder - but without the original flow of time reconstructed.