Time to sort two million 8-byte records vs. two million 40-byte records

I was trying to get an idea how much slower/faster sorting data gets with unit size. This relates to low-level data structure organization, in particular, when it makes sense to index data by pointer vs. by value. L1 cache line size on a modern Intel CPU is 64 bytes, and I was thinking it would make a difference. What I got wasn't quite what I had expected: apparently, it takes only 4 times as much time to sort two million 1024 byte records than to sort two million two-byte records.

I found this pretty amazing.

Here's my program and test results:

    kostja@atlas:~$ cat foo.sh 
    #!/bin/sh
    for el_size in 2 4 8 16 32 64 128 256 512 1024
    do
        echo -n "$el_size\t"
        ./a.out $el_size
    done
    kostja@atlas:~$ cat foo.c       
    #include 
    #include <sys/time.h>
    #include 

    #define EL_CNT 2000000

    int EL_SIZE;

    int my_cmp(const void *a, const void *b) { return memcmp(a, b, EL_SIZE); }

    int usage(const char *progname)
    {
        printf("usage: %s <element_size>\n", progname);
        return 1;
    }

    int main(int argc, char **argv)
    {
        EL_SIZE = argc > 1 ? atoi(argv[1]) : 0;

        if (EL_SIZE <= 0)
            exit(usage(argv[0]));

        char *c = malloc(EL_CNT * EL_SIZE);
        int i;
        for (i = 0; i < (EL_CNT * EL_SIZE); ++i) {
            c[i] = (rand() % 30) + 65;
        }
        struct timeval tv, tv1;

        gettimeofday(&tv, NULL);

        qsort(c, EL_CNT, EL_SIZE, my_cmp);

        gettimeofday(&tv1, NULL);

        printf("%lf\n", ((tv1.tv_sec - tv.tv_sec) * 1000000 +
                 tv1.tv_usec - tv.tv_usec)/1000000.0);

        free(c);
    }
    kostja@atlas:~$ gcc -O3 foo.c; sh ./foo.sh
    2	0.552472
    4	0.502978
    8	0.559762
    16	0.682934
    32	0.895612
    64	1.203204
    128	1.300302
    256	1.583563
    512	1.782502
    1024	2.142894
</pre>