Mikael Eliasson - Linq and the cost of skip

When using Linq you need to be aware of some performance pitfalls. Here we look at Skip() and it's hidden cost.

I was chunking up some work using this algorithm:

for (int i = 0; i < ordered.Length; i += chunkSize)
{
        //Create file
        foreach (var item in ordered.Skip(i).Take(chunkSize))
        {
            //Write line to file
        }

        //Run sql to load file into mysql
}

It got me thinking if that Skip really was a good idea. In this case my collection was a List which can access an element at any index in O(1) (constant time) but Linq is an abstraction over IEnumerable and has to handle that the only method of interest on IEnumerable is GetEnumerator and with the Enumerator we can only iterate by doing .MoveNext() which means to Skip(1000) we need to call MoveNext 1000 times.

Now it could be that Linq had someway optimized Skip for certain collections like List and Array but the only way to find out is to measure so that's what we do.

class Program
{

    private const int chunkSize = 10000;
    static void Main(string[] args)
    {
        var watch = new Stopwatch();

        var collection = Enumerable.Range(0, 9999999).ToList();
        var collectionAsArray = collection.ToArray();
        var collectionAsHashSet = new HashSet<int>(collection);

        watch.Start();
        long items = 0;
        for (int i = 0; i < collection.Count;)
        {
            var limit = Math.Min(i + chunkSize, collection.Count);
            for (; i < limit; i++)
            {
                items += collection[i]; 
            }
        }
        watch.Stop();
        Console.WriteLine("Loops {0} | {1}", watch.ElapsedMilliseconds, items);

        watch.Reset();

        Console.WriteLine("Test List");
        RunSkipTest(collection, watch);

        Console.WriteLine("Test Array");
        RunSkipTest(collectionAsArray, watch);

        Console.WriteLine("Test HashSet");
        RunSkipTest(collectionAsHashSet, watch);

    }

    private static void RunSkipTest(ICollection<int> collection, Stopwatch watch)
    {
        watch.Start();
        long items = 0;
        for (int i = 0; i < collection.Count; i += chunkSize)
        {
            foreach (var item in collection.Skip(i).Take(chunkSize))
            {
                items += item;
            }
        }
        watch.Stop();
        Console.WriteLine("Skips {0} | {1}", watch.ElapsedMilliseconds, items);
        watch.Reset();
    }
}

Some explanations. The for loops in Main is what I would use before Linq came around. *items *is really only a way to confirm that the algorithms does the same thing and to ensure the compiler doesn't remove the loop.

The result:

 **(The format is: xxx ms | verfication value)**

 Test with 99999

Loops 1 | 4999850001  
Test List  
Skips 18 | 4999850001  
Test Array  
Skips 10 | 4999850001  
Test HashSet  
Skips 13 | 4999850001

Test with 300000 items

Loops 3 | 44999850000  
Test List  
Skips 63 | 44999850000  
Test Array  
Skips 56 | 44999850000  
Test HashSet  
Skips 89 | 44999850000

Test with 9999999 items

Loops 122 | 49999985000001  
Test List  
Skips 51884 | 49999985000001  
Test Array  
Skips 50442 | 49999985000001  
Test HashSet  
Skips 85992 | 49999985000001

As you can see there is a noticable difference once the collection grows. The conclusion is that even though it would be possible Linq is not optimizing Skip for any collection. If you need performance you need to do the indexing yourself (unless the collection is small where it doesn't matter).