Mikael Eliasson - Entity framework pitfalls, include

While EF is not as fast as plain sql it's actually quite fast nowdays but there are a few pitfalls to be aware of. This time: Include()

Before I explain the problem with Include() let's look at some background. If you feel you have good knowledge about EF and SQL you can probably skim through the first parts.

To make a long story short Include() will have a lot of data duplication in the returned resultset which in many cases slows down your query a lot.

What can Include() do for me?

Include is a convenient way to load related collections eagerly in an EF query.

var contacts = db.ContactInformations
	.Include(x => x.ContactInformationMails)
	.Include(x => x.ContactInformationPhonenumbers)
	.AsNoTracking()
	.ToList();

That query will load all contacts with their phone numbers and emails populated in a single query. Pretty nice for the developer!

What's the catch?

For us that has been around the game since before ORMs this should ring a warning bell. Remeber how this would be done in plain SQL?

There are a few alternatives but basically all comes back to splitting the loading into multiple queries. A horrible, but not uncommon, way to do it was to load all contacts and then loop over them and for each contact load the releated phone numbers and emails. Which leaves us with a 2N + 1 query. (1 query to load N contacts and then 2 queries for each contact).

That's really bad for performance so the performance aware developer would take another approach and load the data in three separate queries. One for each table and then merge the data in memory. This is faster than the 2N + 1 alternative by orders of magnitude.

That's the two main approaches I know about. There are some alternatives to the second one, like packing the 3 queries into a single query with multiple resultsets which might be a bit faster but it's the same approach just sligtly optimized.

What does that got to do with anything?

EF, or any other ORM, is not magic. It will translate your query to SQL and if the query is hard to write in an efficent way manually it will probably not be any better when the ORM creates it.

The reason we didn't load the data in a single query before ORMs is because relational databases is not built to return hierarchial data. The result from a query is always a table. But unlike in HTML, SQL doesn't support nesting tables so we cannot get rows looking like this:

Col1

Col2

Col3

Id	Number1
Id	Number2

ID	Mail1
ID	Mail2

This is why I think every ORM developer should know the underlying datastore decently. If you do you would atleast recognize that there is a "hack" involved to get the query above to work.

If you don't understand how the ORM does something, it's time for you to look under the hood. Sql profiler or anything similar is a great tool to have to understand that's really going on.

So how do EF load that data in one query?

Here is where Sql Profiler comes in very handy. A quick look reveals the query to be:

(I'll explain the concept below)

SELECT 
    [UnionAll1].[ID] AS [C1], 
    [UnionAll1].[ID1] AS [C2], 
    [UnionAll1].[Lastname] AS [C3], 
    [UnionAll1].[Firstname] AS [C4], 
    [UnionAll1].[Birthdate] AS [C5], 
    [UnionAll1].[LastUpdate] AS [C6], 
    [UnionAll1].[Gender] AS [C7], 
    [UnionAll1].[OnlyOwnerCanEdit] AS [C8], 
    [UnionAll1].[MemberType] AS [C9], 
    [UnionAll1].[Deleted] AS [C10], 
    [UnionAll1].[UserName] AS [C11], 
    [UnionAll1].[SICard] AS [C12], 
    [UnionAll1].[Comment] AS [C13], 
    [UnionAll1].[Address] AS [C14], 
    [UnionAll1].[C1] AS [C15], 
    [UnionAll1].[ID2] AS [C16], 
    [UnionAll1].[ContactInformationID] AS [C17], 
    [UnionAll1].[LastUpdate1] AS [C18], 
    [UnionAll1].[ShowToPublic] AS [C19], 
    [UnionAll1].[Mail] AS [C20], 
    [UnionAll1].[ReciveMailOnThis] AS [C21], 
    [UnionAll1].[MailCategory] AS [C22], 
    [UnionAll1].[C2] AS [C23], 
    [UnionAll1].[C3] AS [C24], 
    [UnionAll1].[C4] AS [C25], 
    [UnionAll1].[C5] AS [C26], 
    [UnionAll1].[C6] AS [C27], 
    [UnionAll1].[C7] AS [C28]
    FROM  (SELECT 
        CASE WHEN ([Extent2].[ID] IS NULL) THEN CAST(NULL AS int) ELSE 1 END AS [C1], 
        [Extent1].[ID] AS [ID], 
        [Extent1].[ID] AS [ID1], 
        [Extent1].[Lastname] AS [Lastname], 
        [Extent1].[Firstname] AS [Firstname], 
        [Extent1].[Birthdate] AS [Birthdate], 
        [Extent1].[LastUpdate] AS [LastUpdate], 
        [Extent1].[Gender] AS [Gender], 
        [Extent1].[OnlyOwnerCanEdit] AS [OnlyOwnerCanEdit], 
        [Extent1].[MemberType] AS [MemberType], 
        [Extent1].[Deleted] AS [Deleted], 
        [Extent1].[UserName] AS [UserName], 
        [Extent1].[SICard] AS [SICard], 
        [Extent1].[Comment] AS [Comment], 
        [Extent1].[Address] AS [Address], 
        [Extent2].[ID] AS [ID2], 
        [Extent2].[ContactInformationID] AS [ContactInformationID], 
        [Extent2].[LastUpdate] AS [LastUpdate1], 
        [Extent2].[ShowToPublic] AS [ShowToPublic], 
        [Extent2].[Mail] AS [Mail], 
        [Extent2].[ReciveMailOnThis] AS [ReciveMailOnThis], 
        [Extent2].[MailCategory] AS [MailCategory], 
        CAST(NULL AS int) AS [C2], 
        CAST(NULL AS int) AS [C3], 
        CAST(NULL AS datetime2) AS [C4], 
        CAST(NULL AS varchar(1)) AS [C5], 
        CAST(NULL AS bit) AS [C6], 
        CAST(NULL AS int) AS [C7]
        FROM  [dbo].[ContactInformations] AS [Extent1]
        LEFT OUTER JOIN [dbo].[ContactInformationMails] AS [Extent2] ON [Extent1].[ID] = [Extent2].[ContactInformationID]
    UNION ALL
        SELECT 
        2 AS [C1], 
        [Extent3].[ID] AS [ID], 
        [Extent3].[ID] AS [ID1], 
        [Extent3].[Lastname] AS [Lastname], 
        [Extent3].[Firstname] AS [Firstname], 
        [Extent3].[Birthdate] AS [Birthdate], 
        [Extent3].[LastUpdate] AS [LastUpdate], 
        [Extent3].[Gender] AS [Gender], 
        [Extent3].[OnlyOwnerCanEdit] AS [OnlyOwnerCanEdit], 
        [Extent3].[MemberType] AS [MemberType], 
        [Extent3].[Deleted] AS [Deleted], 
        [Extent3].[UserName] AS [UserName], 
        [Extent3].[SICard] AS [SICard], 
        [Extent3].[Comment] AS [Comment], 
        [Extent3].[Address] AS [Address], 
        CAST(NULL AS int) AS [C2], 
        CAST(NULL AS int) AS [C3], 
        CAST(NULL AS datetime2) AS [C4], 
        CAST(NULL AS bit) AS [C5], 
        CAST(NULL AS varchar(1)) AS [C6], 
        CAST(NULL AS bit) AS [C7], 
        CAST(NULL AS int) AS [C8], 
        [Extent4].[ID] AS [ID2], 
        [Extent4].[ContactInformationID] AS [ContactInformationID], 
        [Extent4].[LastUpdate] AS [LastUpdate1], 
        [Extent4].[PhoneNumber] AS [PhoneNumber], 
        [Extent4].[ShowToPublic] AS [ShowToPublic], 
        [Extent4].[PhoneType] AS [PhoneType]
        FROM  [dbo].[ContactInformations] AS [Extent3]
        INNER JOIN [dbo].[ContactInformationPhonenumbers] AS [Extent4] ON [Extent3].[ID] = [Extent4].[ContactInformationID]) AS [UnionAll1]
    ORDER BY [UnionAll1].[ID1] ASC, [UnionAll1].[C1] ASC

Ughhhhh! That is ugly and if you don't understand it I will not blame you.

What is going on here is that EF joins ContactInformations against ContactInformationPhonenumbers and ContactInformationMails and combine this into one large table.

Now remember we cannot have nested tables and the rows must be uniform in the resultset which means that for a row returning the phonenumber we still need to include the columns for the email and the contact.

Here is an example of my contact data (With all it's 28 columns)

Notice that we are actually loading the ContactInformation 3 times. So for each number or email I load the basic information an extra time.

Here is the pitfall with Include(). It will duplicate data in the resultset.

So Include() does data duplication. How bad is that?

Let's look at the numbers.

But first let's briefly describe what affects the performance of an EF query. This is a simplified list so remember that there are more things in play.

Translating the LINQ to sql (In newer versions of EF this can be cached)
Sending the sql command to Sql Server
Exectuing the query (compiling the query plan and executing it)
Returning the result set
Binding the data in the DataReader to entities and returning them (Materializing)

Number 2 and 4 is directly dependent on number of bytes that is sent over the wire. Number 5 could depend on the size result. I have not looked how this is done in EF. But imagine you were going to parse the 3 rows above into one entity. Not so clear how you would do that huh? That most likely means a bunch of conditionals is needed to do it, which will lead to reduced performance.

The test data set and queries

That's the model. This dataset is quite small with 692 contact informations. 540 email and 897 phone numbers.

The code for the test is below. There are three methods.

First one is a simple Include().
Then I load all 3 collections and merge them in memory.
The last one also load all 3 collection separetly but do it in a single query with multiple resultsets.

NOTE: This example contains no filtering. For the tests where I add filtering I simple add a where clause to each query.

private static void TestContacts()
{
    Benchmark("Load with include", () =>
    {
        using (var db = new gmokContext())
        {
            var contacts = db.ContactInformations.Include(x => x.ContactInformationMails).Include(x => x.ContactInformationPhonenumbers).AsNoTracking().ToList();
        }
    });

    Benchmark("Load separate and fix", () =>
    {
        using (var db = new gmokContext())
        {
            var contacts = db.ContactInformations.AsNoTracking().ToList();
            var phones = db.ContactInformationPhonenumbers.AsNoTracking().ToLookup(x => x.ContactInformationID);
            var mails = db.ContactInformationMails.AsNoTracking().ToLookup(x => x.ContactInformationID);

            foreach (var c in contacts)
            {
                c.ContactInformationMails = mails.Contains(c.ID) ? mails[c.ID].ToList() : new List<ContactInformationMail>();
                c.ContactInformationPhonenumbers = phones.Contains(c.ID) ? phones[c.ID].ToList() : new List<ContactInformationPhonenumber>();
            }
        }
    });

    Benchmark("Load separate as batch", () =>
    {
        using (var db = new gmokContext())
        {
            var cmd = db.Database.Connection.CreateCommand();

            Benchmark("generate sql", () =>
            {
                var csql = db.ContactInformations.ToString();
                var psql = db.ContactInformationPhonenumbers.ToString();
                var msql = db.ContactInformationMails.ToString();

                cmd.CommandText =
                    csql + ";" + Environment.NewLine + Environment.NewLine +
                    psql + ";" + Environment.NewLine + Environment.NewLine +
                    msql + ";";

            });

            db.Database.Connection.Open();
            // Run the sproc 
            var reader = cmd.ExecuteReader();

            var contacts = ((IObjectContextAdapter)db)
                .ObjectContext
                .Translate<ContactInformation>(reader, "ContactInformations", MergeOption.NoTracking).ToList();
            reader.NextResult();

            var phones = ((IObjectContextAdapter)db)
                .ObjectContext
                .Translate<ContactInformationPhonenumber>(reader, "ContactInformationPhonenumbers", MergeOption.NoTracking).ToLookup(c => c.ContactInformationID);
            reader.NextResult();

            var mails = ((IObjectContextAdapter)db)
                .ObjectContext
                .Translate<ContactInformationMail>(reader, "ContactInformationMails", MergeOption.NoTracking).ToLookup(c => c.ContactInformationID);

            foreach (var c in contacts)
            {
                c.ContactInformationMails = mails.Contains(c.ID) ? mails[c.ID].ToList() : new List<ContactInformationMail>();
                c.ContactInformationPhonenumbers = phones.Contains(c.ID) ? phones[c.ID].ToList() : new List<ContactInformationPhonenumber>();
            }
        }
    });
}

The overview

I did my small tests on two different machine. One is on my desktop where the sql server is located. (i7-3820, 64gb ram, 240gb intel SSD 520 series)

The other one is on my ultrabook. ASUS Zenbook Prime UX31A which is connected over wlan and has about 1ms ping against the desktop.

The load all gives 692 contacts. Filtered by gender it's 388 and filtered by name it's 5 contacts.

Results on wlan
Method / iteration	Load all	Filtered by gender	Filtered by name
Include / 1	287ms	68ms	42ms
Include / 2	65ms	45ms	9ms
Include / 3	59ms	36ms	8ms
3 queries / 1	81ms	66ms	46ms
3 queries / 2	27ms	24ms	10ms
3 queries / 3	26ms	20ms	10ms
MultiQuery / 1	75ms	23ms	25ms
MultiQuery / 2	22ms	19ms	8ms
MultiQuery / 3	27ms	15ms	9ms

Results on localhost
Method / iteration	Load all	Filtered by gender	Filtered by name
Include / 1	223ms	49ms	20ms
Include / 2	44ms	25ms	5ms
Include / 3	37ms	22ms	4ms
3 queries / 1	43ms	44ms	30ms
3 queries / 2	6ms	5ms	6ms
3 queries / 3	5ms	5ms	5ms
MultiQuery / 1	54ms	9ms	12ms
MultiQuery / 2	6ms	4ms	5ms
MultiQuery / 3	9ms	4ms	5ms

As you can see when the dataset is large and the query is cheap Include is much slower.

Include will close in when the returned data decrease and the filter in the query get's more expensive. (The include does the query once and then joins while the other methods will filter on the same criteria for each query).

Going deeper

Let's look in Management Studio what the client statistics say about the queries.

Trial 3 is the 3 separate queries approach and Trial 4 is with includes.

As you can see we almost double the data returned when using include even though the actual number of rows is lower.

Summary

Hopefully this will make it clear what Include() actually does and how that can impact performance.

One thing we should note is that the data set I used for the example is very far from a worst case scenario. There are 2 major factors that helps include on this dataset.

In average each contact has about one email and one phonenumber which means that in average we will get 2 rows per contact which isn't so much duplication.
Each ContactInformation only contains "small" data. Remember that the ContactInformation is duplicated for each phone and email.

A situation where I expect it to be much worse is if we load a BlogPost with Include() on Tags. That means we will load the BlogPost for each text. Rember that the BlogPost probably will have long text which would be duplicated.

I did some tests on data looking which lead to this tweet

Wow, using include() instead of separate queries on this dataset increase loadtime from ~500ms to ~3700ms on wlan. #ef

— Mikael Eliasson (@MikaelEliasson) December 31, 2013

Also it's nice to see that when we have small datasets the Include is actually on pair with the other methods. This means if you only return a few rows and the root entity doesn't contain heavy data you can keep using include which is much faster to code.

It's also interesting to see how little I gain by batching the calls into one call compared to 3 different queries. When we have connection pooling making a simple query is very cheap which is good to know because multiple simple queries might be much more performant that a complex one.

So how to load the data efficently when Include() is slow?

I think the methods I used in my example is the best way. I you have any other ways I would be happy to hear about them. I'm, thinking about adding some utilitiy function to https://github.com/MikaelEliasson/EntityFramework.Utilities to make it easier to do the fixup but I'll have to see what I can come up with.

A stored procedure with multiple result sets would be the fastest I expect but I haven't tried. It's only another variation on the same technique I used anyway so I don't expect much better performance but a small difference is likely.

A few words about how to think about performance

I'm a strong beliver in constantly measuring the performance. Even from the first day of development. Some people will say premature optimization but I don't agree. I don't optimize everything but always measuring allows me to make an informed decision. From the second I first run a query I know how fast it is (in development though).

The example from above is actually from one of my projects done several years ago (I used include and haven't changed it yet). It's on an old version of EF and query time including parsing is about 250 ms in best case on the production server.

Nowdays I know from experience that a simple query takes a couple of ms to run. Most likely 5-15ms on the datasizes we talk about here and the server in question. It's very easy to see I would easily benefit from doing it as 3 queries and end up about 50ms in worst case on that version of EF.

For me "premature optimization" is not about when in the development cycle you do it's but rather about if the choice is informed or not. Do you expect it to be a problem later on there is no better time to fix the code than when it's fresh in your memory.