Working with Execution Plans

Interesting stuff.  With the student information system that I’m developing, there’s one query that’s central to all users, it’s called vwStudentsFiltered.  Since I need to enforce row filtering based on what district(s) a user has access to, this became the core.  The original view looked something like this:

SELECT st.*
FROM Students AS st
JOIN SeekerUsers AS u
   ON u.SeekerUserDBLogin = SYSTEM_USER
   AND st.StatusCode in ('0', '1', '9')
   AND (u.SeekerUserRole = 1 --superuser sees all, otherwise filter
      OR st.DistrictNum IN (SELECT td.SeekerUserDistrict
         FROM SeekerUsersDistricts AS td
         WHERE u.SeekerUserid = td.SeekerUserid)
);

Obviously I wasn’t doing a Select *, but it’s a long field list and irrelevant to the discussion.

Student Status Codes of 0, 1, and 9 indicated an active student.  District users were only allowed to see students that were not assigned to a district, or students assigned to their district.  And SeekerUserRole 1 is a superuser who is allowed to see everyone, regardless of status code.  The problem is that if you looked at it through SSMS Tools Pack by Mladen Prajdić, it would report that the query had an estimated cost of 200%. (SSMS Tools Pack is pretty neat, I recommend checking it out)  The SeekerUsersDistricts table consists of two fields: SeekerUserID and a district number.  For district users, they’d have one record in this table, teachers could have more than one district number.

The problem is that the u.SeekerUserRole = 1 OR st.DistrictNum IN confuse the optimizer: it could execute either path, and you’ll only know at runtime.  Kind of hard to optimize for something like that.  Turning on client statistics gives us the following:

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0 (all other reads are zero)
Table 'Students'. Scan count 1, logical reads 23, physical reads 0 ...
Table 'SeekerUsers'. Scan count 1, logical reads 2, physical reads 0 ...

The optimizer added a worktable.  That can’t be good for performance.

There are three types of users: superusers, district users, and teachers.  The front-end app gives them different menus to control what they can do and what data they can see.  In this post I’ll be looking at how to improve the view but only for the superuser.

Here’s the code that I came up with:

--Variation 1
SELECT st.*
FROM Students AS st
WHERE 1 = (SELECT 1
   FROM SeekerUsers AS u
   WHERE u.SeekerUserDBLogin = SYSTEM_USER
     AND u.SeekerUserRole = 1);
--Variation 2
SELECT st.*
FROM Students AS st
JOIN SeekerUsers u
   ON u.SeekerUserDBLogin = SYSTEM_USER
   AND u.SeekerUserRole = 1;
--Variation 3
SELECT st.*
FROM Students AS st
WHERE EXISTS (SELECT 1
   FROM SeekerUsers AS u
   WHERE u.SeekerUserDBLogin = SYSTEM_USER
      AND u.SeekerUserRole = 1);

Running all three in the same query window shows each result set having the same number of rows returned, 939 students.  So far, so good.

Variation 3 made sense to me.  There are no common fields between the Student table and the SeekerUsers table, so a join, on the surface, doesn’t make sense.  So why not a Where Exists?  The devil, as they say, is in the details.  The most glaring thing to show in the graphic execution plan was an operator that I had never seen before: a Row Count Spool (Lazy Spool) with a cost of 78%.  With all three queries running together, Variation 3 had a cost relative to the batch of 70%, the other two each being 15%.  The Row Count Spool was doing rebinds and rewinds, which are not good.  The estimated number of rows and the estimated row size on that operator were farcical: it estimated one row and returned 939 rows (no, updating statistics and reindexing did nothing to improve that mismatch).  On top of that, the entire estimated subtree cost was 0.12, the other two variations were each 0.026.

So variation 3 is out.  What’s the difference between 1 and 2?  The graphic plans are identical, as are the numbers on every operator, but there’s a subtle difference: the cache plan size for #2 is 96 kb, #1 is 88 kb.  Kind of trivial, but this plan is going to be called regularly, so it’ll live in the cache pretty much forever.  At this point there’s no way to forecast how much cache memory pressure there will be, but 8k is 8k, so since every statistic between the two variations is identical, I’ll go with #1.  Aesthetically, I personally prefer #2, but I can live with #1.

Here’s the execution plans:

queryplan

Yes, I’m getting scans on my Students base table.  Right now the queries are returning 99% of students, so a seek doesn’t improve things since a covering index would require every field in the table.  As the number of students grow with diverging status codes, then indexing might come in to play.

Data Cleanup

I’m working on a project to randomize name and address data for a project.  Basically I want to be able to take live data and do a one-way (non-reversible) scramble for showing to the public and for training purposes.  I’ve downloaded a lot of records from GenerateData.com, a site that I really like for the purpose, but it’s a little too ‘international’ for my needs.  So I’ve grabbed data from a number of places.

One places that I took was my iPhone contact list.  I generated a PDF of all records with just the name and address.  All I wanted was street addresses, but I very quickly ran in to a major problem after I did a Select All and pasted it in to Excel for cleanup before loading it in to a database.

The data looked like this:

Zeffiro's Pizzaria                                                    135 N. Water St

The problem was that there was an unpredictable and inconsistent number of spaces between the company name and the address.  Incidentally, if you ever find yourself in Las Cruces, NM, Zeffiro’s is fantastic pizza from a wood-fired oven.  And yes, that’s their address.

I figured out a simple Excel formula to handle the extraction:

=TRIM(RIGHT(<datacolumn>,50))

It wasn’t 100% effective, but it was pretty darn good for extracting just that part of the address.  Grab the rightmost X columns, and the TRIM() function removes both leading and trailing spaces, and Robert is your parent’s brother.

The process was a little more complicated than that, but ultimately I ended up with what I needed.

Good luck if you have a similar need!