Parsing First and Last Names out of one field

I don’t know if you do a lot of string parsing, like getting First/Middle/Last name in one field. It’s something that I do fairly regularly. I got a spreadsheet of kids that I need to add to my system, and it has their full name in one field so I have to break it apart. Shouldn’t be a big deal.

Famous last words.

I actually couldn’t do it in Access, which was a first for me and quite got my knickers in a twist. Normally I’d do an update query with the Update From, but it didn’t work. I thought maybe the space between the first and last name was actually something weird in hex, but it wasn’t.  So I copied the data from Access back in to Excel (it was fewer than 500 records, and only eight fields or so). The Excel code is as follows:

=LEFT(B2,FIND(" ",B2))

And that should give you the first name. But it didn’t. Turned out that the field had a leading space! So I used the following code:

=LEFT(TRIM(B2),FIND(" ",TRIM(B2)))

And it worked. Then I had a clever thought for getting the last name:

=TRIM(RIGHT(B2,LEN(B2)-LEN(I2)))

(I2 being the first name)  Subtract the length of the first name from the full string length to give you the length of the last name, then a right() gives you the rest.

I was mildly proud of myself. After getting the data back in to an Access table in preparation of uploading it to SQL Server, I went ahead and trimmed leading and trailing spaces.

It’s not perfect. If they have a middle name, or a non-hyphenated compound last name, it’ll plug it all in to the last name and you’ll have to go back and do some manual cleanup. But it’ll get you most of the way there.

Advertisements

How I Do SQL Server Database Maintenance

(One of the reasons for this blog is for me to document my processes for myself.  These are the ways that I do certain things.  In your career as a DBA, if that is your job, you’ll develop your own preferred techniques.  If these are useful for you, bonus!  Regardless, it gives me a source code repository that if I go somewhere to do some consulting to help them get their system maintaining itself properly, I can pull up posts like this and get them going quickly.)

 

The most critical thing as a SQL Server DBA is to ensure that your databases can be restored in the event of the loss of a server for whatever reason: disk crash, fire in the server room, tribble invasion, whatever.  To do this, not only do you have to back up your databases, you also have to test restores!  Create a database and restore the backups of your production DB to them.  It’s the safest way to make sure that everything works.  This test restore can be automated to run every night, but that’s outside the scope of what I want to talk about right now.

There are lots of places that problems can creep in, this is just one part of how you’ll need to monitor systems.  This is how I’ve done things for a number of years, and thus far it has served me well.

First and foremost, backups.  Your requirements will vary depending on your Recovery Point Objectives (RPO, how much data loss is acceptable) and Recovery Time Objectives (RTO, how long before the system is available).  If your normal business hours are Monday to Friday 8-5 with your largest database just a few gigabytes, you’ll have radically different requirements than someone with databases in the hundreds or thousands of gigabytes that require 24/7 uptime.

We’ll go with demonstrating a simple installation.

When your operation is 8-5 Mon-Fri, you have a huge maintenance window that you can get everything done in, so there’s no reason not to get everything done.

My personal method is to do my maintenance tasks at night.  These include DBCC CheckDB, Update Statistics, and checking the suspect pages table.  Note that I DO NOT USE MAINTENACE PLANS.  Gail Shaw recently described them as brain dead, doing far more than what is needed.  I particularly don’t like the miniscule amount that they report, so you’re never clear what was done.  So I do everything through SQL jobs.  And if I have a SQL Express installation, since everything is based on .CMD and .SQL files, it’s not a big deal to run everything through Window’s Task Scheduler.

Since my normal business hours normally see no activity at night, that’s when I do my maintenance.  I do a full backup of each database, followed by a full transaction log backup.  Those will be caught by the server backup later that night and written to tape, which will then be moved to another building the next work day.  I do differential database backups at noon, and transaction backups throughout the day, typically every 10 minutes.  As a rule you don’t want to start multiple backups at the exact same time, so stagger the start time slightly.  If you start the transaction log backups at 7:30 to capture people who come in early, fine.  I have the first transaction log backup initialize the log, then the appending t-log backups would actually begin at 7:40.  Have the noon differential fire at 12:02pm.  Stop the t-log backups at 6pm, start your nightly maintenance at 7pm.  You will develop a feel for how long each task takes based on the size of the database.

On to the job specifics.

My first step calls a file called NightlyDBCC.cmd, which is this:

osql -S"[servername]" -E -id:\dbccs\nightlydbcc.sql -od:\dbccs\dbccresult.txt -w200
osql -S"[servername]" -E -id:\dbccs\nightlysuspectpages.sql -od:\dbccs\suspectpagesresult.txt -w200

This .CMD file is two lines, each starting with osql -S.

The OSQL program is a SQL Server command line tool that lets you run T-SQL commands from the command line.  The switches that I use are:

  • –S: the name of your SQL Server
  • -E: use the current user’s credentials for authentication
  • -I: the input file that contains the commands to be run
  • -O where to write the output
  • -W: column width

Like most Windows utilities, you can get a full list of options by typing OSQL /? in a command prompt window.

The nightlydbcc.sql file contains the following:

use master
go
exec sp_MSforeachdb @command1="print getdate() print '? checkdb' DBCC CHECKDB(?)"
go

The stored procedure, sp_MSforeachdb, is an undocumented system stored procedure that runs a command string against every database on a server.  It accepts a question mark as a wildcard representing each database name.  So the string being run might be better visualized as

For Every Database:
print getdate()      --print the date for the output log
print '? checkdb'    --print header record of the database name
DBCC CHECKDB(?)      --run DBCC CHECKDB against every database

And the output is routed to dbccresult.txt.  Inspect that file every morning and you’ll have a good idea if your system is clean or if something nasty is lurking inside waiting to leap out, like the chestburster from the movie Alien.

NightlySuspectPages.sql has a little bit more going on.

--nightlysuspectpages.sql
SET NOCOUNT ON;
SELECT 'Run date:', GETDATE();
SET NOCOUNT OFF;
--Server and SQL uptime
SELECT DATEADD(SECOND, -1 * (ms_ticks / 1000), GETDATE()) AS ServerStartDate,
    sqlserver_start_time AS SQLStartDate
FROM sys.dm_os_sys_info;
SELECT 'Suspect Pages' AS SuspectPages, *
FROM msdb.dbo.suspect_pages;
SELECT GETDATE() AS RunTime, LEFT(name, 15) AS DBName,
    LEFT(filename, 60) AS Filename, LEFT(status ,6) AS Status
FROM sysdatabases
WHERE dbid > 4
--    AND status <> 528
ORDER BY name;

The first code selects the time stamp of when the code ran in to the output file.  The second block prints when the server was last started and when SQL Server started up.  They should be fairly close, but could vary if you’d installed some SQL service packs or updates that could have restarted the SQL services.

The third select looks for any pages in msdb.dbo.suspect_pages.  These pages represent an I/O error, which could be a transient glitch or a warning of a problem in your disk system: a SAN or disk, whatever, it’s something to look at.  Take a look at Books Online for more information, just enter suspect_pages in to the search box.  Hopefully, normally, there won’t be any records here and you can ignore it.  But if something ever appears here, you have a problem!

And finally, the fourth select lists every database on that instance and shows the Status field.  It will be a number like 65536 or 65544.  That number will change depending on what options are selected for the database, like file recovery mode, etc.  The thing that I watch for is if it changes — if someone changes a database from full recovery to simple, I want to know that!

More on this soon.

RAID vs MTBF of hard drives

It pays to be aware of what disks your databases reside upon, if only to reduce spindle contention. Lots of installations have internal RAID systems with data striped across multiple drives, and overall this is a good thing.

But as a sysadmin you need to be aware of MTBF, Mean Time Between Failure. Hard drives are physical devices and eventually they’re going to die. Even solid-state drives will eventually fail, there’s an interesting torture test of SSDs at Tech Report. The test concluded last year with the last two to die having had over 2.5 PETABYTES of data written to them! It’s an interesting report.

Here’s the problem. Typically when we build or buy servers, we’ll order X number of hard drives from the same vendor at the same time. There is a strong likelihood that all of the drives will be from the same manufacturing batch, which means that they will have approximately the same MTBF, or the number of hours of operation before they die.

What does this have to do with SQL Server, or any server for that matter?

You come in on a Monday morning and you hear a beep in the computer room. With a little investigation you find the server that’s crying for help, or perhaps your network operations monitor sees the problem: a drive in a RAID array has failed. No problem – go in to the spares, find a drive with the same capacity, replace the dead drive.

Now is when the MTBF comes in. If all of the drives were from the same batch, then they have approximately the same MTBF. One drive failed. Thus, all of the drives are not far from failure. And what happens when the failed drive is replaced? The RAID controller rebuilds it. How does it rebuild the new drive? It reads the existing drives to recalculate the checksums and rebuild the data on the new drive. So you now have a VERY I/O intensive operation going on with heavy read activity on a bunch of drives that are probably pushing end of life.

Additionally, before the failed drive is replaced and the new drive is rebuilt, the remaining drives are already under additional strain from reading checksums to recalculate the missing data in order to maintain operation.

Your likelihood of an additional drive failing just went up dramatically. It might happen during the rebuild, it might happen later after the system is back in full swing. And it might be a long way down the line – you never know when a drive or many drives will exceed their expected life or fall short of it.

This is just something to be aware of, and don’t be surprised if it happens. Definitely an important reason to make sure your backups are running well and restorable.

There is a way to try to dodge this issue, but it isn’t terribly easy, and that is to multi-source your drives. Buy drives from multiple suppliers and keep track of which drives from what batch are in what array. Gee, sounds like the job for a database!

I’m not sure about mixing drives from different manufacturers. I would think that if they were the same capacity and speed that you’d be OK, but I’m not 100% confident on that, you might introduce some performance difficulties. Back in Ye Olde Days of network administration when we needed to know things like cylinder and sector counts, things were more complicated. But we don’t bother with that (for the most part) these days.

Modifying Existing Indexes to add Included Fields

As is not unusual in our profession, we occasionally have weird things that drive us nuts.  I was rewriting a fairly important view and examining the actual query plan after I’d completed my changes but before it went live.  The query had a key lookup: there were three fields that were being pulled from that lookup at a moderate cost to the overall plan.  I decided to change one of the indexes to include the fields in question.

It turns out to not be as simple as I would’ve liked.

First, I tried scripting out the index from SSMS.  And I get an error popup saying ‘Discover dependencies failed’.  Well, that’s not much fun.

Next, I fall back to an old favorite: script the table at the database level.  Right click on the database, select the specific table, script it out to a window.  Find the index that you want, copy it in to a new query window, modify the index.  Execute.  Done.

Of course not.  When I went to run the original query, the key lookup was there.  So I double-clicked on the index, clicked on the Included Columns folder tab, and the columns that I wanted were not there.  I don’t know why, when I executed the code to rebuild the index everything seemed to run without an error, it just didn’t build.  And yes, after modifying the index I did rebuild the index, just to make sure.

But with that Included Columns tab open, I just added the fields that I wanted, and the next time I ran my original query I had a seek instead of a key lookup.

Win.

So the takeaway is that, as we’re fond of saying on SQL Server Central, there’s more than one way to do that.  It’s definitely helpful to know other ways to do things to get it done.

How do you defend your data when vendors sign on to your server?

I had an incident last week with my hosted server. Even though it was not yet in production, it had live data on it. A vendor was going to sign on to configure our terminal services gateway and licensing, something that we were having problems doing. It wasn’t just that the data was live, the problem was that the data represented HIPAA- and FERPA-sensitive information.

And when someone signs on as a server administrator, there’s limits as to how well you can protect your data.

I decided the best thing to do was to drop the databases and delete them from the server. I could pick them up again Monday from backups. Thus my data was secure.

I made one mistake, and it was kind of a biggie. When I created the utility jobs (backups, DBCCs, etc.) I set it to email me at both my work and private addresses when the jobs failed.

And I didn’t stop SQL Agent.

And I headed out of town Sunday for three days.

I had over 400 emails in both accounts waiting for me, all complaining that for some odd reason the databases weren’t available.

My personal email account was a Gmail account, so the messages collapsed nicely under common headers and were very easy to clean up. The ones in my work email also weren’t difficult, in Outlook you right click on one of the messages, click on Find Related, then Messages In This Conversation. No big deal, took maybe 15 minutes to delete them all and to have Outlook clean up my deleted items folder.

Tomorrow we have another vendor accessing the server, so I dropped the databases and copied the MDFs and backups to another server that’s not on the same domain or linked. When the vendor is done, I’ll disable their login, copy my databases back to my server, attach them, turn on SQL Agent,