Data Cleanup

I’m working on a project to randomize name and address data for a project.  Basically I want to be able to take live data and do a one-way (non-reversible) scramble for showing to the public and for training purposes.  I’ve downloaded a lot of records from GenerateData.com, a site that I really like for the purpose, but it’s a little too ‘international’ for my needs.  So I’ve grabbed data from a number of places.

One places that I took was my iPhone contact list.  I generated a PDF of all records with just the name and address.  All I wanted was street addresses, but I very quickly ran in to a major problem after I did a Select All and pasted it in to Excel for cleanup before loading it in to a database.

The data looked like this:

Zeffiro's Pizzaria                                                    135 N. Water St

The problem was that there was an unpredictable and inconsistent number of spaces between the company name and the address.  Incidentally, if you ever find yourself in Las Cruces, NM, Zeffiro’s is fantastic pizza from a wood-fired oven.  And yes, that’s their address.

I figured out a simple Excel formula to handle the extraction:

=TRIM(RIGHT(<datacolumn>,50))

It wasn’t 100% effective, but it was pretty darn good for extracting just that part of the address.  Grab the rightmost X columns, and the TRIM() function removes both leading and trailing spaces, and Robert is your parent’s brother.

The process was a little more complicated than that, but ultimately I ended up with what I needed.

Good luck if you have a similar need!