In another era, I would be an Excel jockey; instead, my true love is Google Sheets.

As Scott Fennell and I have continued to hammer away at working on the new LexBlog.com, my eyes have gone red staring at more spreadsheets in Google Sheets than I’d care to admit. I’m using these spreadsheets for two reasons:

  • Validate that the shape of the data on our test aggregation site (i.e., the future LexBlog.com) matches the shape of the data on each test source site (i.e., all of the client sites that we manage)
  • Derive some understanding of the organization of things on the current LexBlog.com

The reason I’m using Google Sheets for all this is simple: It’s fast, easy, and requires very little from me to maintain the approach.

A few months ago, Scott and I debated building a testing apparatus for this project, but after many arguments in Slack, decided against it due to the changing shape of the project. With only two full-time resources working on this at any given time, it’s hard to justify sinking our time and effort into yet another thing to manage and debug. However, I needed a way to efficiently validate that the things our test aggregator was responsible for capturing were being captured. This includes:

  • Post titles
  • Post content
  • Post date
  • Post authors
  • Author meta

And probably a handful of other things. Our goal is to keep all of this in sync between the sites that we manage and LexBlog.com so that when a post on a source site is updated in any way, LexBlog.com’s version of the post adequately reflects that.

Our current test environment is managing to do this with about 6,000 posts and 500 users (all of the sites we manage have a cumulative total of 370,000+ posts and 16,000+ users), but as we test and take various bulk actions on these posts and users, it’s hard to manually review everything.

In doing my testing, I’ve developed a nice workflow with users wherein I use WP All Export to create a customized export of users and their meta from both the aggregation site and the source sites. I first have to concatenate all of the CSVs from all the source sites into one monolithic CSV which I do in the command line inside the directory that holds these files with this command: cat *.csv > merged.csv

From there, I bring the aggregation site’s and source sites’s CSVs into a single Google Spreadsheet and place them in two different sheets. The first thing I do is follow this handy tutorial in utilizing Google Sheets’s Script Editor to run a bit of JavaScript on the sheet in order to remove all duplicate entries. Then, because I know there are some users on the source site that I don’t care about (remember, I’m only interested in users that have a post on the source site), I use a VLOOKUP to build a sheet that contains all of the users and their meta that is in both sheets, using the data from the source sites to populate this.

After everything is cleaned up, I download the new users sheet and the sheet from the aggregation site and bring it into Sublime Text (my text editor of choice) which has a great file diff functionality that looks at the files you’re interested in and compares them to see if they’re identical. I do a similar bit of work with post data as well, only with less cleanup necessary due to WP All Export’s ability to run more advanced queries on posts than users.

This is easy enough, although time consuming, and aided greatly by WP All Export. Unfortunately, data collection on the current LexBlog.com is far more difficult.

Recently, I wanted to develop a list of all RSS feeds syndicated by FeedWordPress and what categories were assigned to posts from those feeds. As we migrate from FeedWordPress to our WordPress REST solution, I want to respect the work our editorial team has done in organizing this content. Unfortunately, the way FeedWordPress keeps this data is fairly unstructured. Each feed that is added to FeedWordPress creates a corresponding wp_link (which is already annoying as this object is not a core part of WordPress anymore). Much of the data around the feed ends up in a “Notes” section of this object, which is a plain text field.

To extract the information, I first had to run a SQL query in phpMyAdmin and download the result as a CSV. I then pulled it into – you guessed it – Google Sheets where I used the following commands:

  • REGEXEXTRACT to pull all strings from the “Notes” cell that look like cats: {category#36}\n{category#4}\n{category#5}\n{category#36}\n{category#4}\n{category#5} into another cell
  • SPLIT to extract numeric values into individual cells
  • JOIN to combine the contents of many cells with a delimiter (and had to use the & function in this as well)

At the end of this, I had a cell that contained all of the categories that each feed posted to and was able to bring this into a single sheet that I’ll use as we perform our migration from FeedWordPress to our own aggregation solutions.

Thanks Google Sheets 🙂 (and WP All Export!)