Google Analytics

Why is cleaning up duplicated URLs in Google Analytics reports so important?

Reading Time: 5 minutes

At the point I was writing the article about using the Filters to get the correct Shopify numbers in Google Analytics, the idea was to help out the numerous Shopify store owners and cover a fairly niched topic as there was no material on that on the Web. But, after I got a feedback from some colleagues that I should have started with covering the broader topic first – duplicated URLs in Google Analytics Pages report – and why removing them is such an important thing, I figured a new article is due.

Reading out about Regex and Google Analytics Filters that use it would probably be a smart idea, in case you feel you are not on solid ground with Regular Expressions. Another useful article on cleaning up GA URLs, although written eight years ago, is still a helpful innuendo into the topic we are covering today, and you should definitely read it before you delve into my post.

Google Analytics Pages report

Let’s begin by brushing up our knowledge about the Pages report and the metrics it contains. You can safely skip to the next section if you feel you are already comfortable and familiar with this Report.

Pageview

A number that shows how many times one particular page was loaded into the visitor’s browser. Reloading the page thus triggers another pageview.

Notice that loading a page in the browser doesn’t necessarily mean the content of the page was consumed/read, yet that can still be tracked with a scroll depth script for GA.

Unique Pageview

This metric is similar to the Pageview, with the exception that it happens only once during a session/visit to the website. You can have multiple pageviews of one page, but only one unique pageview of that page, while your visit lasts.

Average Time on Page

As the word says, this metric should describe how long visitors have been on a page, on average.

Unfortunately, that is not what it indeed describes because GA tracks the time on the page only as a difference between two timestamps of the different page loads (or any other tracking hits sent to GA, including events). What it effectively means is that GA doesn’t know how much time you spent on a page if you just loaded it, reviewed the content on it and then left without going to another page of the same site.

This default behavior or GA can also be modified and tuned, so that time on page is pretty accurately tracked, but as a consequence, that modification heavily skews and reduces Bounce Rate, which is not what you usually would want to do. You can read more about that here.

Entrances

A number that shows in how many cases this page was a landing page — one that visitors began their visit through.

Notice the difference here from the Unique Pageviews: you can start your visit to the website on one particular page, and that will yield one Entrance and one Unique Pageview. But, you can visit this page after some other page was a landing one, implying that the number of Uniques will be equal to or higher than the number of Entrances for that page.

Bounce Rate

Percentage of visits (sessions) that end with only one page viewed.

As previously mentioned when Average Time on Page was discussed, introducing additional events to the tracking on one page could result in reduced Bounce rate as well as increased Time tracked on that page.

% Exit

This number shows the percentage of visits that end on this particular page.

Since one visit might or might not begin on this page, Exit percentage will be equal to or higher than the number of Bounces out of that page. Said differently, if visitors start their visit on this page and then exit without visiting any other page, Exit = Bounce, but if they start on some other page and exit on the other one, then it is one Exit and zero Bounces.

Page Value

Often a neglected metric, but quite an important one for a website which has a monetary value attached to GA goals or that counts transactions. What I mean by this is that Page Value reflects total monetary value on the site divided by the number of unique pageviews of a page. As a result, this metric shows how valuable one page is to the overall revenue or website monetary gain.

Duplicated URLs and the impact on the Google Analytics Page metrics

For the purpose of this article, let me divide the Page URLs reported in GA into two groups: those which are “clean” and those with some additional “parameters”.

The “clean” URLs contain only the path to the page on the website, and nothing else. An example of this case is the path to this article: /articles/google-analytics/cleaning-up-duplicated-google-analytics-reports-important/

On the other side, those URLs with the parameters most probably contain a “question” mark (?) and then one or more values, separated with an “and” mark (&). An example would be:/articles/?page=2&trkid=1lkjdsf. Notice two different parameters — “page” and “trkid”, with their values.

These page might have the parameters for various reasons: they have been attached by some third party service or tracking script that you are using; or they are attached by your website (to show pagination, or elements order on some product category pages, session ID, etc.); or they might as well show sessions generated by your developer while he was making changes and testing your site (like “preview_id” for example).

So, imagine you are analyzing the performance of one page trying to determine if it requires optimization of the content (because of a high bounce rate since it is a landing page that should dispatch visitors, like a homepage), or how valuable it is in achieving your financial goal for the website. Now, instead of seeing only one /articles/ page in the list on the Pages report, you see /articles/?page=2, then /articles/?page=3&ssid=123543 or any other variation of these cases.

Naturally, this creates segmentation and gives you inaccurate metrics. Instead of having only one page, you have multiple variations of it, even though it might be the same page. With that said, Pageviews are obviously underreported, Average Time on Page and Bounce Rates are incorrect, Page Value is lower than it should be. All of these can lead you to wrong conclusions about the performance of that Page.

Cleaning it all up

I have covered the cleanup pretty thoroughly in the article about Shopify URLs so I won’t repeat the whole procedure here, with all the screenshots.

Briefly said, you should mitigate this issue by creating Google Analytics Filters. If you need, and as the other article says, you can choose to store the values of those parameters. If you don’t, you can skip the part with creating Custom Dimensions and creating Filters to output to those Custom Dimensions, and proceed directly to the Filter for removal of these additional parameters.

There can be a different approach to this matter — cleaning those parameters through Google Tag Manager, before they reach Analytics — but that solution creates an additional layer of complexity (because you either need to create GTM anew or have one already in which you would need to add custom JavaScript), thus I won’t deal with it here.

How to Use Google Analytics Filters to Clean Up Shopify Pages Data

Reading Time: 6 minutes

Shopify is a well-set E-commerce platform, in my opinion, and I would undoubtfully support any online business owner in a decision to build a new or move his or her existing website to this engine. Primarily, that is because they have done an excellent job in making the e-shop management easy, but as well because of a secure environment and fast enough servers.

In the past though, honestly, I did vote against their Google Analytics integration which was one of the reasons it was not my platform-of-choice. That has changed since, and nowadays their integration with GA needs just a little outside help and setup to make it collect robust, accurate and useful data.

What you can find in this article are the Google Analytics Filters that I personally find useful for cleaning up Shopify data in GA’s Pages report. In case you would find useful reading an intro and more about Pages report in general, start with another article which focuses on why duplicated URLs in Google Analytics should be resolved.

How to make Shopify numbers in the Google Analytics Pages report more accurate

Problem

In case you went to the All Pages report in GA (Behavior –> Site Content –> All Pages), which you would have if you wanted to see how one single page performs (Bounce Rate, Avg. Time on Page, and not enough valued metric Page Value), you would have noticed that there are repetitions of one page in numerous variations:

  • /example-products
  • /example-products?sort_by=price-ascending
  • /example-products?page=3.

As you can see on the screenshot above, the appearance of these parameters — numbers and letters after the question mark, means that one page’s metrics have been segmented and broken down to multiple pages instead of showing only one row in the report for each Page path. Consequentially, the amount of Pageviews for one single page is not just 6,587, or whatever it might be, but is much more because there might be 10, 20, 50 or more instances of this page with a parameter in the Page path with 50, 60 or only 1 or 2 Pageviews. This could mean that this single page had 7,659 Pageviews (as an example) instead.

The issue reproduces to all the other essential Page metrics: Bounce Rate, Entrances, %Exit, etc.

Let’s explain now what are those two parameters in Shopify, shown in this case:

  • “sort_by” (link to Shopify documentation) in the Page path means that someone sorted by criteria on the Collection pages – price, date, featured, etc.
  • “page” in the Page path means that someone went beyond the first page of Collections, and shows the Page number someone visited – 2nd, 3rd, etc.

Now, just removing these would not be advised, merely because they are useful. You do want to know if one viewed products behind the first page, or by which criteria they sorted. Both can tell you which products you can add to the featured ones, for example. So, we will first store them in GA’s Custom Dimension and then remove them to clean up the Pages report, as we initially intended to do.

Solution

Solving this issue requires adding new Filters to GA: one to remove “sort_by”, one to remove “page” and then one to remove the remaining characters (like ? and &). Also, before doing that, we would be creating new GA Custom Dimensions to store those values, and then create filters that would “pull” those values from the link before they are removed.

This is all done from the GA Admin panel. Next steps follow:

  1. Creating new Custom Dimensions

You would create one for each value that needs to be stored. Thus, making two new Custom Dimensions is what we should do. This is where to do that:

You can name them however you want, but I would propose “Shopify Collections Pagination” and “Shopify Collections Sorting”. Both should be of the “Hit” Scope.

  1. Creating new Filters

First, you should build Filters to pull the values out of the URL. This is done by making a new Filter, choosing Custom as the Filter Type, and then Advanced radio button:

In the Field A, choose Request URI from the list and put (sort_by=[^&]*&?) to the empty value box. Field B should be left empty and unchanged. Output To should be your Custom Dimension for the sorting, previously created, and $A1 should be put into the value box.

Same you should do with the other Custom Dimension, although with (page=[^&]*&?) put in the Field A box, and a different Custom Dimension chosen in the Output To list. This is how it would look like:

After that is done, creating Filters to clean up these values is the next on the list.

The first one, for “sort_by”:

The second one, for “page”:

Notice that for these Filters we are using Search and Replace, instead of the Advanced Filter. Values are (sort_by=[^&]*&?) and (page=[^&]*&?).

This is sorted out now, and what is left is to clean up the remaining characters. That requires only one additional Filter:

Value of this one is ([?&]$).

IMPORTANT: You should know that removing Query strings from Google Analytics should not be done in the most expected place — in the Settings of the GA View:

The reason is that removing these query parameters (like “sort_by) is processed prior to Filters and the parameters would be stripped from the URL before the Filters can pick them up and remove themselves. This means that if you put the parameters in the box shown on the image above, storing the values and your Filters won’t work, even though you would get the clean Page data as a result.

  1. Assigning Filter order

Google Analytics processes data in Filters in the order of their appearance in the list — the first one gets processed prior to the second on the list. This is why we need to make sure that we don’t remove these parameters before we have stored them in the Custom Dimension, as well as to make sure that the final “clean-up” Filter comes as the last.

Assigning Filter order is done from this location:

It is done by choosing a Filter and moving it up or down with the buttons available. For our case, this is what the order should be in the end:

With this, we have completed the setup for cleaning up parameters from the Page path and getting the valid data in your GA reports.

You should know as well that there can be other parameters that appear in the Page path, generated by Shopify, like “limit” (which shows how many products visitor chose to see at the same time on the Collections page). For these cases, you can apply the same methodology — create a Custom Dimension to store it, create one filter that pulls it from URL and stores it and then creating one which is removing that parameter. All that you would need to change is the text of the parameter: being (limit=[^&]*&?) instead of the (page=[^&]*&?), for example. Don’t forget to assign the Filter order, and make the filter that removes remaining & and ? characters always the last one.