Exploratory data analysis

Download your data and save it in a data frame called gadata

# Get the Sessions by Month in 2014
query.list <- Init(start.date = "2014-01-01",
                   end.date = "2014-12-31",
                   dimensions = "ga:date",
                   metrics = "ga:sessions",
                   table.id = "ga:00000000")

Let's do some basic operations on the data.

Min

What is the minimum number of sessions in 2014?

min(gadata$sessions)
[1] 0

Number of days with 0 sessions recorded

It seems like there was an error in tracking and there is no data for some days. When was it? Display the days with 0 sessions.

subset(gadata, ga.data$sessions == 0)
        date sessions
7   20140107        0
8   20140108        0
129 20140509        0
130 20140510        0
131 20140511        0
132 20140512        0
133 20140513        0
134 20140514        0
135 20140515        0

How many days were there with 0 sessions? Use function nrow() to count rows with this condition.

nrow(subset(gadata, ga.data$sessions == 0))
[1] 9

There was 9 days with 0 sessions.

summary(gadata)

Max

When was the biggest traffic on your website? Use max() function.

> max(gadata$sessions)
[1] 204

The highest traffic is 204 sessions in 1 day. When was it?

subset(gadata, gadata$sessions == 204)
       date sessions
59 20140228      204

You can reach these results in just one step, replacing the value with max(). This way, it is shorter but harder to read:

subset(gadata, gadata$sessions == max(gadata$sessions))
       date sessions
59 20140228      204

Mean

What is the mean number of sessions per day? To calculate this, use the mean() function.

mean(gadata$sessions)
[1] 27.6

The average number of sessions per day is equal to 27.6.

Standard deviation

You can check the diversity of the number of sessions per day. Use the sd() function.

sd(gadata$sessions)
[1] 22.12984

The average number of sessions is equal 27.6 +/- 22.12984. This dataset has big diversity and in that case it is better not to trust only the average value.

Median

If a dataset has high standard deviation it is better to calculate the median (the most popular value in a dataset).

median(gadata$sessions)
[1] 21

The most popular number of sessions id 21 sessions per day.

Summary

If you want, you can get all of this statistics in one function: summary.

summary(gadata)
     date              sessions    
 Length:365         Min.   :  0.0  
 Class :character   1st Qu.: 12.0  
 Mode  :character   Median : 21.0  
                    Mean   : 27.6  
                    3rd Qu.: 40.0  
                    Max.   :204.0

As a result you will get basic statistics for numeric variables and description for character variables.

Source code

The complete source code of the examples showed above is in my GitHub repository:

github.com/michalbrys/R-Google-Analytics/blob/master/2_eda.R

results matching ""

    No results matching ""