Aggregating Data¶
Each data service should have an aggregate_data.py. This includes a class which defines functions that create and return data tables in a set format.
The main aim is for the report to be able to create an instance of the data class and call the data tables it needs. Example:
def generate_html(self):
get_data = AnalyticsData(all_sites, this_period, this_frequency)
first_table = get_data.get_summary_table()
second_table = get_data.get_article_table()
third_table = get_data.get_traffic_source_table()
#send tables to template
The functions within the aggregate_data files should query the analytics class to get the data, then use shared functions from the utilities file to aggregate and format the data in a set form.
- class AnalyticsData(sites, period, frequency)
- Initialises class from which to call analytics and get the data tablesInit function creates a list of periods to gather data for based on the period and the frequency.Using StatsRange objects it will get the previous period and yearly period to calculate percentage change with
- Data Table Functions
- These functions query analytics with specific metrics, dimensions and filters needed to get the dataThey call the data for each date period (current, previous and yearly) and for each site specified when the class was initialisedThen using utility functions they aggregate, format and add percentage changes to the data structureReturn the data as a list of dictionaries, each dictionary will be an inidividual dimension type, e.g. for the dimension social networks, you might have a dictionary for Facebook, Twitter, reddit etc
- Helper Functions
- These are specific functionss to be used by the service to help format the data into a more readable and sharable way for the reports and templates to use
Google Analytics Data Aggregator¶
Data Tables¶
- summary_table()
- Metrics: pageviews, users, sessions, pageviews per session, average session durationDimensions: None.This will return a single dictionary with the metrics accumulated for all the sites
- site_summary_table()
- Metrics: pageviews, users, sessions, pageviews per session, average session durationDimensions: None.Adds “site” key which is a string of the site name to each dictionaryReturns list of dictionaries where each dictionary is data for a separate site
- article_table()
- Metrics: pageviewsDimensions: page path, page title, host nameDoes not return data or percentage change for yearly as articles change so quicklyReturns a list of dictionaries, each dictionary being a single article. The length will be 20 and sorted by descending pageviews
- country_table()
- Metrics: pageviews, usersDimensions: countryList of countries specify what countries will be individuall listed in the table, the rest will be included in a “Rest of World” all inclusive dictionaryReturn a list of dictionaries sorted by descending pageviews, length will be the length of list of countries plus 1
- traffic_source_table()
- return with length limited to 10
- referring_sites_table(num_articles)
- gets specified number of articles for each source in the source listusing referral_articles() function
- social_network_table(num_articles)
- Metrics: pageviews, users, sessionsDimensions: social networkFilter: Don’t include data where social network is not setSort by descending usersUsing referral_articles() get num_articles for each network
- referral_articles(filter, limit)
- Metrics: pageviewsDimensions: page path, page title, host nameFilter: set from filter could be empty or specify a social network or source for exampleDoes not return data or percentage change for yearly as articles change so quicklyReturn a list of dictionaries, sorted by descending pageviews, length = limit
- device_table()
- Metrics: usersDimensions: device category (desktop, mobile or tablet)Sorted by descending users
- device_chart(data)
- return chart data for number of users on devices
- social_chart()
- return chart data for pageviews, users and sessions of socail networks
Helper Functions¶
- check_available_data()
- for each site check if data is availablereturn dictionary with boolean of true/false and list of sites with no data available
- _remove_ga_names(rows)
- remove the “ga:” in front of the google analytics metrics and dimension keys
- _remove_query_string(path)
- remove the query string from the end of article page paths so similar articles can be aggregated properlyreturn the new article path
- _get_title(path, title)
- check if path includes the “amp” stringif it does, add “AMP” to the end of the article title to show it is an amp version
- _get_source_list()
- Metrics: pageviews, usersDimensions: source / mediumSort by descending users
Youtube Analytics Data Aggregator¶
Data Tables¶
- country_table()
- Metrics: views, estimated minutes watched, subscribers gained, subscribers lostDimensions: countryAdd “subscriberChange” key, by subs gained - subs lostReturns top 20 countries sorted by descending estimated minutes watched
- channel_summary_table()
- Metrics: subscribers gained, subscribers lost, estimated minutes watchedDimensions: NoneAdds “channel” key which is a string of the channel name to each dictionaryAdds “subscriberCount” using the current total number of subscribersAdds “subscriberChange” key by subs gained - subs lostEach dictionary is the data for a separate channel
- channel_stats_table()
- Metrics: views, likes, dislikes, comments, shares, subscribers gainedDimensions: NoneAdds “channel” key which is a string of the channel name to each dictionaryAdds a rate per 1000 views key value pair for likes, shares, comments and subscribersAdds a like to dislike ratio key value pair
- video_table()
- Metrics: views, estimated minutes watchedDimensions: videoAdds “channel” key which is a string of the channel name to each dictionaryAdds “title” key for the title of the videoReturns top 10 videos sorted by descending estimated minutes watched
- traffic_source_table()
- Metrics: estimated minutes watchedDimensions: traffic source typeReturns a list of lists of dictionariesEach list describes one channel, and contains a dictionary for each traffic sourceFor traffic sources, the total of all channels for each one is calculated and it is ordered by descending estimated minutes watched, this order defines how the sources are ordered within each listThe total estimated minutes watched for each channel is calculated and this defines the order of the lists, descending estimated minutes watchedFor each channel list, the percentage for each traffic source for that channel is calculated, added under the key “source_percentage”
Helper Functions¶
- check_available_data()
- queries analytics to see if data is available for each sitereturns a dictionary with a value of True/False and a list of sites where data is not available
Utilities¶
Main Functions¶
Used mainly by aggregate_data.py in service folders to aggregate and format the data.
Main format is set as a list of dictionaries, where each dictionary holds the data for one dimension.
For instance, the data for a social network table would be a list of dictionaries, where each dictionary is the data for one social network. You can then sort and the set the length of the list as needed.
- format_data_rows(results)
- Google analytics returns data in a set formatThe function converts each row to a dictionary with the correct key value pairs, {metric:value, metric:value}It returns a list of these dictionaries
- aggregate_data(table, aggregate_keys, match_key=None)
- If match_key exists (a dimension field that the data needs to be separated by)uses list_search to create a new list of dictionaries with total accumulated metrics for each individual dimensionElse returns a dictionary of total accumulated metrics
- add_change(this_period, previous_period, change_keys, label, match_key=None)
- Adds specified percentage change to this_period dataFor each metric in change_keys, will add the previous figure, the difference and the percentage differenceIf a dimension is specified in match_key then must use list_search as list of dictionaries structure inputted and returned, otherwise simple dictionary
- list_search(to_search, key, value)
- Searches through the list of dictionaries to_search and returns the dictionary that matches the key, value pair given
- sort_data(unsorted_list, metric, limit=1000, reverse=True)
- Sorts a list of dictionaries by the metric given
- convert_to_floats(row, metrics)
- For list of dictionaries, rowconverts all the metrics specified into floatsreturns rows
- change_key_names(rows, changes)
- For list of dictionaries, rowschanges key names specified in changeswhere changes = {new_key: original_key, new_key: original_key}returns rows with new key names
- percentage(change, total)
- returns the percentage of change over total
- sig_fig(sf, num)
- returns num with specified number of significant figures
- rate_per_1000(metric, comparative)
- returns the rate of the metric per 1000 comparative
- convert_values_list(id_dict)
- converts values of the dictionary given into listsUsed to generalise the site ID values, so if only one ID makes into a list with one element, otherwise leaves as a list of multiple elements
- chart(title, x_labels, data, x_title, y_title)
- Uses pygal library to create a line chart
Date Utils¶
Used mainly in scheduler to calculate next runs.
- get_month_day_range(date)
- return start and end date of the month in date
- add_one_month(t)
- returns datetime object that has had one month added to t
- subtract_one_month(t)
- returns datetime object that has had one much subtracted from t
- find_last_weekday(start_date, weekday)
- returns the date of the previous closest day of the week that matches the string weekday
- find_next_weekday(start_date, weekday)
- returns the date of the next closest day of the week that matches the string weekday
StatsRange¶
Used mainly in reports to create a date range for the report to run This is where the frequency argument is mainly used. Frequency can be:
- DAILY
- a one day period, where it’s previous period is the day before
- WOW_DAILY
- a one day period, where it’s previous period is that day last week (-7days)
- WEEKLY
- a 7 day period, where it’s previous period is the week before
- MONTHLY
- a one month period (utilises the subtract_one_month function), where it’s previous period is the month before
- YEARLY
- a one year period, where it’s previous period is the year before
To create a StatsRange object, input a name as a string and two python datetime objects (start and end of the period). Example:
from datetime import date
monthly_period = utils.StatsRange("July", date(2016, 07, 01), date(2016, 07, 31))
- class StatsRange(name, start_date, end_date)
Object that defines a date period from start_date to end_date, used to show the report period.
- get_start()
- return start date in isoformat
- get_end()
- return end date in isoformat
- days_in_range()
- return number of days in period
- get_period(date, frequency)
- return a StatsRange object based on the date and frequency given
- get_previous_period(current_period, frequency)
- return a StatsRange object based on the current period and frequency
- get_one_day_period(date)
- return a StatsRange object for the date given
- get_one_week_period(date)
- return a StatsRange object of a week ending on the date given
- get_one_month_period(date)
- return a StatsRange object of a month ending on the date given