Upload
amazon-web-services
View
543
Download
4
Embed Size (px)
DESCRIPTION
Delivering deep insight on advertising metrics and providing customers easy data access becomes a challenge as scale increases. In this session, Neustar, a global provider of real-time analytics, shows how they use Redshift to help advertisers and agencies reach the highest-performing customers using data science at scale. Neustar dives into the queries they use to determine how best to target ads based on their real reach, how much to pay for ads using multi-touch attribution, and how frequently to show ads. Finally, Neustar discusses how they operate a fleet of Redshift clusters to run workloads in parallel and generate daily reports on billions of events within hours. Session includes how Neustar provides daily feeds of event-level data to their customers for ad-hoc data science.
Citation preview
November 13, 2014 | Las Vegas, NV
Timon Karnezos, Director Infrastructure, Neustar
Vidhya Srinivasan, Sr. Manager Software Development, Amazon Redshift
Petabyte scale
Massively parallel
Relational data warehouse
Fully managed; zero admin
10 GigE
(HPC)
IngestionBackupRestore
JDBC/ODBC
Ad Tech
Use Cases
692.8s
34.9s
< 0.76%
00 01 10 11
00
01
10
11
P
Space filling Curve for Two Dimensions
Frequency
Attribution
Overlap
Ad-hoc
0.7B / day
2B / week
8B / month
21B / quarter
-- Number of ads seen per user
WITH frequency_intermediate AS (
SELECT user_id ,
SUM(1) AS impression_count,
SUM(cost) AS cost ,
SUM(revenue) AS revenue
FROM impressions
WHERE record_date BETWEEN <...>
GROUP BY 1
)
-- Number of people who saw N ads
SELECT impression_count, SUM(1), SUM(cost), SUM(revenue)
FROM frequency_intermediate
GROUP BY 1;
CREATE TABLE (
record_date date ENCODE NOT NULL ,
campaign_id bigint ENCODE NOT NULL ,
site_id bigint ENCODE NOT NULL ,
user_id bigint ENCODE NOT NULL DISTKEY,
impression_count int ENCODE NOT NULL ,
cost bigint ENCODE NOT NULL ,
revenue bigint ENCODE NOT NULL
) SORTKEY( , , , );
WITH user_frequency AS (
SELECT user_id, campaign_id, site_id,
SUM(impression_count) AS frequency,
SUM(cost) AS cost ,
SUM(revenue) AS revenue
FROM frequency_intermediate
WHERE record_date BETWEEN <...>
GROUP BY 1,2,3
)
SELECT campaign_id, site_id, frequency,
SUM(1), SUM(cost), SUM(revenue)
FROM user_frequency
GROUP BY 1,2,3;
-- Basic sessionization query, assemble user activity
-- that ended in a conversion into a timeline.
SELECT <...>
FROM impressions i
JOIN conversions c ON
i.user_id = c.user_id AND
i.record_date < c.record_date
ORDER BY i.record_date;
Position: 1
Position: 2
Position: 3
Hour offset: 3
Position: 1
Position: 2
Hour offset: 12
Hour offset: 16
Position: 3
-- Sessionize user activity per conversion, partition by campaign (45-day lookback window)
SELECT c.record_date AS conversion_date ,
c.event_id AS conversion_id ,
i.campaign_id AS campaign_id ,
i.site_id AS site_id ,
i.user_id AS user_id ,
c.revenue AS conversion_revenue,
DATEDIFF('hour', i.record_date, c.record_date) AS hour_offset,
SUM(1) OVER (PARTITION BY i.user_id, i.campaign_id, c.event_id
ORDER BY i.record_date DESC ROWS UNBOUNDED PRECEDING) AS position
FROM impressions i
JOIN conversions c ON
i.user_id = c.user_id AND
i.campaign_id = c.campaign_id AND
i.record_date < c.record_date AND
i.record_date > (c.record_date - interval '45 days') AND
c.record_date BETWEEN <...>;
-- Compute statistics on sessions (funnel placement, last-touch, site-count, etc...)
SELECT campaign_id ,
site_id ,
conversion_date,
AVG(position) AS average_position,
SUM(conversion_revenue * (position = 1)::int) AS lta_attributed ,
AVG(COUNT(DISTINCT site_id)
OVER (PARTITION BY i.user_id, i.campaign_id, c.event_id
ORDER BY i.record_date ASC
ROWS UNBOUNDED PRECEDING)) AS average_unique_preceding_site_count
FROM sessions
GROUP BY 1,2,3;
Site A Site B Site
C
Site A 20% 60%
Site B 90%
Site
C
CPM $0.06 $1.05 $9.50
Site A Site B Site
C
Site A 20% 60%
Site B 90%
Site
C
CPM $0.06 $1.05 $9.50
Site A Site B Site
C
Site A 20% 60%
Site B 90%
Site
C
CPM $0.06 $1.05 $9.50
CREATE TABLE (
user_id bigint ENCODE NOT NULL DISTKEY,
site_id bigint ENCODE NOT NULL
) SORTKEY ( );
WITH co_occurences AS (
SELECT
oi.site_id AS site1 ,
oi2.site_id AS site2
FROM overlap_intermediate oi
JOIN overlap_intermediate oi2 ON
oi.site_id > oi2.site_id AND
oi.ak_user_id = oi2.ak_user_id
)
SELECT site1, site2, SUM(1)
FROM co_occurences
GROUP BY 1,2;
CREATE TABLE (
record_date date ENCODE NOT NULL ,
campaign_id bigint ENCODE NOT NULL ,
site_id bigint ENCODE NOT NULL ,
user_id bigint ENCODE NOT NULL DISTKEY
) SORTKEY ( , );
WITH
site_overlap_intermediate AS (
SELECT user_id, site_id, campaign_id
FROM overlap_intermediate WHERE record_date BETWEEN <...> GROUP BY 1,2,3
),
site_co_occurences AS (
SELECT oi.campaign_id AS c_id, oi.site_id AS site1, oi2.site_id AS site2
FROM site_overlap_intermediate oi
JOIN site_overlap_intermediate oi2 ON
oi.site_id > oi2.site_id AND
oi.ak_user_id = oi2.ak_user_id AND
oi.campaign_id = oi2.campaign_id
)
SELECT c_id, site1, site2, SUM(1) FROM site_co_occurences GROUP BY 1,2,3;
8 fact tables
26 dimension tables
7 mapping tables
42 views
121 joins
1100 sloc
$ pg_dump –Fc some_file --table=foo --table=bar
$ pg_restore --schema-only --clean –Fc some_file > schema.sql
$ pg_restore --data-only --table=foo –Fc some_file > foo.tsv
$ aws s3 cp schema.sql s3://metadata-bucket/YYYYMMDD/schema.sql
$ aws s3 cp foo.tsv s3://metadata-bucket/YYYYMMDD/foo.tsv
> \i schema.sql
> COPY foo FROM ‘s3://metadata-bucket/YYYYMMDD/foo.tsv’ <...>
# or combine ‘COPY <..> FROM <...> SSH’ and pg_restore/psql
UNLOAD
('
SELECT i.*
FROM impressions i
JOIN client_to_campaign_mapping m ON
m.campaign_id = i.campaign_id
WHERE i.record_date >= '{{yyyy}}-{{mm}}-{{dd}}' - interval \'1 day\' AND
i.record_date < '{{yyyy}}-{{mm}}-{{dd}}' AND
m.client_id = <...>
‘)
TO 's3://{{bucket}}/us_eastern/{{yyyy}}/{{mm}}/{{dd}}/dsdk_events/{{vers}}/impressions/'
WITH CREDENTIALS 'aws_access_key_id={{key}};aws_secret_access_key={{secret}}'
DELIMITER ',' NULL '\\N' ADDQUOTES ESCAPE GZIP MANIFEST;
Workload Node Count Node Type Restore Maint. Exec.
Frequency
& Attribution
& Overlap
& Ad Hoc
16 dw2.8xlarge 2h 1h 6h
= $691.20
Workload Node Count Node Type Restore Maint. Exec.
Frequency 8 dw2.8xlarge 1.5h 0.5h 2.5h
Attribution 8 dw2.8xlarge 1.5h 0.5h 2h
Overlap 8 dw2.8xlarge 1h 0.5h 2.5h
Ad-hoc 8 dw2.8xlarge 0h 0.5h 1.5h
= $556.80 (-19%)
http://bit.ly/awsevals