16
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION 1 COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION Profiling User Activities With Minimal Traffic Traces Tiep Mai, Deepak Ajwani and Alessandra Sala Bell Laboratories, Ireland

(Mobile Web Applications track) "Profiling User Activities with Minimal Traffic Traces" - Tiep Mai, Deepak Ajwani and Alessandra SalaIcwe v3 b

Embed Size (px)

Citation preview

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

1

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

Profiling User Activities With Minimal Traffic Traces

Tiep Mai, Deepak Ajwani and Alessandra SalaBell Laboratories, Ireland

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

2

Outline

Telecom data and privacy issue

Truncated URL dataset

User behavior analysis on limited data

• Micro-action burst decomposition

• Representative URL selection

Future work and Conclusions

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

3

End-to-End View of the Telecom Network

Mobile user

Webservices

Client-sidedata

Server-sidedata

Telecom data

Huge data but with limited features

Empower telecom data analysis with this data

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

4

Providing Personalized Services

• Personalized services require user activity profiling Traditional approaches rely on features extracted from rich data sources

Server side data: full URLs of visited pages, page categories, transaction data, search queries, click through rate, etc.

Client side data: full URLs (cookies), application data (web browsing), etc.

Network side data: full URLs, HTTP packet content, etc.

• Our goal: Provide medium-grained user profiling with privacy preserving limited dataset for a large user-pool

User privacy considerations

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

5

Mobile Web TracesUser Behavioral Analysis from Timestamped Data

• Mobile traces provide precious insights in user behavior Critical to enable service personalization and enrich user’s online

experience

• Complete mobile web traces risk to reveal sensitive info http://finance.yahoo.com/q?s=BAC Bank of America Corp. stock

price

https://www.google.ie/#q=postnatal+depression sensitive health condition

http://www.amazon.com/Dell-Inspiron-i15R-15-6-inch-Laptop/dp/B009US2BKA specific purchased product

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

6

Removing Sensitive Data from URL Traces

• Telecom Operators subjected to restrictive privacy legislations

• Conservative approach to share data Anonymized, truncate and sampled data

Traces from10,000 anonymized users over 30 days, i.e. +130 Million records

• Focus on the dataset of truncated URLs or IP addresses

• Resulting data:

1. Truncated: www.amazon.com/Dell-Inspiron-i15R-15-6-inch-Laptop/dp/B009US2BKA

2. Noisy: unintentional web traffic as advertisement, web analytics, etc. Quality of behavior analysis depends on effectively separating

unintentional traffic from user activities on truncated URL

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

7

• Collection of web traces of several URL types

• Aim: filter out traces that do not represent explicit user action

Identifying features to drive detection on unintentional traces

Validate across different users

• Diversity of web domains:

Web Browsing Behaviors Across Time & Users

High diversity in user activities High diversity across users

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

8

Methodology Approach

• User activities as collection of micro user actions, i.e. burst

Web clicks, chat replies

• Assumption: Each burst represents atomic user activity

Combination of intended and unintended web-traffics

• Methodology

1. Burst decomposition

2. Activity extraction:

Domain classification : Leverage specialized feature of domain appearance in the burst

Online representative URL selection and activity association

Increase prediction

accuracy by 20%

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

9

Burst Decomposition – Statistical Parametric Distribution Fitting

• Goal: Decompose the web-trace back into constituent data bursts

• A need for a threshold of packet inter-arrival time (IAT) to separate traces into bursts

• Study the inter-arrival time distribution

• No parametric distribution would match most user traces

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

10

Burst Decomposition Algorithm

• Robust burst decomposition algorithm that is independent of the distribution shape

• Starting from the smallest value, find the value such that extended probability by increasing decaying point is insignificant, compared to the accumulated probability at that point

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

11

Domain Classification – Initial Insight

• Goal: automatically identify URLs representing user activities

• Measurements are aggregated for all users for each domain

Record-level measurements

Burst-level measurements

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

12

Domain Classification - Methodology

• Logistic regression

• Validation error and AIC, BIC

• Two discriminating features

ob,j=1 – ub,j=1 (~ 22.87) : probability that a domain comes first in bursts with more than one unique domains

ub,j=2 (~ -9.51) : probability that a domain comes in bursts with two unique domains

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

13

Trade-offs of Domain Classification Results

• Trade-off between accuracy, sensitivity, precision and specificity

Maximizing accuracy

Maximizing sensitivity and specificity

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

14

Future Works

• Mapping domain to activities (reading, shopping, browsing) and identifying user activities online

• Activity query and recommendation

• Correlating truncated URL data with user location data

Spatial temporal study of user activities

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

15

Conclusions and Remarks

• Telecom data: Huge but limited; Strict privacy regulations

• URL trace data:

Privacy preservation with truncation

Noisy data

Burst property of micro user actions

• Goal: Perform activity extraction and behaviour analysis for a large user-pool with limited and noisy data

• Method:

Burst decomposition and feature extractions

Representative URL identification and activity extraction

Doing medium-grained behavior analysis is feasible with limited, noisy and privacy preservation URL data

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

16

Thank you

• Thank you

• Questions?