Content-Based Access Control Wenrong Zeng [email protected] Dept. of Electrical Engineering and Computer Science Advisor: Dr. Luo Committee Members: Dr

Content-Based Access Control Wenrong Zeng

[email protected]

Dept. of Electrical Engineering and Computer Science

Advisor: Dr. Luo

Committee Members: Dr. Agah

Dr. Grzymala-BusseDr. Kulkarni

Dr. Ho

2

I owe my thanks to my committee members:• Dr. Luo• Dr. Agah• Dr. Grzymala-Busse• Dr. Kulkarni• Dr. Ho

Acknowledgements

Outline

• Introduction

• The Content-Based Access Control Model

• Content-Based Access Control Enforcement

• Content-Based Blocking and Tagging

• Multi-Label Learning

• Discussion and Conclusion

3

Outline

• Introduction






4

Introduction

• Education Background & Publication

• Motivation & Goal

• Background & Related Work

5

Education Background

• B. S. Peking University 2006– Major: Electrical Engineering

• M. E. Chinese Academy of Sciences 2009– Major: Computer Science

• PhD student University of Kansas Present– Major: Computer Science

6

Publication– Wenrong Zeng, Yuhao Yang, Bo Luo: Using Data Content to Assist Access

Control for Large-Scale Content-Centric Databases. In IEEE International Conference on Big Data (IEEE BigData), 2014 (Acceptance rate: 18.5%)

– Wenrong Zeng, Xuewen Chen, Hong Cheng: Pseudo labels for imbalanced multi-label learning. DSAA 2014: 25-31.

– Wenrong Zeng, Yuhao Yang, Bo Luo: Access control for big data using data content. In IEEE International Conference on Big Data (IEEE BigData), 2014: 45-47 (Poster).

– Wenrong Zeng, Xuewen Chen, Hong Cheng and Jing Hua, Multi-Space Learning for Image Classication Using AdaBoost and Markov Random Fields, Solving Comeplex Machine Learning Problems with Ensemble Methods Workshop, 2013.

– Yi Jia, Wenrong Zeng, Jun Huan: Non-stationary bayesian networks based on perfect simulation. In ACM Conference on Information and Knowledge Management, 2012: 1095-1104. (Acceptance rate: 13.4%)

– Weiming Hu, Dan Xie, Zhouyu Fu, Wenrong Zeng, Stephen J. Maybank: Semantic-Based Surveillance Video Retrieval. IEEE Transactions on Image Processing 16(4): 1168-1181 (2007).

7

Motivation• Data tends to be content-centric.

– Healthcare: 500 million patient databases nationwide.– Telecom: Largest Volume of one unique database: 312 TB comprises

AT&T’s calling records.– Business: By 2020, business transactions data on internet will reach 450

billion per day. (IDC)

• Nearly Every Field faces Big Data Issue

8

Volume

Velocity

Variety

http://stablekernel.com/blog/wp-content/uploads/2015/02/Big-Data.jpg

Motivation• With big data, conventional database access control

mechanisms may be insufficient. • Long term goal: smart access control decisions for big data

without extensive labor of the DBA.

9

http://blog.varonis.com/big-data-security/

Motivation• Example

– A law enforcement agency (e.g. FBI) holds a database of highly sensitive case records.• Large amount of records• Unstructured content

– A supervisor assigns a case to agent Alice for investigation. – Naturally, the supervisor also needs to grant Alice access to

all related or similar cases.

10

Motivation• Manual assignment

– The supervisor manually selects “related cases”.– Extremely labor intensive, practically impossible

• Multi-level security– Alice can access all the cases with equal or lower security

levels. – Over privileged users!

• Attribute based access control– E.g. Alice can access all the robbery records within 5 years,

in Area X, in which the suspect is 6-foot tall.– Attributes require manual input, usually not available.

11

Goals

• Assumptions:– Basic privileges: users are authenticated with basic trust

(e.g. with MLS)– Data-driven: large amounts of content-centric data, access

control model must be data-driven.– Lack of explicit authorization– Approximation is allowed

Smart access control decision.• Develop content-based access control model, which

is data-driven.• Enforce content-based access control model

efficiently.

12

Conventional Methods• Role-Based Access Control:

Bob, an adult, can drink wine.

sbj. role obj.

• Attribute-Based Access Control:

Bob, who is 24 years old, can drink wine.

sbj. age attribute obj.

13

Current Issues• Difficult to define granular access controls.

• Lack the ability to implement abstract access control

policies (e.g. Similar documents)

• Access control models are NOT content-driven.

E. Bertino, et. al. Access Control for Databases: Concepts and Systems. Vol. 3. No. 1-2.

Now Publishers Inc, 2011.

14

“A truly comprehensive approach for data protection must include mechanisms for enforcing access control policies based on data contents ….”

Text Feature Extraction– TF-IDF: Term Frequency Inverse Document Frequency

– Topic Modeling: Non-negative Matrix Factorization Based on

TF-IDF

15

They are both innately term-distributed features

Text Feature Extraction– Where Term-distributed Features Fall Short!

According to TF-IDF, the cosine similarity of D1 and D2 is 0.

D1: privacy preserving similarity assessment for semi-structured data

D2: private XML document matching

16

Text Feature Extraction

– TAGME: Topic Modeling with Wikipedia Curated Annotation

17

Doc No. Word(s) Topic Annotation Weight

D1 privacy Privacy 0.1279

preserving Historic preservation 0.0017

similarity Homology (biology) 0.0354

assessment Homology (biology) 0.0521

for

semi-structured data Semi-structured data 0.2727

D2 private Privacy 0.1256

XML XML 0.5375

document Document management system 0.1475

matching Matching principle 0.0509

Outline

• Introduction






18

The Content-Based Access Control Model• Two-phase model

– Initial authorization– Content-based authorization

• Initial authorization– Conventional access control policy

– Each CBAC-user is explicitly grant access to a small set of records: seed set.

• Manual selection• Attribute-based rules• Requested by the user

19

The Content-Based Access Control Model• Content-based authorization

– Content-based access control policy

– Dynamic “sign” function

– To be evaluated on-the-fly – {true, false} based on content similarity between the base

set and the object record– Similarity function

20

The Content-Based Access Control Model• Content modeling

– In

similarity function defined for term or topic ax

– Unstructured text attributes (CLOB, Text)• Any text modeling approach could be used• We utilized the vector space model (TF/IDF) in Oracle.

21

Outline

• Introduction






22

Content-Based Access Control Enforcement• Settings

– UCI KDD NSF award data: abstracts represent the content-rich information

– Use MIT’s SCIgen to add approximately 20x noise data: 2.7M records

– Base set: awards PI-ed by the user

– CBAC enforced withOracle Virtual PrivateDatabase

– tables as follows

23

Content-Based Access Control Enforcement• Experiment

– The database runs on a 64-bit Windows 7 system, with Intel R CoreTM 2 Duo CPU E8500 @ 3.16GHz and 4.0GB RAM

– Login as 60 randomly selected users to issue the following queries via PL/SQL:

24

Content-Based Access Control Enforcement• Experiment

– Three different scenarios for access control:

• (R1) an attribute-based access control (ABAC) rule: the user is allowed to access records in a division where he/she has PI-ed an award

• (R2) a content-based access control (CBAC) rule: the user is only allowed to access awards that have similar abstracts with the awards in his/her base set; and (R3) a combined

• (ABAC+CBAC) rule: R1 AND R2.

25

Content-Based Access Control Enforcement• Basic On-the-Fly CBAC Threshold Results

26

ABAC Query1

ABAC Query2

CBAC Threshold Query1

CBAC Threshold Query2

ABAC+CBAC Threshold Query1

ABAC+CBAC Threshold Query2

Content-Based Access Control Enforcement• Basic On-the-Fly Top-10 CBAC Results

27

Offline CBAC Results

Content-Based Access Control Enforcement• Issues with CBAC:

– Efficiency: content-based similarity assessment is slow

– Accuracy: vector space model suffers from lexical ambiguity,

especially for short text snippets (e.g. tweet messages)

28

Outline

• Introduction






• Your Input

29

Content-Based Blocking and Tagging• Content-based Blocking:

– Pre-partition records into semantically similar clusters

– Base set s is first compared with class centroids

– Query is only evaluated against top x clusters

3030

Before Blocking

Content-Based Blocking and Tagging

31

After Blocking


32

CBAC Threshold Query1with Blocking

CBAC Threshold Query2with Blocking

ABAC+CBAC Threshold Query1with Blocking

ABAC+CBAC Threshold Query2with Blocking

CBAC Top-10 with Blocking

Content-Based Blocking and Tagging• Data annotation is performed off-line. Efficiency is not

an issue• We use:

– Non-negative Matrix Factorization with 10, 20, 50 and 100 “topics.”

– TAGME: Wikipedia annotation to text

33

Content-Based Blocking and Tagging• Tag quality is further guaranteed by removing noisy

tags by threshold cut-off.

34

NMF with 100 topics

TAGME


35

CBAC Threshold Query1With Tagging

CBAC Threshold Query2With Tagging

ABAC+CBAC Threshold Query2With Tagging

ABAC+CBAC Threshold Query1With Tagging

CBAC Top-10 with Tagging


36

CBAC Top-10 with Blocking + Tagging


37

• Soundness of CBAC

Outline

• Introduction






38

Multi-Label Learning• Motivation

– We learnt curated annotation with domain knowledge provides accurate annotation and boosts efficiency.

– The question come: if the domain is not supported by Wikipedia database (e.g. ylw banana mapped to food, fruit, snack), where should we get the topic annotation.

– Multi-Label Learning is able to learn a subset of labeled sample and predict the labels for the rest samples, which facilitates the topic (label) annotation

– We will dive into multi-label learning.

39

Multi-Label Learning

4040

Yosemite Valley

WaterfallMountain“View”

Snow

Blue Sky

Silk Short Sleeve

Silk Tops

Tops

Women’s Apparel

Painter, Sculptor, Architect Musician, Mathematician, EngineerInventor, Anatomist, Geologist Cartographer, Botanist, Writer

Multi-ComponentLabels

HierarchicalLabels

Multi-FacetLabels


41

• Ambiguity of Labels

Mountain View

Yosemite

Apple


42

• Ambiguity of Labels

Mountain View

Yosemite

Apple

Label Correlation helps to eliminate ambiguity of labels

Multi-Label Learning• Uneven Label Distribution leads to Imbalance

Problem

FruitGrape

Multi-Label Learning• Pros & Cons of current problem transformations for

MLL

– Binary relevance treats MLL as a bunch of binary classifiers

• Pros: Simple, Easy to Parallelize

• Cons: Totally neglects the inner dependent relationships among labels

– Power-set label introduces a pre-step to construct a power set labels to be the new labels

• Pros: Codes in the correlation among labels

• Cons: Introduces false correlations

Deteriorates imbalance problem

44

Multi-Label Learning• Observations & Assumptions

– Observations:

• Error-correction coding methods is able to boost the accuracy in detecting binary strings

– Assumptions:

• Error-correction coding brings in calibration to correct the mis-classified samples by adding new digits to the end of the binary strings.

• The added new digits should be located as far as possible for different messages to remove the ambiguity.

• The added new digits should be balanced in the entire binary string sets to correct the errors due to imbalance problem.

45

Multi-Label Learning• Preliminary

– Training Data:

– Unique Label Vectors:

– Occurrence Weight Vector:

– Pseudo Label Set:

– Objective Function:

Let

46

Multi-Label Learning• Algorithms

– Generate pseudo labels for training data

– Perform binary relevance transformation

– Make individual prediction on different labels and pseudo labels for testing data

– Calibrate the prediction with pseudo labels

47

Multi-Label Learning• Data Sets

48

Diverse domains of multi-label data sets are pulled out for the experiment. Generally, they are all imbalanced

Multi-Label Learning• Experiment

– SVM with linear and radius kernels, and Random Forest are

chosen as our binary relevance classifiers.

– BPL versions of binary relevance MLL outperform naïve

binary relevance methods.

49


50

BPL Outperforms other State-of-Arts in Macro-Averaging F1


51

BPL Outperforms other State-of-Arts in Micro-Averaging F1


52

BPL Outperforms other State-of-Arts in Subset Accuracy

Outline

• Introduction






53

Discussion and Conclusion• Computational complexity:

• CBAC: O(N*m*t). N: number of records; m: size of base

set; t: size of the dictionary (TF/IDF)

• CBAC with multi-level blocking: O(m*t*log(N*x)). x:

number of clusters

54

Discussion and Conclusion• Parallelization of CBAC:

– CBAC, blocking and labeling processes could be

parallelized.

– May work with map-reduce.

• Overall, CBAC requires a reasonable overhead. It is

scalable.

55

Discussion and Conclusion• Security Analysis

– CBAC != Relaxed security: content-based access control does not imply weakened or relaxed security.

– Rather, it enforces an additional layer of access control on top of existing “precise” access control methods.

56

Security Guarantee

when CBAC is correctly enforced and managed, a malicious user cannot obtain access to sensitive information by manipulating his/her accessible records, creating spoofing records, or gaining (non-base-set) access to similar insensitive information..

Discussion and Conclusion• Conclusion

– CBAC is an access control model focusing on protecting data content.

– CBAC makes access control decision based on the semantic similarity between requester’s credentials and the content of rest data in database.

– Applying offline CBAC to databases not updating that frequently is efficient.

– With optimization on CBAC enforcement, there is a little overhead compared to query without CBAC

– Multi-label learning will be functional when curated database anotation is not available.

57

Questions

58

59

Documents

Content-Based Access Control Wenrong Zeng [email protected] Dept. of Electrical Engineering and Computer Science Advisor: Dr. Luo Committee Members: Dr