Upload
arron-barber
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Content-Based Access Control Wenrong Zeng
Dept. of Electrical Engineering and Computer Science
Advisor: Dr. Luo
Committee Members: Dr. Agah
Dr. Grzymala-BusseDr. Kulkarni
Dr. Ho
2
I owe my thanks to my committee members:• Dr. Luo• Dr. Agah• Dr. Grzymala-Busse• Dr. Kulkarni• Dr. Ho
Acknowledgements
Outline
• Introduction
• The Content-Based Access Control Model
• Content-Based Access Control Enforcement
• Content-Based Blocking and Tagging
• Multi-Label Learning
• Discussion and Conclusion
3
Outline
• Introduction
• The Content-Based Access Control Model
• Content-Based Access Control Enforcement
• Content-Based Blocking and Tagging
• Multi-Label Learning
• Discussion and Conclusion
4
Introduction
• Education Background & Publication
• Motivation & Goal
• Background & Related Work
5
Education Background
• B. S. Peking University 2006– Major: Electrical Engineering
• M. E. Chinese Academy of Sciences 2009– Major: Computer Science
• PhD student University of Kansas Present– Major: Computer Science
6
Publication– Wenrong Zeng, Yuhao Yang, Bo Luo: Using Data Content to Assist Access
Control for Large-Scale Content-Centric Databases. In IEEE International Conference on Big Data (IEEE BigData), 2014 (Acceptance rate: 18.5%)
– Wenrong Zeng, Xuewen Chen, Hong Cheng: Pseudo labels for imbalanced multi-label learning. DSAA 2014: 25-31.
– Wenrong Zeng, Yuhao Yang, Bo Luo: Access control for big data using data content. In IEEE International Conference on Big Data (IEEE BigData), 2014: 45-47 (Poster).
– Wenrong Zeng, Xuewen Chen, Hong Cheng and Jing Hua, Multi-Space Learning for Image Classication Using AdaBoost and Markov Random Fields, Solving Comeplex Machine Learning Problems with Ensemble Methods Workshop, 2013.
– Yi Jia, Wenrong Zeng, Jun Huan: Non-stationary bayesian networks based on perfect simulation. In ACM Conference on Information and Knowledge Management, 2012: 1095-1104. (Acceptance rate: 13.4%)
– Weiming Hu, Dan Xie, Zhouyu Fu, Wenrong Zeng, Stephen J. Maybank: Semantic-Based Surveillance Video Retrieval. IEEE Transactions on Image Processing 16(4): 1168-1181 (2007).
7
Motivation• Data tends to be content-centric.
– Healthcare: 500 million patient databases nationwide.– Telecom: Largest Volume of one unique database: 312 TB comprises
AT&T’s calling records.– Business: By 2020, business transactions data on internet will reach 450
billion per day. (IDC)
• Nearly Every Field faces Big Data Issue
8
Volume
Velocity
Variety
http://stablekernel.com/blog/wp-content/uploads/2015/02/Big-Data.jpg
Motivation• With big data, conventional database access control
mechanisms may be insufficient. • Long term goal: smart access control decisions for big data
without extensive labor of the DBA.
9
http://blog.varonis.com/big-data-security/
Motivation• Example
– A law enforcement agency (e.g. FBI) holds a database of highly sensitive case records.• Large amount of records• Unstructured content
– A supervisor assigns a case to agent Alice for investigation. – Naturally, the supervisor also needs to grant Alice access to
all related or similar cases.
10
Motivation• Manual assignment
– The supervisor manually selects “related cases”.– Extremely labor intensive, practically impossible
• Multi-level security– Alice can access all the cases with equal or lower security
levels. – Over privileged users!
• Attribute based access control– E.g. Alice can access all the robbery records within 5 years,
in Area X, in which the suspect is 6-foot tall.– Attributes require manual input, usually not available.
11
Goals
• Assumptions:– Basic privileges: users are authenticated with basic trust
(e.g. with MLS)– Data-driven: large amounts of content-centric data, access
control model must be data-driven.– Lack of explicit authorization– Approximation is allowed
Smart access control decision.• Develop content-based access control model, which
is data-driven.• Enforce content-based access control model
efficiently.
12
Conventional Methods• Role-Based Access Control:
Bob, an adult, can drink wine.
sbj. role obj.
• Attribute-Based Access Control:
Bob, who is 24 years old, can drink wine.
sbj. age attribute obj.
13
Current Issues• Difficult to define granular access controls.
• Lack the ability to implement abstract access control
policies (e.g. Similar documents)
• Access control models are NOT content-driven.
E. Bertino, et. al. Access Control for Databases: Concepts and Systems. Vol. 3. No. 1-2.
Now Publishers Inc, 2011.
14
“A truly comprehensive approach for data protection must include mechanisms for enforcing access control policies based on data contents ….”
Text Feature Extraction– TF-IDF: Term Frequency Inverse Document Frequency
– Topic Modeling: Non-negative Matrix Factorization Based on
TF-IDF
15
They are both innately term-distributed features
Text Feature Extraction– Where Term-distributed Features Fall Short!
According to TF-IDF, the cosine similarity of D1 and D2 is 0.
D1: privacy preserving similarity assessment for semi-structured data
D2: private XML document matching
16
Text Feature Extraction
– TAGME: Topic Modeling with Wikipedia Curated Annotation
17
Doc No. Word(s) Topic Annotation Weight
D1 privacy Privacy 0.1279
preserving Historic preservation 0.0017
similarity Homology (biology) 0.0354
assessment Homology (biology) 0.0521
for
semi-structured data Semi-structured data 0.2727
D2 private Privacy 0.1256
XML XML 0.5375
document Document management system 0.1475
matching Matching principle 0.0509
Outline
• Introduction
• The Content-Based Access Control Model
• Content-Based Access Control Enforcement
• Content-Based Blocking and Tagging
• Multi-Label Learning
• Discussion and Conclusion
18
The Content-Based Access Control Model• Two-phase model
– Initial authorization– Content-based authorization
• Initial authorization– Conventional access control policy
– Each CBAC-user is explicitly grant access to a small set of records: seed set.
• Manual selection• Attribute-based rules• Requested by the user
19
The Content-Based Access Control Model• Content-based authorization
– Content-based access control policy
– Dynamic “sign” function
– To be evaluated on-the-fly – {true, false} based on content similarity between the base
set and the object record– Similarity function
20
The Content-Based Access Control Model• Content modeling
– In
similarity function defined for term or topic ax
– Unstructured text attributes (CLOB, Text)• Any text modeling approach could be used• We utilized the vector space model (TF/IDF) in Oracle.
21
Outline
• Introduction
• The Content-Based Access Control Model
• Content-Based Access Control Enforcement
• Content-Based Blocking and Tagging
• Multi-Label Learning
• Discussion and Conclusion
22
Content-Based Access Control Enforcement• Settings
– UCI KDD NSF award data: abstracts represent the content-rich information
– Use MIT’s SCIgen to add approximately 20x noise data: 2.7M records
– Base set: awards PI-ed by the user
– CBAC enforced withOracle Virtual PrivateDatabase
– tables as follows
23
Content-Based Access Control Enforcement• Experiment
– The database runs on a 64-bit Windows 7 system, with Intel R CoreTM 2 Duo CPU E8500 @ 3.16GHz and 4.0GB RAM
– Login as 60 randomly selected users to issue the following queries via PL/SQL:
24
Content-Based Access Control Enforcement• Experiment
– Three different scenarios for access control:
• (R1) an attribute-based access control (ABAC) rule: the user is allowed to access records in a division where he/she has PI-ed an award
• (R2) a content-based access control (CBAC) rule: the user is only allowed to access awards that have similar abstracts with the awards in his/her base set; and (R3) a combined
• (ABAC+CBAC) rule: R1 AND R2.
25
Content-Based Access Control Enforcement• Basic On-the-Fly CBAC Threshold Results
26
ABAC Query1
ABAC Query2
CBAC Threshold Query1
CBAC Threshold Query2
ABAC+CBAC Threshold Query1
ABAC+CBAC Threshold Query2
Content-Based Access Control Enforcement• Basic On-the-Fly Top-10 CBAC Results
27
Offline CBAC Results
Content-Based Access Control Enforcement• Issues with CBAC:
– Efficiency: content-based similarity assessment is slow
– Accuracy: vector space model suffers from lexical ambiguity,
especially for short text snippets (e.g. tweet messages)
28
Outline
• Introduction
• The Content-Based Access Control Model
• Content-Based Access Control Enforcement
• Content-Based Blocking and Tagging
• Multi-Label Learning
• Discussion and Conclusion
• Your Input
29
Content-Based Blocking and Tagging• Content-based Blocking:
– Pre-partition records into semantically similar clusters
– Base set s is first compared with class centroids
– Query is only evaluated against top x clusters
3030
Before Blocking
Content-Based Blocking and Tagging
31
After Blocking
Content-Based Blocking and Tagging
32
CBAC Threshold Query1with Blocking
CBAC Threshold Query2with Blocking
ABAC+CBAC Threshold Query1with Blocking
ABAC+CBAC Threshold Query2with Blocking
CBAC Top-10 with Blocking
Content-Based Blocking and Tagging• Data annotation is performed off-line. Efficiency is not
an issue• We use:
– Non-negative Matrix Factorization with 10, 20, 50 and 100 “topics.”
– TAGME: Wikipedia annotation to text
33
Content-Based Blocking and Tagging• Tag quality is further guaranteed by removing noisy
tags by threshold cut-off.
34
NMF with 100 topics
TAGME
Content-Based Blocking and Tagging
35
CBAC Threshold Query1With Tagging
CBAC Threshold Query2With Tagging
ABAC+CBAC Threshold Query2With Tagging
ABAC+CBAC Threshold Query1With Tagging
CBAC Top-10 with Tagging
Content-Based Blocking and Tagging
36
CBAC Top-10 with Blocking + Tagging
Content-Based Blocking and Tagging
37
• Soundness of CBAC
Outline
• Introduction
• The Content-Based Access Control Model
• Content-Based Access Control Enforcement
• Content-Based Blocking and Tagging
• Multi-Label Learning
• Discussion and Conclusion
38
Multi-Label Learning• Motivation
– We learnt curated annotation with domain knowledge provides accurate annotation and boosts efficiency.
– The question come: if the domain is not supported by Wikipedia database (e.g. ylw banana mapped to food, fruit, snack), where should we get the topic annotation.
– Multi-Label Learning is able to learn a subset of labeled sample and predict the labels for the rest samples, which facilitates the topic (label) annotation
– We will dive into multi-label learning.
39
Multi-Label Learning
4040
Yosemite Valley
WaterfallMountain“View”
Snow
Blue Sky
Silk Short Sleeve
Silk Tops
Tops
Women’s Apparel
Painter, Sculptor, Architect Musician, Mathematician, EngineerInventor, Anatomist, Geologist Cartographer, Botanist, Writer
Multi-ComponentLabels
HierarchicalLabels
Multi-FacetLabels
Multi-Label Learning
41
• Ambiguity of Labels
Mountain View
Yosemite
Apple
Multi-Label Learning
42
• Ambiguity of Labels
Mountain View
Yosemite
Apple
Label Correlation helps to eliminate ambiguity of labels
Multi-Label Learning• Uneven Label Distribution leads to Imbalance
Problem
FruitGrape
Multi-Label Learning• Pros & Cons of current problem transformations for
MLL
– Binary relevance treats MLL as a bunch of binary classifiers
• Pros: Simple, Easy to Parallelize
• Cons: Totally neglects the inner dependent relationships among labels
– Power-set label introduces a pre-step to construct a power set labels to be the new labels
• Pros: Codes in the correlation among labels
• Cons: Introduces false correlations
Deteriorates imbalance problem
44
Multi-Label Learning• Observations & Assumptions
– Observations:
• Error-correction coding methods is able to boost the accuracy in detecting binary strings
– Assumptions:
• Error-correction coding brings in calibration to correct the mis-classified samples by adding new digits to the end of the binary strings.
• The added new digits should be located as far as possible for different messages to remove the ambiguity.
• The added new digits should be balanced in the entire binary string sets to correct the errors due to imbalance problem.
45
Multi-Label Learning• Preliminary
– Training Data:
– Unique Label Vectors:
– Occurrence Weight Vector:
– Pseudo Label Set:
– Objective Function:
Let
46
Multi-Label Learning• Algorithms
– Generate pseudo labels for training data
– Perform binary relevance transformation
– Make individual prediction on different labels and pseudo labels for testing data
– Calibrate the prediction with pseudo labels
47
Multi-Label Learning• Data Sets
48
Diverse domains of multi-label data sets are pulled out for the experiment. Generally, they are all imbalanced
Multi-Label Learning• Experiment
– SVM with linear and radius kernels, and Random Forest are
chosen as our binary relevance classifiers.
– BPL versions of binary relevance MLL outperform naïve
binary relevance methods.
49
Multi-Label Learning
50
BPL Outperforms other State-of-Arts in Macro-Averaging F1
Multi-Label Learning
51
BPL Outperforms other State-of-Arts in Micro-Averaging F1
Multi-Label Learning
52
BPL Outperforms other State-of-Arts in Subset Accuracy
Outline
• Introduction
• The Content-Based Access Control Model
• Content-Based Access Control Enforcement
• Content-Based Blocking and Tagging
• Multi-Label Learning
• Discussion and Conclusion
53
Discussion and Conclusion• Computational complexity:
• CBAC: O(N*m*t). N: number of records; m: size of base
set; t: size of the dictionary (TF/IDF)
• CBAC with multi-level blocking: O(m*t*log(N*x)). x:
number of clusters
54
Discussion and Conclusion• Parallelization of CBAC:
– CBAC, blocking and labeling processes could be
parallelized.
– May work with map-reduce.
• Overall, CBAC requires a reasonable overhead. It is
scalable.
55
Discussion and Conclusion• Security Analysis
– CBAC != Relaxed security: content-based access control does not imply weakened or relaxed security.
– Rather, it enforces an additional layer of access control on top of existing “precise” access control methods.
56
Security Guarantee
when CBAC is correctly enforced and managed, a malicious user cannot obtain access to sensitive information by manipulating his/her accessible records, creating spoofing records, or gaining (non-base-set) access to similar insensitive information..
Discussion and Conclusion• Conclusion
– CBAC is an access control model focusing on protecting data content.
– CBAC makes access control decision based on the semantic similarity between requester’s credentials and the content of rest data in database.
– Applying offline CBAC to databases not updating that frequently is efficient.
– With optimization on CBAC enforcement, there is a little overhead compared to query without CBAC
– Multi-label learning will be functional when curated database anotation is not available.
57
Questions
58
59