View
664
Download
1
Category
Preview:
Citation preview
1
Software Industry in India and Keyword Search Over Dynamic Categorized Information
Manish Bhidemanish.bhide@gmail.com
2
My Background
BE in CSE, 2000 from VRCE (now VNIT )
MTech in CSE, 2002 from IITB
Working with IBM India Research Lab since 2002
Started part time PhD in IITB in 2005
3
Types of Software Companies (type of work)
Services Companies
Product development companies
Research and Development Companies
Many companies do work that falls in all the above three categories
4
Services Companies
Services companies work for other companies using the “outsourcing model” E.g., Bank wants to focus on its core business, and not worry
about the software needed to run it IT part is outsourced to a services company.
Services companies can do following type of work Support Product development
Support L1: First level of contact (BPO companies) L2: If problem not solved by L1, it is escalated to L2 L3: If problem not solved by L2, it is escalated to L3
5
Services Companies
Support work can be categorized as Business process outsourcing Application Support & Maintenance (L2, L3) L3 work involves bug fixing
Support work might not be that great!
Product development in services companies The product companies outsource the development of the
products to services companies Conceptualization and design done by the product company Development and testing done by services company
6
Product Development
Involves development of products Also involves testing of products – not great
Development part more exciting than services work
Pays better than services work (L1, L2, L3)
Quality of people hired by product companies is better than those hired by services companies People from CS@IIT do not apply to services companies
7
Research and Development
R&D is a misused word
Some companies try to put product development into R&D
True meaning: Involves conceptualization of new product ideas or enhancement to existing product ideas The concept of a “database” originated in IBM Research Job role can be thought of as originating ideas and developing new products
People hired are typically PhD or masters in computer Science from IIT’s or the best universities worldwide
Hires the best and pays the best amongst all types of software companies
8
How it all fits together
Consider an example of a bank It wants to focus on its core business – banking IT part is outsourced to “Services Companies” Software used by bank will consist of – database, web-
server, etc. Services company will use these software products to build
a “solution” for the bank – banking software Someone needs to build the product like database, web-
server etc. for the services companies to use – This is done by the product companies
Someone needs to think of a need for a new product – This is where R&D companies play a role
9
Take-away for you…
As far as possible, try to find a job in product development companies Getting into R&D right after BTech is difficult
If interested in doing quality work, try to do a MTech/MS.
If still interested in further studies, register for a PhD
Caveat Emptor: Not everyone will get a job in product development This is not the right time to try out all the above
10
How to find a Job in good companies First and foremost: You need to be good academically!
Enhance your coding skills
Try to participate in coding contests such as: Google India Code Jam International Online Programming Contest (organized by IIT’s),
Try to participate in open source software development
Try to do a summer internship in product companies Try to contact VNIT alumni in these companies to improve your chances
Realize your potential! From my personal experience I believe that the top 10% of the folks in CS@VNIT are at
par with those in IIT The rest are better than most of the guys from other engineering colleges in India The faculty in VNIT is amongst the best in India
11
Keyword Search Over Dynamic Categorized Information
Joint work with:Venkatesan Chakravarthy,
Krithi Ramamritham and Prasan Roy
12
Motivating Example
Prime ministerial candidate a political party “PP” wants to asses reaction of different voter categories to their manifesto
Current Approach: Keyword Query: “PP manifesto” Results consist of large number of blog posts Cannot form a consolidated opinion
Desirable Result: Most relevant categories Blogs about education issues Blogs about Tax rebates
13
Motivating Example (contd..)
Alternate Approach: Use traditional search, group results into categories
Problems: Difficult to assign labels to generated clusters Unpredictability of generated results
Solution: Categorized search (Faceted search) over pre-defined categories
14
C1: pc
Categories
C2: pc
C5: pc
C6: pc
C3: pc
C4: pc
CN: pc
Problem Statement CS* (Categorized Search) system supports top-K keyword search
over categories
d1: A(d1), T(d1)d2: A(d2), T(d2) d3: A(d3), T(d3)..
Information Repository
Q(t1,t2..,tl)Keyword Query
Top-KCategories
di = Blog PostsA(di) = Attributes in user profile
T(di) = Text of blogBlog posts about educational issues
pc = Text classifier
“PP Manifesto”
Blogs about educational issuesBlogs about tax rebates
15
Scoring Function
We use standard tf-idf based scoring function to compute relevance of a category to a keyword query
Term Frequency:
Inverse Document Frequency:
Score:
16
Computing Top-K Categories Scoring Function:
Use stored meta-data to compute Score(c,Q) values
C1: pc
Categories
C2: pc
C5: pc
C6: pc
C3: pc
C4: pc
CN: pc
d1: A(d1), T(d1)d2: A(d2), T(d2) d3: A(d3), T(d3)..
Information Repository
Q(t1,t2..,tl)Keyword Query
Top-KCategories
Meta-Data
dN: A(dN), T(dN)
17
Naïve Approach: Update-all Strategy
Refresh all the categories when a new data-item is added Evaluate pc of each category with respect to the data item Update meta-data for those categories whose pc evaluates to true
pc can be a text classifier or could involve expensive joins High value customer: Transactions more than 10K in last 15 days
If one pc evaluation takes 25 milliseconds, for 1000 categories it will take 25 seconds!
While one data item is being processed more data-items could be added As per 2006 estimate 13 blog posts are created per second
Meta-data will become stale, affecting quality of results
Need for an intelligent selective update strategy!
18
CS* Approach: Selective update of categories with selective data
Identify a sub-set of categories (of size ImpCat) that are deemed important
Identify a sub-set of data-items (of size ImpData) that can provide maximum impact in terms of update to meta-data
Refresh important categories using the sub-set of data-items
CS* consists of two components: Meta-data refresher Query Answering Module
19
Overview
Motivation, Problem Statement, Naïve Strategy
Statistics used by CS*
Meta-Data Refresher
Query Answering Module
Experimental Evaluation
Conclusions
20
Statistics Maintained by CS*
Time-step
d1 d2 d3 d4 d5 d6 d7 d8 d9
s1 s2 s3 s4 s5 s6 s7 s8 s9
Data-Items
Ci: pc
Contiguous Refreshing: CS* refreshes a category in a contiguous manner When the statistics of a category are refreshed using data item di,
it is also refreshed using all the data item added before di
RefreshRefresh
Last Refresh Time rt(c): Largest time step till which the statistics of c have been refreshed
rt(Ci) = s6
tfs6(Ci,t) will be available
21
Estimating approximate tf
Need to find tf at current time s* - tfs*(c,t) Use principle of locality Find rate of change of term frequency Δ(c,t) – estimate of change in
term frequency per data item Δ(c,t) updated whenever c is refreshed
Time-step
d1 d2 d3 d4 d5 d6 d7 d8 d*
s1 s2 s3 s4 s5 s6 s7 s8 s*
Data-Items
Ci: pc
Refreshrt(Ci) = s6
tfs6(Ci,t) will be available
Current Time
Estimated term frequency tfests* calculated as
22
Overview
Motivation, Problem Statement, Naïve Strategy
Statistics used by CS*
Meta-Data Refresher
Query Answering Module
Experimental Evaluation
Conclusions
23
Determining Important Categories
What categories will be important? Categories which will be useful for answering queries in the future
What queries are likely to be asked in the future? Need to predict the queries
What categories will be useful for answering those queries? Look at history and find categories used of answering queries in
past
How to compute the benefit of a set of data items? How many categories can be refreshed using the data items? What is the importance of those categories?
Importance is a measure of the likelihood of the category being
used to answer a query in the future
24
Range Selection Problem
Input: Sequence of categories c1, c2,….,cN
Width ImpData
Output: Set of data items such that Total number of data items selected is at most ImpData Total benefit is maximized
We use a dynamic programming algorithm to solve this problem
Details are in paper
25
Overview
Motivation, Problem Statement, Naïve Strategy
Statistics used by CS*
Meta-Data Refresher
Query Answering Module
Experimental Evaluation
Conclusions
26
Query Answering Module
Given keyword query Q = {t1, t2,….,tl} use tfest and idfest to find top-K categories using scoring function:
Naïve approach: Compute score for all categories containing any one of Q, and return top-K categories
We use threshold algorithm (TA) to do this efficiently TA solves the problem of finding the topmost object amongst a set of
objects using scoring function consisting of multiple components TA requires input objects to be sorted on each of the components
In our setup score of C is combination of tf.idf score for each keyword ti
27
Query Answering Module Algorithm Overview
Setup l ordered lists – one for each keyword List for keyword ti provides ordering of categories based on tfest
s* x idfest
s* for ti
Lists are merged using TA algorithm to get top-K categories
TAScoreest
s*(*,Q)
tfests*(*,t1) x idfest
s*(t1)
C3 C1 C9 C2
tfests*(*,t2) x idfest
s*(t2)
C5 C2 C6 C1
tfests*(*,tl) x idfest
s*(tl)
C6 C3 C1 C8
C4 C6 C1 C8 C7
Categories sorted based on tfest
s*(*,t1) x idfests*(t1)
Categories sorted based on tfest
s*(*,tl) x idfests*(t1)
28
Query Answering Module
Recall formula for tfests*:
Maintaining sorted list as per tfests* is not easy
Dependant on Function of time s* – ordering changes with time Problem solved by using another level of threshold algorithm
29
Conclusion
First to identify the problem of keyword search over categorized dynamic data
Developed the CS* system consisting of two components:
Query Answering Module: Two level threshold algorithm
Meta-data Refresher: Formulated an interval selection problem and proposed a dynamic programming solution
Provides accuracy in excess of 90% using 57% less resources than the Update-All Strategy
30
Thank You & Questions!
Recommended