Upload
chad-koziel
View
75
Download
4
Tags:
Embed Size (px)
Citation preview
Jobs & Skills
Team Grant
MMA 865
2
What if you wanted to…
Plan your career How your key skills are trending
Develop labour policy Skill deficits by region or by industry
Train job-ready graduates Add skills to programs and syllabi
(Syllabi is an awesome word)
2014 Aug 16 Team Grant for Queen's School of Business
?
?
?
32014 Aug 16 Team Grant for Queen's School of Business
SourceExtract
StoreDistill
Analyze
Answer Questions
4
from linkedin import linkedinimport json
authentication = linkedin.LinkedInDeveloperAuthentication(...)application = linkedin.LinkedInApplication(authentication)
client = pymongo.MongoClient()db = client.jobenginemax_id = db.posting.find({'source':'linkedin'}).sort('raw_data.id',-1).limit(1)[0]['raw_data']['id']
while True: list_of_jobs = application.search_job(selectors= [{'jobs': ['id', 'posting-date‘,...]}], params={'count': 100, 'sort':'DD',...}) for job in reversed(list_of_jobs): if job['id'] <= max_id: continue max_id=job['id'] location=job['locationDescription'] raw_date=job['postingDate'] posteddate=time.strftime("%d/%m/%Y",...)) skills=job['skillsAndExperience'] db.posting.insert({"posted_date": posteddate, "skills": skills, "city": location, "source":'linkedin', "raw_data": job}) time.sleep(300)
from careerbuilder import CareerBuilderimport jsonimport pymongo
cb = CareerBuilder(DEV_KEY)
search = cb.job_search(HostSite='CA', PostedWithin='1')list_of_jobs=search['ResponseJobSearch']['Results']['JobSearchResult']client = pymongo.MongoClient()db = client.jobenginefor job in list_of_jobs:
location=job['Location']posteddate=time.strftime("%m/%d/
%Y",time.strptime(job[‘PostedDate’], "%m/%d/%Y"))skills=job['Skills']['Skill']db.posting.insert({"posted_date":
posteddate, "skills": skills, "city": location, "source": 'careerbuilder', "raw_data": job})
Linked in
2014 Aug 16 Team Grant for Queen's School of Business
to career builder, indeed
Source Extract Store Distill Analyze
from indeed import IndeedClientimport jsonimport pymongoimport time
client = IndeedClient(‘123456')params = { 'l' : "Anywhere", 'co' : "ca", 'userip' : "1.2.3.4", 'useragent' : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2)"}search_response = client.search(**params)list_of_jobs = search_response['results']
client = pymongo.MongoClient()db = client.jobengine
for job in list_of_jobs:location=job['city']posteddate=time.strftime("%d/%m/
%Y",time.strptime(job[‘date’], "%a, %d %b %Y %H:%M:%S GMT"))
db.posting.insert({"posted_date": posteddate, "skills": "", "city": location, "source": 'indeed', "raw_data": job})
5
Results from Canada
60k results per week
300 MB per week
3+ data structures
2014 Aug 16 Team Grant for Queen's School of Business
"formattedRelativeTime": "5 days ago",
"city": "Lillooet",
"date": "Thu, 24 Jul 2014 20:21:52 GMT",
"formattedLocationFull": "Lillooet, BC",
"url": "http://ca.indeed.com/viewjob?jk=7779e5fbf4d0613f&qd=cvKKr6L_4R6jh64NGGBfipMcUh0i4g5C-X18qE0gAzC3Ws-qTrT0d3CswmkqzrsGxdgmiLA9Fpf3adh66N9NEAN9-HvuJGR2pUApIXI2XAs&indpubnum=1243433210984925&atk=18u2anmkg0mqi68p",
"jobtitle": "Executive Assistant",
"company": "Xaxli'p",
"onmousedown": "indeed_clk(this, '834');",
"snippet": "The Executive Assistant is responsible for providing administrative and secretarial services and support to the Chief and Council and the Band Administrator... ",
"source": "WorkBC",
"state": "BC",
"sponsored": false,
"country": "CA",
"formattedLocation": "Lillooet, BC",
"jobkey": "7779e5fbf4d0613f",
"expired": false,
"indeedApply": false
Source Extract Store Distill Analyze
Sample result
62014 Aug 16 Team Grant for Queen's School of Business
Source A PI
Import IOMongoDB
Source Extract Store Distill Analyze
Python
Hadoop
SAS
Unstructured
Structured
7
Storage & structure
“Postings” collection Store documents from different sources,
with different structures
Wrapper structure allows uniform retrieval Posted date Skills Source Raw data Location
2014 Aug 16 Team Grant for Queen's School of Business
Source Extract Store Distill Analyze
8
Challenge & Solution
Identifying new information
Differing data formats
Duplicates between sources
Differing skill set data structures
2014 Aug 16 Team Grant for Queen's School of Business
Source Extract Store Distill Analyze
92014 Aug 16 Team Grant for Queen's School of Business
import jsonimport pymongo
client = pymongo.MongoClient()db = client.jobengine
# Query to get only the skills and posted_date fieldspostings=db.posting.find({},{"posted_date":1, "skills":1, "_id":0});
# To iterate over each postingfor posting in postings: #Continue processing only if the skills field is not empty if posting['skills'] != "": skills=posting['skills']
#If the skills fields is a list, it will iterate over each element and print the date and the skill, #Otherwise it will just print the date and the content of the skills field if isinstance(skills,list): for skill in skills: print "%s,%s" % (posting['posted_date'],skill.replace(',','').lower()) else: print "%s,%s" % (posting['posted_date'],skills.replace(',','').lower())
from mrjob.job import MRJob
class skillsCount(MRJob): def mapper(self, _, value): date, skill = value.split(",") yield skill, 1 def reducer(self, key, values): yield sum(values), key
if __name__ == '__main__': skillsCount.run()
…
4 "html"
4 "system integration"
5 "software development"
6 "database"
7 "bookkeeping"
8 "audit"
<date>
<skill>
sort-n
Example: identify in-demand skillsgetPostedDateSkill.py getSkillsCount.py
Source Extract Store Distill Analyze
10
Trends
2014 Aug 16 Team Grant for Queen's School of Business
Run MR algorithms to return skill mention frequencies by date
Leverage analytics to understand trends, identify seasonality and predict growth / decline
Package to help employers find untapped labour sources and governments target immigration policies
Source Extract Store Distill Analyze
11
Banks: “communication”
2014 Aug 16 Team Grant for Queen's School of Business
Jun-01 Jul-01 Aug-010
10
20
30
40
50
60
70
Actual
Forecast
Source Extract Store Distill Analyze
12
Banks: “SAS”
2014 Aug 16 Team Grant for Queen's School of Business
Jun-01 Jul-01 Aug-010
1
2
3
4
5
6
7
8
9
10
Actual
Forecast
Source Extract Store Distill Analyze
13
Clustering
2014 Aug 16 Team Grant for Queen's School of Business
Run algorithms to return complementary clusters of skills
Analyze for frequency of association to understand relative importance and trends over time
Package to help job seekers learn “next” skills and post-secondary institutions adapt programs and course syllabi
(Used twice in a single presentation!)
Source Extract Store Distill Analyze
142014 Aug 16 Team Grant for Queen's School of Business
Big data…
Big questions?
Syllabi (third time’s the charm)
15
Appendix 1: LinkedIn API
2014 Aug 16 Team Grant for Queen's School of Business
from linkedin import linkedin
import json
CONSUMER_KEY='7559rpvtim1fcq'
CONSUMER_SECRET='8mpfyOlPLggQjuvp'
USER_TOKEN='570511eb-3f62-4423-b365-40d78d96a31a'
USER_SECRET='a2795c55-3094-498f-8234-a56a2fc304f0'
RETURN_URL='http://127.0.0.1'
authentication = linkedin.LinkedInDeveloperAuthentication(CONSUMER_KEY, CONSUMER_SECRET,
USER_TOKEN, USER_SECRET,
RETURN_URL, linkedin.PERMISSIONS.enums.values())
application = linkedin.LinkedInApplication(authentication)
profile = application.get_profile(selectors=['id', 'first-name', 'last-name', 'skills'])
print json.dumps(profile, indent=3)
print "*" * 120
jobs = application.search_job(selectors=[{'jobs': ['id', 'customer-job-code', 'posting-date']}], params={'title': 'python', 'count': 2})
print json.dumps(jobs, indent=3)
16
Appendix 2: CareerBuilder API
2014 Aug 16 Team Grant for Queen's School of Business
from careerbuilder import CareerBuilder
import json
import pymongo
cb = CareerBuilder(DEV_KEY)
search = cb.job_search(HostSite='CA', PostedWithin='1')
list_of_jobs=search['ResponseJobSearch']['Results']['JobSearchResult']
client = pymongo.MongoClient()
db = client.jobengine
for job in list_of_jobs:
location=job['Location']
posteddate=time.strftime("%m/%d/%Y",time.strptime(job[‘PostedDate’], "%m/%d/%Y"))
skills=job['Skills']['Skill']
db.posting.insert({"posted_date": posteddate, "skills": skills, "city": location, "source": 'careerbuilder', "raw_data": job})
17
Appendix 3: CareerBuilder Result
2014 Aug 16 Team Grant for Queen's School of Business
"Company": "Robert Half Technology",
"CompanyDID": "c8432266b3wfjhdhwpx",
"CompanyDetailsURL": "http://www.careerbuilder.ca/jobs/company-name/c8432266b3wfjhdhwpx/robert-half-technology/?sc_cmp1=13_JobRes_ComDet",
"DID": "J3G6PM69F3QVJ2MY15G",
"OnetCode": "15-1099.04",
"ONetFriendlyTitle": "Web Developers",
"DescriptionTeaser": "Ref ID: 05090-9688475 Classification: Programmer/Analyst Compensation: DOE Our client is currently looking for candidate with strong understanding of...",
"Distance": null,
"EmploymentType": "Full-Time Employee",
"EducationRequired": "Not Specified",
"ExperienceRequired": "Not Specified",
"JobDetailsURL": "http://api.careerbuilder.com/v1/joblink?TrackingID=UNTRKD&HostSite=CA&DID=J3G6PM69F3QVJ2MY15G",
"JobServiceURL": "https://api.careerbuilder.com/v1/job?DID=J3G6PM69F3QVJ2MY15G&HostSite=CA&DeveloperKey=WDHT5Y26MLSBGLS2HC7G",
"Location": "Toronto-M5J 2T3",
"LocationLatitude": "43.6432",
"LocationLongitude": "-79.3806",
"PostedDate": "7/29/2014",
"PostedTime": "7/29/2014 8:16:48 PM",
"Pay": "N/A",
…
18
Appendix 4: Indeed API
2014 Aug 16 Team Grant for Queen's School of Business
from indeed import IndeedClient
import json
import pymongo
import time
client = IndeedClient(‘123456')
params = {
'l' : "Anywhere",
'co' : "ca",
'userip' : "1.2.3.4",
'useragent' : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2)"
}
search_response = client.search(**params)
list_of_jobs = search_response['results']
client = pymongo.MongoClient()
db = client.jobengine
for job in list_of_jobs:
location=job['city']
posteddate=time.strftime("%d/%m/%Y",time.strptime(job[‘date’], "%a, %d %b %Y %H:%M:%S GMT"))
db.posting.insert({"posted_date": posteddate, "skills": "", "city": location, "source": 'indeed', "raw_data": job})
19
Appendix 5: Indeed Result
2014 Aug 16 Team Grant for Queen's School of Business
"formattedRelativeTime": "5 days ago",
"city": "Lillooet",
"date": "Thu, 24 Jul 2014 20:21:52 GMT",
"formattedLocationFull": "Lillooet, BC",
"url": "http://ca.indeed.com/viewjob?jk=7779e5fbf4d0613f&qd=cvKKr6L_4R6jh64NGGBfipMcUh0i4g5C-X18qE0gAzC3Ws-qTrT0d3CswmkqzrsGxdgmiLA9Fpf3adh66N9NEAN9-HvuJGR2pUApIXI2XAs&indpubnum=1243433210984925&atk=18u2anmkg0mqi68p",
"jobtitle": "Executive Assistant",
"company": "Xaxli'p",
"onmousedown": "indeed_clk(this, '834');",
"snippet": "The Executive Assistant is responsible for providing administrative and secretarial services and support to the Chief and Council and the Band Administrator... ",
"source": "WorkBC",
"state": "BC",
"sponsored": false,
"country": "CA",
"formattedLocation": "Lillooet, BC",
"jobkey": "7779e5fbf4d0613f",
"expired": false,
"indeedApply": false
20
Appendix 6: getPostedDateSkill
2014 Aug 16 Team Grant for Queen's School of Business
import json
import pymongo
client = pymongo.MongoClient()
db = client.jobengine
# Query to get only the skills and posted_date fields
postings=db.posting.find({},{"posted_date":1, "skills":1, "_id":0});
# To iterate over each posting
for posting in postings:
#Continue processing only if the skills field is not empty
if posting['skills'] != "":
skills=posting['skills']
#If the skills fields is a list, it will iterate over each element and print the date and the skill,
#Otherwise it will just print the date and the content of the skills field
if isinstance(skills,list):
for skill in skills:
print "%s,%s" % (posting['posted_date'],skill.replace(',','').lower())
else:
print "%s,%s" % (posting['posted_date'],skills.replace(',','').lower())
21
Appendix 7: getSkillsCount
2014 Aug 16 Team Grant for Queen's School of Business
from mrjob.job import MRJob
class skillsCount(MRJob):
def mapper(self, _, value):
date, skill = value.split(",")
yield skill, 1
def reducer(self, key, values):
yield sum(values), key
if __name__ == '__main__':
skillsCount.run()
22
AttributionsText for Big Data graphic:
http://www.bigdata-startups.com/job-descriptions/
Big Data graphic: http://www.wordle.net/
2014 Aug 16 Team Grant for Queen's School of Business