Upload
gary-chu
View
109
Download
3
Embed Size (px)
Citation preview
Photo: http://cliparts.co/clipart/3666251
Has anyone ever written crawlers?
Has anyone ever used cron?
Has anyone ever used Sidekiq?
Gary (Chien-Wei Chu) @icarus4 / @icarus4.chu
Was a C programmerFall in love with Ruby since 2013
CTO of Statementdog
I Play
Photo: https://static01.nyt.com/images/2016/08/19/sports/19BADMINTONweb3/19BADMINTONweb3-master675.jpg
Photo: http://classic.battle.net/images/battle/scc/protoss/pix/units/screenshots/d05.jpg
Photo: http://resources.workable.com/wp-content/uploads/2015/08/ruby-560x224.jpg
• Introduction to Statementdog
• Introduction to Statementdog
• Data behind Statementdog
• Introduction to Statementdog
• Data behind Statementdog
• Past practice of Statementdog
• Introduction to Statementdog
• Data behind Statementdog
• Past practice of Statementdog
• Problems of the past practice
• Introduction to Statementdog
• Data behind Statementdog
• Past practice of Statementdog
• Problems of the past practice
• How we design our system to solve the problems.
Focus on:
• More reliable job scheduling
• Dealing with throttling issue
(Revenue)
(Revenue)
(EPS)
(Revenue)
(EPS)
(Gross Margin)
(Net Income)
(Revenue)
(EPS)
(Gross Margin)
(Net Income)
(Assets)
(Liabilities)
(Revenue)
(EPS)
(Gross Margin)
(Net Income)
(Assets)
(Liabilities)
(Operating Cash Flow)
(Free Cash Flow)
(Investing Cash Flow)
(Revenue)
(EPS)
(Gross Margin)
(Net Income)
(Assets)
(Liabilities)
(Operating Cash Flow)
(Free Cash Flow)
(Investing Cash Flow)
(ROE)
(ROA)
(Accounts Receivable)
(Accounts Payable)
(Revenue)
(EPS)
(Gross Margin)
(Net Income)
(Assets)
(Liabilities)
(Operating Cash Flow)
(Free Cash Flow)
(Investing Cash Flow)
(ROE)
(ROA)
(Accounts Receivable)
(Accounts Payable)
(Revenue)
(EPS)
(Gross Margin)
(Net Income)
(Assets)
(Liabilities)
(Operating Cash Flow)
(Free Cash Flow)
(Investing Cash Flow)
(ROE)
(ROA)
(Accounts Receivable)
(Accounts Payable) (PMI)
(Revenue)
(EPS)
(Gross Margin)
(Net Income)
(Assets)
(Liabilities)
(Operating Cash Flow)
(Free Cash Flow)
(Investing Cash Flow)
(ROE)
(ROA)
(Accounts Receivable)
(Accounts Payable) (PMI)
GDP
Taiwan Market Observation Post System ( )
Taiwan Stock Exchange ( )
Taiwan Depository & Clearing Corporation ( )
Yahoo Stock Feed
…
…
Yearly - dividend, remuneration of directors and supervisors
Quarterly - quarterly financial statements
Monthly - Revenue
Weekly -
Daily - closing price
Hourly - stock news from Yahoo stock feed
Minutely - important news from Taiwan Market Observation Post System
Something like this, but written in PHP
A super long running process (1 hour+) loops from the first stock to the last one
Stock.find_each do |stock| # download xml financial report data …
# extract xml data …
# calculate advanced data …end
A super long running process for quarterly report
A super long running process for quarterly report
A super long running process for monthly revenue
A super long running process for quarterly report
A super long running process for monthly revenue
A super long running process for daily price
A super long running process for quarterly report
A super long running process for monthly revenue
A super long running process for daily price
A super long running process for news
.
.
.
• Really slow
• Really slow
• Inefficient - unable to only retry the failed one
• Really slow
• Inefficient - unable to only retry the failed one
• Unpredictable server loading
Job 1 Job 2 Job 3Time
When the server loading is low
Job 4 Job 5
Serverloading
When the server loading is HIGH
Time
Serverloading
Other task
Job 1Job 2
Job 3
When the server loading is HIGH
Job 4Job 5
Time
Serverloading
Other task
Job 1Job 2
Job 3
When the server loading is HIGH
Job 4Job 5
Time
Serverloading
Other task
Too many crawler processes executed at the same time
• Really slow
• Inefficient - unable to only retry the failed one.
• Unpredictable server loading
• Scale out is not easy
• Inherent problems of Unix Cron:
• Inherent problems of Unix Cron:
• Unreliable scheduling
• Inherent problems of Unix Cron:
• Unreliable scheduling
• High availability is not easy
• Inherent problems of Unix Cron:
• Unreliable scheduling
• High availability is not easy
• Hard to prioritize job by the popularity
• Inherent problems of Unix Cron:
• Unreliable scheduling
• High availability is not easy
• Hard to prioritize job by the popularity
• Not easy to deal with bandwidth throttling issue
Created by Mike Perham
Web serverRequest
Request
Request
.
.
.
Process
Request
Request
Request
.
.
.
Job queue
push to queue(very fast)
Web server
Process
Request
Request
Request
.
.
.
Job queue
push to queue(very fast)
Worker process
Worker process
.
.
.
Worker server
Worker process
Web server
Process
Request
Request
Request
.
.
.
Job queue
push to queue(very fast)
Worker process
Worker process
.
.
.
Worker server
Worker process
Web server
Process Add extra servers when needed
Request
Request
Request
.
.
.
Job queue
push to queue(very fast)
Producer
Worker process
Worker process
.
.
.
Worker server
Worker process
Web server
Process
Request
Request
Request
.
.
.
Job queue
push to queue(very fast)
Producer
Consumer
Worker process
Worker process
.
.
.
Worker server
Worker process
Web server
Process
Worker process
thread 1
thread 2
thread 3
thread 25
.
.
.
Worker process v.s.
Multi-threadSingle process
Worker process
thread 1
thread 2
thread 3
thread 25
.
.
.
Worker process 1 : 25
Multi-threadSingle process
Multi-thread
Worker process
thread 1
thread 2
thread 3
thread 25
.
.
.
Single process
Worker process 1 : 25
With the same degree of memory consumption
Sidekiq (OSS) Sidekiq Pro
Sidekiq Enterprise
Sidekiq Pro Sidekiq Enterprise
Batches
Enhanced Reliability
Search in Web UI
Worker Metrics
Expiring Jobs
Rate Limiting
Periodic Jobs
Unique Jobs
Historical Metrics
Multi-process
Encryption
Parallelism Make Things Faster
• Really slow
• Inefficient - unable to only retry the failed one.
• Unpredictable server loading
• Scale out is not easy
• Efficient - only retry the failed one
• Predictable server loading
• Easy to scale out
• Really slow
• Inefficient - unable to only retry the failed one.
• Unpredictable server loading
• Scale out is not easy
• Inherent problem of Unix Cron:
• Unreliable scheduling
• High availability is not easy
• Hard to prioritize job by the popularity
• Not easy to deal with bandwidth throttling issue
–Mike Perham, CEO, Contributed Systems, Creator of Sidekiq
Keep states of cron executions in our robustest part of system - database
All scheduled jobs are invoked by a particular job executed minutely
Keep states of cron executions in our robustest part of system - database
All scheduled jobs are invoked by a particular job executed minutely
create_table :cron_jobs do |t| t.string :klass, null: false t.string :cron_expression, null: false t.timestamp :next_run_at, null: false, index: true end
Create table for storing cron settingstable name: cron_jobs
create_table :cron_jobs do |t| t.string :klass, null: false t.string :cron_expression, null: false t.timestamp :next_run_at, null: false, index: true end
Create table for storing cron settings
worker class name
create_table :cron_jobs do |t| t.string :klass, null: false t.string :cron_expression, null: false t.timestamp :next_run_at, null: false, index: true end
Create table for storing cron settings
Something like 0 */2 * * *
create_table :cron_jobs do |t| t.string :klass, null: false t.string :cron_expression, null: false t.timestamp :next_run_at, null: false, index: true end
Create table for storing cron settings
when will a job should be executed
klass cron_expression next_run_at
Push2000NewsJobs “0 */2 * * *” …
Push2000DailyPriceJobs “0 2 * * 1-5” …
Push2000MonthlyRevenueJobs “0 0 10 * *” …
…
# Add to your Cron setting every :minute do runner 'CronJobWorker.perform_async' end
Cron only schedules one job minutely
class CronJobWorker include Sidekiq::Worker def perform CronJob.find_each("next_run_at <= ?", Time.now) do |job|
end end end
CronJobWorker to invoke all of your crawlers
Find jobs should be executed
class CronJobWorker include Sidekiq::Worker def perform CronJob.find_each("next_run_at <= ?", Time.now) do |job| Sidekiq::Client.push( class: job.klass.constantize, args: ['foo', ‘bar'] )
end end end
CronJobWorker to invoke all of your crawlers
Push jobs to job queue
class CronJobWorker include Sidekiq::Worker def perform CronJob.find_each("next_run_at <= ?", Time.now) do |job| Sidekiq::Client.push( class: job.klass.constantize, args: ['foo', ‘bar'] ) x = Sidekiq::CronParser.new(job.cron_expression) job.update!(next_run_at: x.next.to_time) end end end
CronJobWorker to invoke all of your crawlers
Setup the next execution time
class CronJobWorker include Sidekiq::Worker def perform CronJob.find_each("next_run_at <= ?", Time.now) do |job| Sidekiq::Client.push( class: job.klass.constantize, args: ['foo', ‘bar'] ) x = Sidekiq::CronParser.new(job.cron_expression) job.update!(next_run_at: x.next.to_time) end end end
CronJobWorker to invoke all of your crawlers
The missed job executions will be executed at next minute
• Inherent problem of Unix Cron:
• Unreliable scheduling
• Hard to prioritize job by the popularity
• High availability is not easy
• Not easy to deal with bandwidth throttling issue
Drawbacks solved
• Inherent problem of Unix Cron:
• Unreliable scheduling
• Hard to prioritize job by the popularity
• High availability is not easy
• Not easy to deal with bandwidth throttling issue
table: cron_jobs
klass cron_expression args next_run_at
Push2000NewsJobs “0 */2 * * *” [] …
table: cron_jobs
klass cron_expression args next_run_at
Push2000NewsJobs “0 */2 * * *” [] …
NewsWorker “*/30 * * * *” [popular_stock_id_1] …
NewsWorker “*/30 * * * *” [popular_stock_id_2] …
…
Drawbacks solved
• Inherent problem of Unix Cron:
• Unreliable scheduling
• Hard to prioritize job by the popularity
• High availability is not easy
• Not easy to deal with bandwidth throttling issue
• Inherent problem of Unix Cron:
• Unreliable scheduling
• Hard to prioritize job by the popularity
• High availability is not easy
• Not easy to deal with bandwidth throttling issue
Sidekiq.configure_server do |config| config.periodic do |mgr| mgr.register("* * * * * *", CronJobWorker) end end
• Inherent problem of Unix Cron:
• Unreliable scheduling
• Hard to prioritize job by the popularity
• High availability is not easy
• Not easy to deal with bandwidth throttling issue
• Inherent problem of Unix Cron:
• Unreliable scheduling
• Hard to prioritize job by the popularity
• High availability is not easy
• Not easy to deal with bandwidth throttling issue
You always want your crawler as fast as possible
However, your target server doesn’t always allow you to crawl with
unlimited rate
Insert 2000 jobs to the queue at the same time
Stock.pluck(:id).each do |stock_id| SomeWorker.perform_async(stock_id) end
If you want to craw data for your 2000 stocks
Assume a target server accepts request at maximum rate equals to 1 request / second
Time (second)
1 2 3
job1 job2 job3
.
.
. job2000
Insert 2000 jobs to the queue at the same time
All of your jobs may be blocked (except the first one)
Improvement 1 Schedule jobs with incremental delays
Stock.pluck(:id).each_with_index do |stock_id, index| SomeWorker.perform_in(index, stock_id) end
Time (second)
1 2 3
job1 job2 job3
…job2000
2000
Workable, but…
1
job1 job2 job3
…job2000
If the target server is unreachable
Time (second)
Workable, but…
1 2 3
job1 job2 job3
…job2000
2000
If the target server is unreachable
job3~2000 will still execute at the same time
Time (second)
• Limit your worker thread to perform specific job with bounded rate
• Sidekiq Enterprise provides two types of rate limiting API
CONCURRENT_LIMITER = Sidekiq::Limiter.concurrent('price', 10) def perform(...) CONCURRENT_LIMITER.within_limit do # crawl stock data end end
CONCURRENT_LIMITER = Sidekiq::Limiter.concurrent('price', 10) def perform(...) CONCURRENT_LIMITER.within_limit do # crawl stock data end end Only 10 concurrent operations inside the block
can happen at any given moment
BUCKET_LIMITER = Sidekiq::Limiter.bucket('price', 10, :second) def perform(...) BUCKET_LIMITER.within_limit do # crawl stock data end end
For every second, you can perform up to 10 operations
You must fine tune parameters of your limiter for each data source for better performance
By far, you already got better performance.
However, the throttling control of your target server may not always be static.
Many websites are dynamically throttling controlled.
If throttling detected, pause your workers for a while
Redis (job queue)
Redis (job queue)
default
critical
low
Redis (job queue)
default
critical
low
Worker thread
Worker thread
Worker thread
Worker thread
Worker thread
Redis (job queue)
default
critical
low
Worker thread
Worker thread
Worker thread
Worker thread
Worker thread
Redis (job queue)
default
critical
low
Worker thread
Worker thread
Worker thread
Worker thread
Worker threadyahoo
Redis (job queue)
default
critical
low
Worker thread
Worker thread
Worker thread
Worker thread
Worker threadyahoo
(paused)
Pause this queue when throttled
Redis (job queue)
default
critical
low
Worker thread
Worker thread
Worker thread
Worker thread
Worker thread
Schedule a job executed after few seconds to “unpause" job in another queue
yahoo(paused)
Redis (job queue)
default
critical
low
Worker thread
Worker thread
Worker thread
Worker thread
Worker threadyahoo
(resumed)
Resumed after the unpause queue job executed
class SomeWorker include Sidekiq::Worker def perform # try to crawl something # ... if throttled queue_name = self.class.get_sidekiq_options['queue'] queue = Sidekiq::Queue.new(queue_name) queue.pause! ResumeJobQueueWorker.perform_in(30.seconds, queue_name) end end end
class SomeWorker include Sidekiq::Worker def perform # try to crawl something # ... if throttled queue_name = self.class.get_sidekiq_options['queue'] queue = Sidekiq::Queue.new(queue_name) queue.pause! ResumeJobQueueWorker.perform_in(30.seconds, queue_name) end end end
class SomeWorker include Sidekiq::Worker def perform # try to crawl something # ... if throttled queue_name = self.class.get_sidekiq_options['queue'] queue = Sidekiq::Queue.new(queue_name) queue.pause! ResumeJobQueueWorker.perform_in(30.seconds, queue_name) end end end class ResumeJobQueueWorker include Sidekiq::Worker sidekiq_options queue: :queue_control, unique: :until_executed def perform(queue_name) queue = Sidekiq::Queue.new(queue_name) queue.unpause! if queue.paused? end end
The queue for ResumeJobQueueWorker MUST NOT equal to the paused queue
We have a dedicated queue for ResumeJobQueueWorker
Decrease Sidekiq server poll interval for more precise timing control
Queue pausing alleviates throttling issues Is it possible for us to do things even better?
Most throttling control aim to block requests from the same IP address
We can change our IP address via proxy service
Sidekiq server
Target server
a.b.c.d
Sidekiq server
Target server
a.b.c.d
a.b.c.d
Sidekiq server
Target server
a.b.c.d
a.b.c.d
a.b.c.d
a.b.c.d
Same IP for each request
Sidekiq server
Target server
a.b.c.d
Proxy service
end point
Sidekiq server
Target server
a.b.c.d
Proxy service
end point
proxy servere.f.g.h
Sidekiq server
Target server
a.b.c.d
a.b.c.dProxy
service end
point
proxy server
proxy server
e.f.g.h
i.j.k.l
Sidekiq server
Target server
a.b.c.d
a.b.c.d
a.b.c.d
a.b.c.d
Proxy service
end point
proxy server
proxy server
proxy server
proxy server
e.f.g.h
i.j.k.l
m.n.o.p
q.r.s.t
Sidekiq server
Target server
a.b.c.d
a.b.c.d
a.b.c.d
a.b.c.d
Proxy service
end point
proxy server
proxy server
proxy server
proxy server
e.f.g.h
i.j.k.l
m.n.o.p
q.r.s.t
Different IP for each request
• Inherent problem of Unix Cron:
• Unreliable scheduling
• Hard to prioritize job by the popularity
• High availability is not easy
• Not easy to deal with bandwidth throttling issue
• With Sidekiq (Enterprise) and a proper design, the following problems are solved
• Slow crawler
• Inefficient - unable to only retry the failed one
• Unpredictable server loading
• Scale out is not easy
• Inherent problem of Unix Cron
• Not easy to deal with bandwidth throttling issue