35
Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Embed Size (px)

Citation preview

Page 1: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Analysis of non-stochastic time varying data - FINGRID

Lee GillamDepartment of Computing, University of

Surrey

Page 2: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Financial Decision Making

Challenge:

analysis of streaming financial (time serial) data and financial

and political news

At the interface of quantitative and qualitative?

Page 3: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

FINGRID Projectaimed at information management/ processing

challenge in social sciences: analysis and fusion of distributed quantitative and qualitative data and programs.

12-month eSS PDP involving econometrics (Essex) and computing academics, particularly in grid computing and artificial intelligence, at Surrey ( social anthropologists & criminologists)

Third project at Surrey that deals with qualitative data (news and reports) and qualitative data (time series) EU Projects ACE (1996-99), GIDA (2001-03).

Page 4: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Motivation

Market sentiment - quantifying effects of news in the Efficient Market Hypothesis? Technicalists (chart patterns, stats) and

fundamentalists (intrinsic -book- value) locked away from the outside world - no CNN? Challenge of treating multiple data sources

Bounded rationality (Simon 1972, Kahneman 2002)? Self-deception of investors rejecting new

evidence in favour of prior (incorrect) information (Lakonishk, Lee & Poteshman 2003, Kindlberger 2001) - e.g. “.com” bubble

Buy/sell - human (re-)action is documented in the dataset

Page 5: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

FINGRID methods/techniques

sentiment analysis: automatic terminology extraction; ontology learning; local grammars.

Learning the rules for Information Extraction (IE). Patterns derived from a corpus (MB GB) of texts

(arbitrary domain)

time series analysis (bootstrapping, wavelet analysis)

visualization of large volume time series and texts

Grid - Globus, Condor, OGSA-DAI, SRB

Page 6: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

FINGRID Technologies

24 computers provide (dual-proc, hyperthreaded)1. Globus Toolkit 3.0.2 (GT3), 2. Java and FORTRAN software compilers, 3. Java Commodity Grid kit (CogKit), and 4. Local security certification.

FINGRID uses the Java CogKit to integrate: (i) the MATLAB wavelet toolbox via JMatLink; (ii) Reuters data via the Reuters SSL SDK; (iii) bootstrap simulation written in FORTRAN; and (iv) System Quirk components via the Quirk Java SDK.

Condor (management of distributed processing – 76 procs in pool), Storage Resource Broker (Data Grids) also configured: expansion and testing in progress.

Page 7: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Streaming Data (Reuters)

FOREX (GBP/USD) tick data

Page 8: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Streaming Data (Reuters)

Page 9: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Streaming Data (Reuters)

Page 10: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Streaming Data (Reuters)

Page 11: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Datasets

HFDF data (O&A) e.g. 5 minutes compression GBP/USD 1992 to 2003 inclusive; 1.25M

datapoints (12*24*365*12) approximates 4MB.

Text corpora RCV 1 (over 800000 news stories in 12 of 1996-7 ); RCV 2 (13 languages)

Copyrights/contracts?

Numerical data

Time series price/value movement of financial

instruments;

c. 5MB/day, per instrument (XML) - including sources of quote (>1GB/year/instrument)

Textual data Text streams news items; financial reports; company brochures; government

documents….

c. 40MB/day (> 10GB/year)

Page 12: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Non-stochastic?

Encyclopedia of Chart patternsJapanese Candlestick Charting techniquesIf price increases, demand decreases?

Page 13: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Methods - Bootstrap

With many financial series, it may be difficult to select and fit an appropriate model; block bootstrap generates bootstrap samples from time series when a parametric model is not available. Block bootstrap is a procedure for generating bootstrap samples from time series when a parametric model is not available. The blocking procedure consists of dividing data into blocks and sampling blocks randomly with replacement. Bootstrap techniques are inherently computationally demanding, even using efficient computational algorithms (Nankervis 2002).The bootstrap can be iterated so that a further layer of resampling is performed (a double bootstrap): results in improved properties of estimators and test statistics. To make realistic statistical inferences from data using bootstrapping, significant replications (c. 10000 times) should be used (Lobato, Nankervis & Savin 2001).Other bootstrap-based procedures applicable to financial data include estimating the distribution of returns for Value at Risk (VaR) models (Ruiz and Pascual, 2002).

Page 14: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Simple Bootstrapping

0

500

1000

1500

2000

2500

1 2 4 8

# of machines

Tim

e in

se

co

nd

s

Bootstrap rep=500 Bootstrap rep=1000

1000 bootstrap replications:2 nodes: 1050 seconds (17.5 mins)8 nodes: 404 seconds (6.73 mins)

10000+ replications? Linear speedup?Hypothesis testing – dismiss bad ideas more quickly?

Methods - Bootstrap

Bespoke FORTRAN implementation of bootstrapping [Nankervis] algorithm (Globus, Java CoGkit – Grid service)

Page 15: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Distributed bootstrapping

Bootstrap is partially parallelizable: Amdahl’s law: the fraction of code f, which cannot

be parallelised, affects speedup factor - replication seeds, results.

Condor and Condor DAGs (compose metalevel description)

seed

calculate calculate

results

calculate calculate

Job A seed.cmdJob B calculate1.cmdJob C calculate2.cmdJob D calculate3.cmdJob E calculate4.cmdJob F results.cmd PARENT A CHILD B C D EPARENT B C D E CHILD F

executable = calculate.exeinput = output = calculate.1.outerror = caculate.1.errtransfer_input_files = outs_aatransfer_files = ALWAYSlog = calculate.1.logarguments = outs_aa 250queue

Page 16: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Wavelet analysis

Conventional Signal Processing:

• Variation in time-domain OR variation in frequency domain applicable to stationary series

Wavelet-based Analysis:

• Variation in time-domain AND variation in frequency domain applicable to non-stationary series.

Aussem & Murtagh (1997) use wavelet analysis combined with neural networks to provide time series forecasts

Page 17: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Wavelet Multiscale Analysis

Fourier Power Spectra can be computed for each scale – discover cyclicals

Page 18: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Methods - Wavelet analysis

Most dominant cycle (brown rectified sine wave) has a period of 85.3333 and starts at 57.3333

Next dominant cycle (green rectified sine wave) has a period of 42.6667 and starts at 41.3333

Other cycles in order of their importance are 23.27, 11.6364, 5.68 and 3.12

UPTREND from 1 to 260 with a slope of 12.57 and a y-intercept of 2626.37

DOWNTREND from 261 to 358 with a slope of -8.6956 and a y-intercept of 8166.4091

The series loses its stationarity (variance change occurs) at 141 (black vertical line)

Possible turning points (black circles):

68, 144 , 148, 152, 154, 165, 212, 220, 228, 260 298, 299, 300, 348, 358, and 358

VisualizationVisualization Textual SummaryTextual Summary

Page 19: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Methods - Wavelet analysis

Matlab toolboxes for Wavelet and Signal processing analysisMatlab -> JMatLink (Java) -> Java CoGkit – Grid serviceParallel/performance evaluation?

JMatLink engine = new JMatLink();

engine.engOpen();

eng.engEvalString("array=randn(500)");

array=eng.engGetArray("array");

engine.engClose();

public class TSAanalysisServiceGridLocator extends org.globus.ogsa.impl.core.service.ServiceLocator implements org.globus.ogsa.GridLocator {

Page 20: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Workflow

Select instrument tick dataUse sampling rule (OHLC) to create a time series [4 series, C at equally-spaced intervals]Use close series for n-scale Wavelet transform [n-series]Identify trends in low-frequency scale; apply Fourier analysis to each n-series to discover cycles Apply bootstrap to modelling individual series?Combination of model and trends = prediction?

Page 21: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Methods - Textual time series

Streaming news textNamed entity identification (e.g. company name)Sentiment discovery (local grammars)Up/down series for market / company (qual -> quant?)System Quirk JDK + Java CoGkit = Grid Service-> time series analysis-> covariance analysis

Page 22: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Methods - Textual time series

Page 23: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Methods - Textual time seriesLocal Grammar Example Frequency

said PN, TITLE at ORG. said Alex Scott, research analyst at Seven Investment Management. 23.49%said PN, TITLE at MOD ORG. said Mike Lenhoff, chief strategist at private client fund manager Gerrard. 4.23%said PN at ORG. said Alex Bannister at Nationwide. 3.26%said PN of ORG. said Andrew Pendrill of ABN AMRO. 2.06%said PN, TITLE at ORG in PLACE. said David Marshall, analyst at NCB Stockbrokers in Dublin. 2.04%said PN at MOD ORG. said Simon Rubinsohn at brokerage Gerrard Ltd. 1.97%said PN, a TITLE at ORG. said Conor Bill, a partner at Lawrence & Co. 1.40%said PN, an TITLE at ORG. said David Pope, an industry analyst at Brewin Dolphin Securities. 1.25%said PN, TITLE at the ORG. said Oliver Froehlich, euro-transition team leader at the Frankfurter Volksbank. 0.83%said PN, TITLE of the ORG. said Mohammad Nazri Fatullah, chairman of the Afghan Trade Association. 0.74%

Page 24: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Methods - Textual time series

Page 25: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Methods - Textual time series

Page 26: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Methods - Textual time series

Patterns identified for Chinese also: “up” (上升 ) in Chinese

上半年 /NTN 地產 /NN 投資 /NN 收入 /NN

上升 /NN 約 /FPM 百分之八 /MM ﹐ 至 /I 十九億 /MM 元 /U ﹔

first half of this year, estate investment

up about 8 percent, to 19 billion dollars

月 /NTN 期 /NN 指 /VT 全 /PA 日 /NTN 收 /VT 報 /NN 一萬一千三百 /MM 點 /U ﹐

上升 /VI 二十 /MM 點 /U ﹐ 低 /A 水 /NN 四十五 /MM 點 /U ﹐ 成交 /VT 合約 /NN

day-close value of the monthly index was 11300 points,

up 20 points, 45 points below average

Page 27: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Methods - Textual time series

Text Analysis Throughput tested with various sizes of corpora

– against benchmark (wordlists – Hughes et al 2004)

Text Analysis

0

100

200

300

400

500

600

1 2 4 8

# of machines

Tim

e in

sec

onds

Text Analysis (process time in ms)

Time required to process one month’s news.RCV1 takes about 95 minutes on 16 machines. Further experiments in progress

Page 28: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Qual. meets Quant.

Decision Matrix / probability of direction

Market ( security)

JRC Fractal present?

Divergence exists?

Momentum changed

Volume increased

Reversal exists?

Overbought/ oversold

Leverage sufficient

Other, e.g. UniS

Sentimentconfidence

level

EUR/USD + + + + + + up 4

Page 29: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Qual. meets Quant.

SYSTEM QUIRK

Reuters News Feed

Up

Down

Time Series of Up and Down

Financial instrument (Reuters) e.g. FTSE

100 INDEX

0

0.2

0.4

0.6

0.8

1

1.2

1 2 5 6 7 8 9 12 13 14 15 16 19 20 21 22 23 26 27 28 29 30

Date

Ratio

Good words FTSE100

0

0.2

0.4

0.6

0.8

1

1.2

1 2 5 6 7 8 9 12 13 14 15 16 19 2021 22 23 26 27 28 29 30

Date

Ratio

Good words FTSE100

Generate Signal (Buy / Sell)

Page 30: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Qual. meets Quant. FINGRID’s Sentiment and Time Series: Financial analysis

system (SATISFI): for visualising and correlating the sentiment and instrument time series

Composition of Grid services

Page 31: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

FINGRID -> qual.

• System Quirk

• text + terminology + ontology + local grammars + ….

• Neural network classifiers (Hebbian networks, Websom)

• Case-based and fuzzy reasoning

• Automatic Text Summarisation

• Text alignment

• Metadata • ISO-standardized (ISO 11179-3 conformant data registries -

LIRICS project); application to text management (Virtual Corpora); Text Categorisation (+ Terminology lookup)

• ISO 639 (codes for the names of languages); ISO 16642 (Terminology Markup Framework); LMF; MAF and other TLAs

Page 32: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Recap

sentiment analysis: automatic terminology extraction; ontology learning; local grammars.

Learning the rules for Information Extraction (IE). Patterns derived from a corpus (MB GB) of texts

(arbitrary domain)

time series analysis (bootstrapping, wavelet analysis)

visualization of large volume time series and texts

Grid - Globus, Condor, OGSA-DAI, SRB

Page 33: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Acknowledgements

Saif Ahmad, Research Student, Wavelet Analysis;David Cheng, Research Officer, Text Analysis;

Gary Dear, Computing Officer, Grid Implementation;Pensiri Manomaisupat, Research Student, Text

Categorisation;Ademola Popoula, Research Student, Fuzzy Logic Analysis;

Hayssam Trablousi, Research Student, Named Entity Extraction;

Tuğba Taşkaya-Temizel, Tutor, Grid Computing, Grid Architect;

Khurshid Ahmad, Principal Investigator;Jon Nankervis, Co-Investigator (Essex)

Page 34: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Outlook

Lessons from: Value at Risk Computation (RiskGrid - BeSC); Aircraft vibration time-series (DAME - York), ….Proposed activity on qual analysis (content analysis meets code-based); qual-quant integration/fusion? Integration with Sheffield’s GATE systemcomplement and draw upon the work of eSS PDPs and the existing nodes: text analysis (Nottingham), modelling & simulation (Leeds), mixed media (Bristol), and quantitative analysis (Lancaster). Additional activities (Surrey): EPSRC: REVEAL (auto-annotation of crime-related CCTV); EU eContent: LIRICS

Page 35: Analysis of non-stochastic time varying data - FINGRID Lee Gillam Department of Computing, University of Surrey

Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6

April 2005

Further information

http://www.computing.surrey.ac.uk/grid/[email protected]