ANOMALY DETECTION IN THE ETHEREUM NETWORK · Blockchain is a decentralized ledger on which the data is cryptographi-cally secure and immutable in nature. network.)! =) = (ANOMALY

ANOMALY DETECTION IN THEETHEREUM NETWORK

A thesis submitted in partial fulfillment of the requirements

for the degree of Master of Technology

by

AJAY SINGH

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

INDIAN INSTITUTE OF TECHNOLOGY KANPUR

June 2019

https://cse.iitk.ac.in/users/amitn/

Department or School Web Site URL Here (include http://www.iitk.ac.in/ce)

University Web Site URL Here (include http://www.iitk.ac.in)

Abstract

Ethereum is a platform where users can build and deploy decentralized applications and

smart contracts. The participants in the Ethereum network are ’pseudo-anonymous’ which

makes it almost impossible to detect anomalous behaviour in the system. Thus, it serves

as a noteworthy place to perform some malicious activity and then go undetected. With

the sudden hype of blockchain technology, anomaly detection also received much attention

in the past decade. Anomalies in the network are the ones who execute fraudulent trans-

actions or whose behavior is abnormal. The abnormalities must be detected and removed

as early as possible to ensure the faith of participants on the largest blockchain platform.

There exists lots of work on the Bitcoin cryptocurrency in which they performed well, but

this thesis presents work on anomaly detection in the Ethereum for the first time to the

best of our knowledge.

In this thesis, we considered anomaly detection for Ethereum network using machine

learning techniques. Our goal is to detect which users are most suspicious. To this end,

we have used various machine learning classifiers on Ethereum transaction data. We

evaluated the accuracy and precision of each method and backed them with experimental

results. Next, we have done some graph-based analysis on Ethereum data. We also tried

to deduce the similarity index for smart contracts based on user interaction. We can use

these methods for any setting which has an internal graph structure. We have chosen

Ethereum due to its availability and popularity of the dataset. This work provides a good

starting point for anomaly detection on Ethereum Network.

Acknowledgements

I want to express my sincerest gratitude to my supervisor, Prof. Sandeep Shukla of the

Department of Computer Science and Engineering, who has guided me through my thesis.

Without his support and expertise, this dissertation would not have been possible. Also,

I would like to extend my gratitude to Prof. Medha Atre and Shubham Sahai who were

always there to help me in various phases of my thesis. Last, but not the least I would

like to thank Infura community from where I have collected data.

I also want to express my gratitude to my parents and my dear brother Abhay who always

supported me in each of my endeavours throughout my life. I want to thank Akanksha,

Abhishek, Pankaj, Prerit and my colleagues for their constant support and for helping me

throughout my M.Tech journey.

Ajay Singh

vi

Contents

Abstract v

Acknowledgements vi

Contents vii

List of Figures ix

List of Tables x

1 Introduction 1

1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Organisation of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 3

2.1 What is Blockchain? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Evolution of Blockchain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 Blockchain 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.2 Blockchain 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.3 Blockchain 3.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 What is Ethereum? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3.1 Consensus Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3.1.1 Proof of Work (POW) . . . . . . . . . . . . . . . . . . . . . 6

2.3.1.2 Proof of Stake (POS) . . . . . . . . . . . . . . . . . . . . . 6

2.3.2 Ethereum Accounts . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.3 How is Ethereum data stored? . . . . . . . . . . . . . . . . . . . . . 7

2.3.3.1 Merkle Patricia Tree . . . . . . . . . . . . . . . . . . . . . . 8

2.3.4 Gas and Payment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.5 Ether(ETH) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.6 Token . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.7 Smart Contracts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

vii

Contents viii

2.3.7.1 Benefits of Smart Contract . . . . . . . . . . . . . . . . . . 11

2.3.7.2 How do Smart Contracts Work? . . . . . . . . . . . . . . . 11

2.3.8 Decentralized Applications(Dapps) . . . . . . . . . . . . . . . . . . . 12

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4.1 What are the anomalies in Ethereum Network? . . . . . . . . . . . . 13

3 Related Work 14

4 Dataset 18

4.1 Dataset Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2 Block Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3 Transaction Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.4 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Methods and Approaches 22

5.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.1.1 Raw Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.1.2 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.1.3 Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2 Machine Learning Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.2.1 Decision Tree Classifier . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.2.2 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . 25

5.2.3 K-nearest Neighbour Classifier(KNN) . . . . . . . . . . . . . . . . . 26

5.2.4 Multi-layer Perceptron Classifier(MLP) . . . . . . . . . . . . . . . . 27

5.2.5 Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.3 Smart Contract Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6 Results and Analysis 30

6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6.3 Machine Learning Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.3.1 Decision Tree Classifier . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.3.2 Random Forest Classifier . . . . . . . . . . . . . . . . . . . . . . . . 36

6.3.3 Other Classifier Results . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.4 Graph Based Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.5 Smart Contract Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

7 Conclusion and Future Work 45

7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Bibliography 47

List of Figures

2.1 Energy Consumption by Bitcoin compared to other Countries [1] . . . . . . 6

2.2 Account Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Data stored on block header [2] . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Example of Merkle Patricia Tree . . . . . . . . . . . . . . . . . . . . . . . . 9

2.5 Payment Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.6 Working of Smart Contract . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.7 Dapps Roadmap [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 BitIodine Architecture [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 System Architecture [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1 Block Attributes [6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 Transaction Attributes [6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.1 Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.2 SVM Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.3 Neural Network Structure [7] . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.1 Confusion Matrix for Ground Truth Result . . . . . . . . . . . . . . . . . 33

6.2 ROC curve for Ground Truth Result . . . . . . . . . . . . . . . . . . . . . 33

6.3 Confusion Matrix for Level 1 Indegree Result . . . . . . . . . . . . . . . . . 34

6.4 ROC curve for Level 1 Indegree Result . . . . . . . . . . . . . . . . . . . . 35

6.5 Confusion Matrix for Level 1 Outdegree Result . . . . . . . . . . . . . . . . 36

6.6 ROC curve for Level 1 Outdegree Result . . . . . . . . . . . . . . . . . . . 36

6.7 Confusion Matrix for Ground Truth Result . . . . . . . . . . . . . . . . . 37

6.8 Confusion Matrix for Level 1 Indegree Result . . . . . . . . . . . . . . . . . 38

6.9 Confusion Matrix for Level 1 Outdegree Result . . . . . . . . . . . . . . . . 38

6.10 Block Publish time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.11 Transaction count per 1 lakh block . . . . . . . . . . . . . . . . . . . . . . . 40

6.12 Cumulative address growth over time . . . . . . . . . . . . . . . . . . . . . 41

6.13 Variation of block size with number of Transactions . . . . . . . . . . . . . 42

6.14 Variation of block size with chain data size . . . . . . . . . . . . . . . . . . 42

ix

List of Tables

2.1 Ether denominations and their value . . . . . . . . . . . . . . . . . . . . . . 10

4.1 Dataset Used for analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.1 Feature ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6.2 Test result on Ground Truth . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.3 Test result on Level 1 Indegree nodes . . . . . . . . . . . . . . . . . . . . . . 34

6.4 Test result on Level 1 Outdegree nodes . . . . . . . . . . . . . . . . . . . . . 35



6.7 Test result on Level 1 Outdegree nodes . . . . . . . . . . . . . . . . . . . . . 38



6.10 Similarity Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

x

Dedicated to the Universe

xi

Chapter 1

Introduction

The introduction of Bitcoin [8] as a cryptocurrency in 2008 has changed the world for

investors, developers, researchers, bankers, money launderers and hackers. The pseudo-

anonymity of participants led the hackers and money launderers to be part of the network

without any fear of being caught or traced. But the researchers failed them badly by

deducing the pattern in the system. On the technical aspects, blockchain is secure, robust

and decentralized which is suited for the sectors where the security of data is a prime

concern in an untrusted environment. It also has its pros and cons in the way it is going to

be used. Using cryptocurrencies, participants can bypass the tax paid to the government.

That is why in some of the countries cryptocurrencies are still illegal. As of April 2019,

only 56.1% [9] of the world’s population have access to the Internet. These are some

of the major concerns with the cryptocurrencies. Many research work has been done in

the recent past on Bitcoin blockchain network in which they were able to figure out the

illegal transactions related to Silk Road and many more. So we have decided to do the

same for Ethereum network in search of irregularities in the system. The irregularity

is abnormal behaviour within the system. There might be a possibility that we may

categorise some genuine user as malicious. So we planned to mark those with probability

score for inspection.

1.1 Problem Definition

In this thesis work, we tried to investigate the Ethereum transactional data in search of

abnormal activities in the Ethereum Network. There exists lots of literature for anomaly

detection in numerous different network settings. We have chosen machine learning tech-

niques for doing so.

1

Chapter 1. Introduction 2

Is it possible to detect anomalous behaviour in the Ethereum Network using

machine learning techniques ?

This question is further subdivided into following:

• How to collect the Ethereum transactional data?

• What raw features should we select from transactional data on Ethereum?

• What could be the engineered features for classification problem from the raw fea-

tures?

• How to deal with the highly imbalanced dataset?

• What machine learning method to be used?

• How to evaluate the results from the engineered dataset?

1.2 Contributions of the Thesis

The critical contribution of the thesis is we can detect the malicious nodes in the Ethereum

Network with a very high probability. Our methods do not classify genuine users as

malicious. It is essential in uplifting the faith of users towards the system. Our method

can be used for any internal graph-structured system.

1.3 Organisation of the Thesis

The rest of the thesis has been organised in the following manner: In Chapter 2, we

provide some necessary information about blockchain which is essential for understanding

the functioning of blockchain. In Chapter 3, we discuss different related work done so far.

In Chapter 4, we explain our dataset, i.e. what data is used and how it is being collected. In

Chapter 5, we explain how information is being preprocessed and the classification methods

used for evaluation. Chapter 6 presents the experimental evaluation and a comparison of

the various heuristics discussed in Chapter 5. Finally, in Chapter 7, we provide a brief

conclusion of our work and the prospects in this direction of work.

Chapter 2

Background

2.1 What is Blockchain?

Blockchain, as the name suggests is a set of blocks/data cryptographically linked one to

the other. It is somewhat similar to a linked list data structure which has a specific set

of rules to modify the list. Is it merely a linked list that gotten such a hype in the past

few years? No, it’s true that a linked list is the core concept behind the blockchain, but

it is way more than that. The salient features of blockchain are Decentralized System,

Immutable Ledger, Cryptographic Security of data and Privacy [10].

The Decentralization refers to the level of control over the network. In these systems, the

control is distributed across the different entities within the network. There is a chance of

51% attack [11] if the mining protocol like Proof of Work is used. This attack is feasible

when one of the entity possesses more mining power than any other.

Immutable ledger signifies that the blocks/data can not be altered once it published on

the blockchain. Cryptographic hash functions are used to maintain the security of the

data. The cryptographic hash function returns the hash for the given data. This hash is

collision resistant and non-invertible, i.e. given the hash of the data, it is almost impossible

to convert it back to data. This hashing technique maintains the privacy of users especially

in case of cryptocurrencies. Every user is mapped to an address which maintains anonymity

and privacy. Now we define blockchain precisely:

Definition 2.1. Blockchain is a decentralized ledger on which the data is cryptographi-

cally secure and immutable in nature.

3

Chapter 2 Background 4

2.2 Evolution of Blockchain

Various technology components used in the blockchain technology have been developed

long before it received a global introduction as Bitcoin cryptocurrency in 2008. The initial

aim of blockchain was to combat the double spending problem which improves the trust

on cryptocurrency. To describe the evolution of blockchain, researchers categorise them

into three generations.

2.2.1 Blockchain 1.0

Blockchain 1.0 is the beginning of the blockchain era with the Bitcoin. It was developed

by a pseudonymous software developer named by Satoshi Nakamoto. Bitcoin is the public

blockchain to support financial transactions on decentralised systems. It bypasses the

central authority, such as a bank or payment gateway. Public blockchain signifies that

anyone can join or leave the network. Bitcoin uses Proof of Work consensus algorithms

to select a miner who will decide the contents of the next block. A block is the collection

of valid transactions. To be chosen as miner they need to find a solution to a particular

mathematical problem. The mathematical problem is of type: Given a value ‘X’, you need

to find a number ‘n’ such that the hash of ‘n’ appended to ‘X’ results in a number ‘Y’

which has ‘k’ initial zeros. The miner incentive for publishing the block is 12.5 Bitcoins(at

present). The reward is halved every four years. The current market capitalisation and

other details of bitcoin in real time can be found here https://www.blockchain.com/

explorer.

Bitcoin has some limitations w.r.t the nature of the transaction as it can perform only

money transfer over the network. We can not perform any intense computation job because

no looping construct is being supported which laid down the foundation for Blockchain

2.0.


Blockchain 2.0 [12] laid the foundation for the programmable transaction. The pro-

grammable transactions are pieces of code which automatically get executed when certain

conditions are met. This automated piece of code is called a “Smart Contract”. The script-

ing language used in Ethereum is Turing complete [13] which signifies that it is capable of

simulating any Turing machine. On top of Ethereum, Decentralized Applications(Dapps)

can be launched. It makes Ethereum much more than a cryptocurrency which justifies it

as Blockchain 2.0. Some other significant improvement over Bitcoin is the transaction time

is reduced by 40 times, i.e., from 10 minutes to (12-15) seconds [14]. The block reward is

https://www.blockchain.com/explorer

https://www.blockchain.com/explorer


constant in the Ethereum network. The algorithm used for consensus is a refined form of

Proof of Work which helps to overcome the advantage of using dedicated mining devices.


All transaction data are available on the blockchain which can create privacy issue for

some of the institutes as much research has been done which proves that public blockchain

addresses are not fully anonymous. To make the blockchain secure and scalable, they

designed block-less and miner-less blockchain. This led to the next generation of blockchain

called Blockchain 3.0. This increases the usability of blockchain in different sectors such

as health care, supply chain, data storage, etc. Some of the blockchain platforms which

comes under this are IOTA, Stellar, Hyperledger Fabric, etc.

2.3 What is Ethereum?

Ethereum [15] was developed by Vitalik Buterin in 2013. It is a blockchain based technol-

ogy that helps developers build and deploy smart contracts and decentralized applications.

Ethereum nodes have a computing infrastructure called Ethereum Virtual Machine (EVM)

on which we can create any blockchain application efficiently. Turing completeness signi-

fies that it is capable of simulating any Turing Machine. The consensus algorithm used

by Ethereum blockchain is Proof of Work, but Ethereum announces that it will switch to

Proof of Stake by mid-2019. The reason being Proof of work is computationally intensive

process, and it consumes lots of electricity. The cryptocurrency of Ethereum is Ether

which fuels the Ethereum network.

2.3.1 Consensus Protocols

Wikipedia says, “Consensus decision-making is a group decision-making process in which

group members develop, and agree to support a decision in the best interest of the whole.

Consensus may be defined professionally as an acceptable resolution, one that can be

supported, even if not the “favourite” of each individual. Consensus is defined by Merriam-

Webster as, first, general agreement, and second, group solidarity of belief or sentiment.”

These protocols are used in decentralized systems where there is no central authority to

make a decision. The protocols are designed in such a way that it should not be biased by

any participant. The voting done by each of the participants must have equal weightage.

The consensus protocols for Ethereum are:


Figure 2.1: Energy Consumption by Bitcoin compared to other Countries [1]

2.3.1.1 Proof of Work (POW)

Proof of Work protocol is designed by Satoshi Nakamoto for Bitcoin. This protocol is also

used in many other cryptocurrencies. The miners need to solve a computationally hard

mathematical problem to become the miner of the current block. The miners have to brute

force the entire search space for finding the solution, but the result can be easily verified.

The puzzle/problem which everyone tries to solve is like: they need to find a nonce for

which the hash of SHA256(nonce + block hash) must precede with given number of zero

bits. The difficulty level is decided by the number of bits to match with zero, and same

goes for computation required to solve the puzzle. The biggest drawback with bitcoin is

energy consumption [1], and it uses more than some of the countries do. Figure 2.1 shows

the energy consumption comparison with different countries.

Image source: https://digiconomist.net/bitcoin-energy-consumption.

2.3.1.2 Proof of Stake (POS)

Proof of Stake selects a miner for making the next block based on some features of partic-

ipants; the features could be how many coins at stake, duration from the last mined block

or random selection. It does not require to solve any mathematical problem and also the

electricity consumption is not as high as that for Bitcoin. Peercoin and Blackcoin are the

first to use this as the consensus protocol.

https://digiconomist.net/bitcoin-energy-consumption


In POS the miners are validators. Initially, they have to put some coins at stake. During

validation of blocks, they need to place a bet based on which they get rewarded if the

block is selected as the next block of the chain. The reward is in proportion to the stake

that each participant made.

The main problem with this protocol is “Nothing at Stake,” i.e., if one of the validators

split his coins within all the candidate blocks then one from those candidate block is

undoubtedly going to be the next block hence in any case validator is going to win some

reward. To overcome this shortcoming, they came up with a new protocol called Casper

Protocol [16]. The Casper Protocol penalizes the malicious acting nodes which try to

perform nothing at stake by slashing their stakes.

2.3.2 Ethereum Accounts

Ethereum has two types of account:

• Externally Owned Account(EOAs)

The end users create EOAs to be a part of the ethereum network. Participants get

the private key for each account to perform transactions.

• Contract Account

These are the self-executing code which can be invoked by EOAs or another contract

as an internal transaction.

Figure 2.2: Account Interaction

2.3.3 How is Ethereum data stored?

The Ethereum also started its story from the genesis block as other cryptocurrencies do.

From this very point, transactions, contract creation, contract invocation, any mining of

subsequent block started and the state of the Ethereum blockchain constantly changes.

Ethereum uses “trie” data structure for storing data [17]. The different trie used are:

• State Trie


• Storage Trie

• Transaction Trie

• Receipts Trie

Only the root of different tries are stored in block header. It is represented in the Figure

2.3 below.

Figure 2.3: Data stored on block header [2]

Image source: https://medium.com/cybermiles/diving-into-ethereums-world

2.3.3.1 Merkle Patricia Tree

The tree structure used in the Ethereum is Merkle Patricia tree [18]. It is a binary tree

where data is stored at the leaf nodes. All the intermediate node contains the hashes of

their left and right child and unique hash using these two hashes. The hash of the root is

stored in the block header. It is used for the verification of data as the root hash do not

match if data is modified. The example of the Merkle Patricia tree is given in the Figure

2.4.

https://medium.com/cybermiles/ diving-into-ethereums-world


Figure 2.4: Example of Merkle Patricia Tree

2.3.4 Gas and Payment

Gas is the metric by which the cost of computation is decided on EVM. Value of ether

generally fluctuates in real time so there must be a metric to measure the computation

cost. Gas can be considered as the number of CPU cycle used for the execution on EVM.

The real costing of gas must be maintained; hence the gas price and price of ether have to

be inversely proportional. The other metric associated with gas is Gas Prices, Gas Cost,

Gas Limit, and Gas Fees. Figure 2.5 explains how payment for each transaction is made.

Figure 2.5: Payment Structure

2.3.5 Ether(ETH)

Ether is the digital currency which fuels the network. Powering the network means that

it distributes the ‘token’. These tokens are further used to execute the smart contract

and Decentralized application. More precisely we can say that it is a piece of code which


is used for payment against any computation done on Ethereum Virtual Machine(EVM)

[19]. Each transaction has some computation job and transaction fee associated with it.

The cost is computed on how much gas is used.

A total of 60,102,216 ether were distributed to crowdsale contributors in the crowdfunding

campaign at the start of the Ethereum blockchain network in 2014 [20]. Twelve mil-

lion ether is given to Ethereum Foundation, early contributors,the research group behind

Ethereum. According to the terms agreed by all parties on the 2014 presale, 18 million

ether will be issued every year,i.e., likely to be 25% of crowdfunding campaign in 2014. If

ether creation rate is high, then the difficulty to mine the block increases just to maintain

the rate of ether creation every year and vice versa. The difficulty level is decided by the

consensus protocol. The miner is rewarded with five ether for every block and a new block

is published/mined at every (12-15) second. The denomination of ether can be found in

the Table 2.1 [21].

S no. Unit Wei Value Wei

1 Kwei(babbage) 1e3 wei 1,000

2 Mwei (lovelace) 1e6 wei 1,000,000

3 Gwei (shannon) 1e9 wei 1,000,000,000

4 microether (szabo) 1e12 wei 1,000,000,000,000

5 milliether (finney) 1e15 wei 1,000,000,000,000,000

6 ether 1e18 wei 1,000,000,000,000,000,000

Table 2.1: Ether denominations and their value

2.3.6 Token

The token is the native currency for the decentralized application. The token for the

Dapps is distributed in the crowd-sale called ’ICO’. The token is distributed in exchange

of ether for the respective Dapps. It is used to make the process easy while interacting

with smart contract and Dapps. The tokens are of broadly classified as:

• Usage Tokens

• Work Tokens

2.3.7 Smart Contracts

A smart contract is a piece of code which contains a set of rules which interacting bodies

have to follow. These contracts are executed on top of the blockchain, i.e., on the EVM.

The contract gets executed when the required conditions are satisfied. The smartness of


the contract is dependent on the developer of the contract. The main aim of the contract

is to build trust between the parties without relying on an intermediary. Some of the

properties of contract which makes it more reliable are:

• Self verification

• Tamper proof

• Autonomous execution

2.3.7.1 Benefits of Smart Contract

• Trust

• Safety

• Speed

• Saving

• Accuracy

• Availability

• Autonomy

• Reduce reliance on trusted third party

2.3.7.2 How do Smart Contracts Work?

Let’s consider an example to understand the working of a smart contract [22]. Suppose

there are two users A and B, where A wants to buy a car and B wants to sell his car. The

contract between A and B is “If A pays 5000 ether to B then A will receive the ownership

of B’s car”. If this contract is deployed, then it can not be changed, i.e., immutable which

define trust in the system. There is no need for a middleman between A and B like a bank,

broker, etc. as the contract is automatically executed when the conditions are met. When

A deposits the money to the contract then contract verifies and transfer the ownership

from B to A. The same scenario is depicted in Figure 2.6.


Figure 2.6: Working of Smart Contract

2.3.8 Decentralized Applications(Dapps)

Dapps is an open source software which runs on peer to peer network [23]. But when it

comes to blockchain based Dapps it must possess some more features; the data must be

cryptographically secure, the application must have a digital asset which fuels the network.

In an initial coin offering the Dapps tokens are kept for sale in exchange with digital fiat

currencies. The roadmap for the launch of Dapps is given in figure 2.7.

Figure 2.7: Dapps Roadmap [3]

2.4 Summary

In this chapter, we started with the introduction of Blockchain technology and its prop-

erties. Afterwards, we mentioned how the evolution of this technology happened over the

years and briefly summarized them.


We discussed some important terminologies which is essential for understanding the Ethereum

Network which is the focal point of this thesis. It is always the first priority to maintain the

faith of participants in the network. Participants play an important role in the functioning

of network. In this thesis, we try to detect and eliminate the malicious participants in the

network.

2.4.1 What are the anomalies in Ethereum Network?

The anomalous addresses are the ones which tries to do the task for which they are not

authorized or tries to execute fraudulent transactions. Some of them are mentioned below:

• Issues fake tokens

• Fake admin in ICOs(Initial coin Offering)

• Scambot phishers

• Slackbot

• Fake etherscan site

• Fake site - asking for private keys

• Fake crowdsale site, etc.

In this work we attempt to mark and eliminate malicious addresses such that the innocent

participants within network do not get affected.

Chapter 3

Related Work

In this chapter, we focus on existing work related to anomaly detection in blockchain more

specifically Bitcoin and Ethereum blockchain.

BitIodine: Extracting Intelligence from the Bitcoin Network [4] is a framework to de-

anonymize the users. They were able to label the addresses automatically or semi-

automatically using the information fetched from web scrapping. The web scrapers search

the web to associate some of the addresses to real users. The labels they used for addresses

are gambling, exchanges, wallets, donations, scammer, disposable, miner, malware, FBI,

killer, Silk Road, shareholder, etc. Bitiodine first parses the transaction data from the

Bitcoin blockchain, then it performs clustering based on user interaction and labels the

clusters and users. Their objective is to label every address in the network into one of

the above mentioned categories. They were able detect some of the anomalous addresses

in the network by manual investigation by tracing their transactions. They verified their

system performance on some of the known theft and fraud that happened in Bitcoin. Their

framework BITIODINE was able to detect addresses which belongs to Silk Road cold wal-

let, CryptoLocker ransomware. The modular structure developed by them is such that it

can be used for other blockchains also. The system architecture of Bitiodine is described

in Figure 3.1.

14

Chapter 3 Related Work 15

Figure 3.1: BitIodine Architecture [4]

Graph-based forensic investigation of Bitcoin transactions [24] performs analysis work on

Bitcoin transaction data and also does some evaluation on the network data. The dataset

used by them includes 34,839,029 Bitcoin transactions and 35,770,360 distinct addresses.

Their objective is to detect money theft, fraudulent transactions and illegal payments

made to black market. They designed a framework which retrieves all the transaction

details of a given address. They do not detect the anomalous addresses in the network

yet they provide detailed information about the given address. They used clustering to

group users together and they used multiple graph-based techniques to analyze the money

flow within the network. They analyzed money flow using ’Breadth First Search (BFS)’

algorithm, edge-convergent pattern and existence of cycles in the network to detect any

sort of money laundering.

Thai T. Pham et al. [25] [26] proposed Anomaly Detection in Bitcoin Network using

Unsupervised Learning method. The aim is to detect the suspicious transaction that

took place within the network and mark the users based on these transactions. The

unsupervised methods used by them are: K-means clustering, Mahalanobis distance and

Support Vector Machine (SVM). They verified their model based on 30 known cases of

Bitcoin Network in which they were able to mark two known cases of theft and 1 case of

loss.

Xiapu Luo et al. [5] proposed understanding Ethereum via graph analysis. They claim

to be the first to perform a graph-based analysis of Ethereum blockchain. They con-

structed three different graphs to analyze money transfer, smart contract creation, and


smart contract invocation. The dataset they have is 28,502,131 external transactions and

19,759,821 internal transactions. After analyzing the above-mentioned graph, they have

given five preliminary insights. The insights they found are:

• Participants uses Ethereum more than Smart Contracts for money transfer

• Smart contracts are not used extensively

• Ethereum is not frequently used by all

• A very few people create smart contracts

• Exchange markets dominate Ethereum network

The insights made by them is pretty obvious as the number of transactions made by

a regular user cannot be compared with the number of transactions done in exchanges

as they are surely going to be much higher. Hence, the exchange market will dominate

the Ethereum network. We can not expect every user knows the Solidity or Golang so

that they can deploy their contracts. Hence, very few of them can deploy the contract

and use it. All participants have different requirements for which they interacted with

the Ethereum network so we can expect the same behaviour from all. Their complete

approach is depicted in Figure 3.2.

Figure 3.2: System Architecture [5]

Although some of the above approaches try to find an anomaly in the Bitcoin network,

none of them has a sophisticated method for anomaly detection. In Bitiodine [4] they

attempted to detect by manually searching the paths in the network. They do not have


automated program to detect malicious addresses. In [25] they tried machine learning

technique for anomaly detection but the accuracy is not very good, i.e., 10% and they

tried only two machine learning model namely K-means clustering and Support Vector

Machine (SVM). Therefore there is a need for a system which can detect the anomalous

addresses in any blockchain network with high accuracy.

Chapter 4

Dataset

4.1 Dataset Collection

We have used Infura API [27] to fetch the Ethereum blockchain data. They provide secure

and reliable access to Ethereum APIs and IPFS gateways. The APIs of Infura which I

used for data collection are:

• eth getBlockByNumber

It returns the complete block data in JSON format for a given block number.

• eth getTransactionByHash

It returns the complete transaction data in JSON format for the given transaction

hash.

• eth getTransactionReceipt

It returns the status of the post Byzantium transactions i.e. ‘1’ for success and ‘0’

for failure.

The files are stored in JSON format. The file size of a block is (1-10)KB while for a

transaction is (0.5-2)KB. As the number of files is enormous, so we used 28 cloud instances

each running a multiprocessing script to fetch the data. The downloading rate is about

150K files per day per instance. The data fetched is only on-chain data, i.e., it does not

include internal transactions. The internal transaction is not published on the blockchain,

they are just executed on the EVM. The dataset we consider for analysis is from block 0

to block 5,139,999 which includes total of 169,192,702 transactions.

18

Chapter 4. Dataset 19

4.2 Block Structure

The blocks are linked together cryptographically in a chronological manner to form a

blockchain [6]. These blocks are packed with a set of successful transactions. A block

consists of:

• Block Header

• Transaction Hashes

• Uncle’s Hash

Figure 4.1 gives the idea of how data is stored in a block. The ‘result’ section as shown in

the Figure 4.1 contains all the data as key-value pair. The set of transactions which are

included in this block goes under the ’transactions’ section.

Figure 4.1: Block Attributes [6]


4.3 Transaction Structure

To interact with the Ethereum blockchain one has to perform a transaction [6]. The

different types of transactions are ether transfer, token distribution, contract creation and

contract invocation. Figure 4.2 shows what all data will go into the transaction record.

The r,s,v values are used to generate the signature to verify the sender.

Figure 4.2: Transaction Attributes [6]

4.4 Dataset Statistics

S no. Dataset Count

1 Blocks 6.8 million

2 External Transactions 169,192,702

3 Unique Addresses 246,93,053

4 Zero Ether Transactions 50,468,270

5 Smart Contracts ∼ 1.8 million

6 Unique Smart Contract ∼ 90K

7 Malicious Addresses 125

Table 4.1: Dataset Used for analysis

The External Transactions are the ones which are stored on the main chain of the

Ethereum, i.e., Ethereum Classic. The transactions stored on the main chain are of two

types;


• Transaction between two Externally owned Accounts

• Contract invocation by Externally owned Account

The Unique Addresses gives the count of unique addresses in total of 169,192,702 ex-

ternal transactions.

The Zero Ether Transactions gives the count of the transactions which transfer zero

ether in total of 169,192,702 external transactions.

The Smart Contracts gives the count of total number of smart contract deployed from

block 0 to block 6.8 million.

The Unique Smart Contract gives the count of unique smart contracts deployed by

comparing their MD5 hashes of the smart contract code.

The Malicious Addresses gives the count of unique malicious addresses which are col-

lected by web scraping that lies in the range of 0 to 5,139,999 blocks [28] [29].

Chapter 5

Methods and Approaches

5.1 Data Preprocessing

Data preprocessing is a crucial step while using machine learning algorithms. It is a method

to remove/reduce the noise in the dataset. It plays a vital role as far as the accuracy of

the model is concerned. Some of the key steps involved are: finding missing values, finding

categorical values, scaling of feature values, normalization of feature values, efficient and

correct splitting of raw data into training set and test set, etc.

The data retrieved from the previously mentioned API’s of Infura is in the JSON format.

5.1.1 Raw Feature Extraction

The raw features are fetched from transaction data files. The raw features include:

• blockNumber: It is a hexadecimal number of the block to which this transaction

belongs.

• from: It is the sender address of the transaction.

• to: It is the receiver address of the transaction. It is NULL in the contract creation

transaction.

• value: It gives the amount transferred in the transaction. It is stored in Wei as a

hexadecimal value.

• timestamp: It provides the time at which the block is published/mined.

5.1.2 Feature Engineering

Data engineering is a significant and crucial step in machine learning. It is the process by

which values are made out of the raw data for learning algorithms which are to be trained

22

Chapter 5. Methods and Approaches 23

and tested. So from the above four mentioned features, we extracted out 14 numeric

features [26] for each address to solve the proposed problem. The features are:

• Outdegree: The total number of outgoing transactions from a given address.

• Indegree: The total number of incoming transactions to a given address.

• Balance Out: The total number of outgoing ether value from a given address.

• Balance In: The total number of incoming ether value to a given address.

• Absolute Balance: (Balance In) - (Balance Out)

• Unique Outdegree: The total number of outgoing transactions to unique addresses

from a given address.

• Unique Indegree: The total number of incoming transactions from unique ad-

dresses to a given address.

• Start Date: The timestamp of the block in which the given address has made its

first ever transaction.

• End date: The timestamp of the block in which the given address has made its last

transaction so far.

• Active duration: (End Date) - (Start Date)

• Last Transaction Bit: 0/1 (0 if last transaction made is incoming else 1)

• Last Transaction Value: The ether value transferred in the last transaction made

by the address.

• In Transaction Average: Average ether value per incoming transaction.

• Out Transaction Average: Average ether value per outgoing transaction.

5.1.3 Feature Importance

While working with the machine learning models it always a tough decision to choose the

features. As we have extracted out 14 features from the raw data now to determine the

importance of a feature, we will feed this data to the decision tree model. The Extra-

TreeClassifier [30] is completely randomized tree classifier. It looks for the best split by

performing random splits for each of the selected features. The feature importance of the

above-mentioned features are:


Figure 5.1: Feature Importance

Feature no. Feature Name value

feature 6 Unique Indegree 0.215800

feature 9 Active duration 0.097585

feature 1 Indegree 0.096118

feature 8 End Date 0.095395

feature 7 Start Date 0.093047

feature 12 In Transaction Average 0.084508

feature 11 Last Transaction Value 0.069127

feature 5 Unique Outdegree 0.051988

feature 13 Out Transaction Average 0.044768

feature 0 Outdegree 0.042575

feature 2 Balance Out 0.039066

feature 4 Out Transaction Average 0.033495

feature 3 Absolute Balance 0.027815

feature 10 Last Transaction Bit 0.008714

Table 5.1: Feature ranking


5.2 Machine Learning Classifiers

5.2.1 Decision Tree Classifier

A decision tree [31] is a supervised learning method which is mostly used in the classifi-

cation problems. The algorithm can be visualized as a tree structure where every node

split the data into two parts based on the most distinctive feature value at that node. The

algorithm can be used for categorical as well as continuous data. At the beginning, the

complete training dataset is fed at the root node, and finer granularity will be done by

features. The termination of the algorithm is dependent on the attributes of the classifier,

or all classes are pure. Some of the terms related to decision trees are:

• Gini Impurity It is the measure which decides how likely a new data point is

misclassified.

G(k) =

J∑i=1

P (i) ∗ (1− P (i)) (5.1)

where P(i) is the probability of class ‘i’ and ‘J ’ is number of classes

• Entropy Entropy is the measure of uncertainty in the given data.

H = −∑

p(x) log p(x) (5.2)

where p(x) is probability of ’x’

• Information Gain Information gain (IG) measures the “information” which we get

about the class from a feature. It helps to partition the data at every node.

IG =

Entropy (x)− ( [weighted average] ∗ entropy(y ) )(5.3)

where x is the parent node and y is a child node

5.2.2 Support Vector Machine (SVM)

A Support Vector Machine [32] is a supervised learning algorithm which categorizes the

data points by constructing the optimal hyperplane. The SVM will try to fit best pos-

sible hyperplane such that the separation of different classes is maximized and error is

minimized. The example of SVM classifier is depicted in Figure 5.2.


Figure 5.2: SVM Classifier

Some of the terminology associated with SVM are:

• Kernel

The kernel is the methods which are mostly used for pattern analysis. It transforms

the data to some other dimension such that better separation between the classes

can be obtained.

• Regularization

Regularization parameter helps the classifier to decide how much misclassification

can be tolerated.

• Margin

Margin is a hyperplane which separates the two classes. The good margin signifies

that the two classes are equidistance from the margin.

5.2.3 K-nearest Neighbour Classifier(KNN)

KNN [33] classifies data to a class by considering the majority of its neighbors. The K

(number of neighbors to consider) need to be specified. The K-nearest neighbors of the

data point can be determined using different distance parameters. Some of them are:


• Euclidean Distance

√√√√ k∑i=1

(xi − yi)2 (5.4)

• Manhattan Distancek∑i=1

|xi − yi| (5.5)

• Minkowski Distance (k∑i=1

(|xi − yi|)q)1/q

(5.6)

The distance should be standardized or scaled as different features have a different range

of values.

Xs =X −Min

Max−Min(5.7)

5.2.4 Multi-layer Perceptron Classifier(MLP)

MLP classifier [34] is the supervised learning method which classifies by learning a function

f(x) where :

f(x) : Rm → Ro (5.8)

Here, ’m’ is the dimension of input and ’n’ is dimension of output. MLP can extract out

significant information from unbalanced or imprecise data, which can be used to figure out

some pattern by which classification of data points can be done. The basic type of neural

network has three units:

1. Input Unit: It feeds the raw information into the network.

2. Hidden Unit: In this weighted nonlinear functions are computed using the values

from the input layer.

3. Output Unit: It converts the values from the last hidden layer to generate output.

MLP uses different loss functions for different problems. The loss function used for clas-

sification problem is Cross-Entropy, which in the binary case is given as,

Loss(y, y,W ) = −y ln y − (1− y) ln(1− y) + α ||W ||22 (5.9)

where penalty is L2-regularization: α ||W ||22


Figure 5.3: Neural Network Structure [7]

5.2.5 Naive Bayes Classifier

It is a supervised learning method [35] in which it computes the probability for every data

point corresponding to each class. The data point is classified to the class with maximum

probability. Using Bayes theorem the conditional probability can be decomposed as:

p (Ck|x) =p (Ck) p (x|Ck)

p(x)(5.10)

where ‘x’ is feature vector and Ck is a set of classes to which ‘x’ belongs.

5.3 Smart Contract Analysis

The smart contract is a piece of code which automatically gets executed when conditions

are satisfied. There exists a lot many ways to write the same code. We tried to group

similar type of contracts based on user interaction. To do this, we have considered three

parameters for each contract. They are:

• Total number of invocations

• Total number of unique invocations

• MD5 hash of contract code

5.4 Summary

This section presents how the data is processed and what machine learning techniques used

for anomaly detection in the Ethereum network. The data creation is a very crucial step in


any machine learning model. In the section 5.1 at the page 22 talks about how features are

extracted from the raw JSON files. After feature extraction part, Decision tree classifier,

k-nearest neighbors classifier, Random forest classifier, SVM classifier, MLP classifier and

Naive Bayes classifier techniques were used for data modeling. Hyper-parameters of dif-

ferent classifiers were optimized for better results. 5-fold cross validation is also performed

in order to avoid over-fitting.

Chapter 6

Results and Analysis

6.1 Experimental Setup

The dataset we have is highly unbalanced as the marked malicious addresses we have is

only 125 [28] out of 24 million addresses. These 24 million addresses are externally owned

accounts (EOAs), i.e., it does not include addresses of smart contracts. We split the mali-

cious 125 addresses into 75 for training and 50 for testing as ground truth. Next, we need

to have some non-malicious addresses to test our machine learning models rigorously. We

included the address of Ethereum developers, Ethereum contributors, exchange addresses

and some known addresses in the non-malicious set for testing. We extracted these ad-

dresses from the genesis block of Ethereum and some other by web crawling [28] [29]. A

total of 250 non-malicious addresses were extracted for testing. We used label ‘0’ for non-

malicious and ‘1’ for malicious addresses. Now, in this work, we try to label 24 million

addresses, either malicious or non-malicious using the 75 known malicious addresses.

To deal with a highly unbalanced dataset, as mentioned above, we have considered an

assumption. The assumption is “we will mark the addresses as malicious if they

have an outgoing transaction with the malicious marked addresses”. After taking

the assumption, we have a total of 3830 malicious marked addresses. Finally, we have

two setting to evaluate our model:

1. Testing on 50 originally marked malicious addresses

2. Testing on randomly chosen 50 malicious addresses from 3830 maliciously marked

by considering the assumption

There is a technique called SMOTE [36] to generate synthetic points if data points are

separable. But while analyzing the distribution of data points, there is a high degree

30

Chapter 6. Results and Analysis 31

of overlap which makes SMOTE not suited for our dataset. 5-fold cross validation is

performed in order to avoid over-fitting.

6.2 Evaluation Metrics

For the evaluation of the experiments, we have treated non-malicious as positive and

malicious as negative. The different metric used is:

1. Confusion Matrix

It is used to describe the performance of the model using labeled test data.

Non-Malicious (Predicted) Malicious (Predicted)

Non-Malicious (Actual) True Positive False Negative

Malicious (Actual) False Positive True Negative

Table 6.1: Confusion Matrix

It has four metric values:

• True Positive(TP): It is the case when algorithm correctly returns as non-

malicious.

• False Positive(FP): It is the case when algorithm incorrectly returns as non-

malicious.

• True Negative(TN): It is the case when algorithm correctly returns as mali-

cious.

• False Negative(FN): It is the case when algorithm incorrectly returns as

malicious.

2. Accuracy

“It is the ratio of correct results to total returned by the algorithm”

Accuracy =TP + TN

TP + FP + TN + FN(6.1)

3. Precision

“What fraction of positives identified by the algorithm is actually correct”

Precision =TP

TP + FP(6.2)


4. Recall

“What fraction of real positives were identified as positives by the algorithm”

Recall =TP

TP + FN(6.3)

5. F-score

It is harmonic mean of precision and recall.

Fscore = 2× Precision × Recall

Precision + Recall(6.4)

The general formula for any β:

Fβ =(1 + β2

)· precision · recall

(β2 · precision ) + recall(6.5)

where β is a parameter to decide different level of importance to precision and recall

6. AUC - ROC Curve

It is a measure used to check the performance of classification algorithms. It tells us

how much our model is capable of distinguishing between the classes. ROC is the

probability curve and AUC(area under the curve) is separability measure.

6.3 Machine Learning Classifiers

6.3.1 Decision Tree Classifier

The first classifier we have considered is ‘Decision Tree Classifier’, in which we have trained

the model on ground truth, i.e., 75 addresses from each class to generate the base results.

Table 6.2 contains the result when model is evaluated on the test set. The test set contains

50 malicious and 250 non-malicious data points.

Evaluation Metric Value

Accuracy 0.9353

Precision 0.9641

Recall 0.8060

F-score 0.8589

Table 6.2: Test result on Ground Truth


Figure 6.1: Confusion Matrix for Ground Truth Result

Figure 6.1 shows the result in the setting considered above in which 37 out of 50 malicious

addresses are correctly classified. We can see that the model doesn’t classify any normal

user as malicious which is very important in this type of network as we may lose users

if we mark them as malicious.

Figure 6.2: ROC curve for Ground Truth Result

The ROC curve in the Figure 6.2 shows the true positive rate (Sensitivity) as a function

of the false positive rate (100-Specificity) for different cut-off points. Each point on the

ROC curve represents a (sensitivity,specificity) pair corresponding to a particular decision

threshold. The accuracy of the model is given by the area under the curve, the closer the

curve to upper left corner the greater the accuracy. 87% of the area lies under the curve

which shows our model performed well.


Next, we evaluated by training our model using the assumption that we have discussed in

section 6.1. We have taken this assumption because while computing feature importance,

we get unique indegree as most important. Those addresses which are sending ether to

nodes known to be malicious are highly likely to be malicious. Further, we show that our

assumption of considering indegree nodes is correct as the results are worse if outdegree

nodes are considered.


Accuracy 0.9966

Precision 0.9980

Recall 0.9900

F-score 0.9939

Table 6.3: Test result on Level 1 Indegree nodes

We can observe from the Table 6.3 that our assumption to slightly balance the dataset

does not affect the accuracy of our model. We are even getting better results with the

assumption. This signifies that the nodes that are directly connected and sending ether

to the malicious nodes are most suspicious.

Figure 6.3: Confusion Matrix for Level 1 Indegree Result

The confusion matrix in the Figure 6.3 shows that we were able to classify 49 out of 50

malicious nodes as malicious.

Figure 6.4 shows that 99% of area lies under the ROC curve which signifies that our model

beats the result obtained by only considering the ground truth in which we get area under

the curve(auc) = 87%.


Figure 6.4: ROC curve for Level 1 Indegree Result

To support our assumption we evaluated our model while considering the nodes to which

the malicious addresses have send ether as malicious. The results obtained while consid-

ering the above assumption is given below.


Accuracy 0.8366

Precision 0.9180

Recall 0.5100

F-score 0.4749

Table 6.4: Test result on Level 1 Outdegree nodes

We can see in the Table 6.4 that the accuracy of the model is drastically reduced which

supports our assumption of considering only indegree nodes to malicious addresses as

malicious. The confusion matrix in the Figure 6.5 shows that we were able to correctly

classify only 1 out of 50 known malicious addresses as malicious. The ROC curve in the

Figure 6.6 is very close to line y=x. The auc = 0.51 which is very less as compared to the

above two different settings. This may be because a malicious node may transacts with

non-malicious nodes in order to spend their ether without being caught.


Figure 6.5: Confusion Matrix for Level 1 Outdegree Result

Figure 6.6: ROC curve for Level 1 Outdegree Result

6.3.2 Random Forest Classifier

Next, we tried Random Forest Classifier [37] which is an improved version of the deci-

sion tree. It is an ensemble learning method for classification. It creates multiple trees

for classification rather than single tree as of in ‘Decision tree’. We repeated the above

experiments that we have done using ‘Decision Tree’ to verify that we must get accuracy

greater than or equal to ‘Decision Tree’ using ‘Random Forest’. The results are:



Accuracy 0.9893

Precision 0.9936

Recall 0.9680

F-score 0.9802


Figure 6.7: Confusion Matrix for Ground Truth Result

The results obtained when indegree nodes of malicious nodes are considered malicious.


Accuracy 0.9966

Precision 0.9980

Recall 0.9900

F-score 0.9939



Figure 6.8: Confusion Matrix for Level 1 Indegree Result

The results obtained when outdegree nodes of malicious nodes are considered malicious.


Accuracy 0.8366

Precision 0.9180

Recall 0.5100

F-score 0.4749

Table 6.7: Test result on Level 1 Outdegree nodes

Figure 6.9: Confusion Matrix for Level 1 Outdegree Result

We get the improved result as expected for the random forest classifier with respect to

the decision tree method. For setting 1 in which only ground truth malicious nodes are


considered we were able to correctly classify 47 out of 50 malicious marked addresses as

malicious as depicted in the Figure 6.7 which is 10 more than that we got in ‘Decision

Tree’. And for the other settings we get the same accuracy as of ‘Decision tree’.

6.3.3 Other Classifier Results

We have tried four other classifiers whose results are given in the Table 6.8 and Table

6.9. The Table 6.8 includes the result which is evaluated on setting 1 and the later on

setting 2. The MLP classifier is better suited for the classification problem having time

series dataset. As the data is not separable the KNN accuracy is low as compared to MLP

and Naive Bayes classifiers. On setting 1 MLP classifier performs better than all other

classifiers that we tried.

SVMs on the other hand can efficiently perform a non-linear classification using what

is called the kernel trick, implicitly mapping their inputs into high-dimensional feature

spaces. It is one of the most robust and accurate algorithm among the other classification

algorithms. SVM requires relatively large data for training than other classifier. As we

have relatively large data in setting 2 so SVM performed well. The accuracy we achieved

is 99.66%. We have done 5-fold validation to eliminates over fitting of the model.

SVM MLP Classifier KNN(K=5) Naive Bayes Classifier

Accuracy 0.8366 0.9152 0.8566 0.8933

Precision 0.9180 0.9538 0.9266 0.9432

Recall 0.5100 0.7460 0.5700 0.6799

F-score 0.4749 0.8051 0.5832 0.7346


SVM MLP Classifier KNN(K=5) Naive Bayes Classifier

Accuracy 0.9966 0.9473 0.9166 0.8833

Precision 0.9980 0.9708 0.9545 0.9385

Recall 0.9900 0.8420 0.7500 0.6500

F-score 0.9939 0.8818 0.8095 0.6980


6.4 Graph Based Analysis

In Ethereum the publish time of the block is about 12-15 seconds but while plotting the

histogram of the publish time we observed that a large number of the blocks published


between 3-6 seconds. We infer from the plot in Figure 6.10 that there is a significant shift in

the publish time is due to the fact the difficulty level set for the block is not appropriate, or

the participants use dedicated devices like Application Specific Integrated Circuits(ASICs)

for mining. But Ethereum claims that the consensus protocol ‘ETHCASH’ used by them

is independent of these types of mining devices [38]. It signifies that the Ethcash still does

not able to provide protection against ASICs.

Figure 6.10: Block Publish time analysis

From the plot in Figure 6.11 we observe the exponential growth in the number of transac-

tions per block. The sudden change started from the block series of 320K which is around

early 2017. One possible reason might be the migration of users from other cryptocur-

rencies. The Bitcoin block reward was reduced to 12.5 BTC from 25 BTC in July 2016

[39].

Figure 6.11: Transaction count per 1 lakh block


The plot in Figure 6.12 shows the cumulative growth in the number of addresses in the

Ethereum Network over time. The x-axis represented the date starting from 30 July 2015

to 20 Feb 2018 and y-axis gives the corresponding count of the addresses. We observed

the same behavior as of the number of transactions in the block. The plot also shows that

there was a sudden rise in the number of addresses from the 320K block.

Figure 6.12: Cumulative address growth over time

The plot in Figure 6.13 shows the variation of block size with number of transactions

per day. We can see that block size directly proportional to the number of transactions

occurred in a day. We infer from the plot that the number of transactions packed in the

block also grows proportional to the transaction rate ,i.e., not biased with the miner.

The plot in Figure 6.14 shows the variation of block size with chain data size. Ethereum

chain data size refers to the size of Ethereum Classic blockchain. One can refer [17]

to known more about data storage in Ethereum. We can see that block size directly

proportional to the chain data.


Figure 6.13: Variation of block size with number of Transactions

Figure 6.14: Variation of block size with chain data size


6.5 Smart Contract Analysis

We have done a basic analysis based on user interaction. We have randomly chosen 1000

contracts for evaluation. Next we have generated all possible pairs of these contracts. For

each pair we computed,

• Total Invocations of both addresses

• Unique Invocations of both addresses

• Intersection of unique invocations of both the addresses

• Whether they have same MD5 hash of the code or not

The result for total of 499500 pairs are given in the table below:

Same MD5 hash Intersection of Intersection of

addresses == 0 addresses != 0

NO 442174 7861

YES 42312 7153

Table 6.10: Similarity Evaluation

The analysis tries to group together the contracts based on user interaction. Our analysis

revealed that there is high degree of overlap between the user who invoked similar con-

tracts(contracts with same MD5 hash). We have taken the intersection of the addresses

who invokes the similar as well as dissimilar contracts. The insights we found are:

• There are 49,465 pairs out of total 4,99,500 pairs which have same MD5 hash. 42,312

pair of contracts from 49,465 pairs do not share any common user but 34,367 out of

these are just invoked once. These might be the test contracts that were deployed

for testing before deployment of actual contract. The testing is necessary as the

contract once deployed can not be killed. 7,153 out of 49,465 pairs of contract have

atleast one user in common. So, we can infer that these users are somehow related

to each other as they all perform similar type of task using different contracts.

• Similarly, there are 4,50,035 pairs out of 4,99,500 all possible pairs which do not have

same MD5 hash. Out of these 4,50,035 pairs 4,42,174 pairs do not share common

user. This value is high as expected because if the contracts are different(different

MD5 hash of their code) then the number of pairs which does not have common user


will be high. Their are only 7861 pairs which have common user between them. We

can infer that some of the contracts which have different MD5 hash of their code but

they show similar behaviour.

Chapter 7

Conclusion and Future Work

7.1 Conclusion

In Chapter [5] we have explained how the dataset is being collected and then pre-processed.

After processing of the dataset, we have applied different machine learning classifier for

anomaly detection in Ethereum network. We were able to detect 47 out of 50 known

cases using Random Forest Classifier in the first setting, i.e., testing on the ground truth.

Next, we have detected 49 out of 50 cases using ours as the assumption of indegree nodes

as malicious. In the second setting SVM, Decision Tree classifier and Random Forest

Classifier produced the result with the same accuracy. We have also backed our assumption

by evaluating the results considering the outdegree nodes. We were able to detect only 1

out of 50 when outdegree nodes are considered.

We can conclude from this work that it is possible to detect patterns in the Ethereum

transactional data using machine learning techniques. Using these machine learning mod-

els one can label all addresses as malicious or non-malicious, assign the suspicious level

using the class probability of data points. These models is applicable on any dataset which

possess inherent graph structure. Our dataset is highly unbalanced, i.e., we have only 125

in one class and 24 million in the other still we were able to produce good results. It

signifies that our model work well with unbalanced datasets.

7.2 Future Work

Anomaly detection in any network is interesting as well as challenging problem. There

are multiple techniques that researchers build to solve the same but still, we can think in

some of other directions. The prominent ones are:

45

Chapter 7. Conclusion and Future Work 46

• We can also apply the parallel partitioning approach [40] for the anomaly detection.

In this setting, we look for structural behavior [41] in which regularity of the graph

is used for anomaly detection.

• We have collected and analyzed a given range of data in this work. We can create a

modular framework which perform this kind analysis on real time data.

• Here, we have manually computed feature values for each node, we can also get

the embedding for each node of the graph [42] using different machine learning

techniques. By using this, we might get some more features for analysis.

• The smart contract analysis done in Chapter [6] is based on user interaction to reveal

interesting behaviour of the addresses. There might be other ways to think in which

the similarity between the group of addresses can be detected.

Bibliography

[1] Bitcoin energy consumption. https://digiconomist.net/

bitcoin-energy-consumption. Accessed: May 13, 2019.

[2] Blockchain data. https://medium.com/cybermiles/

diving-into-ethereums-world, . Accessed: May 13, 2019.

[3] Dapp phases. https://coinsutra.com/dapps-decentralized-applications/.

Accessed: May 13, 2019.

[4] Michele Spagnuolo, Federico Maggi, and Stefano Zanero. Bitiodine: Extracting in-

telligence from the bitcoin network. volume 8437, pages 457–468, 03 2014. ISBN

978-3-662-45471-8. doi: 10.1007/978-3-662-45472-5 29.

[5] T. Chen, Y. Zhu, Z. Li, J. Chen, X. Li, X. Luo, X. Lin, and X. Zhange. Understand-

ing ethereum via graph analysis. In IEEE INFOCOM 2018 - IEEE Conference on

Computer Communications, pages 1484–1492, April 2018. doi: 10.1109/INFOCOM.

2018.8486401.

[6] Ethersc. https://etherscan.io/, . Accessed: May 13, 2019.

[7] Nn structure. https://www.google.com/search?q=neural+network&tbm=isch&

source=iu&ictx=1&fir=vj5KYf7zG80QYM%253A%252CynYusGDc2AddHM%252C%

252Fm%252F05dhw&vet=1&usg=AI4_-kS1zRBtf1J-UIETu-25mMYcnUUotQ&sa=X&ved=

2ahUKEwiJqq6F9KHiAhUDXn0KHaX1CGMQ9QEwAHoECBAQBg#imgrc=vj5KYf7zG80QYM:.


[8] Satoshi Nakamoto. ”bitcoin: A peer-to-peer electronic cash system,” http://

bitcoin.org/bitcoin.pdf, 2008.

[9] Global internet usage. https://en.wikipedia.org/wiki/Global_Internet_usage.


47





https://coinsutra.com/dapps-decentralized-applications/

https://etherscan.io/

https://www.google.com/search?q=neural+network&tbm=isch&source=iu&ictx=1&fir=vj5KYf7zG80QYM%253A%252CynYusGDc2AddHM%252C%252Fm%252F05dhw&vet=1&usg=AI4_-kS1zRBtf1J-UIETu-25mMYcnUUotQ&sa=X&ved=2ahUKEwiJqq6F9KHiAhUDXn0KHaX1CGMQ9QEwAHoECBAQBg#imgrc=vj5KYf7zG80QYM:




http://bitcoin.org/bitcoin.pdf

http://bitcoin.org/bitcoin.pdf

https://en.wikipedia.org/wiki/Global_Internet_usage

Bibliography 48

[10] What is blockchain? https://blockgeeks.com/guides/

what-is-blockchain-technology/, . Accessed: May 13, 2019.

[11] 51% attack. https://bitcoin.org/en/blockchain-guide#

block-height-and-forking. Accessed: May 13, 2019.

[12] Blockchain 2.0. https://medium.com/xpa-2-0/blockchain-2-0, . Accessed: May

13, 2019.

[13] Turing complete. https://en.wikipedia.org/wiki/Turing_completeness. Ac-

cessed: May 13, 2019.

[14] Time to publish block. https://ethereum.stackexchange.com/questions/9617/

how-many-blocks-are-created-at-one-point-of-time. Accessed: May 13, 2019.

[15] Ethereum. https://www.ethereum.org/, . Accessed: May 13, 2019.

[16] Casper protocol. https://blockgeeks.com/guides/ethereum-casper/. Accessed:

May 13, 2019.

[17] How ethereum data is stored ? https://hackernoon.com/

getting-deep-into-ethereum-how-data-is-stored-in-ethereum-e3f669d96033.


[18] Merkle patricia tree. https://medium.com/codechain/

modified-merkle-patricia-trie-how-ethereum-saves-a-state-e6d7555078dd.


[19] Ethereum virtual machine. https://medium.com/mycrypto/

the-ethereum-virtual-machine-how-does-it-work-9abac2b7c9e. Accessed:

May 13, 2019.

[20] Crowdsale stats. https://medium.com/tendermint/

examining-funding-token-allocation-of-blockchain-foundations-a2d0fb29b5ca.


[21] Ether value. http://ethdocs.org/en/latest/ether.html, . Accessed: May 13,

2019.

[22] Smart contract. https://www.coindesk.com/information/

ethereum-smart-contracts-work. Accessed: May 13, 2019.

[23] Dapps. https://blockchainhub.net/decentralized-applications-dapps. Ac-


https://blockgeeks.com/guides/what-is-blockchain-technology/

https://blockgeeks.com/guides/what-is-blockchain-technology/

https://bitcoin.org/en/blockchain-guide#block-height-and-forking

https://bitcoin.org/en/blockchain-guide#block-height-and-forking

https://medium.com/xpa-2-0/blockchain-2-0

https://en.wikipedia.org/wiki/Turing_completeness

https://ethereum.stackexchange.com/questions/9617/how-many-blocks-are-created-at-one-point-of-time

https://ethereum.stackexchange.com/questions/9617/how-many-blocks-are-created-at-one-point-of-time

https://www.ethereum.org/

https://blockgeeks.com/guides/ethereum-casper/

https://hackernoon.com/getting-deep-into-ethereum -how-data-is-stored-in-ethereum-e3f669d96033

https://hackernoon.com/getting-deep-into-ethereum -how-data-is-stored-in-ethereum-e3f669d96033

https://medium.com/codechain/modified-merkle-patricia-trie-how-ethereum-saves-a-state-e6d7555078dd

https://medium.com/codechain/modified-merkle-patricia-trie-how-ethereum-saves-a-state-e6d7555078dd

https://medium.com/mycrypto/the-ethereum-virtual-machine-how-does-it-work-9abac2b7c9e

https://medium.com/mycrypto/the-ethereum-virtual-machine-how-does-it-work-9abac2b7c9e

https://medium.com/tendermint/examining-funding-token-allocation-of-blockchain-foundations-a2d0fb29b5ca

https://medium.com/tendermint/examining-funding-token-allocation-of-blockchain-foundations-a2d0fb29b5ca

http://ethdocs.org/en/latest/ether.html

https://www.coindesk.com/information/ethereum-smart-contracts-work

https://www.coindesk.com/information/ethereum-smart-contracts-work

https://blockchainhub.net/decentralized-applications-dapps

Bibliography 49

[24] Chen Zhao. Graph-based forensic investigation of bitcoin transactions. 2014.

[25] Thai Pham and Steven Lee. Anomaly detection in bitcoin network using unsupervised

learning methods. CoRR, abs/1611.03941, 2016. URL http://arxiv.org/abs/1611.

03941.

[26] Thai Pham and Steven Lee. Anomaly detection in the bitcoin system - A network per-

spective. CoRR, abs/1611.03942, 2016. URL http://arxiv.org/abs/1611.03942.

[27] Infura. https://infura.io/docs/ethereum/json-rpc/. Accessed: May 13, 2019.

[28] Malicious address repository. https://github.com/MyEtherWallet/

ethereum-lists/blob/master/src/addresses/addresses-darklist.json, .


[29] Malicious db. https://etherscamdb.info/, . Accessed: May 13, 2019.

[30] Extra tree classifier. https://scikit-learn.org/stable/modules/generated/

sklearn.ensemble.ExtraTreesClassifier.html. Accessed: May 13, 2019.

[31] Decision tree. https://scikit-learn.org/stable/modules/generated/sklearn.

tree.DecisionTreeClassifier.html. Accessed: May 13, 2019.

[32] Svm classifier. https://scikit-learn.org/stable/modules/generated/sklearn.

svm.SVC.html. Accessed: May 13, 2019.

[33] Knn classifier. https://scikit-learn.org/stable/modules/generated/sklearn.

neighbors.KNeighborsClassifier.html. Accessed: May 13, 2019.

[34] Mlp classifier. https://scikit-learn.org/stable/modules/generated/sklearn.

neural_network.MLPClassifier.html. Accessed: May 13, 2019.

[35] Naive bayes classifier. https://scikit-learn.org/stable/modules/generated/

sklearn.naive_bayes.GaussianNB.html/. Accessed: May 13, 2019.

[36] Smote. https://jair.org/index.php/jair/article/view/10302. Accessed: May

13, 2019.

[37] Random forest classifier. https://scikit-learn.org/stable/modules/

generated/sklearn.ensemble.RandomForestClassifier.html. Accessed: May

13, 2019.

[38] Ethcash asic. https://en.bitcoinwiki.org/wiki/Ethash, . Accessed: May 13,

2019.

http://arxiv.org/abs/1611.03941



https://infura.io/docs/ethereum/json-rpc/

https://github.com/MyEtherWallet/ethereum-lists/blob/master/src/addresses/addresses-darklist.json

https://github.com/MyEtherWallet/ethereum-lists/blob/master/src/addresses/addresses-darklist.json

https://etherscamdb.info/

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html/

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html/

https://jair.org/index.php/jair/article/view/10302

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

https://en.bitcoinwiki.org/wiki/Ethash

Bibliography 50

[39] Bitcoin halving. https://thenextweb.com/hardfork/2019/01/30/

the-bitcoin-halvening-is-happening-heres-what-you-need-to-know/. Ac-


[40] W. Eberle and L. Holder. Incremental anomaly detection in graphs. In 2013 IEEE

13th International Conference on Data Mining Workshops, pages 521–528, Dec 2013.

doi: 10.1109/ICDMW.2013.93.

[41] Lawrence B. Holder, Diane J. Cook, and Surnjani Djoko. Substructure discovery in

the subdue system. In Proceedings of the 3rd International Conference on Knowledge

Discovery and Data Mining, AAAIWS’94, pages 169–180. AAAI Press, 1994. URL

http://dl.acm.org/citation.cfm?id=3000850.3000868.

[42] Generating graph embeddings for extremely large

graph (ai lab fb). https://ai.facebook.com/blog/

open-sourcing-pytorch-biggraph-for-faster-embeddings-of-large-graphs/.


[43] Ethereum white paper. https://github.com/ethereum/wiki/wiki/White-Paper.


[44] Ethereum yellow paper. https://ethereum.github.io/yellowpaper/paper.pdf.


https://thenextweb.com/hardfork/2019/01/30/the-bitcoin-halvening-is-happening-heres-what-you-need-to-know/

https://thenextweb.com/hardfork/2019/01/30/the-bitcoin-halvening-is-happening-heres-what-you-need-to-know/

http://dl.acm.org/citation.cfm?id=3000850.3000868

https://ai.facebook.com/blog/open-sourcing-pytorch-biggraph-for-faster-embeddings-of-large-graphs/

https://ai.facebook.com/blog/open-sourcing-pytorch-biggraph-for-faster-embeddings-of-large-graphs/

https://github.com/ethereum/wiki/wiki/White-Paper

https://ethereum.github.io/yellowpaper/paper.pdf

Documents

ANOMALY DETECTION IN THE ETHEREUM NETWORK · Blockchain is a decentralized ledger on which the data is cryptographi-cally secure and immutable in nature. network.)! =) = (ANOMALY