Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
ANOMALY DETECTION IN THEETHEREUM NETWORK
A thesis submitted in partial fulfillment of the requirements
for the degree of Master of Technology
by
AJAY SINGH
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
INDIAN INSTITUTE OF TECHNOLOGY KANPUR
June 2019
Abstract
Ethereum is a platform where users can build and deploy decentralized applications and
smart contracts. The participants in the Ethereum network are ’pseudo-anonymous’ which
makes it almost impossible to detect anomalous behaviour in the system. Thus, it serves
as a noteworthy place to perform some malicious activity and then go undetected. With
the sudden hype of blockchain technology, anomaly detection also received much attention
in the past decade. Anomalies in the network are the ones who execute fraudulent trans-
actions or whose behavior is abnormal. The abnormalities must be detected and removed
as early as possible to ensure the faith of participants on the largest blockchain platform.
There exists lots of work on the Bitcoin cryptocurrency in which they performed well, but
this thesis presents work on anomaly detection in the Ethereum for the first time to the
best of our knowledge.
In this thesis, we considered anomaly detection for Ethereum network using machine
learning techniques. Our goal is to detect which users are most suspicious. To this end,
we have used various machine learning classifiers on Ethereum transaction data. We
evaluated the accuracy and precision of each method and backed them with experimental
results. Next, we have done some graph-based analysis on Ethereum data. We also tried
to deduce the similarity index for smart contracts based on user interaction. We can use
these methods for any setting which has an internal graph structure. We have chosen
Ethereum due to its availability and popularity of the dataset. This work provides a good
starting point for anomaly detection on Ethereum Network.
Acknowledgements
I want to express my sincerest gratitude to my supervisor, Prof. Sandeep Shukla of the
Department of Computer Science and Engineering, who has guided me through my thesis.
Without his support and expertise, this dissertation would not have been possible. Also,
I would like to extend my gratitude to Prof. Medha Atre and Shubham Sahai who were
always there to help me in various phases of my thesis. Last, but not the least I would
like to thank Infura community from where I have collected data.
I also want to express my gratitude to my parents and my dear brother Abhay who always
supported me in each of my endeavours throughout my life. I want to thank Akanksha,
Abhishek, Pankaj, Prerit and my colleagues for their constant support and for helping me
throughout my M.Tech journey.
Ajay Singh
vi
Contents
Abstract v
Acknowledgements vi
Contents vii
List of Figures ix
List of Tables x
1 Introduction 1
1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Organisation of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 3
2.1 What is Blockchain? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Evolution of Blockchain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Blockchain 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Blockchain 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.3 Blockchain 3.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 What is Ethereum? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.1 Consensus Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.1.1 Proof of Work (POW) . . . . . . . . . . . . . . . . . . . . . 6
2.3.1.2 Proof of Stake (POS) . . . . . . . . . . . . . . . . . . . . . 6
2.3.2 Ethereum Accounts . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.3 How is Ethereum data stored? . . . . . . . . . . . . . . . . . . . . . 7
2.3.3.1 Merkle Patricia Tree . . . . . . . . . . . . . . . . . . . . . . 8
2.3.4 Gas and Payment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.5 Ether(ETH) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.6 Token . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.7 Smart Contracts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
vii
Contents viii
2.3.7.1 Benefits of Smart Contract . . . . . . . . . . . . . . . . . . 11
2.3.7.2 How do Smart Contracts Work? . . . . . . . . . . . . . . . 11
2.3.8 Decentralized Applications(Dapps) . . . . . . . . . . . . . . . . . . . 12
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.1 What are the anomalies in Ethereum Network? . . . . . . . . . . . . 13
3 Related Work 14
4 Dataset 18
4.1 Dataset Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Block Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Transaction Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5 Methods and Approaches 22
5.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1.1 Raw Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1.2 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1.3 Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 Machine Learning Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2.1 Decision Tree Classifier . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2.2 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . 25
5.2.3 K-nearest Neighbour Classifier(KNN) . . . . . . . . . . . . . . . . . 26
5.2.4 Multi-layer Perceptron Classifier(MLP) . . . . . . . . . . . . . . . . 27
5.2.5 Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3 Smart Contract Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6 Results and Analysis 30
6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.3 Machine Learning Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.3.1 Decision Tree Classifier . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.3.2 Random Forest Classifier . . . . . . . . . . . . . . . . . . . . . . . . 36
6.3.3 Other Classifier Results . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.4 Graph Based Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.5 Smart Contract Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7 Conclusion and Future Work 45
7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Bibliography 47
List of Figures
2.1 Energy Consumption by Bitcoin compared to other Countries [1] . . . . . . 6
2.2 Account Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Data stored on block header [2] . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Example of Merkle Patricia Tree . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Payment Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6 Working of Smart Contract . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Dapps Roadmap [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1 BitIodine Architecture [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 System Architecture [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1 Block Attributes [6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Transaction Attributes [6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.1 Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2 SVM Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 Neural Network Structure [7] . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.1 Confusion Matrix for Ground Truth Result . . . . . . . . . . . . . . . . . 33
6.2 ROC curve for Ground Truth Result . . . . . . . . . . . . . . . . . . . . . 33
6.3 Confusion Matrix for Level 1 Indegree Result . . . . . . . . . . . . . . . . . 34
6.4 ROC curve for Level 1 Indegree Result . . . . . . . . . . . . . . . . . . . . 35
6.5 Confusion Matrix for Level 1 Outdegree Result . . . . . . . . . . . . . . . . 36
6.6 ROC curve for Level 1 Outdegree Result . . . . . . . . . . . . . . . . . . . 36
6.7 Confusion Matrix for Ground Truth Result . . . . . . . . . . . . . . . . . 37
6.8 Confusion Matrix for Level 1 Indegree Result . . . . . . . . . . . . . . . . . 38
6.9 Confusion Matrix for Level 1 Outdegree Result . . . . . . . . . . . . . . . . 38
6.10 Block Publish time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.11 Transaction count per 1 lakh block . . . . . . . . . . . . . . . . . . . . . . . 40
6.12 Cumulative address growth over time . . . . . . . . . . . . . . . . . . . . . 41
6.13 Variation of block size with number of Transactions . . . . . . . . . . . . . 42
6.14 Variation of block size with chain data size . . . . . . . . . . . . . . . . . . 42
ix
List of Tables
2.1 Ether denominations and their value . . . . . . . . . . . . . . . . . . . . . . 10
4.1 Dataset Used for analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.1 Feature ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2 Test result on Ground Truth . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.3 Test result on Level 1 Indegree nodes . . . . . . . . . . . . . . . . . . . . . . 34
6.4 Test result on Level 1 Outdegree nodes . . . . . . . . . . . . . . . . . . . . . 35
6.5 Test result on Ground Truth . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.6 Test result on Level 1 Indegree nodes . . . . . . . . . . . . . . . . . . . . . . 37
6.7 Test result on Level 1 Outdegree nodes . . . . . . . . . . . . . . . . . . . . . 38
6.8 Test result on Ground Truth . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.9 Test result on Level 1 Indegree nodes . . . . . . . . . . . . . . . . . . . . . . 39
6.10 Similarity Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
x
Dedicated to the Universe
xi
Chapter 1
Introduction
The introduction of Bitcoin [8] as a cryptocurrency in 2008 has changed the world for
investors, developers, researchers, bankers, money launderers and hackers. The pseudo-
anonymity of participants led the hackers and money launderers to be part of the network
without any fear of being caught or traced. But the researchers failed them badly by
deducing the pattern in the system. On the technical aspects, blockchain is secure, robust
and decentralized which is suited for the sectors where the security of data is a prime
concern in an untrusted environment. It also has its pros and cons in the way it is going to
be used. Using cryptocurrencies, participants can bypass the tax paid to the government.
That is why in some of the countries cryptocurrencies are still illegal. As of April 2019,
only 56.1% [9] of the world’s population have access to the Internet. These are some
of the major concerns with the cryptocurrencies. Many research work has been done in
the recent past on Bitcoin blockchain network in which they were able to figure out the
illegal transactions related to Silk Road and many more. So we have decided to do the
same for Ethereum network in search of irregularities in the system. The irregularity
is abnormal behaviour within the system. There might be a possibility that we may
categorise some genuine user as malicious. So we planned to mark those with probability
score for inspection.
1.1 Problem Definition
In this thesis work, we tried to investigate the Ethereum transactional data in search of
abnormal activities in the Ethereum Network. There exists lots of literature for anomaly
detection in numerous different network settings. We have chosen machine learning tech-
niques for doing so.
1
Chapter 1. Introduction 2
Is it possible to detect anomalous behaviour in the Ethereum Network using
machine learning techniques ?
This question is further subdivided into following:
• How to collect the Ethereum transactional data?
• What raw features should we select from transactional data on Ethereum?
• What could be the engineered features for classification problem from the raw fea-
tures?
• How to deal with the highly imbalanced dataset?
• What machine learning method to be used?
• How to evaluate the results from the engineered dataset?
1.2 Contributions of the Thesis
The critical contribution of the thesis is we can detect the malicious nodes in the Ethereum
Network with a very high probability. Our methods do not classify genuine users as
malicious. It is essential in uplifting the faith of users towards the system. Our method
can be used for any internal graph-structured system.
1.3 Organisation of the Thesis
The rest of the thesis has been organised in the following manner: In Chapter 2, we
provide some necessary information about blockchain which is essential for understanding
the functioning of blockchain. In Chapter 3, we discuss different related work done so far.
In Chapter 4, we explain our dataset, i.e. what data is used and how it is being collected. In
Chapter 5, we explain how information is being preprocessed and the classification methods
used for evaluation. Chapter 6 presents the experimental evaluation and a comparison of
the various heuristics discussed in Chapter 5. Finally, in Chapter 7, we provide a brief
conclusion of our work and the prospects in this direction of work.
Chapter 2
Background
2.1 What is Blockchain?
Blockchain, as the name suggests is a set of blocks/data cryptographically linked one to
the other. It is somewhat similar to a linked list data structure which has a specific set
of rules to modify the list. Is it merely a linked list that gotten such a hype in the past
few years? No, it’s true that a linked list is the core concept behind the blockchain, but
it is way more than that. The salient features of blockchain are Decentralized System,
Immutable Ledger, Cryptographic Security of data and Privacy [10].
The Decentralization refers to the level of control over the network. In these systems, the
control is distributed across the different entities within the network. There is a chance of
51% attack [11] if the mining protocol like Proof of Work is used. This attack is feasible
when one of the entity possesses more mining power than any other.
Immutable ledger signifies that the blocks/data can not be altered once it published on
the blockchain. Cryptographic hash functions are used to maintain the security of the
data. The cryptographic hash function returns the hash for the given data. This hash is
collision resistant and non-invertible, i.e. given the hash of the data, it is almost impossible
to convert it back to data. This hashing technique maintains the privacy of users especially
in case of cryptocurrencies. Every user is mapped to an address which maintains anonymity
and privacy. Now we define blockchain precisely:
Definition 2.1. Blockchain is a decentralized ledger on which the data is cryptographi-
cally secure and immutable in nature.
3
Chapter 2 Background 4
2.2 Evolution of Blockchain
Various technology components used in the blockchain technology have been developed
long before it received a global introduction as Bitcoin cryptocurrency in 2008. The initial
aim of blockchain was to combat the double spending problem which improves the trust
on cryptocurrency. To describe the evolution of blockchain, researchers categorise them
into three generations.
2.2.1 Blockchain 1.0
Blockchain 1.0 is the beginning of the blockchain era with the Bitcoin. It was developed
by a pseudonymous software developer named by Satoshi Nakamoto. Bitcoin is the public
blockchain to support financial transactions on decentralised systems. It bypasses the
central authority, such as a bank or payment gateway. Public blockchain signifies that
anyone can join or leave the network. Bitcoin uses Proof of Work consensus algorithms
to select a miner who will decide the contents of the next block. A block is the collection
of valid transactions. To be chosen as miner they need to find a solution to a particular
mathematical problem. The mathematical problem is of type: Given a value ‘X’, you need
to find a number ‘n’ such that the hash of ‘n’ appended to ‘X’ results in a number ‘Y’
which has ‘k’ initial zeros. The miner incentive for publishing the block is 12.5 Bitcoins(at
present). The reward is halved every four years. The current market capitalisation and
other details of bitcoin in real time can be found here https://www.blockchain.com/
explorer.
Bitcoin has some limitations w.r.t the nature of the transaction as it can perform only
money transfer over the network. We can not perform any intense computation job because
no looping construct is being supported which laid down the foundation for Blockchain
2.0.
2.2.2 Blockchain 2.0
Blockchain 2.0 [12] laid the foundation for the programmable transaction. The pro-
grammable transactions are pieces of code which automatically get executed when certain
conditions are met. This automated piece of code is called a “Smart Contract”. The script-
ing language used in Ethereum is Turing complete [13] which signifies that it is capable of
simulating any Turing machine. On top of Ethereum, Decentralized Applications(Dapps)
can be launched. It makes Ethereum much more than a cryptocurrency which justifies it
as Blockchain 2.0. Some other significant improvement over Bitcoin is the transaction time
is reduced by 40 times, i.e., from 10 minutes to (12-15) seconds [14]. The block reward is
Chapter 2 Background 5
constant in the Ethereum network. The algorithm used for consensus is a refined form of
Proof of Work which helps to overcome the advantage of using dedicated mining devices.
2.2.3 Blockchain 3.0
All transaction data are available on the blockchain which can create privacy issue for
some of the institutes as much research has been done which proves that public blockchain
addresses are not fully anonymous. To make the blockchain secure and scalable, they
designed block-less and miner-less blockchain. This led to the next generation of blockchain
called Blockchain 3.0. This increases the usability of blockchain in different sectors such
as health care, supply chain, data storage, etc. Some of the blockchain platforms which
comes under this are IOTA, Stellar, Hyperledger Fabric, etc.
2.3 What is Ethereum?
Ethereum [15] was developed by Vitalik Buterin in 2013. It is a blockchain based technol-
ogy that helps developers build and deploy smart contracts and decentralized applications.
Ethereum nodes have a computing infrastructure called Ethereum Virtual Machine (EVM)
on which we can create any blockchain application efficiently. Turing completeness signi-
fies that it is capable of simulating any Turing Machine. The consensus algorithm used
by Ethereum blockchain is Proof of Work, but Ethereum announces that it will switch to
Proof of Stake by mid-2019. The reason being Proof of work is computationally intensive
process, and it consumes lots of electricity. The cryptocurrency of Ethereum is Ether
which fuels the Ethereum network.
2.3.1 Consensus Protocols
Wikipedia says, “Consensus decision-making is a group decision-making process in which
group members develop, and agree to support a decision in the best interest of the whole.
Consensus may be defined professionally as an acceptable resolution, one that can be
supported, even if not the “favourite” of each individual. Consensus is defined by Merriam-
Webster as, first, general agreement, and second, group solidarity of belief or sentiment.”
These protocols are used in decentralized systems where there is no central authority to
make a decision. The protocols are designed in such a way that it should not be biased by
any participant. The voting done by each of the participants must have equal weightage.
The consensus protocols for Ethereum are:
Chapter 2 Background 6
Figure 2.1: Energy Consumption by Bitcoin compared to other Countries [1]
2.3.1.1 Proof of Work (POW)
Proof of Work protocol is designed by Satoshi Nakamoto for Bitcoin. This protocol is also
used in many other cryptocurrencies. The miners need to solve a computationally hard
mathematical problem to become the miner of the current block. The miners have to brute
force the entire search space for finding the solution, but the result can be easily verified.
The puzzle/problem which everyone tries to solve is like: they need to find a nonce for
which the hash of SHA256(nonce + block hash) must precede with given number of zero
bits. The difficulty level is decided by the number of bits to match with zero, and same
goes for computation required to solve the puzzle. The biggest drawback with bitcoin is
energy consumption [1], and it uses more than some of the countries do. Figure 2.1 shows
the energy consumption comparison with different countries.
Image source: https://digiconomist.net/bitcoin-energy-consumption.
2.3.1.2 Proof of Stake (POS)
Proof of Stake selects a miner for making the next block based on some features of partic-
ipants; the features could be how many coins at stake, duration from the last mined block
or random selection. It does not require to solve any mathematical problem and also the
electricity consumption is not as high as that for Bitcoin. Peercoin and Blackcoin are the
first to use this as the consensus protocol.
Chapter 2 Background 7
In POS the miners are validators. Initially, they have to put some coins at stake. During
validation of blocks, they need to place a bet based on which they get rewarded if the
block is selected as the next block of the chain. The reward is in proportion to the stake
that each participant made.
The main problem with this protocol is “Nothing at Stake,” i.e., if one of the validators
split his coins within all the candidate blocks then one from those candidate block is
undoubtedly going to be the next block hence in any case validator is going to win some
reward. To overcome this shortcoming, they came up with a new protocol called Casper
Protocol [16]. The Casper Protocol penalizes the malicious acting nodes which try to
perform nothing at stake by slashing their stakes.
2.3.2 Ethereum Accounts
Ethereum has two types of account:
• Externally Owned Account(EOAs)
The end users create EOAs to be a part of the ethereum network. Participants get
the private key for each account to perform transactions.
• Contract Account
These are the self-executing code which can be invoked by EOAs or another contract
as an internal transaction.
Figure 2.2: Account Interaction
2.3.3 How is Ethereum data stored?
The Ethereum also started its story from the genesis block as other cryptocurrencies do.
From this very point, transactions, contract creation, contract invocation, any mining of
subsequent block started and the state of the Ethereum blockchain constantly changes.
Ethereum uses “trie” data structure for storing data [17]. The different trie used are:
• State Trie
Chapter 2 Background 8
• Storage Trie
• Transaction Trie
• Receipts Trie
Only the root of different tries are stored in block header. It is represented in the Figure
2.3 below.
Figure 2.3: Data stored on block header [2]
Image source: https://medium.com/cybermiles/diving-into-ethereums-world
2.3.3.1 Merkle Patricia Tree
The tree structure used in the Ethereum is Merkle Patricia tree [18]. It is a binary tree
where data is stored at the leaf nodes. All the intermediate node contains the hashes of
their left and right child and unique hash using these two hashes. The hash of the root is
stored in the block header. It is used for the verification of data as the root hash do not
match if data is modified. The example of the Merkle Patricia tree is given in the Figure
2.4.
Chapter 2 Background 9
Figure 2.4: Example of Merkle Patricia Tree
2.3.4 Gas and Payment
Gas is the metric by which the cost of computation is decided on EVM. Value of ether
generally fluctuates in real time so there must be a metric to measure the computation
cost. Gas can be considered as the number of CPU cycle used for the execution on EVM.
The real costing of gas must be maintained; hence the gas price and price of ether have to
be inversely proportional. The other metric associated with gas is Gas Prices, Gas Cost,
Gas Limit, and Gas Fees. Figure 2.5 explains how payment for each transaction is made.
Figure 2.5: Payment Structure
2.3.5 Ether(ETH)
Ether is the digital currency which fuels the network. Powering the network means that
it distributes the ‘token’. These tokens are further used to execute the smart contract
and Decentralized application. More precisely we can say that it is a piece of code which
Chapter 2 Background 10
is used for payment against any computation done on Ethereum Virtual Machine(EVM)
[19]. Each transaction has some computation job and transaction fee associated with it.
The cost is computed on how much gas is used.
A total of 60,102,216 ether were distributed to crowdsale contributors in the crowdfunding
campaign at the start of the Ethereum blockchain network in 2014 [20]. Twelve mil-
lion ether is given to Ethereum Foundation, early contributors,the research group behind
Ethereum. According to the terms agreed by all parties on the 2014 presale, 18 million
ether will be issued every year,i.e., likely to be 25% of crowdfunding campaign in 2014. If
ether creation rate is high, then the difficulty to mine the block increases just to maintain
the rate of ether creation every year and vice versa. The difficulty level is decided by the
consensus protocol. The miner is rewarded with five ether for every block and a new block
is published/mined at every (12-15) second. The denomination of ether can be found in
the Table 2.1 [21].
S no. Unit Wei Value Wei
1 Kwei(babbage) 1e3 wei 1,000
2 Mwei (lovelace) 1e6 wei 1,000,000
3 Gwei (shannon) 1e9 wei 1,000,000,000
4 microether (szabo) 1e12 wei 1,000,000,000,000
5 milliether (finney) 1e15 wei 1,000,000,000,000,000
6 ether 1e18 wei 1,000,000,000,000,000,000
Table 2.1: Ether denominations and their value
2.3.6 Token
The token is the native currency for the decentralized application. The token for the
Dapps is distributed in the crowd-sale called ’ICO’. The token is distributed in exchange
of ether for the respective Dapps. It is used to make the process easy while interacting
with smart contract and Dapps. The tokens are of broadly classified as:
• Usage Tokens
• Work Tokens
2.3.7 Smart Contracts
A smart contract is a piece of code which contains a set of rules which interacting bodies
have to follow. These contracts are executed on top of the blockchain, i.e., on the EVM.
The contract gets executed when the required conditions are satisfied. The smartness of
Chapter 2 Background 11
the contract is dependent on the developer of the contract. The main aim of the contract
is to build trust between the parties without relying on an intermediary. Some of the
properties of contract which makes it more reliable are:
• Self verification
• Tamper proof
• Autonomous execution
2.3.7.1 Benefits of Smart Contract
• Trust
• Safety
• Speed
• Saving
• Accuracy
• Availability
• Autonomy
• Reduce reliance on trusted third party
2.3.7.2 How do Smart Contracts Work?
Let’s consider an example to understand the working of a smart contract [22]. Suppose
there are two users A and B, where A wants to buy a car and B wants to sell his car. The
contract between A and B is “If A pays 5000 ether to B then A will receive the ownership
of B’s car”. If this contract is deployed, then it can not be changed, i.e., immutable which
define trust in the system. There is no need for a middleman between A and B like a bank,
broker, etc. as the contract is automatically executed when the conditions are met. When
A deposits the money to the contract then contract verifies and transfer the ownership
from B to A. The same scenario is depicted in Figure 2.6.
Chapter 2 Background 12
Figure 2.6: Working of Smart Contract
2.3.8 Decentralized Applications(Dapps)
Dapps is an open source software which runs on peer to peer network [23]. But when it
comes to blockchain based Dapps it must possess some more features; the data must be
cryptographically secure, the application must have a digital asset which fuels the network.
In an initial coin offering the Dapps tokens are kept for sale in exchange with digital fiat
currencies. The roadmap for the launch of Dapps is given in figure 2.7.
Figure 2.7: Dapps Roadmap [3]
2.4 Summary
In this chapter, we started with the introduction of Blockchain technology and its prop-
erties. Afterwards, we mentioned how the evolution of this technology happened over the
years and briefly summarized them.
Chapter 2 Background 13
We discussed some important terminologies which is essential for understanding the Ethereum
Network which is the focal point of this thesis. It is always the first priority to maintain the
faith of participants in the network. Participants play an important role in the functioning
of network. In this thesis, we try to detect and eliminate the malicious participants in the
network.
2.4.1 What are the anomalies in Ethereum Network?
The anomalous addresses are the ones which tries to do the task for which they are not
authorized or tries to execute fraudulent transactions. Some of them are mentioned below:
• Issues fake tokens
• Fake admin in ICOs(Initial coin Offering)
• Scambot phishers
• Slackbot
• Fake etherscan site
• Fake site - asking for private keys
• Fake crowdsale site, etc.
In this work we attempt to mark and eliminate malicious addresses such that the innocent
participants within network do not get affected.
Chapter 3
Related Work
In this chapter, we focus on existing work related to anomaly detection in blockchain more
specifically Bitcoin and Ethereum blockchain.
BitIodine: Extracting Intelligence from the Bitcoin Network [4] is a framework to de-
anonymize the users. They were able to label the addresses automatically or semi-
automatically using the information fetched from web scrapping. The web scrapers search
the web to associate some of the addresses to real users. The labels they used for addresses
are gambling, exchanges, wallets, donations, scammer, disposable, miner, malware, FBI,
killer, Silk Road, shareholder, etc. Bitiodine first parses the transaction data from the
Bitcoin blockchain, then it performs clustering based on user interaction and labels the
clusters and users. Their objective is to label every address in the network into one of
the above mentioned categories. They were able detect some of the anomalous addresses
in the network by manual investigation by tracing their transactions. They verified their
system performance on some of the known theft and fraud that happened in Bitcoin. Their
framework BITIODINE was able to detect addresses which belongs to Silk Road cold wal-
let, CryptoLocker ransomware. The modular structure developed by them is such that it
can be used for other blockchains also. The system architecture of Bitiodine is described
in Figure 3.1.
14
Chapter 3 Related Work 15
Figure 3.1: BitIodine Architecture [4]
Graph-based forensic investigation of Bitcoin transactions [24] performs analysis work on
Bitcoin transaction data and also does some evaluation on the network data. The dataset
used by them includes 34,839,029 Bitcoin transactions and 35,770,360 distinct addresses.
Their objective is to detect money theft, fraudulent transactions and illegal payments
made to black market. They designed a framework which retrieves all the transaction
details of a given address. They do not detect the anomalous addresses in the network
yet they provide detailed information about the given address. They used clustering to
group users together and they used multiple graph-based techniques to analyze the money
flow within the network. They analyzed money flow using ’Breadth First Search (BFS)’
algorithm, edge-convergent pattern and existence of cycles in the network to detect any
sort of money laundering.
Thai T. Pham et al. [25] [26] proposed Anomaly Detection in Bitcoin Network using
Unsupervised Learning method. The aim is to detect the suspicious transaction that
took place within the network and mark the users based on these transactions. The
unsupervised methods used by them are: K-means clustering, Mahalanobis distance and
Support Vector Machine (SVM). They verified their model based on 30 known cases of
Bitcoin Network in which they were able to mark two known cases of theft and 1 case of
loss.
Xiapu Luo et al. [5] proposed understanding Ethereum via graph analysis. They claim
to be the first to perform a graph-based analysis of Ethereum blockchain. They con-
structed three different graphs to analyze money transfer, smart contract creation, and
Chapter 3 Related Work 16
smart contract invocation. The dataset they have is 28,502,131 external transactions and
19,759,821 internal transactions. After analyzing the above-mentioned graph, they have
given five preliminary insights. The insights they found are:
• Participants uses Ethereum more than Smart Contracts for money transfer
• Smart contracts are not used extensively
• Ethereum is not frequently used by all
• A very few people create smart contracts
• Exchange markets dominate Ethereum network
The insights made by them is pretty obvious as the number of transactions made by
a regular user cannot be compared with the number of transactions done in exchanges
as they are surely going to be much higher. Hence, the exchange market will dominate
the Ethereum network. We can not expect every user knows the Solidity or Golang so
that they can deploy their contracts. Hence, very few of them can deploy the contract
and use it. All participants have different requirements for which they interacted with
the Ethereum network so we can expect the same behaviour from all. Their complete
approach is depicted in Figure 3.2.
Figure 3.2: System Architecture [5]
Although some of the above approaches try to find an anomaly in the Bitcoin network,
none of them has a sophisticated method for anomaly detection. In Bitiodine [4] they
attempted to detect by manually searching the paths in the network. They do not have
Chapter 3 Related Work 17
automated program to detect malicious addresses. In [25] they tried machine learning
technique for anomaly detection but the accuracy is not very good, i.e., 10% and they
tried only two machine learning model namely K-means clustering and Support Vector
Machine (SVM). Therefore there is a need for a system which can detect the anomalous
addresses in any blockchain network with high accuracy.
Chapter 4
Dataset
4.1 Dataset Collection
We have used Infura API [27] to fetch the Ethereum blockchain data. They provide secure
and reliable access to Ethereum APIs and IPFS gateways. The APIs of Infura which I
used for data collection are:
• eth getBlockByNumber
It returns the complete block data in JSON format for a given block number.
• eth getTransactionByHash
It returns the complete transaction data in JSON format for the given transaction
hash.
• eth getTransactionReceipt
It returns the status of the post Byzantium transactions i.e. ‘1’ for success and ‘0’
for failure.
The files are stored in JSON format. The file size of a block is (1-10)KB while for a
transaction is (0.5-2)KB. As the number of files is enormous, so we used 28 cloud instances
each running a multiprocessing script to fetch the data. The downloading rate is about
150K files per day per instance. The data fetched is only on-chain data, i.e., it does not
include internal transactions. The internal transaction is not published on the blockchain,
they are just executed on the EVM. The dataset we consider for analysis is from block 0
to block 5,139,999 which includes total of 169,192,702 transactions.
18
Chapter 4. Dataset 19
4.2 Block Structure
The blocks are linked together cryptographically in a chronological manner to form a
blockchain [6]. These blocks are packed with a set of successful transactions. A block
consists of:
• Block Header
• Transaction Hashes
• Uncle’s Hash
Figure 4.1 gives the idea of how data is stored in a block. The ‘result’ section as shown in
the Figure 4.1 contains all the data as key-value pair. The set of transactions which are
included in this block goes under the ’transactions’ section.
Figure 4.1: Block Attributes [6]
Chapter 4. Dataset 20
4.3 Transaction Structure
To interact with the Ethereum blockchain one has to perform a transaction [6]. The
different types of transactions are ether transfer, token distribution, contract creation and
contract invocation. Figure 4.2 shows what all data will go into the transaction record.
The r,s,v values are used to generate the signature to verify the sender.
Figure 4.2: Transaction Attributes [6]
4.4 Dataset Statistics
S no. Dataset Count
1 Blocks 6.8 million
2 External Transactions 169,192,702
3 Unique Addresses 246,93,053
4 Zero Ether Transactions 50,468,270
5 Smart Contracts ∼ 1.8 million
6 Unique Smart Contract ∼ 90K
7 Malicious Addresses 125
Table 4.1: Dataset Used for analysis
The External Transactions are the ones which are stored on the main chain of the
Ethereum, i.e., Ethereum Classic. The transactions stored on the main chain are of two
types;
Chapter 4. Dataset 21
• Transaction between two Externally owned Accounts
• Contract invocation by Externally owned Account
The Unique Addresses gives the count of unique addresses in total of 169,192,702 ex-
ternal transactions.
The Zero Ether Transactions gives the count of the transactions which transfer zero
ether in total of 169,192,702 external transactions.
The Smart Contracts gives the count of total number of smart contract deployed from
block 0 to block 6.8 million.
The Unique Smart Contract gives the count of unique smart contracts deployed by
comparing their MD5 hashes of the smart contract code.
The Malicious Addresses gives the count of unique malicious addresses which are col-
lected by web scraping that lies in the range of 0 to 5,139,999 blocks [28] [29].
Chapter 5
Methods and Approaches
5.1 Data Preprocessing
Data preprocessing is a crucial step while using machine learning algorithms. It is a method
to remove/reduce the noise in the dataset. It plays a vital role as far as the accuracy of
the model is concerned. Some of the key steps involved are: finding missing values, finding
categorical values, scaling of feature values, normalization of feature values, efficient and
correct splitting of raw data into training set and test set, etc.
The data retrieved from the previously mentioned API’s of Infura is in the JSON format.
5.1.1 Raw Feature Extraction
The raw features are fetched from transaction data files. The raw features include:
• blockNumber: It is a hexadecimal number of the block to which this transaction
belongs.
• from: It is the sender address of the transaction.
• to: It is the receiver address of the transaction. It is NULL in the contract creation
transaction.
• value: It gives the amount transferred in the transaction. It is stored in Wei as a
hexadecimal value.
• timestamp: It provides the time at which the block is published/mined.
5.1.2 Feature Engineering
Data engineering is a significant and crucial step in machine learning. It is the process by
which values are made out of the raw data for learning algorithms which are to be trained
22
Chapter 5. Methods and Approaches 23
and tested. So from the above four mentioned features, we extracted out 14 numeric
features [26] for each address to solve the proposed problem. The features are:
• Outdegree: The total number of outgoing transactions from a given address.
• Indegree: The total number of incoming transactions to a given address.
• Balance Out: The total number of outgoing ether value from a given address.
• Balance In: The total number of incoming ether value to a given address.
• Absolute Balance: (Balance In) - (Balance Out)
• Unique Outdegree: The total number of outgoing transactions to unique addresses
from a given address.
• Unique Indegree: The total number of incoming transactions from unique ad-
dresses to a given address.
• Start Date: The timestamp of the block in which the given address has made its
first ever transaction.
• End date: The timestamp of the block in which the given address has made its last
transaction so far.
• Active duration: (End Date) - (Start Date)
• Last Transaction Bit: 0/1 (0 if last transaction made is incoming else 1)
• Last Transaction Value: The ether value transferred in the last transaction made
by the address.
• In Transaction Average: Average ether value per incoming transaction.
• Out Transaction Average: Average ether value per outgoing transaction.
5.1.3 Feature Importance
While working with the machine learning models it always a tough decision to choose the
features. As we have extracted out 14 features from the raw data now to determine the
importance of a feature, we will feed this data to the decision tree model. The Extra-
TreeClassifier [30] is completely randomized tree classifier. It looks for the best split by
performing random splits for each of the selected features. The feature importance of the
above-mentioned features are:
Chapter 5. Methods and Approaches 24
Figure 5.1: Feature Importance
Feature no. Feature Name value
feature 6 Unique Indegree 0.215800
feature 9 Active duration 0.097585
feature 1 Indegree 0.096118
feature 8 End Date 0.095395
feature 7 Start Date 0.093047
feature 12 In Transaction Average 0.084508
feature 11 Last Transaction Value 0.069127
feature 5 Unique Outdegree 0.051988
feature 13 Out Transaction Average 0.044768
feature 0 Outdegree 0.042575
feature 2 Balance Out 0.039066
feature 4 Out Transaction Average 0.033495
feature 3 Absolute Balance 0.027815
feature 10 Last Transaction Bit 0.008714
Table 5.1: Feature ranking
Chapter 5. Methods and Approaches 25
5.2 Machine Learning Classifiers
5.2.1 Decision Tree Classifier
A decision tree [31] is a supervised learning method which is mostly used in the classifi-
cation problems. The algorithm can be visualized as a tree structure where every node
split the data into two parts based on the most distinctive feature value at that node. The
algorithm can be used for categorical as well as continuous data. At the beginning, the
complete training dataset is fed at the root node, and finer granularity will be done by
features. The termination of the algorithm is dependent on the attributes of the classifier,
or all classes are pure. Some of the terms related to decision trees are:
• Gini Impurity It is the measure which decides how likely a new data point is
misclassified.
G(k) =
J∑i=1
P (i) ∗ (1− P (i)) (5.1)
where P(i) is the probability of class ‘i’ and ‘J ’ is number of classes
• Entropy Entropy is the measure of uncertainty in the given data.
H = −∑
p(x) log p(x) (5.2)
where p(x) is probability of ’x’
• Information Gain Information gain (IG) measures the “information” which we get
about the class from a feature. It helps to partition the data at every node.
IG =
Entropy (x)− ( [weighted average] ∗ entropy(y ) )(5.3)
where x is the parent node and y is a child node
5.2.2 Support Vector Machine (SVM)
A Support Vector Machine [32] is a supervised learning algorithm which categorizes the
data points by constructing the optimal hyperplane. The SVM will try to fit best pos-
sible hyperplane such that the separation of different classes is maximized and error is
minimized. The example of SVM classifier is depicted in Figure 5.2.
Chapter 5. Methods and Approaches 26
Figure 5.2: SVM Classifier
Some of the terminology associated with SVM are:
• Kernel
The kernel is the methods which are mostly used for pattern analysis. It transforms
the data to some other dimension such that better separation between the classes
can be obtained.
• Regularization
Regularization parameter helps the classifier to decide how much misclassification
can be tolerated.
• Margin
Margin is a hyperplane which separates the two classes. The good margin signifies
that the two classes are equidistance from the margin.
5.2.3 K-nearest Neighbour Classifier(KNN)
KNN [33] classifies data to a class by considering the majority of its neighbors. The K
(number of neighbors to consider) need to be specified. The K-nearest neighbors of the
data point can be determined using different distance parameters. Some of them are:
Chapter 5. Methods and Approaches 27
• Euclidean Distance
√√√√ k∑i=1
(xi − yi)2 (5.4)
• Manhattan Distancek∑i=1
|xi − yi| (5.5)
• Minkowski Distance (k∑i=1
(|xi − yi|)q)1/q
(5.6)
The distance should be standardized or scaled as different features have a different range
of values.
Xs =X −Min
Max−Min(5.7)
5.2.4 Multi-layer Perceptron Classifier(MLP)
MLP classifier [34] is the supervised learning method which classifies by learning a function
f(x) where :
f(x) : Rm → Ro (5.8)
Here, ’m’ is the dimension of input and ’n’ is dimension of output. MLP can extract out
significant information from unbalanced or imprecise data, which can be used to figure out
some pattern by which classification of data points can be done. The basic type of neural
network has three units:
1. Input Unit: It feeds the raw information into the network.
2. Hidden Unit: In this weighted nonlinear functions are computed using the values
from the input layer.
3. Output Unit: It converts the values from the last hidden layer to generate output.
MLP uses different loss functions for different problems. The loss function used for clas-
sification problem is Cross-Entropy, which in the binary case is given as,
Loss(y, y,W ) = −y ln y − (1− y) ln(1− y) + α ||W ||22 (5.9)
where penalty is L2-regularization: α ||W ||22
Chapter 5. Methods and Approaches 28
Figure 5.3: Neural Network Structure [7]
5.2.5 Naive Bayes Classifier
It is a supervised learning method [35] in which it computes the probability for every data
point corresponding to each class. The data point is classified to the class with maximum
probability. Using Bayes theorem the conditional probability can be decomposed as:
p (Ck|x) =p (Ck) p (x|Ck)
p(x)(5.10)
where ‘x’ is feature vector and Ck is a set of classes to which ‘x’ belongs.
5.3 Smart Contract Analysis
The smart contract is a piece of code which automatically gets executed when conditions
are satisfied. There exists a lot many ways to write the same code. We tried to group
similar type of contracts based on user interaction. To do this, we have considered three
parameters for each contract. They are:
• Total number of invocations
• Total number of unique invocations
• MD5 hash of contract code
5.4 Summary
This section presents how the data is processed and what machine learning techniques used
for anomaly detection in the Ethereum network. The data creation is a very crucial step in
Chapter 5. Methods and Approaches 29
any machine learning model. In the section 5.1 at the page 22 talks about how features are
extracted from the raw JSON files. After feature extraction part, Decision tree classifier,
k-nearest neighbors classifier, Random forest classifier, SVM classifier, MLP classifier and
Naive Bayes classifier techniques were used for data modeling. Hyper-parameters of dif-
ferent classifiers were optimized for better results. 5-fold cross validation is also performed
in order to avoid over-fitting.
Chapter 6
Results and Analysis
6.1 Experimental Setup
The dataset we have is highly unbalanced as the marked malicious addresses we have is
only 125 [28] out of 24 million addresses. These 24 million addresses are externally owned
accounts (EOAs), i.e., it does not include addresses of smart contracts. We split the mali-
cious 125 addresses into 75 for training and 50 for testing as ground truth. Next, we need
to have some non-malicious addresses to test our machine learning models rigorously. We
included the address of Ethereum developers, Ethereum contributors, exchange addresses
and some known addresses in the non-malicious set for testing. We extracted these ad-
dresses from the genesis block of Ethereum and some other by web crawling [28] [29]. A
total of 250 non-malicious addresses were extracted for testing. We used label ‘0’ for non-
malicious and ‘1’ for malicious addresses. Now, in this work, we try to label 24 million
addresses, either malicious or non-malicious using the 75 known malicious addresses.
To deal with a highly unbalanced dataset, as mentioned above, we have considered an
assumption. The assumption is “we will mark the addresses as malicious if they
have an outgoing transaction with the malicious marked addresses”. After taking
the assumption, we have a total of 3830 malicious marked addresses. Finally, we have
two setting to evaluate our model:
1. Testing on 50 originally marked malicious addresses
2. Testing on randomly chosen 50 malicious addresses from 3830 maliciously marked
by considering the assumption
There is a technique called SMOTE [36] to generate synthetic points if data points are
separable. But while analyzing the distribution of data points, there is a high degree
30
Chapter 6. Results and Analysis 31
of overlap which makes SMOTE not suited for our dataset. 5-fold cross validation is
performed in order to avoid over-fitting.
6.2 Evaluation Metrics
For the evaluation of the experiments, we have treated non-malicious as positive and
malicious as negative. The different metric used is:
1. Confusion Matrix
It is used to describe the performance of the model using labeled test data.
Non-Malicious (Predicted) Malicious (Predicted)
Non-Malicious (Actual) True Positive False Negative
Malicious (Actual) False Positive True Negative
Table 6.1: Confusion Matrix
It has four metric values:
• True Positive(TP): It is the case when algorithm correctly returns as non-
malicious.
• False Positive(FP): It is the case when algorithm incorrectly returns as non-
malicious.
• True Negative(TN): It is the case when algorithm correctly returns as mali-
cious.
• False Negative(FN): It is the case when algorithm incorrectly returns as
malicious.
2. Accuracy
“It is the ratio of correct results to total returned by the algorithm”
Accuracy =TP + TN
TP + FP + TN + FN(6.1)
3. Precision
“What fraction of positives identified by the algorithm is actually correct”
Precision =TP
TP + FP(6.2)
Chapter 6. Results and Analysis 32
4. Recall
“What fraction of real positives were identified as positives by the algorithm”
Recall =TP
TP + FN(6.3)
5. F-score
It is harmonic mean of precision and recall.
Fscore = 2× Precision × Recall
Precision + Recall(6.4)
The general formula for any β:
Fβ =(1 + β2
)· precision · recall
(β2 · precision ) + recall(6.5)
where β is a parameter to decide different level of importance to precision and recall
6. AUC - ROC Curve
It is a measure used to check the performance of classification algorithms. It tells us
how much our model is capable of distinguishing between the classes. ROC is the
probability curve and AUC(area under the curve) is separability measure.
6.3 Machine Learning Classifiers
6.3.1 Decision Tree Classifier
The first classifier we have considered is ‘Decision Tree Classifier’, in which we have trained
the model on ground truth, i.e., 75 addresses from each class to generate the base results.
Table 6.2 contains the result when model is evaluated on the test set. The test set contains
50 malicious and 250 non-malicious data points.
Evaluation Metric Value
Accuracy 0.9353
Precision 0.9641
Recall 0.8060
F-score 0.8589
Table 6.2: Test result on Ground Truth
Chapter 6. Results and Analysis 33
Figure 6.1: Confusion Matrix for Ground Truth Result
Figure 6.1 shows the result in the setting considered above in which 37 out of 50 malicious
addresses are correctly classified. We can see that the model doesn’t classify any normal
user as malicious which is very important in this type of network as we may lose users
if we mark them as malicious.
Figure 6.2: ROC curve for Ground Truth Result
The ROC curve in the Figure 6.2 shows the true positive rate (Sensitivity) as a function
of the false positive rate (100-Specificity) for different cut-off points. Each point on the
ROC curve represents a (sensitivity,specificity) pair corresponding to a particular decision
threshold. The accuracy of the model is given by the area under the curve, the closer the
curve to upper left corner the greater the accuracy. 87% of the area lies under the curve
which shows our model performed well.
Chapter 6. Results and Analysis 34
Next, we evaluated by training our model using the assumption that we have discussed in
section 6.1. We have taken this assumption because while computing feature importance,
we get unique indegree as most important. Those addresses which are sending ether to
nodes known to be malicious are highly likely to be malicious. Further, we show that our
assumption of considering indegree nodes is correct as the results are worse if outdegree
nodes are considered.
Evaluation Metric Value
Accuracy 0.9966
Precision 0.9980
Recall 0.9900
F-score 0.9939
Table 6.3: Test result on Level 1 Indegree nodes
We can observe from the Table 6.3 that our assumption to slightly balance the dataset
does not affect the accuracy of our model. We are even getting better results with the
assumption. This signifies that the nodes that are directly connected and sending ether
to the malicious nodes are most suspicious.
Figure 6.3: Confusion Matrix for Level 1 Indegree Result
The confusion matrix in the Figure 6.3 shows that we were able to classify 49 out of 50
malicious nodes as malicious.
Figure 6.4 shows that 99% of area lies under the ROC curve which signifies that our model
beats the result obtained by only considering the ground truth in which we get area under
the curve(auc) = 87%.
Chapter 6. Results and Analysis 35
Figure 6.4: ROC curve for Level 1 Indegree Result
To support our assumption we evaluated our model while considering the nodes to which
the malicious addresses have send ether as malicious. The results obtained while consid-
ering the above assumption is given below.
Evaluation Metric Value
Accuracy 0.8366
Precision 0.9180
Recall 0.5100
F-score 0.4749
Table 6.4: Test result on Level 1 Outdegree nodes
We can see in the Table 6.4 that the accuracy of the model is drastically reduced which
supports our assumption of considering only indegree nodes to malicious addresses as
malicious. The confusion matrix in the Figure 6.5 shows that we were able to correctly
classify only 1 out of 50 known malicious addresses as malicious. The ROC curve in the
Figure 6.6 is very close to line y=x. The auc = 0.51 which is very less as compared to the
above two different settings. This may be because a malicious node may transacts with
non-malicious nodes in order to spend their ether without being caught.
Chapter 6. Results and Analysis 36
Figure 6.5: Confusion Matrix for Level 1 Outdegree Result
Figure 6.6: ROC curve for Level 1 Outdegree Result
6.3.2 Random Forest Classifier
Next, we tried Random Forest Classifier [37] which is an improved version of the deci-
sion tree. It is an ensemble learning method for classification. It creates multiple trees
for classification rather than single tree as of in ‘Decision tree’. We repeated the above
experiments that we have done using ‘Decision Tree’ to verify that we must get accuracy
greater than or equal to ‘Decision Tree’ using ‘Random Forest’. The results are:
Chapter 6. Results and Analysis 37
Evaluation Metric Value
Accuracy 0.9893
Precision 0.9936
Recall 0.9680
F-score 0.9802
Table 6.5: Test result on Ground Truth
Figure 6.7: Confusion Matrix for Ground Truth Result
The results obtained when indegree nodes of malicious nodes are considered malicious.
Evaluation Metric Value
Accuracy 0.9966
Precision 0.9980
Recall 0.9900
F-score 0.9939
Table 6.6: Test result on Level 1 Indegree nodes
Chapter 6. Results and Analysis 38
Figure 6.8: Confusion Matrix for Level 1 Indegree Result
The results obtained when outdegree nodes of malicious nodes are considered malicious.
Evaluation Metric Value
Accuracy 0.8366
Precision 0.9180
Recall 0.5100
F-score 0.4749
Table 6.7: Test result on Level 1 Outdegree nodes
Figure 6.9: Confusion Matrix for Level 1 Outdegree Result
We get the improved result as expected for the random forest classifier with respect to
the decision tree method. For setting 1 in which only ground truth malicious nodes are
Chapter 6. Results and Analysis 39
considered we were able to correctly classify 47 out of 50 malicious marked addresses as
malicious as depicted in the Figure 6.7 which is 10 more than that we got in ‘Decision
Tree’. And for the other settings we get the same accuracy as of ‘Decision tree’.
6.3.3 Other Classifier Results
We have tried four other classifiers whose results are given in the Table 6.8 and Table
6.9. The Table 6.8 includes the result which is evaluated on setting 1 and the later on
setting 2. The MLP classifier is better suited for the classification problem having time
series dataset. As the data is not separable the KNN accuracy is low as compared to MLP
and Naive Bayes classifiers. On setting 1 MLP classifier performs better than all other
classifiers that we tried.
SVMs on the other hand can efficiently perform a non-linear classification using what
is called the kernel trick, implicitly mapping their inputs into high-dimensional feature
spaces. It is one of the most robust and accurate algorithm among the other classification
algorithms. SVM requires relatively large data for training than other classifier. As we
have relatively large data in setting 2 so SVM performed well. The accuracy we achieved
is 99.66%. We have done 5-fold validation to eliminates over fitting of the model.
SVM MLP Classifier KNN(K=5) Naive Bayes Classifier
Accuracy 0.8366 0.9152 0.8566 0.8933
Precision 0.9180 0.9538 0.9266 0.9432
Recall 0.5100 0.7460 0.5700 0.6799
F-score 0.4749 0.8051 0.5832 0.7346
Table 6.8: Test result on Ground Truth
SVM MLP Classifier KNN(K=5) Naive Bayes Classifier
Accuracy 0.9966 0.9473 0.9166 0.8833
Precision 0.9980 0.9708 0.9545 0.9385
Recall 0.9900 0.8420 0.7500 0.6500
F-score 0.9939 0.8818 0.8095 0.6980
Table 6.9: Test result on Level 1 Indegree nodes
6.4 Graph Based Analysis
In Ethereum the publish time of the block is about 12-15 seconds but while plotting the
histogram of the publish time we observed that a large number of the blocks published
Chapter 6. Results and Analysis 40
between 3-6 seconds. We infer from the plot in Figure 6.10 that there is a significant shift in
the publish time is due to the fact the difficulty level set for the block is not appropriate, or
the participants use dedicated devices like Application Specific Integrated Circuits(ASICs)
for mining. But Ethereum claims that the consensus protocol ‘ETHCASH’ used by them
is independent of these types of mining devices [38]. It signifies that the Ethcash still does
not able to provide protection against ASICs.
Figure 6.10: Block Publish time analysis
From the plot in Figure 6.11 we observe the exponential growth in the number of transac-
tions per block. The sudden change started from the block series of 320K which is around
early 2017. One possible reason might be the migration of users from other cryptocur-
rencies. The Bitcoin block reward was reduced to 12.5 BTC from 25 BTC in July 2016
[39].
Figure 6.11: Transaction count per 1 lakh block
Chapter 6. Results and Analysis 41
The plot in Figure 6.12 shows the cumulative growth in the number of addresses in the
Ethereum Network over time. The x-axis represented the date starting from 30 July 2015
to 20 Feb 2018 and y-axis gives the corresponding count of the addresses. We observed
the same behavior as of the number of transactions in the block. The plot also shows that
there was a sudden rise in the number of addresses from the 320K block.
Figure 6.12: Cumulative address growth over time
The plot in Figure 6.13 shows the variation of block size with number of transactions
per day. We can see that block size directly proportional to the number of transactions
occurred in a day. We infer from the plot that the number of transactions packed in the
block also grows proportional to the transaction rate ,i.e., not biased with the miner.
The plot in Figure 6.14 shows the variation of block size with chain data size. Ethereum
chain data size refers to the size of Ethereum Classic blockchain. One can refer [17]
to known more about data storage in Ethereum. We can see that block size directly
proportional to the chain data.
Chapter 6. Results and Analysis 42
Figure 6.13: Variation of block size with number of Transactions
Figure 6.14: Variation of block size with chain data size
Chapter 6. Results and Analysis 43
6.5 Smart Contract Analysis
We have done a basic analysis based on user interaction. We have randomly chosen 1000
contracts for evaluation. Next we have generated all possible pairs of these contracts. For
each pair we computed,
• Total Invocations of both addresses
• Unique Invocations of both addresses
• Intersection of unique invocations of both the addresses
• Whether they have same MD5 hash of the code or not
The result for total of 499500 pairs are given in the table below:
Same MD5 hash Intersection of Intersection of
addresses == 0 addresses != 0
NO 442174 7861
YES 42312 7153
Table 6.10: Similarity Evaluation
The analysis tries to group together the contracts based on user interaction. Our analysis
revealed that there is high degree of overlap between the user who invoked similar con-
tracts(contracts with same MD5 hash). We have taken the intersection of the addresses
who invokes the similar as well as dissimilar contracts. The insights we found are:
• There are 49,465 pairs out of total 4,99,500 pairs which have same MD5 hash. 42,312
pair of contracts from 49,465 pairs do not share any common user but 34,367 out of
these are just invoked once. These might be the test contracts that were deployed
for testing before deployment of actual contract. The testing is necessary as the
contract once deployed can not be killed. 7,153 out of 49,465 pairs of contract have
atleast one user in common. So, we can infer that these users are somehow related
to each other as they all perform similar type of task using different contracts.
• Similarly, there are 4,50,035 pairs out of 4,99,500 all possible pairs which do not have
same MD5 hash. Out of these 4,50,035 pairs 4,42,174 pairs do not share common
user. This value is high as expected because if the contracts are different(different
MD5 hash of their code) then the number of pairs which does not have common user
Chapter 6. Results and Analysis 44
will be high. Their are only 7861 pairs which have common user between them. We
can infer that some of the contracts which have different MD5 hash of their code but
they show similar behaviour.
Chapter 7
Conclusion and Future Work
7.1 Conclusion
In Chapter [5] we have explained how the dataset is being collected and then pre-processed.
After processing of the dataset, we have applied different machine learning classifier for
anomaly detection in Ethereum network. We were able to detect 47 out of 50 known
cases using Random Forest Classifier in the first setting, i.e., testing on the ground truth.
Next, we have detected 49 out of 50 cases using ours as the assumption of indegree nodes
as malicious. In the second setting SVM, Decision Tree classifier and Random Forest
Classifier produced the result with the same accuracy. We have also backed our assumption
by evaluating the results considering the outdegree nodes. We were able to detect only 1
out of 50 when outdegree nodes are considered.
We can conclude from this work that it is possible to detect patterns in the Ethereum
transactional data using machine learning techniques. Using these machine learning mod-
els one can label all addresses as malicious or non-malicious, assign the suspicious level
using the class probability of data points. These models is applicable on any dataset which
possess inherent graph structure. Our dataset is highly unbalanced, i.e., we have only 125
in one class and 24 million in the other still we were able to produce good results. It
signifies that our model work well with unbalanced datasets.
7.2 Future Work
Anomaly detection in any network is interesting as well as challenging problem. There
are multiple techniques that researchers build to solve the same but still, we can think in
some of other directions. The prominent ones are:
45
Chapter 7. Conclusion and Future Work 46
• We can also apply the parallel partitioning approach [40] for the anomaly detection.
In this setting, we look for structural behavior [41] in which regularity of the graph
is used for anomaly detection.
• We have collected and analyzed a given range of data in this work. We can create a
modular framework which perform this kind analysis on real time data.
• Here, we have manually computed feature values for each node, we can also get
the embedding for each node of the graph [42] using different machine learning
techniques. By using this, we might get some more features for analysis.
• The smart contract analysis done in Chapter [6] is based on user interaction to reveal
interesting behaviour of the addresses. There might be other ways to think in which
the similarity between the group of addresses can be detected.
Bibliography
[1] Bitcoin energy consumption. https://digiconomist.net/
bitcoin-energy-consumption. Accessed: May 13, 2019.
[2] Blockchain data. https://medium.com/cybermiles/
diving-into-ethereums-world, . Accessed: May 13, 2019.
[3] Dapp phases. https://coinsutra.com/dapps-decentralized-applications/.
Accessed: May 13, 2019.
[4] Michele Spagnuolo, Federico Maggi, and Stefano Zanero. Bitiodine: Extracting in-
telligence from the bitcoin network. volume 8437, pages 457–468, 03 2014. ISBN
978-3-662-45471-8. doi: 10.1007/978-3-662-45472-5 29.
[5] T. Chen, Y. Zhu, Z. Li, J. Chen, X. Li, X. Luo, X. Lin, and X. Zhange. Understand-
ing ethereum via graph analysis. In IEEE INFOCOM 2018 - IEEE Conference on
Computer Communications, pages 1484–1492, April 2018. doi: 10.1109/INFOCOM.
2018.8486401.
[6] Ethersc. https://etherscan.io/, . Accessed: May 13, 2019.
[7] Nn structure. https://www.google.com/search?q=neural+network&tbm=isch&
source=iu&ictx=1&fir=vj5KYf7zG80QYM%253A%252CynYusGDc2AddHM%252C%
252Fm%252F05dhw&vet=1&usg=AI4_-kS1zRBtf1J-UIETu-25mMYcnUUotQ&sa=X&ved=
2ahUKEwiJqq6F9KHiAhUDXn0KHaX1CGMQ9QEwAHoECBAQBg#imgrc=vj5KYf7zG80QYM:.
Accessed: May 13, 2019.
[8] Satoshi Nakamoto. ”bitcoin: A peer-to-peer electronic cash system,” http://
bitcoin.org/bitcoin.pdf, 2008.
[9] Global internet usage. https://en.wikipedia.org/wiki/Global_Internet_usage.
Accessed: May 13, 2019.
47
Bibliography 48
[10] What is blockchain? https://blockgeeks.com/guides/
what-is-blockchain-technology/, . Accessed: May 13, 2019.
[11] 51% attack. https://bitcoin.org/en/blockchain-guide#
block-height-and-forking. Accessed: May 13, 2019.
[12] Blockchain 2.0. https://medium.com/xpa-2-0/blockchain-2-0, . Accessed: May
13, 2019.
[13] Turing complete. https://en.wikipedia.org/wiki/Turing_completeness. Ac-
cessed: May 13, 2019.
[14] Time to publish block. https://ethereum.stackexchange.com/questions/9617/
how-many-blocks-are-created-at-one-point-of-time. Accessed: May 13, 2019.
[15] Ethereum. https://www.ethereum.org/, . Accessed: May 13, 2019.
[16] Casper protocol. https://blockgeeks.com/guides/ethereum-casper/. Accessed:
May 13, 2019.
[17] How ethereum data is stored ? https://hackernoon.com/
getting-deep-into-ethereum-how-data-is-stored-in-ethereum-e3f669d96033.
Accessed: May 13, 2019.
[18] Merkle patricia tree. https://medium.com/codechain/
modified-merkle-patricia-trie-how-ethereum-saves-a-state-e6d7555078dd.
Accessed: May 13, 2019.
[19] Ethereum virtual machine. https://medium.com/mycrypto/
the-ethereum-virtual-machine-how-does-it-work-9abac2b7c9e. Accessed:
May 13, 2019.
[20] Crowdsale stats. https://medium.com/tendermint/
examining-funding-token-allocation-of-blockchain-foundations-a2d0fb29b5ca.
Accessed: May 13, 2019.
[21] Ether value. http://ethdocs.org/en/latest/ether.html, . Accessed: May 13,
2019.
[22] Smart contract. https://www.coindesk.com/information/
ethereum-smart-contracts-work. Accessed: May 13, 2019.
[23] Dapps. https://blockchainhub.net/decentralized-applications-dapps. Ac-
cessed: May 13, 2019.
Bibliography 49
[24] Chen Zhao. Graph-based forensic investigation of bitcoin transactions. 2014.
[25] Thai Pham and Steven Lee. Anomaly detection in bitcoin network using unsupervised
learning methods. CoRR, abs/1611.03941, 2016. URL http://arxiv.org/abs/1611.
03941.
[26] Thai Pham and Steven Lee. Anomaly detection in the bitcoin system - A network per-
spective. CoRR, abs/1611.03942, 2016. URL http://arxiv.org/abs/1611.03942.
[27] Infura. https://infura.io/docs/ethereum/json-rpc/. Accessed: May 13, 2019.
[28] Malicious address repository. https://github.com/MyEtherWallet/
ethereum-lists/blob/master/src/addresses/addresses-darklist.json, .
Accessed: May 13, 2019.
[29] Malicious db. https://etherscamdb.info/, . Accessed: May 13, 2019.
[30] Extra tree classifier. https://scikit-learn.org/stable/modules/generated/
sklearn.ensemble.ExtraTreesClassifier.html. Accessed: May 13, 2019.
[31] Decision tree. https://scikit-learn.org/stable/modules/generated/sklearn.
tree.DecisionTreeClassifier.html. Accessed: May 13, 2019.
[32] Svm classifier. https://scikit-learn.org/stable/modules/generated/sklearn.
svm.SVC.html. Accessed: May 13, 2019.
[33] Knn classifier. https://scikit-learn.org/stable/modules/generated/sklearn.
neighbors.KNeighborsClassifier.html. Accessed: May 13, 2019.
[34] Mlp classifier. https://scikit-learn.org/stable/modules/generated/sklearn.
neural_network.MLPClassifier.html. Accessed: May 13, 2019.
[35] Naive bayes classifier. https://scikit-learn.org/stable/modules/generated/
sklearn.naive_bayes.GaussianNB.html/. Accessed: May 13, 2019.
[36] Smote. https://jair.org/index.php/jair/article/view/10302. Accessed: May
13, 2019.
[37] Random forest classifier. https://scikit-learn.org/stable/modules/
generated/sklearn.ensemble.RandomForestClassifier.html. Accessed: May
13, 2019.
[38] Ethcash asic. https://en.bitcoinwiki.org/wiki/Ethash, . Accessed: May 13,
2019.
Bibliography 50
[39] Bitcoin halving. https://thenextweb.com/hardfork/2019/01/30/
the-bitcoin-halvening-is-happening-heres-what-you-need-to-know/. Ac-
cessed: May 13, 2019.
[40] W. Eberle and L. Holder. Incremental anomaly detection in graphs. In 2013 IEEE
13th International Conference on Data Mining Workshops, pages 521–528, Dec 2013.
doi: 10.1109/ICDMW.2013.93.
[41] Lawrence B. Holder, Diane J. Cook, and Surnjani Djoko. Substructure discovery in
the subdue system. In Proceedings of the 3rd International Conference on Knowledge
Discovery and Data Mining, AAAIWS’94, pages 169–180. AAAI Press, 1994. URL
http://dl.acm.org/citation.cfm?id=3000850.3000868.
[42] Generating graph embeddings for extremely large
graph (ai lab fb). https://ai.facebook.com/blog/
open-sourcing-pytorch-biggraph-for-faster-embeddings-of-large-graphs/.
Accessed: May 13, 2019.
[43] Ethereum white paper. https://github.com/ethereum/wiki/wiki/White-Paper.
Accessed: May 13, 2019.
[44] Ethereum yellow paper. https://ethereum.github.io/yellowpaper/paper.pdf.
Accessed: May 13, 2019.