29
IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures, Applications and Recent Trends Saptarshi Sengupta, Sanchita Basak, Pallabi Saikia, Sayak Paul, Vasilios Tsalavoutis, Frederick Ditliac Atiah, Vadlamani Ravi and Richard Alan Peters II Abstract—Deep learning has solved a problem that as little as five years ago was thought by many to be intractable - the automatic recognition of patterns in data; and it can do so with an accuracy that often surpasses that of human beings. It has solved problems beyond the realm of traditional, hand-crafted machine learning algorithms and captured the imagination of practitioners trying to make sense out of the flood of data that now inundates our society. As public awareness of the efficacy of deep learning increases so does the desire to make use of it. But even for highly trained professionals it can be daunting to approach the rapidly increasing body of knowledge produced by experts in the field. Where does one start? How does one determine if a particular Deep Learning model is applicable to their problem? How does one train and deploy such a network? A primer on the subject can be a good place to start. With that in mind, we present an overview of some of the key multilayer artificial neural networks that comprise deep learning. We also discuss some new automatic architecture optimization protocols that use multi-agent approaches. Further, since guaranteeing system uptime is becoming critical to many computer applications, we include a section on using neural networks for fault detection and subsequent mitigation. This is followed by an exploratory survey of several application areas where deep learning has emerged as a game-changing technology: anomalous behavior detection in financial applications or in financial time-series forecasting, predictive and prescriptive analytics, medical image processing and analysis and power systems research. The thrust of this review is to outline emerging areas of application-oriented research within the deep learning community as well as to provide a handy reference to researchers seeking to use deep learning in their work for what it does best: statistical pattern recognition with unparalleled learning capacity with the ability to scale with information. Index Terms—Deep Neural Network Architectures, Supervised Learning, Unsupervised Learning, Testing Neural Networks, Applications of Deep Learning, Evolutionary Computation I. I NTROUCTION Artificial neural networks (ANNs), now one of the most widely-used approaches to computational intelligence, started as an attempt to mimic adaptive biological nervous systems Manuscript received March XX, 2019. S. Sengupta, S. Basak and R.A. Peters II are with the EECS Department at Vanderbilt University, USA. e-mail: [email protected]; [email protected]; [email protected] P. Saikia is with the CSE Department at IIT Guwahati, India. e-mail: [email protected] Sayak Paul is with DataCamp Inc, India. e-mail: [email protected] V. Tsalavoutis is with the School of Mechanical Engineering, Industrial Engineering Laboratory, National Technical University of Athens, Greece. e- mail: [email protected] F. Atiah is with the Computer Science Department at the University of Pretoria. e-mail: [email protected] R. Vadlamani is with the Institute for Development and Research in Banking Technology, India. e-mail: [email protected] in software and customized hardware [1]. ANNs have been studied for more than 70 years [2] during which time they have waxed and waned in the attention of researchers. Recently they have made a strong resurgence as pattern recognition tools fol- lowing pioneering work by a number of researchers [3]. It has been demonstrated unequivocally that multilayered artificial neural architectures can learn complex, non-linear functional mappings, given sufficient computational resources and train- ing data. Importantly, unlike more traditional approaches, their results scale with training data. Following these remarkable, significant results in robust pattern recognition, the intellectual neighborhood has seen exponential growth, both in terms of academic and industrial research. Moreover, multilayer ANNs reduce much of the manual work that until now has been needed to set up classical pattern recognizers. They are, in effect, black box systems that can deliver, with minimal human attention, excellent performance in applications that require insights from unstructured, high-dimensional data [4][5][6] [7][8][9]. These facts motivate this review of the topic. A. What is an Artificial Neural Network? An artificial neural network comprises many interconnected, simple functional units, or neurons that act in concert as paral- lel information-processors, to solve classification or regression problems. That is they can separate the input space (the range of all possible values of the inputs) into a discrete number of classes or they can approximate the function (the black box) that maps inputs to outputs. If the network is created by stacking layers of these multiply connected neurons the resulting computational system can: 1) Interact with the surrounding environment by using one layer of neurons to receive information (these units are known to be part of the input layers of the neural network) 2) Pass information back-and-forth between layers within the black-box for processing by invoking certain design goals and learning rules (these units are known to be part of the hidden layers of the neural network) 3) Relay processed information out to the surrounding environment via some of its atomic units (these units are known to be part of the output layers of the neural network). Within a hidden layer each neuron is connected to the outputs of a subset (or all) of the neurons in the previous arXiv:1905.13294v3 [cs.LG] 9 Jul 2019

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1

A Review of Deep Learning with Special Emphasison Architectures, Applications and Recent TrendsSaptarshi Sengupta, Sanchita Basak, Pallabi Saikia, Sayak Paul, Vasilios Tsalavoutis, Frederick Ditliac Atiah,

Vadlamani Ravi and Richard Alan Peters II

Abstract—Deep learning has solved a problem that as littleas five years ago was thought by many to be intractable - theautomatic recognition of patterns in data; and it can do so with anaccuracy that often surpasses that of human beings. It has solvedproblems beyond the realm of traditional, hand-crafted machinelearning algorithms and captured the imagination of practitionerstrying to make sense out of the flood of data that now inundatesour society. As public awareness of the efficacy of deep learningincreases so does the desire to make use of it. But even forhighly trained professionals it can be daunting to approach therapidly increasing body of knowledge produced by experts inthe field. Where does one start? How does one determine if aparticular Deep Learning model is applicable to their problem?How does one train and deploy such a network? A primer onthe subject can be a good place to start. With that in mind,we present an overview of some of the key multilayer artificialneural networks that comprise deep learning. We also discusssome new automatic architecture optimization protocols thatuse multi-agent approaches. Further, since guaranteeing systemuptime is becoming critical to many computer applications, weinclude a section on using neural networks for fault detectionand subsequent mitigation. This is followed by an exploratorysurvey of several application areas where deep learning hasemerged as a game-changing technology: anomalous behaviordetection in financial applications or in financial time-seriesforecasting, predictive and prescriptive analytics, medical imageprocessing and analysis and power systems research. The thrustof this review is to outline emerging areas of application-orientedresearch within the deep learning community as well as to providea handy reference to researchers seeking to use deep learning intheir work for what it does best: statistical pattern recognitionwith unparalleled learning capacity with the ability to scale withinformation.

Index Terms—Deep Neural Network Architectures, SupervisedLearning, Unsupervised Learning, Testing Neural Networks,Applications of Deep Learning, Evolutionary Computation

I. INTROUCTION

Artificial neural networks (ANNs), now one of the mostwidely-used approaches to computational intelligence, startedas an attempt to mimic adaptive biological nervous systems

Manuscript received March XX, 2019.S. Sengupta, S. Basak and R.A. Peters II are with the EECS Department

at Vanderbilt University, USA. e-mail: [email protected];[email protected]; [email protected]

P. Saikia is with the CSE Department at IIT Guwahati, India. e-mail:[email protected]

Sayak Paul is with DataCamp Inc, India. e-mail: [email protected]. Tsalavoutis is with the School of Mechanical Engineering, Industrial

Engineering Laboratory, National Technical University of Athens, Greece. e-mail: [email protected]

F. Atiah is with the Computer Science Department at the University ofPretoria. e-mail: [email protected]

R. Vadlamani is with the Institute for Development and Research in BankingTechnology, India. e-mail: [email protected]

in software and customized hardware [1]. ANNs have beenstudied for more than 70 years [2] during which time they havewaxed and waned in the attention of researchers. Recently theyhave made a strong resurgence as pattern recognition tools fol-lowing pioneering work by a number of researchers [3]. It hasbeen demonstrated unequivocally that multilayered artificialneural architectures can learn complex, non-linear functionalmappings, given sufficient computational resources and train-ing data. Importantly, unlike more traditional approaches, theirresults scale with training data. Following these remarkable,significant results in robust pattern recognition, the intellectualneighborhood has seen exponential growth, both in terms ofacademic and industrial research. Moreover, multilayer ANNsreduce much of the manual work that until now has beenneeded to set up classical pattern recognizers. They are, ineffect, black box systems that can deliver, with minimal humanattention, excellent performance in applications that requireinsights from unstructured, high-dimensional data [4] [5] [6][7] [8] [9]. These facts motivate this review of the topic.

A. What is an Artificial Neural Network?An artificial neural network comprises many interconnected,

simple functional units, or neurons that act in concert as paral-lel information-processors, to solve classification or regressionproblems. That is they can separate the input space (the rangeof all possible values of the inputs) into a discrete numberof classes or they can approximate the function (the blackbox) that maps inputs to outputs. If the network is createdby stacking layers of these multiply connected neurons theresulting computational system can:

1) Interact with the surrounding environment by using onelayer of neurons to receive information (these units areknown to be part of the input layers of the neuralnetwork)

2) Pass information back-and-forth between layers withinthe black-box for processing by invoking certain designgoals and learning rules (these units are known to bepart of the hidden layers of the neural network)

3) Relay processed information out to the surroundingenvironment via some of its atomic units (these unitsare known to be part of the output layers of the neuralnetwork).

Within a hidden layer each neuron is connected to theoutputs of a subset (or all) of the neurons in the previous

arX

iv:1

905.

1329

4v3

[cs

.LG

] 9

Jul

201

9

Page 2: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 2

layer each multiplied by a number called a weight. The neuroncomputes the sum of the products of those outputs (its inputs)and their corresponding weights. This computation is the dotproduct between an input vector and weight vector which canbe thought of as the projection of one vector onto the other oras a measure of similarity between the the two. Assume theinput vectors and weights are both n-dimensional and thereare m neurons in the layer. Each neuron has its own weightvector, so the output of the layer is an m-dimensional vectorcomputed as the input vector pre-multiplied by an m×n matrixof weights. That is, the output is an m-dimensional vector thatis the linear transformation of an n-dimensional input vector.The output of each neuron is in effect a linear classifier wherethe weight vector defines a borderline between two classesand where the input vector lies some distance to one side ofit or the other. The combined result of all m neurons is anm-dimensional hyperplane that independently classifies the ndimensions of the input into two m-dimensional classes inthe output. If the weights are derived via least mean-squared(LMS) estimation from matched pairs of input-output data theyform a linear regression, i.e. the hyperplane that is closest inthe LMS sense to all the outputs given the inputs.

The hyperplane maps new input points to output points thatare consistent with the original data, in the sense that someerror function between the computed outputs and the actualoutputs in the training data is minimized. Multiple layers oflinear maps, wherein the output of one linear classifier orregression is the input of another, is actually equivalent to adifferent single linear classifier or regression. This is becausethe output of k different layers reduces to the multiplicationof the inputs by a single q × n matrix that is the product ofthe k matrices, one per layer.

To classify inputs non-linearly or to approximate a nonlinearfunction with a regression, each neuron adds a numericalbias value to the result of its input sum of products (thelinear classifier) and passes that through a nonlinear activationfunction. The actual form of the activation function is a designparameter. But they all have the characteristic that they mapthe real line through a monotonic increasing function thathas an inflection point at zero. In a single neuron, the biaseffectively shifts the inflection point of the activation functionto the value of the bias itself. So the sum of products ismapped through an activation function centered on the bias.Any pair of activation functions so defined are capable ofproducing a pulse between their inflection points if each oneis scaled and one is subtracted from the other. In effect eachpair of neurons samples the input space and outputs a specificvalue for all inputs within the limits of the pulse. Giventraining data consisting of input-output pairs – input vectorseach with a corresponding output vector – the ANN learnsan approximation to the function that produced each of theoutputs from its corresponding input. That approximation isthe partition of the input space into samples that minimizes theerror function between the output of the ANN given its traininginputs and the training outputs. This is stated mathematicallyby the universal approximation theorem which implies thatany functional mapping between input vectors and outputvectors can be approximated to with arbitrary accuracy with

Fig. 1: The Perceptron Learning Model

an ANN provided that it has a sufficient number of neurons ina sufficient number of layers with a specific activation function[10] [11] [12] [13].

Given the dimensions of the input and output vectors, thenumber of layers, the number of neurons in each layer, theform of the activation function, and an error function, theweights are computed via optimization over input-output pairsto minimize the error function. That way the resulting networkis a best approximation of the known input-output data.

B. How do these networks learn?

Neural networks are capable of learning - by changing thedistribution of weights it is possible to approximate a functionrepresentative of the patterns in the input. The key ideais to re-stimulate the black-box using new excitation (data)until a sufficiently well-structured representation is achieved.Each stimulation redistributes the neural weights a little bit -hopefully in the right direction, given the learning algorithminvolved is appropriate for use, until the error in approximationw.r.t some well-defined metric is below a practitioner-definedlower bound. Learning then, is the aggregation of a variablelength of causal chains of neural computations [14] seekingto approximate a certain pattern recognition task throughlinear/nonlinear modulation of the activation of the neuronsacross the architecture. The instances in which chains ofimplicit linear activation fail to learn the underlying structure,non-linearity aids the modulation process. The term ’deep’in this context is a direct indicator of the space complexityof the aggregation chain across many hidden layers to learnsufficiently detailed representations. Theorists and empiricistsalike have contributed to an exponential growth in studiesusing Deep Neural Networks, although generally speaking,the existing constraints of the field are well-acknowledged[15] [16] [17]. Deep learning has grown to be one of theprincipal components of contemporary research in artificialintelligence in light of its ability to scale with input data and itscapacity to generalize across problems with similar underlyingfeature distributions, which are in stark contrast to the hard-coded, problem-specific pattern recognition architectures ofyesteryear.

Page 3: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 3

TABLE I: Some Key Advances in Neural Networks Research

PeopleInvolved Contribution

McCulloch& Pitts ANN models with adjustible weights (1943) [2]

Rosenblatt The Perceptron Learning Algorithm (1957) [18]Widrow and

HoffAdaline (1960), Madaline Rule I (1961) & Mada-line Rule II (1988)[19] [20]

Minsky &Papert The XOR Problem (1969) [21]

Werbos(Doctoral

Dissertation)Backpropagation (1974) [22]

Hopfield Hopfield Networks (1982) [23]Rumelhart,Hinton &Williams

Renewed interest in backpropagation: multilayeradaptive backpropagation (1986) [24]

Vapnik,Cortes Support Vector Networks (1995) [25]

Hochreiter&

SchmidhuberLong Short Term Memory Networks (1997) [26]

LeCunn et.al. Convolutional Neural Networks (1998) [27]

Hinton &Ruslan

Hierarchical Feature Learning in Deep NeuralNetworks (2006) [28]

C. Why are deep neural networks garnering so muchattention now?

Multi-layer neural networks have been around through thebetter part of the latter half of the previous century. A naturalquestion to ask why deep neural networks have gained theundivided attention of academics and industrialists alike inrecent years? There are many factors contributing to thismeteoric rise in research funding and volume. Some of theseare briefed:

• A surge in the availability of large training data sets withhigh quality labels

• Advances in parallel computing capabilities and multi-core, multi-threaded implementations

• Niche software platforms such as PyTorch [29], Tensor-flow [30], Caffe [31] , Chainer [32], Keras [33], BigDL[34] etc. that allow seamless integration of architecturesinto a GPU computing framework without the complexityof addressing low-level details such as derivatives and en-vironment setup. Table II provides a summary of popularDeep Learning Frameworks.

• Better regularization techniques introduced over the yearshelp avoid overfitting as we scale up: techniques likebatch normalization, dropout, data augmentation, earlystopping etc are highly effective in avoiding overfittingand can improve model performance with scaling.

• Robust optimization algorithms that produce near-optimalsolutions: Algorithms with adaptive learning rates (Ada-Grad, RMSProp, Adam, Adaboost), Stochastic GradientDescent (with standard momemtum or Nesterov momen-tum), Particle Swarm Optimization, Differential Evolu-tion, etc.

TABLE II: A Collection of Popular Deployment Platforms

SoftwarePlatform Purpose

Tensorflow[30]

Software library with high performance numerical computa-tion and support for Machine Learning and Deep Learningarchitectures compatible to be deployed in CPU, GPU andTPU.url: https://www.tensorflow.org/

Theano [35]

GPU compatible Python library with tight integration toNumPy involves smooth mathematical operations on multi-dimensional arrays.url: http://deeplearning.net/software/theano/

CNTK [36]Microsoft Cognitive Toolkit (CNTK) is a Deep LearningFramework describing computations through directed graphs.url: https://www.microsoft.com/en-us/cognitive-toolkit/

Keras [33]It runs on top of Tensorflow, CNTK or Theano compatible tobe deployed in CPU and GPU.url: https://keras.io/

PyTorch [29]Distributed training and performance evaluation platform in-tegrated with Python supported by major cloud platforms.url: https://pytorch.org/

Caffe [31]

Convolutional Architecture for Fast Feature Embedding(Caffe) is a Deep Learning framework with focus on imageclasssification and segmentation and deployable in both CPUand GPU.url: http://caffe.berkeleyvision.org/

Chainer [32]Supports CUDA computation and multiple GPU implemen-tation.url: https://chainer.org/

BigDL [34]

Distributed deep learning library for Apache Spark supportingprogramming languages Scala and Python.url: https://software.intel.com/en-us/articles/bigdl-distributed-deep-learning-on-apache-spark

D. Review MethodologyThe article, in its present form serves to present a collection

of notable work carried out by researchers in and related to thedeep learning niche. It is by no means exhaustive and limitedin its own right to capture the global scheme of proceedingsin the ever-evolving complex web of interactions among thedeep learning community. While cognizant of the difficulty ofachieving the stated goal, we tried to present nonetheless tothe reader an overview of pertinent scholarly collections invaried niches in a single article.

The article makes the following contributions from a prac-titioner’s reading perspective:

• It walks through foundations of biomimicry involvingartificial neural networks from biological ones, comment-ing on how neural network architectures learn and whydeeper layers of neural units are needed for certain ofpattern recognition tasks.

• It talks about how several different deep architec-tures work, starting from Deep feed-forward networks(DFNNs) and Restricted Boltzmann Machines (RBMs)through Deep Belief Networks (DBNs) and Autoen-coders. It also briefly sweeps across Convolutional neuralnetworks (CNNs), Recurrent Neural Networks (RNNs),Generative Adversarial Networks (GANs) and some otthe more recent deep architectures. This cluster withinthe article serves as a baseline for further readings or asa refresher for the sections which build on it and follow.

Page 4: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 4

Fig. 2: Organization of the Review

• The article surveys two major computational areas ofresearch in the present day deep learning community thatwe feel have not been adequately surveyed yet - (a) Multi-agent approaches in automatic architecture generation andlearning rule optimization of deep neural networks usingswarm intelligence and (b) Testing, troubleshooting androbustness analysis of deep neural architectures whichare of prime importance in guaranteeing up-time andensuring fault-tolerance in mission-critical applications.

• A general survey of developments in certain applicationmodalities is presented. These include:

1) Anomaly Detection in Financial Services,2) Financial Time Series Forecasting,3) Prognostics and Health Monitoring,4) Medical Imaging and5) Power Systems

Figure 2 captures a high-level hierarchical abstraction of theorganization of the review with emphasis on current practices,challenges and future research directions. The content organi-zation is as follows: Section II outlines some commonly useddeep architectures with a high-level working mechanisms ofeach, Section III talks about the infusion of swarm intelligencetechniques within the context of deep learning and SectionIV details diagnostic approaches in assuring fault-tolerantimplementations of deep learning systems. Section V makes anexploratory survey of several pertinent applications highlightedin the previous paragraph while Section VI makes a criticaldissection of the general successes and pitfalls of the field as

of now and Section VII concludes the article.

II. DEEP ARCHITECTURES: WORKING MECHANISMS

There are numerous deep architectures available in theliterature. The Comparison of architectures is difficult asdifferent architectures have different advantages based onthe application and the characteristics of the data involved,for example, In vision, Convolutional Neural Networks [27],for sequences and time series modelling Recurrent neuralnetworks [37] is prefered. However, deep learning is a fastevolving field. In every year various architectures with variouslearning algorithms are developed to endure the need todevelop human-like efficient machines in different domainsof application.

A. Deep Feed-forward NetworksDeep Feedforward Neural network, the most basic deep ar-

chitecture with only the connections between the nodes movesforward. Basically, when a multilayer neural network containsmultiple numbers of hidden layers, we call it deep neuralnetwork [38]. An example of Deep Feed-Forward Networkwith n hidden layers is provided in Figure 3. Multiple hiddenlayers help in modelling complex nonlinear relation moreefficiently compared to the shallow architecture. A complexfunction can be modelled with less number of computationalunits compared to a similarly performing shallow network dueto the hierarchical learning possible with the multiple levelsof nonlinearity [39]. Due to the simplicity of architecture andthe training in this model, It is always a popular architectureamong researchers and practitioners in almost all the domains

Page 5: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 5

Fig. 3: Deep Feed-forward Neural Network with n Hiddenlayers, p input units and q output units with weights W.

of engineering. Backpropagation using gradient descent [40]is the most common learning algorithm used to train thismodel. The algorithm first initialises the weights randomly,and then the weights are tuned to minimise the error usinggradient descent. The learning procedure involves multipleforward and backwards passes consecutively. In forward pass,we forward the input towards the output through multiplehidden layers of nonlinearity and ultimately compare thecomputed output with the actual output of the correspondinginput. In the backward pass, the error derivatives with respectto the network parameters are back propagated to adjust theweights in order to minimise the error in the output. Theprocess continues multiple times until we obtained a desiredimprovement in the model prediction. If Xi is the input andfi is the nonlinear activation function in layer i, the output ofthe layer i can be represented by,

Xi+1 = fi(WiXi + bi) (1)

Xi+1, as this becomes input for the next layer. Wi and biarethe parameters connecting the layer i with the previous layer.In the backward pass, these parameters can be updated with,

Wnew =W − η∂E/∂W (2)

bnew = b− η∂E/∂b (3)

Where Wnew and bnew are the updated parameters for Wand b respectively, and E is the cost function and ηis thelearning rate. Depending on the task to be performed likeregression or classification, the cost function of the modelis decided. Like for regression, root mean square error iscommon and for classification softmax function.

Many issues like overfitting, trapped in local minima andvanishing gradient issues can arise if a deep neural networkis trained naively. This was the reason; neural network wasforsaken by the machine learning community in the late 1990s.However, in 2006 [28], [41], with the advent of unsupervisedpretraining approach in deep neural network, the neural net-work is revived again to be used for the complex tasks likevision and speech. Lately, many other techniques like l1, l2regularisation [42], dropout [43], batch normalisation [44],good set of weight initialisation techniques [45], [46], [47],

Fig. 4: RBM with m visible units and n hidden units

[48] and good set of activation functions [49] are introducedto combat the issues in training deep neural networks.

B. Restricted Boltzmann MachinesRestricted Boltzmann Machine (RBM) [50] can be inter-

preted as a stochastic neural network. It is one of the populardeep learning frameworks due to its ability to learn the inputprobability distribution in supervised as well as unsupervisedmanner. It was first introduced by Paul Smolensky in 1986with the name Harmonium [51]. However, it gets popularisedby Hinton in 2002 [52] with the advent of the improved train-ing algorithm to RBM. After that, it got a wide application invarious tasks like representation learning [53], dimensionalityreduction [54], prediction problems [55]. However, deep beliefnetwork training using the RBM as building block [28] wasthe most prominent application in the history of RBM thatprovides the starting of deep learning era. Recently RBMis getting immense popularity in the field of collaborativefiltering [56] due to the state of the art performance in Netflix.

Restricted Boltzmann Machine is a variation of Boltzmannmachine with the restriction in the intra-layer connection be-tween the units, and hence called restricted. It is an undirectedgraphical model containing two layers, visible and hiddenlayer, forms a bipartite graph. Different variations of RBMshave been introduced in literature in terms of improving thelearning algorithms, provided the task. Temporal RBM [57]and conditional RBM [58] proposed and applied to modelmultivariate time series data and to generate motion captures,Gated RBM [59] to learn transformation between two inputimages, Convolutional RBM [60], [61] to understand the timestructure of the input time series, mean-covariance RBM [62],[63], [64] to represent the covariance structure of the data, andmany more like Recurrent TRBM [65], factored conditionalRBM (fcRBM) [66]. Different types of nodes like Bernoulli,Gaussian [67] are introduced to cope with the characteristicsof the data used. However, the basic RBM modelling conceptintroduced with Bernoulli units. Each node in RBM is acomputational unit that processes the input it receives to makestochastic decisions whether to transmit that input or not. AnRBM with m visible and n hidden units is provided in Figure4.

The joint probability distribution of an standard RBM canbe defined with Gibbs distribution p(v, h) = 1

Z e−E(v,h) ,

Page 6: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 6

where energy function E(v,h) can be represented with:

E(v, h) = −n∑i=1

m∑j=1

wijhjvi −m∑j=1

bjvj −n∑i=1

cihi (4)

Where, m,n are the number of visible and hidden units,vj , hj are the states of the visible unit j and hidden uniti, bj , cj are the real-valued biases corresponding to the jthvisible unit and ith hidden unit respectively, wij is real-valuedweights connecting visible units with hidden units. Z is thenormalisation constant (sum over all the possible combinationsfor e−E(v,h)) to ensure the probability distributions sums to1. The restriction made in the intralayer connection make theRBM hidden layer variables independent given the states ofthe visible layer variables and vice versa. This easy down thecomplexity of modelling the probability distribution and hencethe probability distribution of each variable can be representedby conditional probability distribution as given below:

p(h|v) =n∏i=1

p(hi|v) (5)

p(v|h) =m∏j=1

p(vj |h) (6)

RBM is trained to maximise the expected probability of thetraining samples. Contrastive divergence algorithm proposedby Hinton [52] is popular for the training of RBM. The trainingbrings the model to a stable state by minimising its energy byupdating the parameters of the model. The parameters can beupdated using the following equations:

∆wij = ε(< vihj >data −< vihj >model) (7)

∆bi = ε(< vi >data −< vi >model) (8)

∆cj = ε(< hj >data −< hj >model) (9)

Where, ε is the learning rate, < . > data , < . > modelare used to represent the expected values of the data and themodel.

C. Deep Belief NetworksDeep belief network (DBN) is a generative graphical model

composed of multiple layers of latent variables. The latentvariables are typically binary, can represent the hidden featurespresent in the input observations. The connection between thetop two layers of DBN is undirected like an RBM model,hence a DBN with 1 hidden layer is just an RBM. The otherconnections in DBN except last are directed graphs towardsthe input layer. DBN is a generative model, hence to generatea sample from DBN follows a top-down approach. We firstdraw samples from the RBM on the top layer, this is usuallydone by Gibbs sampling, then we can perform sampling fromthe visible units by a simple pass of ancestral sampling ina top-down fashion. A standard DBN model [68] with threehidden layers is shown in Figure 5.

Fig. 5: DBN with input vector x with 3 hidden layers

Inference in DBN is an intractable problem due to theexplaining away effect in the latent variable model. However,in 2006 Hinton [28] proposed a fast and efficient way oftraining DBN by stacking Restricted Boltzmann Machine(RBM) one above the other. The lowest level RBM duringtraining learns the distribution of the input data. The nextlevel of RBM block learns high order correlation between thehidden units of the previous hidden layer by sampling thehidden units. This process repeated for each hidden layer tillthe top. A DBN with L numbers of hidden layer models thejoint distribution between its visible layer v and the hiddenlayers hl, where l =1,2, . . . L as follows:

p(v, h1, ... , hL) = p(v| h1)(L−2∏l=1

p( hl| hl+1))p( hL−1, hL)

(10)The log-probability of the training data can be improved by

adding layers to the network, which, in turn, increases the truerepresentational power of the network [69]. The DBN trainingproposed in 2006 [28] by Hinton led to the deep learningera of today and revived the neural network. This was the firstdeep architecture in the history able to train efficiently. Beforethat, it was almost impossible to train deep architectures.Deep architectures build by initialising the weights with DBN,outperformed the kernel machines, that was in the researchlandscape at that time. DBN, along with its use as genera-tive models, significantly applied as discrimination model byappending a discrimination layer at the end and fine-tuningthe model using the target labels provided [3]. In most of theapplications, this approach of pretraining a deep architectureled to the state of the performance in discriminative model[70], [28], [41], [71], [54] like in recognising handwrittendigits, detecting pedestrians, time series prediction etc. evenwhen the number of labelled data was limited [72]. It hasgot immense popularity in acoustic modelling [73] recetly asthe model could provide upto 20% improvement over stateof the art models, Hidden Markov Model, Gaussian MixtureModel. The approach creates feature detectors hierarchicallyas features of features in pretraining that provide a goodset of initialised weights to the discriminative model. Theinitialised weights are in a region near the optimal weightsthat can improve both modelling and the convergence infine-tuning [70], [74]. DBN has been used as an initialisedmodel in classification in many applications like in phone

Page 7: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 7

Fig. 6: Autoencoder with 3 neurons in hidden layer

recognition [62], computer vision [63] where it is used forthe training of higher order factorized Boltzmann machine,speech recognition [75], [76], [77] for pretraining DNN, forpretraining of deep convolutional neural network (CNN) [60],[78], [61]. The improved performance is due to the abilityto learn some abstract features by the hidden layer of thenetwork. Some of the work on analysis of the features tounderstand what is lost and what is captured during its trainingis demonstrated in [64], [79], [80].

D. Autoencoders

Autoencoder is a three-layer neural network, as shownin Figure 6, that tries to reconstruct its input in its outputlayer. Hence, the output layer of an autoencoder contains thesame number of units as the input layer. The hidden layertypically contains less number of neurons compared to thevisible layer, tries to encode or represent the input in a morecompact form. It shares the same idea as RBM, but it typicallyuses deterministic distribution instead of stochastic units withparticular distribution as in the case of RBM.

Like feedforward neural network, autoencoder is typicallytrained using backpropagation algorithm. The training consistsof two phases: Encoding and Decoding. In the encodingphase, the model tries to encode the input into some hiddenrepresentation using the weight metrics of the lower half layer,and in the decoding phase, it tries to reconstruct the sameinput from the encoding representation using the metrics ofthe upper half layer. Hence, weights in encoding and decodingare forced to be the transposed of each other. The encodingand decoding operation of an autoencoder can be representedby equations below: In encoding phase,

y′ = f(wx+ b) (11)

Where w, b are the parameters to be tuned, f is theactivation function, x is the input vector, and y is the hiddenrepresentation. In decoding phase,

x′ = f(w′y′ + c) (12)

Where w′ is the transpose of w, c is the bias to the outputlayer, x′ is the reconstructed input at the output layer. Theparameters of the autoencoder can be updated using thefollowing equations:

wnew = w − η∂E/∂w (13)

bnew = b− η∂E/∂b (14)

Where wnew and bnew are the updated parameters for wand b respectively at the end of the current iteration, and E isthe reconstruction error of the input at the output layer.

Autoencoder with multiple hidden layers forms a deepautoencoder. Similar like in deep neural network, autoencodertraining may be difficult due to multiple layers. This can beovercome by training each layer of deep autoencoder as a sim-ple autoencoder [28], [41]. The approach has been successfullyapplied to encode documents for faster subsequent retrieval[81], image retrieval, efficient speech features [82] etc. As likeRBM stacking to form DBN [28] for layerwise pretraining ofDNN, autoencoder [41] along with sparse encoding energy-based model [71] are independently developed at that time.They both were effectively used to pre-train a deep neuralnetwork, much like the DBN. The unsupervised pretrainingusing autoencoder has been successfully applied in many fieldslike in image recognition and dimensionality reduction inMNIST [54], [82], [83], multimodal learning in speech andvideo images [84], [85] and many more. Autoencoder hasgot immense popularity as generative model in recent years[38], [86]. Non Probabilistic and non-generative nature ofconventional autoencoder has been generalised to generativemodelling [87], [42], [88], [89], [90] that can be used togenerate the samples from the network meaningfully.

Several variations of autoencoders are introduced with quitedifferent properties and implementation to learn more effi-cient representation of data. One of the popular variation ofautoencoder that is robust to input variations is denoisingautoencoder [89], [42], [90]. The model can be used forgood compact representation of input with the number ofhidden layers less than the input layer. It can also be usedto perform robust modelling of the input distribution withhigher number of neurons in the hidden layer. The robustnessin denoising autoencoder is achieved by introducing dropouttrick or by introducing some gaussian noise to the input data[91], [92] or to the hidden layers [93]. The approach helpsin many many ways to improve performance. It virtuallyincreasing the training set hence reduce overfitting, and makerobust representation of the input. Sparse autoencoder [93] isintroduced in a consideration to allow larger number of hiddenunits than the visible units to make it easier and efficientrepresentation of the input distribution in the hidden layer.The larger hidden layer represent the input representation byturning on and off the units in the hidden layer. Variationalautoencoder [86], [94] that uses quite the similar concept asRBM, learn stochastic distribution of latent variables insteadof deterministic distribution. Transforming autoencoders [95]proposed as a autoencoder with transformation invariant prop-erty. The encoded features of the autoencoder can effectivelyreflect the transformation invariant property. The encoder isapplied in image recognition [95], [96] purpose that containscapsule as the building block. Capsule is an independent sub-network that extracts local features within a limited window

Page 8: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 8

of viewing to understand if a feature entity is present withcertain probability. Pretraining for CNN using regulariseddeep autoencoder is very much popularised in recent yearsin computer vision works. Robust models of CNN is obtainedwith denoising autoencoder [88] and sparse autoencoder withpooling and local contrast normalization [97] which providesnot only translation-invariant features but also scaling and out-of-plane rotation invariant features.

E. Convolutional Neural NetworksConvolutional Neural Networks are a class of neural net-

works that are extremely good for processing images. Al-though its idea was proposed way back in 1998 by LeCunet. al in their paper entitled ”Gradient-based learning appliedto document recognition” [98] but the deep learning worldactually saw it in action when Krizhevsky et. al were ablewin the ILSVRC-2012 competition. The architecture thatKrizhevsky et. al proposed is popularly known as AlexNet[99]. This remarkable win started the new era of artificialintelligence and the computation community witnessed thereal power of CNNs. Soon after this, several architectureshave been proposed and still are being proposed. And in manycases, these CNN architectures have been able to beat humanrecognition power as well. It is worth to note that, The deeplearning revolution actually with the usage of ConvolutionalNeural Networks (CNNs). CNNs are are extremely useful fora set computer vision related tasks such as image detection,image segmentation, image classification and so on and all ofthese tasks are practically well aligned. On a very high level,deep learning is all about learning data representations and inorder to do so deep learning systems typically breaks downcomplex representations into a set of simpler representations.As mentioned earlier, CNNs are particularly useful when itcomes to images as images have a special spatial propertyin their formations. An image has several characteristicslike edges, contours, strokes, textures, gradients, orientation,colour. A CNN breaks down an image in terms of simpleproperties like these and learn them as representations indifferent layers [100]. Figure 8 is a good representative ofthis learning scheme.

The layers involved in any CNN model are the convolutionlayers and the subsampling/pooling layers which allow thenetwork learn filters that are specific to specific parts in animage. The convolution layers help the network retain thespatial arrangement of pixels that is present in any imagewhereas the pooling layers allow the network to summarize thepixel information [101]. There are several CNN architecturesZFNet, AlexNet, VGG, YOLO, SqueezeNet, ResNet and soon and some these have been discussed in section II-H.

F. Recurrent Neural NetworksAlthough Hidden Markov Models (HMM) can express

time dependencies, they become computationally unfeasiblein the process of modelling long term dependencies whichRNNs are capable of. A detailed derivation of RecurrentNeural Network from differential equations can be found in[102]. RNNs are form of feed-forward networks spanning

adjacent time steps such that at any time instant a nodeof the network takes the current data input as well as thehidden node values capturing information of previous timesteps. During the backpropagation of errors across multipletimesteps the problem of vanishing and exploding gradientstake place which can be avoided by Long Short Term Memory(LSTM) Networks introduced by Hochreiter and Schmidhuber[103]. The amount of information to be retained from previoustime steps is controlled by a sigmoid layer known as ‘forget’gate whereas the sigmoid activated ‘input gate’ decides uponthe new information to be stored in the cell followed by ahyperbolic tangent activated layer to produce new candidatevalues which is updated taking forget gate coefficient weightedold state’s candidate value. Finally the output is producedcontrolled by output gate and hyperbolic tangent activatedcandidate value of the state.

LSTM networks with peephole connections [104] updatesthe three gates using the cell state information. A singleupdate gate instead of forget and input gate is introducedin Gated Recurrent Unit (GRU) [105] merging the hiddenand the cell state. In [106] Sak et al., came up with trainingLSTM RNNs in a distributed way on multicore CPU usingasynchronus SGD (Stochastic Gradient Descent) optimizationfor the purpose of acoustic modelling. They presented a two-layer deep LSTM architecture with each layer having a linearrecurrent projection layer with more efficient use of the modelparameters. Doetch et al., [107] proposed a LSTM based train-ing framework composed of sequence chunks forming minibatches for training for the purpose of handwriting recognition.With reduction of runtime by a factor of 3 the architecture usesmodified gating units with layer specific weights for each gate.Palangi et al., [108] implemented sentence embedding modelusing LSTM-RNN that sequentially extracts information fromeach word and embeds in a semantic vector till the end of thesentence to obtain overall semantic representation of the entiresentence. The model with capability of attenuating unimpor-tant words and identifying salient keywords is specificallyuseful in web document retrieval applications.

G. Generative Adversarial Networks

Goodfellow et al., [109] introduced a novel framework forGenerative Adversarial Nets with simultaneous training of agenerative and a discriminative model. The proposed newGenerative model bypasses the difficulty of approximation ofunmanageable probabilistic measures in Maximum LikelihoodEstimation faced previously. The generative model tries tocapture the data distribution whereas the discriminative modellearns to estimate the probability of a sample either comingfrom training data or the distribution captured by generativemodel. If the two above models described by multilayerperceptrons, only backpropagation and dropout algorithms arerequired to train them.

The goal in this process is to train the Generative networkin a way to maximize the probability of the discriminativenetwork to make a mistake. A unique solution can be obtainedin the function space where the generative model recoversthe distribution of training data and the discriminative model

Page 9: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 9

Fig. 7: Convolution and Pooling Layers in a CNN

Fig. 8: Representations of an image of handwritten digitlearned by CNN

results into 50% probalilty for each sample. This can beviewed as a minmax two player game between these twomodels as the generative models produce adversarial exampleswhile discriminative model trying to identify them correctlyand both try to improve their efficiency until the adversarialexamples are indistinguishable from the original ones.

In [110], the authors presented training procedures to beapplied to GANs focusing on producing visually sensibleimages. The proposed model was successful in producingMNIST samples visually indistinguishable from the originaldata and also in learning recognizable features from Imagenetdataset in a semi-supervised way. This work provides insightabout appropriate evaluation metric for generative modelsin GANs and stable semi-supervised training approach. In[111], the authors identified distinct features of GANs from aTuring perspective. The discriminators were allowed to behaveas interrogators such as in Turing Test by interacting withdata sample generating processes and affirmed the increasein accuracy of the models by verification with two casestudies. The first one was about inferring an agent’s behaviorbased on a hidden stochastic process while managing itsenvironment. The second examples talks about active self-discovery exercised by a robot to conclude about its ownsensors by controlled movements.

Wu et al., [112] proposed a 3D Generative AdversarialNetwork (3DGAN) for three dimensional object generationusing volumetric convolutional networks with a mapping fromprobabilistic space of lower dimension to three dimensionalobject space so that the 3D object can be sampled or exploredwithout any reference image. As a result high quality 3Dobjects can be generated employing efficient shape descriptor

learnt in an unsupervised manner by the adversarial discrim-inator. Vondrick et al., [113] came up with video recogni-tion/classification and video generation/prediction model us-ing Generative Adversarial Network (GAN) with separationof foreground from background employing spatio-temporalconvolutional architecture. The proposed model is efficientin predicting futuristic versions of static images extractingmeaningful features and recognizing actions embedded in thevideo in a minimally supervised way. Thus, learning scenedynamics from unlabeled videos using adversarial learning isthe main objective of the proposed framework.

Another interesting application is generating images fromdetailed visual descriptions [114]. The authors trained adeep convolutional generative adversarial network (DC-GAN)based on encoded text features through hybrid character-levelconvolutional recurrent neural network and used manifoldinterpolation regularizer. The generalizability of the approachwas tested by generating images from various objects andchanging backgrounds.

H. Recent Deep ArchitecturesWhen it comes to deep learning and computer vision,

datasets like Cats and Dogs, ImageNet, CIFAR-10, MNISTare used for benchmarking purposes. Throughout this section,the ImageNet dataset is used for the purpose of benchmarkingresults as it is more generalized than the other datasetsjust mentioned. Every year a competition named ILSVRC(ImageNet Large Scale Visual Recognition Competition) isorganized (which is an image classification competition) whichbased on the ImageNet dataset and it is widely accepted bythe deep learning community [115].Several deep neural network architectures have been proposedin the literature and still are being proposed with an objectiveof achieving general artificial intelligence. LeNet architecture,for example was proposed by Lecun et. al in 1998s and itwas originally proposed as a digit classification model. Later,LeNet has been incorporated to identify handwritten numberson cheques [98]. Several architectures have been proposedafter LeNet among which AlexNet certainly deserves to bethe most notable mentions. It was proposed by Krizhevsky et.al in 2012 and AlexNet was able to beat all the competitors ofthe ILSVRC challenge. The discovery of AlexNet marks a sig-nificant turn in the history of deep learning for several reasons

Page 10: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 10

Fig. 9: A Recurrent Neural Network Architecture

such as AlexNet incorporated the dropout regularization whichwas just developed by that time, AlexNet made use of efficientGPU computing for reducing the training time which was firstof its kind back in 2012 [99]. Soon after AlexNet , ZFNetwas proposed by Zeiler et. al in the year of 2013 and showedstate-of-the-art results on the ILSVRC challenge. It was anenhancement of the AlexNet architecture. It uses expandedmid convolution layers and incorporates smaller strides andfilters in the first convolution layer for capturing the pixelinformation in a great detail [116]. In 2014, Google researcherscame with a better model which is known as GoogleNet orthe Inception Network and won the ILSVRC 2014 challenge.The main catch of this architecture is the inception layer whichallows convolving in parallel with different kernel sizes. This isturn allows to learn the smaller pixel information of an imagein a better way [117]. It’s worth to mention the VGGNet (alsocalled VGG) architecture here. It was the runners’ up in theILSVRC 2014 challenge and was proposed by Simonyan et.al. VGG uses a 3X3 kernel throughout its entire architectureand ahieves tremendous generalization with this fixation [118].The inner of the ILSVRC 2015 challenge was the ResNetarchitecture and was proposed by He et. al. This architectureis more formally known as Residual Networks and is deeperthan the VGG architecture while still being less complexin the VGG architecture. ResNet was able to beat humanperformance on the ImageNet dataset and it is still being quiteactively used in production [119] [120].

III. SWARM INTELLIGENCE IN DEEP LEARNING

The introduction of heuristic and meta-heuristic algorithmsin designing complex neural network architectures aimedtowards tuning the network parameters to optimize the learningprocess has brought improvements in the performance of sev-eral Deep Learning Frameworks. In order to design the Artifi-cial Neural Networks (ANN) automatically with evolutionary

Fig. 10: A Generative Adversarial Network Architecture

computation a Deep Evolutionary Network Structured Repre-sentation (DENSER) was proposed in [121], where the optimaldesign for the network is achieved by a bi-leveled representa-tion. The outer level deals with the number of layers and theirsequence whereas the inner layer optimizes the parametersand hyper parameters associated with each layer defined by acontext-free human perceivable grammar. Through automaticdesign of CNNs the proposed approach performed well onCIFER-10, CIFER-100, MNIST and Fashion MNIST dataset.On the other hand, Garro et al., [122] proposed a methodologyto automatically design ANN using basic Particle SwarmOptimization (PSO), Second Generation of Particle SwarmOptimization (SGPSO), and a New Model of PSO (NMPSO)to evolve and optimize the synaptic weights, transfer functionfor each neuron and the architecture itself simultaneously.The ANNs designed in this way, were evaluated over eightfitness functions. It aimed towards dimensionality reductionof the input pattern, and was compared to the traditionaldesign architectures using well known Back-Propagation andLevenberg-Marquardt algorithms. Das et al. [123], used PSOto optimize the number of layers, neurons, the kind of transferfunctions to be involved and the topology of ANN aimed atbuilding channel equalizers that perform better in presence ofall noise scenarios.

Wang et al. [124], used Variable-length Particle SwarmOptimization for automatic evolution of deep ConvolutionalNeural Network Architectures for image classification pur-poses. They proposed novel encoding strategy to encodeCNN layers in particle vectors and introduced a Disabledlayer hiding certain dimensions of the particle vector to havevariable-length particles. In addition to this, to speed upthe process the authors randomly picked up partial datasetsfor evaluation. Thus several variants of PSO along with itshybridised versions [125] as well as a host of recent swarmintelligence algorithms such as Quantum Double Delta SwarmAlgorithm (QDDS) [126] and its chaotic implementation [127]proposed by Sengupta et al. can be used, among others forautomatic generation of architectures used in Deep Learningapplications.

The problem of changing dimensionality of perceived in-formation by each agent in the domain of Deep reinforcementlearning (RL) for swarm systems has been solved in [128]using an endtoend learned mean feature embedding as stateinformation. The research concluded that an endtoend embed-ding using neural network features helps to scale up the RL

Page 11: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 11

architecture with increasing numbers of agents towards betterperforming policies as well as ensures fast convergence.

IV. TESTING NEURAL NETWORKS

Software employed in safety critical systems need to be rig-orously tested through white-box or black-box testing. In whitebox testing, the internal structure of the software/program isknown and utilized in generating test cases as per the testcriteria/requirement. Whereas in black box testing the inputsand outputs of the program are compared as the internal codeof the software cannot be accessed. Some of the previousworks dealing with generating test cases revealing faulty casescan be found in [129] and in [130] using Principle componentanalysis. In [131] the authors implemented a black-box testingmethodology by feeding randomly generated input test casesto an original version of a real-world test program producingthe corresponding outputs, so as the input-output pairs aregenerated to train a neural network. Then each test case isapplied to mutated and faulty version of the test program andcompared against the output of the trained ANN to calculatethe distance between two outputs indicating whether the faultyprogram has produced valid or invalid result. Thus ANN istreated as an automated oracle which produces satisfactoryresults when the training set is comprised of data ensuringgood coverage on the whole range of input.

Y. Sun et al, [132] proposed a set of four test coveragecriteria drawing inspiration from traditional Modified Condi-tion/Decision Coverage (MC/DC) criteria. They also proposedalgorithms for generating test cases for each criterion builtupon linear programming. A new test case (an input to DeepNeural Network) is produced by perturbing a given one,where the stated algorithms should encode the test requirementand a fragment of the DNN by fixing the activation patternobtained from the given input example, and then minimize thedifference between the new and the current inputs. The utilityof this method lies in bug finding, determining DNN safetystatistics, measuring testing accuracy and analysis of DNNinternal structure. The paper discusses about sign change,value change and distance change of a neuron pair with twoneurons in adjacent layers in the context of their changein activation values in two given test cases. Four coveringmethods: sign sign cover, distance sign cover, sign valuecover and distance value cover are explained along with testrequirement and test criteria which computes the percentageof the neuron pairs that are covered by test cases with respectto the covering method.

For each test requirement an automatic test case generationalgorithm is implemented based on Linear Programming (LP).The objective is to find a test input variable, whose value isto be synthesized with LP, with identical activation patternas a given input. Hence a pair of inputs that satisfy thecloseness definition are called adversarial examples if only oneof them is correctly labeled by the DNN. The testing criterianecessitates that (sign or distance) changes of the conditionneurons should support the (sign or value) change of everydecision neuron. For a pair of neurons with a specified testingcriterion, two activation patterns need to be found such that

the two patterns together shall exhibit the changes required bythe corresponding testing criterion. In the final test suite theinputs matching these patterns will be added. The authors putforward results on 10 DNNs with the Sign-Sign, Distance-Sign, Sign-value and Distance-Value covering methods thatshow that the test generation algorithms are effective, asthey reach high coverage for all covering criteria. Also, thecovering methods designed are useful. This is supported bythe fact that a significant portion of adversarial examples havebeen identified. To evaluate the quality of obtained adversarialexamples, a distance curve to see how close the adversarialexample is to the correct input has been plotted. It is observedthat when going deeper into the DNN, it can become harderfor the cover of neuron pairs. Under such circumstances, toimprove the coverage performance, the use of larger data setwhen generating test pairs is needed. Interestingly, it seemsthat most adversarial examples can be found around the middlelayers of all DNNs tested. In future the authors propose tofind more efficient test case generation algorithms that do notrequire linear programming.

Katz et al. [133], provided methods for verifying adversar-ial robustness of neural networks with Reluplex algorithm,to prove, that a small perturbation to a rightly classifiedinput should not result into misclassification. Huang et al,[134], proposed an automated verification framework basedon Satisfiability Modulo Theory (SMT) to test the safety ofneural network by searching adversarial manipulations throughexploration in the space around a given data point. Theadversarial examples discovered were used to fine-tune thenetwork.

A. Different Methods of Adversarial Test Generation

Despite the success of deep learning in various domains,the robustness of the architectures need to be studied beforeapplying neural network architectures in safety critical sys-tems. In this subsection we discuss the kind of maliciousattack that can fool or mislead NN to output wrong deci-sions and ways to overcome them. The work presented byTuncali et al., [135] deals with generating scenarios leadingto unexpected behaviors by introducing perturbations in thetesting conditions. For identifying fasification and criticalsystems behavior for autonomous driving systems, the authorsfocused on finding glancing counterexamples which refer tothe borderline behavior of the system where it is in the vergeof failing. They introduced Signal Temporal Logic (STL)formula for the problem in hand which in this case is acombination of predicates over the speed of the target car anddistances of all other objects (including cars and pedestrians)and relative positions of them. Then a list of test cases iscreated and evaluated against STL specification. A coveringarray spanning all possible combinations of the values thevariables can take is generated. To find a glancing behavior, thediscrete parameters from the covering array that correspondto the trace that minimize STL conditions for a trace, areused to create test cases either uniformly randomly or by acost function to guide a search over the continuous variables.Thus, a glancing test case for a trace is obtained. The proposed

Page 12: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 12

closed loop architecture behaves in an integrated way alongwith the controller and Deep Neural Network (DNN) basedperception system to search for critical behavior of the vehicle.

In [136] Yuan et al discuss adversarial falsification problemexplaining false positive and false negative attacks, white boxattacks where there is complete knowledge about the trainedNN model and black box attack where no information of themodel can be accessed. With respect to adversarial specificitythere are targeted and non-targeted attacks where the classoutput of the adversarial input is predefined in the first caseand arbitrary in the second case. They also discuss aboutperturbation scope where individual attacks are geared towardsgenerating unique perturbations per input whereas universalattacks generate similar attack for the whole dataset. Theperturbation measurement is computed as p-norm distance be-tween actual and adversarial input. The paper discusses variousattack methods including L-BFGS attack, Fast Gradient SignMethod (FGSM) by performing update of one step gradientalong the direction of the sign of the gradient of every pixelexpressed as [137]:

η = εsign(∇xJθ(x, l)) (15)

where ε is the magnitude of perturbation η which when addedto an input data generates an adversarial data.

FGSM has been extended by Basic Iterative Method (BIM)and Iterative Least-Likely Class Method (ILLC). Moosavi-Dezfooli et al. [138] proposed Deepfool where iterative at-tack was performed with linear approximation to surpass thenonlinearity in multidimensional cases.

B. Countermeasures for Adversarial ExamplesThe paper [136] deals with reactive countermeasures such

as Adversarial Detecting, Input Reconstruction, and NetworkVerification and proactive countermeasures such as NetworkDistillation, Adversarial (Re)training, and Classifier Robus-tifying. In Network Distillation high temperature softmaxactivation reduces the sensitivity of the model towards smallperturbations. In Adversarial (Re)training adversarial examplesare used during training. Adversarial detecting deals withfinding the probability of a given input being adversarial ornot. In input reconstruction technique a denoising autoencoderis used to transform the adversarial examples to actual databefore passing them as input to the prediction module by deepNN. Also, Gaussian Process Hybrid Deep Neural Networks(GPDNNs) are proven to be more robust towards adversarialinputs.

There are also ensembling defense strategies to countermultifaceted adversarial examples. But the defense strategiesdiscussed here are mostly applicable to computer vision tasks,whereas the need of the day is to generate real time adversarialinput detection and take measures for safety critical systems.

In [139] Rouhani et al., proposed an online defense frame-work DeepFense against adversarial deep learning. Theyformulated it as an unsupervised optimization problem byminimizing the less observed spaces in the latent featurehyperspace spanned by a Deep Learning network and wasable to decrease the risk of integrated attacks. With integrated

design of algorithms for software and hardware the proposedframework aims to maximize model reliability.

It is necessary to build robust countermeasures to be usedfor different types of adversarial scenarios to provide a reliableinfrastructure as none of the countermeasures can be univer-sally applicable to all sorts of adversaries. A detailed list ofspecific attack generation and corresponding countermeasurescan be found in [140].

TABLE III: Distribution of Articles by Application Areas

ApplicationArea Authors

FraudDetection in

FinancialServices

Pumsirirat et al. [141], Schreyer et al. [142], Wang et al.[143], Zheng et al. [144], Dong et al. [145], Gomez et al.[146], Rymantubb et al. [147], Fiore et al. [148]

FinancialTime SeriesForecasting

Cavalcante et al. [149], Li et al. [150], Fama et al. [151], Luet al. [152], Tk & Verner [153], Pandey et al. [154], Lasferet al. [155], Gudelek et al. [156], Fischer & Krauss [157],Santos Pinheiro & Dras [158], Bao et al. [159], Hossain etal. [160], Calvez and Cliff [161]

Prognosticsand HealthMonitoring

Basak et al. [162], Tamilselvan & Wang [163], Kuremoto etal. [164], Qiu et al. [165], Gugulothu et al. [166], Filonov etal. [167], Botezatu et al. [168]

MedicalImage

Processing

Suk, Lee & Shen [169], van Tulder & de Bruijne [170],Brosch & Tam [171], Esteva et al. [172], Rajaraman et. al.[173], Kang et al. [174], Hwang & Kim [175], Andermatt etal. [176], Cheng et al. [177], Miao et al. [178], Oktay et al.[179], Golkov et al. [180]

PowerSystems

Vankayala & Rao [181], Chow et al. [182], Guo et al. [183],Bourguet & Antsaklis [184], Bunn & Farmer [185], Hippertet al. [186], Kuster et al. [187], Aggarwal & Song [188],Zhai [189], Park et al. [190], Mocanu et al. [191], Chen etal. [192], Bouktif et al. [193], Dedinec et al. [194], Rahmanet al. [195], Kong et al. [196], Dong et al. [197], Kalogirouet al. [198], Wang et al. [199], Das et al. [200], Dabra et al.[201], Liu et al. [202], Jang et al. [203], Gensler et al. [204],Abdel-Nasser et al. [205], Manwell et al. [206], Marugan etal. [207], Wu et al. [208], Wang et al. [209], Wang et al.[210], Feng et al. [211], Qureshi et al. [212]

V. APPLICATIONS

A. Fraud Detection in Financial ServicesFraud detection is an interesting problem in that it can be

formulated in an unsupervised, a supervised and a one-classclassification setting. In unsupervised learning category, classlabels either unknown or are assumed to be unknown andclustering techniques are employed to figure out (i) distinctclusters containing fraudulent samples or (ii) far off fraudulentsamples that do not belong to any cluster, where all clusterscontained genuine samples, in which case, it is treated asan outlier detection problem. In supervised learning category,class labels are known and a binary classifier is built in orderto classify fraudulent samples. Examples of these techniquesinclude logistic regression, Naive Bayes, supervised neuralnetworks, decision tree, support vector machine, fuzzy rulebased classifier, rough set based classifier etc. Finally, in theone-class classification category, only samples of genuine classavailable or fraud samples are not considered for training evenif available. These are called one-class classifiers. Examplesinclude one-class support vector machine (aka Support vector

Page 13: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 13

data description or SVDD), auto association neural networks(aka auto encoders). In this category, models are trained onthe genuine class data and are tested on the fraud class.Literature abounds with many studies involving traditionalneural networks with various architectures to deal with theabove mentioned three categories. Having said that fraud(including cyber fraud) detection is increasingly becomingmenacing and fraudsters always appear to be few notchesahead of organizations in terms of finding new loopholes inthe system and circumventing them effortlessly. On the otherhand, organizations make huge investments in money, time andresources to predict fraud in near real-time, if not real timeand try to mitigate the consequences of fraud. Financial fraudmanifests itself in various areas such as banking, insuranceand investments (stock markets). It can be both offline aswell as online. Online fraud includes credit/debit card fraud,transaction fraud, cyber fraud involving security, while offlinefraud includes accounting fraud, forgeries etc.

Deep learning algorithms proliferated during the last fiveyears having found immense applications in many fields,where the traditional neural networks were applied with greatsuccess. Fraud detection one of them. In what follows, wereview the works that employed deep learning for frauddetection and appeared in refereed international journals andone article is from arXive repository. papers published inInternational conferences are excluded.

Pumsirirat (2018)[141] proposed an unsupervised deep autoencoder (AE) based on restricted Boltzmann machine (RBM)in order to detect novel frauds because fraudsters alwaystry to be innovative in their modus operandi so that theyare not caught while perpetrating the fraud. He employedbackpropagation trained deep Auto-encoder based on RBMthat can reconstruct normal transactions to find anomalies fromnormal patterns. He used the Tensorflow library from Googleto implement AE, RBM, and H2O by using deep learning.The results show the mean squared error, root mean squarederror, and area under curve.

Schreyer (2017) [142] observed the disadvantage of businessand experiential-knowledge driven rules in failing to generalizewell beyond the known scenarios in large scale accountingfrauds. Therefore, he proposed a deep auto encoder for thispurpose and tested it effectiveness on two real world datasets.Chartered accountants appreciated the power of the deep autoencoder in predicting the anomalous accounting entries.

Automobile insurance fraud has traditionally been predictedby considering only structured data and textual date present inthe claims was never analyzed. But, Wang and Xu (2018) [143]proposed a novel method, wherein Latent Dirichlet Allocation(LDA) was first used to extract the text features hidden inthe text descriptions of the accidents appearing in the claims,and then along with the traditional numeric features as inputdata deep neural networks are trained. Based on the real-worldinsurance fraud dataset, they concluded their hybrid approachoutperformed random forests and support vector machine.

Telecom fraud has assumed large proportions and its impactcan be seen in services involving mobile banking. Zheng et al.(2018)[144] proposed a novel generative adversarial network(GAN) based model to compute probability of fraud for each

large transfer so that the bank can prevent potential fraudsif the probability exceeds a threshold. The model uses adeep denoising autoencoder to learn the complex probabilisticrelationship among the input features, and employs adversarialtraining to accurately discriminate between positive samplesand negative samples in a data. They concluded that themodel outperformed traditional classifiers and using it twocommercial banks have reduced losses of about 10 millionRMB in twelve weeks thereby significantly improving theirreputation.

In today’s word-of-mouth marketing, online reviews postedby customers critically influence buyers purchase decisionsmore than before. However, fraud can be perpetrated in thesereviews too by posting fake and meaningless reviews, whichcannot reflect customers’/users genuine purchase experienceand opinions. They pose great challenges for users to makeright choices. Therefore, it is desirable to build a fraud detec-tion model to identify and weed out fake reviews. Dong et al.(2018)[145] present an autoencoder and random forest, wherea stochastic decision tree model fine tunes the parameters.Extensive experiments were conducted on a large Amazonreview dataset.

Gomez et al. (2018)[146] presented a neural network basedsystem for fraud detection in banking. They analyzed a realworld dataset, and proposed an end-to-end solution from thepractitioners perspective, especially focusing on issues such asdata imbalances, data processing and cost metric evaluation.They reported their proposed solution performed comparablywith state-of-the-art solutions.

Ryman-Tubb et al. (2018) [147] observed that payment cardfraud has dented economies to the tune of USD 416bn in2017. This fraud is perpetrated primarily to finance terrorism,arms and drug crime. Until recently the patterns of fraud andthe criminals modus operandi has remained unsophisticated.However, smart phones, mobile payments, cloud computingand contactless payments have emerged almost simultaneouslywith large-scale data breaches. This made the extant methodsless effective. They surveyed extant methods using transac-tional volumes in 2017. This benchmark will show that onlyeight traditional methods have a practical performance to bedeployed in industry. Further, they suggested that a cognitivecomputing approach and deep learning are promising researchdirections.

Fiore et al (2019) [148] observed that data imbalance isa crucial issue in payment card fraud detection and thatoversampling has some drawbacks. They proposed GenerativeAdversarial Networks (GAN) for oversampling, where theytrained a GAN to output mimicked minority class examples,which were then merged with training data into an augmentedtraining set so that the effectiveness of a classifier can beimproved. They concluded that a classifier trained on theaugmented set outperformed the same classifier trained on theoriginal data, especially as far the sensitivity is concerned,resulting in an effective fraud detection mechanism.

In summary, as far as fraud detection is concerned, someprogress is made in the application of a few deep learningarchitectures. However, there is immense potential to con-tribute to this field especially, the application of Resnet, gated

Page 14: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 14

recurrent unit, capsule network etc to detect frauds includingcyber frauds. .

B. Financial Time Series Forecasting

Advances in technology and break through in deep learningmodels have seen an increase in intelligent automated tradingand decision support systems in Financial markets, especiallyin the stock and foreign exchange (FOREX) markets. How-ever, time series problems are difficult to predict especiallyfinancial time series [149]. On the other hand, NN and deeplearning models have shown great success in forecastingfinancial time series [150] despite the contradictory report byefficient market hypothesis (EMH) [151], that the FOREX andstock market follows a random walk and any profit made is bychance. This can be attributed to the ability of NN to self-adaptto any nonlinear data set without any statically assumption andprior knowledge of the data set [152].

Deep leaning algorithms have used both fundamental andtechnical analysis data, which is the two most commonlyused techniques for financial time series forecasting, to trainedand build deep leaning models [149]. Fundamental analysis isthe use or mining of textual information like financial news,company financial reports and other economic factors likegovernment policies, to predict price movement. Technicalanalysis on the other hand, is the analysis of historical data ofthe stock and FOREX market.

Deep Learning NN (DLNN) or Multilayer Feed forward NN(MFF) is the most used algorithms for financial markets [153].According to the experimental analysis done by Pandey el at[154], showed that MFF with Bayesian learning performedbetter than MFF learning with back propagation for theFOREX market.

Deep neural networks or machine learning models areconsidered as a black box, because the internal workings is notfully understood. The performance of DNN is highly influenceby the its parameters for a particular domain. Lasfer el at [155]performed an analysis on the influence of parameter (like thenumber of neurons, learning rate, activation function etc) onstock price forecasting. The authors work showed that a largerNN produces a better result than a smaller NN. However, theeffect of the activation function on a large NN is lesser.

Although CNN is well known for its stripes in imagerecognition and less application in the Financial markets, CNNhave also shown good performance in forecasting the stockmarket. As indicated by [155], the deeper the network themore NN can generalize to produce good results. However,the more the layers of NN, it is more likely to overfit agiven data set. CNN on the other hand, with its techniquesof convolution, pooling and drop out mechanism reduces thetendency of overfitting [156].

In order to apply CNN for the Financial market, the inputdata need to be transformed or adapted for CNN. With the helpof a sliding window, Gudelek el at [156] used images gener-ated by taking snapshots of the stock time series data and thenfed it into 2D-CNN to perform daily predictions and classifi-cation of trends (whether downwards or upwards). The modelwas able to get 72 percent accuracy on 17 exchange traded

fund data set. The model was not compared against otherNN architecture. Fisher and Krauss [157] adapted LSTM forstock prediction and compared its performance with memory-free based algorithms like random forest, logistic regressionclassifier and deep neural network. LSTM performed betterthan other algorithms, random forest however, outperformedLSTM during the financial crisis in 2008.

EMH [151] holds the view that financial news whichaffects the price movement are in cooperated into the priceimmediately or gradual. Therefore, any investor that can firstanalyze the news and make a good trading strategy canprofit. Based on this view and the rise of big data, there hasbeen an upward trend in sentiment analysis and text miningresearch which utilizes blogs, financial news social mediato forecast the stock or FOREX market [149]. Santos et al[158] explored the impact of news articles on company stockprices by implementing a LSTM neural network pre-trainedby a character level language model to predict the changes inprices of a company for both inter day and intraday trading.The results showed that, CNN with word wise based modeloutperformed other models. However, LSTM character level-based model performed better than RNN base models and alsohas less architectural complexity than other algorithms.

Moreover, there has been hybrid architectures to combinethe strengths or more than one deep leaning models to fore-cast financial time series. Bao et al [159] combined wavelettransform, stacked autoencoders and LSTM for stock priceprediction. The output of one network or model was fed intothe next model as input. The hybrid model perfumed betterthan LSTM and RNN (which were standalone). Hossain etal [160], also created a hybrid model by combining LSTMand Gated recurrent unit (GRU) to predict S&P 500 stockprice. The model was compared against standalone modelslike LSTM and GRU with different architectural layers. Thehybrid model outperformed all other algorithms.

Calvez and Cliff [161] did introduce a new approach onhow to trade on the stock market with DLNN model. DLNNmodel learn or observe the trading behaviors of traders. Theauthor used a limit-order-book (LOB) and quotes made bysuccessful traders (both automated and humans) as input data.DLNN was able to learn and outperformed both human tradersand automated traders. This approach of learning might be thebreakthrough for intelligent automated trading for Financialmarkets.

C. Prognostics and Health Management

The service reliability of the ever-encompassing cyber-physical systems around us has started to garner the undi-vided attention of the prognostics community in recent years.Factors such as revenue loss, system downtime, failure inmission-critical deployments and market competitive indexare emergent motivations behind making accurate predictionsabout the State-of-Health (SoH) and Remaining Useful Life(RUL) of components and systems. Industry niches such asmanufacturing, electronics, automotive, defense and aerospaceare increasingly becoming reliant on expert diagnosis of sys-tem health and smart recommender systems for maximizing

Page 15: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 15

system uptime and adaptive scheduling of maintenance. Giventhe surge in sensor influx, if there exists sufficient structuredinformation in historical or transient data, accurate modelsdescribing the system evolution may be proposed. The generalidea is that in such approaches, there is a point in the opera-tional cycle of a component beyond which it no longer deliversoptimum performance. In this regard, the most widely usedmetric for determining the critical operational cycle is termedas the Remaining Useful Life (RUL), which is a measure ofthe time from measurement to the critical cycle beyond whichsub-optimal performance is anticipated. Prognostic approachesmay be divided into three categorizations: (a) Model-driven(b) Data-driven (c) Hybrid i.e. any combination of (a) and (b).The last three decades have seen extensive usage of model-driven approaches with Gaussian Processes and SequentialMonte-Carlo (SMC) methods which continue to be popularin capturing patterns in relatively simpler sensor data streams.However, one shortcoming of model driven approaches usedtill date happens to be their dependence on physical evolutionequations recommended by an expert with problem-specificdomain knowledge. For model-driven approaches to continueto perform as well when the problem complexity scales,the prior distribution (physical equations) needs to continueto capture the embedded causalities in the data accurately.However, it has been the observation that as sensor data scales,the ability of model-driven approaches to learn the inherentstructures in the data has lagged. This is of course due tothe use of simplistic priors and updates which are unable tocapture the complex functional relationships from the highdimensional input data. With the introduction of self-regulatedlearning paradigms such as Deep Learning, this problem oflearning the structure in sensor data was mitigated to a largeextent because it was no longer necessary for an expert tohand-design the physical evolution scheme of the system. Withthe recent advancements in parallel computational capabilities,techniques leveraging the volume of available data have begunto shine. One key issue to keep in mind is that the performanceof data-driven approaches are only as good as the labeleddata available for training. While the surplus of sensor datamay act as a motivation for choosing such approaches, it iscritical that the precursor to the supervised part of learning,i.e. data labeling is accurate. This often requires laboriousand time-consuming efforts and is not guaranteed to result inthe generation of near-accurate ground truth. However, whenadequate precaution is in place and strategic implementationfacilitating optimal learning is achieved, it is possible to delivercustomized solutions to complex prediction problems withan accuracy unmatched by simpler, model-driven approaches.Therein lies the holy grail of deep learning: the ability to scalelearning with training data.

The recent works on device health forecasting are as fol-lows: Basak et al. [162] carried on Remaining Useful Life(RUL) prediction of hard disks along with discussions oneffective feature normalization strategies on Backblaze harddisk data. Deep Belief Networks (DBN) based multisensorhealth diagnosis methodology has been proposed in [163] andemployed in aircraft engine and electric power transformerhealth diagnosis to show the effectiveness of the approach.

Kuremoto et al., [164] applied DBN composed of twoRestricted Botzmann Machines (RBM) to capture the inputfeature distribution and then optimized the size of the networkand learning rate through Particle Swarm Optimization forforecasting purposes with time series data. Qiu et al., [165]proposed an early warning model where feature extractionthrough DNN with hidden state analysis of Hidden MarkovModel (HMM) is carried out for health maintenance of equip-ment chain in gas pipeline. Gugulothu et al. [166] proposed aforecasting scheme using a Recurrent Neural Network (RNN)model to generate embeddings which capture the trend of mul-tivariate time series data which are supposed to be disparatefor healthy and unhealthy devices. The idea of using RNNsto capture intricate dependencies among various time cyclesof sensor observations is emphasized in [167] for prognosticapplications. Botezatu et al., came up with some rules fordirectly identifying the healthy or unhealthy state of a devicein [168], employing a disk replacement prediction algorithmwith changepoint detection applied to time series Backblazedata.

So, a typical flow of prognostics and health management ofany system under test using data-driven approaches start withdata collection including instances of device performances orfeatures under both normal and degraded operating conditions.The data preprocessing and feature selection play a crucial rolebefore applying deep learning approaches to learn a modelcapturing degradation of device under test which should beemployed to diagnose a fault, prognosticate future states ofa device ensuring proper device maintenance. In this casemaintaining a balance between false positives and negativesbecomes crucial under real world industrial constraints andaccuracy measures such as precision and recall must bevalidated before deployment.

D. Medical Image Processing

Deep learning techniques have pervaded the entire disci-pline of medical image processing and the number of studieshighlighting its application in canonical tasks such as imageclassification, detection, enhancement, image generation, reg-istration and segmentation have been on a sharp rise. A recentsurvey by Litjens et al. [213] presents a collective picture ofthe prevalence and applications of deep learning models withinthe community as does a fairly rigorous treatise of the sameby Shen et al. [214]. A concise overview of recent work insome of these canonical tasks follows.

The purpose of image/exam classification jobs is to identifythe presence of a disease based on the images of medicalexaminations. Over the last few years, various neural networkarchitectures have been used in this field including stackedauto-encoders applied to diagnosis of Alzheimers disease andmild cognitive impairment, exploiting the latent non-linearcomplicated relations among various features [169], RestrictedBoltzmann Machines applied to Lung CT analysis combininggenerative as well as discriminative learning techniques [170],Deep Belief Networks trained on three dimensional medicalimages [171] etc. Recently, the the trend of using Convo-lutional Neural Networks in the field of image processing

Page 16: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 16

Fig. 11: MRI Brain Slice and its different segmentation [217]

has been observed. In 2017, Esteva et al. [172] used andfine-tuned the Inception v3 [215] model to classify clinicalimages pertaining to skin cancer examinations into benign andmalignant variants. Validated of experiments was carried outby testing model performance against a good number of der-matologists. In 2018, Rajaraman et. al [173] used specializedCNN architectures like ResNet for detecting malarial parasitesin thin blood smear images. Kang et al. [174] improved theperformance of 2D CNN by using a 3D multi-view CNN forlung nodule classification using spatial contextual informationwith the help of 3D Inception-ResNet architecture.

Object/lesion detection aims to identify differentparts/lesions in an image. Although object classificationand object detection are quite similar to each other but thechallenges are specific to each of the categories. When itcomes to object detection, the problem of class-imbalancecan pose a major hurdle in terms of the performance of objectdetection models. Object detection also involves identificationof localized information (that is specific to different parts ofan image) from the full image space. Therefore, the task ofobject detection is a combination of identification of localizedinformation and classification [216]. In 2016, Hwang andKim proposed a self-transfer learning (STL) framework whichoptimizes both the aspects of medical object detection task.They tested the STL framework for the detection of nodulesin chest radiographs and lesions in mammography [175].

Segmentation happens to be one of the most commonsubjects of interest when it comes to application of DeepLearning in the domain of medical image processing. Or-gan and substructure segmentation allows for advanced fine-grained analysis of a medical image and it is widely practicedin the analyses of cardiac and brain images. A demonstrationis shown in Figure 11, where different segmented parts of anMRI Brain Slice along with the original slice are considered.Segmentation includes both the local and global context ofpixels with respect to a given image and the performanceof a segmentation model can suffer from inconsistencies dueto class imbalances. This makes the task of segmentationa difficult one. The most widely-used CNN architecture formedical image segmentation is U-Net which was proposedby Ronneberger et al. [218] in 2015. U-Net takes care ofsampling that is required to check the class-imbalance factorsand it is capable of scanning an entire image in just oneforward pass which enables it to consider the full context of theimage. RNN-based architectures have also been proposed forsegmentation tasks. In 2016, Andermatt et al. [176] presenteda method to automatically segment 3D volumes of biomedicalimages. They used multi-dimensional gated recurrent units(GRU) as the main layers of their neural network model. The

proposed method also involves on-the-fly data augmentationwhich enables the model to be trained with less amount oftraining data.

Other applications of deep learning in Medical Image pro-cessing include image registration which implies coordinatetransformation from a reference image space to target imagespace. Cheng et al. [177] used multi-modal stacked denoisingautoencoder to compute effective similarity measure amongimages using normalized mutual information and local crosscorrelation. On the other hand, Miao et al. [178] developedCNN regressors to directly evaluate the registration transfor-mation parameters. In addition to these, image generation andenhancement techniques have been discussed in [179], [180].

So far, applications of deep learning in medical imageprocessing has produced satisfactory results in most of thecases, However, in a sensitive field like medical image pro-cessing prior knowledge should be incorporated in cases ofimage detection and recognition, reconstruction so that the datadriven approaches do not produce implausible results [219].

E. Power SystemsArtificial Neural Networks (ANN) have rapidly gained pop-

ularity among power system researchers [181]. Since their in-troduction to the power systems area in 1988 [182], numerousapplications of ANN to problems of electric power systemshave been proposed. However, the recent developments ofDeep Learning (DL) methods have resulted into powerful toolsthat can handle large data-sets and often outperform traditionalmachine learning methods in problems related to the powersector [183]. For this reason, currently deep architectures arereceiving the attention of researchers in power industry appli-cations. Here, we will focus on describing some approachesof deep ANN architectures applied on three basic problems ofthe power industry, i.e. load forecasting and prediction of thepower output of wind and solar energy systems.

Load forecasting is one of the most important tasks forthe efficient power system’s operation. It allows the systemoperator to schedule spinning reserve allocation, decide forpossible interchanges with other utilities and assess system’ssecurity [184]. A small decrease in load forecasting error mayresult in significant reduction of the total operation cost ofthe power system [185]. Among the Artificial Intelligencetechniques applied for load forecasting, methods based onANN have undoubtedly received the largest share of attention[186]. A basic reason for their popularity lies on the fact thatANN techniques are well-suited for energy forecast [187];they may obtain adequate estimations in cases where datais incomplete [188] and can consistently deal with complexnon-linear problems [189]. Park et al. [190], was one of thefirst approaches for applying ANN in load forecasting. Theefficiency of the proposed Multi-layer Perceptron (MLP) wasdemonstrated by benchmarking it against a numerical fore-casting method frequently used by utilities. As an evolutionof ANN forecasting techniques, DL methods are expected toincrease the prediction accuracy by allowing higher levelsof abstraction [191]. Thus, DL methods are gradually gainincreased popularity due to their ability to capture data be-haviour when considering complex non-linear patterns and

Page 17: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 17

large amounts of data. In [192], an end-to-end model basedon deep residual neural networks is proposed for hourly loadforecasting of a single day. Only raw data of past load andtemperature were used as inputs of the model. Initially, theinputs of the model are processed by several fully connectedlayers to produce preliminary forecast. These forecasts are thenpassed through a deep neural network structure constructedby residual blocks. The efficiency of the proposed model wasdemonstrated on data-sets from the North-American Utilityand ISO-NE. In [193], a Long Short Term Memory (LSTM)-based neural network has been proposed for short and mediumterm load forecasting. In order to optimize the effectiveness ofthe proposed approach, Genetic Algorithm is used to find theoptimal values for the time lags and the number of layers ofthe LSTM model. The efficient performance of the proposedstructure was verified using electricity consumption data of theFrance Metropolitan. Mocanu et al. [191] utilized two deeplearning approaches based on Restricted Boltzman Machines(RBM), i.e. conditional RBM and factored conditional RBM,for single-meter residential load forecasting. The method wasbenchmarked against several shallow ANN architectures anda Support Vector Machine approach, demonstrating increasedefficiency compared to the competing methods. Dedinec etal. [194] employed a Deep Belief Network (DBN) for shortterm load forecasting of the Former Yugoslavian Republic ofMacedonia. The proposed network comprised several stacksof RBM, which were pre-trained layer-wise. Rahman et al.[195] proposed two models based on the architecture ofRecurrent Neural Networks (RNN) aiming to predict themedium and long term electricity consumption in residen-tial and commercial buildings with one-hour resolution. Theapproach has utilized a MLP in combination with a LSTMbased model using an encoder-decoder architecture. A modelbased on LSTM-RNN framework with appliance consumptionsequences for short term residential load forecasting has beenproposed in [196]. The researchers have showed that theirmethod outperforms other state-of-the-art methods for loadforecasting. In [197] a Convolutional Neural Network (CNN)with k-means clustering has been proposed. K-means is usedto partition the large amount of data into clusters, whichare then used to train the networks. The method has shownimproved performance compared to the case where the k-means has not been engaged.

The utilization of DL techniques for modelling and forecast-ing in systems of renewable energy is progressively increasing.Since the data in such systems are inherently noisy, they maybe adequately handled with ANN architectures [198]. More-over, because renewable energy data is complicated in nature,shallow learning models may be insufficient to identify andlearn the corresponding deep non-linear and non-stationaryfeatures and traits [199]. Among the various renewable energysources, wind and solar energy have gained more popularitydue to their potential and high availability [200]. As a result,in recent years the research endeavours have been focusedon developing DL techniques for the problems related to thedeployment of the aforementioned renewable energy sources.

Photovolatic (PV) energy has received much attention, dueto its many advantages; it is abundant, inexhaustible and

clean [201]. However, due to the chaotic and erratic nature ofthe weather systems, the power output of PV energy systemsis intermittent, volatile and random [202]. These uncertaintiesmay potentially degrade the real-time control performance,reduce system economics, and thus pose a great challengefor the management and operation of electric power andenergy systems [203]. For these reasons, the accuracy offorecasting of PV power output plays a major role in ensuringoptimum planning and modelling of PV plants. In [199] adeep neural network architecture is proposed for deterministicand probabilistic PV power forecasting. The deep architecturefor deterministic forecasting comprises a Wavelet Transformand a deep CNN. Moreover, the probabilistic PV powerforecasting model combines the deterministic model and aspine Quantile Regression (QR) technique. The method hasbeen evaluated on historical PV power data-sets obtainedfrom two PV farms in Belgium, exhibiting high forecastingstability and robustness. In Gensler et al. [204], severaldeep network architectures, i.e. MLP, LSTM networks,DBN and Autoencoders, have been examined with respectto their forecasting accuracy of the PV power output. Theperformance of the methods is validated on actual data fromPV facilities in Germany. The architecture that has exhibitedthe best performance is the Auto-LSTM network, whichcombines the feature extraction ability of the Autoencoderwith the forecasting ability of the LSTM. In [205] anLSTM-RNN is proposed for forecasting the output powerof solar PV systems. In particular, the authors examine fivedifferent LSTM network architectures in order to obtain theone with the highest forecasting accuracy at the examineddata-sets, which are retrieved from two cities of Egypt. Thenetwork, which provided the highest accuracy is the LSTMwith memory between batches.

With the advantages of non-pollution, low costs andremarkable benefits of scale, wind power is considered asone of the most important sources of energy [206]. ANNhave been widely employed for processing large amountsof data obtained from data acquisition systems of windturbines [207]. In recent years, many approaches based onDL architectures have been proposed for the prediction ofthe power output of wind power systems. In [208], a deepneural network architecture is proposed for deterministicwind power forecasting, which combines CNN and LSTMnetworks. The results of the model are further analyzedand evaluated based on the wind power forecasting errorin order to perform probabilistic forecasting. The methodhas been validated on data obtained from a wind farmin China; it has managed to perform better comparedto other statistical methods, i.e. ARIMA and persistencemethod, as well as artificial intelligence based techniquesin deterministic and probabilistic wind power forecasting.Wang et al. [209] proposed a wind power forecasting methodbased on Wavelet Transform, CNN and ensemble technique.Their method was compared with the persistence methodand two shallow ANN architectures, i.e. Back-PropagationANN (BPANN) and Support Vector Machine, on data setscollected from wind farms in China. The results validate that

Page 18: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 18

their method outperforms the benchmark approaches in termsof reliability, sharpness and overall skill. In [210] a DBNmodel in conjunction with the k-means clustering algorithmis proposed for wind power forecasting. The proposedapproach demonstrated significantly increased forecastingaccuracy compared to a BPANN and a Morlet wavelet neuralnetwork on data-sets obtained from a wind farm in Spain. Adata-driven multi-model wind forecasting methodology withdeep feature selection is proposed in [211]. In particular, atwo layer ensemble technique is developed; the first layercomprises multiple machine learning models, which generateindividual forecasts. In the second layer a blended algorithmis utilized to merge the forecasts derived during the first stage.Numerical results validate the efficiency of the proposedmethodology compared to models employing a singlealgorithm. Finally, in [212] an approach is proposed for windpower forecasting, which combines deep Autoencoders, DBNand the concept of transfer learning. The method is tested ondata-sets containing power measurement and meteorologicalforecast related to components of wind, obtained from windfarms in Europe. Moreover, it is compared to commonly usedbaseline regression models, i.e. ARIMA and Support VectorRegressor, and derives better results in terms of MAE, RMSEand SDE compared to the benchmark algorithms.

VI. DISCUSSIONS

In this paper we presented several Deep Learning archi-tectures starting from the foundational architectures up to therecent developments covering the aspect of their modificationsand evolution over time as well as applications to specificdomains. We provided a list of category wise publicly availabledata repositories for Deep Learning practitioners in TableV. We discussed the blend of swarm intelligence in DeepLearning approaches and how the influence of one enrichesother when applied to real world problems. The vastly growinguse of deep learning architectures specially in safety criticalsystems brings us to the question, how reliable the architec-tures are in providing decisions even in presence of adversarialscenarios. To address this, we started by giving an overviewof testing neural network architectures, various methods foradversarial test generation as well as countermeasures to beadopted against adversarial examples. Next we moved onto specific applications of deep learning including MedicalImaging, Prognostics and Health Management, Applicationsin Financial Services, Financial Time Series Forecasting andlastly the applications in Power Systems. For each applicationthe current research trends as well as future research directionsare discussed. Table IV lists a collection of recent reviews indifferent fields of Deep Learning including computer vision,forecasting, image processing, adversarial cases, autonomousvehicles, natural language processing, recommender systemsand big data analytics.

VII. CONCLUSIONS AND FUTURE WORK

In conclusion, we highlight a few open areas of researchand elaborate on some of the existing lines of thoughts andstudies in addressing challenges that lie within.

• Challenges with scarcity of data: With growing avail-ability of data as well as powerful and distributed pro-cessing units Deep Learning architectures can be suc-cessfully applied to major industrial problems. However,deep learning is traditionally big data driven and lacksefficiency to learn abstractions through clear verbal def-initions [220] if not trained with billions of trainingsamples. Also the large reliance on Convolutional NeuralNetworks(CNNs) especially for video recognition pur-poses could face exponential ineffeciency leading to theirdemise [221] which can be avoided by capsules [222]capturing critical spatial hierarchical relationships moreefficiently than CNNs with lesser data requirements. Tomake DL work with smaller available data sets, someof the approaches in use are data augmentation, transferlearning, recursive classification techniques as well assynthetic data generation. One shot learning [223] is alsobringing new avenues to learn from very few trainingexamples which has already started showing progressin language processing and image classification tasks.More generalized techniques are being developed in thisdomain to make DL models learn from sparse or fewerdata representations is a current research thrust.

• Adopting unsupervised approaches: A major thrustis towards combining deep learning with unsupervisedlearning methods. Systems developed to set their owngoals [220] and develop problem-solving approaches inits way towards exploring the environment are the futureresearch directions surpassing supervised approaches re-quiring lots of data apriori. So, the thrust of AI researchincluding Deep Learning is towards Meta Learning, i.e.,learning to learn which involves automated model design-ing and decision making capabilities of the algorithms. Itoptimizes the ability to learn various tasks from fewertraining data[224].

• Influence of cognitive meuroscience: Inspiration drawnfrom cognitive neuroscience, developmental psychologyto decipher human behavioral pattern are able to bringmajor breakthrough in applications such as enablingartificial agents learn about spatial navigation on theirown which comes naturally to most living beings [225].

• Neural networks and reinforcement learning: Meta-modeling approaches using Reinforcement Learning(RL)are being used for designing problem specific Neu-ral Network architectures. In [226] the authors intro-duced MetaQNN, a RL based meta-modeling algorithmto automatically generate CNN architectures for imageclassification by using Q-learning [227] with ε greedyexploration. AlphaGo, the computer program built com-bining reinforcement learning and CNN for playing thegame ‘Go’ achieved a great success by beating humanprofessional ’Go’ players. Also deep convolutional neuralnetworks can work as function approximators to predict‘Q’ values in a reinforcement learning problem. So, amajor thrust of current research is on superposition

Page 19: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 19TABLE IV: A Collection of Recent Reviews on Deep Learning

Topic Review Papers

Computer Vision

Deep learning for visual understanding: A review [228]Deep Learning for Computer Vision: A Brief Review [229]A Survey on Deep Learning Methods for Robot Vision [230]Deep learning for visual understanding: A review [228]Deep Learning Advances in Computer Vision with 3D Data: A Survey [231]Visualizations of Deep Neural Networks in Computer Vision: A Survey [232]

Forecasting

Machine Learning in Financial Crisis Prediction: A Survey [233]Deep Learning for Time-Series Analysis [234]A Survey on Machine Learning and Statistical Techniques in BankruptcyPrediction [235]Time series forecasting using artificial neural networks methodologies: Asystematic review [236]A Review of Deep Learning Methods Applied on Load Forecasting [237]Trends in Machine Learning Applied to Demand & Sales Forecasting: AReview [238]A survey on retail sales forecasting and prediction in fashion markets [239]Electric load forecasting: Literature survey and classification of methods [240]A review of unsupervised feature learning and deep learning for time-seriesmodeling [241]

Image Processing

A Survey on Deep Learning in Medical Image Analysis [213]A Comprehensive Survey of Deep Learning for Image Captioning [242]Biological image analysis using deep learning-based methods: Literaturereview [243]Deep learning for remote sensing image classification: A survey [244]Deep Convolutional Neural Networks for Image Classification: A Compre-hensive Review [245]Deep Learning for Medical Image Processing: Overview, Challenges and theFuture [246]An overview of deep learning in medical imaging focusing on MRI [247]Deep Learning in Medical Ultrasound Analysis: A Review [248]

Adversarial Cases

Threat of Adversarial Attacks on Deep Learning in Computer Vision: ASurvey [249]Adversarial Learning in Statistical Classification: A Comprehensive Reviewof Defenses Against Attacks [250]Adversarial Attacks and Defenses Against Deep Neural Networks: A Survey[251]Adversarial Machine Learning: A Literature Review [252]Review of Artificial Intelligence Adversarial Attack and Defense Technolo-gies [253]A Survey of Adversarial Machine Learning in Cyber Warfare [254]

Autonomous Vehicles

Planning and Decision-Making for Autonomous Vehicles [255]A Review of Deep Learning Methods and Applications for Unmanned AerialVehicles [256]MIT Autonomous Vehicle Technology Study: Large-Scale Deep LearningBased Analysis of Driver Behavior and Interaction with Automation [257]Perception, Planning, Control, and Coordination for Autonomous Vehicles[258]Survey of neural networks in autonomous driving [259]Self-Driving Cars: A Survey [260]

Page 20: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 20

Natural LanguageProcessing

Recent Trends in Deep Learning Based Natural Language Processing [261]A Survey of the Usages of Deep Learning in Natural Language Processing[262]A survey on the state-of-the-art machine learning models in the context ofNLP [263]Inflectional Review of Deep Learning on Natural Language Processing [264]Deep learning for natural language processing: advantages and challenges[265]Deep Learning for Natural Language Processing [266]

Recommender Systems

Deep Learning based Recommender System: A Survey and New Perspectives[267]A Survey of Recommender Systems Based on Deep Learning [268]A review on deep learning for recommender systems: challenges and remedies[269]Deep Learning Methods on Recommender System: A Survey of State-of-the-art [270]Deep Learning-Based Recommendation: Current Issues and Challenges [271]A Survey and Critique of Deep Learning on Recommender Systems [272]

Big Data Analytics

Efficient Machine Learning for Big Data: A Review [273]A survey on deep learning for big data [274]A Survey on Data Collection for Machine Learning: a Big Data - AIIntegration Perspective [275]A survey of machine learning for big data processing [276]Deep learning in big data Analytics: A comparative study [277]Deep learning applications and challenges in big data analytics [278]

Page 21: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 21TABLE V: A Collection of Data Repositories for Deep Learning Practitioners

Category Dataset Link

ImageDatasets

MNISTCIFAR-100Caltech 101Caltech 256ImagenetCOIL100STL-10Google Open images

Labelme

http://yann.lecun.com/exdb/mnist/http://www.cs.utoronto.ca/∼kriz/cifar.htmlhttp://www.vision.caltech.edu/Image Datasets/Caltech101/http://www.vision.caltech.edu/Image Datasets/Caltech256/http://www.image-net.org/http://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.phphttp://www.stanford.edu/∼acoates//stl10/https://ai.googleblog.com/2016/09/introducing-open-images-dataset.htmlhttp://labelme.csail.mit.edu/Release3.0/browserTools/php/dataset.php

SpeechDatasets

Google AudiosetTIMIT

VoxForgeCHIME2000 HUB5 EnglishLibriSpeechVoxCelebOpen SLRCALLHOME AmericanEnglish Speech

https://research.google.com/audioset/dataset/index.htmlhttp://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1http://www.voxforge.org/http://spandh.dcs.shef.ac.uk/chime challenge/data.htmlhttps://catalog.ldc.upenn.edu/LDC2002T43http://www.openslr.org/12/http://www.robots.ox.ac.uk/∼vgg/data/voxceleb/https://www.openslr.org/51https://catalog.ldc.upenn.edu/LDC97S42

TextDatasets

English Broadcast NewsSQuADBillion Word Dataset20 NewsgroupsGoogle Books NgramsUCI SpambaseCommon CrawlYelp Open Dataset

https://catalog.ldc.upenn.edu/LDC97S44https://rajpurkar.github.io/SQuAD-explorer/http://www.statmt.org/lm-benchmark/http://qwone.com/∼jason/20Newsgroups/https://aws.amazon.com/datasets/google-books-ngrams/https://archive.ics.uci.edu/ml/datasets/Spambasehttp://commoncrawl.org/the-data/https://www.yelp.com/dataset

NaturalLanguageDatasets

Web 1T 5-gramBlizzard Challenge 2018Flickr personaltaxonomiesMulti-DomainSentiment DatasetEnron Email DatasetBlogger CorpusWikipedia Links DataGutenberg eBooks ListSMS Spam CollectionUCI’s Spambase data

https://catalog.ldc.upenn.edu/LDC2006T13https://www.synsig.org/index.php/Blizzard Challenge 2018https://www.isi.edu/∼lerman/downloads/flickr/flickr taxonomies.html

http://www.cs.jhu.edu/∼mdredze/datasets/sentiment/

https://www.cs.cmu.edu/∼./enron/http://u.cs.biu.ac.il/∼koppel/BlogCorpus.htmhttps://code.google.com/archive/p/wiki-links/downloadshttp://www.gutenberg.org/wiki/Gutenberg:Offline Catalogshttp://www.dt.fee.unicamp.br/∼tiago/smsspamcollection/https://archive.ics.uci.edu/ml/datasets/Spambase

GeospatialDatasets

OpenStreetMapLandsat8NEXRADESRI Open dataUSGS EarthExplorerOpenTopographyNASA SEDACNASA Earth Observa-tionsTerra Populus

https://www.openstreetmap.orghttps://landsat.gsfc.nasa.gov/landsat-8/https://www.ncdc.noaa.gov/data-access/radar-data/nexradhttps://hub.arcgis.com/pages/open-datahttps://earthexplorer.usgs.gov/https://opentopography.org/https://sedac.ciesin.columbia.edu/https://neo.sci.gsfc.nasa.gov/

https://terra.ipums.org/

Page 22: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 22

Recom-menderSystemsDatasets

MovielensMillion Song DatasetLast.fmBook-crossing DatasetJesterNetflix PrizePinterest Fashion Com-patibilityAmazon Question andAnswer DataSocial Circles Data

https://grouplens.org/datasets/movielens/https://www.kaggle.com/c/msdchallengehttps://grouplens.org/datasets/hetrec-2011/http://www2.informatik.uni-freiburg.de/∼cziegler/BX/https://goldberg.berkeley.edu/jester-data/https://www.netflixprize.com/http://cseweb.ucsd.edu/∼jmcauley/datasets.html#pinterest

http://cseweb.ucsd.edu/∼jmcauley/datasets.html#amazon qa

http://cseweb.ucsd.edu/∼jmcauley/datasets.html#socialcircles

Economicsand

FinanceDatasets

QuandlWorld Bank Open DataIMF DataFinancial Times MarketDataGoogle Trends

American Economic As-sociationUS stock DataWorld FactbookDow Jones Index DataSet

https://www.quandl.com/https://data.worldbank.org/https://www.imf.org/en/Datahttps://markets.ft.com/data/

https://trends.google.com/trends/?q=google&ctab=0&geo=all&date=all&sort=0https://www.aeaweb.org/resources/data/us-macro-regional

https://github.com/eliangcs/pystock-datahttps://www.cia.gov/library/publications/download/http://archive.ics.uci.edu/ml/datasets/Dow+Jones+Index

Auto-nomousVehiclesDatasets

BDD100kBaidu ApolloscapesComma.aiOxfords Robotic CarCityscape DatasetCSSAD DatasetKUL Belgium TrafficSign DatasetLISABosch Small TrafficLightLaRa Traffic LightRecognitionWPI Datasets

https://bdd-data.berkeley.edu/http://apolloscape.auto/https://archive.org/details/comma-datasethttps://robotcar-dataset.robots.ox.ac.uk/https://www.cityscapes-dataset.com/http://aplicaciones.cimat.mx/Personal/jbhayet/ccsad-datasethttp://www.vision.ee.ethz.ch/∼timofter/traffic signs/

http://cvrr.ucsd.edu/LISA/datasets.htmlhttps://hci.iwr.uni-heidelberg.de/node/6132

http://www.lara.prd.fr/benchmarks/trafficlightsrecognition

http://computing.wpi.edu/dataset.html

Page 23: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 23

of neural networks and reinforcement learning gearedtowards problem specific requirements.

This review has aimed at aiding the beginner as well as thepractitioner in the field make informed choices and has madean in-depth analysis of some recent deep learning architecturesas well as an exploratory dissection of some pertinent applica-tion areas. It is the authors’ hope that readers find the materialengaging and informative and openly encourage feedback tomake the organization and content of this article more alignedalong the lines of a formal extension of the literature withinthe deep learning community.

REFERENCES

[1] M. van Gerven and S. Bohte, “Editorial: Artificial neural networks asmodels of neural information processing,” Frontiers in ComputationalNeuroscience, vol. 11, p. 114, 2017. [Online]. Available: https://www.frontiersin.org/article/10.3389/fncom.2017.00114

[2] W. S. McCulloch and W. Pitts, “A logical calculus of theideas immanent in nervous activity,” The bulletin of mathematicalbiophysics, vol. 5, no. 4, pp. 115–133, Dec 1943. [Online]. Available:https://doi.org/10.1007/BF02478259

[3] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521,no. 7553, p. 436, 2015.

[4] S. Lawrence, C. L. Giles, , and A. D. Back, “Face recognition: aconvolutional neural-network approach,” IEEE Transactions on NeuralNetworks, vol. 8, no. 1, pp. 98–113, Jan 1997.

[5] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in 2015 IEEE Conference on ComputerVision and Pattern Recognition (CVPR), June 2015, pp. 3431–3440.

[6] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan,S. Guadarrama, K. Saenko, and T. Darrell, “Long-term recurrentconvolutional networks for visual recognition and description,” IEEETrans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 677–691,Apr. 2017. [Online]. Available: https://doi.org/10.1109/TPAMI.2016.2599174

[7] X. Wu, R. He, and Z. Sun, “A lightened CNN for deepface representation,” CoRR, vol. abs/1511.02683, 2015. [Online].Available: http://arxiv.org/abs/1511.02683

[8] A. Diba, V. Sharma, A. Pazandeh, H. Pirsiavash, and L. V. Gool,“Weakly supervised cascaded convolutional networks,” in 2017 IEEEConference on Computer Vision and Pattern Recognition (CVPR), July2017, pp. 5131–5139.

[9] W. Ouyang, X. Zeng, X. Wang, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang,Z. Wang, H. Li, K. Wang, J. Yan, C. Loy, and X. Tang, “Deepid-net: Object detection with deformable part based convolutional neuralnetworks,” IEEE Transactions on Pattern Analysis and Machine Intel-ligence, vol. 39, no. 7, pp. 1320–1334, July 2017.

[10] G. Cybenko, “Approximation by superpositions of a sigmoidalfunction,” Mathematics of Control, Signals and Systems, vol. 2, no. 4,pp. 303–314, Dec 1989. [Online]. Available: https://doi.org/10.1007/BF02551274

[11] K. Hornik, “Approximation capabilities of multilayer feedforwardnetworks,” Neural Networks, vol. 4, no. 2, pp. 251 – 257, 1991.[Online]. Available: http://www.sciencedirect.com/science/article/pii/089360809190009T

[12] Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang, “The expressive power ofneural networks: A view from the width,” CoRR, vol. abs/1709.02540,2017. [Online]. Available: http://arxiv.org/abs/1709.02540

[13] B. Hanin, “Universal function approximation by deep neural nets withbounded width and relu activations,” 08 2017. [Online]. Available:https://arxiv.org/abs/1708.02691

[14] J. Schmidhuber, “Deep learning in neural networks: An overview,”Neural Networks, vol. 61, pp. 85 – 117, 2015. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S0893608014002135

[15] G. Marcus, “Deep Learning: A Critical Appraisal,” arXiv e-prints, p.arXiv:1801.00631, Jan. 2018.

[16] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, andA. Swami, “The limitations of deep learning in adversarial settings,”in 2016 IEEE European Symposium on Security and Privacy (EuroSP), March 2016, pp. 372–387.

[17] E. Abbe and C. Sandon, “Provable limitations of deep learning,” arXive-prints, p. arXiv:1812.06369, Dec. 2018.

[18] F. Rosenblatt, “The perceptron: A probabilistic model for informationstorage and organization in the brain,” Psychological Review, pp. 65–386, 1958.

[19] and, “Madaline rule ii: a training algorithm for neural networks,” inIEEE 1988 International Conference on Neural Networks, July 1988,pp. 401–408 vol.1.

[20] B. Widrow and M. A. Lehr, “30 years of adaptive neural networks:perceptron, madaline, and backpropagation,” Proceedings of the IEEE,vol. 78, no. 9, pp. 1415–1442, Sep. 1990.

[21] M. Minsky and S. Papert, “Perceptrons - an introduction to computa-tional geometry,” 1969.

[22] P. J. Werbos, The roots of backpropagation: from ordered derivativesto neural networks and political forecasting. John Wiley & Sons,1994, vol. 1.

[23] J. J. Hopfield, “Neural networks and physical systems with emergentcollective computational abilities,” Proceedings of the NationalAcademy of Sciences, vol. 79, no. 8, pp. 2554–2558, 1982. [Online].Available: https://www.pnas.org/content/79/8/2554

[24] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Parallel distributedprocessing: Explorations in the microstructure of cognition, vol. 1,”D. E. Rumelhart, J. L. McClelland, and C. PDP Research Group,Eds. Cambridge, MA, USA: MIT Press, 1986, ch. LearningInternal Representations by Error Propagation, pp. 318–362. [Online].Available: http://dl.acm.org/citation.cfm?id=104279.104293

[25] C. Cortes and V. Vapnik, “Support-vector networks,” MachineLearning, vol. 20, no. 3, pp. 273–297, Sep 1995. [Online]. Available:https://doi.org/10.1023/A:1022627411411

[26] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” NeuralComputation, vol. 9, no. 8, pp. 1735–1780, 1997. [Online]. Available:https://doi.org/10.1162/neco.1997.9.8.1735

[27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-basedlearning applied to document recognition,” Proceedings of the IEEE,vol. 86, no. 11, pp. 2278–2324, 1998.

[28] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithmfor deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.

[29] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito,Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differenti-ation in pytorch,” in NIPS-W, 2017.

[30] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat,I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz,L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga,S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner,I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning onheterogeneous systems,” 2015, software available from tensorflow.org.[Online]. Available: http://tensorflow.org/

[31] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecturefor fast feature embedding,” in Proceedings of the 22Nd ACMInternational Conference on Multimedia, ser. MM ’14. NewYork, NY, USA: ACM, 2014, pp. 675–678. [Online]. Available:http://doi.acm.org/10.1145/2647868.2654889

[32] S. Tokui, K. Oono, S. Hido, and J. Clayton, “Chainer: a next-generation open source framework for deep learning,” in Proceedingsof Workshop on Machine Learning Systems (LearningSys) in TheTwenty-ninth Annual Conference on Neural Information ProcessingSystems (NIPS), 2015. [Online]. Available: http://learningsys.org/papers/LearningSys 2015 paper 33.pdf

[33] F. Chollet et al., “Keras,” https://github.com/fchollet/keras, 2015.[34] J. Dai, Y. Wang, X. Qiu, D. Ding, Y. Zhang, Y. Wang, X. Jia, C. Zhang,

Y. Wan, Z. Li, J. Wang, S. Huang, Z. Wu, Y. Wang, Y. Yang, B. She,D. Shi, Q. Lu, K. Huang, and G. Song, “Bigdl: A distributed deeplearning framework for big data,” CoRR, vol. abs/1804.05839, 2018.

[35] Theano Development Team, “Theano: A Python framework forfast computation of mathematical expressions,” arXiv e-prints, vol.abs/1605.02688, May 2016. [Online]. Available: http://arxiv.org/abs/1605.02688

[36] F. Seide and A. Agarwal, “Cntk: Microsoft’s open-source deep-learningtoolkit,” in Proceedings of the 22Nd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, ser. KDD’16. New York, NY, USA: ACM, 2016, pp. 2135–2135. [Online].Available: http://doi.acm.org/10.1145/2939672.2945397

Page 24: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 24

[37] S. Kombrink, T. Mikolov, M. Karafiat, and L. Burget, “Recurrentneural network based language modeling in meeting recognition,” inTwelfth annual conference of the international speech communicationassociation, 2011.

[38] L. Deng, D. Yu, et al., “Deep learning: methods and applications,”Foundations and Trends R© in Signal Processing, vol. 7, no. 3–4, pp.197–387, 2014.

[39] Y. Bengio et al., “Learning deep architectures for ai,” Foundations andtrends R© in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.

[40] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internalrepresentations by error propagation,” California Univ San Diego LaJolla Inst for Cognitive Science, Tech. Rep., 1985.

[41] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of deep networks,” in Advances in neural informationprocessing systems, 2007, pp. 153–160.

[42] Y. Bengio, N. Boulanger-Lewandowski, and R. Pascanu, “Advances inoptimizing recurrent networks,” in 2013 IEEE International Conferenceon Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 8624–8628.

[43] G. E. Dahl, T. N. Sainath, and G. E. Hinton, “Improving deepneural networks for lvcsr using rectified linear units and dropout,”in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEEInternational Conference on. IEEE, 2013, pp. 8609–8613.

[44] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv preprintarXiv:1502.03167, 2015.

[45] D. Sussillo and L. Abbott, “Random walk initialization for training verydeep feedforward networks,” arXiv preprint arXiv:1412.6558, 2014.

[46] D. Mishkin and J. Matas, “All you need is a good init,” arXiv preprintarXiv:1511.06422, 2015.

[47] X. Glorot and Y. Bengio, “Understanding the difficulty of trainingdeep feedforward neural networks,” in Proceedings of the thirteenthinternational conference on artificial intelligence and statistics, 2010,pp. 249–256.

[48] S. K. Kumar, “On weight initialization in deep neural networks,” arXivpreprint arXiv:1704.08863, 2017.

[49] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearitiesimprove neural network acoustic models,” in Proc. icml, vol. 30, no. 1,2013, p. 3.

[50] A. Fischer and C. Igel, “An introduction to restricted boltzmann ma-chines,” in Iberoamerican Congress on Pattern Recognition. Springer,2012, pp. 14–36.

[51] P. Smolensky, “Information processing in dynamical systems: Founda-tions of harmony theory,” COLORADO UNIV AT BOULDER DEPTOF COMPUTER SCIENCE, Tech. Rep., 1986.

[52] G. E. Hinton, “Training products of experts by minimizing contrastivedivergence,” Neural computation, vol. 14, no. 8, pp. 1771–1800, 2002.

[53] A. Coates, A. Ng, and H. Lee, “An analysis of single-layer networksin unsupervised feature learning,” in Proceedings of the fourteenthinternational conference on artificial intelligence and statistics, 2011,pp. 215–223.

[54] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality ofdata with neural networks,” science, vol. 313, no. 5786, pp. 504–507,2006.

[55] H. Larochelle and Y. Bengio, “Classification using discriminative re-stricted boltzmann machines,” in Proceedings of the 25th internationalconference on Machine learning. ACM, 2008, pp. 536–543.

[56] R. Salakhutdinov, A. Mnih, and G. Hinton, “Restricted boltzmannmachines for collaborative filtering,” in Proceedings of the 24th inter-national conference on Machine learning. ACM, 2007, pp. 791–798.

[57] I. Sutskever and G. Hinton, “Learning multilevel distributed represen-tations for high-dimensional sequences,” in Artificial Intelligence andStatistics, 2007, pp. 548–555.

[58] G. W. Taylor, G. E. Hinton, and S. T. Roweis, “Modeling human motionusing binary latent variables,” in Advances in neural informationprocessing systems, 2007, pp. 1345–1352.

[59] R. Memisevic and G. Hinton, “Unsupervised learning of imagetransformations,” in Computer Vision and Pattern Recognition, 2007.CVPR’07. IEEE Conference on. IEEE, 2007, pp. 1–8.

[60] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional deepbelief networks for scalable unsupervised learning of hierarchical repre-sentations,” in Proceedings of the 26th annual international conferenceon machine learning. ACM, 2009, pp. 609–616.

[61] H. Lee, P. Pham, Y. Largman, and A. Y. Ng, “Unsupervised featurelearning for audio classification using convolutional deep belief net-works,” in Advances in neural information processing systems, 2009,pp. 1096–1104.

[62] G. Dahl, A.-r. Mohamed, G. E. Hinton, et al., “Phone recognition withthe mean-covariance restricted boltzmann machine,” in Advances inneural information processing systems, 2010, pp. 469–477.

[63] G. E. Hinton et al., “Modeling pixel means and covariances usingfactorized third-order boltzmann machines,” in Computer Vision andPattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010,pp. 2551–2558.

[64] A.-r. Mohamed, G. Hinton, and G. Penn, “Understanding how deepbelief networks perform acoustic modelling,” in Acoustics, Speech andSignal Processing (ICASSP), 2012 IEEE International Conference on.IEEE, 2012, pp. 4273–4276.

[65] I. Sutskever, G. E. Hinton, and G. W. Taylor, “The recurrent temporalrestricted boltzmann machine,” in Advances in neural informationprocessing systems, 2009, pp. 1601–1608.

[66] G. W. Taylor and G. E. Hinton, “Factored conditional restrictedboltzmann machines for modeling motion style,” in Proceedings ofthe 26th annual international conference on machine learning. ACM,2009, pp. 1025–1032.

[67] G. E. Hinton, “A practical guide to training restricted boltzmannmachines,” in Neural networks: Tricks of the trade. Springer, 2012,pp. 599–619.

[68] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning.MIT press Cambridge, 2016, vol. 1.

[69] N. Le Roux and Y. Bengio, “Representational power of restrictedboltzmann machines and deep belief networks,” Neural computation,vol. 20, no. 6, pp. 1631–1649, 2008.

[70] G. E. Hinton et al., “What kind of graphical model is the brain?” inIJCAI, vol. 5, 2005, pp. 1765–1775.

[71] C. Poultney, S. Chopra, Y. L. Cun, et al., “Efficient learning of sparserepresentations with an energy-based model,” in Advances in neuralinformation processing systems, 2007, pp. 1137–1144.

[72] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun, “Pedes-trian detection with unsupervised multi-stage feature learning,” inProceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2013, pp. 3626–3633.

[73] A.-r. Mohamed, G. E. Dahl, G. Hinton, et al., “Acoustic modelingusing deep belief networks,” IEEE Trans. Audio, Speech & LanguageProcessing, vol. 20, no. 1, pp. 14–22, 2012.

[74] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, andS. Bengio, “Why does unsupervised pre-training help deep learning?”Journal of Machine Learning Research, vol. 11, no. Feb, pp. 625–660,2010.

[75] S. M. Siniscalchi, J. Li, and C.-H. Lee, “Hermitian polynomial forspeaker adaptation of connectionist speech recognition systems,” IEEETransactions on Audio, Speech, and Language Processing, vol. 21,no. 10, pp. 2152–2161, 2013.

[76] S. M. Siniscalchi, D. Yu, L. Deng, and C.-H. Lee, “Exploiting deepneural networks for detection-based speech recognition,” Neurocom-puting, vol. 106, pp. 148–157, 2013.

[77] D. Yu, S. M. Siniscalchi, L. Deng, and C.-H. Lee, “Boosting at-tribute and phone estimation accuracies with deep neural networks fordetection-based speech recognition,” in Acoustics, Speech and SignalProcessing (ICASSP), 2012 IEEE International Conference on. IEEE,2012, pp. 4169–4172.

[78] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Unsupervisedlearning of hierarchical representations with convolutional deep beliefnetworks,” Communications of the ACM, vol. 54, no. 10, pp. 95–103,2011.

[79] J. Susskind, V. Mnih, G. Hinton, et al., “On deep generative modelswith applications to recognition,” in Computer Vision and PatternRecognition (CVPR), 2011 IEEE Conference on. IEEE, 2011, pp.2857–2864.

[80] V. Stoyanov, A. Ropson, and J. Eisner, “Empirical risk minimizationof graphical model parameters given approximate inference, decoding,and model structure,” in Proceedings of the Fourteenth InternationalConference on Artificial Intelligence and Statistics, 2011, pp. 725–733.

[81] R. Salakhutdinov and G. Hinton, “Semantic hashing,” InternationalJournal of Approximate Reasoning, vol. 50, no. 7, pp. 969–978, 2009.

[82] L. Deng, M. L. Seltzer, D. Yu, A. Acero, A.-r. Mohamed, andG. Hinton, “Binary coding of speech spectrograms using a deep auto-encoder,” in Eleventh Annual Conference of the International SpeechCommunication Association, 2010.

[83] L. Deng, “The mnist database of handwritten digit images for machinelearning research [best of the web],” IEEE Signal Processing Magazine,vol. 29, no. 6, pp. 141–142, 2012.

Page 25: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 25

[84] J. Ngiam, Z. Chen, P. W. Koh, and A. Y. Ng, “Learning deep energymodels,” in Proceedings of the 28th International Conference onMachine Learning (ICML-11), 2011, pp. 1105–1112.

[85] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng,“Multimodal deep learning,” in Proceedings of the 28th internationalconference on machine learning (ICML-11), 2011, pp. 689–696.

[86] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXivpreprint arXiv:1312.6114, 2013.

[87] G. Alain and Y. Bengio, “What regularized auto-encoders learn fromthe data-generating distribution,” The Journal of Machine LearningResearch, vol. 15, no. 1, pp. 3563–3593, 2014.

[88] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: Areview and new perspectives,” IEEE transactions on pattern analysisand machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.

[89] Y. Bengio, E. Laufer, G. Alain, and J. Yosinski, “Deep generativestochastic networks trainable by backprop,” in International Conferenceon Machine Learning, 2014, pp. 226–234.

[90] Y. Bengio, “Deep learning of representations: Looking forward,” in In-ternational Conference on Statistical Language and Speech Processing.Springer, 2013, pp. 1–37.

[91] P. Vincent, “A connection between score matching and denoisingautoencoders,” Neural computation, vol. 23, no. 7, pp. 1661–1674,2011.

[92] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol,“Stacked denoising autoencoders: Learning useful representations ina deep network with a local denoising criterion,” Journal of machinelearning research, vol. 11, no. Dec, pp. 3371–3408, 2010.

[93] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, andR. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.

[94] C. Doersch, “Tutorial on variational autoencoders,” arXiv preprintarXiv:1606.05908, 2016.

[95] G. E. Hinton, “A better way to learn features: technical perspective,”Communications of the ACM, vol. 54, no. 10, pp. 94–94, 2011.

[96] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming auto-encoders,” in International Conference on Artificial Neural Networks.Springer, 2011, pp. 44–51.

[97] Q. V. Le, “Building high-level features using large scale unsupervisedlearning,” in Acoustics, Speech and Signal Processing (ICASSP), 2013IEEE International Conference on. IEEE, 2013, pp. 8595–8598.

[98] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-basedlearning applied to document recognition,” Proceedings of the IEEE,vol. 86, no. 11, pp. 2278–2324, Nov 1998.

[99] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Proceedings of the 25thInternational Conference on Neural Information Processing Systems- Volume 1, ser. NIPS’12. USA: Curran Associates Inc., 2012,pp. 1097–1105. [Online]. Available: http://dl.acm.org/citation.cfm?id=2999134.2999257

[100] Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning,” Nature,vol. 521, no. 7553, pp. 436–444, 2015. [Online]. Available:https://doi.org/10.1038/nature14539

[101] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MITPress, 2016, http://www.deeplearningbook.org.

[102] A. Sherstinsky, “Fundamentals of recurrent neural network (rnn) andlong short-term memory (lstm) network,” CoRR, vol. abs/1808.03314,2018.

[103] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neuralcomputation, vol. 9, pp. 1735–80, 12 1997.

[104] F. A. Gers and J. Schmidhuber, “Recurrent nets that time and count,” inProceedings of the IEEE-INNS-ENNS International Joint Conferenceon Neural Networks. IJCNN 2000. Neural Computing: New Challengesand Perspectives for the New Millennium, vol. 3, July 2000, pp. 189–194 vol.3.

[105] J. Chung, aglar Gulehre, K. Cho, and Y. Bengio, “Empirical evaluationof gated recurrent neural networks on sequence modeling,” CoRR, vol.abs/1412.3555, 2014.

[106] H. Sak, A. W. Senior, and F. Beaufays, “Long short-term memory re-current neural network architectures for large scale acoustic modeling,”in INTERSPEECH, 2014.

[107] P. Doetsch, M. Kozielski, and H. Ney, “Fast and robust training of re-current neural networks for offline handwriting recognition,” 2014 14thInternational Conference on Frontiers in Handwriting Recognition, pp.279–284, 2014.

[108] H. Palangi, L. Deng, Y. Shen, J. Gao, X. He, J. Chen, X. Song,and R. K. Ward, “Deep sentence embedding using long short-termmemory networks: Analysis and application to information retrieval,”

IEEE/ACM Transactions on Audio, Speech, and Language Processing,vol. 24, pp. 694–707, 2016.

[109] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. C. Courville, and Y. Bengio, “Generative adversarial nets,”in NIPS, 2014.

[110] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, andX. Chen, “Improved techniques for training gans,” in Proceedings ofthe 30th International Conference on Neural Information ProcessingSystems, ser. NIPS’16. USA: Curran Associates Inc., 2016, pp. 2234–2242. [Online]. Available: http://dl.acm.org/citation.cfm?id=3157096.3157346

[111] R. Groß, Y. Gu, W. Li, and M. Gauci, “Generalizing gans: A turingperspective,” in NIPS, 2017.

[112] J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum, “Learn-ing a probabilistic latent space of object shapes via 3d generative-adversarial modeling,” in NIPS, 2016.

[113] C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videos withscene dynamics,” in NIPS, 2016.

[114] S. E. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee,“Generative adversarial text to image synthesis,” in ICML, 2016.

[115] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, andL. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp.211–252, 2015.

[116] M. D. Zeiler and R. Fergus, “Visualizing and understandingconvolutional networks,” CoRR, vol. abs/1311.2901, 2013. [Online].Available: http://arxiv.org/abs/1311.2901

[117] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov,D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper withconvolutions,” CoRR, vol. abs/1409.4842, 2014. [Online]. Available:http://arxiv.org/abs/1409.4842

[118] K. Simonyan and A. Zisserman, “Very deep convolutional networksfor large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.[Online]. Available: http://arxiv.org/abs/1409.1556

[119] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learningfor image recognition,” CoRR, vol. abs/1512.03385, 2015. [Online].Available: http://arxiv.org/abs/1512.03385

[120] L. N. Smith, “A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, andweight decay,” CoRR, vol. abs/1803.09820, 2018. [Online]. Available:http://arxiv.org/abs/1803.09820

[121] F. Assunao, N. Loureno, P. Machado, and B. Ribeiro, “Denser: deepevolutionary network structured representation,” Genetic Programmingand Evolvable Machines, pp. 1–31, 2018.

[122] B. A. Garro and R. A. Vazquez, “Designing artificial neural networksusing particle swarm optimization algorithms,” in Comp. Int. andNeurosc., 2015.

[123] G. Das, P. K. Pattnaik, and S. K. Padhy, “Artificial neural networktrained by particle swarm optimization for non-linear channelequalization,” Expert Systems with Applications, vol. 41, no. 7, pp.3491 – 3496, 2014. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0957417413008701

[124] B. Wang, Y. Sun, B. Xue, and M. Zhang, “Evolving deep convolutionalneural networks by variable-length particle swarm optimization for im-age classification,” 2018 IEEE Congress on Evolutionary Computation(CEC), pp. 1–8, 2018.

[125] S. Sengupta, S. Basak, and R. A. Peters, “Particle swarmoptimization: A survey of historical and recent developmentswith hybridization perspectives,” Machine Learning and KnowledgeExtraction, vol. 1, no. 1, pp. 157–191, 2018. [Online]. Available:http://www.mdpi.com/2504-4990/1/1/10

[126] S. Sengupta, S. Basak, and R. A. Peters, “Qdds: A novel quantumswarm algorithm inspired by a double dirac delta potential,” in 2018IEEE Symposium Series on Computational Intelligence (SSCI), Nov2018, pp. 704–711.

[127] S. Sengupta, S. Basak, and R. A. Peters, “Chaotic quantum doubledelta swarm algorithm using chebyshev maps: Theoretical foundations,performance analyses and convergence issues,” Journal of Sensorand Actuator Networks, vol. 8, no. 1, 2019. [Online]. Available:http://www.mdpi.com/2224-2708/8/1/9

[128] M. Huttenrauch, A. Sosic, and G. Neumann, “Deep reinforcementlearning for swarm systems,” CoRR, vol. abs/1807.06613, 2018.

[129] C. Anderson, A. V. Mayrhauser, and R. Mraz, “On the use of neuralnetworks to guide software testing activities,” in Proceedings of 1995IEEE International Test Conference (ITC), Oct 1995, pp. 720–729.

Page 26: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 26

[130] T. M. Khoshgoftaar and R. M. Szabo, “Using neural networks to predictsoftware faults during testing,” IEEE Transactions on Reliability,vol. 45, no. 3, pp. 456–462, Sep. 1996.

[131] M. Vanmali, M. Last, and A. Kandel, “Using a neural network in thesoftware testing process,” Int. J. Intell. Syst., vol. 17, pp. 45–62, 2002.

[132] Y. Sun, X. Huang, and D. Kroening, “Testing deep neural networks,”CoRR, vol. abs/1803.04792, 2018.

[133] G. Katz, C. W. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer,“Towards proving the adversarial robustness of deep neural networks.”in FVAV@iFM, 2017.

[134] X. Huang, M. Z. Kwiatkowska, S. Wang, and M. Wu, “Safety verifi-cation of deep neural networks,” in CAV, 2017.

[135] C. E. Tuncali, G. Fainekos, H. Ito, and J. Kapinski, “Simulation-based adversarial test generation for autonomous vehicles with machinelearning components,” 2018 IEEE Intelligent Vehicles Symposium (IV),pp. 1555–1562, 2018.

[136] X. Yuan, P. He, Q. Zhu, R. R. Bhat, and X. Li, “Adversarial examples:Attacks and defenses for deep learning,” CoRR, vol. abs/1712.07107,2017.

[137] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessingadversarial examples,” CoRR, vol. abs/1412.6572, 2014.

[138] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: Asimple and accurate method to fool deep neural networks,” 2016 IEEEConference on Computer Vision and Pattern Recognition (CVPR), pp.2574–2582, 2016.

[139] B. D. Rouhani, M. Samragh, M. Javaheripi, T. Javidi, andF. Koushanfar, “Deepfense: Online accelerated defense againstadversarial deep learning,” in Proceedings of the InternationalConference on Computer-Aided Design, ser. ICCAD ’18. NewYork, NY, USA: ACM, 2018, pp. 134:1–134:8. [Online]. Available:http://doi.acm.org/10.1145/3240765.3240791

[140] A. Chakraborty, M. Alam, V. Dey, A. Chattopadhyay, andD. Mukhopadhyay, “Adversarial attacks and defences: A survey,”CoRR, vol. abs/1810.00069, 2018.

[141] A. Pumsirirat and L. Yan, “Credit card fraud detection using deeplearning based on auto-encoder and restricted boltzmann machine,” IN-TERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCEAND APPLICATIONS, vol. 9, no. 1, pp. 18–25, 2018.

[142] M. Schreyer, T. Sattarov, D. Borth, A. Dengel, and B. Reimer,“Detection of anomalies in large scale accounting data using deepautoencoder networks,” CoRR, vol. abs/1709.05254, 2017. [Online].Available: http://arxiv.org/abs/1709.05254

[143] Y. Wang and W. Xu, “Leveraging deep learning with lda-based textanalytics to detect automobile insurance fraud,” Decision SupportSystems, vol. 105, pp. 87–95, 2018.

[144] Y.-J. Zheng, X.-H. Zhou, W.-G. Sheng, Y. Xue, and S.-Y. Chen,“Generative adversarial network based telecom fraud detection at thereceiving bank,” Neural Networks, vol. 102, pp. 78 – 86, 2018.[Online]. Available: http://www.sciencedirect.com/science/article/pii/S0893608018300698

[145] M. Dong, L. Yao, X. Wang, B. Benatallah, C. Huang, andX. Ning, “Opinion fraud detection via neural autoencoder decisionforest,” Pattern Recognition Letters, 2018. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S0167865518303039

[146] J. A. Gmez, J. Arvalo, R. Paredes, and J. Nin, “End-to-end neuralnetwork architecture for fraud scoring in card payments,” PatternRecognition Letters, vol. 105, pp. 175 – 181, 2018, machine Learningand Applications in Artificial Intelligence. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S016786551730291X

[147] N. F. Ryman-Tubb, P. Krause, and W. Garn, “How artificial intelligenceand machine learning research impacts payment card fraud detection:A survey and industry benchmark,” Engineering Applications ofArtificial Intelligence, vol. 76, pp. 130 – 157, 2018. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S0952197618301520

[148] U. Fiore, A. D. Santis, F. Perla, P. Zanetti, and F. Palmieri,“Using generative adversarial networks for improving classificationeffectiveness in credit card fraud detection,” Information Sciences,vol. 479, pp. 448 – 455, 2019. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0020025517311519

[149] R. C. Cavalcante, R. C. Brasileiro, V. L. Souza, J. P. Nobrega, and A. L.Oliveira, “Computational intelligence and financial markets: A surveyand future directions,” Expert Systems with Applications, vol. 55, pp.194–211, 2016.

[150] X. Li, Z. Deng, and J. Luo, “Trading strategy design in financialinvestment through a turning points prediction scheme,” ExpertSystems with Applications, vol. 36, no. 4, pp. 7818 – 7826, 2009.

[Online]. Available: http://www.sciencedirect.com/science/article/pii/S0957417408008622

[151] E. F. Fama, “Random walks in stock market prices,” Financial analystsjournal, vol. 51, no. 1, pp. 75–80, 1995.

[152] C.-J. Lu, T.-S. Lee, and C.-C. Chiu, “Financial time series forecastingusing independent component analysis and support vector regression,”Decision Support Systems, vol. 47, no. 2, pp. 115 – 125, 2009.[Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167923609000323

[153] M. Tk and R. Verner, “Artificial neural networks in business: Twodecades of research,” Applied Soft Computing, vol. 38, pp. 788 –804, 2016. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1568494615006122

[154] T. N. Pandey, A. K. Jagadev, S. Dehuri, and S.-B. Cho,“A novel committee machine and reviews of neural networkand statistical models for currency exchange rate prediction:An experimental analysis,” Journal of King Saud University -Computer and Information Sciences, 2018. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S1319157817303816

[155] A. Lasfer, H. El-Baz, and I. Zualkernan, “Neural network designparameters for forecasting financial time series,” in Modeling, Sim-ulation and Applied Optimization (ICMSAO), 2013 5th InternationalConference on. IEEE, 2013, pp. 1–4.

[156] M. U. Gudelek, S. A. Boluk, and A. M. Ozbayoglu, “A deep learningbased stock trading model with 2-d cnn trend detection,” in Computa-tional Intelligence (SSCI), 2017 IEEE Symposium Series on. IEEE,2017, pp. 1–8.

[157] T. Fischer and C. Krauss, “Deep learning with long short-term mem-ory networks for financial market predictions,” European Journal ofOperational Research, vol. 270, no. 2, pp. 654–669, 2018.

[158] L. dos Santos Pinheiro and M. Dras, “Stock market prediction withdeep learning: A character-based neural language model for event-based trading,” in Proceedings of the Australasian Language Technol-ogy Association Workshop 2017, 2017, pp. 6–15.

[159] W. Bao, J. Yue, and Y. Rao, “A deep learning framework for financialtime series using stacked autoencoders and long-short term memory,”PloS one, vol. 12, no. 7, p. e0180944, 2017.

[160] A. H. Mohammad, K. Rezaul, T. Ruppa, D. B. B. Neil, and W. Yang,“Hybrid deep learning model for stock price prediction,” in IEEESymposium Symposium Series on Computational Intelligence SSCI,2018. IEEE, 2018, pp. 1837–1844.

[161] A. le Calvez and D. Cliff, “Deep learning can replicate adaptive tradersin a limit-order-book financial market,” 2018 IEEE Symposium Serieson Computational Intelligence (SSCI), pp. 1876–1883, 2018.

[162] S. Basak, S. Sengupta, and A. Dubey, “Mechanisms for IntegratedFeature Normalization and Remaining Useful Life Estimation UsingLSTMs Applied to Hard-Disks,” arXiv e-prints, p. arXiv:1810.08985,Oct 2018.

[163] P. Tamilselvan and P. Wang, “Failure diagnosis using deep belieflearning based health state classification,” Reliability Engineering &System Safety, vol. 115, pp. 124 – 135, 2013. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S0951832013000574

[164] T. Kuremoto, S. Kimura, K. Kobayashi, and M. Obayashi, “Timeseries forecasting using a deep belief network with restrictedboltzmann machines,” Neurocomputing, vol. 137, pp. 47 – 56,2014, advanced Intelligent Computing Theories and Methodologies.[Online]. Available: http://www.sciencedirect.com/science/article/pii/S0925231213007388

[165] J. Qiu, W. Liang, L. Zhang, X. Yu, and M. Zhang, “Theearly-warning model of equipment chain in gas pipeline basedon dnn-hmm,” Journal of Natural Gas Science and Engineering,vol. 27, pp. 1710 – 1722, 2015. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1875510015302407

[166] N. Gugulothu, T. Vishnu, P. Malhotra, L. Vig, P. Agarwal, andG. Shroff, “Predicting remaining useful life using time series embed-dings based on recurrent neural networks,” CoRR, vol. abs/1709.01073,2017.

[167] P. Filonov, A. Lavrentyev, and A. Vorontsov, “Multivariate industrialtime series with cyber-attack simulation: Fault detection using an lstm-based predictive data model,” CoRR, vol. abs/1612.06676, 2016.

[168] M. M. Botezatu, I. Giurgiu, J. Bogojeska, and D. Wiesmann,“Predicting disk replacement towards reliable data centers,” inProceedings of the 22Nd ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining, ser. KDD ’16. NewYork, NY, USA: ACM, 2016, pp. 39–48. [Online]. Available:http://doi.acm.org/10.1145/2939672.2939699

Page 27: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 27

[169] H.-I. Suk, S.-W. Lee, and D. Shen, “Latent feature representationwith stacked auto-encoder for ad/mci diagnosis,” Brain Structure andFunction, vol. 220, pp. 841–859, 2013.

[170] G. van Tulder and M. de Bruijne, “Combining generative and discrim-inative representation learning for lung ct analysis with convolutionalrestricted boltzmann machines,” IEEE Transactions on Medical Imag-ing, vol. 35, no. 5, pp. 1262–1272, May 2016.

[171] T. Brosch and R. Tam, “Manifold learning of brain mris by deeplearning,” in Medical Image Computing and Computer-Assisted Inter-vention – MICCAI 2013, K. Mori, I. Sakuma, Y. Sato, C. Barillot, andN. Navab, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013,pp. 633–640.

[172] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M.Blau, and S. Thrun, “Dermatologist-level classification of skin cancerwith deep neural networks,” Nature, vol. 542, pp. 115–, Jan. 2017.[Online]. Available: http://dx.doi.org/10.1038/nature21056

[173] S. Rajaraman, S. K. Antani, M. Poostchi, K. Silamut, M. A.Hossain, R. J. Maude, S. Jaeger, and G. R. Thoma, “Pre-trainedconvolutional neural networks as feature extractors toward improvedmalaria parasite detection in thin blood smear images,” PeerJ, vol. 6,pp. e4568–e4568, Apr 2018, 29682411[pmid]. [Online]. Available:https://www.ncbi.nlm.nih.gov/pubmed/29682411

[174] G. Kang, K. Liu, B. Hou, and N. Zhang, “3d multi-view convolutionalneural networks for lung nodule classification,” in PloS one, 2017.

[175] S. Hwang and H. Kim, “Self-transfer learning for fully weaklysupervised object localization,” CoRR, vol. abs/1602.01625, 2016.[Online]. Available: http://arxiv.org/abs/1602.01625

[176] S. Andermatt, S. Pezold, and P. Cattin, “Multi-dimensional gatedrecurrent units for the segmentation of biomedical 3d-data,” in DeepLearning and Data Labeling for Medical Applications, G. Carneiro,D. Mateus, L. Peter, A. Bradley, J. M. R. S. Tavares, V. Belagiannis,J. P. Papa, J. C. Nascimento, M. Loog, Z. Lu, J. S. Cardoso, andJ. Cornebise, Eds. Cham: Springer International Publishing, 2016,pp. 142–151.

[177] X. Cheng, X. Lin, and Y. Zheng, “Deep similarity learning formultimodal medical images,” CMBBE: Imaging & Visualization, vol. 6,pp. 248–252, 2018.

[178] S. Miao, Z. J. Wang, and R. Liao, “A cnn regression approach forreal-time 2d/3d registration,” IEEE Transactions on Medical Imaging,vol. 35, no. 5, pp. 1352–1363, May 2016.

[179] O. Oktay, W. Bai, M. C. H. Lee, R. Guerrero, K. Kamnitsas, J. Ca-ballero, A. de Marvao, S. A. Cook, D. P. O’Regan, and D. Rueckert,“Multi-input cardiac image super-resolution using convolutional neuralnetworks,” in MICCAI, 2016.

[180] V. Golkov, A. Dosovitskiy, J. I. Sperl, M. I. Menzel, M. Czisch,P. Smann, T. Brox, and D. Cremers, “q-space deep learning: Twelve-fold shorter and model-free diffusion mri scans,” IEEE Transactionson Medical Imaging, vol. 35, no. 5, pp. 1344–1351, May 2016.

[181] V. S. S. Vankayala and N. D. Rao, “Artificial neural networks and theirapplications to power systemsa bibliographical survey,” Electric powersystems research, vol. 28, no. 1, pp. 67–79, 1993.

[182] M.-y. Chow, P. Mangum, and R. Thomas, “Incipient fault detectionin dc machines using a neural network,” in Signals, Systems andComputers, 1988. Twenty-Second Asilomar Conference on, vol. 2.IEEE, 1988, pp. 706–709.

[183] Z. Guo, K. Zhou, X. Zhang, and S. Yang, “A deep learning model forshort-term power load and probability density forecasting,” Energy, vol.160, pp. 1186–1200, 2018.

[184] R. E. Bourguet and P. J. Antsaklis, “Artificial neural networks in electricpower industry,” ISIS, vol. 94, p. 007, 1994.

[185] J. Sharp, “Comparative models for electrical load forecasting: D.h.bunn and e.d. farmer, eds.(wiley, new york, 1985) [uk pound]24.95,pp. 232,” International Journal of Forecasting, vol. 2, no. 2, pp. 241–242, 1986. [Online]. Available: https://EconPapers.repec.org/RePEc:eee:intfor:v:2:y:1986:i:2:p:241-242

[186] H. S. Hippert, C. E. Pedreira, and R. C. Souza, “Neural networks forshort-term load forecasting: A review and evaluation,” IEEE Transac-tions on power systems, vol. 16, no. 1, pp. 44–55, 2001.

[187] C. Kuster, Y. Rezgui, and M. Mourshed, “Electrical load forecastingmodels: A critical systematic review,” Sustainable cities and society,vol. 35, pp. 257–270, 2017.

[188] R. Aggarwal and Y. Song, “Artificial neural networks in power sys-tems. i. general introduction to neural computing,” Power EngineeringJournal, vol. 11, no. 3, pp. 129–134, 1997.

[189] Y. Zhai, “Time series forecasting competition among three sophisti-cated paradigms,” Ph.D. dissertation, University of North Carolina atWilmington, 2005.

[190] D. C. Park, M. El-Sharkawi, R. Marks, L. Atlas, and M. Damborg,“Electric load forecasting using an artificial neural network,” IEEEtransactions on Power Systems, vol. 6, no. 2, pp. 442–449, 1991.

[191] E. Mocanu, P. H. Nguyen, M. Gibescu, and W. L. Kling, “Deeplearning for estimating building energy consumption,” SustainableEnergy, Grids and Networks, vol. 6, pp. 91–99, 2016.

[192] K. Chen, K. Chen, Q. Wang, Z. He, J. Hu, and J. He, “Short-term loadforecasting with deep residual networks,” IEEE Transactions on SmartGrid, 2018.

[193] S. Bouktif, A. Fiaz, A. Ouni, and M. Serhani, “Optimal deep learninglstm model for electric load forecasting using feature selection andgenetic algorithm: Comparison with machine learning approaches,”Energies, vol. 11, no. 7, p. 1636, 2018.

[194] A. Dedinec, S. Filiposka, A. Dedinec, and L. Kocarev, “Deep beliefnetwork based electricity load forecasting: An analysis of macedoniancase,” Energy, vol. 115, pp. 1688–1700, 2016.

[195] A. Rahman, V. Srikumar, and A. D. Smith, “Predicting electricityconsumption for commercial and residential buildings using deeprecurrent neural networks,” Applied Energy, vol. 212, pp. 372–385,2018.

[196] W. Kong, Z. Y. Dong, Y. Jia, D. J. Hill, Y. Xu, and Y. Zhang,“Short-term residential load forecasting based on lstm recurrent neuralnetwork,” IEEE Transactions on Smart Grid, 2017.

[197] X. Dong, L. Qian, and L. Huang, “Short-term load forecasting in smartgrid: A combined cnn and k-means clustering approach,” in Big Dataand Smart Computing (BigComp), 2017 IEEE International Conferenceon. IEEE, 2017, pp. 119–125.

[198] S. A. Kalogirou, “Artificial neural networks in renewable energysystems applications: a review,” Renewable and sustainable energyreviews, vol. 5, no. 4, pp. 373–401, 2001.

[199] H. Wang, H. Yi, J. Peng, G. Wang, Y. Liu, H. Jiang, and W. Liu,“Deterministic and probabilistic forecasting of photovoltaic powerbased on deep convolutional neural network,” Energy Conversion andManagement, vol. 153, pp. 409–422, 2017.

[200] U. K. Das, K. S. Tey, M. Seyedmahmoudian, S. Mekhilef, M. Y. I.Idris, W. Van Deventer, B. Horan, and A. Stojcevski, “Forecastingof photovoltaic power generation and model optimization: A review,”Renewable and Sustainable Energy Reviews, vol. 81, pp. 912–928,2018.

[201] V. Dabra, K. K. Paliwal, P. Sharma, and N. Kumar, “Optimizationof photovoltaic power system: a comparative study,” Protection andControl of Modern Power Systems, vol. 2, no. 1, p. 3, 2017.

[202] J. Liu, W. Fang, X. Zhang, and C. Yang, “An improved photovoltaicpower forecasting model with the assistance of aerosol index data,”IEEE Transactions on Sustainable Energy, vol. 6, no. 2, pp. 434–442,2015.

[203] H. S. Jang, K. Y. Bae, H.-S. Park, and D. K. Sung, “Solar powerprediction based on satellite images and support vector machine,” IEEETrans. Sustain. Energy, vol. 7, no. 3, pp. 1255–1263, 2016.

[204] A. Gensler, J. Henze, B. Sick, and N. Raabe, “Deep learning forsolar power forecastingan approach using autoencoder and lstm neuralnetworks,” in Systems, Man, and Cybernetics (SMC), 2016 IEEEInternational Conference on. IEEE, 2016, pp. 002 858–002 865.

[205] M. Abdel-Nasser and K. Mahmoud, “Accurate photovoltaic powerforecasting models using deep lstm-rnn,” Neural Computing and Ap-plications, pp. 1–14, 2017.

[206] J. F. Manwell, J. G. McGowan, and A. L. Rogers, Wind energyexplained: theory, design and application. John Wiley & Sons, 2010.

[207] A. P. Marugan, F. P. G. Marquez, J. M. P. Perez, and D. Ruiz-Hernandez, “A survey of artificial neural network in wind energysystems,” Applied energy, vol. 228, pp. 1822–1836, 2018.

[208] W. Wu, K. Chen, Y. Qiao, and Z. Lu, “Probabilistic short-term windpower forecasting based on deep neural networks,” in ProbabilisticMethods Applied to Power Systems (PMAPS), 2016 InternationalConference on. IEEE, 2016, pp. 1–8.

[209] H.-z. Wang, G.-q. Li, G.-b. Wang, J.-c. Peng, H. Jiang, and Y.-t. Liu,“Deep learning based ensemble approach for probabilistic wind powerforecasting,” Applied energy, vol. 188, pp. 56–70, 2017.

[210] K. Wang, X. Qi, H. Liu, and J. Song, “Deep belief network based k-means cluster approach for short-term wind power forecasting,” Energy,vol. 165, pp. 840–852, 2018.

[211] C. Feng, M. Cui, B.-M. Hodge, and J. Zhang, “A data-driven multi-model methodology with deep feature selection for short-term windforecasting,” Applied Energy, vol. 190, pp. 1245–1257, 2017.

[212] A. S. Qureshi, A. Khan, A. Zameer, and A. Usman, “Wind powerprediction using deep neural network based meta regression andtransfer learning,” Applied Soft Computing, vol. 58, pp. 742–755, 2017.

Page 28: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 28

[213] G. J. S. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi,M. Ghafoorian, J. A. W. M. van der Laak, B. van Ginneken,and C. I. Sanchez, “A survey on deep learning in medical imageanalysis,” CoRR, vol. abs/1702.05747, 2017. [Online]. Available:http://arxiv.org/abs/1702.05747

[214] D. Shen, G. Wu, and H.-I. Suk, “Deep learning in medicalimage analysis,” Annual Review of Biomedical Engineering, vol. 19,no. 1, pp. 221–248, 2017, pMID: 28301734. [Online]. Available:https://doi.org/10.1146/annurev-bioeng-071516-044442

[215] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Re-thinking the inception architecture for computer vision,” 2016 IEEEConference on Computer Vision and Pattern Recognition (CVPR), pp.2818–2826, 2016.

[216] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, “Robust visual trackingvia multi-task sparse learning,” in 2012 IEEE Conference on ComputerVision and Pattern Recognition, June 2012, pp. 2042–2049.

[217] “Brain mri image segmentation using stacked denoising autoencoders,”https://goo.gl/tpnDx3.

[218] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutionalnetworks for biomedical image segmentation,” CoRR, vol.abs/1505.04597, 2015. [Online]. Available: http://arxiv.org/abs/1505.04597

[219] A. Maier, C. Syben, T. Lasser, and C. Riess, “A gentle introduction todeep learning in medical image processing,” Zeitschrift fr MedizinischePhysik, vol. 29, no. 2, pp. 86 – 101, 2019. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S093938891830120X

[220] G. Marcus, “Deep learning: A critical appraisal,” CoRR, vol.abs/1801.00631, 2018.

[221] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing betweencapsules,” in NIPS 2017, 2017.

[222] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming auto-encoders,” in ICANN, 2011.

[223] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, andD. Wierstra, “Matching networks for one shot learning,” inProceedings of the 30th International Conference on NeuralInformation Processing Systems, ser. NIPS’16. USA: CurranAssociates Inc., 2016, pp. 3637–3645. [Online]. Available: http://dl.acm.org/citation.cfm?id=3157382.3157504

[224] K. Hsu, S. Levine, and C. Finn, “Unsupervised learning via meta-learning,” CoRR, vol. abs/1810.02334, 2018.

[225] A. Banino, C. Barry, B. Uria, C. Blundell, T. P. Lillicrap, P. Mirowski,A. Pritzel, M. J. Chadwick, T. Degris, J. Modayil, G. Wayne, H. Soyer,F. Viola, B. Zhang, R. Goroshin, N. C. Rabinowitz, R. Pascanu,C. Beattie, S. Petersen, A. Sadik, S. Gaffney, H. King, K. Kavukcuoglu,D. Hassabis, R. Hadsell, and D. Kumaran, “Vector-based navigationusing grid-like representations in artificial agents,” Nature, vol. 557,pp. 429–433, 2018.

[226] B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neu-ral network architectures using reinforcement learning,” CoRR, vol.abs/1611.02167, 2016.

[227] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning,vol. 8, no. 3, pp. 279–292, May 1992. [Online]. Available:https://doi.org/10.1007/BF00992698

[228] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, and M. S. Lew, “Deeplearning for visual understanding: A review,” Neurocomputing, vol.187, pp. 27–48, 2016.

[229] A. Voulodimos, N. D. Doulamis, A. D. Doulamis, and E. Protopa-padakis, “Deep learning for computer vision: A brief review,” in Comp.Int. and Neurosc., 2018.

[230] J. R. del Solar, P. Loncomilla, and N. Soto, “A survey on deep learningmethods for robot vision,” CoRR, vol. abs/1803.10862, 2018.

[231] A. Ioannidou, E. Chatzilari, S. Nikolopoulos, and I. Kompatsiaris,“Deep learning advances in computer vision with 3d data: A survey,”ACM Computing Surveys, vol. 50, 06 2017.

[232] C. Seifert, A. Aamir, A. Balagopalan, D. Jain, A. Sharma,S. Grottel, and S. Gumhold, Visualizations of Deep Neural Networksin Computer Vision: A Survey. Cham: Springer InternationalPublishing, 2017, pp. 123–144. [Online]. Available: https://doi.org/10.1007/978-3-319-54024-5 6

[233] W.-Y. Lin, Y.-H. Hu, and C.-F. Tsai, “Machine learning in financialcrisis prediction: A survey,” IEEE Transactions on Systems, Man, andCybernetics - TSMC, vol. 42, pp. 421–436, 07 2012.

[234] J. C. B. Gamboa, “Deep learning for time-series analysis,” CoRR, vol.abs/1701.01887, 2017.

[235] S. Sarojini Devi and Y. Radhika, “A survey on machine learning andstatistical techniques in bankruptcy prediction,” International Journalof Machine Learning and Computing, vol. 8, pp. 133–139, 04 2018.

[236] A. Tealab, “Time series forecasting using artificial neural networksmethodologies: A systematic review,” Future Computing and Infor-matics Journal, vol. 3, no. 2, pp. 334 – 340, 2018. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S2314728817300715

[237] A. Almalaq and G. Edwards, “A review of deep learning methods ap-plied on load forecasting,” in 2017 16th IEEE International Conferenceon Machine Learning and Applications (ICMLA), Dec 2017, pp. 511–516.

[238] J. P. Usuga Cadavid, S. Lamouri, and B. Grabot, “Trends in MachineLearning Applied to Demand & Sales Forecasting: A Review,”in International Conference on Information Systems, Logisticsand Supply Chain, Lyon, France, July 2018. [Online]. Available:https://hal.archives-ouvertes.fr/hal-01881362

[239] S. Beheshti-Kashi, H. Reza Karimi, K.-D. Thoben, M. Ltjen, andM. Teucke, “A survey on retail sales forecasting and prediction infashion markets,” Systems Science & Control Engineering: An OpenAccess Journal, vol. 3, pp. 154–161, 01 2015.

[240] H. K. Alfares and M. Nazeeruddin, “Electric load forecasting: Liter-ature survey and classification of methods,” Int. J. Systems Science,vol. 33, pp. 23–34, 2002.

[241] M. Lngkvist, L. Karlsson, and A. Loutfi, “A review of unsupervisedfeature learning and deep learning for time-series modeling,” PatternRecognition Letters, vol. 42, pp. 11 – 24, 2014. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S0167865514000221

[242] M. Z. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga, “Acomprehensive survey of deep learning for image captioning,” ACMComput. Surv., vol. 51, no. 6, pp. 118:1–118:36, Feb. 2019. [Online].Available: http://doi.acm.org/10.1145/3295748

[243] H. Wang, S. Shang, L. Long, R. Hu, Y. Wu, N. Chen,S. Zhang, F. Cong, and S. Lin, “Biological image analysisusing deep learning-based methods: Literature review,” DigitalMedicine, vol. 4, no. 4, pp. 157–165, 2018. [Online].Available: http://www.digitmedicine.com/article.asp?issn=2226-8561;year=2018;volume=4;issue=4;spage=157;epage=165;aulast=Wang;t=6

[244] Y. Li, H. Zhang, X. Xue, Y. Jiang, and Q. Shen, “Deeplearning for remote sensing image classification: A survey,”Wiley Interdisciplinary Reviews: Data Mining and KnowledgeDiscovery, vol. 8, no. 6, p. e1264, 2018. [Online]. Available:https://onlinelibrary.wiley.com/doi/abs/10.1002/widm.1264

[245] W. Rawat and Z. Wang, “Deep convolutional neural networks forimage classification: A comprehensive review,” Neural Computation,vol. 29, no. 9, pp. 2352–2449, 2017, pMID: 28599112. [Online].Available: https://doi.org/10.1162/neco a 00990

[246] M. I. Razzak, S. Naz, and A. Zaib, Deep Learning for Medical ImageProcessing: Overview, Challenges and the Future. Cham: SpringerInternational Publishing, 2018, pp. 323–350. [Online]. Available:https://doi.org/10.1007/978-3-319-65981-7 12

[247] A. S. Lundervold and A. Lundervold, “An overview of deep learningin medical imaging focusing on mri,” Zeitschrift fr MedizinischePhysik, vol. 29, no. 2, pp. 102 – 127, 2019. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S0939388918301181

[248] S. Liu, Y. Wang, X. Yang, B. Lei, L. Liu, S. X. Li, D. Ni, andT. Wang, “Deep learning in medical ultrasound analysis: A review,”Engineering, vol. 5, no. 2, pp. 261 – 275, 2019. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S2095809918301887

[249] N. Akhtar and A. S. Mian, “Threat of adversarial attacks on deeplearning in computer vision: A survey,” IEEE Access, vol. 6, pp.14 410–14 430, 2018.

[250] D. J. Miller, Z. Xiang, and G. Kesidis, “Adversarial learning instatistical classification: A comprehensive review of defenses againstattacks,” CoRR, vol. abs/1904.06292, 2019. [Online]. Available:http://arxiv.org/abs/1904.06292

[251] M. Ozdag, “Adversarial attacks and defenses against deep neuralnetworks: A survey,” Procedia Computer Science, vol. 140, pp.152 – 161, 2018, cyber Physical Systems and Deep LearningChicago, Illinois November 5-7, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1877050918319884

[252] S. Thomas and N. Tabrizi, Adversarial Machine Learning: A LiteratureReview, 07 2018, pp. 324–334.

[253] S. Qiu, Q. Liu, S. Zhou, and C. Wu, “Review of artificial intelligenceadversarial attack and defense technologies,” Applied Sciences, vol. 9,no. 5, 2019. [Online]. Available: http://www.mdpi.com/2076-3417/9/5/909

[254] V. Duddu, “A survey of adversarial machine learning in cyber warfare,”2018.

Page 29: IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A ... · IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of Deep Learning with Special Emphasis on Architectures,

IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 29

[255] W. Schwarting, J. Alonso-Mora, and D. Rus, “Planning and decision-making for autonomous vehicles,” Annual Review of Control, Robotics,and Autonomous Systems, vol. 1, 05 2018.

[256] A. Carrio, C. Sampedro, A. Rodriguez-Ramos, and P. C. Cervera, “Areview of deep learning methods and applications for unmanned aerialvehicles,” J. Sensors, vol. 2017, pp. 3 296 874:1–3 296 874:13, 2017.

[257] A. Fridman, D. E. Brown, M. Glazer, W. Angell, S. Dodd, B. Jenik,J. Terwilliger, J. Kindelsberger, L. Ding, S. Seaman, H. Abraham,A. Mehler, A. Sipperley, A. Pettinato, B. Seppelt, L. Angell, B. Mehler,and B. Reimer, “Mit autonomous vehicle technology study: Large-scaledeep learning based analysis of driver behavior and interaction withautomation,” CoRR, vol. abs/1711.06976, 2017.

[258] S. D. Pendleton, H. Andersen, X. Du, X. Shen, M. Meghjani, Y. H.Eng, D. Rus, and M. H. Ang, “Perception, planning, control, andcoordination for autonomous vehicles,” Machines, vol. 5, no. 1, 2017.[Online]. Available: http://www.mdpi.com/2075-1702/5/1/6

[259] G. von Zitzewitz, “Survey of neural networks in autonomous driving,”07 2017.

[260] C. Badue, R. Guidolini, R. V. Carneiro, P. Azevedo, V. B. Cardoso,A. Forechi, L. F. R. Jesus, R. F. Berriel, T. M. Paixao, F. W. Mutz,T. Oliveira-Santos, and A. F. de Souza, “Self-driving cars: A survey,”CoRR, vol. abs/1901.04407, 2019.

[261] T. Young, D. Hazarika, S. Poria, and E. Cambria, “Recent trends indeep learning based natural language processing [review article],” IEEEComputational Intelligence Magazine, vol. 13, pp. 55–75, 2018.

[262] D. W. Otter, J. R. Medina, and J. K. Kalita, “A survey of theusages of deep learning in natural language processing,” CoRR, vol.abs/1807.10854, 2018.

[263] W. Khan, A. Daud, J. Nasir, and T. Amjad, “A survey on the state-of-the-art machine learning models in the context of nlp,” vol. 43, pp.95–113, 10 2016.

[264] S. Fahad and A. Yahya, “Inflectional review of deep learning on naturallanguage processing,” 07 2018.

[265] H. Li, “Deep learning for natural language processing: advantagesand challenges,” National Science Review, vol. 5, no. 1, pp. 24–26,09 2017. [Online]. Available: https://doi.org/10.1093/nsr/nwx110

[266] Y. Xie, L. Le, Y. Zhou, and V. V. Raghavan, “Deep learning for naturallanguage processing,” 2018.

[267] S. Zhang, L. Yao, and A. Sun, “Deep learning based recommendersystem: A survey and new perspectives,” ACM Comput. Surv., vol. 52,pp. 5:1–5:38, 2019.

[268] R. Mu, “A survey of recommender systems based on deep learning,”IEEE Access, vol. 6, pp. 69 009–69 022, 2018.

[269] Z. Batmaz, A. Yurekli, A. Bilge, and C. Kaleli, “A review ondeep learning for recommender systems: challenges and remedies,”Artificial Intelligence Review, Aug 2018. [Online]. Available:https://doi.org/10.1007/s10462-018-9654-y

[270] B. T. Betru, C. A. Onana, and B. Batchakui, “Deep learning methodson recommender system: A survey of state-of-the-art,” 2017.

[271] R. Fakhfakh, B. a. Anis, and C. Ben Amar, “Deep learning-basedrecommendation: Current issues and challenges,” International Journalof Advanced Computer Science and Applications, vol. 8, 01 2017.

[272] L. Zheng, “A survey and critique of deep learning on recommendersystems,” 2016.

[273] O. Y. Al-Jarrah, P. D. Yoo, S. Muhaidat, G. K. Karagiannidis,and K. Taha, “Efficient machine learning for big data: A review,”Big Data Research, vol. 2, no. 3, pp. 87 – 93, 2015, big Data,Analytics, and High-Performance Computing. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S2214579615000271

[274] Q. Zhang, L. T. Yang, Z. Chen, and P. Li, “A survey on deeplearning for big data,” Information Fusion, vol. 42, pp. 146 –157, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1566253517305328

[275] Y. Roh, G. Heo, and S. E. Whang, “A survey on datacollection for machine learning: a big data - AI integrationperspective,” CoRR, vol. abs/1811.03402, 2018. [Online]. Available:http://arxiv.org/abs/1811.03402

[276] J. Qiu, Q. Wu, G. Ding, Y. Xu, and S. Feng, “A survey of machinelearning for big data processing,” EURASIP Journal on Advancesin Signal Processing, vol. 2016, no. 1, p. 67, May 2016. [Online].Available: https://doi.org/10.1186/s13634-016-0355-x

[277] B. Jan, H. Farman, M. Khan, M. Imran, I. U. Islam, A. Ahmad, S. Ali,and G. Jeon, “Deep learning in big data analytics: A comparativestudy,” Computers & Electrical Engineering, vol. 75, pp. 275 –287, 2019. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0045790617315835

[278] M. M. Najafabadi, F. Villanustre, T. M. Khoshgoftaar, N. Seliya,R. Wald, and E. Muharemagic, “Deep learning applications andchallenges in big data analytics,” Journal of Big Data, vol. 2,no. 1, p. 1, Feb 2015. [Online]. Available: https://doi.org/10.1186/s40537-014-0007-7