Statistical Analysis of Various Measures in Auditing Practices Using Optimization
Techniques Dabbeeru Priyanka1, Javvadi Vishnu Priyanka2, Sura Papa Rao3
Department of Computer Science and Engineering, 1Sri Sivani College of Engineering, Srikakulam, AP-532410, India
2RGUKT-Srikakulam, 3 RGUKT-Nuzvid, Andhra Pradesh, India
E-mail : {dabbeerupriyanka90, priyajv34, surapaparao}@gmail.com
ABSTRACT
Now a day, AI assumes a fundamental job in improving the nature of a audit
field work. A data audit refers alludes to the inspecting of information to evaluate its
quality or utility for a particular reason. Assessing or auditing information, not at all
like inspecting accounts, includes looking at key measurements, other than amount,
to make decisions about the properties of an informational index. Auditing practices
are liable for fraud detection. In this paper, various optimization algorithms are
executed by analysing statically measures of audit risk and trail data to find the better
accuracy for suspicious firm classification and will produce an optimum or a
satisfactory solution.
1. INTRODUCTION
1.1 Introduction to Audit data
One major area of auditing is the making of beneficial prediction about a
dataset. The auditors are required by auditing standards to assess the status of a
dataset and make a prediction as to whether the dataset is able to continue
operating as a going concern [1].The greatest concerns in using audit data for
research are missing and implausible data, which are inevitable when large
volumes of data are, collected [2]. Determining the advantageous status of a
dataset is a very difficult task, so auditors have been trying to come up with
statistical methods to help make it easier. In recent years, machine learning has
developed and received major attentionin the predictive analytics in audit
research. The main objective is to produce an efficient and effective prediction
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 37
model that will be hybrid of various machine learning algorithmic characteristics,
which will be capable of predicting a data audit refers to the auditing of data to
assess its quality or utility for a specific purpose [1]. Auditing data, unlike
auditing finances, involves looking at key metrics, other than quantity, to create
conclusions about the properties of a data set. Whether any frauds have been
committed in any firm or not. The experiments are carried out on high
dimensional audit data and accuracy if result achieved. Audits are one of the
important compliance components that need to be set up properly but quickly so
that your organization does not miss out on business opportunities requiring the
storage of user data. Auditing these activities helps companies identify a data
breach before it is too late or at least assist with implementing better security
configurations to stop losses from occurring.
For applying the machine learning techniques and their algorithms, we used a
special tool called mat lab. It was a data visualizing and data mining software. It is
used to visualize the data at front end by data analysis and interactive data
visualization. In mat lab software we can construct the data mining algorithms
using FORTRAN77, Python, C++ and C programming languages. The programs
where maintained and implemented by Cleve Moler at the University of New
Mexico, started developing MATLAB in the late 1970s [3]. It is an interactive
system whose basic data element is an array that does not require dimensioning.
This allows you to solve many technical computing problems, especially those
with matrix and vector formulations, in a fraction of the time.
Machine learning involves an algorithm that improves automatically through
experience based on data[1].We are using different machine learning
classification algorithms like Logistic Regression, Support Vector Machine and
K-Nearest Neighbor. Logistic Regression is a statistical analysis method used to
predict a data value based on prior observations of a dataset, in which there are
one or more independent variables that determine an outcome. The outcome is
measured with a dichotomous variable (in which there are only two possible
outcomes). KNN (K-Nearest Neighbor) was a very simple machine learning
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 38
algorithm, it will contain all the available cases and differentiate each cases based
on their similarity and the main objective of this algorithm was how does the
algorithm work and how it will helps in predicting the outputs [4]. SVM (support
vector machine) is one of the best supervised machine learning algorithm and
most popularly used to solve the classification problems. It is used to identify and
plot each and every data item as a point in the n-dimensional space. After plotting
the point, we will perform the classification algorithm by finding the hyper plane
that will differentiate any of the two classes’ very well [4]. The aim of this project
is mainly focusing on various classification algorithms which are most popular
and interesting areas of research and also having the attractive or adaptive nature
to adapt themselves for novel changes. The scope of this project is to define the
challenges of existing system and to highlight the new researchdirections. The
main objective is to produce an efficient and effective prediction model, that will
be hybrid of various machine learning or classification algorithms and also ensure
the financial reliability of any organization and detection of fraud is just an
incidental object.
1.2 Audit data
Audit is one of the major areas of concern. It is defined as the process of
examining the financial records of any business to corroborate that their financial
statements are incompliance with the standard accounting laws and principles [1].
During a data audit, the origin, creation or format of data may be reviewed to
assess its value and utility. Various agencies and associations, such as the Joint
Information Systems Committee (JISC), promote data audit protocols in different
fields. In academics, the idea of auditing research data has become an important
work component. Government programs also may focus on government funded or
work-related data audit protocols. Typically, a data audit is dependent on a
registry, which is a storage space for data assets. A thorough data audit involves
identifying a registry or repository, often in a specific business department or
organization. In this dataset we have two auditing practices. They are audit_risk
and trail datasets. In audit risk dataset there are 776 instances and 27 attributes.
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 39
On the other hand, trail dataset having 776 instances and 18 attributes.
1.3Algorithms
In this project, we have used three different classification algorithms; they
are KNN, Logistic regression and SVM. By using these algorithms, we have
computed the analytical results based on various statistical parameters like AUC,
Confusion matrix and Recall.
1.3.1 K-Nearest Neighbor
In example acknowledgment, the k-closest neighbor's algorithm (k-NN) is a
non-parametric strategy utilized for arrangement and regression [5]. In both cases,
the information comprises of the k nearest preparing models in the component
space. The yield relies upon whether k-NN is utilized for grouping or relapse.
In k-NN grouping, the yield is class participation. An article is characterized
by a majority vote of its neighbors, with the item being relegated to the class most
basic among its k closest neighbors (k is a positive number, ordinarily little). In
the event that k = 1, at that point the article is essentially appointed to the class of
that solitary closest neighbor. In k-NN relapse, the yield is the property estimation
for the item. This esteem is the normal of the estimations of its k closest
neighbors.
K-NN is a type of instance based learning or lazy learning, where the
capacity is just approximated locally and all calculation is conceded until order.
The k-NN calculation is among the least complex of all AI algorithms. Both for
order and relapse, a valuable procedure can be utilized to allot weight to the
commitments of the neighbors, so that the closer neighbors contribute more to the
normal than the more removed ones. For instance, a typical weighting plan
comprises in giving each neighbor a weight of 1/d, where d is the separation to
the neighbor.
The neighbors are taken from a lot of articles for which the class (for k-NN
arrangement) or the item property estimation (for k-NN relapse) is known. This
can be thought of as the preparation set for the calculation, however no
unequivocal preparing step is required. K-Nearest Neighbors is one of the most
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 40
basic yet essential classification algorithms in Machine Learning. It belongs to the
supervised learning domain and finds intense application in pattern recognition,
data mining and intrusion detection [6].
1.3.2 Logistic Regression
Logistic regression is a statistical analysis method used to predict a data value
based on prior observations of a dataset.It has become an important tool in the
discipline of machinelearning.The approach allows an algorithm being used in a
machine learning application preparing step is required. To classify incoming data
based on historical data. As more relevant data comes in, the algorithm should get
better at predicting classifications within data setsIt is analyzing a dataset in which
there are one or more independent variables that determine an outcome. The
outcome is measured with a dichotomous variable (in which there are only two
possible outcomes) [7].LR combines the independent variables estimate the
probability that a particular event will occur, such as a subject will be a number of
one of the groups defined by the dichotomous dependent variable. Logistic
regression is similar to linear regression. The only difference is that, the output of
logistic regression is constant because there is limited number of values involved
whereas the output of linear regression is continuous due to the infinite number of
values.It is used for binary classification problems, including predictions such as
“this or that”, “yes or no” and “A or B”.
In logistic regression mainly follow three fundamental concepts are
hypothesis, decision boundary, and cost function.The hypothesis of logistic
regression tends it to limit the cost function between 0 and 1. Therefore linear
functions fail to represent it as it can have a value greater than 1 or less than 0
which is not possible as per the hypothesis of logistic regression.The cost function
can be defined as the ‘Sigmoid function’ or also known as the ‘logistic function’
instead of a linear function. To map predicted values to probabilities we use the
sigmoid function. Decision boundary on top of our predictions to see how our
labels compare to the actual labels.
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 41
1.3.3 Support Vector Machine
In AI, support-vector machines (SVMs, likewise support-vector networks)
are managed to learn models with related learning calculations that break down
information utilized for grouping and relapse examination. Given a lot of
preparing precedents, each set apart as having a place with either of two classes, a
SVM preparing calculation constructs a model that allocates new guides to one
classification or the other, making it a non-probabilistic paired straight classifier
(despite the fact that techniques, for example, Platt scaling exist to utilize SVM in
a probabilistic characterization setting). An SVM display is a portrayal of the
models as focuses in space, mapped with the goal that the instances of the
different classifications are separated by an unmistakable hole that is as wide as
could be expected under the circumstances. New models are then mapped into that
equivalent space and anticipated to have a place with a class dependent on which
side of the hole theyfall.
In addition to performing straight characterization, SVMs can effectively
play out a non-direct grouping utilizing is known as the kernel trick, certainly
mapping their contributions to high-dimensional element spaces [8]. At the point
when information is unlabelled, regulated learning is absurd, and an unsupervised
learning approach is required, which endeavors to discover characteristic
bunching of the information to gatherings, and afterward map new information to
these framed gatherings.
2. LITERATURE SURVEY
2.1 Introduction
In our survey, we have gone through previous 15 years papers to know
about the audit data briefly from different corners and also to know about
different technologies, methods as well as methodologies used in their own
survey.
2.2 Survey on audit data using various algorithms
Sawsan Saadi Halbouni[9] mainly considers auditor perceptions regarding
their responsibilities related to preventing, detecting, and reporting fraud. The
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 42
main objective of this study is to investigate internal and external auditor
perceptions regarding being proactive in their efforts to identify fraud during an
audit. There are 53 auditors in the United Arab Emirates. The principal
contribution is that internal auditors are primarily responsible for identifying fraud
and are consequently more concerned about reporting incidents related to fraud.
The authors provide results, empirical support for the notion that external auditors
should increase the degree to which they seek to detect and report incidents of
fraud as well.
Chad A. Simon [10],focusing on management’s goals helps individual
auditor identify ways management may be committing fraud (i.e., fraud schemes).
He also examines whether forming independent expectations for a client’s
reported results enhances the benefits of focusing on management’s goals. The
author found that auditors who identify a higher number of relevant fraud schemes
identify more audit procedures that target the case fraud. Auditors who are
prompted to identify and focus on management’s goals will identify more relevant
fraud schemes than auditors who are not prompted to focus on management’s
goals. Auditors are responsible for identifying the specific ways they believe the
client could be committing fraud and adjusting the audit plan to address identified
schemes.
Stefan Hoyer et al [11] are introduced a generic architectural model to unify
the classic fraud audit approach with human behavior taking into account the
fraud triangle in order to achieve better fraud detection and prevention. In this
author provides added value because the transactions examined by the auditor can
be better differentiated and prioritized. The proposed architecture is implemented
using a prototype and is applied exemplary to an SAP ERP system. The inclusion
of the human factor is focused on the manifestation of various behavior patterns.
These behavior patterns can be found in user data. The generic architectural model
can be expanded in a modular way by adding plug-ins is connecting to social
networking sites. In this way, ideally, additional information on the social
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 43
networks of employees can be extracted, and this information can be useful for
fraud analysis.
Joseph F. Brazel et al [12], introduced Consideration of Fraud in a Financial
Statement Audit, requires brainstorming sessions on every audit as a means of
improving auditors’ fraud judgments. They develop a measure of brainstorming
quality to examine how it affects auditors’ fraud decision-making processes. They
are using field survey data of auditors’ actual brainstorming sessions for 179 audit
engagements. They found that when brainstorming quality is perceived to be
higher, risk assessments are more positively related to the nature, staffing, timing,
and extent of audit procedures. Achieving a high level of brainstorming quality
appears to improve the audit team’s consideration of fraud by cultivating a
broader set of responses to identified fraud risks.
Harold Hassink et al [13] proposed to build expertise in the area of fraud
detection, reporting and redress of detected fraud. The auditing standards which
are used by auditors once they encounter fraud and whether compliance is
associated with particular fraud characteristics and also aims to provide evidence
on the role of auditors in redressing fraud. They are proposed to address which
refers to the audit taking measures are used to nullify the consequences of the
fraud, insofar as possible, and to prevent any recurrence of such fraud. It is
difficult for individual auditors to build up expertise in fraud detection. There
appears to be a need for specific training programs for auditors to help them to
detect fraud, emphasizing the need for mandatory consultation with the technical
department of the audit firm once “red flags” indicating fraud are found.Finally,
they reveal that auditors fail to comply with some important elements of fraud
standards. There are substantial differences among audit firms regarding
compliance with the relevant auditing standards.
Glen Gray et al[14], introduced the application of data mining techniques to
fraud detection in the audit of financial statements and proposes a taxonomy to
support and guide future research. They are used to develop structure to research
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 44
in data mining and create a taxonomy that combines research on patterns of
observed fraud schemes with an appreciation of areas that benefit from productive
application of data mining. Auditors can include non-accounting (non-financial)
data in their data mining domain. There are, however, no external standards that
apply to non-financial data. As such, auditors will need to develop their own set of
processes to determine the quality of non-financial data and the consistency of
that data with the financial statements. Fraud detection is a vital component of the
modern financial statement audit. There are many forms of data mining that are
available to auditors as part of their fraud detection activities.
Roberto García-Fernándezb and Antonio [15]aim to analyze the impact of
auditor tenure on audit quality. In this, the research is motivated by the absence of
consensus in published works, and by the scarcity of studies carried out on
nonprofit organizations. Although foundation audit quality decreases as tenure
length increases, this quality loss does not become apparent until the sixth year of
the foundation–auditor relationship, after an initial five years of improvement in
quality. Long auditor tenure may increase competence because the auditor’s
client-specific knowledge increases over the years. This will allow them to
improve the quality of their auditing but it could also reduce their degree of
independence.Their independence could also be compromised since they need to
keep new clients in order to recover their initial client-specific investment, which
cannot be transferred to other contracts industry specialization, however, can
improve both competence and independence, leading to higher audit quality since
auditors know their sector far better than non-specialist firms and can audit. They
are study contributes to the literature on the relationship between auditor tenure
and audit quality in an environment where there is no mandatory auditor rotation
and in a sector, non-profit making, where empirical research is very limited.
Satirenjit Kaur Johl et al [16] purposes the test impact of the internal audit
function (IAF) an increasingly common internal governance mechanism, on a
firm’s financial reporting quality. In this paper uses a unique dataset of survey
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 45
responses and archival data. Regression analysis was used to test their hypotheses.
In this, uses a unique dataset of survey responses and archival data. They are
finding to show that certain internal audit attributes play an important role in the
financial reporting process and thus these findings are expected to inform the
Institute of Internal Auditors and other regulatory bodies on the role of internal
audit in financial reporting, which in turn can assist in market/regulatory
reforms/changes and inform the revised Malaysian Code of Corporate
Governance. The authors examine the association between internal audit quality
and abnormal accruals, being a proxy for financial reporting quality. They also
examine the moderating role of the board on this relationship. They are using a
unique dataset of survey responses and archival data from a developing nation
where certain corporations are politically influenced, they are test our prediction
that internal audit quality is related to increased financial reporting quality and
that this relationship is affected by the quality of the board.
Habbash and Murya [17]have proposed that agency theory predicts that
corporate governance and external audit enhance the convergence of interests
between shareholders and managers. The primary objective of this thesis is to
investigate the effect of corporate governance and external audit on constraining
earnings management practice in the UK. The corporate governance attributes are
organized in four categories: 1) Board Composition; 2) Audit Committee
Effectiveness; 3) Non-Executive Directors‟(NEDs) Commitment; and 4)
Ownership Structures. The external audit factors include auditor independence
and audit quality. In this the authors, results reveals that board size and
independence, audit committee independence and expertise, nomination
committee independence, chairman independence, the level of NED fees and an
independent and specialized external auditor are negatively associated with
earnings management at significant levels.
Tulus Suryanto [18] has proposed the impact of company size, profit or loss,
and information system toward audit delay in companies listed in the Indonesia
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 46
Stock Exchange. Additionally, it also scrutinizes the implications of audit delay to
fraudulent financial reporting. Audit delay is the time difference between the dates
of financial statements and the dates of audit opinions stated in the financial
statements, which indicates the length of time of auditing processes. The data
analysis used Amos software. In this, the results of the study indicate that the
information system, company size, and operation loss and profit have significant
influence towards audit delay. In this, the use of information system integrated
with the application of technology will simplify administrative and financial
transaction records, so that the issuance of financial statements will be faster and
audit delay can be reduced.
clive lennox etal [19] proposed that empirical studies of earnings
management, audit pricing, and audit reporting provide extensive evidence that
the Big Five public accounting firms are associated with higher quality financial
statements. Measuring trends in absolute audit quality is typically infeasible, so
they more narrowly focus on discriminating between the relative performance of
the Big Five and non-Big five audit firms in preventing companies from
orchestrating accounting fraud. They report multivariate evidence on the relation
between Big Five audits and accounting fraud. Finally, we focus on the link
between auditor choice and the incidence of accounting fraud given that criticism
of the Big Five stresses their apparent role in the prominent financial reporting
failures, although they said future research could complement our analysis by
examining whether our evidence extends to other settings.
Siti Rochmah Ika et al [20] proposed the association between audit
committee effectiveness and timeliness of reporting. Audit committee
effectiveness is measured by an index based on the framework developed by
DeZoort Timeliness of reporting is defined as the number of days that elapses
between a company’s financial year-end and the day on which its audited
financial statement is received by the IDX. Multivariate regression analysis was
performed to analyze the relationship between audit committee effectiveness and
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 47
timeliness of reporting. – The audit committee effectiveness index employed in
this study was based on framework. The findings show that timeliness of
reporting is associated with audit committee effectiveness. In this, results suggests
that audit committee effectiveness is likely to reduce the financial reporting lead
time, and also suggest that audit committee effectiveness is a significant factor
ensuring timely submission of audited financial statements. The author proves
companies perhaps can re-look into how to further improve audit committee
effectiveness in order to enhance timeliness of financial reporting.
Matthew S. Ege [21] has proposed that Standard-setters believe high-quality
internal audit functions (IAFs) serve as a key resource to audit committees for
monitoring senior management. A difference-in-differences analysis indicates that
misconduct firms have IAFquality and competence during misconduct years and
improve IAF quality and competence in the post-misconduct years. Internal
auditors serve as the ‘‘eyes and ears’ ’of the audit committee and are able to deter
management misconduct by effectively being able to detect management
misconduct Internal auditors test internal controls throughout the organization and
focus their efforts on areas with the highest risk. The authors say, this result is
robust to a variety of other specifications, including controlling for internal
control quality and separate estimation during the pre- and post-SOX time
periods. In this, results suggest that regulators and audit committees should
consider ways to improve IAF quality and, more specifically, IAF competence.
Joseph V et al [22] proposed that the corporate governance literature in
accounting and auditing has grown rapidly our corporate governance focus is
primarily on corporate board and audit committee issues. A few studies have
examined the relation between corporate governance and auditor risk assessment,
audit planning, and audit opinions. They said these studies find that governance
characteristics affect auditors’ risk assessments and planning decisions, and that
governance quality is associated with audit opinions. In this, researchers have
continued to examine the relation between corporate governance characteristics
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 48
and audit fees strong governance could increase the demand for auditing
(increasing fees) and/or reduce auditors’ assessments of risk. In this, latter two
approaches, in particular, are viewed by some as less rigorous than archival,
econometric-based research methods. They can continue to rely predominantly on
archival research methods, and to improving the rigor and elegance of these
methods as a result, begin to better understand the issues that matter most to
accountants, attorneys, practitioners, regulators, and investors.
H.D.kuna et al [23] introduced Outlier detection in audit logs for application
systems. In this, an outlier is defined as an observation that is significantly
different from the other data in its set. In this, an auditor will employ many
techniques, processes and tools to identify these entries and data mining are one
such medium through which the auditor can analyze information. The enormous
amount of information contained within transactional processing systems' logs
means that auditors must employ automated systems for anomalous data
detection. In this, several data mining algorithms have been tested, especially
those that deal specifically with classification and outlier detection.They have
been successful in creating and ratifying an outlier detection process that works in
the alphanumeric fields of the audit logs from an information system, thus
constituting a useful tool for system auditors performing data analysis tasks.
Antoni gual et al [24]has proposed that to identify suitable short versions of
the Alcohol Use Disorders Identification Test (AUDIT) and to evaluate their
effectiveness as screening tests for ‘risky drinking’ among men and women in
primary health care (PHC) settings. Logistic regression analysis was carried out to
identify those items that minimized the probability of misclassification between
subjects with and without risky drinking. Cross-tabulations and t-tests were used
for group comparisons. Relationships between the three forms of AUDIT and
reported mean weekly alcohol consumption were examined by regression
analysis. Sensitivity, specificity, positive predictive and overall accuracy (OA)
values were calculated for the full, 3-item and 4-item AUDIT forms in relation to
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 49
the gold standard (diagnosis of risky drinking). Receiver operating characteristics
(ROC) curves and their areas under the curve (AUROCs) were inspected to
choose the optimal screening test and the best cut-off scores according to
sensitivity and specificity levels. The results findthat: AUDIT-3 and AUDIT-4
performed similarly to AUDIT-10 in detecting risky drinking and had equivalent
receiver operating characteristics curves and their areas under the curve. Both
short forms of AUDIT seem to be as effective as the full AUDIT for detecting
risky drinking among men and women in PHC settings.
Dewiani.D et al [25]Financial fraud is an important problem because it can
detrimental firmin the modern business world. An audit is carried out to prevent
and be responsiblefor detecting fraud. External audit is one of the audit practices
conducted outsideof the firm internal audit by visiting firms in carrying out the
work of financialreport audit data. The application of machine learning can be
used as a solution inthe use of data analysis methods needed to solve these
problems. This studyproposes a SVM method by combining the
EnsembleBagging model to improve single classification performance. In this, the
results showed that the EnsembleBagging model could improve the accuracy of
classification performance from the SVM method and achieved the highest
accuracy of9.95%. Based on the results of the accuracy obtained, the SVM
method with the Ensemble Bagging model can be used to detectfraud in the firm.
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 50
Table 2.1: Survey on Audit data
Author
Name
Title Publications
Years
Algorithms/Me
thods
Output Ref
Tulus
Suryanto
et.al
Audit delay
and it’s
implication for
fraudulent
financial
reporting–A
study of
companies
listed in the
Indonesian
stock
exchange.
2016 Non parametric
statistical test,
Multivariate
technique of
structure equation
model(SEM).
Shows that use
of information
technology has
negative effect
on audit delay.
[26]
SH.Sawsan
et.al
The Role of
Auditors in
Preventing,
Detecting,and
Reporting
Fraud: The
Case of the
United Arab
Emirates
(UAE).
2015 T-tests and
analysis
of variance
(ANOVA).
The results of
external auditors
should increase
the degree to
which they seek
to detect and
report incidents
fraud as well.
[27]
Roberto
García-
Fernández
b et.al
Auditor tenure
and audit
quality in
Spanish state
owned
foundations
2015 Logistic
Regression
Models.
The results
shows the
relationship
between auditor
tenure and audit
quality in an
environment
[28]
Harold
Hassink
et.al
Fraud
detection,
redress and
reporting by
auditors
2015 Survey is
conducted
between among
all the audit
partners.
The result
reveals that
auditors fail to
comply with
some important
elements of
fraud standards.
[29]
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 51
3.Proposed Methodology
3.1 Introduction
Machine learning involves an algorithm that improves automatically through
experience based on data. One major area of auditing is making of beneficial
predictions about a dataset. In recent years, machine learning has developed and
Glen
L.Gray
et.al
A taxonomy to
guide research
on the related
application of
data mining to
fraud detection
in financial
statement
audits
2014 Scatter shot
Method
The result
shows create a
taxonomy that
combines
research on
patterns of
observed fraud
schemes with
appropriate
areas.
[30]
Satirenjit
Kaur et.al
Internal audit
function, board
quality and
financial
reporting
quality:
evidence from
Malaysia
2013 Archival
data and
regression
analysis
It shows
unexpected
positive
relationship
between internal
audit quality and
abnormal
accruals.
[31]
Siti
Rochmah
Ika et.al
Audit
committee
effectiveness
and, timeliness
of reporting:
Indonesian
evidence
2012 Index
based
framework
Finding the
timeliness of
reporting is
associated with
audit committee
is effectiveness.
[32]
Chad
A. Simon
Individual
Auditors
Identification
of Relevant
Fraud Schemes
2012
Client-Server
Method.
Auditors are
responsible for
identifying the
specific ways
they believe the
client could
committing
fraud and
adjusting the
audit plan to
address
identified
[33]
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 52
received major attention in the predictive analytics of audit research. Predictive
analytics is also implemented using machine learning methods because it provides
actionable insights for the audit companies. Audit is defined as the process of
examining the financial records of any business to corroborate that their financial
statements are incompliance with the standard accounting laws and principles.
3.2 Existingsystem
In the existing system they have taken Audit data set and applied Particle
Swarm Optimization (PSO) algorithm is implemented to understanding the
complete risk assessment process. The results of two multi-criteria methods,
TOPSIS and SAW, indicated that BayesNet and J48 perform the best for this
particular audit dataset and also give stable results on K-fold validation testing,
serving as a proof of eligibility of classifiers to perform an efficient risk assessment
of the suspicious firms in the audit field work decision-making process. J48 builds
decision tree based on the theory of information entropy. It is an open source java
implementation of the C4.5 algorithm. Bayesian networks aim to model conditional
dependence, and therefore causation, by representing conditional dependence by
edges in a directed graph. Through these relationships, one can efficiently conduct
inference on the randomvariables in the graph through the use of factors. This model
is based on probabilistic and directed acyclic graph theory.Inexisted system, the
researchers got 93% accuracy by using Bayes Net and J48 algorithms
3.3 Motivation
Now a days, it would be extremely difficult to understand the reasons why the
services is failing or the real reasons why company sales are increase or decrease at
a given time. So it will be tend to face the difficulties to produce desires result. Also
reduces the accuracy for classification of data. To overcome this drawback we are
using the machine learning algorithms to give the optimal solution for finding the
better accuracy to classify the data.
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 53
3.4 Proposed framework
In proposed system, the main focus is to use various optimization
algorithms for analysing statistical measures of audit risk and trail data to find the
better accuracy for suspicious firm classification and produce an optimum or a
satisfactory solution. Then we have analyzed the results, and compared them with
the existed project and came to know which algorithm isbest.
3.5 Description
We have taken the data set i:e Audit data from UCI Repository which is
freely available in the internet and for that data set we have various classification
algorithms like K-Nearest neighbor, Logistic regression and Support vector
machine. Applying various statistical methods for analyzing the results and
generated the final report. For the brief understanding of the proposed system and
its methodology as well as to know about what are the methods and algorithms
used in it and which approach is followed to give accurate solution.
3.5.1 Flow chart
Below Figure represents the step by step procedure of the proposed system. It
clearly explains us about complete view of proposed system that is where we have
taken the data and what methods and techniques applied etc. Here we are selecting
the data from UCI repository then applying data preprocessing techniques for data
cleaning and transformation. After that evaluating the results based on machine
learning classification algorithms.
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 54
Fig 3.1: Architecture for proposed system
3.5.2 System Architecture
Figure 3.2 indicates the architecture of the proposed system by using the
Matlab. It is handling three different classification algorithms such as K-
Nearest Neighbor, Support Vector Machine and Logistic Regression.
Choose among various algorithms to train and validate classification
models for binary or multiclass problems. After training multiple models,
compare their validation errors side-by-side, and then choose the best model.
To help you decide which algorithm to use, see Train Classification Models in
Classification Learner App.
This flow chart shows a common workflow for training classification models,
or classifiers, in the Classification Learner app.
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 55
Fig 3.2 Proposed architecture using Matlab
3.5.2.1Select Data and Validation for Classification Problem
Import data into Classification Learner from the workspace or files, find example
data sets, and choose cross-validation or holdout validation options.
3.5.2.2 Choose Classifier Options
In Classification Learner, automatically train a selection of models, or compare
and tune options in decision tree, discriminate analysis, logistic regression, naive
Byes, support vector machine, nearest neighbor, and ensemble models.
3.5.2.3Train the classifier
You can use Classification Learner to automatically train a selection of different
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 56
classification models on your data.
3.5.2.4Assess Classifier Performance in Classification Learner
Compare model accuracy scores;visualize results by plotting class predictions,
and check performance per class in the Confusion Matrix.
3.5.2.5 Export Classification Model to Predict New Data
After training in Classification Learner, export models to the workspace, and if
you want generate MATLAB® code.
4. Experimental Setup and result analysis
4.1 Simulative Environment
The developing environment for the proposed method isMatlab r2018b
version on a system with any Intel or AMD x86-64 processor, 1.8 GHz, 4 GB
RAM and Microsoft Windows with Family.
4.2Experimental Setup
In Experimental setup we are explaining about the Data Set i: eAudit data set,
Matlab.
4.2.1 Dataset
This study was carried out on a dataset containing information about
evaluation of a person, organization, system, process, enterprise and product. The
term most commonly refers to audits in accounting, but similar concepts also exist
in project management, quality management, water management, and energy
conservation. This dataset and its related information have been drawn from the
UCI Machine Learning Repository. The dataset is having two auditing practices.
They are audit_risk and trail datasets. In audit_risk dataset there are 776 instances
and 27 attributes. On the other hand, trail dataset having 776 instances and 18
attributes. This dataset consist of some records related to scores, locations, risks,
money value and loss etc. the audit work is comprehensive and considers all
aspects of the organization both financial and non-financial with an emphasis on
constructive improvement. A sample of the raw data can be seen in the table
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 57
below. The database was created with records exhaustive one year non-
confidential data in the year 2015 to 2016 of firms is collected from the Auditor
Office of India to build a predictor for classifying suspicious firms. Using this
data set we have analyzed different factors which are mainly leading to Audit data
by applying various machine learning algorithms.
Table 4.1 Description of data set
4.2.2 Data setinformation
The goal of the research is to help the auditors by building a classification model
that can predict the fraudulent firm on the basis the present and historical risk
factors. The information about the sectors and the counts of firms are listed
respectively as Irrigation (114), Public Health (77), Buildings and Roads (82),
Forest (70), Corporate (47), Animal Husbandry (95), Communication (1),
Electrical (4), Land (5), Science and Technology (3), Tourism (1), Fisheries (41),
Industries (37), Agriculture (200).
4.2.2.1 AttributeInformation
Many risk factors are examined from various areas like:
Past records of audit office
Audit-paras
Environmental conditions reports
Firm reputation summary
On-going issues report
Profit-value records
Data Set
Characteristics: Multivariate
Number of
Instances: 777 Area: N/A
Attribute
Characteristics: Real
Number of
Attributes: 18 Date Donated
2018-
07-14
Associated
Tasks: Classification Missing Values? Yes Number of Web Hits: 29866
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 58
Loss-value records, follow-up reports etc.
Public health reports
Agriculture issues.
After in-depth interview with the auditors, important risk factors are
evaluated and their probability of existence is calculated from the present and past
records.
4.3 Parametersetting
In parameter setting we defined about the different methods such as Cross
Validation, Random sampling,training time, and prediction speed.
4.3.1Crossvalidation
It is a technique used for the evaluation of the gaining results from analytical
analysis induces to an independent dataset. It is a process of determining the error
rate. It is necessary to evaluate the accuracy of the execution. It gives fair value. If
the error rate is low it is known as a good model otherwise not-good model. It
helps to select the best fit model. It splits the data into a given number of folds
(usually 5 or 10). The algorithm is tested by holding out examples from one fold
at a time; the model is induced from other folds and examples from the held out
fold are classified. This is repeated for all thefolds.
4.3.1.1Steps followed in 10-fold cross validation
i. Divide the dataset into 10 parts or also called as folds.
ii. Each part holds in eachiteration.
iii. Average theresult.
iv. Each data point used once for testing and 9 times for training.
v. Stratified cross validation ensures that each fold has the right
proportion of each class value.
4.3.2 Random Sampling
Random sampling is a way of choosing a sample of views from a population in
order to make assumptions about the population. When we select a group of
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 59
subjects from a large sample, every likely sample of a given size has the same
probability of selection. Randomly splits the data into the training and testing set
in the given proportion the whole procedure is repeated for a specified number of
times. In statistics the random sampling is subset of individuals chosen from a
large set. Each individual is chosen randomly and entirely by chance, such that
each individual has the sameprobability.
4.3.3 Training speed
Time taken to train the data by using classification algorithms.Owing to the
large range of training times only the order relative to the fastest algorithm for
each dataset is reported. The fastest algorithm is indicated by a ‘0’. An algorithm
that is between 10^x-1 to 10^x times as slow is indicated by the value of x. It can
be denoted by seconds.
4.3.4 Prediction speed
Prediction speed refers to the output of an algorithm after it has been trained
on historical dataset. It is the process of using data analytics to make predictions
based on data. This process uses data along with analysis, statistics, and machine
learning techniques to create a predictive model. Prediction speed used to predict
a future value (How long can this machine run before requiring maintenance) or
to estimate a probability. It is denoted by obs/sec.
4.4 Performancemeasures
There are some performance measures based on the classification. By using
the Matlab software kit, we have implemented and analyzed the results of the
Logistic Regression, Support Vector Machine and K-Nearest Neighbor by using
Matlab.
4.4.1 Classificationaccuracy
It is calculated as the total number of correct forecasts divided by the total
number of datasets. It works well on balanced data. If it is imbalanced data then
errors occur at the performance. Accuracy is calculated as the number of all
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 60
correct predictions divided by the total number of the dataset.
Accuracy= (Number of true classified samples)/ (Number of total test data) × 100
4.4.2Confusion matrix
A confusion matrix is a table that is often used to describe the performance
of a classification model (or “classifier”) on a set of test data for which the true
values are known. It ishelps you identify the areas where the classifier has
performed poorly. It is a summary of prediction results on a classification
problem. The number of correct and incorrect predictions are summarized with
count values and broken down by each class. The confusion matrix shows the
ways in which your classification model is confused when it makes predictions. It
gives us insight not only into the errors being made by a classifier but more
importantly the types of errors that are being made.
N Predicted class1 Predicted class2
Actual class1 TP FN
Actual class2 FP TN
Accuracy (or) Classification Rate= (TP+TN)/N
Where N = TP+TN+FP+FN
4.4.3Receiver Operating Curve (ROC)
A ROC curve (receiver operating characteristic curve) is a diagram
explaining the completion of a classification model at all classification thresholds.
It is a metric used to check the quality of classifiers. For each class of a
classifier, roc applies threshold values across the interval [0, 1] to outputs.
This curve plots two parameters:
TPR=True PositiveRate
FPR=False PositiveRate
The True Positive Rate (TPR) is plot against False Positive Rate (FPR) for
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 61
the probabilities of the classifier predictions. Then, the area under the plot is
calculated.The Area under Curve (AUC) number is a measure of the overall
quality of the classifier
True Positive Rate (TPR) is a synonym for recall and is therefore defined as
follows:
TPR=TP/TP+FN (1)
False Positive Rate (FPR) is defined as follows:
FPR=FP/FP+TN (2)
4.4.4Area underCurve (AUC)
The area under the ROC curve is a standard evaluation analysis for
classifiers. It is commensurate to concordance. This can be rendered as the
probability that a random positive is assigned a higher scope than a random
negative.
4.4.4.1 AUC is essential for the following tworeasons
• AUC is scale-invariant. It measures how well predictions are ranked,
ratherthan their absolutevalues.
• AUC is classification-threshold-invariant. It measures the quality of
themodel's predictions irrespective of what classification threshold
ischosen.
4.4.5Precision (Positive predicted value)
It is a metric for classification models. Precision recognizes the incidence
with which a model was correct when divining the certain class. High precision
describes the low false positive rate.
These are favored in information retrieval positives for the documents that
are retrieved in response to a query true positive and are really relevant to the
query. y-axis: precision. Precision value is calculated by using the following
formula.
Precision=TP/TP+FP (3)
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 62
4.4.6Recall (True positive rate)
The recall is the proportion of correctly divined observations to the total
predicted positive observations in real class. Mathematically, recall value is
calculated by using the following formula:
Recall=TP/TP+FN (4)
4.4.7False discovery rate
The false discovery rate (FDR) is the expected proportion of type I errors. A
type I error is where you incorrectly reject the null hypothesis; In other words, you
get a false positive. Closely related to the FDR is the family-wise error rate
(FWER). The FWER is the probability of making at least one false
conclusion (i.e. at least one Type I Error).It is the probability of making any Type-
I error at all.
False discovery rate=FP/TP+FP (5)
4.4.8False negative rate
The false negative rate is the proportion of positives which
yield negative test outcomes with the test, i.e., the conditional probability of
a negative test result given that the condition being looked for is present.
False negative rate=FN/FN+TP (6)
4.4.9 Scatter plot
In the scatter plot, after train a classifier, the scatter plot switches from
displaying the data to showing model predictions. Scatterplot (x) produces a
scatter plot for the signal x. The interpretation of x depends on its shape and
complexity:
• If x is a real two-column matrix, scatterplot interprets the first column
as in-phase components and the second column as quadrature
components.
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 63
n
• If x is a complex vector, scatterplot interprets the real part as in-phase
components and the imaginary part as quadrature components.
• If x is a real vector, scatterplot interprets it as a real signal.
4.5 Experimentalresult
In experimental analysis, we have taken the result from Matlab
tool for performing analysis.
4.5.1 Statistical analysis for machine learningalgorithms
We applied various statistical methods for analyzing various classification
algorithms that are used in our project such as K-Nearest Neighbor, Support
Vector Machine and Logistic Regression.
4.5.1.1 Results ofclassification
By using different classification algorithms we are performing different functions
Such as Cross validation.
4.5.1.1 .1 Crossvalidation
It is a technique which is used for the evaluation of gaining results from
analytical analysis induces to an independentdataset.Cross-validation is a
resembling procedure used to evaluate machine learning models on a limited data
sample. The procedure has a single parameter called k that refers to the number of
groups that a given data sample is to be split into. As such, the procedure is often
called k-fold cross-validation.
When a specific value for k is chosen, it may be used in place of k in the
reference to the model, such as k=10 becoming 10-fold cross-validation. 10-fold
cross validation isolates the dataset into 10 folds where 9 folds utilized for
training and one section is for testing.
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 64
Fig4.8:Classification using cross validation
4.5.2 Confusion matrix
The confusion matrix itself is relatively simple to understand, but the related
terminology can be confusing. To find out the accuracy of error rate algorithms
confusion matrix plays major role.The confusion matrix displays the total number
of observations in each cell. The rows of the confusion matrix correspond to the
true class, and the columns correspond to the predicted class. Diagonal and off-
diagonal cells correspond to correctly and incorrectly classified observations,
respectively.
4.5.2.1 K-Nearest Neighbors
K-Nearest Neighbors is one of the simplest classification algorithm. it will
store the all available classes and divides it into new classes based ontheir
similarity Measure such as the distance, factions etc. K nearest neighbors is mostly
used for the Statistical analysis as well as for the patternrecognition.
4.5.2.1.1 Steps for K-Nearest Neighbors
1. Determine parameter k=number of nearest neighbors
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 65
2. Calculate the distance between the query instance and all the training samples
3. Sort the distance and determine nearest neighbor based on the Minimum
distance.
4. Gather the category r of the nearest neighbors
5. Use simple majority of the category of nearest neighbor as the prediction
value of query instance.
4.5.2.1.2 Formulas for KNN using Matlab
The k nearest neighbors are selected based on distance metric. Given an mx-by-
n data matrix X, which is treated as mx (1-by-n) row vectors x1, x2... xmx, and an my-
by-n data matrix Y, which is treated as my (1-by-n) row vectors y1, y2, ...,ymy, the
various distances between the vector xs and yt are defined as follows:
1. Euclidean distance
𝑑𝑠𝑡2 = (xs−yt) (xs−yt) ′.
The Euclidean distance is a special case of the Minkowski distance.
2. Standardized Euclidean distance
𝑑𝑠𝑡2 = (xs−yt)V
−1(xs−yt)′,
where V is the n-by-n diagonal matrix.
3. Mahalanobis distance
𝑑𝑠𝑡2 = (xs−yt) C
−1 (xs−yt) ′,
where C is the covariance matrix.
4. City block distance
𝑑𝑠𝑡 = ∑ |𝑥𝑠𝑗
𝑛
𝑗=1
− 𝑦𝑡𝑗|
5. Minkowski distance
𝑑𝑠𝑡 = √∑ |𝑥𝑠𝑗
𝑛
𝑗=1
− 𝑦𝑡𝑗|𝑝
For the special case of p = 1, the Minkowski distance gives the city block
distance. For the special case of p = 2, the Minkowski distance gives the
Euclidean distance. For the special case of p = ∞, the Minkowski distance gives
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 66
the Chebyshev distance.
6. Chebyshev distance
𝑑𝑠𝑡=maxj {xsj−ytj}.
i. Create a confusion matrix chart and sort the classes of the chart
according to the class-wise number of observations.
Fig 4.9: Confusion matrix for KNN based on number of observations
ii. Create a confusion matrix chart and sort the classes of the chart
according to the class-wise true positive rate (recall) and the class-wise
false negativerate.
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 67
Fig 4.10: Confusion matrix for KNN based on TPR and FNR
iii. Create a confusion matrix chart and sort the classes of the chart according
to the class-wise positive predictive rate (precision) and the class-wise
false discovery rate
Fig 4.11:Confusion matrix for KNN based on PPV and FDR
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 68
4.5.2.2 Logistic Regression
Logistic Regression is a Machine Learning algorithm which is used for the
classification problems. It is a predictive analysis algorithm and based on the
concept of probability. We can call a Logistic Regression a Linear Regression model
but the Logistic Regression uses a more complex cost function, this cost function
can be defined as the ‘Sigmoid function’ or also known as the ‘logistic function’
instead of a linear function. The hypothesis of logistic regression tends it to limit the
cost function between 0 and 1. Therefore linear functions fail to represent it as it can
have a value greater than 1 or less than 0 which is not possible as per the hypothesis
of logistic regression.
0 ≤ ℎ𝜃(𝑥) ≤ 1
Logistic regression hypothesis expectation
4.5.2.2.1 Sigmoid Function:
In order to map predicted values to probabilities, we use the sigmoid function. The
function maps any real value into another value between 0 and 1. In machine
learning, we use sigmoid to map predictions to probabilities.
𝑓(𝑥) =1
1+𝑒−(𝑥)
F(x)= output between 0 and 1 (probability estimate)
z = input to the function (your algorithm’s prediction e.g. mx +
b)
x= base of natural log.
It is having “S” shaped curve.
In logistic regression mainly follow three fundamental concepts they are:
1. Logistic regression hypothesis
2. Logistic regression decision boundary
3. Logistic regression cost function
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 69
4.5.2.2.2 Hypothesis Representation
When using linear regression we used a formula of the hypothesis i.e.
h∅(x) = β0 + β1X
For logistic regression we are going to modify it a little bit i.e.
𝜎(𝑍) = (β0 + β1X)
We have expected that our hypothesis will give values between 0 and 1.
Z = β₀ + β1X
h∅(x)= sigmoid (Z)
h∅(x) =1
1 + 𝑒(−(β₀ + β₁X)
4.5.2.2.3 Decision boundary:
Our current prediction function returns a probability score between 0 and In
order to map this to a discrete class (true/false, cat/dog), we select a threshold value
or tipping point above which we will classify values into class 1 and below which
we classify values into class 2.
p≥0.5, class=1
p<0.5, class=0
4.5.2.2.4 Cost function
For logistic regression, the Cost function is defined as:
Cost (h∅(x), y) = {−log (ℎ𝜃 (x)) if y = 1
−log (1 − hθ(x)) if y = 0
In this cost function pays, if the predictive value h∅(x) while the actual cost
label turns out to be y. By using this function we will grant the convexity to the
function the gradient decent algorithms has to process. Using gradient decent
algorithms we have to reduce or minimize the cost function.
i. Create a confusion matrix chart and sort the classes of the chart
according to the class-wise true positive rate (recall) and the class-wise
false negative rate.
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 70
Fig 4.12:Confusion matrix for LR based on number of observation
ii. Create a confusion matrix chart and sort the classes of the chart
according to the class-wise true positive rate (recall) and the class-wise
false negative rate.
Fig 4.13: Confusion matrix for KNN based on TPR and FNR
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 71
Create a confusion matrix chart and sort the classes of the chart according to the
class-wise positive predictive value (recall) and the class-wise false discovery rate.
Fig 4.14:Confusion matrix for KNN based on PPV and FDR
4.5.2.3 Support Vector Machine
A Support Vector Machine (SVM) performs classification by finding the hyper
plane that maximizes the margin between the two classes. The vectors (cases) that
define the hyper plane are the support vectors. We can find the SVM by making use
of three methods they are liner SVM, nonlinear SVM and the kernel
functionalSVM. Support vector machine (SVM) is a discriminative classifier
formally defined by a separating hyper plane. In other words, given labeled training
data (supervised learning). The algorithm outputs an optimal hyper plane which
categorizes new examples. An SVM model is representation of the examples of the
separate categories are divided by clear gap that is as wide as possible. New
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 72
examples are then mapped into that same space and predicted to belong to a
category based on which side of the gap they fall. Then the operation of SVM
algorithm is based on finding the hyper plane that gives the largest minimum
distance to the training examples. Twice, this distance receives the important name
of margin within SVM’s theory. Therefore, the optimal separating hyper plane
maximizes the margin of the training data
I. Create a confusion matrix chart and sort the classes of the chart according
to the class-wise number of observations.
Fig4.15: confusion matrix for SVM based on number of observation.
II. Create a confusion matrix chart and sort the classes of the chart according
to the class-wise true positive rate (recall) and the class-wise false
negative rate.
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 73
Fig4.16: confusion matrix for SVM based on TPR and FNR
III. Create a confusion matrix chart and sort the classes of the chart according
to the class-wise true positive rate (recall) and the class-wise false
negative rate.
Fig4.17: confusion matrix for SVM based on PPV and FDR
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 74
4.5.3 ROCchart
By varying the value of the threshold, it can be obtained a family of different
binary classifier and, for each of them, we can calculate their sensitivity and its
specificity. It is a graphical plot that illustrates the diagnostic ability of abinary
classifier system as its discrimination threshold is varied. It means the ROC
curve is a graph which plots in the x-axis1.0-specificity and in the y-axis the
sensitivity calculated for every different threshold.
Inordertocalculatethepointsofthecurve,thethresholdwillvaryalongthe will vary
along theoutput probability of the testing instances in ascending order. If we have
a perfectmodel that can correctly through the upper left corner which is the point
in which the sensitivity andspecificity take the value 1. The closer to the upper left
corner that the ROC curve passes, the better its discrimination capacity, the base
line represents the ROC can curve for a random classifier.
4.5..3.1 Area under curve
As a perfect classifier passes through the upper left corner, i.e., the point (0,1), the
area under curve for it would be 1. A random classifier, represented by the base
line, has an area under curve of 0.5. If the area under curve takes a value lower
than 0.5, means that it is worse than randomness. In practice, this measure should
take a value between 0.5 and 1.0. The closer to 1 area under curve, the better the
classifier.
4.5..3.1.1 KNN
The k-NN model creates confidences based on the mean of the k nearest
neighbors. For k=1, the confidences can be only 0 or 1, leading to an ugly, but
correct ROC plot. For larger values of k, the granularity of the confidence values
increases. Since in this algorithm, there is no decision threshold at all and the
instance will be classified as the dominant class of k nearest instances to it.
In KNN, the area under the curve is 0.97and the current classifier
point(0.04,0.99). The maximum AUC is 1, which corresponds to a perfect
classifier. Larger AUC values indicate better classifier performance.
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 75
Fig4.18: Roc curve for K-Nearest Neighbors
4.5.3.1.2 Logistic Regression
Use only the first two features as predictor variables. Define a binary
classification problem by using only the measurements that correspond to the
species. Define the roc curve use the probability estimates from the logistic
regression model as scores.
In logistic regression, the area under the curve is 0.97 and the current
classifier point (0.04, 0.98). The maximum AUC is 1, which corresponds to a
perfect classifier. Larger AUC values indicate better classifier performance.
Fig4.19: Roc curve for Logistic Regression
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 76
4.5.3.1.3 SVM
The AUC SVM generally incurs a higher AUC performance than the regular
SVM. The optimal setting of the trade-off parameter is smaller than for the regular
SVM. AUC SVM is more effective at maximizing the AUC than the SVM for
linear kernels. The ROC curve on SVM can be applied to the corresponding binary
one-versus-one or one-versus-rest problems.
In SVM, the area under the curve is 1.00 and current classifier point (0.05,
0.99). The maximum AUC is 1, which corresponds to a perfect classifier. Larger
AUC values indicate better classifier performance
Fig4.20: Roc curve for Support Vector Machine
4.5.4 Scatter plot
The scatter plot, view the classifier results. After you train a classifier, the
scatter plot switches from displaying the data to showing model predictions. If
you are using holdout or cross-validation, then these predictions are the
predictions on the held-out observations. In other words, each prediction is
obtained using a model that was trained without using the corresponding
observation. To investigate your results, use the controls on the right. You can:
a. Choose whether to plot model predictions or the data alone.
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 77
b. Show or hide correct or incorrect results using the check boxes under Model
predictions.
c. Choose features to plot using the X and Y lists under Predictors.
d. Visualize results by class by showing or hiding specific classes using the
check boxes under Show.
e. Change the stacking order of the plotted classes by selecting a class
under Classes and then clicking Move to Front.
f. Zoom in and out, or pan across the plot. To enable zooming and panning,
hover the mouse over the scatter plot and click one of the buttons that appear
near the top-right corner of the plot.
Fig4.21: Scatter Plot in classification learner app
4.5.5 ResultAnalysis
For computing the result analysis we have taken the result data that are
calculated and analyzed at the stage of experimental results. By taking those
results we have represented them in a graphical manner, for better and clear
understanding of the results as well as it is user favorable it means user can easily
understand them.
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 78
4.5.5.1 Analysis on various classificationalgorithms
The table 4.2 and table 4.3 describes about the comparative analysis of
various classification algorithms such as K-Nearest Neighbor, Support Vector
Machine, and Logistic Regression are considered in terms of recall, accuracy
rate.In classification learner app when compared with all the other algorithms,the
accuracy rate for Support Vector Machine will be maximum. So, we observed that
SVM is better than remaining all algorithms in terms of accuracy raterecall.
By analyzing these results only we came know accuracy, prediction speed
and training time. Based on these statistical measures we analyze that which
algorithms is best classifier
Table 4.2: Analysis of various classification algorithms for audit_risk data
T
T
able 4.3: Analysis of various classification algorithms for trail data
4.5.5.2Analyze the confusion matrix
The confusion matrix displays the total number of observations in each cell.
The rows of the confusion matrix correspond to the true class, and the columns
correspond to the predicted class. Diagonal and off-diagonal cells correspond to
correctly and incorrectly classified observations, respectively.
Classification Learner Accuracy Prediction Speed Training
Time
Logistic Regression 97.3% ˜6700 obs/sec 15.135 sec
Support Vector Machine
(SVM)
97.7% ˜9800 obs/sec 3.9264 sec
K-Nearest
Neighbor(KNN)
97.2% ˜6200 obs/sec 2.877 sec
Classification Learner Accuracy Prediction Speed Training Time
Logistic Regression 98.7% ˜7400obs/sec 31.486 sec
Support Vector Machine
(SVM)
99.2% ˜1700obs/sec 1.3901 sec
K-Nearest
Neighbor(KNN)
99.4% ˜9100 obs/sec 1.0169 sec
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 79
Confusion matrix has three plotting parameters they are:
1. Number of observations
2. True positive rate and false negative rate
3. Positive predictive values and false discovery rate
To see how the classifier performed, based on number of observation in each
and every class, under the plot.
Table 4.4: Analyze the confusion matrix based on number of observation.
To see how the classifier performed per class, under Plot, select the True
Positive Rates, False Negative Rates option. The plot shows summaries per true
class in the last two columns on the right.
Table 4.5:Analyze the confusion matrix based on TPR and FNR
To see results per predicted class, under Plot, select the Positive Predictive
Values False Discovery Rates option. The confusion matrix now shows
summary rows underneath the table. Positive predictive values are shown in green
for the correctly predicted points in each class, and false discovery rates are
shown below it in red for the incorrectly predicted points in each class.
Number
of
Observation
Algorithms TP FP TN FN
SVM 466 14 291 5
LR 461 12 293 10
KNN 464 13 292 7
True positive rates
And
False positive rates
Algorithms TP FP TN FN
SVM 99% 5% 95% 1%
LR 98% 4% 96% 2%
KNN 99% 4% 96% 1%
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 80
Table 4.6: Analyze the confusion matrix based on PPV and FDR
Class labels of the confusion matrix chart, specified as a categorical vector,
numeric vector, string vector, character array, cell array of character vectors, or
logical vector. If class Labels is a vector, then it must have the same number of
elements as the confusion matrix has rows and columns. If class Labels is a
character array, then it must be two-dimensional with each row corresponding to
the label of one class.
Table 4.7: Analyze TPR and FPR using class labels
Table 4.7: Analyze PPV and FDR using class labels
Below line chart and pie chat depicts the analysis of accuracy rate for various
classification methods and the result implies the highest accuracy rate for support
vector machine in Matlab.
Positive
predictive values
And
False
discovery rates
Algorithms TP FP TN FN
SVM 97% 3% 98% 2%
LR 97% 3% 97% 3%
KNN 97% 3% 98% 12%
True
positive rate
And
False
positive rate
Classes Algorithms TPR FNR
Class0
SVM 99% 1%
LR 98% 2%
KNN 99% 1%
Class1
SVM 95% 5%
LR 96% 4%
KNN 96% 4%
positive
predictive values
And
False
discovery rate
classes Algorithms TPR FNR
Class0
SVM 99% 1%
LR 98% 2%
KNN 99% 1%
Class1
SVM 95% 5%
LR 96% 4%
KNN 96% 4%
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 81
c
Fig 4.22: Accuracy rate for various classifications Algorithm based Line chart
Fig 4.23: Accuracy rate for various classifications Algorithm based Pie chart.
97.30%
97.70%
97.20%
96.90%
97.00%
97.10%
97.20%
97.30%
97.40%
97.50%
97.60%
97.70%
97.80%
1 2 3
Data Set Accuracy
Accuracy
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 82
Below figure shows the analysis of various parameters such as
Truepositive,true negative, false positive and falsenegative for various
classification algorithms in machine learning.
Fig 4.22: Analysis of classifications algorithm using various parameters
5. CONCLUSION AND FUTURE WORK
In present scenario,Audit is the systematic process to obtain and evaluating
objectivelyevidences about the correspondence between information, situations or
procedures and established criteria. It is to ensure that appropriate attention is
devoted to important areas, potential problems are promptly identified, and work
is completed expeditiously and also properly coordinated. In existing system they
have taken Audit data set and applied Particle Swarm Optimization (PSO)
algorithm is implemented to understanding the complete risk assessment process
and the researchers got 93% of accuracy by using Byes Net and J48 algorithms.
Means while we have triggered out the best algorithm among K-Nearest
Neighbor, Support Vector Machineby analyzing these three algorithms and we got
the result that Support Vector Machine is best among all. From result analysis, by
considering the recall value, the accuracy, precision of SVM is more when
0
50
100
150
200
250
300
350
400
450
500
TP FP TN FN
Analysis of parameters
SVM
LR
KNN
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 83
compared with other algorithms. And also find which algorithms is best
performance classifier.
For future works, we are targeting to improve the performance of the
classifiers by the ensemble machine learning approach; effort can be made for
obtaining more accuracy and better execution time for prediction.
REFERENCES
[1]Hooda, N., Bawa, S., & Rana, P. S. (2018). Fraudulent firm classification:
case study of an external audit. Applied Artificial Intelligence, 32(1),48-64.
[2]Cattle, B. A., Baxter, P. D., Greenwood, D. C., Gale, C. P., & West, R. M.
(2011). Multiple imputation for completion of a national clinical audit
dataset. Statistics in medicine, 30(22), 2736-2753.
[3]Evans-Lacko, S., & Knapp, M. (2016). Global patterns of workplace
productivity for people with depression: absenteeism and presenteeism costs
across eight diverse countries. Social psychiatry and psychiatric epidemiology,
51(11),1525-1537.
[4]Kamiński, B. Jakubczyk, M. Szufel, P. (2017). A framework for sensitivity
analysis of decision trees. Central European Journal of Operations Research.
26 (1): 135– 159. doi:10.1007/s10100-017-0479-6. PMC 5767274.
PMID29375266.
[5] Altman, N. S. (1992). An introduction to kernel and nearest-neighbor
nonparametric regression.The American Statistician. 46(3):175– 185.
doi:10.1080/00031305.1992.10475879.hdl:1813/31637.
[6]Cunningham, P., & Delany, S. J. (2007). k-Nearest neighbour
classifiers. Multiple Classifier Systems, 34(8), 1-17.
[7] Dreiseitl, S., & Ohno-Machado, L. (2002). Logistic regression and artificial
neural network classification models: a methodology review. Journal of
biomedical informatics, 35(5-6), 352-359.
[8] Cortes, Corinna;Vapnik, Vladimir N. (1995). Support-vector networks.
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 84
Machine Learning. 20 (3): 273–297. CiteSeerX
10.1.1.15.9362.doi:10.1007/BF00994018.
[9] Halbouni, S. S. (2015). The Role of Auditors in Preventing, Detecting, and
Reporting Fraud: The Case of the united A rab E mirates (UAE). International
Journal of Auditing, 19(2), 117-130.
[10] Simon, C. A. (2012). Individual auditors' identification of relevant fraud
schemes. Auditing: A Journal of Practice & Theory, 31(1), 1-16.
[11] Hoyer, S., Zakhariya, H., Sandner, T., & Breitner, M. H. (2012, January).
Fraud prediction and the human factor: An approach to include human behavior in
an automated fraud audit. In 2012 45th Hawaii International Conference on
System Sciences (pp. 2382-2391). IEEE.
[12] Brazel, J. F., Carpenter, T. D., & Jenkins, J. G. (2010). Auditors’ use of
brainstorming in the consideration of fraud: Reports from the field. The
Accounting Review, 85(4), 1273-1301.
[13] Hassink, H., Meuwissen, R., & Bollen, L. (2010). Fraud detection, redress
and reporting by auditors. Managerial Auditing Journal, 25(9), 861-881.
[14] Gray, G. L. , & Debreceny, R. S. (2014). A taxonomy to guide research on
the application of data mining to fraud detection in financial statement
audits. International Journal of Accounting Information Systems, 15(4), 357-380.
[15] González-Díaz, B., García-Fernández, R., & López-Díaz, A. (2015). Auditor
tenure and audit quality in Spanish state-owned foundations. Revista de
Contabilidad, 18(2), 115-126.
[16] K. Johl, S., Kaur Johl, S., Subramaniam, N., & Cooper, B. (2013). Internal
audit function, board quality and financial reporting quality: evidence from
Malaysia. Managerial Auditing Journal, 28(9), 780-814.
[17] Habbash, M. (2010). The effectiveness of corporate governance and external
audit on constraining earnings management practice in the UK (Doctoral
dissertation, Durham University).
[18] Suryanto, T. (2016). Audit delay and its implication for fraudulent financial
reporting: a study of companies listed in the Indonesian stock exchange. European
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 85
Research Studies, 19(1), 18.
[19] Lennox, C., & Pittman, J. A. (2010). Big Five audits and accounting
fraud. Contemporary Accounting Research, 27(1), 209-247.
[20] Rochmah Ika, S., & Mohd Ghazali, N. A. (2012). Audit committee
effectiveness and timeliness of reporting: Indonesian evidence. Managerial
Auditing Journal, 27(4), 403-424.
[21] Ege, M. S. (2014). Does internal audit function quality deter management
misconduct?. The Accounting Review, 90(2), 495-527.
[22] Carcello, J. V., Hermanson, D. R., & Ye, Z. (2011). Corporate governance
research in accounting and auditing: Insights, practice implications, and future
research directions. Auditing: A Journal of Practice & Theory, 30(3), 1-31.
[23] Kuna, H. D., García-Martinez, R., & Villatoro, F. R. (2014). Outlier
detection in audit logs for application systems. Information Systems, 44, 22-33.
[24] Gual, A., Segura, L., Contel, M., Heather, N., & Colom, J. (2002). Audit-3
and audit-4: effectiveness of two short forms of the alcohol use disorders
identification test. Alcohol and alcoholism, 37(6), 591-596.
[25] Dewiani, D., Lawi, A., Sarro, M., & Aziz, F. (2019). Classification of Firm
External Audit Using Ensemble Support Vector Machine Method. In 1st
International Conference on Science and Technology, ICOST 2019. European
Alliance for Innovation (EAI).
[26] Suryanto, T. (2016). Audit delay and its implication for fraudulent financial
reporting: a study of companies listed in the Indonesian stock exchange. European
Research Studies, 19(1), 18.
[27] Halbouni, S. S. (2015). The Role of Auditors in Preventing, Detecting, and
Reporting Fraud: The Case of the united A rab E mirates (UAE). International
Journal of Auditing, 19(2), 117-130.
[28] González-Díaz, B., García-Fernández, R., & López-Díaz, A. (2015). Auditor
tenure and audit quality in Spanish state-owned foundations. Revista de
Contabilidad, 18(2), 115-126.
[29] Hassink, H., Meuwissen, R., & Bollen, L. (2010). Fraud detection, redress
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 86
and reporting by auditors. Managerial Auditing Journal, 25(9), 861-881.
[30] Gray, G. L. , & Debreceny, R. S. (2014). A taxonomy to guide research on
the application of data mining to fraud detection in financial statement
audits. International Journal of Accounting Information Systems, 15(4), 357-380.
[31] K. Johl, S., Kaur Johl, S., Subramaniam, N., & Cooper, B. (2013). Internal
audit function, board quality and financial reporting quality: evidence from
Malaysia. Managerial Auditing Journal, 28(9), 780-814.
[32] Rochmah Ika, S., & Mohd Ghazali, N. A. (2012). Audit committee
effectiveness and timeliness of reporting: Indonesian evidence. Managerial
Auditing Journal, 27(4), 403-424.
[33] Simon, C. A. (2012). Individual auditors' identification of relevant fraud
schemes. Auditing: A Journal of Practice & Theory, 31(1), 1-16.
Science, Technology and Development
Volume IX Issue VI JUNE 2020
ISSN : 0950-0707
Page No : 87