Statistical Analysis of Various Measures in Auditing ...journalstd.com/gallery/6-june2020.pdf · 2RGUKT-Srikakulam, 3 RGUKT-Nuzvid, Andhra Pradesh, India E-mail : {dabbeerupriyanka90,

Statistical Analysis of Various Measures in Auditing Practices Using Optimization

Techniques Dabbeeru Priyanka1, Javvadi Vishnu Priyanka2, Sura Papa Rao3

Department of Computer Science and Engineering, 1Sri Sivani College of Engineering, Srikakulam, AP-532410, India

2RGUKT-Srikakulam, 3 RGUKT-Nuzvid, Andhra Pradesh, India

E-mail : {dabbeerupriyanka90, priyajv34, surapaparao}@gmail.com

ABSTRACT

Now a day, AI assumes a fundamental job in improving the nature of a audit

field work. A data audit refers alludes to the inspecting of information to evaluate its

quality or utility for a particular reason. Assessing or auditing information, not at all

like inspecting accounts, includes looking at key measurements, other than amount,

to make decisions about the properties of an informational index. Auditing practices

are liable for fraud detection. In this paper, various optimization algorithms are

executed by analysing statically measures of audit risk and trail data to find the better

accuracy for suspicious firm classification and will produce an optimum or a

satisfactory solution.

1. INTRODUCTION

1.1 Introduction to Audit data

One major area of auditing is the making of beneficial prediction about a

dataset. The auditors are required by auditing standards to assess the status of a

dataset and make a prediction as to whether the dataset is able to continue

operating as a going concern [1].The greatest concerns in using audit data for

research are missing and implausible data, which are inevitable when large

volumes of data are, collected [2]. Determining the advantageous status of a

dataset is a very difficult task, so auditors have been trying to come up with

statistical methods to help make it easier. In recent years, machine learning has

developed and received major attentionin the predictive analytics in audit

research. The main objective is to produce an efficient and effective prediction

Science, Technology and Development

Volume IX Issue VI JUNE 2020

ISSN : 0950-0707

Page No : 37

model that will be hybrid of various machine learning algorithmic characteristics,

which will be capable of predicting a data audit refers to the auditing of data to

assess its quality or utility for a specific purpose [1]. Auditing data, unlike

auditing finances, involves looking at key metrics, other than quantity, to create

conclusions about the properties of a data set. Whether any frauds have been

committed in any firm or not. The experiments are carried out on high

dimensional audit data and accuracy if result achieved. Audits are one of the

important compliance components that need to be set up properly but quickly so

that your organization does not miss out on business opportunities requiring the

storage of user data. Auditing these activities helps companies identify a data

breach before it is too late or at least assist with implementing better security

configurations to stop losses from occurring.

For applying the machine learning techniques and their algorithms, we used a

special tool called mat lab. It was a data visualizing and data mining software. It is

used to visualize the data at front end by data analysis and interactive data

visualization. In mat lab software we can construct the data mining algorithms

using FORTRAN77, Python, C++ and C programming languages. The programs

where maintained and implemented by Cleve Moler at the University of New

Mexico, started developing MATLAB in the late 1970s [3]. It is an interactive

system whose basic data element is an array that does not require dimensioning.

This allows you to solve many technical computing problems, especially those

with matrix and vector formulations, in a fraction of the time.

Machine learning involves an algorithm that improves automatically through

experience based on data[1].We are using different machine learning

classification algorithms like Logistic Regression, Support Vector Machine and

K-Nearest Neighbor. Logistic Regression is a statistical analysis method used to

predict a data value based on prior observations of a dataset, in which there are

one or more independent variables that determine an outcome. The outcome is

measured with a dichotomous variable (in which there are only two possible

outcomes). KNN (K-Nearest Neighbor) was a very simple machine learning



ISSN : 0950-0707

Page No : 38

https://en.wikipedia.org/wiki/University_of_New_Mexico

https://en.wikipedia.org/wiki/University_of_New_Mexico

algorithm, it will contain all the available cases and differentiate each cases based

on their similarity and the main objective of this algorithm was how does the

algorithm work and how it will helps in predicting the outputs [4]. SVM (support

vector machine) is one of the best supervised machine learning algorithm and

most popularly used to solve the classification problems. It is used to identify and

plot each and every data item as a point in the n-dimensional space. After plotting

the point, we will perform the classification algorithm by finding the hyper plane

that will differentiate any of the two classes’ very well [4]. The aim of this project

is mainly focusing on various classification algorithms which are most popular

and interesting areas of research and also having the attractive or adaptive nature

to adapt themselves for novel changes. The scope of this project is to define the

challenges of existing system and to highlight the new researchdirections. The

main objective is to produce an efficient and effective prediction model, that will

be hybrid of various machine learning or classification algorithms and also ensure

the financial reliability of any organization and detection of fraud is just an

incidental object.

1.2 Audit data

Audit is one of the major areas of concern. It is defined as the process of

examining the financial records of any business to corroborate that their financial

statements are incompliance with the standard accounting laws and principles [1].

During a data audit, the origin, creation or format of data may be reviewed to

assess its value and utility. Various agencies and associations, such as the Joint

Information Systems Committee (JISC), promote data audit protocols in different

fields. In academics, the idea of auditing research data has become an important

work component. Government programs also may focus on government funded or

work-related data audit protocols. Typically, a data audit is dependent on a

registry, which is a storage space for data assets. A thorough data audit involves

identifying a registry or repository, often in a specific business department or

organization. In this dataset we have two auditing practices. They are audit_risk

and trail datasets. In audit risk dataset there are 776 instances and 27 attributes.



ISSN : 0950-0707

Page No : 39

On the other hand, trail dataset having 776 instances and 18 attributes.

1.3Algorithms

In this project, we have used three different classification algorithms; they

are KNN, Logistic regression and SVM. By using these algorithms, we have

computed the analytical results based on various statistical parameters like AUC,

Confusion matrix and Recall.

1.3.1 K-Nearest Neighbor

In example acknowledgment, the k-closest neighbor's algorithm (k-NN) is a

non-parametric strategy utilized for arrangement and regression [5]. In both cases,

the information comprises of the k nearest preparing models in the component

space. The yield relies upon whether k-NN is utilized for grouping or relapse.

In k-NN grouping, the yield is class participation. An article is characterized

by a majority vote of its neighbors, with the item being relegated to the class most

basic among its k closest neighbors (k is a positive number, ordinarily little). In

the event that k = 1, at that point the article is essentially appointed to the class of

that solitary closest neighbor. In k-NN relapse, the yield is the property estimation

for the item. This esteem is the normal of the estimations of its k closest

neighbors.

K-NN is a type of instance based learning or lazy learning, where the

capacity is just approximated locally and all calculation is conceded until order.

The k-NN calculation is among the least complex of all AI algorithms. Both for

order and relapse, a valuable procedure can be utilized to allot weight to the

commitments of the neighbors, so that the closer neighbors contribute more to the

normal than the more removed ones. For instance, a typical weighting plan

comprises in giving each neighbor a weight of 1/d, where d is the separation to

the neighbor.

The neighbors are taken from a lot of articles for which the class (for k-NN

arrangement) or the item property estimation (for k-NN relapse) is known. This

can be thought of as the preparation set for the calculation, however no

unequivocal preparing step is required. K-Nearest Neighbors is one of the most



ISSN : 0950-0707

Page No : 40

basic yet essential classification algorithms in Machine Learning. It belongs to the

supervised learning domain and finds intense application in pattern recognition,

data mining and intrusion detection [6].

1.3.2 Logistic Regression

Logistic regression is a statistical analysis method used to predict a data value

based on prior observations of a dataset.It has become an important tool in the

discipline of machinelearning.The approach allows an algorithm being used in a

machine learning application preparing step is required. To classify incoming data

based on historical data. As more relevant data comes in, the algorithm should get

better at predicting classifications within data setsIt is analyzing a dataset in which

there are one or more independent variables that determine an outcome. The

outcome is measured with a dichotomous variable (in which there are only two

possible outcomes) [7].LR combines the independent variables estimate the

probability that a particular event will occur, such as a subject will be a number of

one of the groups defined by the dichotomous dependent variable. Logistic

regression is similar to linear regression. The only difference is that, the output of

logistic regression is constant because there is limited number of values involved

whereas the output of linear regression is continuous due to the infinite number of

values.It is used for binary classification problems, including predictions such as

“this or that”, “yes or no” and “A or B”.

In logistic regression mainly follow three fundamental concepts are

hypothesis, decision boundary, and cost function.The hypothesis of logistic

regression tends it to limit the cost function between 0 and 1. Therefore linear

functions fail to represent it as it can have a value greater than 1 or less than 0

which is not possible as per the hypothesis of logistic regression.The cost function

can be defined as the ‘Sigmoid function’ or also known as the ‘logistic function’

instead of a linear function. To map predicted values to probabilities we use the

sigmoid function. Decision boundary on top of our predictions to see how our

labels compare to the actual labels.



ISSN : 0950-0707

Page No : 41

https://searchenterpriseai.techtarget.com/definition/machine-learning-ML

https://whatis.techtarget.com/definition/algorithm

1.3.3 Support Vector Machine

In AI, support-vector machines (SVMs, likewise support-vector networks)

are managed to learn models with related learning calculations that break down

information utilized for grouping and relapse examination. Given a lot of

preparing precedents, each set apart as having a place with either of two classes, a

SVM preparing calculation constructs a model that allocates new guides to one

classification or the other, making it a non-probabilistic paired straight classifier

(despite the fact that techniques, for example, Platt scaling exist to utilize SVM in

a probabilistic characterization setting). An SVM display is a portrayal of the

models as focuses in space, mapped with the goal that the instances of the

different classifications are separated by an unmistakable hole that is as wide as

could be expected under the circumstances. New models are then mapped into that

equivalent space and anticipated to have a place with a class dependent on which

side of the hole theyfall.

In addition to performing straight characterization, SVMs can effectively

play out a non-direct grouping utilizing is known as the kernel trick, certainly

mapping their contributions to high-dimensional element spaces [8]. At the point

when information is unlabelled, regulated learning is absurd, and an unsupervised

learning approach is required, which endeavors to discover characteristic

bunching of the information to gatherings, and afterward map new information to

these framed gatherings.

2. LITERATURE SURVEY

2.1 Introduction

In our survey, we have gone through previous 15 years papers to know

about the audit data briefly from different corners and also to know about

different technologies, methods as well as methodologies used in their own

survey.

2.2 Survey on audit data using various algorithms

Sawsan Saadi Halbouni[9] mainly considers auditor perceptions regarding

their responsibilities related to preventing, detecting, and reporting fraud. The



ISSN : 0950-0707

Page No : 42

main objective of this study is to investigate internal and external auditor

perceptions regarding being proactive in their efforts to identify fraud during an

audit. There are 53 auditors in the United Arab Emirates. The principal

contribution is that internal auditors are primarily responsible for identifying fraud

and are consequently more concerned about reporting incidents related to fraud.

The authors provide results, empirical support for the notion that external auditors

should increase the degree to which they seek to detect and report incidents of

fraud as well.

Chad A. Simon [10],focusing on management’s goals helps individual

auditor identify ways management may be committing fraud (i.e., fraud schemes).

He also examines whether forming independent expectations for a client’s

reported results enhances the benefits of focusing on management’s goals. The

author found that auditors who identify a higher number of relevant fraud schemes

identify more audit procedures that target the case fraud. Auditors who are

prompted to identify and focus on management’s goals will identify more relevant

fraud schemes than auditors who are not prompted to focus on management’s

goals. Auditors are responsible for identifying the specific ways they believe the

client could be committing fraud and adjusting the audit plan to address identified

schemes.

Stefan Hoyer et al [11] are introduced a generic architectural model to unify

the classic fraud audit approach with human behavior taking into account the

fraud triangle in order to achieve better fraud detection and prevention. In this

author provides added value because the transactions examined by the auditor can

be better differentiated and prioritized. The proposed architecture is implemented

using a prototype and is applied exemplary to an SAP ERP system. The inclusion

of the human factor is focused on the manifestation of various behavior patterns.

These behavior patterns can be found in user data. The generic architectural model

can be expanded in a modular way by adding plug-ins is connecting to social

networking sites. In this way, ideally, additional information on the social



ISSN : 0950-0707

Page No : 43

networks of employees can be extracted, and this information can be useful for

fraud analysis.

Joseph F. Brazel et al [12], introduced Consideration of Fraud in a Financial

Statement Audit, requires brainstorming sessions on every audit as a means of

improving auditors’ fraud judgments. They develop a measure of brainstorming

quality to examine how it affects auditors’ fraud decision-making processes. They

are using field survey data of auditors’ actual brainstorming sessions for 179 audit

engagements. They found that when brainstorming quality is perceived to be

higher, risk assessments are more positively related to the nature, staffing, timing,

and extent of audit procedures. Achieving a high level of brainstorming quality

appears to improve the audit team’s consideration of fraud by cultivating a

broader set of responses to identified fraud risks.

Harold Hassink et al [13] proposed to build expertise in the area of fraud

detection, reporting and redress of detected fraud. The auditing standards which

are used by auditors once they encounter fraud and whether compliance is

associated with particular fraud characteristics and also aims to provide evidence

on the role of auditors in redressing fraud. They are proposed to address which

refers to the audit taking measures are used to nullify the consequences of the

fraud, insofar as possible, and to prevent any recurrence of such fraud. It is

difficult for individual auditors to build up expertise in fraud detection. There

appears to be a need for specific training programs for auditors to help them to

detect fraud, emphasizing the need for mandatory consultation with the technical

department of the audit firm once “red flags” indicating fraud are found.Finally,

they reveal that auditors fail to comply with some important elements of fraud

standards. There are substantial differences among audit firms regarding

compliance with the relevant auditing standards.

Glen Gray et al[14], introduced the application of data mining techniques to

fraud detection in the audit of financial statements and proposes a taxonomy to

support and guide future research. They are used to develop structure to research



ISSN : 0950-0707

Page No : 44

in data mining and create a taxonomy that combines research on patterns of

observed fraud schemes with an appreciation of areas that benefit from productive

application of data mining. Auditors can include non-accounting (non-financial)

data in their data mining domain. There are, however, no external standards that

apply to non-financial data. As such, auditors will need to develop their own set of

processes to determine the quality of non-financial data and the consistency of

that data with the financial statements. Fraud detection is a vital component of the

modern financial statement audit. There are many forms of data mining that are

available to auditors as part of their fraud detection activities.

Roberto García-Fernándezb and Antonio [15]aim to analyze the impact of

auditor tenure on audit quality. In this, the research is motivated by the absence of

consensus in published works, and by the scarcity of studies carried out on

nonprofit organizations. Although foundation audit quality decreases as tenure

length increases, this quality loss does not become apparent until the sixth year of

the foundation–auditor relationship, after an initial five years of improvement in

quality. Long auditor tenure may increase competence because the auditor’s

client-specific knowledge increases over the years. This will allow them to

improve the quality of their auditing but it could also reduce their degree of

independence.Their independence could also be compromised since they need to

keep new clients in order to recover their initial client-specific investment, which

cannot be transferred to other contracts industry specialization, however, can

improve both competence and independence, leading to higher audit quality since

auditors know their sector far better than non-specialist firms and can audit. They

are study contributes to the literature on the relationship between auditor tenure

and audit quality in an environment where there is no mandatory auditor rotation

and in a sector, non-profit making, where empirical research is very limited.

Satirenjit Kaur Johl et al [16] purposes the test impact of the internal audit

function (IAF) an increasingly common internal governance mechanism, on a

firm’s financial reporting quality. In this paper uses a unique dataset of survey



ISSN : 0950-0707

Page No : 45

responses and archival data. Regression analysis was used to test their hypotheses.

In this, uses a unique dataset of survey responses and archival data. They are

finding to show that certain internal audit attributes play an important role in the

financial reporting process and thus these findings are expected to inform the

Institute of Internal Auditors and other regulatory bodies on the role of internal

audit in financial reporting, which in turn can assist in market/regulatory

reforms/changes and inform the revised Malaysian Code of Corporate

Governance. The authors examine the association between internal audit quality

and abnormal accruals, being a proxy for financial reporting quality. They also

examine the moderating role of the board on this relationship. They are using a

unique dataset of survey responses and archival data from a developing nation

where certain corporations are politically influenced, they are test our prediction

that internal audit quality is related to increased financial reporting quality and

that this relationship is affected by the quality of the board.

Habbash and Murya [17]have proposed that agency theory predicts that

corporate governance and external audit enhance the convergence of interests

between shareholders and managers. The primary objective of this thesis is to

investigate the effect of corporate governance and external audit on constraining

earnings management practice in the UK. The corporate governance attributes are

organized in four categories: 1) Board Composition; 2) Audit Committee

Effectiveness; 3) Non-Executive Directors‟(NEDs) Commitment; and 4)

Ownership Structures. The external audit factors include auditor independence

and audit quality. In this the authors, results reveals that board size and

independence, audit committee independence and expertise, nomination

committee independence, chairman independence, the level of NED fees and an

independent and specialized external auditor are negatively associated with

earnings management at significant levels.

Tulus Suryanto [18] has proposed the impact of company size, profit or loss,

and information system toward audit delay in companies listed in the Indonesia



ISSN : 0950-0707

Page No : 46

Stock Exchange. Additionally, it also scrutinizes the implications of audit delay to

fraudulent financial reporting. Audit delay is the time difference between the dates

of financial statements and the dates of audit opinions stated in the financial

statements, which indicates the length of time of auditing processes. The data

analysis used Amos software. In this, the results of the study indicate that the

information system, company size, and operation loss and profit have significant

influence towards audit delay. In this, the use of information system integrated

with the application of technology will simplify administrative and financial

transaction records, so that the issuance of financial statements will be faster and

audit delay can be reduced.

clive lennox etal [19] proposed that empirical studies of earnings

management, audit pricing, and audit reporting provide extensive evidence that

the Big Five public accounting firms are associated with higher quality financial

statements. Measuring trends in absolute audit quality is typically infeasible, so

they more narrowly focus on discriminating between the relative performance of

the Big Five and non-Big five audit firms in preventing companies from

orchestrating accounting fraud. They report multivariate evidence on the relation

between Big Five audits and accounting fraud. Finally, we focus on the link

between auditor choice and the incidence of accounting fraud given that criticism

of the Big Five stresses their apparent role in the prominent financial reporting

failures, although they said future research could complement our analysis by

examining whether our evidence extends to other settings.

Siti Rochmah Ika et al [20] proposed the association between audit

committee effectiveness and timeliness of reporting. Audit committee

effectiveness is measured by an index based on the framework developed by

DeZoort Timeliness of reporting is defined as the number of days that elapses

between a company’s financial year-end and the day on which its audited

financial statement is received by the IDX. Multivariate regression analysis was

performed to analyze the relationship between audit committee effectiveness and



ISSN : 0950-0707

Page No : 47

timeliness of reporting. – The audit committee effectiveness index employed in

this study was based on framework. The findings show that timeliness of

reporting is associated with audit committee effectiveness. In this, results suggests

that audit committee effectiveness is likely to reduce the financial reporting lead

time, and also suggest that audit committee effectiveness is a significant factor

ensuring timely submission of audited financial statements. The author proves

companies perhaps can re-look into how to further improve audit committee

effectiveness in order to enhance timeliness of financial reporting.

Matthew S. Ege [21] has proposed that Standard-setters believe high-quality

internal audit functions (IAFs) serve as a key resource to audit committees for

monitoring senior management. A difference-in-differences analysis indicates that

misconduct firms have IAFquality and competence during misconduct years and

improve IAF quality and competence in the post-misconduct years. Internal

auditors serve as the ‘‘eyes and ears’ ’of the audit committee and are able to deter

management misconduct by effectively being able to detect management

misconduct Internal auditors test internal controls throughout the organization and

focus their efforts on areas with the highest risk. The authors say, this result is

robust to a variety of other specifications, including controlling for internal

control quality and separate estimation during the pre- and post-SOX time

periods. In this, results suggest that regulators and audit committees should

consider ways to improve IAF quality and, more specifically, IAF competence.

Joseph V et al [22] proposed that the corporate governance literature in

accounting and auditing has grown rapidly our corporate governance focus is

primarily on corporate board and audit committee issues. A few studies have

examined the relation between corporate governance and auditor risk assessment,

audit planning, and audit opinions. They said these studies find that governance

characteristics affect auditors’ risk assessments and planning decisions, and that

governance quality is associated with audit opinions. In this, researchers have

continued to examine the relation between corporate governance characteristics



ISSN : 0950-0707

Page No : 48

and audit fees strong governance could increase the demand for auditing

(increasing fees) and/or reduce auditors’ assessments of risk. In this, latter two

approaches, in particular, are viewed by some as less rigorous than archival,

econometric-based research methods. They can continue to rely predominantly on

archival research methods, and to improving the rigor and elegance of these

methods as a result, begin to better understand the issues that matter most to

accountants, attorneys, practitioners, regulators, and investors.

H.D.kuna et al [23] introduced Outlier detection in audit logs for application

systems. In this, an outlier is defined as an observation that is significantly

different from the other data in its set. In this, an auditor will employ many

techniques, processes and tools to identify these entries and data mining are one

such medium through which the auditor can analyze information. The enormous

amount of information contained within transactional processing systems' logs

means that auditors must employ automated systems for anomalous data

detection. In this, several data mining algorithms have been tested, especially

those that deal specifically with classification and outlier detection.They have

been successful in creating and ratifying an outlier detection process that works in

the alphanumeric fields of the audit logs from an information system, thus

constituting a useful tool for system auditors performing data analysis tasks.

Antoni gual et al [24]has proposed that to identify suitable short versions of

the Alcohol Use Disorders Identification Test (AUDIT) and to evaluate their

effectiveness as screening tests for ‘risky drinking’ among men and women in

primary health care (PHC) settings. Logistic regression analysis was carried out to

identify those items that minimized the probability of misclassification between

subjects with and without risky drinking. Cross-tabulations and t-tests were used

for group comparisons. Relationships between the three forms of AUDIT and

reported mean weekly alcohol consumption were examined by regression

analysis. Sensitivity, specificity, positive predictive and overall accuracy (OA)

values were calculated for the full, 3-item and 4-item AUDIT forms in relation to



ISSN : 0950-0707

Page No : 49

the gold standard (diagnosis of risky drinking). Receiver operating characteristics

(ROC) curves and their areas under the curve (AUROCs) were inspected to

choose the optimal screening test and the best cut-off scores according to

sensitivity and specificity levels. The results findthat: AUDIT-3 and AUDIT-4

performed similarly to AUDIT-10 in detecting risky drinking and had equivalent

receiver operating characteristics curves and their areas under the curve. Both

short forms of AUDIT seem to be as effective as the full AUDIT for detecting

risky drinking among men and women in PHC settings.

Dewiani.D et al [25]Financial fraud is an important problem because it can

detrimental firmin the modern business world. An audit is carried out to prevent

and be responsiblefor detecting fraud. External audit is one of the audit practices

conducted outsideof the firm internal audit by visiting firms in carrying out the

work of financialreport audit data. The application of machine learning can be

used as a solution inthe use of data analysis methods needed to solve these

problems. This studyproposes a SVM method by combining the

EnsembleBagging model to improve single classification performance. In this, the

results showed that the EnsembleBagging model could improve the accuracy of

classification performance from the SVM method and achieved the highest

accuracy of9.95%. Based on the results of the accuracy obtained, the SVM

method with the Ensemble Bagging model can be used to detectfraud in the firm.



ISSN : 0950-0707

Page No : 50

Table 2.1: Survey on Audit data

Author

Name

Title Publications

Years

Algorithms/Me

thods

Output Ref

Tulus

Suryanto

et.al

Audit delay

and it’s

implication for

fraudulent

financial

reporting–A

study of

companies

listed in the

Indonesian

stock

exchange.

2016 Non parametric

statistical test,

Multivariate

technique of

structure equation

model(SEM).

Shows that use

of information

technology has

negative effect

on audit delay.

[26]

SH.Sawsan

et.al

The Role of

Auditors in

Preventing,

Detecting,and

Reporting

Fraud: The

Case of the

United Arab

Emirates

(UAE).

2015 T-tests and

analysis

of variance

(ANOVA).

The results of

external auditors

should increase

the degree to

which they seek

to detect and

report incidents

fraud as well.

[27]

Roberto

García-

Fernández

b et.al

Auditor tenure

and audit

quality in

Spanish state

owned

foundations

2015 Logistic

Regression

Models.

The results

shows the

relationship

between auditor

tenure and audit

quality in an

environment

[28]

Harold

Hassink

et.al

Fraud

detection,

redress and

reporting by

auditors

2015 Survey is

conducted

between among

all the audit

partners.

The result

reveals that

auditors fail to

comply with

some important

elements of

fraud standards.

[29]



ISSN : 0950-0707

Page No : 51

3.Proposed Methodology

3.1 Introduction

Machine learning involves an algorithm that improves automatically through

experience based on data. One major area of auditing is making of beneficial

predictions about a dataset. In recent years, machine learning has developed and

Glen

L.Gray

et.al

A taxonomy to

guide research

on the related

application of

data mining to

fraud detection

in financial

statement

audits

2014 Scatter shot

Method

The result

shows create a

taxonomy that

combines

research on

patterns of

observed fraud

schemes with

appropriate

areas.

[30]

Satirenjit

Kaur et.al

Internal audit

function, board

quality and

financial

reporting

quality:

evidence from

Malaysia

2013 Archival

data and

regression

analysis

It shows

unexpected

positive

relationship

between internal

audit quality and

abnormal

accruals.

[31]

Siti

Rochmah

Ika et.al

Audit

committee

effectiveness

and, timeliness

of reporting:

Indonesian

evidence

2012 Index

based

framework

Finding the

timeliness of

reporting is

associated with

audit committee

is effectiveness.

[32]

Chad

A. Simon

Individual

Auditors

Identification

of Relevant

Fraud Schemes

2012

Client-Server

Method.

Auditors are

responsible for

identifying the

specific ways

they believe the

client could

committing

fraud and

adjusting the

audit plan to

address

identified

[33]



ISSN : 0950-0707

Page No : 52

received major attention in the predictive analytics of audit research. Predictive

analytics is also implemented using machine learning methods because it provides

actionable insights for the audit companies. Audit is defined as the process of

examining the financial records of any business to corroborate that their financial

statements are incompliance with the standard accounting laws and principles.

3.2 Existingsystem

In the existing system they have taken Audit data set and applied Particle

Swarm Optimization (PSO) algorithm is implemented to understanding the

complete risk assessment process. The results of two multi-criteria methods,

TOPSIS and SAW, indicated that BayesNet and J48 perform the best for this

particular audit dataset and also give stable results on K-fold validation testing,

serving as a proof of eligibility of classifiers to perform an efficient risk assessment

of the suspicious firms in the audit field work decision-making process. J48 builds

decision tree based on the theory of information entropy. It is an open source java

implementation of the C4.5 algorithm. Bayesian networks aim to model conditional

dependence, and therefore causation, by representing conditional dependence by

edges in a directed graph. Through these relationships, one can efficiently conduct

inference on the randomvariables in the graph through the use of factors. This model

is based on probabilistic and directed acyclic graph theory.Inexisted system, the

researchers got 93% accuracy by using Bayes Net and J48 algorithms

3.3 Motivation

Now a days, it would be extremely difficult to understand the reasons why the

services is failing or the real reasons why company sales are increase or decrease at

a given time. So it will be tend to face the difficulties to produce desires result. Also

reduces the accuracy for classification of data. To overcome this drawback we are

using the machine learning algorithms to give the optimal solution for finding the

better accuracy to classify the data.



ISSN : 0950-0707

Page No : 53

3.4 Proposed framework

In proposed system, the main focus is to use various optimization

algorithms for analysing statistical measures of audit risk and trail data to find the

better accuracy for suspicious firm classification and produce an optimum or a

satisfactory solution. Then we have analyzed the results, and compared them with

the existed project and came to know which algorithm isbest.

3.5 Description

We have taken the data set i:e Audit data from UCI Repository which is

freely available in the internet and for that data set we have various classification

algorithms like K-Nearest neighbor, Logistic regression and Support vector

machine. Applying various statistical methods for analyzing the results and

generated the final report. For the brief understanding of the proposed system and

its methodology as well as to know about what are the methods and algorithms

used in it and which approach is followed to give accurate solution.

3.5.1 Flow chart

Below Figure represents the step by step procedure of the proposed system. It

clearly explains us about complete view of proposed system that is where we have

taken the data and what methods and techniques applied etc. Here we are selecting

the data from UCI repository then applying data preprocessing techniques for data

cleaning and transformation. After that evaluating the results based on machine

learning classification algorithms.



ISSN : 0950-0707

Page No : 54

Fig 3.1: Architecture for proposed system

3.5.2 System Architecture

Figure 3.2 indicates the architecture of the proposed system by using the

Matlab. It is handling three different classification algorithms such as K-

Nearest Neighbor, Support Vector Machine and Logistic Regression.

Choose among various algorithms to train and validate classification

models for binary or multiclass problems. After training multiple models,

compare their validation errors side-by-side, and then choose the best model.

To help you decide which algorithm to use, see Train Classification Models in

Classification Learner App.

This flow chart shows a common workflow for training classification models,

or classifiers, in the Classification Learner app.



ISSN : 0950-0707

Page No : 55

https://www.mathworks.com/help/stats/train-classification-models-in-classification-learner-app.html

https://www.mathworks.com/help/stats/train-classification-models-in-classification-learner-app.html

Fig 3.2 Proposed architecture using Matlab

3.5.2.1Select Data and Validation for Classification Problem

Import data into Classification Learner from the workspace or files, find example

data sets, and choose cross-validation or holdout validation options.

3.5.2.2 Choose Classifier Options

In Classification Learner, automatically train a selection of models, or compare

and tune options in decision tree, discriminate analysis, logistic regression, naive

Byes, support vector machine, nearest neighbor, and ensemble models.

3.5.2.3Train the classifier

You can use Classification Learner to automatically train a selection of different



ISSN : 0950-0707

Page No : 56

https://www.mathworks.com/help/stats/select-data-and-validation-for-classification-problem.html

https://www.mathworks.com/help/stats/choose-a-classifier.html

classification models on your data.

3.5.2.4Assess Classifier Performance in Classification Learner

Compare model accuracy scores;visualize results by plotting class predictions,

and check performance per class in the Confusion Matrix.

3.5.2.5 Export Classification Model to Predict New Data

After training in Classification Learner, export models to the workspace, and if

you want generate MATLAB® code.

4. Experimental Setup and result analysis

4.1 Simulative Environment

The developing environment for the proposed method isMatlab r2018b

version on a system with any Intel or AMD x86-64 processor, 1.8 GHz, 4 GB

RAM and Microsoft Windows with Family.

4.2Experimental Setup

In Experimental setup we are explaining about the Data Set i: eAudit data set,

Matlab.

4.2.1 Dataset

This study was carried out on a dataset containing information about

evaluation of a person, organization, system, process, enterprise and product. The

term most commonly refers to audits in accounting, but similar concepts also exist

in project management, quality management, water management, and energy

conservation. This dataset and its related information have been drawn from the

UCI Machine Learning Repository. The dataset is having two auditing practices.

They are audit_risk and trail datasets. In audit_risk dataset there are 776 instances

and 27 attributes. On the other hand, trail dataset having 776 instances and 18

attributes. This dataset consist of some records related to scores, locations, risks,

money value and loss etc. the audit work is comprehensive and considers all

aspects of the organization both financial and non-financial with an emphasis on

constructive improvement. A sample of the raw data can be seen in the table



ISSN : 0950-0707

Page No : 57

https://www.mathworks.com/help/stats/assess-classifier-performance.html

https://www.mathworks.com/help/stats/export-classification-model-for-use-with-new-data.html

below. The database was created with records exhaustive one year non-

confidential data in the year 2015 to 2016 of firms is collected from the Auditor

Office of India to build a predictor for classifying suspicious firms. Using this

data set we have analyzed different factors which are mainly leading to Audit data

by applying various machine learning algorithms.

Table 4.1 Description of data set

4.2.2 Data setinformation

The goal of the research is to help the auditors by building a classification model

that can predict the fraudulent firm on the basis the present and historical risk

factors. The information about the sectors and the counts of firms are listed

respectively as Irrigation (114), Public Health (77), Buildings and Roads (82),

Forest (70), Corporate (47), Animal Husbandry (95), Communication (1),

Electrical (4), Land (5), Science and Technology (3), Tourism (1), Fisheries (41),

Industries (37), Agriculture (200).

4.2.2.1 AttributeInformation

Many risk factors are examined from various areas like:

Past records of audit office

Audit-paras

Environmental conditions reports

Firm reputation summary

On-going issues report

Profit-value records

Data Set

Characteristics: Multivariate

Number of

Instances: 777 Area: N/A

Attribute

Characteristics: Real

Number of

Attributes: 18 Date Donated

2018-

07-14

Associated

Tasks: Classification Missing Values? Yes Number of Web Hits: 29866



ISSN : 0950-0707

Page No : 58

Loss-value records, follow-up reports etc.

Public health reports

Agriculture issues.

After in-depth interview with the auditors, important risk factors are

evaluated and their probability of existence is calculated from the present and past

records.

4.3 Parametersetting

In parameter setting we defined about the different methods such as Cross

Validation, Random sampling,training time, and prediction speed.

4.3.1Crossvalidation

It is a technique used for the evaluation of the gaining results from analytical

analysis induces to an independent dataset. It is a process of determining the error

rate. It is necessary to evaluate the accuracy of the execution. It gives fair value. If

the error rate is low it is known as a good model otherwise not-good model. It

helps to select the best fit model. It splits the data into a given number of folds

(usually 5 or 10). The algorithm is tested by holding out examples from one fold

at a time; the model is induced from other folds and examples from the held out

fold are classified. This is repeated for all thefolds.

4.3.1.1Steps followed in 10-fold cross validation

i. Divide the dataset into 10 parts or also called as folds.

ii. Each part holds in eachiteration.

iii. Average theresult.

iv. Each data point used once for testing and 9 times for training.

v. Stratified cross validation ensures that each fold has the right

proportion of each class value.

4.3.2 Random Sampling

Random sampling is a way of choosing a sample of views from a population in

order to make assumptions about the population. When we select a group of



ISSN : 0950-0707

Page No : 59

subjects from a large sample, every likely sample of a given size has the same

probability of selection. Randomly splits the data into the training and testing set

in the given proportion the whole procedure is repeated for a specified number of

times. In statistics the random sampling is subset of individuals chosen from a

large set. Each individual is chosen randomly and entirely by chance, such that

each individual has the sameprobability.

4.3.3 Training speed

Time taken to train the data by using classification algorithms.Owing to the

large range of training times only the order relative to the fastest algorithm for

each dataset is reported. The fastest algorithm is indicated by a ‘0’. An algorithm

that is between 10^x-1 to 10^x times as slow is indicated by the value of x. It can

be denoted by seconds.

4.3.4 Prediction speed

Prediction speed refers to the output of an algorithm after it has been trained

on historical dataset. It is the process of using data analytics to make predictions

based on data. This process uses data along with analysis, statistics, and machine

learning techniques to create a predictive model. Prediction speed used to predict

a future value (How long can this machine run before requiring maintenance) or

to estimate a probability. It is denoted by obs/sec.

4.4 Performancemeasures

There are some performance measures based on the classification. By using

the Matlab software kit, we have implemented and analyzed the results of the

Logistic Regression, Support Vector Machine and K-Nearest Neighbor by using

Matlab.

4.4.1 Classificationaccuracy

It is calculated as the total number of correct forecasts divided by the total

number of datasets. It works well on balanced data. If it is imbalanced data then

errors occur at the performance. Accuracy is calculated as the number of all



ISSN : 0950-0707

Page No : 60

correct predictions divided by the total number of the dataset.

Accuracy= (Number of true classified samples)/ (Number of total test data) × 100

4.4.2Confusion matrix

A confusion matrix is a table that is often used to describe the performance

of a classification model (or “classifier”) on a set of test data for which the true

values are known. It ishelps you identify the areas where the classifier has

performed poorly. It is a summary of prediction results on a classification

problem. The number of correct and incorrect predictions are summarized with

count values and broken down by each class. The confusion matrix shows the

ways in which your classification model is confused when it makes predictions. It

gives us insight not only into the errors being made by a classifier but more

importantly the types of errors that are being made.

N Predicted class1 Predicted class2

Actual class1 TP FN

Actual class2 FP TN

Accuracy (or) Classification Rate= (TP+TN)/N

Where N = TP+TN+FP+FN

4.4.3Receiver Operating Curve (ROC)

A ROC curve (receiver operating characteristic curve) is a diagram

explaining the completion of a classification model at all classification thresholds.

It is a metric used to check the quality of classifiers. For each class of a

classifier, roc applies threshold values across the interval [0, 1] to outputs.

This curve plots two parameters:

TPR=True PositiveRate

FPR=False PositiveRate

The True Positive Rate (TPR) is plot against False Positive Rate (FPR) for



ISSN : 0950-0707

Page No : 61

the probabilities of the classifier predictions. Then, the area under the plot is

calculated.The Area under Curve (AUC) number is a measure of the overall

quality of the classifier

True Positive Rate (TPR) is a synonym for recall and is therefore defined as

follows:

TPR=TP/TP+FN (1)

False Positive Rate (FPR) is defined as follows:

FPR=FP/FP+TN (2)

4.4.4Area underCurve (AUC)

The area under the ROC curve is a standard evaluation analysis for

classifiers. It is commensurate to concordance. This can be rendered as the

probability that a random positive is assigned a higher scope than a random

negative.

4.4.4.1 AUC is essential for the following tworeasons

• AUC is scale-invariant. It measures how well predictions are ranked,

ratherthan their absolutevalues.

• AUC is classification-threshold-invariant. It measures the quality of

themodel's predictions irrespective of what classification threshold

ischosen.

4.4.5Precision (Positive predicted value)

It is a metric for classification models. Precision recognizes the incidence

with which a model was correct when divining the certain class. High precision

describes the low false positive rate.

These are favored in information retrieval positives for the documents that

are retrieved in response to a query true positive and are really relevant to the

query. y-axis: precision. Precision value is calculated by using the following

formula.

Precision=TP/TP+FP (3)



ISSN : 0950-0707

Page No : 62

4.4.6Recall (True positive rate)

The recall is the proportion of correctly divined observations to the total

predicted positive observations in real class. Mathematically, recall value is

calculated by using the following formula:

Recall=TP/TP+FN (4)

4.4.7False discovery rate

The false discovery rate (FDR) is the expected proportion of type I errors. A

type I error is where you incorrectly reject the null hypothesis; In other words, you

get a false positive. Closely related to the FDR is the family-wise error rate

(FWER). The FWER is the probability of making at least one false

conclusion (i.e. at least one Type I Error).It is the probability of making any Type-

I error at all.

False discovery rate=FP/TP+FP (5)

4.4.8False negative rate

The false negative rate is the proportion of positives which

yield negative test outcomes with the test, i.e., the conditional probability of

a negative test result given that the condition being looked for is present.

False negative rate=FN/FN+TP (6)

4.4.9 Scatter plot

In the scatter plot, after train a classifier, the scatter plot switches from

displaying the data to showing model predictions. Scatterplot (x) produces a

scatter plot for the signal x. The interpretation of x depends on its shape and

complexity:

• If x is a real two-column matrix, scatterplot interprets the first column

as in-phase components and the second column as quadrature

components.



ISSN : 0950-0707

Page No : 63

https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/statistics-definitions/type-i-error-type-ii-error-decision/

https://www.statisticshowto.datasciencecentral.com/support-or-reject-null-hypothesis/#meaning

https://www.statisticshowto.datasciencecentral.com/false-positive-definition-and-examples/

https://www.statisticshowto.datasciencecentral.com/familywise-error-rate/

https://www.statisticshowto.datasciencecentral.com/familywise-error-rate/

n

• If x is a complex vector, scatterplot interprets the real part as in-phase

components and the imaginary part as quadrature components.

• If x is a real vector, scatterplot interprets it as a real signal.

4.5 Experimentalresult

In experimental analysis, we have taken the result from Matlab

tool for performing analysis.

4.5.1 Statistical analysis for machine learningalgorithms

We applied various statistical methods for analyzing various classification

algorithms that are used in our project such as K-Nearest Neighbor, Support

Vector Machine and Logistic Regression.

4.5.1.1 Results ofclassification

By using different classification algorithms we are performing different functions

Such as Cross validation.

4.5.1.1 .1 Crossvalidation

It is a technique which is used for the evaluation of gaining results from

analytical analysis induces to an independentdataset.Cross-validation is a

resembling procedure used to evaluate machine learning models on a limited data

sample. The procedure has a single parameter called k that refers to the number of

groups that a given data sample is to be split into. As such, the procedure is often

called k-fold cross-validation.

When a specific value for k is chosen, it may be used in place of k in the

reference to the model, such as k=10 becoming 10-fold cross-validation. 10-fold

cross validation isolates the dataset into 10 folds where 9 folds utilized for

training and one section is for testing.



ISSN : 0950-0707

Page No : 64

Fig4.8:Classification using cross validation

4.5.2 Confusion matrix

The confusion matrix itself is relatively simple to understand, but the related

terminology can be confusing. To find out the accuracy of error rate algorithms

confusion matrix plays major role.The confusion matrix displays the total number

of observations in each cell. The rows of the confusion matrix correspond to the

true class, and the columns correspond to the predicted class. Diagonal and off-

diagonal cells correspond to correctly and incorrectly classified observations,

respectively.

4.5.2.1 K-Nearest Neighbors

K-Nearest Neighbors is one of the simplest classification algorithm. it will

store the all available classes and divides it into new classes based ontheir

similarity Measure such as the distance, factions etc. K nearest neighbors is mostly

used for the Statistical analysis as well as for the patternrecognition.

4.5.2.1.1 Steps for K-Nearest Neighbors

1. Determine parameter k=number of nearest neighbors



ISSN : 0950-0707

Page No : 65

2. Calculate the distance between the query instance and all the training samples

3. Sort the distance and determine nearest neighbor based on the Minimum

distance.

4. Gather the category r of the nearest neighbors

5. Use simple majority of the category of nearest neighbor as the prediction

value of query instance.

4.5.2.1.2 Formulas for KNN using Matlab

The k nearest neighbors are selected based on distance metric. Given an mx-by-

n data matrix X, which is treated as mx (1-by-n) row vectors x1, x2... xmx, and an my-

by-n data matrix Y, which is treated as my (1-by-n) row vectors y1, y2, ...,ymy, the

various distances between the vector xs and yt are defined as follows:

1. Euclidean distance

𝑑𝑠𝑡2 = (xs−yt) (xs−yt) ′.

The Euclidean distance is a special case of the Minkowski distance.

2. Standardized Euclidean distance

𝑑𝑠𝑡2 = (xs−yt)V

−1(xs−yt)′,

where V is the n-by-n diagonal matrix.

3. Mahalanobis distance

𝑑𝑠𝑡2 = (xs−yt) C

−1 (xs−yt) ′,

where C is the covariance matrix.

4. City block distance

𝑑𝑠𝑡 = ∑ |𝑥𝑠𝑗

𝑛

𝑗=1

− 𝑦𝑡𝑗|

5. Minkowski distance

𝑑𝑠𝑡 = √∑ |𝑥𝑠𝑗

𝑛

𝑗=1

− 𝑦𝑡𝑗|𝑝

For the special case of p = 1, the Minkowski distance gives the city block

distance. For the special case of p = 2, the Minkowski distance gives the

Euclidean distance. For the special case of p = ∞, the Minkowski distance gives



ISSN : 0950-0707

Page No : 66

the Chebyshev distance.

6. Chebyshev distance

𝑑𝑠𝑡=maxj {xsj−ytj}.

i. Create a confusion matrix chart and sort the classes of the chart

according to the class-wise number of observations.

Fig 4.9: Confusion matrix for KNN based on number of observations

ii. Create a confusion matrix chart and sort the classes of the chart

according to the class-wise true positive rate (recall) and the class-wise

false negativerate.



ISSN : 0950-0707

Page No : 67

Fig 4.10: Confusion matrix for KNN based on TPR and FNR

iii. Create a confusion matrix chart and sort the classes of the chart according

to the class-wise positive predictive rate (precision) and the class-wise

false discovery rate

Fig 4.11:Confusion matrix for KNN based on PPV and FDR



ISSN : 0950-0707

Page No : 68

4.5.2.2 Logistic Regression

Logistic Regression is a Machine Learning algorithm which is used for the

classification problems. It is a predictive analysis algorithm and based on the

concept of probability. We can call a Logistic Regression a Linear Regression model

but the Logistic Regression uses a more complex cost function, this cost function

can be defined as the ‘Sigmoid function’ or also known as the ‘logistic function’

instead of a linear function. The hypothesis of logistic regression tends it to limit the

cost function between 0 and 1. Therefore linear functions fail to represent it as it can

have a value greater than 1 or less than 0 which is not possible as per the hypothesis

of logistic regression.

0 ≤ ℎ𝜃(𝑥) ≤ 1

Logistic regression hypothesis expectation

4.5.2.2.1 Sigmoid Function:

In order to map predicted values to probabilities, we use the sigmoid function. The

function maps any real value into another value between 0 and 1. In machine

learning, we use sigmoid to map predictions to probabilities.

𝑓(𝑥) =1

1+𝑒−(𝑥)

F(x)= output between 0 and 1 (probability estimate)

z = input to the function (your algorithm’s prediction e.g. mx +

b)

x= base of natural log.

It is having “S” shaped curve.

In logistic regression mainly follow three fundamental concepts they are:

1. Logistic regression hypothesis

2. Logistic regression decision boundary

3. Logistic regression cost function



ISSN : 0950-0707

Page No : 69

4.5.2.2.2 Hypothesis Representation

When using linear regression we used a formula of the hypothesis i.e.

h∅(x) = β0 + β1X

For logistic regression we are going to modify it a little bit i.e.

𝜎(𝑍) = (β0 + β1X)

We have expected that our hypothesis will give values between 0 and 1.

Z = β₀ + β1X

h∅(x)= sigmoid (Z)

h∅(x) =1

1 + 𝑒(−(β₀ + β₁X)

4.5.2.2.3 Decision boundary:

Our current prediction function returns a probability score between 0 and In

order to map this to a discrete class (true/false, cat/dog), we select a threshold value

or tipping point above which we will classify values into class 1 and below which

we classify values into class 2.

p≥0.5, class=1

p<0.5, class=0

4.5.2.2.4 Cost function

For logistic regression, the Cost function is defined as:

Cost (h∅(x), y) = {−log (ℎ𝜃 (x)) if y = 1

−log (1 − hθ(x)) if y = 0

In this cost function pays, if the predictive value h∅(x) while the actual cost

label turns out to be y. By using this function we will grant the convexity to the

function the gradient decent algorithms has to process. Using gradient decent

algorithms we have to reduce or minimize the cost function.

i. Create a confusion matrix chart and sort the classes of the chart


false negative rate.



ISSN : 0950-0707

Page No : 70

https://ml-cheatsheet.readthedocs.io/en/latest/logistic_regression.html#id22

Fig 4.12:Confusion matrix for LR based on number of observation

ii. Create a confusion matrix chart and sort the classes of the chart


false negative rate.

Fig 4.13: Confusion matrix for KNN based on TPR and FNR



ISSN : 0950-0707

Page No : 71

Create a confusion matrix chart and sort the classes of the chart according to the

class-wise positive predictive value (recall) and the class-wise false discovery rate.

Fig 4.14:Confusion matrix for KNN based on PPV and FDR

4.5.2.3 Support Vector Machine

A Support Vector Machine (SVM) performs classification by finding the hyper

plane that maximizes the margin between the two classes. The vectors (cases) that

define the hyper plane are the support vectors. We can find the SVM by making use

of three methods they are liner SVM, nonlinear SVM and the kernel

functionalSVM. Support vector machine (SVM) is a discriminative classifier

formally defined by a separating hyper plane. In other words, given labeled training

data (supervised learning). The algorithm outputs an optimal hyper plane which

categorizes new examples. An SVM model is representation of the examples of the

separate categories are divided by clear gap that is as wide as possible. New



ISSN : 0950-0707

Page No : 72

examples are then mapped into that same space and predicted to belong to a

category based on which side of the gap they fall. Then the operation of SVM

algorithm is based on finding the hyper plane that gives the largest minimum

distance to the training examples. Twice, this distance receives the important name

of margin within SVM’s theory. Therefore, the optimal separating hyper plane

maximizes the margin of the training data

I. Create a confusion matrix chart and sort the classes of the chart according

to the class-wise number of observations.

Fig4.15: confusion matrix for SVM based on number of observation.

II. Create a confusion matrix chart and sort the classes of the chart according

to the class-wise true positive rate (recall) and the class-wise false

negative rate.



ISSN : 0950-0707

Page No : 73

Fig4.16: confusion matrix for SVM based on TPR and FNR

III. Create a confusion matrix chart and sort the classes of the chart according

to the class-wise true positive rate (recall) and the class-wise false

negative rate.

Fig4.17: confusion matrix for SVM based on PPV and FDR



ISSN : 0950-0707

Page No : 74

4.5.3 ROCchart

By varying the value of the threshold, it can be obtained a family of different

binary classifier and, for each of them, we can calculate their sensitivity and its

specificity. It is a graphical plot that illustrates the diagnostic ability of abinary

classifier system as its discrimination threshold is varied. It means the ROC

curve is a graph which plots in the x-axis1.0-specificity and in the y-axis the

sensitivity calculated for every different threshold.

Inordertocalculatethepointsofthecurve,thethresholdwillvaryalongthe will vary

along theoutput probability of the testing instances in ascending order. If we have

a perfectmodel that can correctly through the upper left corner which is the point

in which the sensitivity andspecificity take the value 1. The closer to the upper left

corner that the ROC curve passes, the better its discrimination capacity, the base

line represents the ROC can curve for a random classifier.

4.5..3.1 Area under curve

As a perfect classifier passes through the upper left corner, i.e., the point (0,1), the

area under curve for it would be 1. A random classifier, represented by the base

line, has an area under curve of 0.5. If the area under curve takes a value lower

than 0.5, means that it is worse than randomness. In practice, this measure should

take a value between 0.5 and 1.0. The closer to 1 area under curve, the better the

classifier.

4.5..3.1.1 KNN

The k-NN model creates confidences based on the mean of the k nearest

neighbors. For k=1, the confidences can be only 0 or 1, leading to an ugly, but

correct ROC plot. For larger values of k, the granularity of the confidence values

increases. Since in this algorithm, there is no decision threshold at all and the

instance will be classified as the dominant class of k nearest instances to it.

In KNN, the area under the curve is 0.97and the current classifier

point(0.04,0.99). The maximum AUC is 1, which corresponds to a perfect

classifier. Larger AUC values indicate better classifier performance.



ISSN : 0950-0707

Page No : 75

Fig4.18: Roc curve for K-Nearest Neighbors

4.5.3.1.2 Logistic Regression

Use only the first two features as predictor variables. Define a binary

classification problem by using only the measurements that correspond to the

species. Define the roc curve use the probability estimates from the logistic

regression model as scores.

In logistic regression, the area under the curve is 0.97 and the current

classifier point (0.04, 0.98). The maximum AUC is 1, which corresponds to a

perfect classifier. Larger AUC values indicate better classifier performance.

Fig4.19: Roc curve for Logistic Regression



ISSN : 0950-0707

Page No : 76

4.5.3.1.3 SVM

The AUC SVM generally incurs a higher AUC performance than the regular

SVM. The optimal setting of the trade-off parameter is smaller than for the regular

SVM. AUC SVM is more effective at maximizing the AUC than the SVM for

linear kernels. The ROC curve on SVM can be applied to the corresponding binary

one-versus-one or one-versus-rest problems.

In SVM, the area under the curve is 1.00 and current classifier point (0.05,

0.99). The maximum AUC is 1, which corresponds to a perfect classifier. Larger

AUC values indicate better classifier performance

Fig4.20: Roc curve for Support Vector Machine

4.5.4 Scatter plot

The scatter plot, view the classifier results. After you train a classifier, the

scatter plot switches from displaying the data to showing model predictions. If

you are using holdout or cross-validation, then these predictions are the

predictions on the held-out observations. In other words, each prediction is

obtained using a model that was trained without using the corresponding

observation. To investigate your results, use the controls on the right. You can:

a. Choose whether to plot model predictions or the data alone.



ISSN : 0950-0707

Page No : 77

b. Show or hide correct or incorrect results using the check boxes under Model

predictions.

c. Choose features to plot using the X and Y lists under Predictors.

d. Visualize results by class by showing or hiding specific classes using the

check boxes under Show.

e. Change the stacking order of the plotted classes by selecting a class

under Classes and then clicking Move to Front.

f. Zoom in and out, or pan across the plot. To enable zooming and panning,

hover the mouse over the scatter plot and click one of the buttons that appear

near the top-right corner of the plot.

Fig4.21: Scatter Plot in classification learner app

4.5.5 ResultAnalysis

For computing the result analysis we have taken the result data that are

calculated and analyzed at the stage of experimental results. By taking those

results we have represented them in a graphical manner, for better and clear

understanding of the results as well as it is user favorable it means user can easily

understand them.



ISSN : 0950-0707

Page No : 78

4.5.5.1 Analysis on various classificationalgorithms

The table 4.2 and table 4.3 describes about the comparative analysis of

various classification algorithms such as K-Nearest Neighbor, Support Vector

Machine, and Logistic Regression are considered in terms of recall, accuracy

rate.In classification learner app when compared with all the other algorithms,the

accuracy rate for Support Vector Machine will be maximum. So, we observed that

SVM is better than remaining all algorithms in terms of accuracy raterecall.

By analyzing these results only we came know accuracy, prediction speed

and training time. Based on these statistical measures we analyze that which

algorithms is best classifier

Table 4.2: Analysis of various classification algorithms for audit_risk data

T

T

able 4.3: Analysis of various classification algorithms for trail data

4.5.5.2Analyze the confusion matrix

The confusion matrix displays the total number of observations in each cell.

The rows of the confusion matrix correspond to the true class, and the columns

correspond to the predicted class. Diagonal and off-diagonal cells correspond to

correctly and incorrectly classified observations, respectively.

Classification Learner Accuracy Prediction Speed Training

Time

Logistic Regression 97.3% ˜6700 obs/sec 15.135 sec

Support Vector Machine

(SVM)

97.7% ˜9800 obs/sec 3.9264 sec

K-Nearest

Neighbor(KNN)

97.2% ˜6200 obs/sec 2.877 sec

Classification Learner Accuracy Prediction Speed Training Time

Logistic Regression 98.7% ˜7400obs/sec 31.486 sec

Support Vector Machine

(SVM)

99.2% ˜1700obs/sec 1.3901 sec

K-Nearest

Neighbor(KNN)

99.4% ˜9100 obs/sec 1.0169 sec



ISSN : 0950-0707

Page No : 79

Confusion matrix has three plotting parameters they are:

1. Number of observations

2. True positive rate and false negative rate

3. Positive predictive values and false discovery rate

To see how the classifier performed, based on number of observation in each

and every class, under the plot.

Table 4.4: Analyze the confusion matrix based on number of observation.

To see how the classifier performed per class, under Plot, select the True

Positive Rates, False Negative Rates option. The plot shows summaries per true

class in the last two columns on the right.

Table 4.5:Analyze the confusion matrix based on TPR and FNR

To see results per predicted class, under Plot, select the Positive Predictive

Values False Discovery Rates option. The confusion matrix now shows

summary rows underneath the table. Positive predictive values are shown in green

for the correctly predicted points in each class, and false discovery rates are

shown below it in red for the incorrectly predicted points in each class.

Number

of

Observation

Algorithms TP FP TN FN

SVM 466 14 291 5

LR 461 12 293 10

KNN 464 13 292 7

True positive rates

And

False positive rates


SVM 99% 5% 95% 1%

LR 98% 4% 96% 2%

KNN 99% 4% 96% 1%



ISSN : 0950-0707

Page No : 80

Table 4.6: Analyze the confusion matrix based on PPV and FDR

Class labels of the confusion matrix chart, specified as a categorical vector,

numeric vector, string vector, character array, cell array of character vectors, or

logical vector. If class Labels is a vector, then it must have the same number of

elements as the confusion matrix has rows and columns. If class Labels is a

character array, then it must be two-dimensional with each row corresponding to

the label of one class.

Table 4.7: Analyze TPR and FPR using class labels

Table 4.7: Analyze PPV and FDR using class labels

Below line chart and pie chat depicts the analysis of accuracy rate for various

classification methods and the result implies the highest accuracy rate for support

vector machine in Matlab.

Positive

predictive values

And

False

discovery rates


SVM 97% 3% 98% 2%

LR 97% 3% 97% 3%

KNN 97% 3% 98% 12%

True

positive rate

And

False

positive rate

Classes Algorithms TPR FNR

Class0

SVM 99% 1%

LR 98% 2%

KNN 99% 1%

Class1

SVM 95% 5%

LR 96% 4%

KNN 96% 4%

positive

predictive values

And

False

discovery rate

classes Algorithms TPR FNR

Class0

SVM 99% 1%

LR 98% 2%

KNN 99% 1%

Class1

SVM 95% 5%

LR 96% 4%

KNN 96% 4%



ISSN : 0950-0707

Page No : 81

c

Fig 4.22: Accuracy rate for various classifications Algorithm based Line chart

Fig 4.23: Accuracy rate for various classifications Algorithm based Pie chart.

97.30%

97.70%

97.20%

96.90%

97.00%

97.10%

97.20%

97.30%

97.40%

97.50%

97.60%

97.70%

97.80%

1 2 3

Data Set Accuracy

Accuracy



ISSN : 0950-0707

Page No : 82

Below figure shows the analysis of various parameters such as

Truepositive,true negative, false positive and falsenegative for various

classification algorithms in machine learning.

Fig 4.22: Analysis of classifications algorithm using various parameters

5. CONCLUSION AND FUTURE WORK

In present scenario,Audit is the systematic process to obtain and evaluating

objectivelyevidences about the correspondence between information, situations or

procedures and established criteria. It is to ensure that appropriate attention is

devoted to important areas, potential problems are promptly identified, and work

is completed expeditiously and also properly coordinated. In existing system they

have taken Audit data set and applied Particle Swarm Optimization (PSO)

algorithm is implemented to understanding the complete risk assessment process

and the researchers got 93% of accuracy by using Byes Net and J48 algorithms.

Means while we have triggered out the best algorithm among K-Nearest

Neighbor, Support Vector Machineby analyzing these three algorithms and we got

the result that Support Vector Machine is best among all. From result analysis, by

considering the recall value, the accuracy, precision of SVM is more when

0

50

100

150

200

250

300

350

400

450

500

TP FP TN FN

Analysis of parameters

SVM

LR

KNN



ISSN : 0950-0707

Page No : 83

compared with other algorithms. And also find which algorithms is best

performance classifier.

For future works, we are targeting to improve the performance of the

classifiers by the ensemble machine learning approach; effort can be made for

obtaining more accuracy and better execution time for prediction.

REFERENCES

[1]Hooda, N., Bawa, S., & Rana, P. S. (2018). Fraudulent firm classification:

case study of an external audit. Applied Artificial Intelligence, 32(1),48-64.

[2]Cattle, B. A., Baxter, P. D., Greenwood, D. C., Gale, C. P., & West, R. M.

(2011). Multiple imputation for completion of a national clinical audit

dataset. Statistics in medicine, 30(22), 2736-2753.

[3]Evans-Lacko, S., & Knapp, M. (2016). Global patterns of workplace

productivity for people with depression: absenteeism and presenteeism costs

across eight diverse countries. Social psychiatry and psychiatric epidemiology,

51(11),1525-1537.

[4]Kamiński, B. Jakubczyk, M. Szufel, P. (2017). A framework for sensitivity

analysis of decision trees. Central European Journal of Operations Research.

26 (1): 135– 159. doi:10.1007/s10100-017-0479-6. PMC 5767274.

PMID29375266.

[5] Altman, N. S. (1992). An introduction to kernel and nearest-neighbor

nonparametric regression.The American Statistician. 46(3):175– 185.

doi:10.1080/00031305.1992.10475879.hdl:1813/31637.

[6]Cunningham, P., & Delany, S. J. (2007). k-Nearest neighbour

classifiers. Multiple Classifier Systems, 34(8), 1-17.

[7] Dreiseitl, S., & Ohno-Machado, L. (2002). Logistic regression and artificial

neural network classification models: a methodology review. Journal of

biomedical informatics, 35(5-6), 352-359.

[8] Cortes, Corinna;Vapnik, Vladimir N. (1995). Support-vector networks.



ISSN : 0950-0707

Page No : 84

Machine Learning. 20 (3): 273–297. CiteSeerX

10.1.1.15.9362.doi:10.1007/BF00994018.

[9] Halbouni, S. S. (2015). The Role of Auditors in Preventing, Detecting, and

Reporting Fraud: The Case of the united A rab E mirates (UAE). International

Journal of Auditing, 19(2), 117-130.

[10] Simon, C. A. (2012). Individual auditors' identification of relevant fraud

schemes. Auditing: A Journal of Practice & Theory, 31(1), 1-16.

[11] Hoyer, S., Zakhariya, H., Sandner, T., & Breitner, M. H. (2012, January).

Fraud prediction and the human factor: An approach to include human behavior in

an automated fraud audit. In 2012 45th Hawaii International Conference on

System Sciences (pp. 2382-2391). IEEE.

[12] Brazel, J. F., Carpenter, T. D., & Jenkins, J. G. (2010). Auditors’ use of

brainstorming in the consideration of fraud: Reports from the field. The

Accounting Review, 85(4), 1273-1301.

[13] Hassink, H., Meuwissen, R., & Bollen, L. (2010). Fraud detection, redress

and reporting by auditors. Managerial Auditing Journal, 25(9), 861-881.

[14] Gray, G. L. , & Debreceny, R. S. (2014). A taxonomy to guide research on

the application of data mining to fraud detection in financial statement

audits. International Journal of Accounting Information Systems, 15(4), 357-380.

[15] González-Díaz, B., García-Fernández, R., & López-Díaz, A. (2015). Auditor

tenure and audit quality in Spanish state-owned foundations. Revista de

Contabilidad, 18(2), 115-126.

[16] K. Johl, S., Kaur Johl, S., Subramaniam, N., & Cooper, B. (2013). Internal

audit function, board quality and financial reporting quality: evidence from

Malaysia. Managerial Auditing Journal, 28(9), 780-814.

[17] Habbash, M. (2010). The effectiveness of corporate governance and external

audit on constraining earnings management practice in the UK (Doctoral

dissertation, Durham University).

[18] Suryanto, T. (2016). Audit delay and its implication for fraudulent financial

reporting: a study of companies listed in the Indonesian stock exchange. European



ISSN : 0950-0707

Page No : 85

Research Studies, 19(1), 18.

[19] Lennox, C., & Pittman, J. A. (2010). Big Five audits and accounting

fraud. Contemporary Accounting Research, 27(1), 209-247.

[20] Rochmah Ika, S., & Mohd Ghazali, N. A. (2012). Audit committee

effectiveness and timeliness of reporting: Indonesian evidence. Managerial

Auditing Journal, 27(4), 403-424.

[21] Ege, M. S. (2014). Does internal audit function quality deter management

misconduct?. The Accounting Review, 90(2), 495-527.

[22] Carcello, J. V., Hermanson, D. R., & Ye, Z. (2011). Corporate governance

research in accounting and auditing: Insights, practice implications, and future

research directions. Auditing: A Journal of Practice & Theory, 30(3), 1-31.

[23] Kuna, H. D., García-Martinez, R., & Villatoro, F. R. (2014). Outlier

detection in audit logs for application systems. Information Systems, 44, 22-33.

[24] Gual, A., Segura, L., Contel, M., Heather, N., & Colom, J. (2002). Audit-3

and audit-4: effectiveness of two short forms of the alcohol use disorders

identification test. Alcohol and alcoholism, 37(6), 591-596.

[25] Dewiani, D., Lawi, A., Sarro, M., & Aziz, F. (2019). Classification of Firm

External Audit Using Ensemble Support Vector Machine Method. In 1st

International Conference on Science and Technology, ICOST 2019. European

Alliance for Innovation (EAI).

[26] Suryanto, T. (2016). Audit delay and its implication for fraudulent financial

reporting: a study of companies listed in the Indonesian stock exchange. European

Research Studies, 19(1), 18.

[27] Halbouni, S. S. (2015). The Role of Auditors in Preventing, Detecting, and

Reporting Fraud: The Case of the united A rab E mirates (UAE). International

Journal of Auditing, 19(2), 117-130.

[28] González-Díaz, B., García-Fernández, R., & López-Díaz, A. (2015). Auditor

tenure and audit quality in Spanish state-owned foundations. Revista de

Contabilidad, 18(2), 115-126.

[29] Hassink, H., Meuwissen, R., & Bollen, L. (2010). Fraud detection, redress



ISSN : 0950-0707

Page No : 86

and reporting by auditors. Managerial Auditing Journal, 25(9), 861-881.

[30] Gray, G. L. , & Debreceny, R. S. (2014). A taxonomy to guide research on

the application of data mining to fraud detection in financial statement

audits. International Journal of Accounting Information Systems, 15(4), 357-380.

[31] K. Johl, S., Kaur Johl, S., Subramaniam, N., & Cooper, B. (2013). Internal

audit function, board quality and financial reporting quality: evidence from

Malaysia. Managerial Auditing Journal, 28(9), 780-814.

[32] Rochmah Ika, S., & Mohd Ghazali, N. A. (2012). Audit committee

effectiveness and timeliness of reporting: Indonesian evidence. Managerial

Auditing Journal, 27(4), 403-424.

[33] Simon, C. A. (2012). Individual auditors' identification of relevant fraud

schemes. Auditing: A Journal of Practice & Theory, 31(1), 1-16.



ISSN : 0950-0707

Page No : 87

Documents

Statistical Analysis of Various Measures in Auditing ...journalstd.com/gallery/6-june2020.pdf · 2RGUKT-Srikakulam, 3 RGUKT-Nuzvid, Andhra Pradesh, India E-mail : {dabbeerupriyanka90,