finished implementation draft 1

This commit is contained in:
2017-08-22 19:48:34 +01:00
parent 9600330a76
commit a828ce3fae
+177 -40
View File
@@ -177,7 +177,7 @@ Classification of Heart Abnormalities} \par}\\
\renewcommand{\abstractname}{Acknowledgements}
\begin{abstract}
I'd like to thank anyone and everyone...
I'd like to thank John Thompson - What a babe...
\end{abstract}
\tableofcontents
@@ -596,14 +596,14 @@ the test set for the challenge at 86.02\%.\\
Zabihi et al.\ took an alternative approach by choosing not to segment PCG data
in the pre-processing stage~\parencite{Zabihi2016}. This was with the intention
of reducing computational complexity of the resulting algorithm. In addition,
the proposed method utilizes a wrapper sequential forward feature selection
(SFS) and Linear Predictive Coefficients (LPC) for the reduction of features
used for classification. This benefits the system by removing correlated and
irrelevant features, thus reducing computational complexity and removing
irellevant noise from feature vectors prior to training. Final classifications
are determined through cascaded ensembles of ANNs. The signal is first
classified as either of high or low sound quality, and then as normal or
abnormal. The system achieved a final score of 85.9\% on the hidden test set.\\
the proposed method utilizes a sequential forward feature selection (SFS) and
Linear Predictive Coefficients (LPC) for the reduction of features used for
classification. This benefits the system by removing correlated and irrelevant
features, thus reducing computational complexity and removing irellevant noise
from feature vectors prior to training. Final classifications are determined
through cascaded ensembles of ANNs. The signal is first classified as either of
high or low sound quality, and then as normal or abnormal. The system achieved
a final score of 85.9\% on the hidden test set.\\
Plesinger et al.\ opted to develop a new form of machine learning algorithm
based on probability assesment~\parencite{Plesinger2017}. In this method,
@@ -1129,8 +1129,11 @@ SFFS is an iterative wrapper method that adds features and re-trains the chosen
model sequentially, choosing features that increase the accuracy of the model
output (using 3-fold cross validation to avoid overfitting). Final models used
as few as 40 features, increasing both accuracy of classifications and
computation time of models significantly. For further details on SFFS please
refer to~\parencite[p.3]{Ferri1994}
computation time of models significantly.\\
It should be understood that this method does not guarantee a globally optimal
set of features. An exhaustive feature selection algorithm is capable of this
but this would inccur significant computational cost. For further details on
SFFS please refer to~\parencite[p.3]{Ferri1994}
\subsubsection{Particle Swarm Hyperparameter Optimisation}\label{PSOp}
The particle swarm optimisation algorithm is an iterative meta-heuristic algorithm that
@@ -1184,7 +1187,11 @@ algorithm to provide a locally optimal selection of base classifiers for the
model~\parencite{Sesmero2015}. This technique was used to pick the 3 final
models described in section~\ref{class} from a selection of 8 models. This
dynamic selection of models was seen to be one of the key contributors to the
overall success of the agorithm.
overall success of the agorithm. As with the chosen feature selection
algorithm, this method of optimisation is not guaranteed to provide a globally
optimal solution. It was thought that for the proposed system a locally optimal
system would suffice, particularly given the highly complex parameter space
used in implementation. This is discussed in detail in Section~\ref{ModOp}.
\subsection{Model Performance Evaluation Method}\label{metrics}
In order to fully understand the performance of the system (and to evaluate the
@@ -1258,7 +1265,7 @@ This section describes the tools used in the realisation of the
proposed system, the practical issues encountered throught the
implementation process and the development strategy taken to address and avoid
such issues. Rationale is given for decisions made throughout
production of the proposed system and any issues with curent implementation are
production of the proposed system and any known issues with the curent implementation are
outlined.
\subsection{Development Strategy}
@@ -1272,7 +1279,7 @@ readily available packages and libraries, make the language a good choice for
the fast, flexible development approach taken throughout this project.\\
The most significant objective from the outset of the project was to provide a
system that could classify pathological systems with a degree of accuracy that
system that could classify pathological signals with a degree of accuracy that
was compareable to the current state of research in the field of PCG analysis.
Given this focus and that the performance of the final product was initially
unknown, it was recognised that the design of the project would need to adapt
@@ -1297,34 +1304,163 @@ throughout the project alongside other packages detailed in the following
sections.
\subsection{System overview}
The proposed system can be broken down into a number of key components, each of
which performs a specific task, interacting with other components to produce
the final result. The main components are the user interface, feature
generation module, classification and optimisation module and evaluation
module. Implementation of each is detailed in the following sections.
The proposed system can be broken down into 4 key components: the user
interface, feature generation module, classification module and optimisation
module and evaluation module. The overall architecture of the system follows a
common design pattern for machine learning based systems; Taking a set of input
data, augamenting to produce associated data, extracting patterns from said
data and evaluating to understand the effects of various implementation
choices. The reason for this maintain focus on the central innovations of the
project which are thought to be in the implementation of models and the
analysis of input data.
\subsubsection{User interface and project framework}
As a project focused on producing a functional proof of concept for a specific
problem, this project was not aimed at producing a user-facing application. For
this reason, interactions with the system are based primarily around a simple,
well documented commandline interface (CLI) that allows a user to specify
relavent parameters through use of CLI flags (CLI output flags are detailed in
Appendix~\ref{appendixB}). This gives the user relevant
control over optional parameters that are situation specific (such as
multiprocessing, specifying locations to save data etc\ldots) through an interface that is both easy to maintain throughout
development and intuitive for a user familiar with command line applications.
Relevent information is then presented to the user as the program progresses
through the use of standard output and file based logging systems. Providing
informative feedback to the user was essential, both during development for
debugging prosesses, and to provide an understanding of the programs progress,
particularly in long-running iterative processes used for optimisation.\\
A file based logging system was developed using Python's built-in logging
module to allow for the monitoring of threaded processes. This allowed for
detailed monitoring of the systems progress, even when running multiple
operation concurently.\\
A significant issues that developed as the project grew in size and complexity
was the running time. As more complex methods were implemented for feature
extraction and model optimisation, the time taken to process the relatively
large dataset grew considerably. Primarily using Python's object pickling
functionality, Methods were implemented to handle the creation and maintenance
of intermediate feature and parameter files, allowing the program to load
previously generated data. This helped in cutting computation times of
subsequent runs significantly. This also provided a convenient facility for
storing models, in a portable fashion for transfer between computers, for
example.
\subsubsection{User interface}
- Implementation of simple CLI for quick control of system parameters
- High computational cost - Multiprocessing, logging issues
\subsubsection{Features extraction}
- Data Manipulation
- - Pandas and Numpy for basic handeling and manipulation of data
- - Splitting of data using sklearn
- Joining of existing segmentation script and python code
- pyWavelets for wavelet features
- librosa for MFCCs
\subsubsection{Classification model generation}
- Use of sklearn for base classifiers, use of pipelines
- Addition of stacking classifier using mlxtend - use of probabilities
- Saving of features and models to pickles, allowing for direct running of
intermediate section of system and for development and portability of generated models
Implementation of optimisatons
\paragraph{Model optimisation}
- Optunity for Hyperparameter optimisation
- Mlxtend for SFS
As a fundamental component of the classification system, the feature extraction
method required careful planning and development to realise the extraction of
the large number of features. As a feature of particular importance, the
integration of the MATLAB segmentation algorithm was a key part of the feature
extraction method. Given a premade MATLAB implementation, the most problematic
step was the integration of this code seemlessly with the largely incompatible
Python framework. This was achieved through use of Python's Subprocess module,
allowing for the code to be called through an external unix process. The nature
of this implementation limits the compatability of the system to unix only
systems. However, given the nature of the project, this is not considered an
issue.\\
Having extracted segmentation data, methods for the efficient calculation and
storage of other features were considered. It was recognised that in order to
maintain such a large number of features, an organised and efficient data
format would be needed. The popular Pandas DataFrame library was chosen for
this task. Built on the NumPy array object, DataFrames provide a labeled data
structure, ideal for storing large quantities of labeled data. As the DataFrame
is based on the Numpy array it is able to combine the powerful matrix
operations of Numpy with intuitive database queries commonly seen in languages
such as SQL. This data structure also allows for symplistic storing and loading
of DataFrames through built-in interfaces to the HDF5 filesystem, simplifying
the storage of large quantities of intermediate data.\\
A minor issue that was found during development was the lack of compatability
between DataFrames and Scikit-Learn (described in Section~\ref{Sklearn}). This
simply required careful handling of operations between these APIs in order to
avoid unintentional mishandeling of data.\\
As mentioned previously, the calculation of features often required many
mathematical operations to be performed on vectors. The use of Numpy array-like
containers simplified this operation, allowing many features to be calculated
in a single line of code. Standard operations were sufficient for a number of
the features. However, mathematically complex features such as MFCCs and
wavelet transforms required dedicated processing libraries. The opensource
pyWavelets and Librosa project were used for initial generation of MFCCs and
the DWT respectively~\parencite{pyWave, Mcfee2015}. Further processing such as
segmentation, entropy analysis etc\ldots is calculated to produce the final
features. For a details of all features please refer to
Appendix~\ref{appendixA}.\\
Given the large number of operation required for feature extraction, a large
amount of time needed to compute features was an unavoidable consequence of
the design. To help alleviate this issue, processing of feaures was
parallelized, using each sample as an individual job. The speedup incurred
through parellization is inherently dependant on the system running the
program, however, this significantly reduced the computation time of features.
A modified implementation of Python's multiprocessing module was used for task
management.
\subsubsection{Classification model generation}\label{Sklearn}
The Scikit-learn machine learning package is a popular choice for
implementation of machine learning algorithms in
Python.~\parencite{Pedregosa2011} As a result, it is well maintained and
benefits from a number of smaller projects that are designed to be compatible
with it's API. Such projects include Mlxtend, designed to expand Scikit-learns
range of features with more essoteric models and tools~\parencite{Raschka2016,}
and Imbalance-learn, which addresses the lack of sample balancing
transformations in Scikit-learn~\parencite{Lemaitre2017}. The project's aim to
provide a standard interface to each of it's algorithms also aids in the quick
prototyping of many machine learning algorithms on the features generated.
During the initial stages of development, this allowed for models to be tested
quickly to gain a rudimentary understanding of the performance of models. From
this, it was possible to explore more complex models and ultimately resulted in
the use of Mlxtend's Stacking ensemble classifier.\\
As the size of the classification codebase grew, so did the number of chained
operations in the clasification process. Scikit-learn's Pipeline utility was
used to maintain correct handeling of these process chains, allowing for models
to be easily formed of multiple transforms and classifiers in a manageable
manner. This gained paticular significance with the implementation of model
optimisation methods, as discussed in the following section.
\paragraph{Model selection/optimisation}\label{ModOp}
The decision to use an ensemble classifier also presented a complication,
through the need to choose multiple complimentary base classifiers. A further
issue was in the growing number of parameters that would need to be tuned to
obtain optimal results from each of the chosen base classifiers. The adaption
of a particle swarm optimisation algorithm was found to be an effective
solution to both of these problems~\parencite{Claesen2014}. Using model
pipelines to encapsulate all transformations and classifiers into a single
object, it was possible to construct a dictionary of multiple potential
Scikit-learn models for each base classifier. Implementation of a wrapper
function around the full stacking classifier then allowed classifier pipelines
to be treated as hyperparameters, which could be optimized on a training set
alongside all active parameters for each base classifier. This resulted in the
simultaneous selection and optimization of model combinations and
hyperparameters.\\
Mlxtend's implementation of SFFS was also implemented to apply feature reduction to
the dataset. This was initially implemented as part of the processing pipeline,
resulting in feature reduction being applied on every iteration of parameter
optimisation. This initially provided a score for each model based on the
selection of it's optimal parameters. However, as models grew in complexity,
the growing computational complexity resulted in this method becoming infeasible.
This was adressed by re-implementing the feature selection algorithm after
optimisation on the full dataset. The reduction in performance from this is not
thought to be significant.\\
As with parameters, models were saved using a combination of Python's picling
functionality and Panda's HDF5 export methods, to create fully portable models.
\subsubsection{Automatic system evaluation}
In order to accurately place the system in the context of current research,
evaluation metrics were needed to perform automatic testing of the system.
Metrics were implemented as described in Section~\ref{metrics} using a custom
multi-scorer object that was adapted to allow for the calculation of the 3
metrics: sensitivity, specificity and score. Using this object in conjunction
with a selection of Scikit-learns cross-validation objects provided a mechanism
for quickly evaluating models in an equivelant fashion to those presented in
the literature. In addition, a simple test train split was implemented allowing
for the use of a hold-out dataset. Performance on this dataset was evaluated
directly using the custom scoring functions and Scikit-learn's model scoring
methods.\\
Finally, results were formatted into tables and logged to provide instant
feedback to the user on the performance of the current model.
\section{Evaluation}\label{Eval}
Weighted specificity and weighted Accuracy measures
@@ -1338,7 +1474,10 @@ performed well
Relationships between features likely with features such as wavelets, perhaps
captured by SVMs
Discuss issues with database e
\section{Further Work}\label{FurtherWork}
Further research to be done into resampling - inclusion as hyperparameter in
optimization
Handle silent sections of audio such as those highlighted by Goda et.\
al~\parencite{Goda2016}
Synthesis of synthetic PCG signals
@@ -1350,7 +1489,7 @@ Particle swarm Would ideally be placed inside feature selection
\addcontentsline{toc}{section}{Appendices}
\renewcommand{\thesubsection}{\Alph{subsection}}
\subsection{Table of Features}\label{appendixA}
\subsection{Commandline Interface}
\subsection{Commandline Interface}\label{appendixB}
\singlespacing
\lstset{basicstyle=\scriptsize, style=mystyle}
\begin{lstlisting}[numbers=none]
@@ -1407,8 +1546,6 @@ optional arguments:
by current process
\end{lstlisting}
\doublespacing
\pagebreak{}
\printbibliography{}