Structure implementation section

This commit is contained in:
2017-08-22 12:28:25 +01:00
parent 0d70cdc302
commit 9600330a76
+135 -89
View File
@@ -746,6 +746,8 @@ $e$~\parencite{Bobillo2016}. The recording of normal and pathological signals us
separate devices is likely to cause issues and is discussed in
Section~\ref{Eval}
%BEGIN NEW MATERIAL
\section{Design}
This project aims to provide robust heart abnormality detection for PCG
signals, such that use of the system could reliably recommend further medical
@@ -825,8 +827,8 @@ submitted to the challenge. Results produced by the proposed system will
generally not be coloured by the differences in quality of segmentation
algorithms, allowing for more direct comparison of classification methods.
However, it is noted that despite the high performance of the algorithm, errors
in segmentation will still occur that may have a negative impact on feature
quality. As methods proposed by previous literature such as hand correction by
in segmentation will still occur, that may have a negative impact on feature
quality. As methods proposed by previous literature, such as hand correction by
a professional~\parencite[p.2203]{Liu2016} are not feasible in this context,
and considering the low number of erroneous results produced by the
algorithm~\parencite[p.2]{Goda2016} it was decided that these errors would not
@@ -859,21 +861,22 @@ Features such as:
\item A selection of envelope based features for each heart sound
\end{itemize}
18 feature provided by the Physionet challenge focused on timings between
18 features provided by the Physionet challenge focused on timings between
segments of the heart cycles. It was thought that these features would be
useful in capturing irregularities caused by conditions such as arrhythmias,
atrial septal defect and other conditions that are likely to affect relative
timing of heart sounds, such as Mitral valve prolapse or regurgitation.
timing of heart sounds.
Many conditions that can be detected by traditional auscultation are
characterised by an increase in loudness of the S1 and/or S2 heart
sounds~\parencite{Brown2008}. This suggests that features relating to human
perception of loudness may aid in the detection of such conditions. Simple
envelope based features such as RMS, peak loudness and the Shannon energy
envelope (Equation~\ref{ShanEQ}, popular in previous literature, were extracted
for this reason~\parencite[p.73-77]{Lerch2012}. In addition, statistical
features such as sample entropy and skewness (Equation ~\ref{SkewEQ}) were used
to evaluate the distribution of samples for each heart sound, these were
selected to provide a representation of the temporal ``shape'' of each sound.
envelope (Equation~\ref{ShanEQ}) that proved popular in previous literature,
were extracted for this reason~\parencite[p.73-77]{Lerch2012}. In addition,
statistical features such as sample entropy and skewness (Equation
~\ref{SkewEQ}) were used to evaluate the distribution of samples for each heart
sound, these were selected to provide a representation of the temporal
``shape'' of each sound.
\begin{equation}\label{ShanEQ}
SE = \frac{-1}{N}\sum\limits_{n=0}^N x(n)^2\cdot \log{x(n)^2}
@@ -892,12 +895,12 @@ It was recognised that a time domain representation alone was unlikely to
provide a sufficient representation for discerning a wide variety of
conditions. Using a time-frequency representation to characterise the spectral
components of the signal has proven effective in the majority of literature.
The classic method for producing a spectral representation of a signal is the
Fourier transform (as defined in Equation~\ref{FFTEQ}) over a sliding window of size
$N$. By decomposing the signal into a series of sine and cosine
waves, a representation of the signal across a range of frequency bands is
produced. This can be used for further analysis of heart sounds
based on their spectral characteristics.
The classic method for producing a spectral representation of a signal is to
apply the Discrete Fourier Transform (DFT) (as defined in Equation~\ref{FFTEQ})
over a sliding window of size $N$. By decomposing the signal into a series of
sine and cosine waves, a representation of the signal's spectral content across
a range of frequency bands is produced. This can be used for further analysis
of heart sounds based on their spectral characteristics.
\begin{equation}\label{FFTEQ}
X(k)=\sum\limits_{n=0}^{N}x(n)e^{\frac{-j2\pi kn}{N}}
\end{equation}
@@ -912,7 +915,7 @@ signal's spectral shape. MFCCs are calculated by first applying $N$ (a
user-defined parameter) triangular filter banks, spaced using the mel scale to
the magnitude spectrum. Applying a discrete cosine transform to the log of the
filterbank outputs provides the final set of coefficients (for further details,
please refer to~\parencite{Lerch2012}). This representation
please refer to~\parencite{Lerch2012}). This analysis
creates a perceptually relevant representation of spectral shape, in effect
mimicking the way in which humans might perceive the spectral shape of heart
sounds. The reasoning for this is that, as the aim is to provide a system with
@@ -921,7 +924,7 @@ what a human percieves may prove effective at distinguishing conditions in the
way that a human does. This has shown to be effective in previous literature,
with multiple systems utilising perceptual features with
success~\parencite{Ortiz2016, Rubin2016, Quiceno-Manrique2010a}. 13 MFCCs were
calculated for each heart sound and averaged per sample to provide 13 features
calculated for each heart sound and averaged to provide 13 features
per sample.\\
%TODO: Generate MFCC spectum
@@ -947,10 +950,10 @@ time-frequency representation to fourier methods. The fundamental concept of
the wavelet transform is to represent an input signal as a set of scaled and
shifted finite oscillations. By comparing the signal with each scale of wavelet
at all points in time, a set of $N\times A$ (Where $A$ is the number of scales)
coefficients are generated that represent the scale and position needed for
coefficients are generated. These define the scale and position needed for
each wavelet in order to fully reconstruct the signal (For further details,
refer to~\parencite{Polikar1994}) The benefit of this transform is that it is
well localized in both time and frequency. This allows for accurate
refer to~\parencite{Polikar1994}). The benefit of this transform is that it is
well localized in both time and frequency domains. This allows for accurate
representation of transient events such as clicks and snaps that are
characteristic of heart conditions such as Mitral valve prolapse or
stenosis~\parencite{Brown2008}.\\
@@ -962,44 +965,44 @@ coefficients to attain a total of 48 features.~\parencite{Homsi2016}
\subsubsection{Feature Scaling and Imputing}
A common problem when working with multiple features is the difference in scale
Dbetween features. This problem can cause many machine learning algorithms to place
between features. This problem can cause many machine learning algorithms to place
bias on larger scale features and can significantly impact the time taken for
certain algorithms to converge. This is particularly significant when applying
algorithms sensitive to feature scale such as SVMs (described in
Section~\ref{SVM}). To address this, a Min-Max scaler was applied
to training and test sets prior to training models. This scales all values to within a
0--1 range producing a set of features on a common scale.\\
It is also common to encounter missing values in features. these can occur as a
It is also common to encounter missing values in features. These can occur as a
result of $\log(0)$ or division by 0 calculations, amongst other edge cases. A
standard method for handeling these values is to apply an imputer, replacing
values with the mean of the feature vector.~\parencite{VanderPlas2017}
values with the mean of the feature vector~\parencite{VanderPlas2017}.
\subsection{Stacking Classifier with Cross-Validation}\label{class}
The stacking classifier is an ensemble classifier, that uses the results of
multiple base classifiers as input to a 2nd level meta-classifier, used to
generate a final predicition. $k$-fold cross validation is used accross base
classifiers, training on $k-1$ folds of input data, and applying to the
remaining hold out set. The results of these predictions from each base
classifier are combined and used to train the 2nd level classifier which
produces the final preditions.\\
Given it's considerable performance accross a range of tasks, it was expected
that this classification model could be applied effectively to produce an
alternative method for abnormality detection than those presented in previous
literature.
multiple base classifiers as input to a 2nd level meta-classifier, which in
turn is used to generate a final predicition. $k$-fold cross validation is used
accross base classifiers, training on $k-1$ folds of input data, and applying
to the remaining validation set. The results of these predictions from each
base classifier are combined and used to train the 2nd level classifier which
produces the final preditions based on the probabilities and predictions
provided.\\
Given it's proven accurate performance accross a range of tasks, it was
expected that this classification model could be applied effectively to produce
an alternative method for abnormality detection than those presented in
previous literature.
% TODO:Insert stacking classifier diagram
\subsubsection{Base Classifiers}
Clearly, an important consideration when using any ensemble method is the
selection of the base classifiers. In order for any ensemble method to perform
well, it must be constructed using a selection of classifiers that individually
provide useful models for the data~\parencite[p.484]{Tobergte2013a}. The final
optimized model consisted of 3 base models. A wide variety of models were
considered for use as base and meta models. These included models such as Tree
based, $k$-Nearest Neighbor, and AdaBoost classifiers. Selection of these
models was based on a novel approach using hyperparameter optimization as
discussed in Section~\ref{optimise}. The following sections detail the final
selection used; A combination of SVM and Naive-Bayes classifiers, with a
Logistic Regression meta classifier.
provide useful models for the data~\parencite[p.484]{Tobergte2013a}. A wide
variety of models were considered for use as base and meta models including
models such as Tree based, $k$-Nearest Neighbor, and AdaBoost classifiers.
Selection of these models was based on a novel approach using hyperparameter
optimisation as discussed in Section~\ref{optimise}. The following sections
detail the 3 final models selected by the optimisation algorithm; A combination
of SVM and Naive-Bayes classifiers, with a Logistic Regression meta classifier.
\paragraph{SVM}\label{SVM}
The SVM classifier aims to fit a hyperplane to data that maximises the
@@ -1009,8 +1012,8 @@ also likely to increase the margin for error in separation of classes. This
type of classifier is also able to generate hyperplanes in non-linear space,
using a techniques known as `kernal tricks'. This works by mapping linear data
to a higher dimension, allowing non-linearly seperable classes to be separated
by the same method. The details of the SVM and Kernal-SVM are involved and
outside the scope of this report. Further details can be found
by the same method. The details of the SVM and Kernal-SVM are complex and
outside the scope of this report. Further information can be found
in~\parencite[p.187]{Tobergte2013a}.\\
% TODO: Create Hyperplane plot
SVMs have been prevalent in previous literature, shown to be effective in
@@ -1018,7 +1021,7 @@ separation of a variety of heart conditions~\parencite{Ari2010} The use of
kernals to map parameters to higher dimensions is a key advantage of this
model, allowing for non-linear relationships that are likely to be present in
the large variety of features to be well represented in classification. Choice
of kernals, and relevant hyperparameters is detailed in Section~\ref{optimise}.
of kernals, and relevant hyperparameters is detailed in Section~\ref{PSOp}.
\paragraph{Naive-Bayes}
Commonly used in text classification problems, where there is typically a
@@ -1034,9 +1037,9 @@ calculating the probability of a feature as:
P(x_i\mid y)=\frac{1}{\sqrt{2\pi
\sigma_y^2}}\exp\bigg(-\frac{(x_i-\mu_y)^2}{2\sigma^2_y}\bigg)
\end{equation}
Where:
$\mu$ is the mean of the distribution
$\sigma^2$ is the varaince
Where:\\
$\mu$ is the mean of the distribution\\
$\sigma^2$ is the variance\\
Using Maximum Liklihood estimation to estimate $\sigma$ and $\mu$ given the
feature vector, a classification for new features can then be calculated as:
\begin{equation}
@@ -1052,7 +1055,7 @@ completely independant allows for extremely fast classification and scalability
to large datasets, with many dimensions~\parencite[p.300]{Zhang2004}. It was
thought that these benefits would make the classifier suitable for the proposed system, as the reatively high
dimensionality of features and quantity of datapoints could then be classified
quickly to obtain initial results. Despite the inclussion of more complex
quickly, to obtain initial results. Despite the inclussion of more complex
models, this model was chosen via automatic selection for the final model.
Refer to section~\ref{PSOp} for further details.
@@ -1082,25 +1085,27 @@ $\lambda$ is the regularization parameter used to help prevent overfitting\\
By minimizing the cost function, classification predictions can then be made
using the hypothesis function~\parencite{Ng2012}.\\
Logistic regression was chosen as the meta-classifier primarily due to it's
simplicity and performance in testing. Choice of meta-classifier is a potential
area for improvement and it is noted that a range of meta-classifiers have been
proposed for different tasks that utilise
simplicity and performance in testing. It is thought that this algorithm
performed particularly well as output from base classifiers was linearly
seperable and relatively simple (in comparison to input features). The choice of
meta-classifier is a potential area for improvement and it is noted that a
range of meta-classifiers have been proposed for different tasks that utilise
stacking~\parencite[p.29]{Sesmero2015}. Further work in this area could
potentially provide improved results.
% TODO: Replace this section
% \subsubsection{Signal quality classification}\label{Quality}
\subsection{Model Optimization}\label{optimise}
As discussed in previous section, two of the most important aspects that affect
\subsection{Model Optimisation}\label{optimise}
As discussed in previous sections, two of the most important aspects that affect
the performance of a classification system are it's models, and the input
features. A combination of relevant features and well tuned models is therefore
likely to provide an accurate classification system. However, it is not always
immdiately clear which values to choose for parameters, or features to use as
input. This is especially true when given such a wide selection of models to
choose from, and high such dimensional feature spaces, as are used in the
proposed method. To address this issue, two automatic optimisation approaches
were implemented with the aim of maximising the accuracy of the proposed
immdiately clear which values to choose for parameters, or which features to use as
inputs. This is especially true when given such a wide selection of models to
choose from, and such high dimensional feature spaces, as are used in the
proposed system. To address this issue, two automatic optimisation approaches
were implemented, with the aim of maximising the accuracy of the proposed
system.
\subsubsection{Sequential Feature Selection}\label{SFS}
@@ -1110,16 +1115,17 @@ There are two commonly used methods for addressing this problem: feature
reduction and feature selection. Feature reduction involves reducing features
to a lower dimensionality using techniques such as PCA. Conversely, feature
selection involves selectively removing features entirely via methods such as
Sequential Floating Selection (SFFS). Both aim to reduce the amount of redundant
information in features by removing or reducing features that are expected not
to benefit the model. As a selection of models were to be used, each
Sequential Floating Selection (SFFS). Both aim to reduce the amount of
redundant information in features by removing or reducing features that are not
expected to benefit the model. As a selection of models were to be used, each
potentially handeling dimensionality differently (SVMs in particular), it was
decided that feature selection would be most appropriate for this application.\\
decided that feature selection would be most appropriate for this
application.\\
Through experimentation, the chosen method was SFFS. This method is an adaption
of tradition sequential forward selection, that also uses sequential backward
selection to allow for subsequent removal of added features when neccesary.
SFFS is an iterative wrapper method that adds features and retrains the chosen
SFFS is an iterative wrapper method that adds features and re-trains the chosen
model sequentially, choosing features that increase the accuracy of the model
output (using 3-fold cross validation to avoid overfitting). Final models used
as few as 40 features, increasing both accuracy of classifications and
@@ -1127,7 +1133,7 @@ computation time of models significantly. For further details on SFFS please
refer to~\parencite[p.3]{Ferri1994}
\subsubsection{Particle Swarm Hyperparameter Optimisation}\label{PSOp}
The particle swarm optimization algorithm is an iterative meta-heuristic algorithm that
The particle swarm optimisation algorithm is an iterative meta-heuristic algorithm that
aims to find the set of parameters that maximises a given function. Given a
$n$ dimensional parameter space, the algorithm randomly initialises sets of
`particles' representing random combinations of parameters. As the algorithm
@@ -1140,7 +1146,7 @@ algorithm is shown in code block~\ref{PSCode}~\parencite{Clerc2002}
\onehalfspacing
\begin{lstlisting}[escapeinside={(*}{*)}, label={PSCode}, caption={Particle
Swarm Optimization Pseudocode}]
Swarm Optimisation Pseudocode}]
Do
//For all particles...
For (*$i$*)=1 to Population Size
@@ -1164,9 +1170,9 @@ Until termination criterion is met
\doublespacing
The use of this algorithm allowed for the efficient optimisation of all parameters
relating to the stacking classifier, and it's base classifiers, resulting in a
relating to the stacking classifier and it's base classifiers, resulting in a
finely tuned classification model that would not have been produceable using
traditional trial and error methods.\\
traditional trial and error methods to search for optimal parameters.\\
During the initial design phase, it was found that the abundance of machine
learning algorithms available make selection of the optimal model a difficult,
requiring in depth knowlege of a range of machine learning techniques. A novel
@@ -1184,7 +1190,7 @@ overall success of the agorithm.
In order to fully understand the performance of the system (and to evaluate the
impact of design decisions throughout development), a group of scoring methods
were implemented to test the system's performance in a selection of scenarios.
The aim was to provide reliable metrics that would highlight the systems
The aim was to provide reliable metrics that would highlight the system's
strength and weaknesses and to provide quantifyable measures with which to
compare the system to the range of alternative methods proposed in the
literature.\\
@@ -1194,7 +1200,7 @@ separate hold-out dataset. By reserving a selection of samples from accross the
databases, a trained model could then be scored on this dataset for accuracy, sensitivity and
specifcity (metrics described in Section~\ref{ChallengeEnt}) to determine the
system's performance on an unseen set of samples. This method is widely used to
provide a basic understanding of a model's ability to generalise to new data, A
provide a basic understanding of a model's ability to generalise to new data, a
crucial requirement of the system. Data was split using a grouped stratified shuffle
split, grouping by database. This ensured an equal number of randomly selected
classes were taken from each database to produce training and test sets. This
@@ -1211,12 +1217,14 @@ the full dataset into multiple folds, and training models on each, metrics can
be calculated on each fold, and an average can be taken to provide a measure of
the system's performance over all folds. 10-fold cross validation, stratified
by class, was chosen for evaluation of the system. This provides an insight
into the performance of the algorithm accross the dataset.\\
It is highlighted by Homsi et.\ al, that a large amount of variance may be observed
accross folds~\parencite[p.1637]{Homsi2017}. Homsi et.\ al attribute this to the
into the performance of the algorithm accross the dataset. This is a common
method used by all paricipants of the Physionet chalenge and is commonly found
in prior literature.\\
It is highlighted by Homsi et.\ al, that a large amount of variance may be
observed accross folds~\parencite[p.1637]{Homsi2017}. This is attributed to the
variations accross databases, making generalisation difficult. To account for
this, it is suggested that cross-validation is repeated multiple times and
average to provide a more accurate measurement of performance accross folds.
averaged to provide a more accurate measurement of performance accross folds.
For the proposed system, cross validation was repreated 10 times for each fold
and averaged to produce the final results. Standard-deviation is also
calculated accross these iterations to illustrate the possible prevelance of
@@ -1228,7 +1236,9 @@ accuracies in standard cross-validation, but performed significantly worse when
testing on unseen databases~\parencite{Homsi2017, Bobillo2016}. For this
reason, leave-one-out cross-validation was used to form a better understanding
of the system's ability to generalise to unseen data from different sources. On
each fold, a single database is removed, training on all other databases.\\
each fold, a single database is removed, training on all other databases. This
is a useful method as it can be used to determine the level to which
information extracted from databases is representative of other databases.\\
The evaluation of models using cross-validation was not limited to final
evaluation. Evaluation of intermediate models generated by both the SFFS and Particle Swarm
@@ -1242,39 +1252,75 @@ Discussion on the performance of the proposed system using these methods can be
found in Section~\ref{Eval}.
% TODO: Insert cross validation diagram from data science handbook
% END NEW MATERIAL
\section{Implementation}
This section describes the tools used in the realisation of the
proposed system and the practical issues encountered throught the
implementation process. Rationale is given for decisions made throughout
proposed system, the practical issues encountered throught the
implementation process and the development strategy taken to address and avoid
such issues. Rationale is given for decisions made throughout
production of the proposed system and any issues with curent implementation are
outlined.
\subsection{System Structure}
From the outset, the project aimed to
\subsection{Development Strategy}
Early in the design process it became apparent that in order for this project
to produce reasonable results, it would need to utilise a number of complex
algorithms to handle the various non-trivial problems that were encountered
throughout development. Python was chosen as the most suitable language for the
implementation of the system. High level language features such as dynamic
types and automatic garbage collection, combined with the large variety of
readily available packages and libraries, make the language a good choice for
the fast, flexible development approach taken throughout this project.\\
focus on using open source libraries throughout the project to avoid
`reinventing the wheel'. Integration of external libraries
Use of Python - quick development, wide variet of third party libraries to
allow for rapid prototyping
The most significant objective from the outset of the project was to provide a
system that could classify pathological systems with a degree of accuracy that
was compareable to the current state of research in the field of PCG analysis.
Given this focus and that the performance of the final product was initially
unknown, it was recognised that the design of the project would need to adapt
as the project progressed, implementing and testing high level concepts to
iteratively improve performance of the system. For this reason a high level
view of production was taken, choosing to focus on the overall system
architecture, rather than spending great amounts of time on on any one specific
element of the project (as any component of the project could be
removed/replaced entirely, if this facillitated the improvement of results).\\
With this design ethos in mind, it was decided that external packages and
libraries would be used/adapted wherever neccesary to avoid spending large
amount of time developing proprietary implementations of proven concepts. By
not `reinventing the wheel', it would be possible to rapidly prototype and
evaluate high level concepts, such as the variety of machine learning and
optimisation algorithms detailed in previous sections, quickly and effectively.
Due to it's active developer community, a wide range of scientific computing
and machine learning algorithms are available, such as
NumPy~\parencite{VanDerWalt2011}, SciPy~\parencite{Millman2011} and
Scikit-Learn~\parencite{Pedregosa2011}, each of which was used extensively
throughout the project alongside other packages detailed in the following
sections.
Interface
\subsection{System overview}
The proposed system can be broken down into a number of key components, each of
which performs a specific task, interacting with other components to produce
the final result. The main components are the user interface, feature
generation module, classification and optimisation module and evaluation
module. Implementation of each is detailed in the following sections.
\subsubsection{User interface}
- Implementation of simple CLI for quick control of system parameters
- High computational cost - Multiprocessing, logging issues
Data Manipulation
- Pandas and Numpy for basic handeling and manipulation of data
- Splitting of data using sklearn
Implementation of features
\subsubsection{Features extraction}
- Data Manipulation
- - Pandas and Numpy for basic handeling and manipulation of data
- - Splitting of data using sklearn
- Joining of existing segmentation script and python code
- pyWavelets for wavelet features
- librosa for MFCCs
Implementation of machine learning classifiers
\subsubsection{Classification model generation}
- Use of sklearn for base classifiers, use of pipelines
- Addition of stacking classifier using mlxtend - use of probabilities
- Saving of features and models to pickles, allowing for direct running of
intermediate section of system and for development and portability of generated models
Implementation of optimisatons
- Optunity for Hyperparameter optimization
\paragraph{Model optimisation}
- Optunity for Hyperparameter optimisation
- Mlxtend for SFS