Structure implementation section

2017-08-22 12:28:25 +01:00
parent 0d70cdc302
commit 9600330a76
1 changed files with 135 additions and 89 deletions
@@ -746,6 +746,8 @@ $e$~\parencite{Bobillo2016}. The recording of normal and pathological signals us
 separate devices is likely to cause issues and is discussed in
 Section~\ref{Eval}

+%BEGIN NEW MATERIAL
+
 \section{Design}
 This project aims to provide robust heart abnormality detection for PCG
 signals, such that use of the system could reliably recommend further medical
@@ -825,8 +827,8 @@ submitted to the challenge. Results produced by the proposed system will
 generally not be coloured by the differences in quality of segmentation
 algorithms, allowing for more direct comparison of classification methods.
 However, it is noted that despite the high performance of the algorithm, errors
-in segmentation will still occur that may have a negative impact on feature
-quality. As methods proposed by previous literature such as hand correction by
+in segmentation will still occur, that may have a negative impact on feature
+quality. As methods proposed by previous literature, such as hand correction by
 a professional~\parencite[p.2203]{Liu2016} are not feasible in this context,
 and considering the low number of erroneous results produced by the
 algorithm~\parencite[p.2]{Goda2016} it was decided that these errors would not
@@ -859,21 +861,22 @@ Features such as:
    \item A selection of envelope based features for each heart sound
 \end{itemize}

-18 feature provided by the Physionet challenge focused on timings between
+18 features provided by the Physionet challenge focused on timings between
 segments of the heart cycles. It was thought that these features would be
 useful in capturing irregularities caused by conditions such as arrhythmias,
 atrial septal defect and other conditions that are likely to affect relative
-timing of heart sounds, such as Mitral valve prolapse or regurgitation.
+timing of heart sounds.
 Many conditions that can be detected by traditional auscultation are
 characterised by an increase in loudness of the S1 and/or S2 heart
 sounds~\parencite{Brown2008}. This suggests that features relating to human
 perception of loudness may aid in the detection of such conditions.  Simple
 envelope based features such as RMS, peak loudness and the Shannon energy
-envelope (Equation~\ref{ShanEQ}, popular in previous literature, were extracted
-for this reason~\parencite[p.73-77]{Lerch2012}. In addition, statistical
-features such as sample entropy and skewness (Equation ~\ref{SkewEQ}) were used
-to evaluate the distribution of samples for each heart sound, these were
-selected to provide a representation of the temporal ``shape'' of each sound.
+envelope (Equation~\ref{ShanEQ}) that proved popular in previous literature,
+were extracted for this reason~\parencite[p.73-77]{Lerch2012}. In addition,
+statistical features such as sample entropy and skewness (Equation
+~\ref{SkewEQ}) were used to evaluate the distribution of samples for each heart
+sound, these were selected to provide a representation of the temporal
+``shape'' of each sound.

 \begin{equation}\label{ShanEQ}
    SE = \frac{-1}{N}\sum\limits_{n=0}^N x(n)^2\cdot \log{x(n)^2}
@@ -892,12 +895,12 @@ It was recognised that a time domain representation alone was unlikely to
 provide a sufficient representation for discerning a wide variety of
 conditions. Using a time-frequency representation to characterise the spectral
 components of the signal has proven effective in the majority of literature.
-The classic method for producing a spectral representation of a signal is the
-Fourier transform (as defined in Equation~\ref{FFTEQ}) over a sliding window of size
-$N$. By decomposing the signal into a series of sine and cosine
-waves, a representation of the signal across a range of frequency bands is
-produced. This can be used for further analysis of heart sounds
-based on their spectral characteristics.
+The classic method for producing a spectral representation of a signal is to
+apply the Discrete Fourier Transform (DFT) (as defined in Equation~\ref{FFTEQ})
+over a sliding window of size $N$. By decomposing the signal into a series of
+sine and cosine waves, a representation of the signal's spectral content across
+a range of frequency bands is produced. This can be used for further analysis
+of heart sounds based on their spectral characteristics.
 \begin{equation}\label{FFTEQ}
 X(k)=\sum\limits_{n=0}^{N}x(n)e^{\frac{-j2\pi kn}{N}}
 \end{equation}
@@ -912,7 +915,7 @@ signal's spectral shape. MFCCs are calculated by first applying $N$ (a
 user-defined parameter) triangular filter banks, spaced using the mel scale to
 the magnitude spectrum. Applying a discrete cosine transform to the log of the
 filterbank outputs provides the final set of coefficients (for further details,
-please refer to~\parencite{Lerch2012}). This representation
+please refer to~\parencite{Lerch2012}). This analysis
 creates a perceptually relevant representation of spectral shape, in effect
 mimicking the way in which humans might perceive the spectral shape of heart
 sounds. The reasoning for this is that, as the aim is to provide a system with
@@ -921,7 +924,7 @@ what a human percieves may prove effective at distinguishing conditions in the
 way that a human does. This has shown to be effective in previous literature,
 with multiple systems utilising perceptual features with
 success~\parencite{Ortiz2016, Rubin2016, Quiceno-Manrique2010a}. 13 MFCCs were
-calculated for each heart sound and averaged per sample to provide 13 features
+calculated for each heart sound and averaged to provide 13 features
 per sample.\\
 %TODO: Generate MFCC spectum

@@ -947,10 +950,10 @@ time-frequency representation to fourier methods. The fundamental concept of
 the wavelet transform is to represent an input signal as a set of scaled and
 shifted finite oscillations. By comparing the signal with each scale of wavelet
 at all points in time, a set of $N\times A$ (Where $A$ is the number of scales)
-coefficients are generated that represent the scale and position needed for
+coefficients are generated. These define the scale and position needed for
 each wavelet in order to fully reconstruct the signal (For further details,
-refer to~\parencite{Polikar1994}) The benefit of this transform is that it is
-well localized in both time and frequency. This allows for accurate
+refer to~\parencite{Polikar1994}). The benefit of this transform is that it is
+well localized in both time and frequency domains. This allows for accurate
 representation of transient events such as clicks and snaps that are
 characteristic of heart conditions such as Mitral valve prolapse or
 stenosis~\parencite{Brown2008}.\\
@@ -962,44 +965,44 @@ coefficients to attain a total of 48 features.~\parencite{Homsi2016}

 \subsubsection{Feature Scaling and Imputing}
 A common problem when working with multiple features is the difference in scale
-Dbetween features. This problem can cause many machine learning algorithms to place
+between features. This problem can cause many machine learning algorithms to place
 bias on larger scale features and can significantly impact the time taken for
 certain algorithms to converge. This is particularly significant when applying
 algorithms sensitive to feature scale such as SVMs (described in
 Section~\ref{SVM}). To address this, a Min-Max scaler was applied
 to training and test sets prior to training models. This scales all values to within a
 0--1 range producing a set of features on a common scale.\\
-It is also common to encounter missing values in features. these can occur as a
+It is also common to encounter missing values in features. These can occur as a
 result of $\log(0)$ or division by 0 calculations, amongst other edge cases. A
 standard method for handeling these values is to apply an imputer, replacing
-values with the mean of the feature vector.~\parencite{VanderPlas2017}
+values with the mean of the feature vector~\parencite{VanderPlas2017}.

 \subsection{Stacking Classifier with Cross-Validation}\label{class}
 The stacking classifier is an ensemble classifier, that uses the results of
-multiple base classifiers as input to a 2nd level meta-classifier, used to
-generate a final predicition. $k$-fold cross validation is used accross base
-classifiers, training on $k-1$ folds of input data, and applying to the
-remaining hold out set. The results of these predictions from each base
-classifier are combined and used to train the 2nd level classifier which
-produces the final preditions.\\
-Given it's considerable performance accross a range of tasks, it was expected
-that this classification model could be applied effectively to produce an
-alternative method for abnormality detection than those presented in previous
-literature.
+multiple base classifiers as input to a 2nd level meta-classifier, which in
+turn is used to generate a final predicition. $k$-fold cross validation is used
+accross base classifiers, training on $k-1$ folds of input data, and applying
+to the remaining validation set. The results of these predictions from each
+base classifier are combined and used to train the 2nd level classifier which
+produces the final preditions based on the probabilities and predictions
+provided.\\ 
+Given it's proven accurate performance accross a range of tasks, it was
+expected that this classification model could be applied effectively to produce
+an alternative method for abnormality detection than those presented in
+previous literature.
 % TODO:Insert stacking classifier diagram

 \subsubsection{Base Classifiers}
 Clearly, an important consideration when using any ensemble method is the
 selection of the base classifiers. In order for any ensemble method to perform
 well, it must be constructed using a selection of classifiers that individually
-provide useful models for the data~\parencite[p.484]{Tobergte2013a}.  The final
-optimized model consisted of 3 base models. A wide variety of models were
-considered for use as base and meta models. These included models such as Tree
-based, $k$-Nearest Neighbor, and AdaBoost classifiers. Selection of these
-models was based on a novel approach using hyperparameter optimization as
-discussed in Section~\ref{optimise}. The following sections detail the final
-selection used; A combination of SVM and Naive-Bayes classifiers, with a
-Logistic Regression meta classifier.
+provide useful models for the data~\parencite[p.484]{Tobergte2013a}.   A wide
+variety of models were considered for use as base and meta models including
+models such as Tree based, $k$-Nearest Neighbor, and AdaBoost classifiers.
+Selection of these models was based on a novel approach using hyperparameter
+optimisation as discussed in Section~\ref{optimise}. The following sections
+detail the 3 final models selected by the optimisation algorithm; A combination
+of SVM and Naive-Bayes classifiers, with a Logistic Regression meta classifier.

 \paragraph{SVM}\label{SVM}
 The SVM classifier aims to fit a hyperplane to data that maximises the
@@ -1009,8 +1012,8 @@ also likely to increase the margin for error in separation of classes. This
 type of classifier is also able to generate hyperplanes in non-linear space,
 using a techniques known as `kernal tricks'. This works by mapping linear data
 to a higher dimension, allowing non-linearly seperable classes to be separated
-by the same method. The details of the SVM and Kernal-SVM are involved and
-outside the scope of this report. Further details can be found
+by the same method. The details of the SVM and Kernal-SVM are complex and
+outside the scope of this report. Further information can be found
 in~\parencite[p.187]{Tobergte2013a}.\\
 % TODO: Create Hyperplane plot
 SVMs have been prevalent in previous literature, shown to be effective in
@@ -1018,7 +1021,7 @@ separation of a variety of heart conditions~\parencite{Ari2010} The use of
 kernals to map parameters to higher dimensions is a key advantage of this
 model, allowing for non-linear relationships that are likely to be present in
 the large variety of features to be well represented in classification. Choice
-of kernals, and relevant hyperparameters is detailed in Section~\ref{optimise}.
+of kernals, and relevant hyperparameters is detailed in Section~\ref{PSOp}.

 \paragraph{Naive-Bayes}
 Commonly used in text classification problems, where there is typically a
@@ -1034,9 +1037,9 @@ calculating the probability of a feature as:
    P(x_i\mid y)=\frac{1}{\sqrt{2\pi
    \sigma_y^2}}\exp\bigg(-\frac{(x_i-\mu_y)^2}{2\sigma^2_y}\bigg)
 \end{equation}
-Where:
-$\mu$ is the mean of the distribution
-$\sigma^2$ is the varaince
+Where:\\
+$\mu$ is the mean of the distribution\\
+$\sigma^2$ is the variance\\
 Using Maximum Liklihood estimation to estimate $\sigma$ and $\mu$ given the
 feature vector, a classification for new features can then be calculated as:
 \begin{equation}
@@ -1052,7 +1055,7 @@ completely independant allows for extremely fast classification and scalability
 to large datasets, with many dimensions~\parencite[p.300]{Zhang2004}. It was
 thought that these benefits would make the classifier suitable for the proposed system, as the reatively high
 dimensionality of features and quantity of datapoints could then be classified
-quickly to obtain initial results. Despite the inclussion of more complex
+quickly, to obtain initial results. Despite the inclussion of more complex
 models, this model was chosen via automatic selection for the final model.
 Refer to section~\ref{PSOp} for further details.

@@ -1082,25 +1085,27 @@ $\lambda$ is the regularization parameter used to help prevent overfitting\\
 By minimizing the cost function, classification predictions can then be made
 using the hypothesis function~\parencite{Ng2012}.\\
 Logistic regression was chosen as the meta-classifier primarily due to it's
-simplicity and performance in testing. Choice of meta-classifier is a potential
-area for improvement and it is noted that a range of meta-classifiers have been
-proposed for different tasks that utilise
+simplicity and performance in testing. It is thought that this algorithm
+performed particularly well as output from base classifiers was linearly
+seperable and relatively simple (in comparison to input features). The choice of
+meta-classifier is a potential area for improvement and it is noted that a
+range of meta-classifiers have been proposed for different tasks that utilise
 stacking~\parencite[p.29]{Sesmero2015}. Further work in this area could
 potentially provide improved results.

 % TODO: Replace this section
 % \subsubsection{Signal quality classification}\label{Quality}

-\subsection{Model Optimization}\label{optimise}
-As discussed in previous section, two of the most important aspects that affect
+\subsection{Model Optimisation}\label{optimise}
+As discussed in previous sections, two of the most important aspects that affect
 the performance of a classification system are it's models, and the input
 features. A combination of relevant features and well tuned models is therefore
 likely to provide an accurate classification system. However, it is not always
-immdiately clear which values to choose for parameters, or features to use as
-input. This is especially true when given such a wide selection of models to
-choose from, and high such dimensional feature spaces, as are used in the
-proposed method. To address this issue, two automatic optimisation approaches
-were implemented with the aim of maximising the accuracy of the proposed
+immdiately clear which values to choose for parameters, or which features to use as
+inputs. This is especially true when given such a wide selection of models to
+choose from, and such high dimensional feature spaces, as are used in the
+proposed system. To address this issue, two automatic optimisation approaches
+were implemented, with the aim of maximising the accuracy of the proposed
 system. 

 \subsubsection{Sequential Feature Selection}\label{SFS}
@@ -1110,16 +1115,17 @@ There are two commonly used methods for addressing this problem: feature
 reduction and feature selection. Feature reduction involves reducing features
 to a lower dimensionality using techniques such as PCA. Conversely, feature
 selection involves selectively removing features entirely via methods such as
-Sequential Floating Selection (SFFS). Both aim to reduce the amount of redundant
-information in features by removing or reducing features that are expected not
-to benefit the model. As a selection of models were to be used, each
+Sequential Floating Selection (SFFS). Both aim to reduce the amount of
+redundant information in features by removing or reducing features that are not
+expected to benefit the model. As a selection of models were to be used, each
 potentially handeling dimensionality differently (SVMs in particular), it was
-decided that feature selection would be most appropriate for this application.\\
+decided that feature selection would be most appropriate for this
+application.\\

 Through experimentation, the chosen method was SFFS. This method is an adaption
 of tradition sequential forward selection, that also uses sequential backward
 selection to allow for subsequent removal of added features when neccesary.
-SFFS is an iterative wrapper method that adds features and retrains the chosen
+SFFS is an iterative wrapper method that adds features and re-trains the chosen
 model sequentially, choosing features that increase the accuracy of the model
 output (using 3-fold cross validation to avoid overfitting). Final models used
 as few as 40 features, increasing both accuracy of classifications and
@@ -1127,7 +1133,7 @@ computation time of models significantly. For further details on SFFS please
 refer to~\parencite[p.3]{Ferri1994}

 \subsubsection{Particle Swarm Hyperparameter Optimisation}\label{PSOp}
-The particle swarm optimization algorithm is an iterative meta-heuristic algorithm that
+The particle swarm optimisation algorithm is an iterative meta-heuristic algorithm that
 aims to find the set of parameters that maximises a given function. Given a
 $n$ dimensional parameter space, the algorithm randomly initialises sets of
 `particles' representing random combinations of parameters. As the algorithm
@@ -1140,7 +1146,7 @@ algorithm is shown in code block~\ref{PSCode}~\parencite{Clerc2002}

 \onehalfspacing
 \begin{lstlisting}[escapeinside={(*}{*)}, label={PSCode}, caption={Particle
-Swarm Optimization Pseudocode}]
+Swarm Optimisation Pseudocode}]
 Do
    //For all particles...
    For (*$i$*)=1 to Population Size
@@ -1164,9 +1170,9 @@ Until termination criterion is met
 \doublespacing

 The use of this algorithm allowed for the efficient optimisation of all parameters
-relating to the stacking classifier, and it's base classifiers, resulting in a
+relating to the stacking classifier and it's base classifiers, resulting in a
 finely tuned classification model that would not have been produceable using
-traditional trial and error methods.\\
+traditional trial and error methods to search for optimal parameters.\\
 During the initial design phase, it was found that the abundance of machine
 learning algorithms available make selection of the optimal model a difficult,
 requiring in depth knowlege of a range of machine learning techniques. A novel
@@ -1184,7 +1190,7 @@ overall success of the agorithm.
 In order to fully understand the performance of the system (and to evaluate the
 impact of design decisions throughout development), a group of scoring methods
 were implemented to test the system's performance in a selection of scenarios.
-The aim was to provide reliable metrics that would highlight the systems
+The aim was to provide reliable metrics that would highlight the system's
 strength and weaknesses and to provide quantifyable measures with which to
 compare the system to the range of alternative methods proposed in the
 literature.\\
@@ -1194,7 +1200,7 @@ separate hold-out dataset. By reserving a selection of samples from accross the
 databases, a trained model could then be scored on this dataset for accuracy, sensitivity and
 specifcity (metrics described in Section~\ref{ChallengeEnt}) to determine the
 system's performance on an unseen set of samples. This method is widely used to
-provide a basic understanding of a model's ability to generalise to new data, A
+provide a basic understanding of a model's ability to generalise to new data, a
 crucial requirement of the system. Data was split using a grouped stratified shuffle
 split, grouping by database. This ensured an equal number of randomly selected
 classes were taken from each database to produce training and test sets. This
@@ -1211,12 +1217,14 @@ the full dataset into multiple folds, and training models on each, metrics can
 be calculated on each fold, and an average can be taken to provide a measure of
 the system's performance over all folds. 10-fold cross validation, stratified
 by class, was chosen for evaluation of the system. This provides an insight
-into the performance of the algorithm accross the dataset.\\
-It is highlighted by Homsi et.\ al, that a large amount of variance may be observed
-accross folds~\parencite[p.1637]{Homsi2017}. Homsi et.\ al attribute this to the
+into the performance of the algorithm accross the dataset. This is a common
+method used by all paricipants of the Physionet chalenge and is commonly found
+in prior literature.\\
+It is highlighted by Homsi et.\ al, that a large amount of variance may be
+observed accross folds~\parencite[p.1637]{Homsi2017}. This is attributed to the
 variations accross databases, making generalisation difficult. To account for
 this, it is suggested that cross-validation is repeated multiple times and
-average to provide a more accurate measurement of performance accross folds.
+averaged to provide a more accurate measurement of performance accross folds.
 For the proposed system, cross validation was repreated 10 times for each fold
 and averaged to produce the final results. Standard-deviation is also
 calculated accross these iterations to illustrate the possible prevelance of
@@ -1228,7 +1236,9 @@ accuracies in standard cross-validation, but performed significantly worse when
 testing on unseen databases~\parencite{Homsi2017, Bobillo2016}. For this
 reason, leave-one-out cross-validation was used to form a better understanding
 of the system's ability to generalise to unseen data from different sources. On
-each fold, a single database is removed, training on all other databases.\\
+each fold, a single database is removed, training on all other databases. This
+is a useful method as it can be used to determine the level to which
+information extracted from databases is representative of other databases.\\

 The evaluation of models using cross-validation was not limited to final
 evaluation. Evaluation of intermediate models generated by both the SFFS and Particle Swarm
@@ -1242,39 +1252,75 @@ Discussion on the performance of the proposed system using these methods can be
 found in Section~\ref{Eval}.
 % TODO: Insert cross validation diagram from data science handbook

+% END NEW MATERIAL
 \section{Implementation}
 This section describes the tools used in the realisation of the
-proposed system and the practical issues encountered throught the
-implementation process. Rationale is given for decisions made throughout
+proposed system, the practical issues encountered throught the
+implementation process and the development strategy taken to address and avoid
+such issues. Rationale is given for decisions made throughout
 production of the proposed system and any issues with curent implementation are
 outlined.

-\subsection{System Structure}
-From the outset, the project aimed to 
+\subsection{Development Strategy}
+Early in the design process it became apparent that in order for this project
+to produce reasonable results, it would need to utilise a number of complex
+algorithms to handle the various non-trivial problems that were encountered
+throughout development. Python was chosen as the most suitable language for the
+implementation of the system. High level language features such as dynamic
+types and automatic garbage collection, combined with the large variety of
+readily available packages and libraries, make the language a good choice for
+the fast, flexible development approach taken throughout this project.\\

-focus on using open source libraries throughout the project to avoid
-`reinventing the wheel'. Integration of external libraries
-Use of Python - quick development, wide variet of third party libraries to
-allow for rapid prototyping
+The most significant objective from the outset of the project was to provide a
+system that could classify pathological systems with a degree of accuracy that
+was compareable to the current state of research in the field of PCG analysis.
+Given this focus and that the performance of the final product was initially
+unknown, it was recognised that the design of the project would need to adapt
+as the project progressed, implementing and testing high level concepts to
+iteratively improve performance of the system. For this reason a high level
+view of production was taken, choosing to focus on the overall system
+architecture, rather than spending great amounts of time on on any one specific
+element of the project (as any component of the project could be
+removed/replaced entirely, if this facillitated the improvement of results).\\

+With this design ethos in mind, it was decided that external packages and
+libraries would be used/adapted wherever neccesary to avoid spending large
+amount of time developing proprietary implementations of proven concepts. By
+not `reinventing the wheel', it would be possible to rapidly prototype and
+evaluate high level concepts, such as the variety of machine learning and
+optimisation algorithms detailed in previous sections, quickly and effectively.
+Due to it's active developer community, a wide range of scientific computing
+and machine learning algorithms are available, such as
+NumPy~\parencite{VanDerWalt2011}, SciPy~\parencite{Millman2011} and
+Scikit-Learn~\parencite{Pedregosa2011}, each of which was used extensively
+throughout the project alongside other packages detailed in the following
+sections.

-Interface
+\subsection{System overview}
+The proposed system can be broken down into a number of key components, each of
+which performs a specific task, interacting with other components to produce
+the final result. The main components are the user interface, feature
+generation module, classification and optimisation module and evaluation
+module. Implementation of each is detailed in the following sections.
+
+\subsubsection{User interface}
 - Implementation of simple CLI for quick control of system parameters
 - High computational cost - Multiprocessing, logging issues
-Data Manipulation
- Pandas and Numpy for basic handeling and manipulation of data
- Splitting of data using sklearn
-Implementation of features
+\subsubsection{Features extraction}
+- Data Manipulation
+- - Pandas and Numpy for basic handeling and manipulation of data
+- - Splitting of data using sklearn
 - Joining of existing segmentation script and python code
 - pyWavelets for wavelet features
 - librosa for MFCCs
-Implementation of machine learning classifiers
+\subsubsection{Classification model generation}
 - Use of sklearn for base classifiers, use of pipelines
 - Addition of stacking classifier using mlxtend - use of probabilities
 - Saving of features and models to pickles, allowing for direct running of
 intermediate section of system and for development and portability of generated models
 Implementation of optimisatons
- Optunity for Hyperparameter optimization
+\paragraph{Model optimisation}
+- Optunity for Hyperparameter optimisation
 - Mlxtend for SFS