Writing ML design

2017-08-21 13:55:40 +01:00
parent bdd4e52081
commit 364c6274e6
1 changed files with 122 additions and 13 deletions
@@ -33,6 +33,8 @@
 \graphicspath{{./resources/}}
 \addbibresource{~/Documents/library.bib}

+\DeclareMathOperator*{\argmax}{arg\,max}
+\DeclareMathOperator*{\argmin}{arg\,min}
 % Fix for medeley's rubbish underscore handeling in generated bib files
 \DeclareSourcemap{
    \maps{
@@ -487,7 +489,7 @@ Reed et.~al \citeyearpar{Reed2004}             & ---
 \end{landscape}
 \restoregeometry

-\subsubsection{Physionet challenge entries}
+\subsubsection{Physionet challenge entries}\label{ChallengeEnt}
 \doublespacing
 The 2016 Physionet/CinC Challenge aimed to encourage development of heart
 abnormality detection algorithms by providing a large open database of PCG
@@ -943,26 +945,122 @@ stenosis~\parencite{Brown2008}.\\
 For the proposed system, a 5 level DWT using debauchies-4 mother wavelet was
 used for decomposition and reconstruction. Statistical features such as entropy
 were then calculated, both on the reconstructed signal and directly on
-coefficients to attain a total of 48 features.
+coefficients to attain a total of 48 features.~\parencite{Homsi2016}
 % TODO: Insert wavelet diagram here

 \subsubsection{Feature Scaling and Imputing}
-particularly when using methods
-that are sensitive to such as SVMs described in section
+A common problem when working with multiple features is the difference in scale
+Dbetween features. This problem can cause many machine learning algorithms to place
+bias on larger scale features and can significantly impact the time taken for
+certain algorithms to converge. This is particularly significant when applying
+algorithms sensitive to feature scale such as SVMs (described in
+Section~\ref{SVM}). To address this, a Min-Max scaler was applied
+to training and test sets prior to training models. This scales all values to within a
+0--1 range producing a set of features on a common scale.\\
+It is also common to encounter missing values in features. these can occur as a
+result of $\log(0)$ or division by 0 calculations, amongst other edge cases. A
+standard method for handeling these values is to apply an imputer, replacing
+values with the mean of the feature vector.~\parencite{VanderPlas2017}

 \subsection{Stacking Classifier with Cross-Validation}\label{class}
-This meta-learning approach
-has shown significantly success, with robust performance across a variety of classification
-tasks~\parencite[p.498]{Tobergte2013a}.For this reason it was chosen
+The stacking classifier is an ensemble classifier, that uses the results of
+multiple base classifiers as input to a 2nd level meta-classifier, used to
+generate a final predicition. $k$-fold cross validation is used accross base
+classifiers, training on $k-1$ folds of input data, and applying to the
+remaining hold out set. The results of these predictions from each base
+classifier are combined and used to train the 2nd level classifier which
+produces the final preditions.\\
+Given it's considerable performance accross a range of tasks, it was expected
+that this classification model could be applied effectively to produce an
+alternative method for abnormality detection than those presented in previous
+literature.
 % TODO:Insert stacking classifier diagram

 \subsection{Base Classifiers}
+Clearly, an important consideration when using any ensemble method is the
+selection of the base classifiers. In order for any ensemble method to perform
+well, it must be constructed using a selection of classifiers that individually
+provide useful models for the data~\parencite[p.484]{Tobergte2013a}.  The final
+optimized model consisted of 3 base models. A wide variety of models were
+considered for use as base and meta models. These included models such as Tree
+based, $k$-Nearest Neighbor, and AdaBoost classifiers. Selection of these
+models was based on a novel approach using hyperparameter optimization as
+discussed in Section~\ref{optimise}. The following sections detail the final
+selection used; A combination of SVM and Naive-Bayes classifiers, with a
+Logistic Regression meta classifier.

-\subsubsection{SVM}
-
-\subsubsection{Logistic Regression}
+\subsubsection{SVM}\label{SVM}
+The SVM classifier aims to fit a hyperplane to data that maximises the
+separability between classes. This results in a model that has been shown to
+generalise well in many cases, as maximising separability between classes is
+also likely to increase the margin for error in separation of classes. This
+type of classifier is also able to generate hyperplanes in non-linear space,
+using a techniques known as `kernal tricks'. This works by mapping linear data
+to a higher dimension, allowing non-linearly seperable classes to be separated
+by the same method. The details of the SVM and Kernal-SVM are involved and
+outside the scope of this report. Further details can be found
+in~\parencite[p.187]{Tobergte2013a}.\\
+% TODO: Create Hyperplane plot
+SVMs have been prevalent in previous literature, shown to be effective in
+separation of a variety of heart conditions~\parencite{Ari2010} The use of
+kernals to map parameters to higher dimensions is a key advantage of this
+model, allowing for non-linear relationships that are likely to be present in
+the large variety of features to be well represented in classification. Choice
+of kernals, and relevant hyperparameters is detailed in Section~\ref{optimise}.

 \subsubsection{Naive-Bayes}
+Commonly used in text classification problems, where there is typically a
+high-dimensional feature space, Naive Bayes classification uses Bayes rule to
+determine the probability of classification, given a vector of features. This
+is calculated as:
+\begin{equation}
+    P(y\mid x_1,\ldots,x_n)=\frac{P(y)\prod\limits_{i=1}^{N}P(x_i\mid y)}{P(x_1,\ldots,x_n)}
+\end{equation}
+The implementation used assumes a gaussian distribution for all features,
+calculating the probability of a feature as:
+\begin{equation}
+    P(x_i\mid y)=\frac{1}{\sqrt{2\pi
+    \sigma_y^2}}\exp\bigg(-\frac{(x_i-\mu_y)^2}{2\sigma^2_y}\bigg)
+\end{equation}
+Where:
+$\mu$ is the mean of the distribution
+$\sigma^2$ is the varaince
+Using Maximum Liklihood estimation to estimate $\sigma$ and $\mu$ given the
+feature vector, a classification for new features can then be calculated as:
+\begin{equation}
+    \hat{y}=\argmax\limits_y P(y)\prod\limits_{i=1}^nP(x_i\mid y)
+\end{equation}
+Where:\\
+$x$ is the feature vector to be classified\\
+$\hat{y}$ is the estimated classification\\
+
+Despite their computational simplicity, Naive Bayes classifiers have been shown
+to produce highly accurate classifications models. The assumption that each feature is
+completely independant allows for extremely fast classification and scalability
+to large datasets, with many dimensions~\parencite[p.300]{Zhang2004}. It was
+thought that these benefits would make the classifier suitable for the proposed system, as the reatively high
+dimensionality of features and quantity of datapoints could then be classified
+quickly to obtain initial results. Despite the inclussion of more complex
+models, this model remained one of the selected base classifiers for the final
+model. 
+
+\subsubsection{Logistic Regression}
+Logistic regression is a regression model that aims to fit as hyperplane to
+data points by minimizing a cost function using weighted features.
+By applying weights to feature vectors then applying a sigmoid function, a
+hypothesis function is defined as:
+\begin{equation}
+    h_\theta(x)=\frac{1}{1-e^{-\theta^{T}x}}
+\end{equation}
+Where:\\
+$x$ is a feature vector
+$y$ is a weight vector 
+A cost function can then be defined as:
+\begin{equation}
+    J(\theta)=\argmin\limits_\theta\frac{1}{2m}\sum\limits_i^{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2+
+\end{equation}
+
+

 % TODO: Replace this section
 % \subsubsection{Signal quality classification}\label{Quality}
@@ -974,10 +1072,14 @@ A wrapper method

 \subsubsection{Particle Swarm Hyperparameter Optimisation}
 Would ideally be placed inside feature selection
+Given the abundance of
+machine learning algorithms readily available, it can be difficult to select
+the best model quickly, with 


-\subsection{Model Performance Metrics}\label{metrics}
+\subsection{Model Performance Evaluation}\label{metrics}
 % TODO: Insert cross validation diagram from data science handbook
+~\ref{ChallengeEnt}
 Group cross-validation
 $k$-fold cross validation

@@ -988,6 +1090,7 @@ focus on using open source libraries throughout the project to avoid
 Use of Python - quick development, wide variet of third party libraries to
 allow for rapid prototyping

+
 Interface
 - Implementation of simple CLI for quick control of system parameters
 - High computational cost - Multiprocessing, logging issues
@@ -999,8 +1102,8 @@ Implementation of features
 - pyWavelets for wavelet features
 - librosa for MFCCs
 Implementation of machine learning classifiers
- Use of sklearn for base classifiers
- Addition of stacking classifier using mlxtend
+- Use of sklearn for base classifiers, use of pipelines
+- Addition of stacking classifier using mlxtend - use of probabilities
 - Saving of features and models to pickles, allowing for direct running of
 intermediate section of system and for development and portability of generated models
 Implementation of optimisatons
@@ -1014,6 +1117,12 @@ Implementation of optimisatons
 Weighted specificity and weighted Accuracy measures
 Computational cost was not considered, unlike other entries to the physionet
 challenge
+Could be used as cloud based system
+Features were selected for their individual relevance to classification
+problem, Naive Bayes treats features individually. Could explain why it
+performed well
+Relationships between features likely with features such as wavelets, perhaps
+captured by SVMs
 \section{Further Work}\label{FurtherWork}
 Handle silent sections of audio such as those highlighted by Goda et.\
 al~\parencite{Goda2016}