Final proof changes

This commit is contained in:
Sam Perry
2017-02-12 21:33:03 +00:00
parent 604132cb0b
commit f98231f9a2
Executable → Regular
+472 -430
View File
@@ -1,430 +1,472 @@
\documentclass[titlepage]{scrartcl}
\usepackage{enumitem}
\usepackage[british]{babel}
\usepackage[style=apa, backend=biber]{biblatex}
\DeclareLanguageMapping{british}{british-apa}
\usepackage{url}
\usepackage{float}
\restylefloat{table}
\usepackage{perpage}
\MakePerPage{footnote}
\usepackage{abstract}
\usepackage{graphicx}
% Create hyperlinks in bibliography
\usepackage{hyperref}
\renewcommand{\familydefault}{\sfdefault}
\usepackage{fontspec}
\setmainfont{Arial}
\usepackage{blindtext}
\setkomafont{disposition}{\normalfont\fontsize{12}{17}\bfseries}
\setkomafont{section}{\normalfont\fontsize{12}{17}\bfseries}
\setkomafont{subsection}{\normalfont\fontsize{12}{17}\itshape}
\setkomafont{subsubsection}{\normalfont\fontsize{12}{17}\itshape}
\graphicspath{{./resources/}}
\addbibresource{~/Documents/library.bib}
\usepackage[affil-it]{authblk}
% \usepackage{etoolbox}
% \makeatletter
% \expandafter\patchcmd\csname\string\maketitle\endcsname
% {\vskip\z@\@plus3fill}
% {\vskip\z@\@plus2fill\box\abstractbox\vskip\z@\@plus1fill}
% {}{}
% \makeatother
%
\DeclareCiteCommand{\citeyearpar}
{}
{\mkbibparens{\bibhyperref{\printdate}}}
{\multicitedelim}
{}
\begin{document}
\title{Descriptor Driven Concatenative Synthesis Tool for Python}
% \subtitle{\LARGE{Abstract Draft}}
\author{S. Perry\thanks{E-mail: \texttt{\href{mailto:u1265119@unimail.hud.ac.uk}{u1265119@unimail.hud.ac.uk}}}}
\date{Dated: \today}
\maketitle
\begin{abstract}
A command-line tool and Python framework is proposed for the exploration of
a new form of audio synthesis known as ``concatenative synthesis'': A form
of synthesis that uses perceptual audio analyses to arrange small segments
of audio based on their characteristics. The tool is designed to
synthesise representations of an input sound using a database of source
sounds. This involves the segmentation and analysis of both the input sound
and database, matching of input segments to their closest segment from the
database, and the re-synthesis of the closest matches to produce the final
result. The project aims to provide a tool capable of generating high
quality sonic representations of an input, to present a variety of examples
that demonstrated the breadth of possibilities that this style of synthesis
has to offer and to provide a robust framework on which concatenative
synthesis projects can be developed easily.\\
Results demonstrate the wide variety of sounds that can be produced using
this method of synthesis. A number of technical issues are outlined that
impeded the overall quality of results and efficiency of the software.
However, the project clearly demonstrates the strong potential for this
type synthesis to be used for creative purposes.
\end{abstract}
\section*{Background}
The concept of constructing a new sound by arranging collections of smaller
sounds has gained popularity in the past 30 years through the introduction
of ``Granular Synthesis''. Granular synthesis works on the theory that any
sound can be described through the arrangement of smaller samples (referred
to as ``grains''). This representation of sound allows for the temporal
decomposition and re-arranging of real-world samples, with the potential to
create new ``complex, dynamically-evolving
sounds.''~\parencite[p.1]{Roads1988}\\
Concatenative synthesis (CS) is a form of synthesis that has developed
significantly over the past 15 years, driven by recent advancements in
technology. Key advancements have been in easy access to large databases of
audio and the development of methods for extracting useful information from
these databases automatically~\parencite[p.1]{Schwarz2006a}. CS utilises
these technologies to provide a content-based extension to granular
synthesis; by analysing a database of source grains, grains can be
differentiated based on their characteristics. These characteristics can
then be used for grain selection in the process of synthesizing output for
a wide range of applications~\parencite[p.102]{Schwarz2007}.
\section*{Related Works}
A number of programs utilize CS to achieve various goals. The process has
been used for applications in areas such as speech synthesis, instrument
synthesis and for applications in creative sound design.\\
The wide range of applications demonstrates the versatility of this
synthesis technique. It differs from traditional synthesis methods through
the use of real recorded samples, as opposed to traditional methods that
focus on defining sets of rules for emulating real sounds. By transforming
samples that have been directly recorded from a source, the subtle nuances
of the source's sound are preserved. These would be difficult to reproduce
using other synthetic methods for modelling an
instrument~\parencite[p.24]{Maestre2009}.
\subsection*{Speech Synthesis}
Creating a natural and intelligible realisation is an important factor when
developing a speech synthesis system. The Talkapillar project is one such
example of how highly convincing results are possible with CS. Through
careful analysis of a vocal database, the project aims to impose the
qualities of the database voice on an input voice. This would result in the
words of the input speaker being transformed to appear as if they were
spoken by the voice in the database.~\parencite{Hueber}
\subsection*{Instrument Synthesis}
Progress has also been made in improving the quality of instrumental
synthesis. As with speech synthesis, the use of samples directly allows for
natural sounding results, which provides a method for reproducing real
instruments convincingly. Another important aspect of instrumental synthesis is that of performer
expression. The reproduction of performance qualities such as dynamics,
timbre and timing are essential when emulating a real instrument and CS has
been used to effectively reproduce these aspects. This is achieved through
splicing of grains based on their expressive characteristics to form
musical phrases. For example, just as a violinist might transition
seamlessly from one articulation to the next, the CS software will join
grains to produce the variation in articulations. This contrasts the
traditional approach to sampling, where samples are played in isolation,
resulting in a discontinuity between adjacent samples~\parencite[p.82]{Lindemann2007}.
The Catapillar project is one such example of this use of CS.
By using a Viterbi algorithm, the project is able to calculate the
smoothest overall transition between grains across the output, resulting
in convincing synthesis of orchestral instrument performances~\parencite[p.5]{Schwarz2003}.
\subsection*{Creative Sound Design}
The flexibility of CS allows for creativity in a broader context than simply
emulating real-world instruments and speech. It can also be used as a tool
to explore the possibilities for synthesizing new abstract sounds for
creative purposes.\\
A prominent project in this area of CS is IRCAM's CataRT
project~\parencite{Schwarz2006a}. The project focuses on the playback of
source grains based on their proximity to a target in multi-dimensional
descriptor space. By providing a target point in the descriptor space, the
user is able to navigate the database, playing selections of samples that
are nearest to the target. This allows the user to explore the database
intuitively through a graphic user interface, selecting a point in
2-dimensional space with the mouse. Grains are then played back in
real-time to create an ``audio mosaic''.\\
Alternatively, target audio can be provided and analysed to create a target
location based on it's location in the descriptor space. Tremblay and
Schwarz's~\citeyearpar{Tremblay2010} use of CataRT to explore
electroacoustic sample banks demonstrates the creative potential of this
method. CS is used in this context as a means for synthesizing matches in a
corpus database to real-time input from an electric bass. Significance is
placed on linking the playback of grains to the expressivity of the
performer. The use of perceptually based audio descriptors to match the
source to the target allows the performer to navigate the database
naturally based on factors such as the pitch and timbre of the bass
guitar. The result is a performance that mixes characteristics of both the
bass guitar output and the qualities of the corpus database to create a
hybrid of the two.\\
This is by no means an exhaustive overview of the projects and techniques
that explore the vast possibilities of CS. For further information, please
refer to: ``Concatenative Synthesis - The Early
Years''~\parencite{Schwarz2006b}
\section*{Concatenator}
The concatenator project aims to provide an open source set of tools that
allows composers to generate a variety of CS driven realisations for
sound design purposes. In addition, the project aims to provide an
intuitive API that Python programmers might use as the fundamental building
blocks to build further concatenative synthesis applications on.
The result is a framework and command-line interface, built in Python, for
easy access to basic CS techniques.
The current implementation can be used for the concatenation of a source
database onto target audio files, using a range of perceptual audio
descriptors for matching. Database management, simple matching and
synthesis algorithms are used to achieve this, and are described in the
following sections.
\section*{Program Design and Implementation}
The Concatenator project consists of a number of components that work
together to produce the final output. A complete description of all
components and there usage in the concatenator project can be found in it's
complete documentation at:\\
*PERMANENT URL FOR DOCUMENTATION NEEDED*\\
Output is generated by analysing overlapping segments of audio (known as
grains) from both the target sound and the source database, then searching
for the closest matching grain in the source database to the target sound.
Finally, the output is generated by applying a hanning window and
overlap-adding the best matches. Each component will be discussed in detail
in the following sections.\\
When designing the concatenator framework, ease of development, use and
extensibility were primary considerations. It was for these reasons that
the framework was written in the Python programming language. Python has
grown in popularity in the scientific community recently, primarily due to
it's focus on productivity, readability and the large number of efficient
numeric processing libraries available (Numpy, SciPy, Scikitlearn
etc...)~\parencite[p.11]{Fangohr2014}. This makes Python a good choice for
quickly developing ideas in the context of audio signal processing.
Unfortunately, the language does sacrifice processing speed for simplicity
and as a result is not suitable for real-time signal processing. Other
performance focused languages such as C++ are better suited to this type of
processing. However, it was decided that the increase in productivity, lack
of prior CS research in Python and the author's previous experience,
made it the most suitable choice for this project.\\
The choice to limit the project to offline processing has both positive and
negative implications on the function of the project. A key disadvantage to
this type of processing is the lack of possibility for any live performance
aspect. This method provides no way of exploring the feedback between
performer and system in a live environment, comparable to the work of
Tremblay and Schwarz's~\citeyearpar{Tremblay2010}.
However, there are advantages to offline processing that would not be
possible in a real-time context.\\
One significant advantage is that databases can afford to be far larger
than they could in real time. Without the requirement to process output in
a short period of time, more time can be taken to search vast databases in
the hope that the closest match to a target will be found.\\
Another advantage is in the global view of a target that can be taken in an
offline approach. Because the complete audio file is available from the
start of processing, techniques can be applied that consider the output as
a whole rather than on a grain by grain basis. This allows for algorithms
such as the Viterbi algorithm to find the sequence of grains that provide
the best continuity, as demonstrated in the Catapillar
project~\parencite[p.4]{Schwarz2003} This would not be possible in
real-time, as audio is processed on-the-fly.\\
An additional consideration was the method to be used for controlling the
target to be matched to. It was decided that the most interesting results
would be produced through the matching of grains to a target audio file, as
opposed to other approaches such as matching to MIDI scores. In this sense
the project is a form of offline audio-mosaicking tool similar to that of
CataRT.
\subsection*{Descriptor Implementation}
In order to differentiate between grains, a number of audio descriptors
were implemented. Audio descriptors are used to measure a specific
characteristic of a signal~\parencite[p.31]{Lerch2012}. For example, an RMS
descriptor was implemented to give an indication of the overall intensity
of the grain. Another example is the F0 descriptor implemented to give a
value relating to pitch for harmonic grains. These values could then be
used by the matching algorithm in order to find the best match between the
source and target grains. A full description of all descriptors implemented
can be found in the Concatenator documentation.\\
Due to time constraints on the project, only a limited number of basic
descriptors were implemented. For this reason, it was ensured that new
descriptors could be added easily to the project. The object oriented
design of the descriptors provides the potential for quick development of
any future descriptors to be added to the project.
\subsection*{Database Design}
When generating descriptors for large database, large amounts of data are
produced and so an efficient method of storing and retrieving the data was
needed to manage this. The Python interface to the HDF5
filesystem~\parencite{Collette2016} was chosen for it's simplicity and
ability to compress the data automatically. Storing Numpy arrays of
descriptors in groups allowed for quick and easy access to analyses from a
single, organized source.
\subsection*{Matching Algorithms}
In order to match grains using the descriptor values, a matching algorithm
was required. Initially a brute force matcher was used to compare each
descriptor value in the target to all values of the same descriptor type in
the source. However, it quickly became apparent that this approach would be
far to slow, particularly for larger database.\\
For this reason, a k-dimensional tree search algorithm was used in an
effort to improve matching efficiency. This approach produced the same
results as the brute force matcher, but by arranging descriptors in a tree
structure, a far more efficient search to find the best match was possible.
This reduced matching time considerably.
\subsection*{Synthesis and Transformations}
The final step in the program is to synthesize the matched output.
This process consisted of:
\begin{enumerate}
\item Retrieving the best grain matches returned by the matching algorithm
\item Applying a window function
\item Overlapping the grains
\item Transforming grains to match target
\item Saving the result to a file
\end{enumerate}
Initially, grains were not transformed to better match the target. This
worked effectively for large databases, however it was observed that
results synthesized using small databases were of a lower quality as the
chance of a closely matched grain was lower. To account for this, methods
for altering grains to better match their target were implemented. It was
decided that the two most significant characteristics to alter were the
pitch and intensity of the grains. By scaling the grains by the difference
between the source and target RMS, it was possible to impose a closer
intensity on a grain. Likewise, by shifting the pitch of a grain by the
difference, it was possible to better match the pitch contour of the output
to that of the target audio. This improved the results significantly in
smaller databases, as poor matches could be improved to match the target
more convincingly.
\subsection*{Command line Interface}
In order to make the framework accessible to users, a commandline interface
was developed. By supplying arguments to the program, users could alter
parameters and experiment freely with the tool. Although this interface
was sufficient for testing and experimentation, it quickly became apparent
that there were too many parameters to pass to the program via the command
line interface on each run. A configuration file parser was created to
address this issue, allowing users to specify default parameters that would
be used by the program on each run. The combination of these interfaces
provided an effective means for accessing all of the framework's features.
\subsection*{Documentation and API}
Complete documentation for the project was created in order to make the
project as user friendly as possible for both developers and users. As a
result, a full API is available alongside examples of use and instructions
for commandline operation. This was created in the hope that it might form
a usable package that developers can build on quickly and effectively to
build other CS projects, allowing for easier access to Python based CS than
is currently available. The command line interface is equally documented to
allow users to create their own realisations quickly and easily so that
this project may be used for creative sound design purposes.
\section*{Results and Evaluation}
Overall, results generated by this project showed promise; a variety of
transformations were generated using open source instrument databases to
demonstrate the projects potential for sound design application. This
tested the project's ability to convincingly impose qualities of an
instrument onto target sounds. A variety of examples are provided that
outline the style of synthesis aimed for. These range from imposing
acoustic guitar qualities on an electric guitar to imposing stringed
instrument qualities on vocal melodies. Current results have a clear
synthetic nature, but still clearly exhibit some of the main
characteristics of the database used.\\
\noindent
Concatenator project examples that demonstrate current results can be found at:\\
*PERMENANT URL FOR RESULTS NEEDED*\\
\section*{Research Limitations/Potential Development}
In retrospect, a great deal of time was spent trying to improve the
efficiency of the project. Although this was necessary, as initial tests
were not feasible on most databases, it had a negative impact on the time
available for developing perceptual qualities of the output. As a result of
this, the overall quality of output may perhaps not be as natural as that of
other projects in this area. This is apparent in the vocal -> string
instrument examples. Phrases tend to begin and end abruptly, failing to
replicate any defined attack or decay of the string instruments, as would
be expected when hearing a string instrument naturally. Conversely, this
does give output it's own synthetic characteristic, which may be desirable
as perfect reproduction of an instrument may not be the reason for using
this tool.\\
In Addition, the high computation required results in large amounts of time
needed to produce high quality results. An end user may not have the
patience required to to reach the quality of results that might be
possible. This is in part a set back of the Python language, and could be
better accounted for with further work on profiling the performance of the
tool.\\
However, the fundamental concepts such as descriptor matching and
transforming matches to better fit the target, that are used in the most
sophisticated CS projects, have been implemented in this project to
satisfying creative effect. As a proof of concept, this project displays
the possibilities for CS in Python and there is evidently potential for
further development in this area.\\
There are a number of further improvements that could be made to this
project in order to improve the quality of results and extend it's overall
usefulness. Some initial ideas for improvements are detailed in this
section below. These range from reasonably simple modifications that could
not be implemented purely due to time constraints, to more complex ideas
that may take a considerable amount of work.\\
The current implementation uses only a small and relatively basic subset of
the audio descriptors available. This limits the analysis of audio and thus
the quality of matches. Using a larger set of more advanced descriptors may
improve quality from this perspective. One way would be to incorporate the
open source Essentia audio descriptors~\parencite{Essentia2016} giving the
project access to a vast quantity of descriptors for analysis.\\
Replacing the hanning window function used for grain windowing with a short
cross fade at grain overlaps should reduce amplitude modulation, resulting
in smoother transitions between grains. This might be further improved
through calculating the point of maximum similarity by cross-correlating
overlapping sections, as described by~\textcite[p.191-193]{Zolzer2011} in
the SOLA algorithm.\\
A lack of continuity between grains was observed in results, most likely
due to the lack of any comparison of selected grains. A Viterbi algorithm
could be used to account for this, allowing for a search to be done amongst
the top matches to find the optimal set of grains. This takes advantage of
the offline nature of the project and has been shown to work effectively in
the Talkapillar project~\parencite{Hueber}.
Although the HDF5 filesystem allows for easy storage of descriptor values,
it also has drawbacks that limits the functionality of the project. One
significant problem is that it is difficult to implement parallel
processing using the library and for this reason asynchronous processing was
not implemented in the project. An alternative method of storage may
accommodate this more easily, allowing for the speed-ups possible through
asynchronous processing. The overall design of the database management was
also relatively naive and may benefit from being replaced by a technology
such as an SQL database or similar. This has been shown to work effectively
in work such as the CataRT project~\parencite[p.3]{Schwarz2006a}.
\section*{Conclusion}
This project has provided a functioning Python based CS project with much
potential for further development. Given the number of technical issues
faced with this style of synthesis (from the big data issues faced with
analysis storage, to high efficiency requirements for processing the large
quantities of data), overall this project appears to work effectively. It
provides a new and accessible means for tapping some of the vast amount of
potential that concatenative synthesis has to offer.\\ With the ever
increasing quality of technology, it is predicted that new techniques such
as concatenative synthesis may grow further in popularity, leading to an
increasing number of possibilities in this area of sound synthesis. It is
hoped that this project might aid in the highlighting the possibilities
offered by this form of synthesis and demonstrate some of the technical
obstacles that must be addressed to design a CS project successfully.
\section*{Acknowledgments}
The author would like to thanks A. Harker for his advice and guidance
as a mentor throughout the project, and to A. Harker and P. Chen for access
to their vocal samples database. Thanks also to D. Chaplin for his
creative input in generating results.
\printbibliography
\end{document}
\documentclass{scrartcl}
\usepackage{enumitem}
\usepackage[british]{babel}
\usepackage[style=apa, backend=biber, maxnames=99]{biblatex}
\DeclareLanguageMapping{british}{british-apa}
\usepackage{filecontents}
\usepackage{url}
\usepackage{float}
\restylefloat{table}
\usepackage{perpage}
\MakePerPage{footnote}
\usepackage{abstract}
\usepackage{graphicx}
% Create hyperlinks in bibliography
\usepackage{hyperref}
\renewcommand{\familydefault}{\sfdefault}
\usepackage{fontspec}
\setmainfont{Arial}
\usepackage{blindtext}
\setkomafont{disposition}{\normalfont\fontsize{12}{17}\bfseries}
\setkomafont{section}{\normalfont\fontsize{12}{17}\bfseries}
\setkomafont{subsection}{\normalfont\fontsize{12}{17}\itshape}
\setkomafont{subsubsection}{\normalfont\fontsize{12}{17}\itshape}
\graphicspath{{./resources/}}
\addbibresource{~/Documents/library.bib}
% Hack to fix problem with underscores and other special charachters in
% Mendeley bibliography.
\DeclareSourcemap{% Used when .bib/Bibliography is compiled, not when document is
\maps{
\map{% Replaces '{\_}', '{_}' or '\_' with just '_'
\step[fieldsource=url,
match=\regexp{\{\\\_\}|\{\_\}|\\\_},
replace=\regexp{\_}]
}
\map{% Replaces '{'$\sim$'}', '$\sim$' or '{~}' with just '~'
\step[fieldsource=url,
match=\regexp{\{\$\\sim\$\}|\{\~\}|\$\\sim\$},
replace=\regexp{\~}]
}
}
}
\usepackage[affil-it]{authblk}
% \usepackage{etoolbox}
% \makeatletter
% \expandafter\patchcmd\csname\string\maketitle\endcsname
% {\vskip\z@\@plus3fill}
% {\vskip\z@\@plus2fill\box\abstractbox\vskip\z@\@plus1fill}
% {}{}
% \makeatother
%
\DeclareCiteCommand{\citeyearpar}
{}
{\mkbibparens{\bibhyperref{\printdate}}}
{\multicitedelim}
{}
\newenvironment{keywords}%
{\begin{trivlist}\item[]{\bfseries\sffamily Keywords:}\ }%
{\end{trivlist}}
\begin{document}
\section*{Descriptor driven concatenative synthesis tool for Python}
Sam Perry\\
E-mail: \href{mailto:u1265119@unimail.hud.ac.uk}{u1265119@unimail.hud.ac.uk}
\section*{Abstract}
A command-line tool and Python framework is proposed for the exploration of
a new form of audio synthesis known as `concatenative synthesis', a form
of synthesis that uses perceptual audio analyses to arrange small segments
of audio based on their characteristics. The tool is designed to
synthesise representations of an input target sound using a source database
of sounds. This involves the segmentation and analysis of both the input
sound and database, the matching of input segments to their closest segment
from the database, and the re-synthesis of the closest matches to produce
the final result.\\
The project aims to provide a tool capable of generating high-quality
sonic representations of an input, to present a variety of examples that
demonstrated the breadth of possibilities that this style of synthesis has
to offer and to provide a robust framework on which concatenative synthesis
projects can be developed easily. The purpose of this project was primarily
to highlight the potential for further development in the area of
concatenative synthesis, and to provide a simple and intuitive tool that
could be used by composers for sound design and experimentation. The
breadth of possibilities for creating new sounds offered by this method of
synthesis makes it ideal for digital sound design and electroacoustic
composition.\\
Results demonstrate the wide variety of sounds that can be produced using
this method of synthesis. A number of technical issues are outlined that
impeded the overall quality of results and efficiency of the software.
However, the project clearly demonstrates the strong potential for this
type of synthesis to be used for creative purposes.
\begin{keywords}
Concatenative synthesis; Python; audio descriptor; audio analysis; command line tool; Python framework; Python sound;
\end{keywords}
\section*{Acknowledgments}
I would like to thank A Harker for his advice and guidance as a mentor
throughout the project, and A Harker and P Chen for access to their
vocal samples database. Thanks also to D Chaplin for his creative input
in generating results.
\pagebreak
\section*{Background}
The concept of constructing a new sound by arranging collections of smaller
sounds has gained popularity in the past 30 years through the introduction
of granular synthesis, which works on the theory that any sound can be
described through the arrangement of smaller samples (referred to as
`grains'). This representation of sound allows for the temporal
decomposition and re-arranging of real-world samples, with the potential to
create new `complex, dynamically-evolving
sounds'~\parencite[p.1]{Roads1988}.\\
Concatenative synthesis (CS) is a form of synthesis that has developed
significantly over the past 15 years, driven by recent advancements in
technology. The key advancements have been in ease of access to large databases of
audio and the development of methods for extracting useful information from
these databases automatically~\parencite[p. 1]{Schwarz2006a}. CS utilises
these technologies to provide a content-based extension to granular
synthesis; analysis of a database of source grains enable them to be
differentiated based on their characteristics. These characteristics can
then be used for grain selection in the process of synthesising output for
a wide range of applications~\parencite[p. 102]{Schwarz2007}.
\section*{Related works}
A number of programs utilise CS to achieve various goals. The process has
been used for applications in areas such as speech synthesis, instrument
synthesis and creative sound design.\\
The wide range of applications demonstrates the versatility of this
synthesis technique. It differs from traditional synthesis methods as it
uses real recorded samples, as opposed to traditional methods that focus on
defining sets of rules for emulating real sounds. By transforming samples
that have been directly recorded from a source, the subtle nuances of the
source's sound are preserved. These would be difficult to reproduce using
other synthetic methods for modelling an
instrument~\parencite[p. 24]{Maestre2009}.
\subsection*{Speech synthesis}
Creating a natural and intelligible realisation is an important factor when
developing a speech-synthesis system. The Talkapillar project is one such
example of how highly convincing results are possible with CS. Through
careful analysis of a vocal database, the project aims to impose the
qualities of the database voice on an input voice. This would result in the
words of the input speaker being transformed to appear as if they were
spoken by the voice in the database.~\parencite{Hueber}
\subsection*{Instrument Synthesis}
Progress has also been made in improving the quality of instrumental
synthesis. As with speech synthesis, the use of samples directly allows for
natural-sounding results, which provides a method for reproducing real
instruments convincingly. Another important aspect of instrumental synthesis is that of performer
expression. The reproduction of performance qualities such as dynamics,
timbre and timing is essential when emulating a real instrument and CS has
been used to effectively reproduce these aspects. This is achieved through
splicing of grains based on their expressive characteristics to form
musical phrases. For example, just as a violinist might transition
seamlessly from one articulation to the next, the CS software will join
grains to produce the variation in articulations. This contrasts with the
traditional approach to sampling, where samples are played in isolation,
resulting in a discontinuity between adjacent samples~\parencite[p. 82]{Lindemann2007}.
The Catapillar project is one such example of this use of CS.
By using a Viterbi algorithm, the project is able to calculate the
smoothest overall transition between grains across the output, resulting
in convincing synthesis of orchestral instrument performances~\parencite[p. 5]{Schwarz2003}.
\subsection*{Creative sound design}
The flexibility of CS allows for creativity in a broader context than simply
emulating real-world instruments and speech. It can also be used as a tool
to explore the possibilities for synthesising new abstract sounds for
creative purposes.\\
A prominent project in this area of CS is IRCAM's CataRT
project~\parencite{Schwarz2006a}. The project focuses on the playback of
source grains based on their proximity to a target in multi-dimensional
descriptor space. Providing a target point in the descriptor space enable the
user to navigate the database, playing selections of samples that
are nearest to the target. This allows the user to explore the database
intuitively through a graphic user interface, selecting a point in
2-dimensional space with the mouse. Grains are then played back in
real-time to create an `audio mosaic'.\\
Alternatively, target audio can be provided and analysed to create a target
location based on it's location in the descriptor space. Tremblay and
Schwarz's~\citeyearpar{Tremblay2010} use of CataRT to explore
electroacoustic sample banks demonstrates the creative potential of this
method. CS is used in this context as a means of synthesising matches in a
corpus database to real-time input from an electric bass. Significance is
placed on linking the playback of grains to the expressively of the
performer. The use of perceptually based audio descriptors to match the
source to the target allows the performer to navigate the database
naturally based on factors such as the pitch and timbre of the bass
guitar. The result is a performance that mixes characteristics of both the
bass guitar output and the qualities of the corpus database to create a
hybrid of the two.\\
This is by no means an exhaustive overview of the projects and techniques
that explore the vast possibilities of CS. Further information can be found
in the article by~\parencite{Schwarz2006b}
\pagebreak
\section*{Concatenator}
The Concatenator project aims to provide an open source tool that allows
composers to generate a variety of CS driven realisations for sound design
purposes. In addition, the project aims to provide an intuitive API that
Python programmers might use as the fundamental building blocks on which to
build further CS applications. The result is a framework and command-line
interface, built in Python, for easy access to basic CS techniques. All
relevant material including source code, results, and documentation can be
found in the official online project repository~\parencite{perry2016a}.
The current implementation can be used for the concatenation of a source
database onto target audio files, using a range of perceptual audio
descriptors for matching. Database management, simple matching and
synthesis algorithms are used to achieve this, and are described in the
following sections. \\
The features and uses of this tool are most comparable to those of the
MATConcat project~\parencite{sturm2004}, which was developed to provide an
open source tool for generating similar representations of audio in MATLAB.
Although there are technical differences such as the number of descriptors
available for each project, both share a similar focus on the
electro-acoustic compositional applications of CS. Results produced for the
MATConcat project are comparable to those of the Concatenator project, and
both work offline to produce results. The Concatenator project builds on
this by providing a wider variety of descriptors and the ability to
artificially enhance matches (as discussed in the~\hyperref[sat]{Synthesis
and Transformations section}).
\section*{Program design and implementation}
The Concatenator project consists of a number of components that work
together to produce the final output. A complete description of all
components and there usage in the Concatenator project can be found in it's
documentation.\\
Output is generated by analysing overlapping segments of audio (known as
grains) from both the target sound and the source database, then searching
for the closest matching grain in the source database to the target sound.
Finally, the output is generated by applying a hanning window and
overlap-adding the best matches. Each component is discussed in detail
in the following sections.\\
When designing the Concatenator framework, ease of development, use and
extensibility were primary considerations. It was for these reasons that
the framework was written in the Python programming language. Python has
grown in popularity in the scientific community recently, primarily due to
its focus on productivity, readability and the large number of efficient
numeric processing libraries available (\cite{Pedregosa2011,
Fangohr2014, Scipy}). This makes Python a good choice for
quickly developing ideas in the context of audio signal processing.
Unfortunately, the language does sacrifice processing speed for simplicity,
and as a result, is not suitable for real-time signal processing. Other
performance-focused languages such as C++ are better suited to this type of
processing. However, it was decided that the increase in productivity, lack
of prior CS research in Python and the author's previous experience made
it the most suitable choice for this project.\\
The choice to limit the project to offline processing has both positive and
negative implications for the function of the project. A key disadvantage
of this type of processing is the lack of possibility for any live
performance aspect. This method provides no way of exploring the feedback
between performer and system in a live environment, as in the work
of Tremblay and Schwarz~\citeyearpar{Tremblay2010}.
However, there are advantages to offline processing that would not be
possible in a real-time context.\\
One significant advantage is that databases can afford to be far larger
than they could be in real time. Without the requirement to process output in
a short period of time, more time can be taken to search vast databases in
the hope that the closest match to a target will be found.\\
Another advantage is in the global view of a target that can be taken in an
offline approach. Because the complete audio file is available from the
start of processing, techniques can be applied that consider the output as
a whole, rather than on a grain-by-grain basis. This allows for algorithms
such as the Viterbi algorithm to find the sequence of grains that provide
the best continuity, as demonstrated in the Catapillar
project~\parencite[p. 4]{Schwarz2003} This would not be possible in
real-time, as audio is processed 'on the fly'.\\
An additional consideration was the method to be used for controlling the
target to which the grains would be matched. It was decided that the most
interesting results would be produced through the matching of grains to a
target audio file, as opposed to other approaches such as matching to MIDI
scores. In this sense the project is a form of offline audio-mosaicking
tool similar to that of CataRT.
\subsection*{Descriptor Implementation}
In order to differentiate between grains, a number of audio descriptors
were implemented. Audio descriptors are used to measure a specific
characteristic of a signal~\parencite[p. 31]{Lerch2012}. For example, a
root mean square (RMS) descriptor was implemented to give an indication of
the overall intensity of the grain. Another example is the fundamental
frequency (F0) descriptor, which was implemented to give a value relating
to pitch for harmonic grains. These values could then be used by the
matching algorithm in order to find the best match between the source and
target grains.\\
Owing to time constraints on the project, only a limited number of basic
descriptors were implemented. For this reason, the project was designed so that new
descriptors could easily be added. The object-oriented
design of the descriptors provides the potential for quick development of
any future descriptors to be added.
\subsection*{Database design}
When generating descriptors for large databases, large amounts of data are
produced and so an efficient method of storing and retrieving the data was
needed in order to manage this. The Python interface to the HDF5
filesystem~\parencite{Collette2016} was chosen for it's simplicity and
ability to compress the data automatically. Storing Numpy arrays of
descriptors in groups allowed for quick and easy access to analyses from a
single, organised source.
\subsection*{Matching algorithms}
In order to match grains using the descriptor values, a matching algorithm
was required. Initially a brute-force matcher was used to compare each
descriptor value in the target to all values of the same descriptor type in
the source. However, it quickly became apparent that this approach would be
far too slow, particularly for a larger database.\\
For this reason, a k-dimensional tree search algorithm was used in an
effort to improve matching efficiency. This approach produced the same
results as the brute force matcher, but by arranging descriptors in a tree
structure, a far more efficient search to find the best match was possible.
This reduced matching time considerably.
\subsection*{Synthesis and transformations} \label{sat}
The final step in the program was to synthesise the matched output.
This process consisted of:
\begin{enumerate}
\item Retrieving the best grain matches returned by the matching algorithm
\item Applying a window function
\item Overlapping the grains
\item Transforming grains to match the target
\item Saving the result to a file
\end{enumerate}
Initially, grains were not transformed to better match the target. This
worked effectively for large databases; however, it was observed that
results synthesised using small databases were of a lower quality, as the
chance of a closely matched grain was lower. To account for this, methods
for altering grains to better match their target were implemented. It was
decided that the two most significant characteristics to alter were the
pitch and intensity of the grains. By scaling the grains by the difference
between the source and target RMS, it was possible to impose a closer
intensity on a grain. Likewise, by shifting the pitch of a grain by the
difference, it was possible to better match the pitch contour of the output
to that of the target audio. This improved the results significantly in
smaller databases, as poor matches could be improved to match the target
more convincingly.
\subsection*{Command-line interface}
In order to make the framework accessible to users, a command-line interface
was developed. By supplying arguments to the program, users could alter
parameters and experiment freely with the tool. Although this interface
was sufficient for testing and experimentation, it quickly became apparent
that there were too many parameters to pass to the program via the command
line interface on each run. A configuration file parser was created to
address this issue, allowing users to specify default parameters that would
be used by the program on each run. The combination of these interfaces
provided an effective means for accessing all of the framework's features.
\subsection*{Documentation and API}
Complete documentation for the project was created in order to make the
project as user friendly as possible for both developers and users. As a
result, a full API is available alongside examples of use and instructions
for command-line operation. This was created in the hope that it might form
a usable package that developers can build on quickly and effectively to
build other CS projects, allowing for easier access to Python-based CS than
is currently available. The command-line interface is equally documented to
allow users to create their own realisations quickly and easily so that
this project may be used for creative sound design purposes.
\section*{Results and evaluation}
Overall, the results generated by this project showed promise; a variety of
transformations were generated using open source instrument databases to
demonstrate the projects potential for sound design application. This
tested the project's ability to convincingly impose qualities of an
instrument onto target sounds. A variety of examples are provided that
outline the style of synthesis aimed for. These range from imposing
acoustic guitar qualities on an electric guitar to imposing stringed
instrument qualities on vocal melodies. Current results have a clear
synthetic nature, but still clearly exhibit some of the main
characteristics of the database used.
\section*{Research Limitations/Potential Development}
In retrospect, a great deal of time was spent trying to improve the
efficiency of the project. Although this was necessary, as initial tests
were not feasible on most databases, it had a negative impact on the time
available for developing perceptual qualities of the output. As a result of
this, the overall quality of output might not perhaps be as natural as that
of other projects in this area. This is apparent in the
vocal~\textrightarrow~string instrument examples. Phrases tend to begin and
end abruptly, failing to replicate any defined attack or decay of the
string instruments, as would be expected when hearing a string instrument
naturally. Conversely, this does give output it's own synthetic
characteristic, which may be desirable as perfect reproduction of an
instrument may not be the reason for using this tool.\\
In addition, the amount of computation required results in large amounts of time
needed to produce high quality results. An end user may not have the
patience required to reach the quality of results that might be
possible. This is in part a drawback of the Python language, and could be
better accounted for with further work on profiling the performance of the
tool.\\
However, the fundamental concepts such as descriptor matching and
transforming matches to better fit the target, which are used in the most
sophisticated CS projects, have been implemented in this project to
satisfying creative effect. As a proof of concept, this project displays
the possibilities for CS in Python and there is evidently potential for
further development in this area.\\
There are a number of further improvements that could be made to this
project in order to improve the quality of results and extend it's overall
usefulness. These range from reasonably simple modifications that could not
be implemented purely due to time constraints, to more complex ideas that
may take a considerable amount of work. The following is a list of some
initial ideas for improvements.\\
\begin{itemize}
\item The current implementation uses only a small and relatively basic
subset of the audio descriptors available. This limits the analysis
of audio and thus the quality of matches. Using a larger set of
more advanced descriptors may improve quality from this
perspective. One way would be to incorporate the open source
Essentia audio descriptors~\parencite{Essentia2016} giving the
project access to a vast quantity of descriptors for analysis.
\item Replacing the hanning window function used for grain windowing
with a short cross-fade at grain overlaps should reduce amplitude
modulation, resulting in smoother transitions between grains. This
might be further improved through calculating the point of maximum
similarity by cross-correlating overlapping sections, as described
by~\textcite[p.191-193]{Zolzer2011} in the Synchronus OverLap Add
(SOLA) algorithm.
\item A lack of continuity between grains was observed in results, most
likely owing to the lack of any comparison of selected grains. A
Viterbi algorithm could be used to account for this, allowing for a
search to be done amongst the top matches to find the optimal set
of grains. This takes advantage of the offline nature of the
project and has been shown to work effectively in the Talkapillar
project~\parencite{Hueber}.
\item Although the HDF5 filesystem allows for easy storage of
descriptor values, it also has drawbacks that limits the
functionality of the project. One significant problem is that it is
difficult to implement parallel processing using the library and
for this reason asynchronous processing was not implemented in the
project. An alternative method of storage may accommodate this more
easily, allowing for the speed-ups possible through asynchronous
processing. The overall design of the database management was also
relatively naive and may benefit from being replaced by a
technology such as an SQL database or similar. This has been shown
to work effectively in work such as the CataRT
project~\parencite[p.3]{Schwarz2006a}.
\end{itemize}
\section*{Conclusion}
This project has provided a functioning Python based CS project with much
potential for further development. Given the number of technical issues
faced with this style of synthesis (from the big data issues faced with
analysis storage, to high efficiency requirements for processing the large
quantities of data), overall this project appears to work effectively. It
provides a new and accessible means for tapping some of the vast amount of
potential that concatenative synthesis has to offer.\\
With the ever increasing quality of technology, it is predicted that new
techniques such as concatenative synthesis may grow further in popularity,
leading to an increasing number of possibilities in this area of sound
synthesis. It is hoped that this project might aid in the highlighting the
possibilities offered by this form of synthesis and demonstrate some of the
technical obstacles that must be addressed to design a CS project
successfully.
\pagebreak
\printbibliography
\end{document}
>>>>>>> Stashed changes