407 lines
23 KiB
TeX
Executable File
407 lines
23 KiB
TeX
Executable File
\documentclass{scrartcl}
|
|
\usepackage{enumitem}
|
|
\usepackage[british]{babel}
|
|
\usepackage[style=apa, backend=biber]{biblatex}
|
|
\DeclareLanguageMapping{british}{british-apa}
|
|
\usepackage{url}
|
|
\usepackage{float}
|
|
\restylefloat{table}
|
|
\usepackage{perpage}
|
|
\MakePerPage{footnote}
|
|
\usepackage{abstract}
|
|
\usepackage{graphicx}
|
|
% Create hyperlinks in bibliography
|
|
\usepackage{hyperref}
|
|
|
|
\renewcommand{\familydefault}{\sfdefault}
|
|
\usepackage{fontspec}
|
|
\setmainfont{Arial}
|
|
|
|
\usepackage{blindtext}
|
|
\setkomafont{disposition}{\normalfont\fontsize{12}{17}\bfseries}
|
|
\setkomafont{section}{\normalfont\fontsize{12}{17}\bfseries}
|
|
\setkomafont{subsection}{\normalfont\fontsize{12}{17}\itshape}
|
|
\setkomafont{subsubsection}{\normalfont\fontsize{12}{17}\itshape}
|
|
|
|
\graphicspath{{./resources/}}
|
|
\addbibresource{~/Documents/library.bib}
|
|
|
|
\usepackage{etoolbox}
|
|
\makeatletter
|
|
\expandafter\patchcmd\csname\string\maketitle\endcsname
|
|
{\vskip\z@\@plus3fill}
|
|
{\vskip\z@\@plus2fill\box\abstractbox\vskip\z@\@plus1fill}
|
|
{}{}
|
|
\makeatother
|
|
|
|
\DeclareCiteCommand{\citeyearpar}
|
|
{}
|
|
{\mkbibparens{\bibhyperref{\printdate}}}
|
|
{\multicitedelim}
|
|
{}
|
|
|
|
\begin{document}
|
|
\title{Descriptor Driven Concatenative Synthesis Tool for Python}
|
|
% \subtitle{\LARGE{Abstract Draft}}
|
|
\author{Sam Perry}
|
|
|
|
\maketitle
|
|
|
|
\begin{abstract}
|
|
A command-line tool and Python framework is proposed for the exploration of
|
|
a new form of audio synthesis known as ``concatenative synthesis'': A
|
|
form of synthesis that uses perceptual audio analyses to arrange small
|
|
segments of audio based on their characteristics. The tool is designed to
|
|
synthesise representations of an input sound using a database of source
|
|
sounds. This involves the segmentation and analysis of both the input sound
|
|
and database, matching of input segments to their closest segment from the
|
|
database, and the re-synthesis of the closest matches from the database to
|
|
produce the final result. The project aims to provide a tool capable of
|
|
generating high quality sonic representations of an input, to present a
|
|
variety of examples that demonstrated the breadth of possibilities that
|
|
this style of synthesis has to offer and to provide a robust framework on
|
|
which concatenative synthesis projects can be developed easily.\\
|
|
|
|
Results demonstrate the wide variety of sounds that can be produced using
|
|
this method of synthesis. A number of technical issues are outlined that
|
|
impeded the overall quality of results and efficiency of the software.
|
|
However, the project clearly demonstrates the strong potential for this
|
|
type synthesis to be used for creative purposes.
|
|
\end{abstract}
|
|
|
|
\section*{Background}
|
|
The concept of constructing a new sound by arranging collections of smaller
|
|
sounds has gained popularity in the past 30 years through the introduction
|
|
of ``Granular Synthesis''. Granular synthesis works on the theory that any
|
|
sound can be described through the arrangement of smaller samples (referred
|
|
to as ``grains''). This representation of sound allows for the temporal
|
|
decomposition and re-arranging of real-world samples, with the potential to
|
|
create new ``complex, dynamically-evolving
|
|
sounds.''~\parencite[p.1]{Roads1988}\\
|
|
|
|
Concatenative synthesis (CS) is a form of synthesis that has developed
|
|
significantly over the past 15 years, driven by recent advancements in
|
|
technology. Key advancements have been in easy access to large databases of
|
|
audio and the development of methods for extracting useful information from
|
|
these databases automatically~\parencite[p.1]{Schwarz2006a}. CS utilises
|
|
these technologies to provide a content-based extension to granular
|
|
synthesis; by analysing a database of source grains, grains can be
|
|
differentiated based on their characteristics. These characteristics can
|
|
then be used for grain selection in the process of synthesizing output for
|
|
a wide range of applications~\parencite[p.102]{Schwarz2007}.
|
|
|
|
\section*{Related Works}
|
|
A number of programs utilize CS to achieve various goals. The process has
|
|
been used for applications in areas such as speech synthesis, instrument
|
|
synthesis and for applications in creative sound design.\\
|
|
The wide range of applications demonstrates the versatility of this
|
|
synthesis technique. It differs from traditional synthesis methods through
|
|
the use of real recorded samples, as opposed to traditional methods that
|
|
focus on defining sets of rules for emulating real sounds. By transforming
|
|
samples that have been directly recorded from a source, the subtle nuances
|
|
of the source's sound are preserved. These would be difficult to reproduce
|
|
using other synthetic methods for modelling an
|
|
instrument~\parencite[p.24]{Maestre2009}.
|
|
|
|
\subsection*{Speech Synthesis}
|
|
Creating a natural and intelligible realisation is an important factor when
|
|
developing a speech synthesis system. The Talkapillar project is one such
|
|
example of how highly convincing results are possible with CS. Through
|
|
careful analysis of a vocal database, the project aims to impose the
|
|
qualities of the database voice on an input voice. This would result in the
|
|
words of the input speaker being transformed to appear as if they were
|
|
spoken by the voice in the database.~\parencite{Hueber}
|
|
|
|
\subsection*{Instrument Synthesis}
|
|
Progress has also been made in improving the quality of instrument
|
|
synthesis. As with speech synthesis, the use of samples directly allows for
|
|
natural sounding results, which provides a method for reproducing real
|
|
instruments convincingly. Another important aspect of instrument synthesis is that of performer
|
|
expression. The reproduction of performance qualities such as dynamics,
|
|
timbre and timing are essential when emulating a real instrument and CS has
|
|
been used to effectively reproduce these aspects. This is achieved through
|
|
splicing of grains based on their expressive characteristics to form
|
|
musical phrases. For example, just as a violinist might transition
|
|
seamlessly from one articulation to the next, the CS software will join
|
|
grains to produce the variation in articulations. This contrasts the
|
|
traditional approach to sampling, where samples are played in isolation,
|
|
resulting in a discontinuity between adjacent samples~\parencite[p.82]{Lindemann2007}.
|
|
The Catapillar project is one such example of this use of CS.
|
|
By using a viterbi algorithm, the project is able to calculate the
|
|
smoothest overall transition between grains across the output, resulting
|
|
in convincing synthesis of orchestral instrument performances~\parencite[p.5]{Schwarz2003}.
|
|
|
|
\subsection*{Creative Sound Design}
|
|
The flexibility of CS allows for creativity in a broader context than simply
|
|
emulating real-world instruments and speech. It can also be used as a tool
|
|
to explore the possibilities for synthesizing new abstract sounds for
|
|
creative purposes.\\
|
|
A prominent project in this area of CS is IRCAM's CataRT
|
|
project~\parencite{Schwarz2006a}. The project focuses on the playback of
|
|
source grains based on their proximity to a target in multi-dimensional
|
|
descriptor space. By providing a target point in the descriptor space, the
|
|
user is able to navigate the database, playing selections of samples that
|
|
are nearest to the target. This allows the user to explore the database
|
|
intuitively through a graphic user interface, selecting a point in
|
|
2-dimensional space with the mouse. Grains are then played back in
|
|
real-time to create an ``audio mosaic''.\\
|
|
Alternatively, target audio can be provided and analysed to create a target
|
|
location based on it's location in the descriptor space. Tremblay and
|
|
Schwarz's~\citeyearpar{Tremblay2010} use of CataRT to explore
|
|
electroacoustic sample banks demonstrates the creative potential of this
|
|
method. CS is used in this context as a means for synthesizing matches in a
|
|
corpus database to real-time input from an electric bass. Significance is
|
|
placed on linking the playback of grains to the expressivity of the
|
|
performer. The use of perceptually based audio descriptors to match the
|
|
source to the target allows the performer to navigate the database
|
|
naturally based on factors such as the pitch and timbre of the bass
|
|
guitar. The result is a performance that mixes characteristics of both the
|
|
bass guitar output and the qualities of the corpus database to create a
|
|
hybrid of the two.\\
|
|
|
|
This is by no means an exhaustive overview of the projects and techniques
|
|
that explore the vast possibilities of CS. For further information, please
|
|
refer to: ``Concatenative Synthesis - The Early
|
|
Years''~\parencite{Schwarz2006b}
|
|
|
|
\section*{Concatenator}
|
|
The concatenator project aims to provide an open source set of tools that
|
|
allows composers to generate a variety of CS driven realisations for
|
|
sound design purposes. In addition, the project aims to provide an
|
|
intuitive API that Python programmers might use as the fundamental building
|
|
blocks to build further concatenative synthesis applications on.
|
|
The result is a framework and command-line interface, built in Python, for
|
|
easy access to basic CS techniques.
|
|
The current implementation can be used for the concatenation of a source
|
|
database onto target audio files, using a range of perceptual audio
|
|
descriptors for matching. Database management, simple matching and
|
|
synthesis algorithms are used to achieve this, and are described in the
|
|
following sections.
|
|
|
|
\section*{Program Design and Implementation}
|
|
The Concatenator project consists of a number of components, as show below:\\
|
|
|
|
*INSERT Concatenator OVERVIEW DIAGRAM*\\
|
|
|
|
Output is generated by analysing overlapping segments of audio (known as
|
|
grains) from both the target sound and the source database, then searching
|
|
for the closest matching grain in the source database to the target sound.
|
|
Finally, the output is generated by applying a hanning window and
|
|
overlap-adding the best matches. Each component will be discussed in detail
|
|
in the following sections.\\
|
|
|
|
When designing the concatenator framework, ease of development, use and
|
|
extensibility were primary considerations. It was for these reasons that
|
|
the framework was written in the Python programming language. Python has
|
|
grown in popularity in the scientific community recently, primarily due to
|
|
it's focus on productivity, readability and the large number of efficient
|
|
numeric processing libraries available (Numpy, SciPy, Scikitlearn
|
|
etc...)~\parencite[p.11]{Fangohr2014}. This makes Python a good choice for
|
|
quickly developing ideas in the context of audio signal processing.
|
|
Unfortunately, the language does sacrifice processing speed for simplicity
|
|
and as a result is not suitable for real-time signal processing. Other
|
|
performance focused languages such as C++ are better suited to this type of
|
|
processing. However, it was decided that the increase in productivity, lack
|
|
of prior CS research in Python and the author's previous experience,
|
|
made it the most suitable choice for this project.\\
|
|
|
|
The choice to limit the project to offline processing has both positive and
|
|
negative implications on the function of the project. A key disadvantage to
|
|
this type of processing is the lack of possibility for any live performance
|
|
aspect. This method provides no way of exploring the feedback between
|
|
performer and system in a live environment, comparable to the work of
|
|
Tremblay and Schwarz's~\citeyearpar{Tremblay2010}.
|
|
However, there are advantages to offline processing that would not be
|
|
possible in a real-time context.\\
|
|
One significant advantage is that databases can afford to be far larger
|
|
than they could in real time. Without the requirement to process output in
|
|
a short period of time, more time can be taken to search vast databases in
|
|
the hope that the closest match to a target will be found.\\
|
|
Another advantage is in the global view of a target that can be taken in an
|
|
offline approach. Because the complete audio file is available from the
|
|
start of processing, techniques can be applied that consider the output as
|
|
a whole rather than on a grain by grain basis. This allows for algorithms
|
|
such as the viterbi algorithm to find the sequence of grains that provide
|
|
the best continuity, as demonstrated in the Catapillar
|
|
project~\parencite[p.4]{Schwarz2003} This would not be possible in
|
|
real-time, as audio is processed on the fly.\\
|
|
|
|
An additional consideration was the method to be used for controlling the
|
|
target to be matched too. It was decided that the most interesting results
|
|
would be produced through the matching of grains to a target audio file, as
|
|
opposed to other approaches such as matching to MIDI scores. In this sense
|
|
the project is a form of offline audio-mosaicking tool similar to that of
|
|
CataRT.
|
|
|
|
\subsection*{Descriptor Implementation}
|
|
In order to differentiate between grains, a number of audio descriptors
|
|
were implemented. Audio descriptors are used to measure a specific
|
|
characteristic of a signal~\parencite[p.31]{Lerch2012}. For example, an RMS
|
|
descriptor was implemented to give an indication of the overall intensity
|
|
of the grain. Another example is the F0 descriptor implemented to give a
|
|
value relating to pitch for harmonic grains. These values could then be
|
|
used by the matching algorithm in order to find the best match between the
|
|
source and target grains. A full description of all descriptors implemented
|
|
can be found in the Concatenator documentation.\\
|
|
Due to time constraints on the project, only a limited number of basic
|
|
descriptors were implemented. For this reason, it was ensured that new
|
|
descriptors could be added easily to the project. The object oriented
|
|
design of the descriptors provides the potential for quick development of
|
|
any future descriptors to be added to the project.
|
|
|
|
\subsection*{Database Design}
|
|
When generating descriptors for large database, large amounts of data are
|
|
produced and so an efficient method of storing and retrieving the data was
|
|
needed to manage this. The Python interface to the HDF5
|
|
filesystem~\parencite{Collette2016} was chosen for it's simplicity and
|
|
ability to compress the data automatically. Storing Numpy arrays of
|
|
descriptors in groups allowed for quick and easy access to analyses from a
|
|
single, organized source.
|
|
|
|
\subsection*{Matching Algorithms}
|
|
In order to match grains using the descriptor values, a matching algorithm
|
|
was required. Initially a brute force matcher was used to compare each
|
|
descriptor value in the target to all values of the same descriptor type in
|
|
the source. However, it quickly became apparent that this approach would be
|
|
far to slow, particularly for larger database.\\
|
|
For this reason, a k-dimensional tree search algorithm was used in an
|
|
effort to improve matching efficiency. This approach produced the same
|
|
results as the brute force matcher, but by arranging descriptors in a tree
|
|
structure, a far more efficient search to find the best match was possible.
|
|
This reduced matching time considerably.
|
|
|
|
\subsection*{Synthesis and Transformations}
|
|
The final step in the program is to synthesize the matched output.
|
|
This process consisted of:
|
|
\begin{enumerate}
|
|
\item Retrieving the best grain matches returned by the matching algorithm
|
|
\item Applying a window function
|
|
\item Overlapping the grains
|
|
\item Transforming grains to match target
|
|
\item Saving the result to a file
|
|
\end{enumerate}
|
|
Initially, grains were not transformed to better match the target. This
|
|
worked effectively for large databases, however it was observed that
|
|
results synthesized using small databases were of a lower quality as the
|
|
chance of a closely matched grain was lower. To account for this, methods
|
|
for altering grains to better match their target were implemented. It was
|
|
decided that the two most significant characteristics to alter were the
|
|
pitch and intensity of the grains. By scaling the grains by the difference
|
|
between the source and target RMS, it was possible to impose a closer
|
|
intensity on a grain. Likewise, by shifting the pitch of a grain by the
|
|
difference, it was possible to better match the pitch contour of the output
|
|
to that of the target audio. This improved the results significantly in
|
|
smaller databases, as poor matches could be improved to match the target
|
|
more convincingly.
|
|
|
|
\subsection*{Command line Interface}
|
|
In order to make the framework accessible to users, a commandline interface
|
|
was developed. By supplying arguments to the program, users could alter
|
|
parameters and experiment freely with the tool. Although this interface
|
|
was sufficient for testing and experimentation, it quickly became apparent
|
|
that there were too many parameters to pass to the program via the command
|
|
line interface on each run. A configuration file parser was created to
|
|
address this issue, allowing users to specify default parameters that would
|
|
be used by the program on each run. The combination of these interfaces
|
|
provided an effective means for accessing all of the framework's features.
|
|
|
|
\subsection*{Documentation and API}
|
|
In order to make the project as user friendly as possible for both
|
|
developers and users, a significant amount of time was spent documenting
|
|
the code properly. As a result, a full API is available alongside examples
|
|
of use. This was written in the hope that it might form a usable package
|
|
that developers can build on quickly and effectively to build other CS
|
|
projects, allowing for easier access to Python based CS than is currently
|
|
available. The command line interface is equally documented to allow users
|
|
to create their own realisations quickly and easily so that this project
|
|
may be used for creative sound design purposes.
|
|
|
|
\section*{Results and Evaluation}
|
|
Overall, results generated by this project showed promise; a variety of
|
|
transformations were generated using open source instrument databases to
|
|
demonstarte the projects potential for sound design application. This
|
|
tested the project's ability to convincingly impose qualities of an
|
|
instrument onto target sounds.
|
|
|
|
In retrospect, a great deal of time was spent trying to improve the
|
|
efficiency of the project. Although this was necessary, as initial tests
|
|
were not feasible on most databases, it had a negative impact on the time
|
|
available for developing perceptual qualities of the output. As a result of
|
|
this, the overall quality of output may perhaps not be as high as that of
|
|
other projects in this area.
|
|
high computation required, resulting in
|
|
large amounts of time needed to produce high quality results. An end user
|
|
may not have the patience required to to reach the quality of results that
|
|
might be possible. However, the fundamental concepts such as descriptor
|
|
matching and transforming matches to better fit the target, that are used
|
|
in the most sophisticated CS projects, have been implemented in this
|
|
project to reasonable effect. As a proof of concept, this project displays
|
|
the possibilities for CS in Python and there is evidently potential for
|
|
further development in this area.
|
|
|
|
\section*{Research Limitations/Potential Development}
|
|
There are a number of further improvements that could be made to this
|
|
project in order to improve the quality of results and extend it's overall
|
|
usefulness. Some initial ideas for improvements are detailed in this
|
|
section. These range from reasonably simple modifications that could not be
|
|
implemented purely due to time constraints, to more complex ideas that may
|
|
take a considerable amount of work.\\
|
|
|
|
The current implementation uses only a small and relatively basic subset of
|
|
the audio descriptors available. This limits the analysis of audio and thus
|
|
the quality of matches. Using a larger set of more advanced descriptors may
|
|
improve quality from this perspective. One way would be to incorporate the
|
|
open source Essentia audio descriptors~\parencite{Essentia2016} giving the
|
|
project access to a vast quantity of descriptors for analysis.\\
|
|
|
|
Replacing the hanning window function used for grain windowing with a short
|
|
cross fade at grain overlaps should reduce amplitude modulation, resulting
|
|
in smoother transitions between grains. This might be further improved
|
|
through calculating the point of maximum similarity by cross-correlating
|
|
overlapping sections, as described by~\textcite[p.191-193]{Zolzer2011} in
|
|
the SOLA algorithm.\\
|
|
|
|
A lack of continuity between grains was observed in results, most likely
|
|
due to the lack of any comparison of selected grains. A viterbi algorithm
|
|
could be used to account for this, allowing for a search to be done amongst
|
|
the top matches to find the optimal set of grains. This takes advantage of
|
|
the offline nature of the project and has been shown to work effectively in
|
|
the Talkapillar project~\parencite{Hueber}.
|
|
|
|
Although the HDF5 filesystem allows for easy storage of descriptor values,
|
|
it also has drawbacks that limits the functionality of the project. One
|
|
significant problem is that it is difficult to implement parallel
|
|
processing using the library and for this reason asynchronous processing was
|
|
not implemented in the project. An alternative method of storage may
|
|
accommodate this more easily, allowing for the speed-ups possible through
|
|
asynchronous processing. The overall design of the database management was
|
|
also relatively naive and may benefit from being replaced by a technology
|
|
such as an SQL database or similar. This has been shown to work effectively
|
|
in work such as the CataRT project~\parencite[p.3]{Schwarz2006a}.
|
|
|
|
\section*{Conclusion}
|
|
Given the limited time frame for the project and complexity of modern
|
|
approaches to this form of synthesis, only a basic implementation of CS is
|
|
presented. Nevertheless, this project has provided a functioning Python
|
|
based CS project with much potential for further development. Given the
|
|
high number of technical issues faced with this style of synthesis (from
|
|
the big data issues faced with analysis storage, to high efficiency
|
|
requirements for processing the large quantities of data), overall this
|
|
project appears to perform to a reasonable standard.\\
|
|
With the ever increasing quality of technology, it is predicted that new
|
|
techniques such as concatenative synthesis may grow further in popularity,
|
|
leading to an increasing number of possibilities in this area of sound
|
|
synthesis. It is hoped that this project might aid in the highlighting the
|
|
possibilities offered by this form of synthesis and demonstrate some of the
|
|
technical obstacles that must be addressed to design a CS project
|
|
successfully.
|
|
|
|
\section*{Acknowledgments}
|
|
The author would like to thanks A. Harker for his advice and guidance
|
|
as a mentor throughout the project, and to A. Harker and P. Chen for access
|
|
to their vocal samples database. Thanks also to D. Chaplin for his
|
|
creative input in generating results.
|
|
|
|
\printbibliography
|
|
\end{document}
|