Fields-Journal-Article/Journal_Article.tex

\documentclass{scrartcl}
\usepackage{enumitem}
\usepackage[british]{babel}
\usepackage[style=apa, backend=biber]{biblatex}
\DeclareLanguageMapping{british}{british-apa}
\usepackage{url}
\usepackage{float}
\restylefloat{table}
\usepackage{perpage}
\MakePerPage{footnote}
\usepackage{abstract}
\usepackage{graphicx}
% Create hyperlinks in bibliography
\usepackage{hyperref}

\renewcommand{\familydefault}{\sfdefault}
\usepackage{fontspec}
\setmainfont{Arial}

\usepackage{blindtext}
\setkomafont{disposition}{\normalfont\fontsize{12}{17}\bfseries}
\setkomafont{section}{\normalfont\fontsize{12}{17}\bfseries}
\setkomafont{subsection}{\normalfont\fontsize{12}{17}\itshape}
\setkomafont{subsubsection}{\normalfont\fontsize{12}{17}\itshape}

\graphicspath{{./resources/}}
\addbibresource{~/Documents/library.bib}

\usepackage{etoolbox}
\makeatletter
\expandafter\patchcmd\csname\string\maketitle\endcsname
  {\vskip\z@\@plus3fill}
  {\vskip\z@\@plus2fill\box\abstractbox\vskip\z@\@plus1fill}
  {}{}
\makeatother

\DeclareCiteCommand{\citeyearpar}
    {}
    {\mkbibparens{\bibhyperref{\printdate}}}
    {\multicitedelim}
    {}

\begin{document}
    \title{Descriptor Driven Concatenative Synthesis Tool for Python}
    % \subtitle{\LARGE{Abstract Draft}}
    \author{Sam Perry}

    \maketitle

    \begin{abstract}
    A command-line tool and Python framework is proposed for the exploration of
    a new form of audio synthesis known as ``concatenative synthesis'': A
    form of synthesis that uses perceptual audio analyses to arrange small
    segments of audio based on their characteristics.  The tool is designed to
    synthesise representations of an input sound using a database of source
    sounds. This involves the segmentation and analysis of both the input sound
    and database, matching of input segments to their closest segment from the
    database, and the re-synthesis of the closest matches from the database to
    produce the final result. The  project aims to provide a tool capable of
    generating high quality sonic representations of an input, to present a
    variety of examples that demonstrated the breadth of possibilities that
    this style of synthesis has to offer and to provide a robust framework on
    which concatenative synthesis projects can be developed easily.\\

    Results demonstrate the wide variety of sounds that can be produced using
    this method of synthesis. A number of technical issues are outlined that
    impeded the overall quality of results and efficiency of the software.
    However, the project clearly demonstrates the strong potential for this
    type synthesis to be used for creative purposes.
    \end{abstract}

    \section*{Background}
    The concept of constructing a new sound by arranging collections of smaller
    sounds has gained popularity in the past 30 years through the introduction
    of ``Granular Synthesis''. Granular synthesis works on the theory that any
    sound can be described through the arrangement of smaller samples (referred
    to as ``grains''). This representation of sound allows for the temporal
    decomposition and re-arranging of real-world samples, with the potential to
    create new ``complex, dynamically-evolving
    sounds.''~\parencite[p.1]{Roads1988}\\

    Concatenative synthesis (CS) is a form of synthesis that has developed
    significantly over the past 15 years, driven by recent advancements in
    technology. Key advancements have been in easy access to large databases of
    audio and the development of methods for extracting useful information from
    these databases automatically~\parencite[p.1]{Schwarz2006a}.  CS utilises
    these technologies to provide a content-based extension to granular
    synthesis; by analysing a database of source grains, grains can be
    differentiated based on their characteristics.  These characteristics can
    then be used for grain selection in the process of synthesizing output for
    a wide range of applications~\parencite[p.102]{Schwarz2007}.

    \section*{Related Works}
    A number of programs utilize CS to achieve various goals. The process has
    been used for applications in areas such as speech synthesis, instrument
    synthesis and for applications in creative sound design.\\
    The wide range of applications demonstrates the versatility of this
    synthesis technique. It differs from traditional synthesis methods through
    the use of real recorded samples, as opposed to traditional methods that
    focus on defining sets of rules for emulating real sounds. By transforming
    samples that have been directly recorded from a source, the subtle nuances
    of the source's sound are preserved. These would be difficult to reproduce
    using other synthetic methods for modelling an
    instrument~\parencite[p.24]{Maestre2009}.

    \subsection*{Speech Synthesis}
    Creating a natural and intelligible realisation is an important factor when
    developing a speech synthesis system. The Talkapillar project is one such
    example of how highly convincing results are possible with CS. Through
    careful analysis of a vocal database, the project aims to impose the
    qualities of the database voice on an input voice. This would result in the
    words of the input speaker being transformed to appear as if they were
    spoken by the voice in the database.~\parencite{Hueber}

    \subsection*{Instrument Synthesis}
    Progress has also been made in improving the quality of instrument
    synthesis. As with speech synthesis, the use of samples directly allows for
    natural sounding results, which provides a method for reproducing real
    instruments convincingly. Another important aspect of instrument synthesis is that of performer
    expression. The reproduction of performance qualities such as dynamics,
    timbre and timing are essential when emulating a real instrument and CS has
    been used to effectively reproduce these aspects. This is achieved through
    splicing of grains based on their expressive characteristics to form
    musical phrases.  For example, just as a violinist might transition
    seamlessly from one articulation to the next, the CS software will join
    grains to produce the variation in articulations. This contrasts the
    traditional approach to sampling, where samples are played in isolation,
    resulting in a discontinuity between adjacent samples~\parencite[p.82]{Lindemann2007}.
    The Catapillar project is one such example of this use of CS.
    By using a viterbi algorithm, the project is able to calculate the
    smoothest overall transition between grains across the output, resulting
    in convincing synthesis of orchestral instrument performances~\parencite[p.5]{Schwarz2003}.

    \subsection*{Creative Sound Design}
    The flexibility of CS allows for creativity in a broader context than simply
    emulating real-world instruments and speech. It can also be used as a tool
    to explore the possibilities for synthesizing new abstract sounds for
    creative purposes.\\
    A prominent project in this area of CS is IRCAM's CataRT
    project~\parencite{Schwarz2006a}. The project focuses on the playback of
    source grains based on their proximity to a target in multi-dimensional
    descriptor space.  By providing a target point in the descriptor space, the
    user is able to navigate the database, playing selections of samples that
    are nearest to the target. This allows the user to explore the database
    intuitively through a graphic user interface, selecting a point in
    2-dimensional space with the mouse. Grains are then played back in
    real-time to create an ``audio mosaic''.\\
    Alternatively, target audio can be provided and analysed to create a target
    location based on it's location in the descriptor space.  Tremblay and
    Schwarz's~\citeyearpar{Tremblay2010} use of CataRT to explore
    electroacoustic sample banks demonstrates the creative potential of this
    method. CS is used in this context as a means for synthesizing matches in a
    corpus database to real-time input from an electric bass.  Significance is
    placed on linking the playback of grains to the expressivity of the
    performer. The use of perceptually based audio descriptors to match the
    source to the target allows the performer to navigate the database
    naturally based on factors such as the pitch and timbre of the bass
    guitar. The result is a performance that mixes characteristics of both the
    bass guitar output and the qualities of the corpus database to create a
    hybrid of the two.\\

    This is by no means an exhaustive overview of the projects and techniques
    that explore the vast possibilities of CS. For further information, please
    refer to: ``Concatenative Synthesis - The Early
    Years''~\parencite{Schwarz2006b}

    \section*{Concatenator}
    The concatenator project aims to provide an open source set of tools that
    allows composers to generate a variety of CS driven realisations for
    sound design purposes.  In addition, the project aims to provide an
    intuitive API that Python programmers might use as the fundamental building
    blocks to build further concatenative synthesis applications on.
    The result is a framework and command-line interface, built in Python, for
    easy access to basic CS techniques.
    The current implementation can be used for the concatenation of a source
    database onto target audio files, using a range of perceptual audio
    descriptors for matching. Database management, simple matching and
    synthesis algorithms are used to achieve this, and are described in the
    following sections.

    \section*{Program Design and Implementation}
    The Concatenator project consists of a number of components, as show below:\\

    *INSERT Concatenator OVERVIEW DIAGRAM*\\

    Output is generated by analysing overlapping segments of audio (known as
    grains) from both the target sound and the source database, then searching
    for the closest matching grain in the source database to the target sound.
    Finally, the output is generated by applying a hanning window and
    overlap-adding the best matches. Each component will be discussed in detail
    in the following sections.\\

    When designing the concatenator framework, ease of development, use and
    extensibility were primary considerations. It was for these reasons that
    the framework was written in the Python programming language. Python has
    grown in popularity in the scientific community recently, primarily due to
    it's focus on productivity, readability and the large number of efficient
    numeric processing libraries available (Numpy, SciPy, Scikitlearn
    etc...)~\parencite[p.11]{Fangohr2014}. This makes Python a good choice for
    quickly developing ideas in the context of audio signal processing.
    Unfortunately, the language does sacrifice processing speed for simplicity
    and as a result is not suitable for real-time signal processing. Other
    performance focused languages such as C++ are better suited to this type of
    processing. However, it was decided that the increase in productivity, lack
    of prior CS research in Python and the author's previous experience,
    made it the most suitable choice for this project.\\

    The choice to limit the project to offline processing has both positive and
    negative implications on the function of the project. A key disadvantage to
    this type of processing is the lack of possibility for any live performance
    aspect. This method provides no way of exploring the feedback between
    performer and system in a live environment, comparable to the work of
    Tremblay and Schwarz's~\citeyearpar{Tremblay2010}.
    However, there are advantages to offline processing that would not be
    possible in a real-time context.\\
    One significant advantage is that databases can afford to be far larger
    than they could in real time. Without the requirement to process output in
    a short period of time, more time can be taken to search vast databases in
    the hope that the closest match to a target will be found.\\
    Another advantage is in the global view of a target that can be taken in an
    offline approach. Because the complete audio file is available from the
    start of processing, techniques can be applied that consider the output as
    a whole rather than on a grain by grain basis. This allows for algorithms
    such as the viterbi algorithm to find the sequence of grains that provide
    the best continuity, as demonstrated in the Catapillar
    project~\parencite[p.4]{Schwarz2003} This would not be possible in
    real-time, as audio is processed on the fly.\\

    An additional consideration was the method to be used for controlling the
    target to be matched too. It was decided that the most interesting results
    would be produced through the matching of grains to a target audio file, as
    opposed to other approaches such as matching to MIDI scores. In this sense
    the project is a form of offline audio-mosaicking tool similar to that of
    CataRT.

    \subsection*{Descriptor Implementation}
    In order to differentiate between grains, a number of audio descriptors
    were implemented. Audio descriptors are used to measure a specific
    characteristic of a signal~\parencite[p.31]{Lerch2012}. For example, an RMS
    descriptor was implemented to give an indication of the overall intensity
    of the grain. Another example is the F0 descriptor implemented to give a
    value relating to pitch for harmonic grains. These values could then be
    used by the matching algorithm in order to find the best match between the
    source and target grains. A full description of all descriptors implemented
    can be found in the Concatenator documentation.\\
    Due to time constraints on the project, only a limited number of basic
    descriptors were implemented. For this reason, it was ensured that new
    descriptors could be added easily to the project. The object oriented
    design of the descriptors provides the potential for quick development of
    any future descriptors to be added to the project.

    \subsection*{Database Design}
    When generating descriptors for large database, large amounts of data are
    produced and so an efficient method of storing and retrieving the data was
    needed to manage this. The Python interface to the HDF5
    filesystem~\parencite{Collette2016} was chosen for it's simplicity and
    ability to compress the data automatically. Storing Numpy arrays of
    descriptors in groups allowed for quick and easy access to analyses from a
    single, organized source.

    \subsection*{Matching Algorithms}
    In order to match grains using the descriptor values, a matching algorithm
    was required. Initially a brute force matcher was used to compare each
    descriptor value in the target to all values of the same descriptor type in
    the source. However, it quickly became apparent that this approach would be
    far to slow, particularly for larger database.\\
    For this reason, a k-dimensional tree search algorithm was used in an
    effort to improve matching efficiency.  This approach produced the same
    results as the brute force matcher, but by arranging descriptors in a tree
    structure, a far more efficient search to find the best match was possible.
    This reduced matching time considerably.

    \subsection*{Synthesis and Transformations}
    The final step in the program is to synthesize the matched output.
    This process consisted of:
    \begin{enumerate}
        \item Retrieving the best grain matches returned by the matching algorithm
        \item Applying a window function
        \item Overlapping the grains
        \item Transforming grains to match target
        \item Saving the result to a file
    \end{enumerate}
    Initially, grains were not transformed to better match the target.  This
    worked effectively for large databases, however it was observed that
    results synthesized using small databases were of a lower quality as the
    chance of a closely matched grain was lower. To account for this, methods
    for altering grains to better match their target were implemented.  It was
    decided that the two most significant characteristics to alter were the
    pitch and intensity of the grains.  By scaling the grains by the difference
    between the source and target RMS, it was possible to impose a closer
    intensity on a grain. Likewise, by shifting the pitch of a grain by the
    difference, it was possible to better match the pitch contour of the output
    to that of the target audio.  This improved the results significantly in
    smaller databases, as poor matches could be improved to match the target
    more convincingly.

    \subsection*{Command line Interface}
    In order to make the framework accessible to users, a commandline interface
    was developed. By supplying arguments to the program, users could alter
    parameters and experiment freely with the tool.  Although this interface
    was sufficient for testing and experimentation, it quickly became apparent
    that there were too many parameters to pass to the program via the command
    line interface on each run. A configuration file parser was created to
    address this issue, allowing users to specify default parameters that would
    be used by the program on each run. The combination of these interfaces
    provided an effective means for accessing all of the framework's features.

    \subsection*{Documentation and API}
    In order to make the project as user friendly as possible for both
    developers and users, a significant amount of time was spent documenting
    the code properly. As a result, a full API is available alongside examples
    of use. This was written in the hope that it might form a usable package
    that developers can build on quickly and effectively to build other CS
    projects, allowing for easier access to Python based CS than is currently
    available. The command line interface is equally documented to allow users
    to create their own realisations quickly and easily so that this project
    may be used for creative sound design purposes.

    \section*{Results and Evaluation}
    Overall, results generated by this project showed promise; a variety of
    transformations were generated using open source instrument databases to
    demonstarte the projects potential for sound design application. This
    tested the project's ability to convincingly impose qualities of an
    instrument onto target sounds.

    In retrospect, a great deal of time was spent trying to improve the
    efficiency of the project. Although this was necessary, as initial tests
    were not feasible on most databases, it had a negative impact on the time
    available for developing perceptual qualities of the output. As a result of
    this, the overall quality of output may perhaps not be as high as that of
    other projects in this area.
    high computation required, resulting in
    large amounts of time needed to produce high quality results. An end user
    may not have the patience required to to reach the quality of results that
    might be possible. However, the fundamental concepts such as descriptor
    matching and transforming matches to better fit the target, that are used
    in the most sophisticated CS projects, have been implemented in this
    project to reasonable effect. As a proof of concept, this project displays
    the possibilities for CS in Python and there is evidently potential for
    further development in this area.

    \section*{Research Limitations/Potential Development}
    There are a number of further improvements that could be made to this
    project in order to improve the quality of results and extend it's overall
    usefulness. Some initial ideas for improvements are detailed in this
    section. These range from reasonably simple modifications that could not be
    implemented purely due to time constraints, to more complex ideas that may
    take a considerable amount of work.\\

    The current implementation uses only a small and relatively basic subset of
    the audio descriptors available. This limits the analysis of audio and thus
    the quality of matches. Using a larger set of more advanced descriptors may
    improve quality from this perspective. One way would be to incorporate the
    open source Essentia audio descriptors~\parencite{Essentia2016} giving the
    project access to a vast quantity of descriptors for analysis.\\

    Replacing the hanning window function used for grain windowing with a short
    cross fade at grain overlaps should reduce amplitude modulation, resulting
    in smoother transitions between grains. This might be further improved
    through calculating the point of maximum similarity by cross-correlating
    overlapping sections, as described by~\textcite[p.191-193]{Zolzer2011} in
    the SOLA algorithm.\\

    A lack of continuity between grains was observed in results, most likely
    due to the lack of any comparison of selected grains. A viterbi algorithm
    could be used to account for this, allowing for a search to be done amongst
    the top matches to find the optimal set of grains. This takes advantage of
    the offline nature of the project and has been shown to work effectively in
    the Talkapillar project~\parencite{Hueber}.

    Although the HDF5 filesystem allows for easy storage of descriptor values,
    it also has drawbacks that limits the functionality of the project. One
    significant problem is that it is difficult to implement parallel
    processing using the library and for this reason asynchronous processing was
    not implemented in the project. An alternative method of storage may
    accommodate this more easily, allowing for the speed-ups possible through
    asynchronous processing. The overall design of the database management was
    also relatively naive and may benefit from being replaced by a technology
    such as an SQL database or similar. This has been shown to work effectively
    in work such as the CataRT project~\parencite[p.3]{Schwarz2006a}.

    \section*{Conclusion}
    Given the limited time frame for the project and complexity of modern
    approaches to this form of synthesis, only a basic implementation of CS is
    presented. Nevertheless, this project has provided a functioning Python
    based CS project with much potential for further development. Given the
    high number of technical issues faced with this style of synthesis (from
    the big data issues faced with analysis storage, to high efficiency
    requirements for processing the large quantities of data), overall this
    project appears to perform to a reasonable standard.\\
    With the ever increasing quality of technology, it is predicted that new
    techniques such as concatenative synthesis may grow further in popularity,
    leading to an increasing number of possibilities in this area of sound
    synthesis. It is hoped that this project might aid in the highlighting the
    possibilities offered by this form of synthesis and demonstrate some of the
    technical obstacles that must be addressed to design a CS project
    successfully.

    \section*{Acknowledgments}
    The author would like to thanks A. Harker for his advice and guidance
    as a mentor throughout the project, and to A. Harker and P. Chen for access
    to their vocal samples database.  Thanks also to D. Chaplin for his
    creative input in generating results.

    \printbibliography
\end{document}