Measuring the Accuracy of Microtonal Synthesizers: A Detailed Analysis of Pianoteq & Vogue


Tuning accuracy
Fundamental Frequency
Octave Stretching
Measuring methodology
Physical Modeling & Virtual Synthesizer

A condensed version of this paper had been presented at DAGA 2013.

by Timour Klouche, Teresa Samulewicz and L. Jakob Bergner,
(Staatliches Institut für Musikforschung PK, Berlin, Germany)


This paper provides a detailed analysis of the tuning accuracy of two microtonal synthesizers. Those synthesizers are sound generators capable of fine tuning the musical pitches in other ways than the ubiquitous 12 equal steps per octave.
By validating the systems as a whole with a streamlined methodology this is a follow-up study broadening the experiment we presented at DAGA 2012. This time we will have a detailed look on the synthesizers' complete ambitus of a large concert grand piano in several temperaments. We also consider octave stretching as well as the frequency trend and range of each tone.
The measuring system and both synthesizers to be measured are selected from the extended precursor study: Praat is chosen for matching the expectancy values best, Pianoteq is chosen for its musical realism and Vogue for its accuracy.
This study has impact for all uses of microtonal synthesizers, including but not limited to scientific, experimental and artistic usage. A focus on the methodology had been chosen for the study to avoid human influence on the measurements as much as possible.

Setup and Methods

Stimuli / Generators
In order to reduce the variables and the experimenters' influence to a minimum we decided for a controlled experimental setup (cf. figure 1). Accordingly, we created two data files that could be regarded as stimuli in a behavioristic or rather cybernetic sense driving the tone generators. One of these two data files is a Finale generated MIDI file providing a chromatic scale over eight octaves with a 6 sec signal/silence interval. The other file is a tuning definition programmed in Scala format. Here, two considerably disparate tunings were chosen: 12tone equal temperament (12-tet) and the 2/7-comma meantone scale published 1558 by Zarlino (ZarMean). The resulting audiofiles - i.e. the behavior of the generators - are to be measured. Generators are Pianoteq, a virtual physical piano model and Vogue, a software synthesizer providing subtractive synthesis. In Pianoteq a D4 grand piano model was chosen and for Vogue we decided to use a simple sine without any envelope or modulation with the expectancy to achieve more precise measurements. As an alternative to the Scala-driven tuning instructions we also tested comparable “internal” tuning presets provided in both synthesizers. For each tuning instruction in Pianoteq we additionally generated six stimulus series each with a different octave stretching.

Figure 1: Schema of the experimental setup

Tuning Instructions
The idea to test both external (Scala files with 11 digit decimal precision) and internally represented tuning instructions is twofold. First, we wanted to have common controlled microtonal stimuli for the tone generators. Secondly, the question arose whether they are treated equally as should be assumed. 12tet serves as the external equivalent to the always present internal default tuning (equal). Additionally, Pianoteq offers an alternative to this default tuning with supposedly disabled octave stretching labeled flat tuning.
ZarMean meant to be analogous to the internal Zarlino tuning of Pianoteq. A pretest analysis revealed that these are different tunings and an extensive search in microtonal databases showed no match of this Pianoteq representation with any tuning meant to be zarlinian.

Matlab Preprocessing
Preprocessing is achieved with a MATLAB script segmenting the generated audio files in parts each comprising a single tone of the scale, thus resulting in 97 segments. This is done to treat each note of the audio files equal and to minimize the experimenters' influence on the measurements. Segmentation is based on a threshold-dependant RMS estimation, providing parameters such as window size, threshold, and additionally fixed durational offsets to compensate for high-threshold artifacts.

Praat Analysis
Because of the quantity of research material we batched the analysis of the files by using the scripting function of Praat. This also allows consistent conditions in the analysis of every single tone pitch. The script determines the average, minimum, maximum, and range of the fundamental frequency of the tone in each file and collects these data in one textfile.
Regrettably, this automatically performed analysis does not allow us to detect (supposedly) erroneous measurements or other failures at a glance. On the other hand, the visualization of the measured data in a spreadsheet application indicates possible errors by outliers or a suspiciously extensive range within some tones.


Pretesting the Segmentation Script/ Preprocessing Parameters
Pretests regarding the segmentation were done to iteratively search for optimal parameters, capturing most relevant signal parts. Some audio files to be measured revealed isolated erroneous segments. Interestingly, these appear to be sensitive to fine-tuning in the case of Pianoteq. The segmentation errors occur on different pitches in different tunings and octave stretching settings in a non-systematic manner. A possible explanation could be that even small deviations in fine-tuning lead to a build-up of resonances that exceeds the threshold of the segmentation preprocessing. In the end, no setting could be found to eliminate all artifacts, so the isolated segments had to be deleted manually.

Variation of Parameters in Praat
Because of the batch handling in Praat, we have searched for one measurement setting fitting to all pitches to be analyzed. First of all pitch range was set to 10-5000 Hz.
As we expected, the accuracy of the measured test scale seems to vary over the whole bandwidth. The range and the deviation to the expected values indicate a higher error probability in the very low and the very high pitches. Indeed this could be improved by confining the setting of the range to be measured. However, batch handling and methodological reasoning require one setting for the whole range of 10-5000 Hz.
Considering the other measuring parameters, the pretest results in the following findings:
  • Crosscorrelation method leads to many more measuring errors
  • Autocorrelation is more exact, but the first three pitch tones (C0, C#0, D0) were not detected before modification of the segmentation parameters (and still seem to be erroneous)
  • Variations in “Voicing Threshold”, “Silence Threshold” and “Octave cost” never lead to refined results for the whole ambitus. Some frequency areas seem to be measured slightly more accurate. On the other hand, then more failures appear in other frequency areas. This also applies to the “very accurate” setting.
That is why we decided to use the autocorrelation method with the default settings and a range of 10-5000 Hz for the main experiment.

“Axiomatic” Dependencies: Variations of Some Fundamental Parameters
In this part of the pretest we wanted to test whether the measurements also depend on factors “external” to the experimental setup. These factors are commonly not taken into account, mainly in good faith that their alteration must not make any difference, i.e. they are treated as unquestioned axioms. To quantify the effects those factors may have on the measurements we systematically altered our testfiles, the segmentation parameters, the software version and computer platform. We found the following dependencies that could be viewed as factors in the function of pitch. It should be noted that we measured musical pitch on the signal description level and no psychoacoustic interpretation thereof.

-- Level --
The audio testfiles were changed in level by +/-10 dB and compared with the unchanged version after treated by the same chain of steps before measuring. Below are some statistics - maximum deviation and mean absolute deviation (MAD) - of these differences for a -10 dB Pianoteq stimulus (flat tuning) to exemplify the effect on pitch. This finding is technically not surprising as we have two level-dependant interpretations in the analyzing process dealing with time variant pitch, one in the MATLAB preprocessing, and one in the PRAAT analysis. However, the amount of dependence is surprising. This could be of high relevance for validating the tuning accuracy of a given system.

The influence of signal level on measured pitch is less obvious in the case of our Vogue sine signals. These values are still remarkable, because this dependence is not to be expected with a supposedly time-frequency constant signal.

-- Duration / Analyzed Time Period --
With this pretest we wanted to track possible influences of audio file length on measured pitch. This is done by altering the duration parameter in our MATLAB sequencing script. This resulted in different starting and ending points of time of the audio files to be measured by appending or deleting a fixed offset duration before and after the threshold-dependent points. We decided to shorten the signals with the aim to cut transients and crop quasi-stationary areas. The standard version is +0.8 s prolonged. Besides, segmentation had been tested in the case of Vogue with -0.3 s and -0.6 s respectively. The latter means a shortening of 0.6 s of the beginning and ending of the signal. Unfortunately, high-pitched Pianoteq signals were too short, so 0.2 s is the highest value possible for duration reduction in Pianoteq. In addition to this obstacle, Praat is not able to compute pitch with the chosen measurement setting on some shortened notes from E7 on.
Below are exemplary comparative ratings of the 0.2 s shortened Pianoteq test signal (Pitch range D0-D#7; flat tuning; excluding the strongest outliers) and the 0.6 s shortened sine signals of Vogue

Considering that frequency deviations increase particularly at the beginning and the end of the Vogue signals it is highly remarkable to find a durational dependence in terms of an influence of the transients even for supposedly time-frequency constant sine signals.

-- Operating System and Software Versions --
In this part of the pretest we compared the measurements in exactly the same experimental setup and exchanged only the computers' operating system or the software revision of the measurement system leaving everything else untouched.
Regarding the software revision we compared the measurement differences of Praat (Windows) Version 5.2.33 vs. 5.3.23. The F0 MAD calculates to 4.7*10-5 cent, and the maximum deviation to approximately 0.002 cent. The measurements of the pitch ranges differ from each other in the average (MAD) of 2.5*10-4 cents with a maximum of roughly 2.1*10-3 cent.
The residues regarding the exchange of the operating system (Windows vs. Mac) calculates to a F0 MAD of approximatly 4.7*10-5 cent with a maximum of roughly 2.1*10-3 cent. The difference of ranges is higher: MAD is about 2.6*10-4 cent, and the maximum 0.012 cent. Two observations hold true for all comparisons: Some residues are real zeros, and secondly, the note G#7 triggers all maxima.
A possible explanation to these surprising findings could lie in a different representation and handling of numbers in software and operating systems.


For each tuning - equal, flat, 12tet, Zarlino and ZarMean - we tested the six mentioned stretching levels provided by Pianoteq. In almost any of these 30 test series the first two tone pitches show noticeable differences from the respective reference frequency value. Praat averages frequency over several measuring points. These single points were measured highly different for C0 and C#0 resulting in an unrealistic wide frequency range measurement, which indicates a potentially unreliable frequency mean.
Considering the range as an indication for the accuracy our measurements are consistently accurate for the ambitus of D0 to C4 (mean frequency range about 4 cents) while the frequency range severely increases outside up to 4577 cents. The complete mean frequency range of all test series is about 100 cents. In all test series - regardless of the stretching - the range curves run mainly parallel with all their local maxima and minima. Correlation in all combinations of stretching levels within all tunings is very high (0.99) except for flat, where correlation between the stretching series is about 0.65 to 0.99 (mean: 0.81).
In real pianos octave stretching is implemented to compensate the inharmonicity of the strings. Therefore, the fundamental frequency of the octave is adapted to the first overtone of a pitch. Thus, the stretch amount depends on string length and gauge. This tuning convention is usually applied to the lowest and highest two octaves of a grand piano. Pianoteq provides stretchings within a scale from 0.95 until 3. However, it turned out that these stretching settings are not related to an inharmonicity in terms of increasing or decreasing frequency of the strings' overtones. The strings are implemented as ideal harmonic strings regardless of the stretch setting. Considering the dependency of octave stretching from the strings' inharmonicity the Pianoteq model severely and obviously differs from reality.
Considering the deviation of all six tuning series from their reference frequency the resulting curves present a positive deviation in high frequencies due to the stretching. The deviation increases up to 160 cents for equal, 12tet, ZarMean and Zarlino tuning in each case starting from about A4. This magnitude seems to be exceptionally high. The smallest possible stretching level (0.95) still has a distinctive deviation of about 60 cents in the high frequencies. Unexpectedly, even the flat tuning, which is explicitly supposed to provide no stretching at all, shows an increasing deviation of up to 90 cents. Moreover, the pitches below ca A4 systematically differ from reference in such way that all octave stretching levels form a radial shape extending to the whole ambitus (cf. figure 2). Thus, the flat curves strongly deviate from their obliged values.

Figure 2: Measurements of all tested stretching levels in flat tuning (deviation from reference values in cents). The y-axis is focused on a range between -120 and 120 cents in order to visualize details in the middle of the ambitus more clearly.

In the lower frequencies equal, 12tet, ZarMean and Zarlino measurements do not show a negative deviation as we would expect regarding octave stretching. There is only a comparatively small negative deviation of 5 cents maximum for the stronger stretching levels (1.5, 2.0, 2.5 and 3.0) and even a positive deviation for 0.95 and 1.0 stretching, thus frequency curves form a U-shape. Below C1 the measured frequencies rise up to 780 cents, which probably displays a measuring error as mentioned above.
Comparing the measurements of the internal and external (Scala-controlled) tuning instructions (equal - 12tet and ZarMean - Zarlino) there is a very high correlation (0.97 and more) of deviation to reference values for each stretching level. However, even correlation between the equal and the Zarlino tuning is still above 0.97. These high values are probably due to the large range of values. Thus, the curves are very similar overall but because of some few huge outliers the small deviation values in the middle part of ambitus are undervalued. However, eye-minded estimation would state an overall very similar run of deviation curves in equal, 12tet, ZarMean and Zarlino (cf. figure 3-6).

Figure 3: Measurements of all tested stretching levels in equal tuning in Pianoteq (deviations from reference values in cents)

Figure 4: Measurements of all tested stretching levels in 12tet tuning in Pianoteq (deviations from reference values in cents)

Figure 5: Measurements of all tested stretching levels in ZarMean tuning in Pianoteq (deviations from reference values in cents)

Figure 6: Measurements of all tested stretching levels in Zarlino tuning in Pianoteq (deviations from reference values in cents)

Selected detailed analyses
The situation with Pianoteq is very complex because of its many tunings and octave stretching levels. No region can be singled out that is unique in all cases. Regarding the internal equal tuning a relative homogeneous region is present from G#1 to F#5 with absolute F0 deviations below 5 cents (cf. table 1 below). The other tunings including all tested octave stretching levels are relatively similar in this region with regards to amount and tendency except for the flat tunings. Regarding this tuning the tendency of increasing F0 deviation for stronger stretching levels seems to be especially raised. The mean range of deviation, on the other hand, remains interestingly constant over all tunings and stretching levels, thus averaging to roughly 1.8 cents. The F0 MAD calculates to an average of roughly 3.4 cents in this region.

Table 1: Selected statistics of Pianoteq's tunings and octave stretching levels in the region G#1 to F#5: MAD of F0 deviations from reference values and range in cents

In Vogue we measured the fundamental frequency of equal, 12tet and ZarMean tuning. Deviation to reference tuning is considerably small in contrast to the Pianoteq measurements.

Figure 7: Measurements of all tested tuning instructions in Vogue (deviations from reference values and range in cents)

In the very low and high frequency parts deviation from reference in cents and range within each tone increases. Peaks of the range curve run exactly parallel to the outliers in frequency deviation, which supports the assumption that those outliers are erroneous.
Remarkably, equal and 12tet tuning measured exactly identical except for some few low tone pitches. Even high-pitched outliers are the same. Deviations from reference for ZarMean also shape a very similar curve. These observations suggest that there is a systematical pitch-dependent error probably in sound synthesizing.
A further abnormality discovered by visual examination is that some pitches within the middle ambitus have a few slight overtones. As we already saw regarding the stronger deviation from reference frequency in the beginning and end of the tone, Vogue obviously fails to generate an ideal sine.

Selected detailed analyses
With an in-depth look on the measured F0 deviations to the expectancy values, some interesting regions can be singled out. A relative precise continuous region with errors in cents below one digit decimal is present from G1 to A3. Errors exceeding 1 cent on the other hand remain exclusive in the highest octave: E7, A7 and C8 in accordance with the trend of increased error probabilities for higher pitches. These findings hold true for measurements of both 12-tet and equal tunings. The single exception is an extension of the precise area to two additional tones in the 12-tet tuning. ZarMean measures slightly more imprecise in the mentioned area with three notes exceeding 0.1 cent, but generally maintaining the trend including the very same outliers' notes.
Regarding the range there are comparable regions, too: The most precise continuous region measures from G1 to F#4 with values below approximately 4 cents (mean approximately 3 cents). The exceptions are two additional notes for 12-tet and two outliers in ZarMean. The four peak ranges are again exclusive in the highest octave: C7, E7, A7, C8 in all three tested tunings.

Comparing both Systems
For estimating how accurate a given microtonal system as a whole is we calculate statistical measures on the above-discussed markers like range and F0 deviation to the expectancy values. It must be noted that these measures validate not only the tested generator but also the accuracy of the measurement (and involve axiomatic dependencies), too. If the whole ambitus is taken into account, it is self-evident that all errors and outliers go into the result. Due to averaging over the whole ambitus this provides good unbiased estimates of the tone generators' general behavior.
Depicted below are exemplary statistics for both systems (Table 2&3). Vogue measurements are relatively robust with respect to the tested tunings. Highest deviations to the expectancy values are 1.4 cents (or 3.4 Hz). The F0 MAD calculates to roughly 0.14 cents. The maximum range amounts to notably approximately 52 cents due to isolated outliers. Accordingly, the MAD of the range gives better results of roughly 5.3 cents over the complete ambitus.
The overall statistics of Pianoteq are comparable homogeneous despite outliers in five different tunings. Below are exemplary statistics (cf. fig. 8) of the internal flat tuning with octave stretching set to 1.0. Pianoteq shows generally lower measured accuracy than Vogue confirming our expectation before conducting the experiment. Outliers, time variability of pitch due to the physical model employed, and omnipresent octave stretching all contribute to this finding. It is needless to say that the absolute deviation values are accordingly excessive (maximum range even almost 46 halftones, maximum F0 deviation more than 7 halftones). Despite these obstacles, we measured Pianoteq with overall F0 MAD of roughly 30 cents. The mean range of deviation measures to roughly 2 semitones. The most accurate tuning excluding the main outliers C0, C#0, and G7 gives a F0 MAD of 4.3 cents (range 7.8 cents).

Table 2: Overall statistics for Vogue: MAD, min and max F0 deviation from reference values and range in cents and hertz

Table 3: Overall statistics for Pianoteq: MAD, min and max F0 deviation from reference values and range in cents and hertz

Figure 8: Comparison of mean overall measurements for Vogue (all tunings) and Pianoteq (flat, octave stretching = 1): F0 deviation from reference values and range in cents


Statistically speaking, the measured pitch accuracy of the analyzed microtonal synthesizers is in the averaged dimension of a thousandth halftone for Vogue and one-third halftone for Pianoteq. This is improvable by hand-selecting regions within the ambitus and eliminating assumed measuring errors: The thus gained accuracy is 3.5 cents over four octaves for Pianoteq and brings this system, too, in the realms of reasonable microtonality.
More meaningful for practically using these systems in research or artistically is the very actual deviation in contrast to statistical accuracy probabilities. This is most important when relying on a system without being able to validate it beforehand or even using its outcome for generalized analytical statements.
In this perspective, we have to deal with possible deviation of up to 1.5 cents (Vogue), respective more than 700 cents (7 halftones for Pianoteq). Even more critical are the measured ranges per note that are in the amount of a tenth halftone for the most accurate system and almost 46 halftones for the worst. Furthermore, octave stretching within Pianoteq is not implemented in a realistic manner.
In addition to these obstacles regarding the effect of measurements' dependence on commonly overlooked fundamental factors like the computers operating system, the software version, or secondary stimuli parameters (e.g. level or duration), those experimental setups in general seem to be very prone to disturbances. The same holds true for the interpretation of the measurements if the statements are meant to assure the very fundamentals of scientific enquiry.
Furthermore, it must be noted that an extrapolation of the measured accuracy from one sample octave is not feasible for the system as a whole.
Summing up, the work presented here leads to the conclusion that any plain usage of the most accurate system we measured safely guarantees no finer tuning than about 2 cents taking account of all possible sources of error. Any finer accuracy affords human intervention, namely manual measurement, validation, and selection based on expert knowledge and experience.