An Objective Measure of Quality for Time-Scale Modification of Audio

Objective evaluation of audio processed with Time-Scale Modification (TSM) remains an open problem. Recently, a dataset of time-scaled audio with subjective quality labels was published and used to create an initial objective measure of quality. In this paper, an improved objective measure of quality for time-scaled audio is proposed. The measure uses hand-crafted features and a fully connected network to predict subjective mean opinion scores. Basic and Advanced Perceptual Evaluation of Audio Quality features are used in addition to nine features specific to TSM artefacts. Six methods of alignment are explored, with interpolation of the reference magnitude spectrum to the length of the test magnitude spectrum giving the best performance. The proposed measure achieves a mean Root Mean Squared Error of 0.487 and a mean Pearson correlation of 0.865, equivalent to 98th and 82nd percentiles of subjective sessions respectively. The proposed measure is used to evaluate time-scale modification algorithms, finding that Elastique gives the highest objective quality for Solo instrument and voice signals, while the Identity Phase-Locking Phase Vocoder gives the highest objective quality for music signals and the best overall quality. The objective measure is available at https://www.github.com/zygurt/TSM.


I. INTRODUCTION
Time-Scale Modification (TSM) is the process of modifying the duration of a signal without modifying the pitch of the signal. In order to justify the quality of the processing, subjective testing must be undertaken. However, it is expensive and time consuming. Objective methods are available for evaluation of audio quality, however these methods require reference and test signals of identical duration. Consequently, most published objective measures cannot be applied to this context. Two objective measures, SER by Verhelst and Roelands (1993) and D M by Laroche and Dolson (1999), have been proposed. However, they are shown to be only high level indicators of 'phasiness' or quality (Laroche and Dolson, 1999). In this work, we propose the first effective objective measure of a timothy.roberts@griffithuni.edu.au quality for time-scale modified audio. It uses hand-crafted features with deep-learning methods and is trained using a recently published dataset (Roberts, 2020).
Objective measures of quality seek to predict the quality of a test signal and can be broadly classified into two classes, traditional and machine learning. Traditional measures such as Perceptual Evaluation of Speech Quality of (ITU-T, 2001b), STOI of (Gomez et al., 2011) and the TSM specific measures of SER and D M are purely analytical in nature. Machine learning methods use neural networks to develop a relationship between subjective evaluations of the test signal and hand-crafted or data driven features extracted from reference and test signals, as in Perceptual Evaluation of Audio Quality (PEAQ) (ITU-T, 2001a). Deep learning allows for objective measures that do not require a reference file, as in Avila et al. (2019) for speech quality. However this these methods have not yet been applied to TSM.
Training of deep learning methods requires a large amount of labelled signals. Recently, a dataset of time-scaled audio with subjective labels was published for this purpose (Roberts, 2020). Reference files were drawn from a large variety of sources including speech, singing, solo harmonic and percussive instruments as well as a variety of musical genres. The training subset, containing 5,280 processed files, was generated using six methods to time-scale 88 reference files at 10 ratios. The methods used were the Phase Vocoder (PV) by Portnoff (1976), the Identity Phase-Locking Phase Vocoder (IPL) by Laroche and Dolson (1999), Waveform Similarity Overlap Add (WSOLA) by Verhelst and Roelands (1993), Fuzzy Epoch Synchronous Overlap-Add (FES-OLA) by Roberts and Paliwal (2019), Harmonic Percussive Separation Time-Scale Modification (HPTSM) by  and Mel-Scale Sub-band Modelling (uTVS) by Sharma et al. (2017). Playback speeds of 0. 3838, 0.4427, 0.5383, 0.6524, 0.7821, 0.8258, 0.9961, 1.381, 1.667, and 1.924 were used as time-scale ratios (β = 1 α ) for the training subset. The testing subset, containing 240 files, was created using three additional methods to time-scale 20 reference files at a random β in each band of 0.25 < β < 0.5, 0.5 < β < 0.8, 0.8 < β < 1 and 1 < β < 2. Elastique by Zplane Development, the Phase Vocoder using fuzzy classification of bins (FuzzyPV) by Damskägg and Välimäki (2017) and Non-Negative Matrix Factorisation Time-Scale Modification (NMFTSM) by Roma et al. (2019) were used to generate the testing subset. Finally, an evaluation subset was generated by processing the testing subset reference files with all previously mentioned methods, in addition to the Scaled Phase-Locking Phase Vocoder (SPL) by Laroche and Dolson (1999), IPL and SPL variants of PhaVoRIT (IPL and SPL) by Karrer et al. (2006) and Epoch Synchronous Overlap-Add (ESOLA) by Rudresh et al. (2018). 20 time-scale ratios in the interval of 0.22 < β < 2.2 were used, resulting in 5,200 files with 400 files per method. During subjective testing 42,529 ratings were collected from 263 participants in 633 sessions resulting in a minimum of 7 ratings per file. Subjective median opinion scores (MedianOS) and subjective mean opinion scores (SMOS) and before and after normalization were provided as labels. The dataset was published under the Creative Commons Attribution 4.0 International (CC BY 4.0) license at http://ieeedataport.org/1987 (Roberts, 2020).
The International Telecommunications Union (ITU) Recommendation BS.1387, more commonly known as Perceptual Evaluation of Audio Quality (PEAQ) (ITU-T, 2001a), is an objective measure of quality (OMOQ) developed primarily for evaluation of audio codecs. It combines research from multiple groups and was released as an ITU standard in 2001. PEAQ has two modes of operation, Basic and Advanced. The Basic version (PEAQB) consists of an Fast Fourier Transform (FFT)-based peripheral ear model, pre-processing, calculation of 11 Model Output Variables (MOVs) and a small neural network. The Advanced version (PEAQA) follows the same framework, but with a filter-bank-based ear model and 5 MOVs.
The FFT-based ear model aims to process the input signals in a similar way to the ear. The model contains an FFT, rectification, scaling of the input signal, outer and middle ear weighting, auditory filter bands, internal noise, frequencydomain spreading and time-domain spreading. The filter-bank model is identical in aim and contains scaling of the input signals, DC rejection, auditory filter band decomposition, outer and middle ear weighting, frequency-domain spreading, rectification, time-domain spreading, adding of internal noise and additional time-domain spreading. Pre-processing of the resulting excitation patterns for both ear models creates patterns used in the calculation of the MOVs, the details of which can be found in ITU-T (2001a), Thiede et al. (2000) and Kabal et al. (2002).
The basic MOVs can be categorised into six groups. Modulation difference MOVs, WinMod-Diff1B, AvgModDiff1B and AvgModDiff2B, are the windowed and linear averages of the modulation differences. Noise loudness MOVs, of which RmsNoiseLoudB is the only one used in the basic method, is the squared average of the noise loudness and takes masking into account. Bandwidth MOVs, BandwidthRefB and BandwidthT-estB, estimate the mean bandwidth of the reference and testing signals considering only frames with a bandwidth greater than 8kHz. Psuedocode for the calculation is given in ITU-T (2001a). When considering auditory masking, Total NMRB, is the linear mean of the noise-to-mask ratio, while Relative Disturbed Frames Basic, RelDistFramesB, is the number of frames with a noise-to-mask ratio above 1.5dB as a ratio of the number of frames for the signal. For detection probability, Maximum Filtered Probability of Detection (MFPDB ) models the smaller impact of distortions at the beginning of the file on quality assessment. Average Distorted Block (ADBB ), uses the number of frames with a distortion detection probably above 0.5 and is calculated according to section 4.7.2 in ITU-T (2001a). Finally, the Harmonic Structure of Error (EHSB ) MOV measures the harmonic structure of the error signal, as strong harmonic structure may be transferred to the error signal. The advanced model adds an additional four MOVs, while also using EHSB. RmsMod-DiffA, RmsNoiseLoudAsymA and AvgLinDistA are all calculated from the filterbank ear model excitation patterns, while SegmentalNMRB is calculated from the FFT model. For full details, see ITU-T (2001a) and Kabal et al. (2002).
PEAQ makes use of a neural network to map the MOVs to a single Distortion Index (DI) value. The network used with the basic model is a fully connected network with a single hidden layer of three nodes and sigmoid activation. Features are scaled to between 0 and 1 before input to the network usinĝ where a min and a max are scaling factors. Finally, the DI is mapped to the final Objective Difference Grade (ODG) minimizing the Root Mean Square Error (RMSE). The initial PEAQ standard (ITU-T, 2001a) was found to contain errors and omit vital information required for a proper implementation of the standard. Kabal et al. (2002) clarified these errors and omissions, and provided a MATLAB implementation of the standard. Available implementations include PQeval (Kabal et al., 2002), gstpeaq (Holters, 2017), EAQUAL (godock, 2017), peaqb-fast (Gottardi, 2013) and PEAQPython (Welch and Cohen, 2015).
The paper is organized as follows: Section II presents the proposed OMOQ method; Section III presents feature and network results as well as a comparison of TSM algorithms. Avail-ability, future research and conclusions are presented in Sections IV, V and VI respectively.

II. METHOD
In this section, the proposed TSM objective measure is described. It uses a neural net to infer the SMOS score from hand-crafted features computed from audio processed by TSM. It includes PEAQ features, with modifications described in subsection II A and additional features specific to TSM artefacts described in subsection II B. Feature preparation is described in subsection II C and the neural network is described in subsection II D.

A. Changes to PEAQ
Changes were made to PEAQ MOV generation to allow for the use of time-scaled signals assuming that a constant time-scale ratio was applied while processing the signal.
Signal preparation begins by summing all input channels, before DC removal and normalization to the maximum absolute value. The proposed method uses full scale as ±1 rather than the 16-bit integers of PEAQ. A single channel is used in the proposed method as multi-channel TSM is rarely considered (Roberts and Paliwal, 2018). Consequently, a single channel used for detection probability calculations in ITU-T (2001a) section 4.7. The beginning and end of test and reference files are determined as the first and last time the sum of the absolute of four consecutive samples exceeds 0.0061, as per ITU-T (2001a) section 5.2.4.4. Signals are then truncated to these lengths. This removes frames with low energy at the beginning and end of the signals during averaging calculations, and synchronises the time-scaling starting point.
PEAQ assumes an input sample rate of 48kHz, however the proposed method calculates features based on the sample rate of the input signals. In the processing of this dataset a sample rate of 44.1kHz is used. As a result, the proposed method uses frequency values calculated from the given bin values of ITU-T (2001a). In the calculation of BandwidthRefB and Band-widthTestB, frequencies are used rather than bin indices. Noise floor is calculated above 21kHz with 8kHz used as the bandwidth cutoff for bin inclusion during averaging. PEAQ and the pro-posed method both assume that bandwidth will be reduced due to processing.
The reference signal before and after spectral adaptation is used as input for AvgLinDistA calculation. However, the ITU specification is unclear as to the which filter envelope modulation (Mod [k, n]  The final change to the ITU standard in the proposed method is the calculation of RelDist-FramesB. The proposed method uses the interpretation of Kabal et al. (2002) as 'related to' meaning the fraction of frames exceeding 1.5 dB.
Six methods of alignment were investigated during development, time-instance framing anchored to the reference or test signal, and four methods of interpolating magnitude spectrum frequency bins along the time-axis.
Timeinstance framing extracts frames from the reference and test signals at identical time-instances by scaling the frame locations by β, such that S R = uβS T where u is the frame number, S R is the reference signal shift in samples and S T is the test signal shift in samples. In cases where β is not known, the ratio between the lengths of the truncated input signals is used.
While alignment through re-sampling either the reference or test signal to be the same duration is not suitable, due to resulting changes in pitch, it is possible to re-sample or interpolate low bandwidth representations of the signals, as shown by Sharma et al. (2017). In the proposed method, interpolation for basic PEAQ features is applied prior to the ear model using one of four targets: the longest signal, the shortest signal, the reference signal or the test signal. For advanced PEAQ features, interpolation to the test signal is applied after application of the ear model. There is no requirement for the time-scale to be known during calculation of features, only that the time-scale of the processed signal has been modified by a constant amount. Through a simple thought experiment, we can observe that as we extend signals through interpolation the transient components of the signal will also be extended, while the same transients will not be extended through time-instance framing. As such it is necessary to consider all, and combinations of, alignment methods.

B. Additional Features
When calculating PEAQ Bandwidth features, asymmetric thresholds are used with +10dB used for BandwidthRefB and +5dB used for BandwidthTestB. Test Bandwidth calculated with a +10dB threshold (BandwidthTestNew ) has been included as an additional feature.
The two published traditional OMOQ were included as features in the proposed method. Roucos and Wilgus (1985) used the Signal to Error Ratio (SER), which is calculated by SER = 10 log 10 (2) where X is shorthand for X(u, k), u is the frame number, k is the frequency bin, U is the total number of frames, N is the frame size, X R is the Short Time Fourier Transform (STFT) of the reference signal and X T is the STFT of the test signal. It is a measure of the difference between the magnitude spectra of the reference and test signals. Practically, SER is bounded to a maximum of 80 to avoid possible infinite results when processing identical files. This empirical value was the maximum finite feature value for identical files. Laroche and Dolson (1999) proposed an objective 'phasiness' measure (D M ) by measuring the consistency of the synthesis reconstruction. It is calculated by Neither of these measures have seen continued use with each measure noted to be only a high level indicator of signal 'phasiness' (Laroche and Dolson, 1999).
One cause of 'phasiness' is phase unwrapping errors that occur when the time-scaling parameter (α) is not an integer (Laroche and Dolson, 1999). In this work we propose a method for estimating the level of 'phasiness' by considering the phase progression of reference and test signals. The proposed 'phasiness' features track phase progression through time for reference and test tracks, accounts for the change of time-scale and calculates the difference between the resulting unwrapped phase progression. Weighting is applied to the phase difference, with unity and magnitude spectrum weighting applied in separate features within the proposed method. These features are calculated in the following manner. The phase spectra of the reference and test signals are calculated using the STFT and adjusted to be between 0 and 2π using forming ∠X. 2π is then successively added to each bin until it is greater than the same frequency bin in the previous frame usinǵ where P ∈ Z. The longerX is then interpolated to match the length of the shorter signal, forming X. The weighted angle difference (∆ϕ) can then be calculated using where weighting is calculated with or W (k) = 1 for no weighting. Once the angle differences have been calculated, the mean 'phasiness' features, using No Weighting (MPhNW ) and Magnitude Weighting (MPhMW ), are calculated by taking the means of the absolute weighted difference in time and frequency dimensions. The standard deviation of the absolute weighted difference (SPhNW and SPhMW ) are calculated by taking the standard deviation of the frequency mean of the absolute weighted difference. A number of additional measures were explored including power spectrum weighting, Fletcher-Munson curve weighting and the mean first difference along the time dimension, however they were found to be poor measures or contribute little towards the training of the prediction network. Figure 1 show the 'phasiness' features compared to both SMOS and TSM ratio. 'Phasiness' can be seen to increase as the TSM ratio moves away from 100% and as the SMOS decreases, as expected. Animated 3-dimensional plots for all features color coded to each TSM method can be found in the supplementary material 1 and at zygurt.github.io/TSM/objective. 'Phasiness' causes spectral coloration of the signal (Laroche and Dolson, 1999), allowing for spectral similarity to be used as an indicator of phasiness. Two features (SSMAD and SSMD) were developed using differences in the smoothed spectrum between reference and test signals. Frames, aligned using reference frame anchors, are converted to normalized magnitude spectra using the STFT and Hann windowing. Thirdorder polynomials are then fit to the spectra. The resulting polynomials, without the intercept term, are applied to a linearly spaced vector N 2 in length. Removal of the intercept term removes any overall level difference between the frames. The mean absolute difference and mean difference between reference and test signals are calculated for each frame, with the means of these values forming the two spectral similarity features. These features also gives a measure of signal coloration introduced by the TSM algorithm. Figure 2 shows the spectral similarity features in relation to the SMOS and TSM ratio. Further analysis found groupings for individual and classes of TSM methods within the features. Time-domain methods inherently introduce less or no 'phasiness', and FESOLA and WSOLA tend to have better spectral similarity than frequency-domain methods. Refer to the SSMAD and SSMD animated graphs in the supplementary material 2 for examples.

FIG. 2. [Color Online] Spectral similarity features as functions of SMOS and TSM Ratio.
Changes in the transient content of the signal are common TSM artefacts. Three features have been developed for the proposed method, Peak Delta, Transient Ratio and Harmonic Percussive Separation Transient Ratio, with no requirement for alignment between signals. Peak Delta (∆P) is the difference in the number of onsets between the reference and test signals per second. Onset detection is applied to both signals using the spectral features method described by Bello et al. (2005). A weighting function, W [k] = |k|, is applied to the power spectrum before the first backward difference of the logarithmic transform is calculated using Peak picking is applied to the onset results, where a peak is greater than its four surrounding values, with ) Finally, the difference in the number of peaks per second, calculated using (11) is used as the feature, where f s is the sampling frequency and dim(x R ) is the length of the reference signal in samples. The transient ratio (TrRat) is a measure of the change in transients due to processing and makes use of the peak locations calculated previously in equation 10. It is calculated by selecting peaks where the onset peak level is greater than one standard deviation above the mean onset level usinĝ Peak values are then used to calculate the ratio of mean transient level between the reference and test signals using where V is the total number of selected peaks, and v is the index of the selected peaks.
The Harmonic Percussive Separation Transient Ratio (HPSTrRat) compares the Root Mean Square (RMS) levels of reference and test transients. Transients are extracted from reference and test signals using the median filtering method of . The RMS of the extracted signals are calculated before the final feature is computed by the ratio of reference to test. Figure 3 compares each of the transient features to SMOS and TSM ratio.
As musical noise is a known artefact introduced by TSM, it was also explored as a possible feature. Spectral Kurtosis, as proposed by Torcoli (2019), was explored using all previously discussed methods of alignment. Lower, middle and upper frequency bands were used in addition to the maximum across all bands. As all time-alignment methods produced highly correlated results, interpolation to test was chosen as the alignment method. However, inclusion of these features reduced neural network performance and as a result they were removed from the features used in the final proposed network.

C. Feature Preparation
Prior to network training, features were normalized for faster network convergence. Each feature was scaled to the interval [0,1] using equation 1 where a min and a max are the minimum and maximum values for each feature. Target scores were also scaled to the interval [0,1] using Estimation of opinion scores was formulated as a regression problem using a fully-connected neural network with three hidden layers of 128 nodes, shown in figure 4. Layer normalization and ReLU activation were used with residual connections around the second and third layers. Residual connections are facilitated by adding the input of a layer to its output. Sigmoid activation is applied to the final output. 10% of the training dataset was reserved for validation. The network was trained for 800 epochs using a single batch, RMSE loss (L), AdamW optimization (Loshchilov and Hutter, 2017) and a learning rate of 1e −4 . Networks that were still improving after 800 epochs were trained for an additional 800 epochs. Internal loss values were calculated using estimates in the interval of [0,1], while reported loss values were calculated using estimates scaled back to the original interval of [1,5]. As prediction of opinion scores for novel TSM methods is the network aim, early stopping based on validation loss was not used. The optimal epoch was chosen as the epoch with the minimum overall distance (D), calculated by whereρ andL are calculated bŷ where ρ = [ρ tr , ρ val , ρ te ], L = [L tr , L val , L te ], tr, val and te denote training, validation and testing, L is the mean of L, ρ is the mean of ρ, ∆ρ = max(ρ) −min(ρ) and ∆L = max(L) −min(L). This allowed for the novel artefacts of the testing subset to inform the chosen optimal network, without their use in training the network.

A. Feature Results
Features were manually pruned if the Pearson Correlation Coefficient (PCC) between features of the same type was above approximately 0.95, with figure 5 showing the correlation between each of the features in the proposed measure. This pruning increased the performance of the trained network. Due to the non-linear nature of the relationship between β and SMOS, correlation was calculated separately for β < 1 and β > 1 before averaging. The additional features were found to have a greater correlation to the SMOS than most PEAQ features. Of interest is the lack of individual features highly correlated with the SMOS or β, while still resulting in excellent network performance.

B. Network Performance
A wide range of testing and network configurations were considered during the development of the proposed method. Network hyperparameters were optimized through a systematic non-exhaustive search. Each method of alignment was trained to SMOS, MedianOS, raw SMOS and raw MedianOS targets, where raw values were calculated prior to session normalization. Additionally, baseline conditions, the inclusion of reference files within the train-ing set, concatenation of logarithmic transforms of features and combinations of multiple alignment methods were considered. Training of the network was conducted deterministically using seeds from 0 to 99. Figure 6 shows the boxplot distribution of the best D for each of the seed values used while training to SMOS. Lower values are better, with a smaller range meaning less reliance on the initial seed. Across all test cases, the network was more successful when training to mean, rather than median, targets. Consequently, the results discussed below will be solely focused on networks trained to mean targets. To increase readability, median D and best case D with L, ∆L, ρ and ∆ρ values can be found in table I. Values were calculated as per section II D.
The baseline performance for the traditional methods was determined by correlation with the target. SER and D M gave overall ρ with subjective scores of 0.1445 and 0.0274 respectively. Machine learning baseline performance was obtained by applying time-aligned PEAQB features to the original PEAQB network described by ITU-T (2001a), shown as 'Original PEAQB (To Test)'. By increasing the complexity of the network, to that in section II D, L and ρ were improved, shown as 'PEAQB (To Test)'. Performance was further improved through the TABLE I. Mean RSME loss (L) and range (∆L), mean PCC (ρ) and range (∆ρ), median overall distance ( D) and minimum overall distance (min(D)). Best results in bold. Including reference signals as test material, with targets set to 5, improved network performance and gave the best median overall distance for the seeds tested. Combinations of features generated using interpolation to test and time-instance anchoring to test were also applied to the network. This produced the best minimum distance after additional hyperparameter tuning, but was highly reliant on initial seed selection and gave inconsistent results during TSM method evaluation. Finally, combinations of concatenating logarithmic features, including reference signals and combining different alignment features were applied to the network, but all resulted in reduced performance.
Given the network performance in predicting raw SMOS outperforms prediction of normalized SMOS, investigation of Objective Mean Opinion Score (OMOS) differences was undertaken. The mean difference between normalized and raw SMOS was found to be -0.0023, while the mean difference was found to be -0.004 for OMOS. Normalizing was found to slightly extend the range of the SMOS values, with higher ratings for high quality files, and lower ratings for low quality files. Given the ITU-T (2019) recommendation of normalization, and the improvement in all metrics after normalization (Roberts and Paliwal, 2020), the final proposed objective measure of quality was trained to normalized SMOS using features aligned using interpolation to test, including reference files.
The proposed network achieved a best mean PCC of 0.865 and RMSE of 0.487. These results place the proposed network at the 82nd and 98th percentiles of subjective sessions for PCC and RMSE respectively.

C. TSM Algorithm Evaluation
TSM algorithms were compared using the evaluation subset of Roberts (2020), described in section I. The uTVS implementation used in subjective testing (uTVS), and an IPL by (DIPL) have also been included. Although β = 1 was used in the evaluation, in practice time-scaling is only applied at ratios other than 1. Additionally, the minimum β available for Elastique is 0.25. Consequently, all results for β = 1 and β < 0.25 were excluded from averaging calculations. Table II shows the mean OMOS for each of the TSM methods tested in addition to means for each file class ordered by ascending overall mean.
Analysis is split into each class of reference file followed by overall average results. In all cases, the noisy nature of the results for testing TSM method in Roberts and Paliwal (2020) has been smoothed. The poor performance of the uTVS subjective testing implementation, for β close to 1, is also visible, with the updated implementation showing monotonic improvement towards β = 1.
For musical files, the OMOQ effectively differentiates between frequency and time-domain methods, where the quality worsens faster for time-domain methods. WSOLA fairs the best of time-domain methods, diverging from frequency domain methods for β < 0.8, as shown in figure 7(a). When averaged, the OMOQ rates IPL highest followed by Elastique. All other frequency domain methods gave similar results.
For solo files all methods except NMFTSM perform similarly with a maximum difference between methods of approximately 0.5 around β = 0.9. Method means at each time-scale can be seen in figure 7(b). Driedger's IPL has the highest mean OMOS, followed by Elastique, WSOLA and IPL as shown in table II. The strong performance of WSOLA is expected, due to individual harmonic and percussive signals.
Voice file OMOS shows the greatest variance between methods. Of interest is the exponential shape of the curve for β < 1 compared to the logarithmic shape for musical and solo classes, indicating harsher subjective evaluation of voice files was learned by the network. Method means at each time-scale can be seen in figure 7(c). Elastique has the highest mean OMOS, followed by IPL and WSOLA. ESOLA and FESOLA give significantly improved performance for this class.
By averaging all OMOS, IPL has the highest average rating followed by DIPL and Elastique. Only 0.049 separates the following methods of  SPL, WSOLA, HPTSM and uTVS. The overall low performance of FuzzyPV is unexpected, given that it builds on IPL. However other methods that perform decomposition of the signal, such as NMFTSM and HPTSM, also perform below the methods they build upon. The overall means can be seen in figure 7(d).

IV. AVAILABILITY
The proposed tool is available from github.com/zygurt/TSM. This includes the MATLAB scripts for feature generation, Py-Torch code for the trained network and features for all dataset files in 'csv' and '.mat' formats. A bash script is also included that creates a virtual environment and installs required modules. The features are also available with the subjective dataset at http://ieee-dataport.org/1987.

V. FUTURE RESEARCH
Future research is multi-faceted. Evaluation of a wide range of commercial and lesser known published TSM methods should be considered in addition to comparisons of different implementations of the same TSM method. Secondly, expansion into alternative and deeper neural networks should also be considered. Initial testing resulted in a ρ te of 0.71 for a random forest network using the hand-crafted features, while using blind data-driven features created by a CNN used as input to a fully connected network resulted in a ρ te of 0.65.

VI. CONCLUSION
An objective measure for time-scaled audio was proposed with performance superior to most subjective listeners. The measure used handcrafted features and a fully connected network to predict subjective mean opinion scores. PEAQ Basic and Advanced features were used in addition to nine novel features specific to TSM artefacts. Six methods of alignment were explored, with interpolation of the magnitude spectrum to the length of the test signal giving the best performance. The proposed measure achieved a mean RMSE of 0.487 and a mean PCC of 0.865. Using the proposed method to evaluate algorithms, it was found that Elastique gave the highest objective quality for solo and voice signals, while the Identity Phase-Locking Phase Vocoder gave the highest objective quality for music signals and the best overall performance. Future work includes optimisation of feature generation, exploration of other network structures and evaluation of more TSM algorithms.