A Time-Scale Modification Dataset with Subjective Quality Labels

Time-Scale Modification (TSM) is a well-researched field, however no effective objective measure of quality exists. This paper details the creation, subjective evaluation and analysis of a dataset, for use in the development of an objective measure of quality for TSM. Comprising two parts, the training component contains 88 source files processed using six TSM methods at 10 time-scales, while the testing component contains 20 source files processed using three additional methods at four time-scales. The source material contains speech, solo harmonic and percussive instruments, sound effects and a range of music genres. 42,529 ratings were collected from 633 sessions using laboratory and remote collection methods. Analysis of results shows no correlation between age and quality of rating; expert and non-expert listeners to be equivalent; minor differences between participants with and without hearing issues; and minimal differences between testing modalities. Comparison of published objective measures and subjective scores shows the objective measures to be poor indicators of subjective quality. A modified version of PEAQ Basic was retrained using the subjective results and achieved root mean squared error loss of 0.668 and Pearson correlation of 0.719 for all files. The labelled dataset is available at http://ieee-dataport.org/1987.


I. INTRODUCTION
Time-Scale Modification (TSM) is the process of modifying the duration of a signal without modifying the pitch. It has found use in areas including music production, language learning and speech recognition systems. Despite being a well-researched field, an effective objective measure of quality has not yet been published, limiting comparisons between TSM algorithms. When subjective evaluation has been used, each paper has used a unique set of source material and methods, further reducing comparison to only the methods involved in the evaluation. In order to develop an effective objective measure, a dataset with subjective quality labels is required. This work details the creation, subjective evaluation and analysis of the first dataset a timothy.roberts@griffithuni.edu.au for this purpose, and gives preliminary results for a neural-network-based objective measure of quality.
TSM algorithms most commonly modify the temporal domain by varying the ratio between analysis (S a ) and synthesis (S s ) shift sizes within an Analysis Modification Synthesis framework. This ratio, given by shows α to be the change in signal duration (Roucos and Wilgus, 1985), while β is the playback speed (Sylvestre and Kabal, 1992) and will be used within this paper. Algorithms for TSM can be classified into three main categories: frequency domain, time domain and hybrid methods.
In general, frequency-domain methods excel in scaling harmonically complex material but struggle to produce high quality results with highly transient signals. Time-domain methods are more effective at scaling transient signals but give poor results for polyphonic signals. Hybrid methods leverage the strengths of frequency and time domain methods to produce higher quality results .
Common artefacts produced during TSM include 'phasiness' and reverberation (Laroche and Dolson, 1997;Portnoff, 1981), musical and metallic noise or undesirable roughness (Laroche and Dolson, 1999), a buzzy quality (Laroche, 2002) and transient smearing (Laroche and Dolson, 1999). Phasiness and reverberation are heard as a loss of spectral definition and are most commonly associated with frequency domain methods. Laroche and Dolson (1999) suggest that this is due to a change in relationship between the phases of bins in the spectral domain. Musical noise, also known as musical artefacts or musical tones, is due to isolated holes and/or peaks within the power spectrum (Torcoli, 2019). Within TSM, these artefacts are caused by periodicity introduced to noise bins during phase progression, due to the sum of sines model of the Short Time Fourier Transform (STFT). Depending on the frequency relationships between these periodic signals the noise will be perceived as musical for simple harmonic relationships and metallic for complex harmonic relationships. Transient smearing occurs due to the trade off between STFT spectral and temporal resolution in frequency domain algorithms. As the frame size increases to improve spectral resolution, temporal resolution decreases leading to smearing of transients in time. The buzzy quality, also known as transient skipping or duplication, is an artefact of time-domain methods in which transients may be skipped for β > 1 or duplicated for β < 1.
The aim of TSM is often noted, however an exploration of ideal TSM has not been published. For the purpose of subjective evaluation, we describe ideal TSM as indistinguishable from a change by the sound source, that is: the processing should be transparent. A musician changing tempo or a speaker changing cadence would therefore be ideal and should be the goal for TSM algorithms. Consequently, ideal TSM should be determined by the sound source being scaled. For example, a dry recording of individual clicks simply requires temporal realignment of each click, however a recording of sustained notes played on a violin would require the extension of the sustain section of the note's envelope. Further, in the case of a piano, one must consider whether the transient or harmonic nature of the source should be maintained. If a staccato melody played in the upper register without damping is to be slowed, should note decay be lengthened or should the decay be maintained with each note shifted to the new timescale? We argue that as the piano is a percussive instrument and unable to modify its amplitude envelope, the note decay should be maintained. This is counter to the processing applied by all published TSM algorithms. We propose that an ideal TSM algorithm would be sensitive to the signal source and be capable of modifying only the sustain portion of the amplitude envelope. This raises many questions in the processing of reverberation, vibrato, specific phonemes and more. We consider that content aware or source sensitive TSM is an area with considerable potential for improving the quality of TSM.
The remainder of the paper is laid out as follows. Section II describes the TSM algorithms used to create the dataset and previous methodologies for quality evaluation. Section III describes the source files used in the creation of the dataset and the processing of the source material to create the processed dataset. Section IV describes the subjective testing methodology, opinion score normalization, results and analysis of the subjective testing and dataset availability. Section V compares subjective results with published objective measures and provides preliminary results for an objective measure of quality based on Perceptual Evaluation of Audio Quality (PEAQ) (ITU-T, 2001). Finally, section VI summarises and draws conclusions from this research.

The
Phase Vocoder (PV) (Flanagan and Golden, 1966), is a frequencydomain method that uses the known phase progression between frames at the original time-scale to calculate the phase progression between frames at the adjusted time-scale. The digital implementation by Portnoff (1976) uses the STFT to calculate phase spectra and forms the basis for all PV methods published since. A detailed explanation of the PV can be found in Laroche and Dolson (1999). The PV is effective at scaling signals with a complex harmonic structure, however it introduces 'phasiness' for non-integer values of α and is prone to transient smearing.
The Identity Phase Locking Phase Vocoder (IPL) (Laroche and Dolson, 1999) attempts to reduce the 'phasiness' introduced by the PV algorithm. The PV maintains horizontal phase coherence within each STFT bin, however the vertical phase relationship between bins is not maintained. In IPL, only the phase of peaks in the magnitude spectrum are modified, with nearby bins locked to the phase progression of the closest peak. This method was extended, through multi-resolution peak-picking and accounting for added and removed peaks by Karrer et al. (2006). These methods reduce phasiness, however they can introduce metallic or musical noise, also called spectral roughness.
The Waveform Similarity Overlap Add algorithm (WSOLA) (Verhelst and Roelands, 1993) is a time-domain method that uses the similarity between a frame and its natural progression in the input signal to minimize discontinuities between frames at the new time-scale. This is in contrast to previous methods that compare with the output signal (Moulines and Charpentier, 1990;Roucos and Wilgus, 1985). WSOLA effectively processes speech and monophonic musical signals, however due to the reliance on the fundamental frequency for alignment, produces low quality results for polyphonic signals.
Fuzzy Epoch Synchronous Overlap-Add (FESOLA) (Roberts and Paliwal, 2019) uses cross-correlation of glottal closure instants, known as epochs, for aligning frames of speech. Epochs are calculated using a Zero Frequency Resonator before being spread in the timedomain. The spreading improves the crosscorrelation of epochs, such that small changes in fundamental frequency of the speaker are accounted for. This method works well for speech and monophonic signals, however it not effective at processing polyphonic signals or signals that lack strong harmonic content.
The Harmonic-Percussive Separation Time-Scale Modification (HPTSM) method  is a hybrid method that uses median filtering of spectrograms to split the signal into harmonic and percussive components. WSOLA and IPL are used for percussive and harmonic components respectively. As a result, quality is improved over both individual methods. The method was also shown to compete with contemporary commercial state-of-the-art algorithms.
Multi-component Time-Varying Sinusoidal decomposition (uTVS) (Sharma et al., 2017) uses oversampling, a Mel-scale filter-bank and the Hilbert transform to calculate instantaneous phase and frequency, bypassing the error prone phase unwrapping and quasi-stationary assumption of traditional frequency domain methods. As a result, temporal smearing, transient skipping and duplication, and phasiness artefacts are reduced. This method slightly improves quality over HPTSM, with large improvements over traditional methods (Sharma et al., 2017).
Elastique (Zplane Development) is a commercial TSM method widely used in digital audio workstations. While the exact algorithm used is not publicly available, it is currently a state-of-the-art method and has been used in recent TSM subjective comparisons.
Fuzzy classification of spectral bins (FuzzyPV) (Damskägg and Välimäki, 2017), is an extension of the IPL. Spectral bins are given a degree of membership to three classes, sinusoidal, noise and transient, resulting in a fuzzy classification of each bin. Sinusoidal bins are scaled using IPL with phase locking applied to only sinusoidal bins, while random phase is added to noise bins. Analysis phases of transients bins are simply relocated in time. Subjective evaluation shows improvement over HPTSM and similar performance to Elastique.
Non-Negative Matrix Factorization Time-Scale Modification (NMFTSM) by Roma et al. (2019) decomposes the signal into percussive events and harmonic components. Percussive events are copied directly to the output signal, while IPL is used for harmonic components. It is effective in preserving the duration of percussive events, however it is highly reliant on correct detection of the events and introduces novel TSM artefacts.
Comparatively little formal subjective testing has been used to evaluate proposed methods, with most proposed methods providing results from informal testing. A wide variety of time-scales and algorithms are used, with little consistency. Time-scales are often limited with two to five times scales (0.5 ≤ β ≤ 2) reported in formal testing, with a bias towards β < 1. This reduces the number of files that require rating, but also limits algorithm evaluation. The difference in quality between β < 1 and β > 1 was mentioned briefly by Sylvestre and Kabal (1992) and shown in early results from this testing in (Roberts and Paliwal, 2019). Since the release of the MATLAB TSM Toolbox , the included algorithms, PV, IPL, WSOLA and HPTSM, have been used in most evaluations, while comparisons to commercial algorithms are rare (Damskägg and Välimäki, 2017;Karrer et al., 2006). The source audio used during testing also varies between papers with some papers using the files provided with the MATLAB TSM Toolbox. It was noted by Moulines and Laroche (1995) that a thorough perceptual evaluation of TSM approaches and algorithms had not yet been undertaken.
Two objective measures have been proposed, SER as Signal to Error Ratio by Roucos and Wilgus (1985) and D M as a measure of synthesis reconstruction consistency in (Laroche and Dolson, 1999). SER accounts only for successive magnitude spectra, with no attention paid to phase spectra. Synthesis consistency (D M ) compares the output frame's magnitude and phase to the reconstructed signal's magnitude and phase, however the "measure is not a clear indicator of phasiness" (Laroche and Dolson, 1999). Neither of these measures have seen continued use in published literature.

III. DATASET DESCRIPTION
The source material for the dataset was collated from the author's previous creative projects including films, concert and field recordings as well as music written specifically for the dataset. Files were selected to give a broad spectrum of content with variation in TSM difficulty. The number of source files, methods and timescales was determined by balancing the amount of content required to train a neural network and the number of ratings required for a 'true' Mean Opinion Score (MOS). All content was converted to mono by averaging each pair of samples to remove the influence of poor handling of multichannel files (Roberts and Paliwal, 2018) and normalized to ±1 before TSM. All files have a sample rate of 44.1kHz, a bit depth of 16 bits and range in SPL from 56.62dB to 86.92dB with a mean and standard deviation of 73.37dB and 6.75dB respectively.
The full dataset contains 34 musical files, 37 solo instrument files and 37 voice files with a complete listing provided with the dataset. The total playback length of the source dataset is 6 minutes and 42 seconds. The duration of the audio files were kept short, with a mean of 3.7 seconds and standard deviation of 1.6 seconds, to limit the duration of files after timescaling. Files were recorded using a combination of close microphone placement, multimicrophone concert recording, digital synthesis and sampling techniques and shotgun, lapel and large diaphragm condenser microphones.
The musical files consist of six synthetic music excerpts, 17 organic excerpts and three excerpts containing a mix of synthetic and organic sound sources. This classification contains examples of classical, rock, jazz, and a variety of electronic genres. Six files contain brass instruments, seven contain percussion, four contain piano, four contain a rhythm section, three contain stringed instruments, five contain synthesizers and 10 contain woodwind instruments. The solo instrument files consist of five synthetic and 25 organic instruments. 11 files contain percussion, four contain rhythm instruments, one contains violin, three contain synthesizers and 11 contain woodwind instruments. The voice files can be further classified, with 14 male, eight female, three child and five singing files. Finally, the objective source files contain a mix of the aforementioned file types and are used in the generation of the test and evaluation sets.
To form the training set, the source dataset was processed using the first six methods previously mentioned at 10 time-scale ratios resulting in 5,280 processed files. Time-scale ratios of 0.3838, 0.4427, 0.5383, 0.6524, 0.7821, 0.8258, 0.9961, 1.381, 1.667, and 1.924 were generated randomly, but adjusted to ensure coverage across the range of interest. The testing set used the final three methods mentioned previously at four random time scales in four bands across 0.25 ≤ β ≤ 2, resulting in 240 testing files. Subjective evaluation was conducted for both the training and testing sets. An additional evaluation set was created and is discussed in section V. Dataset generation took approximately three days on a medium to high end workstation.

MATLAB
TSM Toolbox  was used with default settings for WSOLA, HPTSM and Elastique time-scaling. FuzzyPV and NMFTSM time-scaling used provided implementations with default settings. Author implementations of PV, IPL, uTVS and FESOLA were used with Hann windowing throughout and parameters chosen to maximize informal subjective evaluation. All files were normalized after processing. The PV and IPL both used a frame length of 2,048 samples (46.4ms) and a synthesis hop of 512 samples. FESOLA used a frame length of 1024 samples (23.2ms) and a zero-frequency resonator for epoch extraction. WSOLA used a frame length of 1,024 samples (23.2ms) a synthesis hop of 512 samples and a tolerance of 512 samples. HPTSM used parameters identical to those above for the IPL for the harmonic signal component, while the percussive signal component was scaled using the WSOLA algorithm with a frame size of 256 samples (5.8ms) and a hop of 64 samples. uTVS was implemented using six times oversampling and a filterbank containing 88 filters to maintain the relationship between the signal sample rate and filterbank length of the original paper. During testing, an error in the uTVS implementation was found that introduced discontinuities within spectra during processing at 0.9 ≤ β ≤ 1.1 for some files. However, as the purpose of the subjective testing was to rate multiple files with a variety of artefacts, the decision was made to not remove these files from the dataset. The error was rectified before creation of the evaluation subset.

IV. SUBJECTIVE TESTING
Subjective testing was undertaken in two phases. Initial testing was conducted internally within the laboratory. Due to the large number of responses needed per file, testing transitioned to an online browser-based test using the Web Audio Evaluation Tool (WAET) (Jillings et al., 2015), shown in figure 1. Remote testing greatly increased the number of participants in the study. Participants were contacted in person, directly through social media and email, through mailing lists and public posts on websites such as Reddit and Facebook.
Testing followed ITU-R BS.1248-1 (ITU-T, 2019) recommendations for general methods for the subjective assessment of sound quality as close as practicable, resulting in the following testing parameters. Files were presented in reference-processed pairs with no limits placed on the amount of playback before moving to the next file. Checks were included to ensure both files were played at least once. A continuous grading scale was used in conjunction with a quality scale, where Poor-Excellent corresponds to scores of 1-5. Sessions contained a randomised selection of processed files, presented in random order, with participants free to choose the session they would evaluate. The amount of content per session was refined during testing, for a maximum session duration of 20 minutes. Towards the end of testing, the sessions were restricted to files that had limited responses to reduce MOS standard deviation.
Initial testing was undertaken using a bespoke MATLAB GUI that presented individual reference-processed pairs, allowed for saving and restoring of sessions, user input of name, sound transducer, and a check that the participant had no known hearing issues. Participants received training before beginning testing, including explanations of the purpose of TSM and common artefacts with audio examples. A small initial test session of 33 files was completed before a random session was assigned. Each session contained 18 minutes of audio, approximately 200 files, randomly selected from the pool of processed audio files. Participants could elect to evaluate additional sessions following a break equal in length to the previously completed session.
To increase the number of participants, the WAET was used. A small number of sessions were evaluated containing 100 files before reduction to 60 files based on participant feedback of session duration. Training identical to laboratory testing was available from the index page, which contained links to each test session. The index page contained reminders to use headphones in a quiet space during testing and a random number generator to suggest which test session the participant should complete. Before each session, name, age, sound transducer, experience in critical evaluation of sound and any known hearing issues were collected. Participants could also elect provide an email address to be contacted for future studies. Each session was split into pages containing six referenceprocessed pairs.
To remove bias and variability between sessions, opinion scores were normalized according to ITU-R BS1284 (ITU-T, 2019) using where Z i is the normalized result, x i is the opinion score of subject i,x si is the mean score for subject i in session s,x s is the mean score of all subjects in session s, σ s is the standard deviation for all subjects in session s and σ si is the standard deviation for subject i in session s. As the files in each session were unique, means and standard deviations were calculated on the subset of files matching those in the session. Normalized opinion scores were not truncated, however MOS were limited to the interval of 1-5.

A. Results
A total of 42,529 file ratings were collected from 263 participants across 633 sessions, with 10,354 ratings collected during laboratory testing. Participants ranged in age from 16 to 66 with a median age of 30. 52.36% of ratings were contributed by expert listeners. 12 files were limited to a MOS of 1, while 28 files were limited to a MOS of 5.
Due to the different files and time-scale ratios used for the testing subset, direct comparison between methods in training and testing subsets was not appropriate. However, a general comparison was achieved through local averaging of MOS, centered around training timescale ratios. Means of adjacent time-scale ratios, bounded by 0.3 and 3, defined the local areas. While 0.3 is greater than some time-scales used within the testing set, it was set empirically to include enough data points, while limiting the impact of much slower time-scales. Mean MOS for testing subset methods are noisier due to the smaller number of files, and non-uniform difficulty in processing each signal.
Two measures of reliability were used for each session. The Root Mean Squared Error (RMSE) denoted by L is given by where the number of files within the session is denoted by N, x i is the participants opinion score for the file andx i is the overall MOS for the file. The Pearson Correlation Coefficient (PCC), denoted by ρ, given by was also used where x andx denote sets of opinion scores and MOS for the session and σ x and σx are the standard deviation of x andx. These measures were calculated for each session before and after normalization. Outliers, calculated prior to normalization and shown in figure  2, were determined as sessions in which L or ρ were further than three scaled median absolute deviations away from their respective medians. This resulted in the removal of 45 sessions containing a total of 2,102 ratings (4.94%) from the final pool of sessions. The use of Intraclass Correlation Coefficients (ICC) was explored, however as the subjective results are neither fully crossed nor fully nested, ICC cannot be used. Instead, the interrater reliability for Ill-Structured Measurement Designs, calculated by as proposed by Putka et al. (2008) was used.k is the harmonic mean of the number of participants per file, c i,i ′ is the number of participants that each pair of files (i, i ′ ) share, k i and k i ′ are the number of participants who rated files i and i ′ respectively and N t is the total number of participants in the sample.σ 2 T is the estimated variance for file main effects (true score),σ 2 R is the estimated variance for participant main effects andσ 2 T R,e is the estimated variance components for the combination of residual effects and fileparticipant interaction. This measure gives an overall rater reliability (G(q, k)) of 0.871 prior to normalization and 0.909 post normalization.
For an overview of all results, figure 4 shows all normalized file ratings ordered by ascending MOS. All opinion scores are shown in the histogram with the overlaid red line showing the MOS for each file. It can be seen that when the TSM quality is very high or very low there is greater consensus amongst participants, however there is a large variance in opinion for files with mid-range quality. It can also be seen that the MOS tracks below the majority of responses in the Good to Excellent range, suggesting a difference between MOS and a majority of opinion scores. Median opinions scores were explored, based on (Jamieson et al., 2004), resulting in tighter groupings, however there was no significant change in averaged scores nor improvement in session reliability. Median opinion scores have nonetheless been included as labels with the dataset, along with mean and median opinion scores calculated before normalization. All methods show improvement in quality as β approaches 1, as is to be expected. However, the implementation of uTVS gave poor performance when time-scaling at 0.9961, see section III, but acheived state-of-the-art performance for all other time-scales. Figure 5 show the results of each method for each time-scale, averaged across all files. When comparing two inverse time-scale ratios, for example β = 0.5 and β = 2, the slower of the pair is lower in quality, suggesting that slowing a file down is perceptually more difficult than increasing its speed. This is consistent with the testing of Sharma et al. (2017), however the effect is more pronounced within this testing. Of interest are two specific cases, that of PV and WSOLA. For β < 1, PV is perceived to have a higher quality than WSOLA, however this is reversed for β > 1. It can then be inferred that different artefacts are perceived as having a greater impact on the quality of the TSM. We propose that for β < 1, the transient-doubling of WSOLA is perceived as worse than the 'phasi-ness' and transient smearing of the PV, while for β > 1 transient skipping is less detrimental than the artefacts introduced by the PV. This is a similar finding to Moinet and Dutoit (2011), who noted that some listeners preferred PV artefacts in some cases. Similarly, comparison of PV and IPL shows a change in preference towards the smeary PV artefacts for large reductions in speed, over the metallic artefacts of IPL. The PV can be seen to be comparable to state-of-the-art methods for the three slowest time-scale ratios.
A surprising result is the high performance of IPL in comparison to HPTSM and uTVS. HPTSM achieved numerically similar results to those given in . However, while HPTSM was shown to be greater in MOS by 1, our testing found IPL to be rated higher for all except the two slowest time-scale ratios. Artefacts due to harmonic-percussive separation, the use of WSOLA with a very short frame length or the lower sample-rate of the files used in the MATLAB TSM Toolbox may be the cause. Similarly, the reduced sample-rate in original uTVS testing may have contributed to the variance in MOS between testing. Future research should include comparisons between different IPL implementations. Algorithm performance per class generally follows that of the overall results. As expected however, there are differences in performance quality between methods dependent on the source material. When the mean MOS for each class are considered and β = 0.9961 re-sults excluded, uTVS is preferred for music and solo instrument sources while WSOLA is preferred for voice sources. However, the differences in averaged ratings are minor in most cases. Exact mean results have not been reported here as the primary focus is rating timescaled files, rather than definitive evaluation of different TSM methods.
Perception of processing quality for musical sources, figure 6, confirms the improved quality of frequency-domain over time-domain methods with FESOLA and WSOLA giving poor results. The most interesting result here is that the PV is consistently rated higher than other methods for β < 0.7 and is comparable for other β. If ratings are averaged for each source file, it is possible to identify 'difficult' files to process. Files with uncorrelated high frequency content were rated poorly, while clean, harmonically simple musical excerpts were rated highly. Signals containing more transient material was rated lower than less transient material. Mean file ratings ranged from 2.76 for Jazz 1.wav to 3.94 for Yellow 2.wav. Mean MOS results for the solo instrument class of signals, shown in figure 7, improve over musical and voice classes with the exception of the PV for β > 1. Synthesizer bass sounds were the lowest rated, followed by noisy percussion, polyphonic instruments and tuned percussion, with monophonic harmonic instruments rated highest. The combination of low frequencies with significant transients within the synthesizer bass was particularly troublesome for all TSM methods. Mean file ratings ranged from 2. In considering mean MOS for voice signals, shown in figure 8, WSOLA is preferred for β > 1, while the preference is less clear for β < 1. Most methods, except the PV and NMFTSM, were rated similarly for 0.6 < β < 1, however the PV is clearly preferred for β < 0.6. After this point, smoothness is preferred over transient doubling and metallic artefacts. When considering mean file ratings, the 11 lowest rated files were all male voices, with female and child voices as the seven highest rated files. This mirrors results by Sylvestre and Kabal (1992) who suggested poor frequency resolution for lower frequencies as well as short frame sizes as causes for lower quality. Mean file ratings ranged from 2.73 for Male 18.wav to 3.59 for Child 01.wav.
The mean standard deviation across all files was 0.802 and 0.718, before and after normalization respectively. As can be seen in figure  9, the range of standard deviation values converges as the number of responses for the file increases. During testing (around 19,000 ratings) this graph showed convergence at around seven ratings per file. As a result, a minimum of seven ratings per file was set as the target to give a 'true' representation of the quality of the audio file. While there are files that have yet to converge, this is a small subset of the total dataset.
Comparisons between expert and non-expert listeners, participants with and without known hearing issues and testing modalities were undertaken using the two one sided tests (TOST) of Hauck and Anderson (1984) and Lakens (2017). TOST begins with the null hypothesis of nonequivalent means and uses two one sided tests to show equivalence within a given interval. The interval can be given as a raw score or a standardized difference. If the confidence interval for the difference of the means falls within the equivalence interval, the null hypothesis is rejected and equivalence can be claimed. Analysis was undertaken on session RMSE and PCC values before normalization. The equivalence interval was calculated at 5% of the reference sample's mean and Confidence Intervals (CI) of 95% were used throughout. Cohen's sample d is also given for indication of effect size, where d ≈ 0.2 is a small effect size.
ITU Recommendation BS.1284 (ITU-T, 2019) recommends investigation of the relationship between expert and non-expert listeners. In this testing, participants self-selected if they had experience critically evaluating the quality of audio. RMSE and PCC for non-expert listeners were found to be equivalent to those of expert listeners, with equivalence intervals shown in figure 10. Testing RMSE gave a maximum p value of 0.0498 and d of 0.1273. Testing PCC gave a maximum p value of 4.67e-06 and d of 0.1059. We propose that equivalence is a result of the reference-test style of testing and the medium to large impairment in the processed signal, reducing the importance of highly trained critical listening skills for this type of subjective testing. Participants also reported any known hearing issues, with an open answer text box given for responses. Results were not excluded if known issues were reported, but were instead manually sorted into a binary classification of 'No known hearing issues' and 'Any known hearing issues'. Hearing issues included highly descriptive explanations such as "-6dB above 14kHz", a range of tinnitus severity, age related hearing changes and "I like punk music". PCC for participants with any hearing issues were found to be equivalent to those without issue, while RMSE was not found to be equivalent. Equivalence intervals are shown in figure 11. Testing RMSE gave a maximum p value of 0.2467 and d of 0.0958. Testing PCC gave a maximum p value of 0.0245 and d of 0.1219. Our proposed explanation is two-fold. Those participants who reported known hearing issues in great detail were also expert listeners, and familiar with the shortcomings of their own auditory system. Additionally, as the participants were presented with the source and processed files and asked to rate the quality of the processing, any issue within the auditory system would affect perception of both files. The small number of sessions classified as 'any issue', 33 compared to 554 for 'no issue', also impacts this result, greatly increasing the standard error. A t-test applied to RMSE was unable to reject that the means are equal with a p-score of 0.4985. Increasing the equivalence interval to ±9.32% allows RMSE equivalence to be claimed. Due to the strong PCC equivalence and close RMSE equivalence, we find no reason to reject sessions in which hearing issues were reported. (1-α)100% CI for equivalence for means of participants with and without hearing issues for α = 0.05. Equivalence interval of ±5% of mean for participants without hearing issues.
As testing was undertaken in different modalities, comparative analysis of results is necessary. PCC for remote participants were found to be equivalent to laboratory participants, while RMSE was not found to be equivalent. Equivalence intervals are shown in figure 12. Testing RMSE gave a maximum p value of 0.3474 and d of 0.2126. Testing PCC gave a maximum p value of 0.0013 and d of 0.0931. A t-test applied to RMSE was unable to reject that the means are equal with a p-score of 0.4693. Increasing the equivalence interval to ±8.14% allowed RMSE equivalence to be claimed. Due to the strong PCC equivalence and close RMSE equivalence, we found no reason to reject either testing mode. TOST (1-α)100% CI for equivalence for means of each testing modality for α = 0.05. Equivalence interval of ±5% of mean for laboratory participants.
As participants also reported their age, it was also possible to analyse the possible impact of age on the quality of the participant's responses. Correlations of 0.108 and -0.001 were found be-tween the age of the participant and the RMSE or PCC respectively, showing no impact of age on evaluation ability.
The labeled dataset is available, under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, through IEEE-Dataport at http://ieee-dataport.org/1987. Implementation and additional source code is available at github.com/zygurt/TSM.

V. COMPARISON TO OBJECTIVE MEASURES
Comparison between subjective and previous objective measures, SER and D M , found correlations of 0.1445 and 0.0274 respectively. Signals were aligned through time axis interpolation of the reference magnitude spectrum to the duration of the test spectrum before measure calculation. Correlation was calculated as the mean PCC for β < 1 and β > 1 to account for the non-linear nature of MOS.
Due to the change in duration for processed files, direct comparison between PEAQ and subjective results is not possible. As such, initial testing was undertaken using a modified version of PEAQ to allow for objective evaluation of time-scaled signals. Signals were aligned as previously mentioned for feature extraction, and the PEAQ neural network was retrained to the subjective MOS. 10% of the training set was reserved for validation with the optimal epoch having the minimum overall distance (D) where ρ = [ρ tr , ρ val , ρ te ], L = [L tr , L val , L te ] and tr, val and te denote training, validation and testing. The trained network achieved an L of 0.668 and ρ of 0.719, placing it at the 11th and 17th percentiles of subjective sessions. An evaluation set was created by processing the testing subset source files with all methods previously mentioned, at 20 time-scale ratios in the range of 0.22 < β < 2.2. The mean objective output for each method across the range of timescales is shown in figure 13. The output exhibits a similar shape to that of the subjective results, however only moves away from the mean for β < 0.75 and β = 1. Development of an accurate objective measure of quality for TSM algorithms is now an achievable goal.

VI. CONCLUSION
This paper detailed the creation, subjective evaluation and analysis of a dataset and its use in the development of an objective measure of quality for time-scaled audio. Six TSM methods processed 88 source files at 10 time-scales resulting in 5,280 processed signals for a training subset. Three additional methods at four random time-scales resulted in 240 signals for a testing subset. 42,529 ratings were collected from 633 sessions using laboratory and remote collection methods. Preliminary results for an objective measure of quality were presented, which achieved an RMSE loss of 0.668 and PCC of 0.719. Future work includes using the dataset to develop an improved objective measure of quality for TSM, to assist in comparative evaluation of novel methods.