Page 80 - Fister jr., Iztok, Andrej Brodnik, Matjaž Krnc and Iztok Fister (eds.). StuCoSReC. Proceedings of the 2019 6th Student Computer Science Research Conference. Koper: University of Primorska Press, 2019
P. 80
ENDIX Representing a prerecorded signal to the frequency domain
with Fast Fourier Transform is a favoured approach in the
A. DETAILS OF THE NEURAL NETWORK field of signal processing. Frequency analysis reveals each
frequency band’s presence in the signal, which may reveal
A.1 Activation Function periodic and aperiodic traits of the sample. In practice when
a sample of the length N is transformed with FFT, it pro-
We have found that one major reason of over-fitting where duces two arrays of values of the same length N representing
the network rather memorizes training samples than general- complex (Im and Re) valued frequency coefficients. Usu-
izes was the non-linearity applied after parametrized layers, ally, these two arrays are merged by the following formula
the Rectified Linear Unit (ReLU): relu(x) = max(0, x) r = Im2 + Re2 resulting in Power Spectrum (PS), while
phase (being less informative about the signal) is omitted
The phenomenon of ”dead neurons” is a well known and thus making the transformation irreversible. The problem
a frequent issue among networks that apply ReLU. Gener- with taking PS is that it discards temporal patterns (such
ally speaking, ReLU suppresses any inhibiting activation (by as Q-T, R-R distance etc.) making convolutional layers use-
clipping off negative values), excluding a notable portion of less. Furthermore, PS is not casual by design, meaning that
neurons in the preceding layer from succeeding layers; thus, the whole signal must be provided before obtaining PS. A
activation spectrum of the layer will be saturated. Prac- frequently applied technique in speech recognition is tak-
tically it is acceptable if different samples cause different ing multiple FFT of short overlapping windows sliding over
neurons to be mitigated; however, during train time some of input audio sample and concatenating short samples’ PS
the neurons can become completely silent since the gradient into a multi-channel array. Another slight detail is to ap-
optimization prevents muted neurons to be trained (their ply piece-wise natural logarithm on every element of the
gradient is 0). Because of this property, there is a consider- resulting array to increase the variance of signal and pre-
able risk that numerous nodes will not be able to influence vent strong frequencies repress ones which are weaker with
the response of classifier neurons. orders of magnitude.
Instead of ReLU, we used SELU activation function pub- The main advantage of that method is that it preserves time-
lished in Self-Normalizing Neural Networks [21]. By chang- domain (temporal) patterns, while it reveals the presence of
ing the activation function, we were able to overcome the different frequencies in the signal. Furthermore, there is only
variance-problem of the networks applied, i.e. distance be- a slight difference when we apply different weighting on in-
tween training- and test-performance was reduced for iden- ternal values of the sliding window, while window size and
tical architectures. For benchmarks on VGG models see stride (i.e. inverse degree of overlapping) heavily influences
Figure 4. how long our resulting array will be and how many frequency
bands will represent it (i.e. resolution of PS at given time in-
On the left side of Figure 4 two separate trends are re- stance). We have found it not just incredibly convenient, but
vealed. Apparently, ReLU (bold lines) outperforms identi- also surprisingly effective to choose the highest possible de-
cal networks applied with SELU (light lines), almost reach- gree of overlapping (window stridden by 1), and resampling
ing ideal performance. On the right side of Figure 4: We resulting spectrogram to match the length of the original
can see ReLU networks reaching their top test-performance signal. Taking the original sample and redundant represen-
in early stages of training, and by continuing their train- tation of ECG-recording (a 64 channel spectrogram) of the
ing their accuracy decreased. In contrast, the accuracy of same length allowed us to apply two concurrent NN on each
SELU networks gradually improves throughout the entire domain (temporal and spectral), and to concatenate result-
training. Naming convention: ADAM stands for gradient ing representations in-depth without being forced to reduce
optimization method, 16/19 for the number of layers that temporal dimension since both feature vectors were of the
have adjustable weights, and double/halved/quart suffixes same length.
refers to the depth of each convolutional filter applied in
corresponding VGG baseline networks. A.4 Multi-Domain Representation Learning
A.2 Dilated Convolution While both input spaces, temporal and spectral, had their
challenges, we saw that - by designing preprocess steps for
Receptive field problem was another obstacle we encoun- spectrogram training - the feature extractor network pro-
tered while setting up the baselines experiments. Simply by duced output of the same length as a time-domain equivalent
changing the 2-dimensional convolutions (3x3 filters) to their algorithm. That led us to try and concatenate these pre-
1-dimensional equivalent (1x9 filters), we ended up with a trained feature extractors together to test whether multi-
network that could barely cover multiple heartbeats. Since domain representation could help the final layer to overcome
we have learned that atrial fibrillation can be episodic, it issues specific to separate domain classification by comple-
was essential extending search space of architectures that menting each other. Indeed, we found that the general be-
could cover entire episodes. By applying causal dilated con- haviour of the time-domain network was as follows: increas-
volutional filters used by [26], the receptive field was expo- ing the accuracy of a single class at the expense of severe
nentially increased further improving our models’ accuracy forgetting in other classes disappeared using multi-domain
without involving variance problems (like max-pooling does) features. At the same time, spectral-domain networks strug-
or sacrificing evaluation speed since applying dilated convo- gled with variance problems, even with extremely low capac-
lution results in minimal overhead compared to the tradi- ity. Also, networks trained in frequency domain were more
tional operation. For the visual example see Figure 5. dependent on the choice of training / evaluating data. These
traits are omitted when feature extractors are working in
A.3 Spectrogram
StuCoSReC Proceedings of the 2019 6th Student Computer Science Research Conference 80
Koper, Slovenia, 10 October
with Fast Fourier Transform is a favoured approach in the
A. DETAILS OF THE NEURAL NETWORK field of signal processing. Frequency analysis reveals each
frequency band’s presence in the signal, which may reveal
A.1 Activation Function periodic and aperiodic traits of the sample. In practice when
a sample of the length N is transformed with FFT, it pro-
We have found that one major reason of over-fitting where duces two arrays of values of the same length N representing
the network rather memorizes training samples than general- complex (Im and Re) valued frequency coefficients. Usu-
izes was the non-linearity applied after parametrized layers, ally, these two arrays are merged by the following formula
the Rectified Linear Unit (ReLU): relu(x) = max(0, x) r = Im2 + Re2 resulting in Power Spectrum (PS), while
phase (being less informative about the signal) is omitted
The phenomenon of ”dead neurons” is a well known and thus making the transformation irreversible. The problem
a frequent issue among networks that apply ReLU. Gener- with taking PS is that it discards temporal patterns (such
ally speaking, ReLU suppresses any inhibiting activation (by as Q-T, R-R distance etc.) making convolutional layers use-
clipping off negative values), excluding a notable portion of less. Furthermore, PS is not casual by design, meaning that
neurons in the preceding layer from succeeding layers; thus, the whole signal must be provided before obtaining PS. A
activation spectrum of the layer will be saturated. Prac- frequently applied technique in speech recognition is tak-
tically it is acceptable if different samples cause different ing multiple FFT of short overlapping windows sliding over
neurons to be mitigated; however, during train time some of input audio sample and concatenating short samples’ PS
the neurons can become completely silent since the gradient into a multi-channel array. Another slight detail is to ap-
optimization prevents muted neurons to be trained (their ply piece-wise natural logarithm on every element of the
gradient is 0). Because of this property, there is a consider- resulting array to increase the variance of signal and pre-
able risk that numerous nodes will not be able to influence vent strong frequencies repress ones which are weaker with
the response of classifier neurons. orders of magnitude.
Instead of ReLU, we used SELU activation function pub- The main advantage of that method is that it preserves time-
lished in Self-Normalizing Neural Networks [21]. By chang- domain (temporal) patterns, while it reveals the presence of
ing the activation function, we were able to overcome the different frequencies in the signal. Furthermore, there is only
variance-problem of the networks applied, i.e. distance be- a slight difference when we apply different weighting on in-
tween training- and test-performance was reduced for iden- ternal values of the sliding window, while window size and
tical architectures. For benchmarks on VGG models see stride (i.e. inverse degree of overlapping) heavily influences
Figure 4. how long our resulting array will be and how many frequency
bands will represent it (i.e. resolution of PS at given time in-
On the left side of Figure 4 two separate trends are re- stance). We have found it not just incredibly convenient, but
vealed. Apparently, ReLU (bold lines) outperforms identi- also surprisingly effective to choose the highest possible de-
cal networks applied with SELU (light lines), almost reach- gree of overlapping (window stridden by 1), and resampling
ing ideal performance. On the right side of Figure 4: We resulting spectrogram to match the length of the original
can see ReLU networks reaching their top test-performance signal. Taking the original sample and redundant represen-
in early stages of training, and by continuing their train- tation of ECG-recording (a 64 channel spectrogram) of the
ing their accuracy decreased. In contrast, the accuracy of same length allowed us to apply two concurrent NN on each
SELU networks gradually improves throughout the entire domain (temporal and spectral), and to concatenate result-
training. Naming convention: ADAM stands for gradient ing representations in-depth without being forced to reduce
optimization method, 16/19 for the number of layers that temporal dimension since both feature vectors were of the
have adjustable weights, and double/halved/quart suffixes same length.
refers to the depth of each convolutional filter applied in
corresponding VGG baseline networks. A.4 Multi-Domain Representation Learning
A.2 Dilated Convolution While both input spaces, temporal and spectral, had their
challenges, we saw that - by designing preprocess steps for
Receptive field problem was another obstacle we encoun- spectrogram training - the feature extractor network pro-
tered while setting up the baselines experiments. Simply by duced output of the same length as a time-domain equivalent
changing the 2-dimensional convolutions (3x3 filters) to their algorithm. That led us to try and concatenate these pre-
1-dimensional equivalent (1x9 filters), we ended up with a trained feature extractors together to test whether multi-
network that could barely cover multiple heartbeats. Since domain representation could help the final layer to overcome
we have learned that atrial fibrillation can be episodic, it issues specific to separate domain classification by comple-
was essential extending search space of architectures that menting each other. Indeed, we found that the general be-
could cover entire episodes. By applying causal dilated con- haviour of the time-domain network was as follows: increas-
volutional filters used by [26], the receptive field was expo- ing the accuracy of a single class at the expense of severe
nentially increased further improving our models’ accuracy forgetting in other classes disappeared using multi-domain
without involving variance problems (like max-pooling does) features. At the same time, spectral-domain networks strug-
or sacrificing evaluation speed since applying dilated convo- gled with variance problems, even with extremely low capac-
lution results in minimal overhead compared to the tradi- ity. Also, networks trained in frequency domain were more
tional operation. For the visual example see Figure 5. dependent on the choice of training / evaluating data. These
traits are omitted when feature extractors are working in
A.3 Spectrogram
StuCoSReC Proceedings of the 2019 6th Student Computer Science Research Conference 80
Koper, Slovenia, 10 October