This paper was accepted on the workshop “I Can’t Consider It’s Not Higher: Understanding Deep Studying By means of Empirical Falsification”
Steady pseudo-labeling (PL) algorithms equivalent to slimIPL have not too long ago emerged as a strong technique for semi-supervised studying in speech recognition. In distinction with earlier methods that alternated between coaching a mannequin and producing pseudo-labels (PLs) with it, right here PLs are generated in end-to-end method as coaching proceeds, bettering coaching velocity and the accuracy of the ultimate mannequin. PL shares a typical theme with teacher-student fashions equivalent to distillation in {that a} trainer mannequin generates targets that should be mimicked by the coed mannequin being skilled. Nevertheless, apparently, PL methods usually use hard-labels, whereas distillation makes use of the distribution over labels because the goal to imitate. Impressed by distillation we count on that specifying the entire distribution (aka soft-labels) over sequences because the goal for unlabeled information, as a substitute of a single greatest move pseudo-labeled transcript (hard-labels) ought to enhance PL efficiency and convergence. Surprisingly and unexpectedly, we discover that soft-labels targets can result in coaching divergence, with the mannequin collapsing to a degenerate token distribution per body. We hypothesize that the explanation this doesn’t occur with hard-labels is that coaching loss on hard-labels imposes sequence-level consistency that retains the mannequin from collapsing to the degenerate resolution. On this paper, we present a number of experiments that help this speculation, and experiment with a number of regularization approaches that may ameliorate the degenerate collapse when utilizing soft-labels. These approaches can convey the accuracy of soft-labels nearer to that of hard-labels, and whereas they’re unable to outperform them but, they function a helpful framework for additional enhancements.