Fig. 3: Identifying most estate-informative sections of the chromatograms.
From: Predicting Bordeaux red wine origins and vintages from raw gas chromatograms

a A “survival of the fittest” algorithm was applied, removing the 2% bins of the ester chromatogram that had the least effect on estate decoding accuracy, before removing the next bin one by one, until the last bin was left. Importantly, the same decoding accuracy was achieved with the best 10% of the total data than with the complete chromatogram, showing that these sections have all the estate information. The top panel shows the decoding accuracy as a function of the fraction of the data with the best decoding accuracy. The lower panel shows the five most important sections (red) on top of an example ester chromatogram (blue). The red color darkness indicates their rank in the survival algorithm, darker bins being more informative. b Estate decoding accuracy per data bin (red bars) with an overlaid example chromatogram (blue), for each chromatogram type. After dividing the chromatogram into 50 equal bins, estate decoding was performed using only single bins with LDA as in Fig. 2. Test decoding accuracy is shown in red for each bin, fluctuating fairly continuously across section locations in the chromatogram with most having above-chance (0.14) decoding accuracy. This indicates that estate chemical identity is not defined by just a few bins of the chromatogram but is distributed throughout. The fact that the decoding performance only requires 5 bins (a) suggests that the information across bins is highly redundant. Similar results were obtained for oak and offFla (Figs. S8, S9). c, d show the results of the same analyses performed for vintage decoding. Note that the reduction of the oak chromatogram led to a 20% increase in performance, indicating that our decoder was subject to overfitting when applied to the whole chromatogram. Decoding performance from individual bins is lower than for estate decoding yet still clearly above chance for most segments, again suggesting that vintage information is distributed throughout the chromatogram and that there is a high level of redundancy across bins.