Figure 4

A schematic description of the ML method introduced in the present paper for functional protein studies. (a) Binary sequence representation of an amino-acid sequence. Let M = 20 be the number of amino-acid species, and let N be the number of residues considered in the present study. Then, the amino-acid sequence of a protein is represented by M × N binary variables, each of which represents the amino-acid species at each residue. (b) By writing the MN binary variables as xi,j, i = 1, …, M, j = 1, …, N, we consider an MN-dimensional linear model. The linear model has an intercept parameter β0 and MN coefficient parameters βi,j, i = 1, …, M, j = 1, …, N. (c) When the linear model is fitted, a group-wise sparsity constraint is introduced. Then, in many residues, all of the corresponding M coefficients would be fitted to zero, and only a small number of residues have nonzero coefficient parameters. The latter residues are called active residues. The choice of amino-acid species in these active residues is expected to play an important role in determining molecular properties such as absorption wavelength.