The data is publicly available and can be found in the supplementary section of the article "Multivariate-based classification of predicting cooking quality ideotypes in rice (Oryza sativa L.) indica germplasm" by Rosa Paula Cuevas (me), Cyril John Domingo, and Nese Sreenivasulu (2018) published in (Rice 11: 56).
In that paper, R was used in data analyses. I used Ward's cluster analysis to classify the rice varieties into quality types. I then used multinomial logistic regression to create a model that can be used to differentiate the different quality types based on the non-collinear variables used to characterise the rice samples. A random forest algorithm was applied to determine the variables that were most important in classifying the rice samples.
I am now exploring the dataset using different approaches implemented in Python. The results, of course, are quite different because I use deep learning in neural networks.
There are 25 continuous variables in the dataset:
Variable | Meaning | Description |
---|---|---|
AC | Amylose content (%) | Predicts hardness and stickiness of cooked rice based on the relative concentration of amylose (starch type with straight chains) |
GT_DSC | Gelatinisation temperature (ºC) | Indicates the temperature range at which rice begins to cook based on the melting of amylopectin (crystalline starch type with hyperbranched chains) |
PC | Protein content (%) | Indicates the relative amount of proteins inside the rice endosperm based on Kjeldahl N measurements |
HRD | Hardness (g) | Force required to bite onto a sample, simulated by compression |
ADH | Adhesiveness (g•sec) | Degree of stickiness of a sample, simulated by the work required to separate the probe from the base platform |
COH | Cohesiveness | Capacity of a sample to remain intact rather than to break during compression |
SPR | Springiness | Capacity of a sample to return to its original shape after compression |
SMMAX | Maximum storage modulus (Pa) | Maximum elastic response of a sample (solid-like behaviour |
TEMP_SMMAX | Temperature at maximum storage modulus (ºC) | Temperature reading when a sample exhibits maximum solid-like behaviour |
TD_SMMAX | Tan delta at max storage modulus | Ratio of loss to storage modulus at maximum storage modulus |
LM_SMMAX | Loss modulus at max storage modulus (Pa) | Viscous response of a sample at the maximum storage modulus |
TEMP_GELPT | Temperature at Gel Point (ºC) | Temperature reading when the loss and the storage moduli are equal (tan delta = 1) |
TROUGH_SM | Lowest storage modulus (Pa) | Lowest storage modulus value after reaching the maximum |
SLOPE1_SM | Increasing storage modulus | Measured from gel point to SMMAX |
SLOPE2_SM | Decreasing storage modulus | Measured from the highest to the lowest points of storage modulus after SMMAX |
SLOPE3_LM | Increasing loss modulus | Measured from gel point to maximum loss modulus |
SLOPE4_LM | Decreasing loss modulus | Measured from maximum loss modulus until it levelled off |
PV | Peak viscosity (RVU) | Highest viscosity recorded as the sample is cooked |
TV | Trough viscosity (RVU) | Lowest viscosity recorded as the sample is kept at a high temperature |
FV | Final viscosity (RVU) | Last viscosity reading when the sample is cooled |
BD | Breakdown (RVU) | Difference between peak viscosity and trough viscosity |
SB | Setback (RVU) | Difference between final viscosity and peak viscosity |
LO | Lift-off (RVU) | Difference between final viscosity and trough viscosity |
PASTEMP_RECALC | Pasting temperature (ºC) | Temperature when a sample starts thickening (as temperature is increased) |
PT | Pasting time (min) | How long it takes to reach peak viscosity |
First, I calculate the Pearson correlation coefficient and determine the variable pairs with high coefficients (r > 0.70). From these variable pairs, I picked variables to be excluded from the analysis.
Second, I conduct K-means clustering. To determine the number of clusters, I used the elbow method (calculating the sum of squared distances per cluster number), the silhouette method, and the dendogram method. These methods indicate that a five-cluster solution is the best; hence, the subsequent deep learning neural network algorithm is based on five clusters.
Neural networks are programming approaches to learn from observational data by loosely imitating the way the brain's neurons connect and process information. Deep learning, on the other hand, is a set of techniques for learning within neural networks. I used these techniques to classify the samples.
I divide the samples into a test and a training set. Then, the data is scaled so that all variables have values within the same scale. The input layer is composed of the 18 variables. The hidden layer has 100 units and uses ReLU as the activation function. The output layer has five units and softmax as its activation function.
The deep learning algorithm's performance is evaluated using the DS2014PHY data.