Introduction

Koompassia malaccensis Maingay ex Benth. (Leguminosae) is one of the major commercial timber species traded in South-East Asia, distributed in Peninsular Malaysia, Sumatra, and Borneo1. It is a very large tree, can grow up to 55 m in height, and up to 200 cm in diameter. In Peninsular Malaysia, K. malaccensis is commonly known as kempas, whereas in Sabah and Sarawak, it is called impas and menggris, respectively. It is found in lowland, hill, peat and fresh water swamp forests, usually below 150 m, although its occurrence up to 610 m has been reported1. Despite being listed as Least Concern (LC) under the International Union for Conservation of Nature (IUCN) Red List, version 3.12, K. malaccensis is a protected tree species in the states of Sabah3 and Sarawak4.

The Standard Malaysian Name for its timber is Kempas, a medium hardwood with a density of 770–1120 kg/m3 air dry. Kempas is a monospecific timber, as it only refers to K. malaccensis, unlike most timber trade names, which typically encompass a genus or a group of related species. Kempas wood is suitable for interior as well as exterior usage5. It has a fine, interlocked grain pattern, making it very popular for furniture, flooring, and cabinetry. K. malaccensis is potentially vulnerable to illegal logging because of its widespread distribution and commercial value. Among the medium hardwoods traded domestically in Malaysia, Kempas sawn timber is the second most expensive, after Keruing, which is the timber trade name for Dipterocarpus species. In 2024, the price per cubic metre under General Market Specification (GMS) was approximately USD$513 for Kempas, compared to Keruing (USD$700)6.

Illegal logging refers to all activities related to harvesting, processing and trading of timber that violate national and sub-national laws7, including logging without a license. It not only causes deforestation, but also leads to significant revenue loss for the governments of timber-producing countries. To curb illegal logging and deforestation, a growing number of consumer countries have enacted legislation to prohibit trade in illegal timber, such as the U.S. Lacey Act, the European Union Timber Regulation, and the more recent Korea Act on the Sustainable Use of Timbers and Japan Clean Wood Act. Several positive impacts have been realized in the United States (US) and European Union (EU) markets. The volumes of illegal wood-based products imported by the US fell by one-third in 2013, compared to their peak in 2006, and volumes halved over the same period in three EU countries8.

Validating timber species and origin are the key to documenting and demonstrating compliance with timber laws and regulations. Over the past decade, new tools and resources have been developed to support due diligence throughout the supply chains9, including timber tracking based on structural characteristics (wood anatomy), chemical (direct analysis in real-time time-of-flight mass spectrometry (DART-TOFMS) and near-infrared spectroscopy (NIRS)), and genetic methods (DNA barcoding and profiling). These tools vary in their scalability (identification of genus, species, individual and geographical origin), applicability in the field (screening on the front-line or as diagnostic laboratory methods), and cost effectiveness (processing speed, equipment and expertise required)10.

The “Best Practice Guide for Forensic Timber Identification” published by the United Nations Office on Drugs and Crime (UNODC) provides information on appropriate procedures and methods involved in the entire investigation process, from the crime scene to the court room, to ensure consistent and quality results across jurisdictions11. Meanwhile, World Forest ID focuses on building a comprehensive reference database of geo-referenced plant samples, in partnership with regulatory and enforcement agencies worldwide, to enable species identification and tracing of harvest location12, facilitating verification of the products’ claimed geographic origins.

There is promising progress in the development of scientific verification technologies for identifying species and geographic origin based on the priority tree species listed by the Global Timber Tracking Network (GTTN)13,14. Based on the 322 taxa (36 genera, 286 species) species priority list, the status of technology for species identification has been assessed: reference data exist for all taxa using wood anatomy, followed by 86% using DNA barcoding loci, 41% using DART TOFMS spectra, and 6% using NIRS spectra14. In comparison, the progress in identifying geographic origin is still lacking, with data only exist for 24% of taxa, mainly using genetic approaches (23%)14. This disparity highlights the need for more studies on determining the geographical origin of timber, especially in the Asia, Pacific, and Oceania region, in addition to Central and South America14.

In Malaysia, the National Forestry Act 1984 (Amendment 2022) serves as the current legal framework to provide for the administration, management and conservation of forests, protecting against illegal logging and deforestation. With the aim of supporting national effort to combat illegal logging, the Forest Research Institute Malaysia (FRIM) has developed comprehensive short tandem repeat (STR) DNA profiling databases for several important tropical timber species, including Neobalanocarpus heimii15, Gonystylus bancanus16, Rubroshorea platyclados17, R. leprosula18, Intsia palembanica19 and Aquilaria malaccensis20. These databases have been applied to assist forest enforcement authorities in forest crime investigations, providing forensic evidence to convict illegal loggers. In this paper, we report the establishment of an STR database of K. malaccensis in Malaysia, and present a case study on its application as an investigative tool in forensic context.

Results

Population structure

The Bayesian-based STRUCTURE analysis divided the 56 populations of K. malaccensis into two genetic clusters. i.e. Cluster 1 and Cluster 2 (Fig. 1A). All the 37 populations grouped under Cluster 1 are from West Malaysia, while Cluster 2 consists of all the 13 populations from East Malaysia plus the seven peat swamp populations from West Malaysia. Upon further STRUCTURE analysis, Cluster 2 was sub-divided into Cluster 2a (all populations from East Malaysia except Maludam) and Cluster 2b (all peat swamp populations in this study) (Fig. 1B). The eight peat swamp populations are SKarangB, RMusa, KLangat, Pekan, Nenasi, Resak, AHitam, and Maludam, with Maludam being the single peat swamp population sampled from East Malaysia. Notably, the result from cluster analysis based on genetic distance (DA) concurs with this finding, whereby the UPGMA dendrogram displays the same grouping pattern (Cluster 1, 2a, 2b), with bootstrap ≥ 90% (Fig. 2).

Fig. 1
figure 1

Bayesian clustering results. (A) The 56 populations of Koompassia malaccensis from Malaysia were partitioned into two clusters: Cluster 1 and 2 (K = 2). Cluster 1 consists of 37 populations from West Malaysia (WM); Cluster 2 consists of East Malaysia (EM) populations and Peat Swamp (PS) populations. (B) Cluster 2 was further subdivided into: Cluster 2a (all EM populations except Maludam) and 2b (all PS populations) (K = 2). (C) Map of Malaysia showing the sampling locations and the population structure inferred from STRUCTURE analysis; PS populations are indicated in red font.

Fig. 2
figure 2

Dendrogram generated based on UPGMA cluster analysis showing the relationships among the 56 populations of Koompassia malaccensis in Malaysia. Corresponding to the Bayesian clustering results, these populations were partitioned into Cluster 1, 2a and 2b with bootstrap ≥ 90%.

Individual identification database

Correspondingly, the STR database for K. malaccensis was partitioned according to the aforementioned three regions (geographical/ecological), and designated as WM (West Malaysia), EM (East Malaysia) and PS (Peat Swamp) Database. The forensic parameters for the nine STR loci by region are given in Table 1. The allele frequencies of each STR locus by region are provided in Table S1, S2 and S3. In total, 170, 162 and 82 alleles were observed in WM, EM and PS regions, respectively. The power of discrimination (PD) for each STR locus ranged from 0.6905 to 0.9762 (WM), 0.6509 to 0.9708 (EM) and 0.3402 to 0.9076 (PS). In West Malaysia, 56% of the STR loci were found to be departed from Hardy-Weinberg equilibrium (HWE) after Bonferroni corrections, whereas 44% and 22% deviation from HWE were observed in East Malaysia and Peat Swamp, respectively. Nevertheless, these results reflect the reality in natural populations, whether in humans, animals, or plants; whereby the assumptions of completely random mating and zero migration, necessary for HWE, are unlikely to be met21,22,23.

Table 1 Forensic parameters of the nine Koompassia malaccensis STR loci for the three regional database; West Malaysia (WM), East Malaysia (EM) and Peat Swamp (PS). A: number of alleles; Ho: observed heterozygosity; He: expected heterozygosity; PIC: polymorphism information content; HWE: Hardy-Weinberg equilibrium; MP: matching probability; PD: power of discrimination. bSignificant departure from Hardy-Weinberg equilibrium after bonferroni correction (P < 0.05/9 = 0.0056).

Mean self-assignment, i.e. the proportion of individuals correctly assigned to the population of origin, was 36.7%, ranging from 0% (Klau, Som) to 96.6% (Maludam) (Table S4). However, at the regional level, the mean successful assignment rate of an individual to the region of origin was very high (WM 99.3%, EM 100.0% and PS 99.6%).

Conservativeness of the database

The minimum allele frequency was adjusted for alleles falling below the thresholds of 0.0026 (WM), 0.0093 (EM) and 0.0100 (PS) (Table S1, S2 and S3). The coancestry coefficient (θ) for EM (0.0898) was the highest, followed by PS (0.0528) and WM (0.0254) (Table 2). The inbreeding coefficient (f) for EM was also the highest (f = 0.1122), compared with PS (f = 0.0763) and WM (f = 0.0249). All the θ and f values were significantly greater than zero, demonstrated by the 95% confidence intervals not overlapping with zero. Both the θ and f values were used to evaluate the conservativeness of each database by testing the cognate database (Porigin) against the regional database (Pcombined). The database was not totally conservative at the calculated θ value. Hence, in order to ensure conservativeness, the value of θ was adjusted accordingly for WM (from 0.0254 to 0.3772), EM (from 0.0898 to 0.4285) and PS (from 0.0528 to 0.2909) (Fig. S1).

Table 2 Coancestry coefficients (θ) and inbreeding coefficients (f) of Koompassia malaccensis by region. Probabilities of the mean θ and f were determined using bootstrap analysis (1,000 replications) with a 95% confidence interval. N is the number of samples.

Discussion

Despite the larger area of East Malaysia (60% of the total land area of Malaysia) compared with West Malaysia (Peninsular Malaysia), about 80% of the samples of this study were collected from Peninsular Malaysia (43 populations, n = 1168), with only four populations from Sabah and nine from Sarawak (n = 297). One of the key reasons for the smaller overall sample size in East Malaysia is due to the fact that the expenditure incurred for sampling trips to East Malaysia is significantly higher compared to within Peninsular Malaysia. Besides flights, ferries and land transportation, boats were employed to reach some of the remote populations like Batang Ai, Amang and Putai, as well as Maludam (the only peat swamp forest from the Borneo in this study). Moreover, there were actually a few more sampling locations covered in Sabah – Madai, Bukit Tawau and Siaunggau Forest Reserves and Sarawak - Sampadi, Gunung Singai, Similajau, Bukit Lima Nature Reserve and Klingkang Range, but because of the small sample size (average four per location), they were excluded from the study. A few other forest reserves were also surveyed but we did not manage to find K. malaccensis (Bukit Gemuk, Crocker Range and Rafflessia from Sabah; Gading and Gunung Santubung from Sarawak).

Koompassia malaccensis from peat swamp areas made up 14% of the population samples (n = 250). The genetic diversity level of PS is remarkably lower among the regions, with the mean number of alleles per locus observed of only nine, compared to 19 (WM) and 18 (EM). The coancestry coefficient (θ) for EM (0.0898) was the highest, followed by PS (0.0528) and WM (0.0254) (Table 2). In other words, only 8.98%, 5.28% and 2.54% of the genetic variability was distributed among populations within EM, PS and WM, respectively.

Based on the allele frequency distributions of the candidate populations, it is possible to test whether an individual with a certain STR profile is likely to originate from a given population through assignment test38. The average self-assignment, i.e. the average proportion of individuals correctly assigned to the population of origin, was 36.7% (Table S4). Among the 36 populations in WM, only minority of populations have correct assignment above 50% (PSelatan 63.9%, Pelagat 59.1% and Bbandi 52.4%). Comparatively, in the case for EM and PS, 75% of the populations from the respective regions have the self-assignment rate above 50%. This is due to the weaker population differentiation in WM (θ = 0.0254), compared to EM (θ = 0.0898) and PS (θ = 0.0528). On the contrary, at the regional level, the successful assignment rate of individuals to the region of origin was very high, with an average of 99.6% (West Malaysia 99.3%, East Malaysia 100.0% and Peat Swamp 99.6%).

The STR analysis revealed different genotypes possessed by K. malaccensis populations in West Malaysia compared to East Malaysia. For example, at locus Kma082, alleles 271, 273, 275, 277, 281, 285 and 292 could only be found in West Malaysia. The STRUCTURE analysis showed that K. malaccensis populations from peat swamp forests and populations in East Malaysia were clustered together as Cluster 2. This finding is congruent with the result of the STRUCTURE analysis, which also yielded two clusters with the same groupings, i.e., Cluster 1 (WM) and Cluster 2 (EM & PS). In the subsequent sub-structure analysis, two genetic clusters were further delineated for K. malaccensis from East Malaysia (Cluster 2a; excluding Maludam) and those from the peat swamp forests (Cluster 2b), using both approaches. East Malaysia populations exhibit some unique genotypes compared to other populations of peat swamp habitat. For example, at locus Kma147, alleles with size lesser than 331 bp can only be found in EM.

Locus Kma127 exhibits the highest PD, an average of 0.9515 across the three regions. On the contrary, Kma172a exhibited the least PD in each region, average PD = 0.5605. A total of three regional STR individual identification databases were developed according to the respective genetic clusters, namely WM, EM and PS Database.

In the past, the enforcement authorities relied solely on wood anatomy and morphometric evidences to link the suspected illegal loggers to the crime scene. This conventional approach has its limitations, in that species identification is often only feasible to the genus level, or to the group of timber species of a particular trade name, whereas the seized logs and stumps were matched through the morphological characteristics, such as log/stump-diameter and shape. While species identification via DNA barcoding could solve the former, DNA profile databases will facilitate in tackling the latter. As demonstrated in the following case study, the STR database developed in this study is useful for individual identification and geographic traceability of K. malaccensis wood in the context of forensic application.

Case study

Herein, we demonstrate how the STR database of K. malaccensis was applied to assist an investigation officer of an enforcement agency (Department of Wildlife and National Parks Peninsular Malaysia), and successfully solved an illegal logging case. This case study involved felling of valuable timber trees within one of the gazetted wildlife reserves in Pahang state, Peninsular Malaysia. Based on morphological traits and wood anatomy, K. malaccensis was identified as one of the timber species among the seized logs. We received a request from the investigation officer to assist in carrying out DNA analysis.

An extensive forested area adjacent to the stack of felled logs was surveyed to find as many K. malaccensis stumps as possible. Sampling was carried out with the help of the local indigenous people, who are familiar with the forest. Following FRIM’s Standard Operating Procedure on “Forensic DNA Testing for Plant Species Identification and Timber Tracking”, SOP 1: “Specimen Collection in the Field”25, a total of 170 seized K. malaccensis logs (assigned as L1 – L170) and 22 potential stumps (assigned as S1 – S22) were collected, 5–10 bark discs were obtained for each sample using a hollow steel punch of 2 cm diameter. After every sample, the hollow punch was wiped with 70% ethanol to avoid contamination. It was significantly more challenging and arduous to locate and sample the K. malaccensis stumps scattered throughout the vast forest than it was sampling from the suspected logs from a single locality.

Subsequent DNA extraction followed by forensic DNA testing were conducted in FRIM Genetics Laboratory. Using the SOP 2: “DNA Isolation and Purification from Wood”, the total DNA of the samples were extracted and purified. The purified DNA samples were then genotyped for nine STR loci using SOP 4: “STR Genotyping for Population and Individual Identification”. Of all the logs sampled, 65 unique 9-loci STR profiles were acquired, ranging from 1 to 4 logs per profile. Results showed that the STR profile of logs L75, L79 and L80 matches that of stump S12 (171/188, 156/162, 230/230, 255/255, 271/271, 342/344, 258/264, 148/150 and 278/278 at STR loci Kma011a to Kma057, please refer Table 1 for the sequence of loci genotyped). The rest of the STR profiles of the remaining logs did not match those of the other 21 stumps.

A random match probability between the log and the potential stumps can be established by using the frequency database. Random match probability (RMP) is the probability of a match between an unknown timber and its potential origin stump. It is the reciprocal of profile frequency (1/profile frequency), i.e. the estimated frequency at which a particular STR profile is present in a population21. By considering both population sub-structuring and inbreeding coefficient, the coancestry coefficient (θ) value was adjusted to increase the profile frequency, while reducing the weight of the DNA evidence against a defendant in a court proceeding24. In this particular case study, WM database was applied for the calculation of RMP because the wildlife reserve is located in Peninsular Malaysia, and it is not a peat swamp forest. The estimated RMP between logs L75, L79 and L80 and stump S12 was 3.2981 × 10−10.

In order to ascertain that the STR profiles of the stump and logs did not match by chance in a court proceeding, statistical methods such as likelihood ratios (LR) are necessary to extrapolate the evidentiary value of this match21. LR is a comparison of the probabilities of the evidence under two hypotheses. The first hypothesis represents the position of the prosecution that logs L75, L79 and L80 originated from stump S12. Conversely, the second hypothesis represents the position of the defendant that the STR profiles matched by chance and logs L75, L79 and L80 did not originate from stump S12. The LR equals the hypothesis of the prosecution, Hp (numerator) divided by the hypothesis of the defendant, Hd (denominator) [LR = Hp/Hd]. While the prosecution hypothesis that logs L75, L79 and L80 originated from stump S12 (Hp) equals to one, assuming 100% probability; the RMP of the defendant’s claim that logs L75, L79 and L80 originated from an unknown stump can be calculated as aforementioned.

In short, LR is the inverse of the estimated profile frequency, in this case, 1/(3.2981 × 10−10) = 3.0320 × 109, thus providing an evidence with extremely strong support from the proposition that logs L75, L79 and L80 originated from stump S12. Although only three suspected logs were traced to one of stumps located in the wildlife reserve, it is enough to prove that forest offense has been committed by the suspect.

To test the efficiency of using the STR database for geographic traceability, the 9-locus STR profiles of these 22 K. malaccensis stumps were subjected to assignment test, and all these individuals were assigned to West Malaysia [non-peat-swamp forests], with mean percentage of 99.9996%.

Conclusion

We report on the development of a DNA profiling database for an important timber species, K. malaccensis in Malaysia, using STR markers. The STR database is robust and has been validated for specificity and accuracy, enabling the calculation of RMP and LR in the event an unknown log is traced to a potential stump of origin, using appropriate database (WM, EM or PS). In cases where the sample source is unknown, it is possible to trace the geographic/ecological origin(s) of K. malaccensis samples with high accuracy (99.6%) via assignment test. And thereafter utilise the corresponding regional database if there is a need for individual identification.

Combined with other timber reference databases, STR databases for the indigenous timber species will serve as an impetus for the uptake of DNA technology in forestry forensic. The technology adoption rate depends largely on effective dissemination of information to the forest managers and relevant enforcement agencies. Besides raising awareness on the availability of DNA-based timber identification system, active engagement, technology transfer and cooperation between researchers and the relevant stakeholders are crucial to leverage DNA technology in the fight against illegal logging. In addition, given that timber STR database is species-specific, it is imperative to establishing more databases for other timber species of economic importance, to curb illegal logging activities.

Having said that, acquisition of quality DNA from the alleged stolen wood sample is a prerequisite for forensic DNA analysis. In the present case of K. malaccensis, we were able to extract intact DNA from the suspected logs despite short term exposure to outdoor weather. Moreover, it is relatively much easier to obtain DNA samples from living stumps. Given other scenarios, where seized wood or timber might have been dried or processed, extracting sufficient and quality DNA for analysis could be challenging. For such difficult wood samples, the yield of DNA could be increased by enhancing the DNA extraction method, while the use of single nucleotide polymorphisms (SNPs) could be a solution to overcome the limitation of degraded DNA materials, which impedes the acquisition of full STR profile. However, establishing SNP databases for timber species of interest using NGS approach would require the reference genome sequence.

Methods

Sample collection and DNA extraction

In total, we have collected 1,465 K. malaccensis samples from 56 forested areas, with an average sample size of 26 per population (Table 3). The sampling locations spanned throughout the East and West Malaysia (Fig. 1). Of the 56 populations, eight were of peat swamp habitat (seven from Peninsular Malaysia, one from Sarawak). Ramli Ponyoh from FRIM assisted in species identification during sampling. A voucher specimen from Mukah Hill has been deposited in FRIM herbarium -A1686 (KEP). For the samples from Peninsular Malaysia, approximately 5 g of leaf or cambium tissue per sample was weighed and wrapped in aluminium foil and kept in liquid nitrogen after processing during sampling trips, prior to DNA extraction. Whereas those from Sabah and Sarawak, the leaf or inner bark samples were collected, weighed and kept in silica gel during transportation to the laboratory. As for the sampling of logs and stumps in the case study, the cambium samples were collected using a hollow steel punch.

Table 3 Location and respective sample size of the 56 Koompassia malaccensis populations in Malaysia. * denotes population of peat swamp habitat.

The total DNA was extracted using the cetyltrimethyl ammonium bromide (CTAB) method26 with modification. The frozen leaf or cambium samples (~ 5 g each) were cryogenically ground with SPEX® SamplePrep 6875 Freezer/Mill (New Jersey, USA) for 1 min. Each grindate was immediately transferred into a 50 mL Nunc tube (Falcon) with 20 mL of prewarmed (60 °C) CTAB extraction buffer (20 mM Na2EDTA, 100 mM TrisHCI pH 8.0, 1.4 M NaCI, 1% [w/v] PVP-40, 2% [w/v] CTAB, 0.2% [v/v] 2-mercapthethanol) and incubated at 60 °C for 30–60 min. Subsequently, 20 ml of chloroform-isoamyl alcohol (24:1) was added and mixed gently for 15 min. After centrifuging at 3000 rpm for 10 min at room temperature, the aqueous layer was transferred to a new tube. Two-thirds volume of cold (− 20 °C) propan-2-ol was added and mixed gently to precipitate the nucleic acids. Precipitated DNA was dissolved in TE (10 mM TrisHCI pH 8.0, I mM Na2EDTA).

Short tandem repeat genotyping

Genotyping of the 1,465 K. malaccensis samples was carried out using nine STR loci27, in two multiplex sets (Table S5). These STR markers are reproducible and robust in allele size calling, they have been stringently selected from among the 24 markers developed27, excluding those with suspected presence of null allele and allele dop-out. The forward primers were fluorescently labelled either with 6-FAM (Kma011a, Kma096, Kma109, Kma172a and Kma147), HEX (Kma127 and Kma057) or NED (Kma082 and Kma026). The multiplex-PCR consists of 1x Type-it Multiplex PCR Master Mix (Qiagen), 0.4 µM for each primer and 10 ng of template DNA. PCR amplification was performed by using the programme: activation step at 95 °C for 5 min, followed by 35 cycles of a denaturation step at 95 °C for 30 s, annealing at 52–57 °C for 90 s, and extension at 72 °C for 30 s; and a final extension at 60 °C for 30 min. The PCR products were electrophoresed along with GeneScan 400HD ROX Standard as the internal size standard on an ABI 3130xl capillary sequencer (Applied Biosystems). Genotyping was carried out using GeneMarker v2.6.4 software (Soft Genetics LLC, Pennsylvania, USA). The reproducibility of all STR markers was tested by comparing the genotypes from five independent PCR amplifications on one individual28.

Data analysis

A model-based clustering analysis, employing a Bayesian algorithm in STRUCTURE v2.3.4 was used to infer the genetic structure of K. malaccensis in Malaysia (56 populations), and substructuring of the populations in East Malaysia and peat swamp (20 populations). In each STRUCTURE analysis, 10 independent runs were performed by setting the K values ranging from 1 to 10, a burn-in length of 250,000 and followed by 500,000 Markov Chain Monte Carlo (MCMC)29 steps. We applied models of admixture with sampling locations included as prior population information. Correlated allele frequencies were applied for 10 repetitions. The most likely number of genetic clusters was chosen based on the Delta K statistic30 via the analysis using the online version of STRUCTURE SELECTOR31. After the best K value was selected, a graphical representation of the results from the 10 independent runs of STRUCTURE analyses was generated using CLUMPAK32. In addition, genetic relatedness among populations was inferred from the UPGMA dendrogram based on chord distance, DA, generated by using the program POPTREE233. The relative strengths of the nodes were determined based on 1000 bootstrap replicates.

Establishment and characterization of STR database

An STR database for individual identification of K. malaccensis was established following the approach by Tnah et al.15 After the completion of STR genotyping for all the K. malaccensis samples, based on the results from STRUCTURE and cluster analyses, the STR database was divided according to three regions, designated as West Malaysia (WM), East Malaysia (EM) and Peat Swamp (PS) Database, corresponding to Cluster 1 (947 individuals), 2a (268 individuals) and 2b (250 individuals). The genetic diversity parameters of these STR loci for each region were assessed by calculating the number of alleles per locus (A), observed (Ho) and expected heterozygosity (He), using the program Genetic Data Analysis (GDA) v1.134. Hardy-Weinberg equilibrium (HWE) for each population was tested, with p value for departure from HWE adjusted by Bonferroni correction35.

Forensic parameters for each regional database, viz., polymorphic information content (PIC), matching probability (MP) and power of discrimination (PD) were calculated using FORSTAT36. The coancestry coefficient (θ) and inbreeding coefficient (f) for each region were estimated using GDA37, with 1000 bootstrap replicates. Self-assignment tests were used to evaluate the proportion of correctly assigned individuals to population and regional levels, using GENECLASS238. The first level was at the designated population. The second level was at the genetic cluster revealed through the clustering analyses, corresponding to the three regions (WM, EM & PS).

The allele frequency for each locus was computed using the program Microsatellite Toolkit39. With the assumption that K. malaccensis is a diploid, a conservative minimum allele frequency of 5/2n was applied to ensure that an allele has been sampled sufficiently to be used reliably in the statistical tests. While n is the number of individuals sampled from a population, 2n is the number of chromosomes counted because autosomes are in pairs due to the inheritance of one allele each from one’s maternal and paternal parent. The profile frequency was calculated by multiplying the frequency of each locus, across all the nine STR loci based on the subpopulation-cum-inbreeding model40.

The conservativeness of each database was estimated by calculating the full profile frequency of each individual using the genotype frequencies derived from the cognate database (Porigin), which is the population database, against profile frequency of each individual using genotype frequencies derived from the combined database (Pcombined), i.e., the corresponding regional database. The relative difference (d) between the databases was defined as d = log10 (Porigin/Pcombined); d value is negative when Porigin < Pcombined, indicating that the database is conservative24. In order to ensure that each regional database is conservative, a series of θ adjustments were applied to recalculate Pcombined until all individuals within the respective region present a negative d value.

Plant collection declaration

We declare that all our experimental research and field sampling of plant materials comply with local, national or international guidelines and legislation.