• DOI: 10.1109/TSIPN.2020.3044913
  • Corpus ID: 231644663

Multiscale Representation Learning of Graph Data With Node Affinity

  • Xing Gao , Wenrui Dai , +2 authors P. Frossard
  • Published in IEEE Transactions on Signal… 2021
  • Computer Science, Mathematics

Figures and Tables from this paper

figure 1

One Citation

Data representation via attribute selection- propagation neural network, 54 references, ipool - information-based pooling in hierarchical graph neural networks, graph convolutional networks with eigenpooling, ace: ant colony based multi-level network embedding for hierarchical graph representation learning, hierarchical graph representation learning with differentiable pooling.

  • Highly Influential

How Powerful are Graph Neural Networks?

Representation learning on graphs with jumping knowledge networks, cayleynets: graph convolutional neural networks with complex rational spectral filters, semi-supervised classification with graph convolutional networks, dynamic edge-conditioned filters in convolutional neural networks on graphs, convolutional neural network architectures for signals supported on graphs, related papers.

Showing 1 through 3 of 0 Related Papers

multiscale representation learning of graph data with node affinity

Published in IEEE Transactions on Signal and Information Processing over Networks 2021

Xing Gao Wenrui Dai Chenglin Li H. Xiong P. Frossard

Are you an EPFL student looking for a semester project?

Work with us on data science and visualisation projects , and deploy your project as an app on top of GraphSearch.

Learn more about Graph Apps .

Multiscale Representation Learning of Graph Data With Node Affinity

Copyright © 2024 EPFL, all rights reserved

multiscale representation learning of graph data with node affinity

External Links

  • Google Scholar
  • References: 0
  • Cited by: 0
  • Bibliographies: 0
  • [Upload PDF for personal use]

Researchr is a web site for finding, collecting, sharing, and reviewing scientific publications, for researchers by researchers.

Sign up for an account to create a profile with publication list, tag and review your related work, and share bibliographies with your co-authors.

Multiscale Representation Learning of Graph Data With Node Affinity

Xing Gao 0005 , Wenrui Dai , Chenglin Li , Hongkai Xiong , Pascal Frossard . Multiscale Representation Learning of Graph Data With Node Affinity . IEEE Trans. Signal and Information Processing over Networks , 7: 30-44 , 2021. [doi]

  • Bibliographies

Abstract is missing.

  • Web Service API

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Multiscale Representation Learning of Graph Data With Node Affinity

Xing Gao , Wenrui Dai , Chenglin Li +2 more  ·  2021

Influential Citations  

IEEE Transactions on Signal and Information Processing over Networks

Key takeaway

The proposed graph pooling strategy, leveraging node affinity, improves hierarchical representation learning of graph data in graph neural networks, achieving state-of-the-art performance on public graph classification benchmark datasets..

Graph neural networks have emerged as a popular and powerful tool for learning hierarchical representation of graph data. In complement to graph convolution operators, graph pooling is crucial for extracting hierarchical representation of data in graph neural networks. However, most recent graph pooling methods still fail to efficiently exploit the geometry of graph data. In this paper, we propose a novel graph pooling strategy that leverages node affinity to improve the hierarchical representation learning of graph data. Node affinity is computed by harmonizing the kernel representation of topology information and node features. In particular, a structure-aware kernel representation is introduced to explicitly exploit advanced topological information for efficient graph pooling without eigendecomposition of the graph Laplacian. Similarities of node signals are evaluated using the Gaussian radial basis function (RBF) in an adaptive way. Experimental results demonstrate that the proposed graph pooling strategy is able to achieve state-of-the-art performance on a collection of public graph classification benchmark datasets.

Multiscale Representation Learning of Graph Data With Node Affinity. .css-ct9vl7{-webkit-text-decoration:none;text-decoration:none;}.css-ct9vl7:hover{-webkit-text-decoration:underline;text-decoration:underline;} .css-1bz2xua{margin:0;font:inherit;color:#79153B;-webkit-text-decoration:none;text-decoration:none;}.css-1bz2xua:hover{-webkit-text-decoration:underline;text-decoration:underline;} .css-15bhlhy{-webkit-user-select:none;-moz-user-select:none;-ms-user-select:none;user-select:none;width:1em;height:1em;display:inline-block;fill:currentColor;-webkit-flex-shrink:0;-ms-flex-negative:0;flex-shrink:0;-webkit-transition:fill 200ms cubic-bezier(0.4, 0, 0.2, 1) 0ms;transition:fill 200ms cubic-bezier(0.4, 0, 0.2, 1) 0ms;font-size:1.5rem;font-size:1.75rem;margin-left:2.8px;}

  • learning algorithm
  • machine learning
  • image processing
  • database systems
  • relational databases
  • graph databases

Royal Society of Chemistry

MGraphDTA: deep multiscale graph neural network for explainable drug–target binding affinity prediction †

ORCID logo

First published on 5th January 2022

Predicting drug–target affinity (DTA) is beneficial for accelerating drug discovery. Graph neural networks (GNNs) have been widely used in DTA prediction. However, existing shallow GNNs are insufficient to capture the global structure of compounds. Besides, the interpretability of the graph-based DTA models highly relies on the graph attention mechanism, which can not reveal the global relationship between each atom of a molecule. In this study, we proposed a deep multiscale graph neural network based on chemical intuition for DTA prediction (MGraphDTA). We introduced a dense connection into the GNN and built a super-deep GNN with 27 graph convolutional layers to capture the local and global structure of the compound simultaneously. We also developed a novel visual explanation method, gradient-weighted affinity activation mapping (Grad-AAM), to analyze a deep learning model from the chemical perspective. We evaluated our approach using seven benchmark datasets and compared the proposed method to the state-of-the-art deep learning (DL) models. MGraphDTA outperforms other DL-based approaches significantly on various datasets. Moreover, we show that Grad-AAM creates explanations that are consistent with pharmacologists, which may help us gain chemical insights directly from data beyond human perception. These advantages demonstrate that the proposed method improves the generalization and interpretation capability of DTA prediction modeling.

1 Introduction

Structure-based methods can explore the potential binding sites by considering the 3D structure of a small molecule and a protein. Docking is a well-established structure-based method that uses numerous mode definitions and scoring functions to minimize free energy for binding. Molecular dynamics simulation is another popular structure-based method that can provide the ultimate detail concerning individual particle motions as a function of time. 6 However, the structure-based methods are time-consuming and can not be employed if the 3D structure of the protein is unknown. 7

Feature-based methods for DTA prediction modeling are also known as proteochemometrics (PCM), 8–10 which relies on a combination of explicit ligand and protein descriptors. Any pairs of drugs and targets can be represented in terms of biological feature vectors with a certain length, often with binary labels that determine whether the drug can bind to the target or not. The extracted biological feature vectors can be used to train machine/deep learning models such as feed-forward neural networks (FNNs), support vector machine (SVM), random forest (RF), and other kernel-based methods. 11–19 For example, DeepDTIs 20 chose the most common and simple features: extended connectivity fingerprints (ECFP) and protein sequence composition descriptors (PSC) for drugs and targets representation, and then used a deep belief network for DTA prediction. Lenselink et al. 11 compared FNNs with different machine learning methods such as logistic regression, RF, and SVM on one single standardized dataset and found that FNNs are the top-performing classifiers. A study conducted by Mayr et al. 12 also found a similar result that FNNs outperform other competing methods. MDeePred 21 represented protein descriptors by the combination of various types of protein features such as sequence, structural, evolutionary, and physicochemical properties, and a hybrid deep neural network was used to predict binding affinities from the compound and protein descriptors. MoleculeNet 22 introduced a featurization method called grid featurizer that used structural information of both ligand and target. The grid featurizer considers not only features of the protein and ligand individually but also the chemical interaction within the binding pocket.

Over the past few years, there has been a remarkable increase in the amount of available compound activity and biomedical data owing to the emergence of novel experimental techniques such as high throughput screening, parallel synthesis among others. 23–25 The high demand for exploring and analyzing massive data has encouraged the development of data-hungry algorithms like deep learning. 26,27 Many types of deep learning frameworks have been adopted in DTA prediction. DeepDTA 2 established two convolutional neural networks (CNNs) to learn the representations of the drug and protein, respectively. The learned drug and protein representations are then concatenating and fed into a multi-layer perceptron (MLP) for DTA prediction. WideDTA 28 further improved the performance of DeepDTA by integrating two additional text-based inputs and using four CNNs to encode them into four representations. Lee et al. 29 also utilized CNN on the protein sequence to learn local residue patterns and conduct extensive experiments to demonstrate the effectiveness of CNN-based methods. On the other hand, DEEPScreen represented compounds as 2-D structural images and used CNN to learn complex features from these 2-D structural drawings to produce highly accurate DTA predictions. 30

Although CNN-based methods have achieved remarkable performance in DTA prediction, most of these models represent the drugs as strings, which is not a natural way to represent compounds. 31 When using strings, the structural information of the molecule is lost, which could impair the predictive power of a model as well as the functional relevance of the learned latent space. To address this problem, graph neural networks (GNNs) have been adopted in DTA prediction. 31–36 The GNN-based methods represent the drugs as graphs and use GNN for DTA prediction. For instance, Tsubaki et al. 34 proposed to use GNN and CNN to learn low-dimensional vector representation of compound graphs and protein sequences, respectively. They formulated the DTA prediction as a classification problem and conducted experiments on three datasets. The experimental results demonstrate that the GNN-based method outperforms PCM methods. GraphDTA 31 evaluated several types of GNNs including GCN, GAT, GIN, and GAT–GCN for DTA prediction, in which DTA was regarded as a regression problem. The experimental results confirm that deep learning methods are capable of DTA prediction, and representing drugs as graphs can lead to further improvement. DGraphDTA 37 represented both compounds and proteins as graphs and used GNNs on both the compound and protein sides to obtain their representations. Moreover, to increase the model interpretability, attention mechanisms have been introduced into DTA prediction models. 32,36,38–40

On the other hand, some researches focused on improving DTA prediction by using structural-related features of protein as input. 37,41 For example, DGraphDTA 37 utilized contact maps predicted from protein sequences as the input of the protein encoder to improve the performance of DTA predictions. Since protein structural information is not always available, they use contact maps predicted from the sequences, which enables the model to take all sorts of proteins as input.

Overall, many novel models for DTA prediction based on shallow GNNs have been developed and show promising performance on various datasets. However, at least three problems have not been well addressed for GNN-based methods in DTA prediction. First, we argue that GNNs with few layers are insufficient to capture the global structure of the compounds. As shown in Fig. 1(a) , a GNN with two layers is unable to know whether the ring exists in the molecule, and the graph embedding will be generated without considering the information about the ring. The graph convolutional layers should be stacked deeply in order to capture the global structure of a graph. Concretely, to capture the structures make up of k -hop neighbors, k graph convolutional layers should be stacked. 42 However, building a deep architecture of GNNs is currently infeasible due to the over-smoothing and vanishing gradient problems. 43,44 As a result, most state-of-the-art (SOTA) GNN models are no deeper than 3 or 4 layers. Second, a well-constructed GNN should be able to preserve the local structure of a compound. As shown in Fig. 1(b) , the methyl carboxylate moiety is crucial for methyl decanoate and the GNN should distinguish it from the less essential substituents in order to make a reasonable inference. Third, the interpretability of graph-based DTA models highly relies on the attention mechanism. Although the attention mechanism provides an effective visual explanation, it increases the computational cost. In addition, the graph attention mechanism only considers the neighborhood of a vertex (also called masked attention), 45,46 which can not capture the global relationship between each atom of a molecule.

Both global and local structure information is important for GNN. (a) The sight of GNNs in the second layer is shown in green as we take the carbon with orange as the center. In this example, a GNN with two layers fails to identify the ring structure of zearalenone. (b) The GNN should preserve local structure information in order to distinguish the methyl carboxylate moiety (orange ellipse) from other less essential substituents.

To address the above problems, we proposed a multiscale graph neural network (MGNN) and a novel visual explanation method called gradient-weighted affinity activation mapping (Grad-AAM) for DTA prediction and interpretation. An overview of the proposed MGraphDTA is shown in Fig. 2 . The MGNN with 27 graph convolutional layers and a multiscale convolutional neural network (MCNN) were used to extract the multiscale features of drug and target, respectively. The multiscale features of the drug contained rich information about the molecule's structure at a different scale and enabled the GNN to make a more accurate prediction. The extracted multiscale features of the drug and target were fused respectively and then concatenated to obtain a combined descriptor for a given drug–target pair. The combined descriptor was fed into a MLP to predict binding affinity. Grad-AAM used the gradients of the affinity flowing into the final graph convolutional layer of MGNN to produce a probability map highlighting the important atoms that contribute most to the DTA. The proposed Grad-AAM was motivated by gradient-weighted class activation mapping (Grad-CAM) that can produce a coarse localization map highlighting the important regions in the image. 47 However, the Grad-CAM was designed for neural network classification tasks based CNNs. Unlike Grad-CAM, Grad-AAM was activated by the binding affinity score based on GNNs. The main contributions of this paper are twofold:

Overview of the proposed MGraphDTA. The MGNN and MCNN were used to extract multiscale features of the input drug graph and protein sequence, respectively. The output multiscale features of the two encoders were fused respectively and then concatenated to obtain a combined representation of the drug–target pair. Finally, the combined representation was fed into a MLP to predict binding affinity. The Grad-AAM uses the gradient information flowing into the last graph convolutional layer of MGNN to understand the importance of each neuron for a decision of affinity.

(a) We construct a very deep GNN for DTA prediction and rationalize it from the chemical perspective.

(b) We proposed a simple but effective visualization method called Grad-AAM to investigate how GNN makes decisions in DTA prediction.

2.1 Input representation

Molecule representation and graph embedding. (a) Representing a molecule as a graph. (b) Graph message passing phase corresponding . (c) Graph readout phase corresponding .

2.2 Graph neural network

 
(1)
 
(2)

2.3 Multiscale graph neural network for drug encoding

Overview of the MGNN. (a) The network architecture of the proposed MGNN. (b) The detailed design of the multiscale block.
 
(3)
 
(4)

2.4 Multiscale convolutional neural network for target encoding

 
(5)
The network architecture of the proposed MCNN.

2.5 MGraphDTA network architecture

 
(6)

2.6 Gradient-weighted affinity activation mapping

 
(7)
 
(8)

Finally, min–max normalization was used to map the probability map P Grad-AAM ranging from 0 to 1. The chemical probability map P Grad-AAM can be thought of as a weighted aggregation of important geometric substructures of a molecule that are captured by a GNN as shown in Fig. 6 .

The chemical probability map is a weighted sum of vital substructures of a molecule captured by a GNN.

2.7 Dataset

We also formulated DTA prediction as a binary classification problem and evaluated the proposed MGraphDTA in two widely used classification datasets, Human and Caenorhabditis elegans ( C. elegans ). 34,38,46

Moreover, we conducted a case study to evaluate the Grad-AAM using the ToxCast dataset. 35 Since the ToxCast dataset contains multiple assays which means that one drug–target pair may have different binding affinity depending on the type of assay. For simplicity, we only selected one of the assays containing the largest drug–target pairs. Table 1 summarizes these datasets. Fig. S1–S3 † show the distribution of binding affinities, SMILES length, and protein sequence length of these datasets.

Dataset Task type Compounds Proteins Interactions
Davis Regression 68 442 30056
Filtered davis Regression 68 379 9125
KIBA Regression 2111 229 118254
Metz Regression 1423 170 35259
Human Classification 2726 2001 6728
C. elegans Classification 1767 1876 7786
ToxCast Regression 3098 37 114626

2.8 Experimental setup

3 results and discussion, 3.1 compare with sota dta prediction models in classification tasks.

Table 2 summarizes the quantitative results. For the Human dataset, the proposed method yielded a significantly higher precision than that of other methods for DTA prediction. For the C. elegans dataset, the proposed method achieved considerable improvements in both precision and recall. These results reveal MGraphDTA's potential to master molecular representation learning for drug discovery. Besides, we observed that replacing CNN with MCNN can yield a slight improvement, which corroborates the efficacy of the proposed MCNN.

Dataset Model Precision Recall AUC
Human GNN-CNN 0.923 0.918 0.970
TrimNet-CNN 0.918 0.953 0.974
GraphDTA 0.882 (0.040) 0.912 (0.040) 0.960 (0.005)
DrugVQA(VQA-seq) 0.897 (0.004) 0.948 (0.003) 0.964 (0.005)
TransformerCPI 0.916 (0.006) 0.925 (0.006) 0.973 (0.002)
MGNN-CNN (ours) 0.953 (0.006) 0.950 (0.004) 0.982 (0.001)
MGNN-MCNN (ours) 0.955 (0.005) 0.956 (0.003) 0.983 (0.003)
C. elegans GNN-CNN 0.938 0.929 0.978
TrimNet-CNN 0.946 0.945 0.987
GraphDTA 0.927 (0.015) 0.912 (0.023) 0.974 (0.004)
TransformerCPI 0.952 (0.006) 0.953 (0.005) 0.988 (0.002)
MGNN-CNN (ours) 0.979 (0.005) 0.961 (0.002) 0.991 (0.002)
MGNN-MCNN (ours) 0.980 (0.004) 0.967 (0.005) 0.991 (0.001)

3.2 Compare with SOTA DTA prediction models in regression tasks

For the regression task on the filtered Davis dataset, we compared the proposed MGraphDTA with SOTA methods in this dataset, which were MDeePred, 21 CGKronRLS, 63 and DeepDTA. 2 We used root mean square error (RMSE, the smaller the better), CI, and Spearman rank correlation (the higher the better) as performance indicators following MDeePred. The whole dataset was randomly divided into six parts; five of them were used for fivefold cross-validation and the remaining part was used as the independent test dataset. The final performance was evaluated on the independent test dataset following MDeePred. Note that the data points in each fold are exactly the same as MDeePred for a fair comparison.

Tables 3 and 4 summarize the predictive performance of MGraphDTA and previous models on the Davis, KIBA, and Metz datasets. The graph-based methods surpassed CNN-based and recurrent neural network (RNN) based methods, which demonstrates the potential of graph neural networks in DTA prediction. Since CNN-based and RNN-based models represent the compounds as strings, the predictive capability of a model may be weakened without considering the structural information of the molecule. In contrast, the graph-based methods represent compounds as graphs and capture the dependence of graphs via message passing between the vertices of graphs. Compared to other graph-based methods, MGraphDTA achieved the best performances as shown in Tables 3 and 4 . The paired Student's t -test shows that the differences between MGraphDTA and other graph-based methods are statistically significant on the Metz dataset ( p < 0.05). Moreover, MGraphDTA was significantly better than traditional PCM models on three datasets ( p < 0.01). It is worth noting that FNN was superior to other traditional PCM models ( p < 0.01), which is consistent with the previous studies. 11,12 Table 5 summarizes the results of four methods in the filtered Davis dataset. It can be observed that MGraphDTA achieved the lowest RMSE. Overall, MGraphDTA showed impressive results on four benchmark datasets that exceed other SOTA DTA prediction models significantly, which reveals the validity of the proposed MGraphDTA.

Dataset Davis KIBA
Model Proteins Compounds MSE CI r index MSE CI r index
These results are taken from DeepDTA. These results are taken from WideDTA. These results are taken from GraphDTA. These results are taken from DeepAffinity. — These results are not reported from original studies.
DeepDTA CNN CNN 0.261 0.878 0.630 0.194 0.863 0.673
WideDTA CNN + PDM CNN + LMCS 0.262 0.886 0.179 0.875
GraphDTA CNN GCN 0.254 0.880 0.139 0.889
GraphDTA CNN GAT 0.232 0.892 0.179 0.866
GraphDTA CNN GIN 0.229 0.893 0.147 0.882
GraphDTA CNN GAT–GCN 0.245 0.881 0.139 0.891
DeepAffinity RNN RNN 0.253 0.900 0.188 0.842
DeepAffinity RNN GCN 0.260 0.881 0.288 0.797
DeepAffinity CNN GCN 0.657 0.737 0.680 0.576
DeepAffinity HRNN GCN 0.252 0.881 0.201 0.842
DeepAffinity HRNN GIN 0.436 0.822 0.445 0.689
KronRLS SW PS 0.379 0.871 0.407 0.411 0.782 0.342
SimBoost SW PS 0.282 0.872 0.655 0.222 0.836 0.629
RF ECFP PSC 0.359 (0.003) 0.854 (0.002) 0.549 (0.005) 0.245 (0.001) 0.837 (0.000) 0.581 (0.000)
SVM ECFP PSC 0.383 (0.002) 0.857 (0.001) 0.513 (0.003) 0.308 (0.003) 0.799 (0.001) 0.513 (0.004)
FNN ECFP PSC 0.244 (0.009) 0.893 (0.003) 0.685 (0.015) 0.216 (0.010) 0.818 (0.005) 0.659 (0.015)
MGraphDTA MCNN MGNN 0.207 (0.001) 0.900 (0.004) 0.710 (0.005) 0.128 (0.001) 0.902 (0.001) 0.801 (0.001)
Model Proteins Compounds MSE CI r index
DeepDTA CNN CNN 0.286 (0.001) 0.815 (0.001) 0.678 (0.003)
GraphDTA CNN GCN 0.282 (0.007) 0.815 (0.002) 0.679 (0.008)
GraphDTA CNN GAT 0.323 (0.003) 0.800 (0.001) 0.625 (0.010)
GraphDTA CNN GIN 0.313 (0.002) 0.803 (0.001) 0.632 (0.001)
GraphDTA CNN GAT–GCN 0.282 (0.011) 0.816 (0.004) 0.681 (0.026)
RF ECFP PSC 0.351 (0.002) 0.793 (0.001) 0.565 (0.001)
SVM ECFP PSC 0.361 (0.001) 0.794 (0.000) 0.590 (0.001)
FNN ECFP PSC 0.316 (0.001) 0.805 (0.001) 0.660 (0.003)
MGraphDTA MCNN MGNN 0.265 (0.002) 0.822 (0.001) 0.701 (0.001)
Model RMSE CI Spearman
These results are taken from MDeePred.
MDeePred 0.742 (0.009) 0.733 (0.004) 0.618 (0.009)
CGKronRLS 0.769 (0.010) 0.740 (0.003) 0.643 (0.008)
DeepDTA 0.931 (0.015) 0.653 (0.005) 0.430 (0.013)
MGraphDTA 0.695 (0.009) 0.740 (0.002) 0.654 (0.005)

3.3 Performance evaluation on more realistic experimental settings

(1) Orphan–target split: each protein in the test set is unavailable in the training set.

(2) Orphan–drug split: each drug in the test set is inaccessible in the training set.

(3) Cluster-based split: compounds in the training and test sets are structurally different ( i.e. , the two sets have guaranteed minimum distances in terms of structure similarity). We used Jaccard distance on binarized ECFP4 features to measure the distance between any two compounds following the previous study. 12 Single-linkage clustering 12 was applied to find a clustering with guaranteed minimum distances between any two clusters.

Given that the DTA prediction models are typically used to discover drugs or targets that are absent from the training set, the orphan splits provide realistic and more challenging evaluation schemes for the models. The cluster-based split further prevents the structural information of compounds from leaking to the test set. We compared the proposed MGraphDTA to GraphDTA and three traditional PCM models (RF, SVM, and FNN). For a fair comparison, we replaced the MGNN in MGraphDTA with GCN, GAT, GIN, and GAT–GCN using the source code provided by GraphDTA with the hyper-parameters they reported. We used the five-fold cross-validation strategy to analyze model performance. In each fold, all methods shared the same training, validation, and test sets. Note that the experimental settings remain the same for the eight methods.

Fig. 7 shows the experimental results for eight methods using the orphan-based and cluster-based split settings. Compared with the results using the random split setting shown in Tables 3 and 4 , we found that the model's performance decreases greatly in the orphan-based and cluster-based split settings. Furthermore, as shown in Fig. 7(a) and (c) , the MSE for MGraphDTA on Davis, KIBA, and Metz datasets using the orphan–drug split were 0.572 ± 0.088, 0.390 ± 0.023, and 0.555 ± 0.043, respectively while those using the cluster-based split were 0.654 ± 0.207, 0.493 ± 0.097, and 0.640 ± 0.078, respectively. In other words, the cluster-based split is more challenging to the DTA prediction model compared to the orphan–drug split, which is consistent with the fact that the cluster-based split setting can prevent the structural information of compounds from leaking to the test set. These results suggest that improving the generalization ability of the DTA model is still a challenge. From Fig. 7(a) , we observed that MGNN exceeded other methods significantly in the Davis dataset using the orphan–drug split setting ( p < 0.01). On the other hand, there were no statistical differences between MGNN, GAT, and RF ( p > 0.05) in the KIBA dataset while these three methods surpassed other methods significantly ( p < 0.01). In addition, SVM and FNN methods were superior to other methods significantly in the Metz dataset ( p < 0.01). Overall, the traditional PCM models showed impressive results that even surpassed graph-based methods in the KIBA and Metz datasets using the orphan–drug split setting as shown in Fig. 7(a) . These results suggest that it may be enough to use simple feature-based methods like RF in this scenario, which is consistent with a recent study. 64 Since the number of drugs in the Davis dataset is significantly less than that in KIBA and Metz datasets as shown in Table 1 , the generalization ability of a model trained on limited drugs can not be guaranteed for unseen drugs. Fig. 8 shows the correlations between predictive values and ground truths of five graph-based models in the Davis dataset using orphan–drug splitting. The predictive value of MGNN was broader than that of other graph-based models as shown in Fig. 8(a) . We also noticed that the ground truths and predictive values of MGNN have the most similar distributions as shown in Fig. 8(b) . The Pearson correlation coefficients of GCN, GAT, GIN, GAT–GCN, and MGNN for DTA prediction were 0.427, 0.420, 0.462, 0.411, and 0.552, respectively. These results further confirm that MGNN has the potential to increase the generalization ability of the DTA model. From Fig. 7(b) , we observed that MGNN outperforms other models significantly in three datasets using the orphan–target split setting ( p < 0.01). MGNN also exceeded other methods significantly in KIBA and Metz datasets using the cluster-based split setting as shown in Fig. 7(c) ( p < 0.05). It is worth noting that graph-based methods outperformed traditional PCM models in the random split setting as shown in Tables 3 and 4 , while the superiority of the graph-based methods was less obvious in the orphan-based and cluster-based split settings as shown in Fig. 7 . Overall, the results show the robustness of MGNN in different split setting schemes and prove that both local and nonlocal properties of a given molecule are essential for a GNN to make accurate predictions.

Comparisons of MGNN and other seven models in Davis, KIBA, and Metz datasets in terms of MSE, CI, and r index (from left to right) using the (a) orphan–drug, (b) orphan–target, and (c) cluster-based split settings.
(a) Scatter and (b) kernel density estimate plots of binding affinities between predictive values and ground truths in Davis dataset using the orphan–drug split setting.

3.4 Ablation study

Model RMSE CI Spearman
Without dense connection 0.726 (0.008) 0.726 (0.008) 0.620 (0.019)
Without batch normalization 0.746 (0.032) 0.719 (0.014) 0.604 (0.008)
MGraphDTA 0.695 (0.009) 0.740 (0.002) 0.654 (0.005)

Furthermore, an ablation study was performed on the filtered Davis dataset to investigate the effect of the receptive field of MCNN on the performance. Specifically, we increased the receptive field gradually by using convolutional layers with a more and more large kernel ( i.e. , 7, 15, 23, 31). From the results shown in Table 7 , it can be observed that the model performance was slightly decreased as increasing the receptive field. Since there are usually a few residues that are involved in protein and ligand interaction, 65 increasing the receptive field to cover more regions may bring noise information from the portions of the sequence that are not involved in DTA into the model.

Max receptive field RMSE CI Spearman
31 0.718 (0.002) 0.732 (0.005) 0.636 (0.013)
23 0.713 (0.008) 0.732 (0.004) 0.635 (0.008)
15 0.710 (0.006) 0.734 (0.005) 0.639 (0.011)
7 0.695 (0.009) 0.740 (0.002) 0.654 (0.005)
Distribution of activation values of the last layers in the ligand and protein encoders on the Davis, filtered Davis, KIBA, and Metz datasets.

3.5 Grad-AAM provides the visual explanation

(1) Visualizing MGNN model based on Grad-AAM.

(2) Visualizing GAT model based on Grad-AAM.

(3) Visualizing GAT model based on graph attention mechanism.

Specifically, we first replaced MGNN with a two layers GAT in which the first graph convolution layer had ten parallel attention heads using the source code provided by GraphDTA. 31 We then trained MGNN-based and GAT-based DTA prediction models under the five-fold cross-validation strategy using the random split setting. Finally, we calculated the atom importance using Grad-AAM and graph attention mechanism and showed the probability map using RDkit. 48

Table 8 shows the quantitive results of MGNN and GAT. MGNN outperformed GAT by a notable margin ( p < 0.01), which further corroborates the superiority of the proposed MGNN. Fig. 10 shows the visualization results of some molecules based on Grad-AAM (MGNN), Grad-AAM (GAT), and graph attention (more examples can be found in ESI Fig. S4 and S5 † ). According to previous studies, 71–75 epoxide, 73 fatty acid, 72,75 sulfonate, 71 and aromatic nitroso 74 are the structural alerts that correlate with specific toxicological endpoints. We found that Grad-AAM (MGNN) does give the highest weights to these structural alerts. Grad-AAM (MGNN) can not only identify important small moieties as shown in Fig. 10(a)–(d) but also reveal the large moieties as shown in Fig. 10(f) , which proves that the MGNN can capture the local and global structures simultaneously. Grad-AAM (GAT) also discerned the structural alerts as shown in Fig. 10(b), (c), (e), and (f) . However, Grad-AAM (GAT) sometimes failed to detect structural alerts as shown in Fig. 10(a) and (d) and we might also notice that the highlighted region involves more extensive regions, and does not correspond to the exact structural alerts as shown in Fig. 10(b), (c), and (e) . These results suggest that the hidden representations learned by GAT were insufficient to well describe the molecules. On the other hand, the graph attention can only reveal some atoms of structural alerts as shown in Fig. 10(c), (d), and (f) . The attention map contained less information about the global structure of a molecule since it only considers the neighborhood of an atom. 45 The superiority of graph attention was that it can highlight atoms and bonds simultaneously, which the Grad-AAM can only highlight the atoms. Fig. 11 shows the distribution of atom importance for Grad-AAM (MGNN), Grad-AAM (GAT), and graph attention. The distribution for Grad-AAM (MGNN) was left-skewed, which suggested that MGNN pays more attention to some particular substituents contributing most to the toxicity while suppressing the less essential substituents. We also found that Grad-AAM (GAT) tends to highlight extensive atoms from the distribution which was consistent with the results shown in Fig. 10(b), (c), and (e) . Conversely, the distribution of graph attention was narrow with most values less than 0.5, which suggested that graph attention often failed to detect important substructures. It is worth noting that some studies utilize global attention mechanisms while dropping all structural information of a graph to visualize a model and may also provide reasonable visual explanations. 46,76 However, these global attention-based methods are model-specific so that the methods can not easily transfer to other graph models. Conversely, Grad-AAM is a universal visual interpretation method that can be easily transferred to other graph models. Moreover, the visual explanation results produced by Grad-AAM may be further improved by applying regularization techniques during the training of MGraphDTA. 77

Model Proteins Compounds MSE CI r index
GraphDTA MCNN GAT 0.215 (0.007) 0.843 (0.005) 0.330 (0.007)
MGraphDTA MCNN MGNN 0.176 (0.007) 0.902 (0.005) 0.430 (0.006)
Atom importance revealed by Grad-AAM (MGNN), Grad-AAM (GAT), and graph attention in structural alerts of (a) and (b) epoxide, (c) and (d) fatty acid, (e) sulfonate, and (f) aromatic nitroso.
Distribution of atom importance for Grad-AAM (MGNN), Grad-AAM(GAT), and graph attention. Note that we do not consider the bond importance for Grad-AAM (GAT).

Overall, Grad-AAM tends to create more accurate explanations than the graph attention mechanism, which may offer biological interpretation to help us understand DL-based DTA prediction. Fig. 12 shows Grad-AAM (MGNN) on compounds with symmetrical structures. The distribution of Grad-AAM (MGNN) was also symmetrical, which suggests that representing compounds as graphs and using GNNs to extract the compounds' pattern is able to preserve the structures of the compounds.

Grad-AAM (MGNN) for molecules with symmetrical structures.

3.6 How does MGNN solve over-smoothing problems?

The receptive field of layer 1, layer 2, and layer 3 of GNN in compound 4-propylcyclohexan-1-one. (a) The receptive field of atom C2. (b) The receptive field of atom C1.
Grad-AAM (MGNN) for molecules with similar structures.

3.7 Limitations

4 conclusion, code availability, data availability, author contributions, conflicts of interest, acknowledgements.

  • T. Zhao, Y. Hu, L. R. Valsdottir, T. Zang and J. Peng, Brief. Bioinform. , 2021, 22 , 2141–2150  CrossRef   CAS   PubMed .
  • H. Öztürk, A. Özgür and E. Ozkirimli, Bioinformatics , 2018, 34 , i821–i829  CrossRef   PubMed .
  • H. Lee and J. W. Lee, Arch. Pharm. Res. , 2016, 39 , 1193–1201  CrossRef   CAS   PubMed .
  • M. Schirle and J. L. Jenkins, Drug Discov. Today , 2016, 21 , 82–89  CrossRef   CAS   PubMed .
  • J. Peng, Y. Wang, J. Guan, J. Li, R. Han, J. Hao, Z. Wei and X. Shang, Brief. Bioinform. , 2021, 22 (5)  DOI: 10.1093/bib/bbaa430 .
  • M. Karplus and J. A. McCammon, Nat. Struct. Biol. , 2002, 9 , 646–652  CrossRef   CAS   PubMed .
  • Y. Yamanishi, M. Araki, A. Gutteridge, W. Honda and M. Kanehisa, Bioinformatics , 2008, 24 , i232–i240  CrossRef   CAS   PubMed .
  • B. J. Bongers, A. P. IJzerman and G. J. P. Van Westen, Drug Discov. Today Technol. , 2019, 32 , 89–98  CrossRef   PubMed .
  • G. J. P. van Westen, J. K. Wegner, A. P. IJzerman, H. W. T. van Vlijmen and A. Bender, Medchemcomm , 2011, 2 , 16–30  RSC .
  • I. Cortés-Ciriano, Q. U. Ain, V. Subramanian, E. B. Lenselink, O. Méndez-Lucio, A. P. IJzerman, G. Wohlfahrt, P. Prusis, T. E. Malliavin, G. J. P. van Westen and others, Medchemcomm , 2015, 6 , 24–50  RSC .
  • E. B. Lenselink, N. Ten Dijke, B. Bongers, G. Papadatos, H. W. T. Van Vlijmen, W. Kowalczyk, A. P. IJzerman and G. J. P. Van Westen, J. Cheminform. , 2017, 9 , 1–14  CrossRef   PubMed .
  • A. Mayr, G. Klambauer, T. Unterthiner, M. Steijaert, J. K. Wegner, H. Ceulemans, D.-A. Clevert and S. Hochreiter, Chem. Sci. , 2018, 9 , 5441–5451  RSC .
  • R. S. Olayan, H. Ashoor and V. B. Bajic, Bioinformatics , 2018, 34 , 1164–1173  CrossRef   CAS   PubMed .
  • T. He, M. Heidemeyer, F. Ban, A. Cherkasov and M. Ester, J. Cheminform. , 2017, 9 , 1–14  CrossRef   PubMed .
  • Y. Chu, A. C. Kaushik, X. Wang, W. Wang, Y. Zhang, X. Shan, D. R. Salahub, Y. Xiong and D. Q. Wei, Brief. Bioinform. , 2021, 22 , 451–462  CrossRef   PubMed .
  • A. Ezzat, M. Wu, X.-L. Li and C.-K. Kwoh, Methods , 2017, 129 , 81–88  CrossRef   CAS   PubMed .
  • T. Pahikkala, A. Airola, S. Pietilä, S. Shakyawar, A. Szwajda, J. Tang and T. Aittokallio, Brief. Bioinform. , 2015, 16 , 325–337  CrossRef   CAS   PubMed .
  • Q. Kuang, Y. Li, Y. Wu, R. Li, Y. Dong, Y. Li, Q. Xiong, Z. Huang and M. Li, Chemom. Intell. Lab. Syst. , 2017, 162 , 104–110  CrossRef   CAS .
  • Y. Chu, X. Shan, T. Chen, M. Jiang, Y. Wang, Q. Wang, D. R. Salahub, Y. Xiong and D.-Q. Wei, Brief. Bioinform , 2021, 22 (3)  DOI: 10.1093/bib/bbaa205 .
  • M. Wen, Z. Zhang, S. Niu, H. Sha, R. Yang, Y. Yun and H. Lu, J. Proteome Res. , 2017, 16 , 1401–1409  CrossRef   CAS   PubMed .
  • A. S. Rifaioglu, R. Cetin Atalay, D. Cansen Kahraman, T. Do\\ugan, M. Martin and V. Atalay, Bioinformatics , 2021, 37 (5), 693–704  CrossRef   CAS   PubMed .
  • Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing and V. Pande, Chem. Sci. , 2018, 9 , 513–530  RSC .
  • M. K. Gilson, T. Liu, M. Baitaluk, G. Nicola, L. Hwang and J. Chong, Nucleic Acids Res. , 2016, 44 , D1045–D1053  CrossRef   CAS   PubMed .
  • G. Papadatos, A. Gaulton, A. Hersey and J. P. Overington, J. Comput. Aided. Mol. Des. , 2015, 29 , 885–896  CrossRef   CAS   PubMed .
  • S. Kim, P. A. Thiessen, E. E. Bolton, J. Chen, G. Fu, A. Gindulyte, L. Han, J. He, S. He, B. A. Shoemaker and others, Nucleic Acids Res. , 2016, 44 , D1202–D1213  CrossRef   CAS   PubMed .
  • H. Chen, O. Engkvist, Y. Wang, M. Olivecrona and T. Blaschke, Drug Discov. Today , 2018, 23 , 1241–1250  CrossRef   PubMed .
  • H. Altae-Tran, B. Ramsundar, A. S. Pappu and V. Pande, ACS Cent. Sci. , 2017, 3 , 283–293  CrossRef   CAS   PubMed .
  • H. Öztürk, E. Ozkirimli and A. Özgür, 2019, arXiv Prepr, arXiv1902.04166.
  • I. Lee, J. Keum and H. Nam, PLoS Comput. Biol. , 2019, 15 , e1007129  CrossRef   CAS   PubMed .
  • A. S. Rifaioglu, E. Nalbat, V. Atalay, M. J. Martin, R. Cetin-Atalay and T. Do\\ugan, Chem. Sci. , 2020, 11 , 2531–2557  RSC .
  • T. Nguyen, H. Le, T. P. Quinn, T. Nguyen, T. D. Le and S. Venkatesh, Bioinformatics , 2021, 37 , 1140–1147  CrossRef   CAS   PubMed .
  • M. Karimi, D. Wu, Z. Wang and Y. Shen, J. Chem. Inf. Model. , 2020, 61 , 46–66  CrossRef   PubMed .
  • M. Karimi, D. Wu, Z. Wang and Y. Shen, Bioinformatics , 2019, 35 , 3329–3338  CrossRef   CAS   PubMed .
  • M. Tsubaki, K. Tomii and J. Sese, Bioinformatics , 2019, 35 , 309–318  CrossRef   CAS   PubMed .
  • Q. Feng, E. Dueva, A. Cherkasov and M. Ester, 2018, arXiv Prepr, arXiv1807.09741.
  • W. Torng and R. B. Altman, J. Chem. Inf. Model. , 2019, 59 , 4131–4149  CrossRef   CAS   PubMed .
  • M. Jiang, Z. Li, S. Zhang, S. Wang, X. Wang, Q. Yuan and Z. Wei, RSC Adv. , 2020, 10 , 20701–20712  RSC .
  • L. Chen, X. Tan, D. Wang, F. Zhong, X. Liu, T. Yang, X. Luo, K. Chen, H. Jiang and M. Zheng, Bioinformatics , 2020, 36 , 4406–4414  CrossRef   CAS   PubMed .
  • B. Agyemang, W.-P. Wu, M. Y. Kpiebaareh, Z. Lei, E. Nanor and L. Chen, J. Biomed. Inform. , 2020, 110 , 103547  CrossRef   PubMed .
  • Z. Yang, W. Zhong, L. Zhao and C. Y.-C. Chen, J. Phys. Chem. Lett. , 2021, 12 , 4247–4261  CrossRef   CAS   PubMed .
  • S. Zheng, Y. Li, S. Chen, J. Xu and Y. Yang, Nat. Mach. Intell. , 2020, 2 , 134–140  CrossRef .
  • G. S. Na, H. W. Kim and H. Chang, J. Chem. Inf. Model. , 2020, 60 , 1137–1145  CrossRef   CAS   PubMed .
  • G. Li, M. Muller, A. Thabet and B. Ghanem, in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2019, pp. 9267–9276  Search PubMed .
  • Y. Li, P. Li, X. Yang, C.-Y. Hsieh, S. Zhang, X. Wang, R. Lu, H. Liu and X. Yao, Chem. Eng. J. , 2021, 414 , 128817  CrossRef   CAS .
  • P. Veličković, A. Casanova, P. Liò, G. Cucurull, A. Romero and Y. Bengio, 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings , 2018  Search PubMed .
  • P. Li, Y. Li, C.-Y. Hsieh, S. Zhang, X. Liu, H. Liu, S. Song and X. Yao, Brief. Bioinform. , 2021, 22 (4)  DOI: 10.1093/bib/bbaa266 .
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh and D. Batra, in Proceedings of the IEEE international conference on computer vision , 2017, pp. 618–626  Search PubMed .
  • A. P. Bento, A. Hersey, E. Félix, G. Landrum, A. Gaulton, F. Atkinson, L. J. Bellis, M. De Veij and A. R. Leach, J. Cheminform. , 2020, 12 , 1–16  CrossRef   PubMed .
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals and G. E. Dahl, in International Conference on Machine Learning , 2017, pp. 1263–1272  Search PubMed .
  • C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan and M. Grohe, in Proceedings of the AAAI Conference on Artificial Intelligence , 2019, vol. 33, pp. 4602–4609  Search PubMed .
  • K. He, X. Zhang, S. Ren and J. Sun, in European conference on computer vision , 2016, pp. 630–645  Search PubMed .
  • G. Huang, Z. Liu, L. Van Der Maaten and K. Q. Weinberger, in Proceedings of the IEEE conference on computer vision and pattern recognition , 2017, pp. 4700–4708  Search PubMed .
  • Z. Yang, L. Zhao, S. Wu and C. Y.-C. Chen, IEEE J. Biomed. Heal. Informatics , 2021, 25 , 1864–1872  Search PubMed .
  • J. T. Metz, E. F. Johnson, N. B. Soni, P. J. Merta, L. Kifle and P. J. Hajduk, Nat. Chem. Biol. , 2011, 7 , 200–202  CrossRef   CAS   PubMed .
  • J. Tang, A. Szwajda, S. Shakyawar, T. Xu, P. Hintsanen, K. Wennerberg and T. Aittokallio, J. Chem. Inf. Model. , 2014, 54 , 735–743  CrossRef   CAS   PubMed .
  • M. I. Davis, J. P. Hunt, S. Herrgard, P. Ciceri, L. M. Wodicka, G. Pallares, M. Hocker, D. K. Treiber and P. P. Zarrinkar, Nat. Biotechnol. , 2011, 29 , 1046–1051  CrossRef   CAS   PubMed .
  • D. P. Kingma and J. L. Ba, 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings , 2015  Search PubMed .
  • T. Akiba, S. Sano, T. Yanase, T. Ohta and M. Koyama, in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining , 2019, pp. 2623–2631  Search PubMed .
  • M. Gönen and G. Heller, Biometrika , 2005, 92 , 965–970  CrossRef .
  • K. Roy, P. Chakraborty, I. Mitra, P. K. Ojha, S. Kar and R. N. Das, J. Comput. Chem. , 2013, 34 , 1071–1082  CrossRef   CAS   PubMed .
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg and others, J. Mach. Learn. Res. , 2011, 12 , 2825–2830  Search PubMed .
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai and S. Chintala, Adv. Neural Inf. Process. Syst. , 2019, 32 , 8026–8037  Search PubMed .
  • A. Airola and T. Pahikkala, IEEE Trans. neural networks Learn. Syst. , 2017, 29 , 3374–3387  Search PubMed .
  • Q. Ye, C.-Y. Hsieh, Z. Yang, Y. Kang, J. Chen, D. Cao, S. He and T. Hou, Nat. Commun. , 2021, 12 , 6775  CrossRef   CAS   PubMed .
  • B. K. C. Dukka, Comput. Struct. Biotechnol. J. , 2013, 8 , e201308005  CrossRef   PubMed .
  • L. Chen, A. Cruz, S. Ramsey, C. J. Dickson, J. S. Duca, V. Hornak, D. R. Koes and T. Kurtzman, PLoS One , 2019, 14 , e0220113  CrossRef   CAS   PubMed .
  • J. Sieg, F. Flachsenberg and M. Rarey, J. Chem. Inf. Model. , 2019, 59 , 947–961  CrossRef   CAS   PubMed .
  • C. Cao, X. Liu, Y. Yang, Y. Yu, J. Wang, Z. Wang, Y. Huang, L. Wang, C. Huang, W. Xu and others, in Proceedings of the IEEE international conference on computer vision , 2015, pp. 2956–2964  Search PubMed .
  • Z. Wu, D. Jiang, J. Wang, C.-Y. Hsieh, D. Cao and T. Hou, J. Med. Chem. , 2021, 64 , 6924–6936  CrossRef   CAS   PubMed .
  • A. Mukherjee, A. Su and K. Rajan, J. Chem. Inf. Model. , 2021, 61 , 2187–2197  CrossRef   CAS   PubMed .
  • M. D. Barratt, D. A. Basketter, M. Chamberlain, G. D. Admans and J. J. Langowski, Toxicol. Vitr. , 1994, 8 , 1053–1060  CrossRef   CAS .
  • A. S. Kalgutkar and J. R. Soglia, Expert Opin. Drug Metab. Toxicol. , 2005, 1 , 91–142  CrossRef   CAS   PubMed .
  • M. P. Payne and P. T. Walsh, J. Chem. Inf. Comput. Sci. , 1994, 34 , 154–161  CrossRef   CAS   PubMed .
  • J. Kazius, R. McGuire and R. Bursi, J. Med. Chem. , 2005, 48 , 312–320  CrossRef   CAS   PubMed .
  • V. Poitout, J. Amyot, M. Semache, B. Zarrouki, D. Hagman and G. Fontés, Biochim. Biophys. Acta, Mol. Cell Biol. Lipids , 2010, 1801 , 289–298  CrossRef   CAS   PubMed .
  • Z. Xiong, D. Wang, X. Liu, F. Zhong, X. Wan, X. Li, Z. Li, X. Luo, K. Chen, H. Jiang and others, J. Med. Chem. , 2019, 63 , 8749–8760  CrossRef   PubMed .
  • R. Henderson, D.-A. Clevert and F. Montanari, in Proceedings of the 38th International Conference on Machine Learning , ed. M. Meila and T. Zhang, PMLR, 2021, vol. 139, pp. 4203–4213  Search PubMed .
  • K. Oono and T. Suzuki, 2019, arXiv Prepr, arXiv1905,10947.
Electronic supplementary information (ESI) available: Details of machine learning construction, vertex features of graphs, data distributions, hyperparameters tuning, and additional visualization results. See DOI:
Equal contribution.

arXiv's Accessibility Forum starts next month!

Help | Advanced Search

Computer Science > Machine Learning

Title: multi-scale representation learning on proteins.

Abstract: Proteins are fundamental biological entities mediating key roles in cellular function and disease. This paper introduces a multi-scale graph construction of a protein -- HoloProt -- connecting surface to structure and sequence. The surface captures coarser details of the protein, while sequence as primary component and structure -- comprising secondary and tertiary components -- capture finer details. Our graph encoder then learns a multi-scale representation by allowing each level to integrate the encoding from level(s) below with the graph at that level. We test the learned representation on different tasks, (i.) ligand binding affinity (regression), and (ii.) protein function prediction (classification). On the regression task, contrary to previous methods, our model performs consistently and reliably across different dataset splits, outperforming all baselines on most splits. On the classification task, it achieves a performance close to the top-performing model while using 10x fewer parameters. To improve the memory efficiency of our construction, we segment the multiplex protein surface manifold into molecular superpixels and substitute the surface with these superpixels at little to no performance loss.
Comments: Neural Information Processing Systems 2021
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
Cite as: [cs.LG]
  (or [cs.LG] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 21 August 2024

Context-embedded hypergraph attention network and self-attention for session recommendation

  • Zhigao Zhang 1 , 2 ,
  • Hongmei Zhang 1 ,
  • Zhifeng Zhang 1 &
  • Bin Wang 2  

Scientific Reports volume  14 , Article number:  19413 ( 2024 ) Cite this article

Metrics details

  • Computer science
  • Electrical and electronic engineering
  • Information technology

Modeling user intention with limited evidence in short-term historical sequences is a major challenge in session recommendation. In this domain, research exploration extends from traditional methods to deep learning. However, most of them solely concentrate on the sequential dependence or pairwise relations within the session, disregarding the inherent consistency among items. Additionally, there is a lack of research on context adaptation in session intention learning. To this end, we propose a novel session-based model named C-HAN, which consists of two parallel modules: the context-embedded hypergraph attention network and self-attention. These modules are designed to capture the inherent consistency and sequential dependencies between items. In the hypergraph attention network module, the different types of interaction contexts are introduced to enhance the model’s contextual awareness. Finally, the soft-attention mechanism efficiently integrates the two types of information, collaboratively constructing the representation of the session. Experimental validation on three real-world datasets demonstrates the superior performance of C-HAN compared to state-of-the-art methods. The results show that C-HAN achieves an average improvement of 6.55%, 5.91%, and 6.17% over the runner-up baseline method on Precision @ K , Recall @ K , and MRR evaluation metrics, respectively.

Similar content being viewed by others

multiscale representation learning of graph data with node affinity

Session interest model for CTR prediction based on self-attention mechanism

multiscale representation learning of graph data with node affinity

Graph neural network recommendation algorithm based on improved dual tower model

multiscale representation learning of graph data with node affinity

The application of social recommendation algorithm integrating attention model in movie recommendation

Introduction.

With the rapid growth of the Internet and the abundance of available information, recommendation system (RS) has become crucial in helping users navigate through this vast amount of data 1 . RS aims to provide personalized and relevant recommendations to users, enabling them to discover new content, products, or services that align with their interests and preferences. The fascinating experience of RS is that they free people from information overload 2 . Conventional methods 3 , 4 , 5 typically focus on complete user interaction records or log files to achieve personalized recommendations. But, due to increasing privacy concerns, obtaining complete user profiles has become more challenging. This presents significant challenges for traditional recommendation systems, limiting their effectiveness and feasibility in practical applications. To deal with this dilemma, session-based recommendation system (SRS) has emerged, which is significantly different from these traditional studies. SRS addresses the challenge of unavailable user profiles by solely utilizing short-term historical data.

Nowadays, SRS has emerged as a popular topic in the field of RS, attracting attention from both academic and industrial communities. Generally, in SRS, an anonymous user’s session (e.g., the sequence of purchases, clicks, browsing) is modeled as short sequences with chronological order, i.e., a sequence of multiple items that the user purchases or clicks. The fundamental idea behind SRS is to predict the user’s next action by analyzing a short sequence. Given the limited information available, the main difficulty is how to effectively and precisely comprehend or capture the intricate relations between items. Modeling sequential dependency is critical to SRS, especially in sessions with strong sequential dependencies like Fig. 1 a. After clicking on the camera, \(u_1\) clicked a series of accessories such as the lens, charger, or camera bag. it is clear from Fig. 1 that the items in session 1 have sequential dependencies. The Markov chain (MC) 6 is a classical method based on sequential dependent assumption, where it predicts the user’s next behavior based on conditional transition probabilities of previous actions. However, MC-based methods strictly adhere to point-level sequential patterns between individual steps 7 , overlooking long-range global dependencies and failing to capture certain complex user behavior patterns, which leads to less accurate or comprehensive recommendation results. Figure 2 shows an instance based on a 1-order Markov chain model w.r.t. a series of electronic devices. The system recommends the camera lens to the user, while the user’s real intention is the display. The disagreement stems from the fact that the model only generates recommendations based on the last point of the user’s interaction trajectory (i.e., local short-term dependence), without considering the previous items that the user has interacted with.

figure 1

An illustration outlines two distinct connections frequently observed among items within sessions. In ( a ), the sequence-dependent trait is the primary influence, while ( b ) emphasizes the predominance of consistency.

Recently, RNN (recurrent neural network) has been successfully applied to session-based recommendation due to its advantage in modeling sequential data and has achieved remarkable results. Hidasi et al. 8 first proposed the RNN-based model GRU4Rec, where they attempted to model the user click list as a sequence and employ RNN to learn user intention features from it. Li et al. 9 implemented the attention mechanism within the RNN architecture to improve the user’s main intent recognition and proposed the NARM model. STAMP 10 proposed a short-term interest prioritization strategy and combined it with an attention mechanism to capture important information in user behavior sequences. While RNN-based approaches effectively manage sequential data within a session, their reliance on strong dependencies prevents them from adequately modeling other crucial relationships, such as consistency. In this paper, consistency refers to the common characteristics reflected in the items interacted with by the user during a session. These characteristics are not limited to the external attributes of the items but rather reflect the invariance in the user’s intent. For instance, as illustrated in Fig. 1 b, user \(u_2\) clicked on a range of bags, including handbags, backpacks, and shoulder bags, with no clear sequential dependence but evident consistency within session 2. The consistency reflected in the session may be simplicity and vintage design. Accurately capturing the consistency in a session can lead to a better understanding of the user’s interaction intentions, especially for modeling long-term user preferences.

figure 2

An instance of 1-order MC-based recommendation.

The context information plays a pivotal role in improving the accuracy and personalization of recommendation systems. Various rich contextual information such as holidays, time, festivals, and types have already been incorporated into SRS with the manner of implicit or explicit feedback 11 . The core intuition lies in that the user behaviors and preferences vary in different times, locations, and scenarios. For instance, during holidays, users are more inclined towards recreational activities, such as watching movies, which are more appealing than reading product reports. Hence, how to effectively leverage the abundance of contextual information in SBR to enhance the modeling of user behaviors more properly is a crucial and challenging issue.

In recent years, Graph Neural Networks (GNN) have stood out in numerous applications due to their powerful representation learning capabilities. In the realm of SRS, GNN-based methods have also begun to gain popularity. GNN models user session data as directed graphs, leveraging the relational structures between nodes in the graph to propagate and aggregate information for learning node representations. The GNN method relaxes the strong temporal dependencies between consecutive items in RNN and regards item transitions as pairwise relations to learn the representation of nodes. The SR-GNN 12 is the first model to implement SRS using graph neural networks. GC-SAN 13 incorporates both GGNN and self-attention mechanisms to enhance contextualized item representations. MSGIFSR 14 uses a variety of context information and graph neural networks to enrich the representation of items and proposes a multi-granularity continuous user intent unit method for SRS. While GNN-based approaches are effective at capturing complex relations within items by propagating information over the graph, they relax modeling temporal dependencies in user session sequences, which may limit their performance when dealing with sequentially dependent session recommendation tasks, as shown in Fig. 1 a.

In summary, the problems with existing methods can be summarized as follows: (1) Over-reliance on local sequential dependencies, while neglecting the global long-term dependencies of user behavior in sessions. (2) They struggle to capture and understand certain complex patterns in user behavior, such as consistency issues, leading to inaccurate recommendations. (3) They lack effective mechanisms to integrate and utilize interaction context information, failing to effectively capture the dynamic changes in user interests.

To solve the problems mentioned above, we propose a novel Context-embedded Hypergraph Attention Network and self-attention for session recommendation (named C-HAN). It can capture two kinds of complex relationships among items, i.e., sequential dependency and consistency. Particularly, we learn item representations using a hypergraph attention network from session hypergraph and context information. Simultaneously, we employ self-attention to capture global sequential dependencies between session items. Next, an attention mechanism is employed to integrate both types of information forming the final session representation. Finally, this model can predict the probability of each item being clicked. Our work presents the following key contributions:

We present C-HAN, a novel session recommendation method that effectively incorporates interactive context information and captures consistency and sequential dependencies within a session.

C-HAN utilizes hypergraph and self-attention to learn two types of item representations, and collaboratively generates the final representation in the stage of session representation learning, with the help of soft attention coordination.

We use context information to improve item representation learning, using attention mechanisms to identify those contexts that are important to the user’s interest.

We conducted extensive experiments on the ML-1M, Delicious, and Yoochoose datasets. The findings of the experiment indicate that C-HAN surpasses the state-of-the-art approaches.

Related works

Conventional methods.

Traditional approaches typically rely on session-based state transitions or predefined rules to produce suggestions. The Markov chain is the basic operation to implement the state transition-based recommendation method. Such as the FPMC model 6 creates a 1-order Markov transition matrix to learn user preference information. Shani et al. 15 proposed an MDP model by formulating the recommender system’s procedure as a Markov decision process tailored for sequential recommendations. SMF 16 is a sequence-aware recommendation method that captures the dynamic changes of user preferences using a hidden Markov model. However, these methods based on MC only concentrate on capturing the local sequence transition relationship between adjacent actions, neglecting the global dependencies within the entire sequence. Both S-POP 17 and Item-KNN 18 are classic rule-based recommendation methods. S-POP leverages item popularity analysis to predict user preferences and identify trending items, while Item-KNN recommends the K most similar items by establishing similarity rules. Nevertheless, neither of these techniques takes into account the impact of the sequential order of interactions within sessions.

Deep learning-based methods

With their natural ability to handle sequential data, RNNs are widely used in session-based recommendation tasks and have achieved remarkable results. GRU4Rec 8 is the first model for SRS that uses the Gated Recurrent Unit (GRU) to process the sequential relationship of items in a session. Tan et al. 19 proposed an improved method for GRU4Rec, which enhances recommendation performance by introducing data augmentation and accounting for temporal shifts in user behavior. Session recommendation methods based on RNN gained popularity in the subsequent years. NARM 9 and STAMP 10 are two notable works that have been proposed after GRU4Rec. Compared to GRU4Rec, both approaches incorporate attention mechanisms with RNN to construct hybrid recommendation models. NARM utilizes an attention mechanism to assign varying weights to items in the user’s current session. This helps to capture changes in user interests. STAMP introduces a short-term memory priority mechanism that prioritizes the impact of the most recent user session item on subsequent user behavior. Since then, the RNN-based session recommendation methods have been further explored. Wang et al. 20 devised an approach that employs RNNs at both the cross-domain and individual user levels to trace users’ collective preferences across multiple domains. Sheng et al. 21 proposed time-based directional attention incorporated with RNN to enhance the accuracy of user preference modeling by detecting sequential signals within sessions. Zhang et al. 22 presented the MBPI hybrid model, based on a concurrent GRU framework. While RNN-based strategies successfully tackle the session-item interrelation, they often prioritize the temporal sequence of items to a degree that may result in overfitting problems.

Caser 7 is the pioneer model that introduced convolutional neural networks (CNN) in SRS. It signifies the start of researchers delving into the utilization of other intricate relationships in SRS that are challenging to uncover in RNNs. Among them, CNN-based and self-attention-based methods have been well explored. HMN 23 utilizes CNNs to analyze item representations and extract multi-scale features, enabling the capture of users’ preferences at the feature level. CSRM 24 proposes a hybrid framework that utilizes two parallel memory encoders to model the session and neighborhood information. Despite the significant success demonstrated by these methods, they are also linked to a noteworthy limitation: an overreliance on sequential connections between neighboring items, leading to other higher-order complex relationships without adjacency being ignored. SASRec 25 effectively captures the inherent dependencies between items within a user session using self-attention, regardless of the length of the sequence. It achieves superior performance compared to methods based on MC, RNN, and CNN.

Graph neural networks (GNN) have been successfully incorporated into SRS, where they excel at capturing intricate transition relationships between vertices. SR-GNN 12 represents the pioneering application of the GNN in SRS. It uses gated graph neural networks (GGNN) to model item sequential transform patterns and integrate transient and long-term features with an attention mechanism for sessions. GC-SAN 26 designed a hybrid framework model combining GNN and self-attention, where GNN is used to capture local dependencies, while self-attention is used to learn long-distance dependencies. FGNN 27 proposes a weighted attention graph layer as an encoder to encode the features of items in a session. Meanwhile, it adopts a Readout function to generate embeddings of sessions. TAGNN 28 uses a target-aware attention GNN to adaptively activate users’ different interests in different target items and capture rich item transitions in the session. DSGNN 29 uses lightweight gating networks to combine dynamic and static intents to improve prediction accuracy. Nevertheless, these methods prioritize pairwise item relations over the strict temporal order found in RNN-based approaches, they overlook intricate many-to-many correlations among items within a session. In real-world situations, item transitions are frequently influenced by the combined impact of preceding items and the complex interrelations among items.

Due to the inherent capability of hypergraphs to represent complex high-order relationships among items, methods based on Hypergraph Neural Networks (HGNNs) 30 , 31 , 32 have garnered significant interest among researchers. Currently, research on the topic is still in the initial stage, with only a limited number of relevant studies available. HGNN 30 is recognized as the first hypergraph convolutional network. It applies the clique expansion technique to approximate hypergraphs as graphs, thereby simplifying the problem to fit within the graph embedding framework. UniGNN 33 proposes a unified hypergraph framework that models the messaging process in graph neural networks using two-stage hypergraphs, which it claims can generalize the GNN model to hypergraphs for downstream tasks. In the latest research, Wang et al. 34 proposed HyperRec model, which uses hypergraph multi-layer convolution to capture user dynamic preferences and short-term intentions for session recommendation. Gao et al. 35 proposed a self-supervised dual hypergraph learning model SDHID with intention disentanglement, which fuses the hypergraph and capsule network to learn the embedding representation of the vertices, and then forms the final embedding of the session through the self-attention aggregation mechanism. Xia et al. 36 introduced the self-supervised hypergraph transformer (SHT) framework, which enhances user representations by incorporating global collaborative relationships through a hypergraph transformer network. Li et al. 37 utilized a heterogeneous HGNN for friend recommendation tasks using human mobility data, demonstrating the flexibility of hypergraph models in capturing intricate spatiotemporal information. Overall, hypergraph neural networks have demonstrated effectiveness in various recommendation scenarios, providing a new perspective for a more comprehensive understanding of user interactions and preferences, resulting in improved recommendation accuracy and novelty.

Closely relevant to our work is the SHARE 32 , which constructs a sub-hypergraph for each session and employs a hypergraph attention layer to aggregate item information to generate session representation. Although SHARE can capture high-level complex item relationships, it almost ignores the sequence-dependent information between items in the session and does not consider the impact of different types of interaction context scenarios on user interest. In contrast to SHARE, our proposed C-HAN method meticulously accounts for both consistency and sequential dependencies among items. Another difference is that the interaction context information is introduced to make the model perceive the scenario information of user interaction.

Problem statement

Hypergraph is an expanded concept in graph theory used to model multi-way relationships. Unlike the traditional graph structure where edges connect two vertices, a hypergraph allows for edges to connect multiple vertices, making it better suited for representing higher-order and complex relationships. The definition and corresponding formula for a hypergraph can be given as follows:

Definition 1

( Hypergraph ). A hypergraph 32 G can be defined as a tuple \(G=(V, E)\) , where \(V=\{v_i\}_{i=1}^N\) is a vertex set, and \(E=\{e_i\}_{i=1}^M\) is hyperedge set. A hyperedge is a nonempty subset with multiple vertices, i.e., \(e_i \subseteq V\) . \(H\in \mathbb {R}^{N\times M}\) denote the incidence matrix, \(H_{ij}=1\) if a vertex \(v_i\) and a hyperedge \(e_j\) are connected, otherwise \(H_{ij}=0\) . \(W_{ii}\) is a positive weight for \(e_i\) , and the diagonal matrix \(W\in \mathbb {R}^{M\times M}\) denotes all weight for all hyperedge. We calculate the degrees of each of the vertices and the hyperedges to form two diagonal matrices called \(D\in \mathbb {R}^{N\times N}\) and \(B\in \mathbb {R}^{M\times M}\) , where \(D_{ii}=\sum _{j}^{M}W_{jj}H_{ij}\) , \(B_{ii}=\sum _{i}^{N}H_{ij}\) .

( session recommendation) . Let \(S=\{S_1, S_2, S_3,..., S_{|S|}\}\) be the set of all sessions, and \(V=\{v_1, v_2, v_3,..., v_N\}\) be the set of all unique items involved in the sessions. A user’s session can be represented as a chronological list of the items clicked, i.e., \(S_i=[v_1,v_2,...,v_{n_i}]\) , \(n_i\) is the session length. Let \(C=[C_1, C_2,...., C_K]\) indicate the interaction context categories set, where K is the number of categories. In this paper, interaction context refers to contextual information during session interactions, such as time, holidays, weekdays, etc. Each interaction context \(C_i (1\le i \le K)\) is a set containing a series of contextual values. Given a session \(S_i\) and the corresponding interaction context sequence \(T=[T_1, T_2,..., T_{n_i}]\) , the task of session recommendation is to suggest the top-N items that the user is likely to interact with. Where \(T_i\) is also a sequence that includes the context when the user interacts with item \(v_i\) , which is denoted as \(T_i =[t_i^1,t_i^2,....t_i^K]\) , where \(t_i^k (1 \le k \le K)\) is determined based on the context type.

Methodology

In this section, we present the proposed model C-HAN in detail, the pipeline of which is presented in Fig. 3 . It has four components: (1) a self-attention module for sequential information learning; (2) hypergraph information embedding with interaction context; (3) session representation learning module to learn the final representation of the session; (4) the prediction layer uses the refined session’s embedding to predict top-N items that user will likely click next.

figure 3

The overview of the proposed C-HAN model.

Sequential information embedding

It is widely recognized that sequential transition patterns are essential for SRS, as they encompass the temporal correlation of user behavior, interest evolution, long-term dependency relationships, and other relevant information. This information effectively enhances our understanding of user behavior, improves recommendation accuracy, and contributes to a better user recommendation experience. In this paper, we adopt the self-attention mechanism to model the sequence dependency transition patterns in sessions.

Technically, we represent each item \(v_i\) in a session \(S_i=[v_1, v_2,..., v_{n_i}]\) as an embedding vector with d dimensions, which can be obtained by querying the learnable items embedding matrix \(E^V=[e_1^v,e_2^v,...,e_{|V|}^v]\) through the item’s ID with a looking-up layer. Therefore, the embedding representation of session \(S_i\) is \(E_{s_i}=[e_1^v,e_2^v,...,e_n^v]\) , \(E_{s_i}\in \mathbb R^{n_i\times d}\) . Then, following the work characteristics of self-attention, we transform \(E_{s_i}\) into a different latent space and use the sigmoid activating function to inject non-linearity to generate query Q and key K , respectively. Their mathematical formulas are as follows:

Here, \(W^q\in \mathbb {R}^{d\times d}, W^k\in \mathbb {R}^{d\times d}\) are learnable parameters used to implement the spatial transformation.

Following the acquisition of the query and key, the embedding similarity between each pair of items is calculated using the dot product with scaling. This process helps establish a correlation matrix that represents the relationships between items. To prevent high similarity scores for identical items, we utilize a masking operation inspired by the approach described in 25 . The correlation matrix is calculated as follows:

\(C\in \mathbb {R}^{n_i\times n_i}\) , \(\sqrt{d}\) used to scale attention points.

By analyzing the affinity matrix, our objective is to assess the importance of an item by evaluating its similarity scores with other items. Lower similarity scores indicate that the item is not particularly important, possibly resulting from accidental or curious user interaction. Conversely, if an item shows high similarity to most items in a session, it signifies that the item represents the user’s primary preference and holds greater importance. Building on this understanding, we measure an item’s importance by calculating the average similarity between the item and other items within a session.

\(\alpha _i\) denotes the importance score assigned to item \(v_i\) within a session, and \(C_{ij}\in C\) . To ensure score normalization, we utilize a softmax layer. Consequently, the overall importance of items \(\beta \) can be shown as:

Context-embedded hypergraph attention networks

Hypergraph construction.

Drawing inspiration from 31 , we propose adopting hypergraph to model sessions, which is formally expressed as \(G=(V, E)\) , where, a hyperedge is used to model each session, the elements of the vertex set V are all unique items in the session. Figure 4 shows the construction process from sessions to the hypergraph. In contrast to the sequential dependence captured by traditional methods, our approach establishes connections between items within each hyperedge to better capture the complex relationships and transitions among them.

figure 4

An example of a hypergraph created by three sessions, with each hyperedge marked with a dashed line of a different color.

Attention-based context embedding

To capture the impact of contextual information on item learning during user interactions and enhance the adaptability of user interest features to the context, we incorporate a weighted contextual representation for each item within the hyperedge using soft attention. This contextual representation indicates the impact of different types of contextual scenarios.

Specifically, for each item \(v_i\) in a session, given its corresponding interaction context \(T_i = [t_i^1, t_i^2, ..., t_i^K]\) , where \(t_i^k (1 \le k \le K)\) is the value of a certain type of interaction context. We initially represent each of them as one-hot vectors, which are then converted into dense embedding representations \(e_i^{T,k}\in \mathbb {R}^d\) with dimension d by querying a learnable parameter matrix \(C_k\) . Therefore, all context embeddings in \(T_i\) form a context embedding matrix \(E_i^T\in \mathbb R^{K \times d}\) . Next, we use an attention mechanism as shown in Eq. (6) to generate a weight vector \(W_i\in \mathbb R^K\) , where each element \(w_i^k\) represents the degree of influence of the contextual scenario type on user interest when interacting with item \(v_i\) . Utilizing the weight vector, the weighted contextual representation of the user based on the contextual interaction type can be obtained using Eq. ( 7 ). The mathematical representation of Eqs. ( 6 ) and ( 7 ) are shown below:

where \(q_c\in \mathbb R^d\) , \(W_T \in \mathbb R^{d \times d}\) , \(E_i^T \in \mathbb R^{d \times K}\) .

Finally, in a session, the embedding representation of an item \(v_i\) is modeled as an integration \(e_i^{T,*}\) , which is denoted as the initial embedding vector \(e_i^v\) and the weighted interaction context representation \(e_i^{T}\) as shown in Eq. ( 8 ). It means that the item has a contextual-aware embedding representation.

where \(\oplus \) denote concatenation. \(W\in \mathbb R^{d\times 2d}\) and \(b_0 \in \mathbb R^d\) represent learnable parameter matrices and parameter vectors, respectively.

Hypergraph attention networks

We extend the graph attention network to hypergraphs and employ hyperedges as the medium for information propagation to update vertex representations. This process consists of two stages: aggregating information from vertices to hyperedges, and aggregating information from hyperedges to vertices. The mathematical expressions for these two stages are presented below:

By considering GNN as a two-stage aggregation process, we can naturally extend the designs for GNN to hypergraphs. In this framework, \(\varphi _1\) and \(\varphi _2\) represent permutation-invariant functions that handle the aggregation of messages from vertices and hyperedges, respectively. This approach allows us to leverage the key insight and seamlessly adapt GNN designs to the context of hypergraphs. Figure 5 illustrates the update process of HAN (Hypergraph Attention Network).

figure 5

An illustration of how the GAT can be applied to hypergraphs. ( a ) Toy examples of a hypergraph. ( b ) Two-stage message passing for hypergraph H . Note that edges showing how messages flow to vertex \(v_3\) are marked in red dotted lines.

Specifically, in the first stage, for each session (hyperedge), we employ an arbitrary permutation-invariant function \(\varphi _1\) to aggregate the feature information of all the vertices connected within it. \(\varphi _1\) satisfies \(\varphi _1(\{x_j\}) = x_j\) . In this paper, we utilize a summation function as the aggregation function as shown in Eq. ( 10 ).

where \(x_i=e_i^{T,*}\) , \(W_{jj}\) is a positive weight of the hyperedge \(h_{e_j}\) and be set to 1 for each hyperedge.

In the second stage, we utilize the \(\varphi _2\) function to update the information of each vertex within a hyperedge by leveraging the associated hyperedges. We can draw inspiration from existing GNN approaches to design \(\varphi _2\) . Among numerous methods, GAT (Graph Attention Network) has achieved success and gained widespread attention by assigning different weights to the neighbors of a central node and aggregating their information to update the central node’s features. As mentioned above, Eq. ( 9 ) shows that GNN-based methods can easily extend our framework. Therefore, we have decided to apply GAT to update vertex information in the hypergraph, which works as follows:

where \(\sigma \) is the LeakyRelu function, \(W\in \mathbb R^{d\times d}\) and \(a \in \mathbb R^{2d}\) are the learnable attentional parameters. \(\mathcal {N}(e_i)=\{e\in \mathcal {E}|v_i\in e\} \) is the incident-edges of vertex \(v_i\) , i.e., the set of all hyperedges containing vertex \(v_i\) .

We propagate and update the representation of vertices in multiple hypergraph convolution layers. Additionally, we adopted the method mentioned in literature 30 , where the non-linear activation function and weight matrix between different convolutional layers were removed to reduce computational complexity. Based on formulas ( 11 ) to ( 13 ), we established the following definition to represent the t th vertex:

where \(x_t^{l+1}\) is the t th item’s representation at the \((l+1)\) th layer.

In this paper, we designed L layers to propagate \(x^0\) in HAN and averaged the learned vertex representations \(x_i^l\) of each layer to obtain the item representation based on HAN:

The input \(x_i^0\) in the 0th layer is initialized with the value obtained from Eq. ( 8 ), i.e., \(x_i^0 = e_i^{T,*}\) .

Session representation learning

Within this layer, we construct the ultimate embedding representation of a session by considering two key factors: the consistency information and the sequential dependency information. These distinct types of information are amalgamated through soft attention to form the session’s representation.

Sequential pattern representation learning

The self-attention mechanism in the above section has provided us with relevance scores between each item and the intent of the session. We use these relevance scores as weights to combine with the embedding representations of the items, performing a weighted average operation as the long-term interest representation.

Under the approach proposed in reference 10 , we utilize the last item’s embedding as the instantaneous interest, denoted as \(R_S=e_n^v\) . Incorporate with the long-term preference, we construct the final sequential pattern representation as follows:

where \(W\in \mathbb R^{d\times 2d}\) is a learnable transformation parameter matrix.

Consistency learning

We propagate the embedded representations of items with weighted context through our carefully designed hypergraph attention network across multiple convolutional layers, ultimately forming hypergraph embeddings for each item. In this process, for all items in an interactive session, we utilize an average aggregation function to model the consistency,

Consistency and sequential information fusion

To model the final intent of user interaction sessions by considering both sequential and consistency, we employ a soft-attention mechanism to automatically blend the two and obtain the final embedding representation as follows:

where \(\alpha \) represents the weight of the fusion of a participating term, and \(W\in \mathbb R^{d\times 2d}\) is a learnable parameter matrix.

Prediction layer

Based on the learned session feature vector \(Z_f\) mentioned above, we calculate the similarity score of an item \(v_j\in V\) in session \(S_i\) using the inner product. This allows us to obtain a score vector \(\hat{Z}_{S_i}\) for all items in the candidate set relative to the session \(S_i\) , where \(\hat{Z}_{S_i,j}\) represents the j -th element in this vector. Then, we use the softmax function to transform it into a click probability vector \(\hat{y}_{s_i}\) . The computation is represented as follows:

where \(\hat{y}_{S_i}=[\hat{y}_{S_i,1},\hat{y}_{S_i,2},...,\hat{y}_{S_i,N}]\) , and \(\hat{y}_{S_i,j}\) represents the probability of the user clicking on item \(v_j\) in the next instance. Finally, the top- N items with the highest score in \(\hat{y}_{S_i}\) will be recommended to the user.

For session \(S_i\) , to optimize the parameters of the model, the cross-entropy function is used as the objective function. Algorithm 1 provides the training process of C-HAN and the formal expression for the cross-entropy function is as follows:

figure a

The process of C-HAN training.

Experiments

The purpose of this section is to assess and analyze the effectiveness of C-HAN. Furthermore, we seek to explore and address the following four research questions:

RQ1: Is C-HAN competitive with other baseline methods?

RQ2: What is the performance of the C-HAN model across varying session lengths?

RQ3: Does the inclusion of context information enhance the performance of C-HAN?

RQ4: Is the attention-based fusion approach employed by the model fusion layer able to achieve competitive performance?

Datasets and evaluation metrics

To confirm the effectiveness of C-HAN, we conducted experiments on three benchmark datasets, namely ML-1M, Delicious-2K, Yoochoose. The statistics of them are shown in Table 1 .

We arranged the user-clicked item sequences in chronological order for the ML-1M and Delicious-2K datasets. Afterward, we split the user interaction sequences into a training set and a test set. The test set comprised the last 10 days for ML-1M and the last month for Delicious-2K, while the remaining data was used for the training set. Three types of context information were extracted, namely week (7 days), month (12 months), and working day indicators, resulting in a total of 168 context values. For the Yoochoose dataset, we excluded the last 1 day of data and used it as the training set. Additionally, we filtered out sessions with a length of less than 5 or containing fewer than 5 clicks on an item. To augment the training data, we employed a sliding window technique to split the sequences, as suggested in previous literature 9 , 21 . Given the extensive size of the Yoochoose dataset, we used only 1/64 of its data for training and testing, known as Yoochoose-1/64. We also collect four types of context information, i.e., 7 days a week, working day indicators, 6 category types, and 4 time periods in a day, with a total of 336 context values. Table 1 presents comprehensive static statistics of the three processed datasets.

Evaluation metrics

We utilize three widely used evaluation metrics in SRS, i.e., Recall @ K ( R @ K ), MRR @ K , and Precision @ K ( P @ K ), to measure the performance of our approach as well as other comparison methods. These metrics are defined as follows:

here, N refers to the total number of items that the user truly likes in the test set, | hit | is the number of items in the recommendation list that match the user likes, and Rank ( t ) represents the rank of the item the user truly likes in the recommendation list.

In our comparative analysis, we evaluate our method against the following representative baselines:

CASER 7 proposes a personalized top-N sequential recommendation method based on convolutional sequence embedding.

SRGNN 12 is the first model that utilizes GNN for session recommendation.

HyperRec 34 utilizes hypergraphs to represent complex relationships between items, enabling a more comprehensive understanding of multi-order connections for the next-item recommendation.

SERec 38 Learn user and item representations by integrating social network knowledge through heterogeneous graph neural networks.

SDHID 35 introduces hypergraphs and capsule networks to learn vertex embedding, and obtains representations of intra-session patterns by aggregating item embeddings with attention weights.

SHARE 32 employs HGAN to aggregate item information to generate session representation.

Implementation details

We search the general embedding size d from \(\{50,100,150,200,250\}\) and set \(d=100\) . To enhance the generalization capability and mitigate overfitting, we incorporated a random discard approach with a ratio of 30% coupled with \(L2=10^{-5}\) regularization. We utilized the batch method with a size of 256 and employed the Adam optimizer to optimize our model, with a learning rate of 0.001. To prevent overfitting, we applied a decay rate of 0.1 after every three epochs.

Comparison results (RQ1)

To address RQ1 and showcase the comprehensive performance of C-HAN, we conducted a comparison against various baseline methods across three different datasets, with the findings presented in Tables 2 and 3 . The following are some key findings we have summarized.

Tables 2 and 3 show that C-HAN consistently outperforms other baseline methods on all three benchmark datasets. This provides strong evidence for the effectiveness and validity of our strategy in addressing RQ1. Based on further statistical analysis of the data in Tables 2 and 3 , we found that C-HAN achieved the highest average improvement in the precision metric across the three datasets, with an improvement of 6.55%. This was followed by an average improvement of 6.17% in MRR and 5.92% in Recall. The highest average improvement across all performance metrics was observed in the ML–1M dataset, with a value of 10.52%, followed by Yoochoose-1/64 with 4.52%, and finally Deletious-2K with an average improvement of 3.61%. More detailed statistics of the improvement of C-HAN over the runner-up baseline method are shown in Table 4 .

Overall, among all the compared methods, the GNN-based methods achieve better performance than other methods such as Caser. This reflects the unique superiority of graph neural networks in sequential recommendation, because it models user interaction as a graph structure and can accurately capture the complex transformation relationship between vertices. Observing further, we find that introducing hypergraph techniques into the graph embedding learning process achieves better performance. For example, SDID using the hypergraph technique achieves the best performance of all baseline methods. It is directly proved that the design of allowing an edge to be associated with multiple vertices in a hypergraph can better capture the multi-modal information in the graph.

Let’s now narrow our focus on Table 2 to derive the performance capabilities of C-HAN. By comparison, it is known that C-HAN achieves the highest performance than the runner-up method on the ML-1M dataset and the statistics in Table 4 also support this point. This progress can be attributed to the relatively limited candidate itemset size of 3417 in ML-1M, which allows C-HAN to improve the hit rate and ranking of target items while reducing the number of candidate items. Another factor may be that ML-1M has a large-scale data volume, indicating that our method has strong adaptability to large-scale data sets. We also find that Recall @20 and Precision @20 have roughly comparable average improvement rates (6.04% vs. 6.84%) on the three datasets, which indicates that the model tends to increase the coverage of recommendations while ensuring the accuracy of recommendations, that is, it is good at discovering new items that users have not yet interacted with but may be interested in.

Performance of different session length (RQ2)

To investigate RQ2, we carried out a comparative study on three datasets with different session lengths, examining the effectiveness of the C-HAN, SRGNN, CASER, and SHARE methods. Their performance metrics, including Recall @20 and MRR @20, are depicted in Fig. 6 . Noted that, due to space constraints, the examination of RQ2– RQ4 later in this section relies on the two evaluation measures, Recall @20 and MRR @20.

figure 6

The model’s effectiveness across various session lengths in three datasets, where each row represents the performance metrics of Recall @20 and MRR @20.

Based on Fig. 6 , when considering Recall @20, the performance of SRGNN, CASER, and SHARE follows a common pattern: initially increasing, then sharply declining as the session length increases. In contrast, our model achieves optimal performance and maintains stability with a gradual decline. This can be attributed to the fact that as the session length increases, more interactive items offer additional information to the model regarding user intentions, resulting in improved accuracy in detecting user intentions. Nonetheless, when session lengths surpass a certain threshold, CASER and SHARE will struggle to handle the increased complexity of user behavior patterns and the growing amount of irrelevant items. This can result in an influx of noise in the models’ performance. Generally, SHARE outperforms CASER, primarily due to its ability to identify higher-order relations of items. The efficiency of our model comes from its ability to detect sequential signal patterns while maintaining user intent consistency. This helps the model mitigate the impact of irrelevant factors and reduce sensitivity to session length. Experimental results further confirm the ability of C-HAN to manage long sessions.

There is a consistent decrease in MRR @20 as session length increases for all models. C-HAN consistently outperforms SHARE, SRGNN, and CASER in all session lengths. It is worth noting that the decline in MRR @20 is more pronounced compared to the decline in Recall @20 for these datasets. This difference can be attributed to the fact that irrelevant items have a stronger negative impact on MRR @20 than on Recall @20.

Impact of context information (RQ3)

figure 7

Influence of context information on the performance of the Recall @20 and MRR @20.

To examine the impact of context information and the attention-based mechanism, we introduce two modified versions of C-HAN: C-HAN-C and C-HAN-A. C-HAN-C refers to the variant that excludes context information, while C-HAN-A represents the version that incorporates solely context information without utilizing the attention-based mechanism. Through these variations, we aim to investigate the specific contributions of each component in the C-HAN model.

Figure 7 depicts the comparative analysis between C-HAN and the two variants. The results presented in Fig. 7 showcase the superior performance of C-HAN in comparison to the other two models. Notably, when compared to C-HAN-C, C-HAN exhibits noteworthy enhancements in Recall @20, with improvements of approximately 2.82%, 3.13%, and 2.50% across the three datasets. Additionally, C-HAN demonstrates substantial enhancements in MRR @20, with improvements of about 17.68%, 3.12%, and 7.38% across the same datasets. These significant improvements underscore the crucial role of context information in session-based recommendations. The presence of rich semantic information within the context enables effective modeling of user behaviors, thereby contributing to the enhanced performance observed in C-HAN.

In comparison to C-HAN-A, C-HAN demonstrates improvements in Recall @20 by approximately 2.10%, 0.58%, and 1.22% across the three datasets, and also shows enhancements in MRR @20 of about 12.70%, 2.06%, and 2.56% across the same datasets. These findings highlight the distinct influences of different types of context information on the learning of item representations. It suggests that the combination of context information and attention-based mechanisms in C-HAN contributes to more accurate and effective learning of item representations, which is superior to solely relying on context information without the attention-based mechanism. Therefore, it is essential to integrate both aspects for optimizing the session-based recommendation process.

Impact of fusion patterns on session representation learning (RQ4)

figure 8

Comparison of different fusion strategies on session representation learning.

To answer RQ4, we experiment with two linear integration techniques to build the ultimate user intention representations at the session representation stage, after which we contrast these with our chosen methodology. The two transformation models are referred to as follows: (1) C-HAN-CO: substituting the soft attention utilized in session representation learning with concatenation, i.e., \(Z=R_{f_s} \oplus R_{f_c}\) , \(\oplus \) denotes vector connection. (2) C-HAN-IP: employs the inner product operation as a fusion Pattern, i.e., \(Z=R_{f_s} \odot R_{f_c}\) , where \(\odot \) denotes element-wise multiplication. The results of the different fusion strategies are presented in Fig. 8 .

Figure 8 reveals that C-HAN exhibits superior performance across two evaluation metrics on the three datasets, with C-HAN-IP and C-HAN-CO trailing behind. This result can be attributed to the adaptability of C-HAN in adjusting the weights between sequential transition signals and interaction consistency information, which compose the session representation, based on variations in user interaction context or scenarios. This adaptive adjustment enables C-HAN to capture the user’s intent more accurately.

We have developed a novel C-HAN model to effectively capture the intricate relationships between items within a session and accurately discern user interaction intentions across various contextual interaction scenarios. Our model is distinctive in its ability to concurrently capture sequential dependencies and consistency information among session items, while also considering the influence of diverse interactive contextual information on changes in user interests. C-HAN propagates items with different types of interactive contexts through hypergraph attention convolution layers, iteratively learning consistency. Additionally, it leverages self-attention mechanisms to capture sequential dependencies among multiple items within each session. Ultimately, the model adaptively integrates these two types of information using a soft-attention mechanism to comprehensively represent the session. The results of experiments conducted on three authentic datasets attest to the remarkable performance of C-HAN when compared to other baseline methods. In future work, we will expand C-HAN into social recommendation tasks, introduce users’ trusted social relationships, and integrate the influence of multi-modal behaviors generated by users, such as comments, adding favorites, adding shopping carts, etc. Such information helps to capture users’ intentions accurately to improve the accuracy of session-based recommendations.

Data availibility

The datasets used in experiments can be downloaded from the following URLs: ML-1M: http://www.grouplens.org/node/73 , Delicious-2K: https://grouplens.org/datasets/hetrec-2011 , Yoochoose: http://2015.recsyschallenge.com/challege.html .

Malik, S., Rana, A. & Bansal, M. A survey of recommendation systems. Inf. Resour. Manag. J. 33 , 53–73 (2020).

Google Scholar  

Zhao, Z. et al. Dual feature interaction-based graph convolutional network. IEEE Trans. Knowl. Data Eng. 35 , 9019–9030 (2023).

Li, C. & He, K. CBMR: An optimized mapreduce for item-based collaborative filtering recommendation algorithm with empirical analysis. Concurr. Comput. Pract. Exp. 29 (2017).

Abdi, M. H., Okeyo, G. O. & Mwangi, R. W. Matrix factorization techniques for context-aware collaborative filtering recommender systems: A survey. Comput. Inf. Sci. 11 , 1–10 (2018).

Cheng, H. et al. Wide & deep learning for recommender systems. In DLRS@RecSys . 7–10 (2016).

Rendle, S., Freudenthaler, C. & Schmidt-Thieme, L. Factorizing personalized Markov chains for next-basket recommendation. In WWW . 811–820 (2010).

Tang, J. & Wang, K. Personalized top-n sequential recommendation via convolutional sequence embedding. In WSDM 2018, Marina Del Rey, CA, USA, February 5–9, 2018 . 565–573 (2018).

Hidasi, B., Karatzoglou, A., Baltrunas, L. & Tikk, D. Session-based recommendations with recurrent neural networks. In ICLR (Poster) (2016).

Li, J. et al. Neural attentive session-based recommendation. In CIKM . 1419–1428 (2017).

Liu, Q., Zeng, Y., Mokhosi, R. & Zhang, H. STAMP: short-term attention/memory priority model for session-based recommendation. In KDD . 1831–1839 (2018).

Wu, S., Liu, Q., Wang, L. & Tan, T. Contextual operation for recommender systems. IEEE Trans. Knowl. Data Eng. 28 , 2000–2012 (2016).

Wu, S. et al. Session-based recommendation with graph neural networks. In AAAI . 346–353 (2019).

Xia, X. et al. Self-supervised hypergraph convolutional networks for session-based recommendation. In AAAI 2021, Virtual Event, February 2–9, 2021 . 4503–4511 (2021).

Guo, J. et al. Learning multi-granularity consecutive user intent unit for session-based recommendation. In WSDM . 343–352 (2022).

Shani, G., Brafman, R. I. & Heckerman, D. An MDP-based recommender system. CoRR arXiv:1301.0600 (2013).

Eskandanian, F. & Mobasher, B. Modeling the dynamics of user preferences for sequence-aware recommendation using hidden Markov models. In FLAIRS . 425–430 (AAAI Press, 2019).

Adomavicius, G. & Tuzhilin, A. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Trans. Knowl. Data Eng. 17 , 734–749 (2005).

Sarwar, B. M., Karypis, G., Konstan, J. A. & Riedl, J. Item-based collaborative filtering recommendation algorithms. In WWW . 285–295 (2001).

Tan, Y. K., Xu, X. & Liu, Y. Improved recurrent neural networks for session-based recommendations. In DLRS@RecSys . 17–22 (ACM, 2016).

Wang, Y., Guo, C., Chu, Y., Hwang, J. & Feng, C. A cross-domain hierarchical recurrent model for personalized session-based recommendations. Neurocomputing 380 , 271–284 (2020).

Sheng, Z., Zhang, T. & Zhang, Y. HTDA: Hierarchical time-based directional attention network for sequential user behavior modeling. Neurocomputing 441 , 323–334 (2021).

Zhang, J., Ma, C., Zhong, C., Mu, X. & Wang, L. MBPI: Mixed behaviors and preference interaction for session-based recommendation. Appl. Intell. 51 , 7440–7452 (2021).

Song, B., Cao, Y., Zhang, W. & Xu, C. Session-based recommendation with hierarchical memory networks. In CIKM 2019, Beijing, China, November 3–7, 2019 . 2181–2184 (2019).

Wang, M. et al. A collaborative session-based recommendation approach with parallel memory modules. In SIGIR 2019, Paris, France, July 21–25, 2019 . 345–354 (2019).

Kang, W. & McAuley, J. J. Self-attentive sequential recommendation. In IEEE International Conference on Data Mining, ICDM 2018, Singapore, November 17–20, 2018 . 197–206 (2018).

Xu, C. et al. Graph contextualized self-attention network for session-based recommendation. In IJCAI 2019, Macao, China, August 10–16, 2019 . 3940–3946 (2019).

Qiu, R., Li, J., Huang, Z. & Yin, H. Rethinking the item order in session-based recommendation with graph neural networks. In CIKM 2019, Beijing, China, November 3–7, 2019 . 579–588 (2019).

Yu, F. et al. TAGNN: Target attentive graph neural networks for session-based recommendation. In SIGIR 2020, Virtual Event, China, July 25–30, 2020 . 1921–1924 (2020).

Zhang, C., Liu, Q. & Zhang, Z. DSGNN: A dynamic and static intentions integrated graph neural network for session-based recommendation. Neurocomputing 468 (2022).

Feng, Y., You, H., Zhang, Z., Ji, R. & Gao, Y. Hypergraph neural networks. In AAAI 2019, Honolulu, Hawaii, USA, January 27–February 1, 2019 . 3558–3565 (2019).

Bai, P. H. S., Feihu, T. Hypergraph convolution and hypergraph attention. In Pattern Recognition: The Journal of the Pattern Recognition Society 110 (2021).

Wang, J., Ding, K., Zhu, Z. & Caverlee, J. Session-based recommendation with hypergraph attention networks. In SDM 2021, Virtual Event, April 29–May 1, 2021 . 82–90 (2021).

Huang, J. & Yang, J. UNIGNN: A unified framework for graph and hypergraph neural networks. In IJCAI 2021, Virtual Event/Montreal, Canada, 19–27 August 2021 . 2563–2569 (2021).

Wang, J., Ding, K., Hong, L., Liu, H. & Caverlee, J. Next-item recommendation with sequential hypergraphs. In SIGIR . 1101–1110 (ACM, 2020).

Gao, R. et al. Self-supervised dual hypergraph learning with intent disentanglement for session-based recommendation. Knowl. Based Syst. 270 , 110528 (2023).

Xia, L., Huang, C. & Zhang, C. Self-supervised hypergraph transformer for recommender systems. In KDD . 2100–2109 (ACM, 2022).

Li, R., Zhang, L., Liu, G. & Wu, J. Next basket recommendation with intent-aware hypergraph adversarial network. In SIGIR . 1303–1312 (ACM, 2023).

Chen, T. & Wong, R. C. An efficient and effective framework for session-based social recommendation. In WSDM . 400–408 (ACM, 2021).

Download references

Acknowledgements

This work was supported by the Natural Science Foundation of Inner Mongolia (2023LHMS06025), and the Basic Scientific Research Foundation of Colleges and Universities Directly under the Inner Mongolia Autonomous Region (GXKY22135).

Author information

Authors and affiliations.

College of Computer Science and Technology, Inner Mongolia Minzu University, Tongliao, 028000, China

Zhigao Zhang, Hongmei Zhang & Zhifeng Zhang

School of Computer Science and Engineering, Northeastern University, Shenyang, 110169, China

Zhigao Zhang & Bin Wang

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization, Zhang Z.G.; methodology, Zhang Z.G.; software, Zhang Z.G.; validation, Zhang Z.G., Zhang H.M., Wang B., and Zhang Z.F.; formal fanalysis, Zhang Z.G.; investigation, Zhang Z.G.; resources, Zhang Z.G.; data curation, Zhang Z.G., Wang B.; writing—original draft preparation, Zhang Z.G.; writing—review and editing, Zhang Z.G., Zhang Z.F., and Wang B.; visualization, Zhang H.M.; supervision, Zhang H.M.; project administration, Zhang H.M.; funding acquisition, Wang B.

Corresponding author

Correspondence to Zhifeng Zhang .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Zhang, Z., Zhang, H., Zhang, Z. et al. Context-embedded hypergraph attention network and self-attention for session recommendation. Sci Rep 14 , 19413 (2024). https://doi.org/10.1038/s41598-024-66349-7

Download citation

Received : 05 March 2024

Accepted : 01 July 2024

Published : 21 August 2024

DOI : https://doi.org/10.1038/s41598-024-66349-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

multiscale representation learning of graph data with node affinity

Adaptive Local Modularity Learning for Efficient Multilayer Graph Clustering

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, recommendations, local structure in graph classes, local antimagic vertex coloring of a graph.

Let $$G=(V,E)$$G=(V,E) be a connected graph with $$\left| V \right| =n$$V=n and $$\left| E \right| = m.$$E=m. A bijection $$f:E \rightarrow \{1,2, \dots , m\}$$f:E {1,2, ,m} is called a local antimagic labeling if for any two adjacent vertices u and v, $...

Stacked Network to Realize Spectral Clustering With Adaptive Graph Learning

Spectral clustering with graph learning usually performs eigen-decomposition on the adaptive graph to obtain embedded representation for clustering. In terms of adaptive graph learning, the embedded representation is usually treated as the principal ...

Information

Published in, publication history.

  • Research-article

Contributors

Other metrics, bibliometrics, article metrics.

  • 0 Total Citations
  • 0 Total Downloads
  • Downloads (Last 12 months) 0
  • Downloads (Last 6 weeks) 0

View options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

An improved digital soil mapping approach to predict total N by combining machine learning algorithms and open environmental data

  • Original Article
  • Open access
  • Published: 20 August 2024

Cite this article

You have full access to this open access article

multiscale representation learning of graph data with node affinity

  • Alessandro Auzzas 1 ,
  • Gian Franco Capra 1 ,
  • Arun Dilipkumar Jani 2 &
  • Antonio Ganga   ORCID: orcid.org/0000-0001-7929-5160 1  

Explore all metrics

Digital Soil Mapping (DSM) is fundamental for soil monitoring, as it is limited and strategic for human activities. The availability of high temporal and spatial resolution data and robust algorithms is essential to map and predict soil properties and characteristics with adequate accuracy, especially at a time when the scientific community, legislators and land managers are increasingly interested in the protection and rational management of soil.

Proximity and remote sensing, efficient data sampling and open public environmental data allow the use of innovative tools to create spatial databases and digital soil maps with high spatial and temporal accuracy. Applying machine learning (ML) to soil data prediction can improve the accuracy of maps, especially at scales where geostatistics may be inefficient. The aim of this research was to map the nitrogen (N) levels in the soils of the Nurra sub-region (north-western Sardinia, Italy), testing the performance of the Ranger, Random Forest Regression (RFR) and Support Vector Regression (SVR) models, using only open source and open access data. According to the literature, the models include soil chemical-physical characteristics, environmental and topographic parameters as independent variables. Our results showed that predictive models are reliable tools for mapping N in soils, with an accuracy in line with the literature. The average accuracy of the models is high (R 2  = 0.76) and the highest accuracy in predicting N content in surface horizons was obtained with RFR (R 2  = 0.79; RMSE = 0.32; MAE = 0.18). Among the predictors, SOM has the highest importance. Our results show that predictive models are reliable tools in mapping N in soils, with an accuracy in line with the literature. The results obtained could encourage the integration of this type of approach in the policy and decision-making process carried out at regional scale for land management.

Similar content being viewed by others

multiscale representation learning of graph data with node affinity

Evaluation of digital soil mapping approach for predicting soil fertility parameters—a case study from Karnataka Plateau, India

multiscale representation learning of graph data with node affinity

Digital mapping of selected soil properties using machine learning and geostatistical techniques in Mashhad plain, northeastern Iran

multiscale representation learning of graph data with node affinity

Spatial prediction of soil micronutrients using machine learning algorithms integrated with multiple digital covariates

Explore related subjects.

  • Artificial Intelligence

Avoid common mistakes on your manuscript.

Introduction

Digital Soil Mapping (DSM) has been the main spatial information practice in soil science for many years. This sub-discipline of soil science received international recognition in 2005 with the establishment of a dedicated working group led by IUSS (Arrouays et al. 2017 ). Today, the main processes of DSM are based on geostatistical methods, machine learning (ML) models, and algorithms (Heung et al. 2016 ; Khaledian and Miller 2020 ; Padarian et al. 2019 ; Wadoux et al. 2020 ). Geostatistics refers to methods of studying environmental phenomena based on their spatial variability, starting from real data collected in the field (Hoffimann et al. 2021 ). These tools are widely used for drafting prediction maps, especially through different Kriging algorithms (Keskin and Grunwald 2018 ; Santra et al. 2017 ; Zhang et al. 2020 ). Alongside them, however, ML (i.e., tools obtaining comparable results), is increasingly being used (Taghizadeh-Mehrjardi et al. 2021 ; Wadoux et al. 2020 ).

Indeed, ML is applied in several fields, such as monitoring of hydrogeological risk (Jain et al. 2020 ; Ma et al. 2021 ), wildfire prevention (Elia et al. 2020 ), the prediction of soil physical–chemical parameters (Li et al. 2023a , b ; Li et al. 2022 ; Wang et al. 2021 , 2022 ; Xu et al. 2021 ), and human health (Aghazadeh et al. 2019 ; Piunti 2019 ). Consequently, the number of algorithms to reference is as numerous as the fields of application. Depending on the objective, the sampling characteristics and the dataset, it is necessary to choose one algorithm over another (Li et al. 2023a , b ; Wadoux et al. 2020 ). A relevant aspect in the application of ML is the abundance and quality of databases (Chen et al. 2022 ). In environmental science, the application of ML requires extensive and costly surveying campaigns, which can be supported by existing databases, often shared by institutions and governmental bodies according to the logic of open data (Hengl et al. 2017 ) . It is precisely in the environmental field that we are witnessing in recent years the proliferation of open databases, especially by public institutions (Worthy 2015 ), and in the field of soil science (Orgiazzi et al. 2018 ). Furthermore, the increased use of open data in digital soil mapping is recent and strictly related to the use of new spatial analysis tools, such as Google Earth Engine (GEE), and the availability of large datasets of remote sensing data acquired by satellite missions (Copernicus, Landsat) (Poppiel et al. 2021 ). National and international agencies are developing policies and tools to share soil data, also for scientific purposes, such as the LUCAS soil project implemented by the EU Environment Agency (Orgiazzi et al. 2018 ). Indeed, today almost all medium/large scale studies focused on digital soil mapping integrate field data with updated, publicly managed, high-resolution open data (Radočaj et al. 2024 ; Searle et al. 2021 ). This type of data, coupled by a ML algorithm, appears to be more efficient, also in terms of cost–benefit, than the traditional approach using a geostatistical algorithm (Radočaj et al. 2022a ).

Soil mapping can have two main purposes: i) assignment of a class associated with observed soil, or ii) identification of one or more soil features (Zhang et al. 2017 ). Among these, physical–chemical parameters were extensively investigated to create regional (Brungard et al. 2021 ; Maleki et al. 2023 ), local and field scale distribution maps (Chlingaryan et al. 2018 ; Söderström et al. 2016 ; Zhou et al. 2023 ). Among the chemical parameters, the map elaboration for soil macronutrients (N, P, and K) represents a pivotal step, for environmental and agricultural development agencies, farmers, etc., to understand their spatial distribution and consequently improve nutrient input management while avoiding soil water pollution. Nitrogen is a fundamental macronutrient for the development of plant species, not the least because of the quantities that plants require for sustenance (Högberg et al. 2017 ). In fact, plant species accumulate N in different forms and through different modalities, throughout their life cycle and predominantly during the growth phases (Das et al. 2022 ). The continuous input of N needed by crops has a significant impact on production cycles and markets (Dimkpa et al. 2020 ). Use of N fertilizers has a significant economic weight; this entails careful and constant monitoring over time, to highlight the spatial distribution dynamics of N deficits and surpluses (Singh 2018 ; Wang et al. 2019 ).

The Nurra subregion (northwestern Sardinia) provides an excellent paradigmatic case to explore previously reported questions. Indeed, it encompasses several environmental conditions, passing from natural areas (Parks protected and ruled by laws) to highly productive enterprises, mainly located in plains, and represented by: the production of famous, high-quality wines that are exported around the world; from intensive to semi-intensive agricultural activities; cattle and sheep farming for meat and milk-derived products. Additionally, the area has undergone extensive urbanization due to the presence of extended urban areas (Sassari and Alghero) and famous tourist locations (Arru et al. 2019 ).

However, the objectives of this research were to: i) assess the effectiveness and performance of some ML models using only open access environmental databases; ii) predict N values in soil surface horizons of the Nurra sub-region (Sardinia, Italy) and iii) based on the predicted values, draw up a sub-regional scale map. Only open-access data were used, provided, and implemented by different bodies and organizations at different hierarchical levels. Variables under investigation have been selected through data exploration, i.e., an in-depth analysis of the dataset to study its distribution and main characteristics from a statistical point of view. Random tree models were used since they are in common use and integrated, as algorithms, in several statistical software packages, such as “CART”, “RF” and “Ranger,” of Rstudio (RStudio Team 2011 ). Furthermore, this approach has three important characteristics. It is: i) easy to reproduce with open-source software; ii) powered by public open data; iii) oriented to produce outputs that can be easily integrated into decision-making processes (Fig.  1 ).

figure 1

Workflow Diagram

Materials and methods

The study area, which covers 1,330 km 2 , is located in NW Sardinia (Italy, Fig.  2 ), in the Nurra sub-region (40°48′28.8″N 8°15′14.4″E). Different geological substrates are featured in the area. The most extensive is the limestone formation, followed by pyroclastic flow deposits (south), aeolian sandstones, and gravel (Carmignani et al. 2015 ). The study area is characterized by high pedodiversity, (Aru e Baldaccini 1983) with Alfisols (Rhodoxeralfs, Palexeralfs, Haploxeralf), Inceptisols (Xerochrepts) and Entisols (Fluvents—Xerofluvents, Aquents—Fluvaquents, Psamments—Xeropsamments, ( Keys to Soil Taxonomy, 13th Edition 2022 )) dominating. The main land uses are: agriculture (65%), urban settlements (5%), and natural areas (30%, CORINE Land Cover Copernicus Land Monitoring Service). The vegetative cover is mainly divided into forest vegetation (30%), such as hardwood and coniferous trees, and arable crops (40%), as described by Corine Land Cover (CLC). A part of the forest is located on the coastline of Asinara’s Bay. These are relatively recent conifer plantations placed behind the dunes. Approximately 10% of the surface is occupied by olive trees. The central part of the study area is characterized by irrigated, arable lands.

figure 2

Study area framework

Data collection

The construction, implementation, and validation of the dataset is a pivotal part of the mapping process; the predictive results of the model depend on its characteristics and composition. The availability of quality data determines the accuracy of the model; therefore, it is necessary to build a general dataset that includes a carefully selected range of variables that, as a whole, influence the values of the variable we want to predict (Wadoux et al. 2020 ). Only open sources have been used in this work. The use of open sources increases the level of replicability of this research, thus providing the possibility to compare results. Furthermore, as shown by several authors (Ferreira et al. 2022 ; Nussbaum et al. 2018 ; Wadoux et al. 2020 ), the availability of data, especially those related to soil characteristics, stimulates research regarding the conditions of this resource. At the same time, the existence and availability of freely accessible data increases society’s awareness of soil resource issues (Gorelick et al. 2017 ; Orgiazzi et al. 2018 ). In this work, chemical, physical, topographical, and land-use-related predictors are used. In Table  1 , the main characteristics of the predictors are reported (type, source and resolution).

Soil chemical-physical features

Soil data used in the study are available on the official website of the Sardinian Soil Survey. Footnote 1 These data are provided in ESRI shapefile format with a geometric punctual structure. Each one of these points represents a sample collected by different institutions involved in several projects: Regional Agency (AGRIS, LAORE), University of Sassari and Cagliari. There are 1511 samplings in the study area, each point is associated with the prosaic card’s code and the relative link that contains the profile description and chemical and physical parameters. Unfortunately, 981 of the 1511 maps contained only physical property data, reducing the number of observations available to apply the models. Further data will be added by LUCAS. Footnote 2

Topography a directly and indirectly affects the dynamics of soil N concentrations (Weintraub et al. 2017 ). In this research, we studied the spatial variation of the Topographic Position Index (TPI), which expresses the shape of the space making up the landscape. We demonstrated the relationship between topographic index and N concentration in soil, especially in forest watersheds (Dai et al. 2022 ; Li et al. 2020 ). The data relating to the topography were explored using the Digital Terrain Model (DTM), developed by the cartographic office of the Sardinian Region, available on the Regional Geoportal (Regione Autonoma della Sardegna 2023 ) at the resolution 10 × 10 m. The TPI values were calculated through the SAGAGIS tool (Conrad et al. 2015 ).

Erosion by water and distance to waterbody

Nitrogen is one of the essential macronutrients in vegetation. The color and vigour of the plant depend on the soil N concentration. Soil N is susceptible to runoff due to water-induced soil erosion (Sequi et al. 2017 ). A covariate related to the hydrography of the study area consisted of an estimation of soil water erosion. This estimate was made available by the European Soil Data Centre (ESDAC) and was achieved using the Revised Universal Soil Loss Equation (RUSLE) model. This empirical model is defined by the following equation:

K = Soil Erodibility, (Panagos et al. 2014 );

R = Erosivity, (Panagos et al. 2015a , b , c , d );

C = Vegetation Cover, (Panagos et al. 2015a , b , c , d );

l = Slope length, s = Steepness (Panagos et al. 2015b );

P = Support Practices, (Panagos et al. 2015c ).

This model estimates soil loss per year (t/ha −1 ). Another important dataset, related to hydrography, is the Euclidean distance between the cell and the waterbodies. The presence of water affects N concentration in the surface horizons of soil (Amicabile 2016 ) and is, therefore, included in the data set. Our aim was to assess the influence of these and other predictors to improve the accuracy of the predictions.

Soil N concentrations in the surface horizons are intrinsically linked to vegetation cover conditions (Chen et al. 2014 ), so vegetation data contribute to assessing land degradation processes (Ridwan et al. 2024 ). Therefore, the vegetation index could help to detect and describe soil conditions. The vegetation spectral indices were obtained by combining several satellite images (Chlingaryan et al. 2018 ). One of the covariates selected to represent the vegetation cover was the Normalized Difference Vegetations Index (NDVI), which represents the vigour of the vegetation with a range of values from [−1; +1], interpreted by the color of the leaves (Antognelli 2018 ). This index estimates the vigour of the vegetation by photosynthesis and is found by the satellite image combination, product by Landsat 8, Footnote 3 through the elaborations of the following band:

n o 4 Red (0.64–0.67 µm).

n o 5 Near-Infrared (0.85–0.88 µm).

the band is elaborated through the following equation:

NIR corresponds to the band 5;

VIS corresponds to band 4.

The final NDVI reading is the average of the values and the image detected in the summer and winter seasons in the years from 2016 to 2020. Data of the images are as follows (Table  2 ):

Exploratory data and spatial analysis

The Exploratory Data and Spatial Analysis (EDA) was implemented using R software. In this study, EDA consisted of analysing the distribution and composition of any predictors, through use of descriptive statistics. It was articulated in five parts: i) data collection, ii) data cleaning, iii) univariate statistics, iv) multivariate statistics, and v) spatial distribution analysis.

Once collected, all data selected in a vectorial dataset in the QGIS workspace (QGIS Development Team 2023 ) covered a wide study area with 100 × 100-m cell grids. The matrix associated with the vectorial grid showed the cell as the row and the variable as the column. The raster dataset was appropriately re-scaled and transformed into a vector dataset using the QGIS raster statistics procedure. The Raster dataset was re-scaled and incorporated into a vectorial dataset using the QGIS raster statistics procedure (QGIS Development Team 2023 ).

In the final dataset, a general check was carried out to identify and remove the null values (NA) and outliers.

Univariate statistics were used to describe the distribution of the values of the predictor and dependent variable.

To detect multicollinearity, we created a correlation matrix. Multicollinearity is a phenomenon that arises during regression analysis when multiple variables exhibit significant correlations not only with the dependent variable but also with each other (Shrestha 2020 ). If two covariates are correlated, it increases the absolute error of the predictions (Daoud 2017 ). Therefore, this analysis helped identify variables that had no impact on prediction quality or, worse, adversely affected it. According to the literature (Chan et al. 2022 ; Lindner et al. 2022 ), we removed the covariates with a correlation coefficient >0.80, because if the value of Pearson correlation coefficient is close to 0.8, collinearity is probable (Shrestha 2020 ).

Another analysis that we conducted on the N value point dataset was the study of spatial autocorrelations, which is the phenomenon associated with the presence of a systematic spatial variation in a variable. A positive spatial autocorrelation is the trend of a site or nearby space to have similar values (Chlingaryan et al. 2018 ; Li et al. 2016 ; Nguyen and Vu 2019 ). The Moran index (Moran 1948 ) enables an estimation of the grade of global spatial autocorrelation. The index is given by:

N is the number of the events;

\({X}_{i}\) and \({X}_{i}\) are the values taken from the intensity at the points i and j with \(i\ne j\) ;

X is the average of the covariate considered;

\({w}_{ij}\) is an element of the matrix containing arbitrary event weights.

The weights are determined according to the contiguity of the events. The range values of the index I are [−1;+1] (Tybl 2016 ). The values closest to 1 and −1 indicate the presence of clustering. While values close to zero indicate a random spatial distribution. This approach could be useful for strengthening model selection. In the absence of high spatial correlation, it is preferable to use multivariate statistical methods rather than geostatistical methods.

Machine learning algorithms

This type of model has been used widely in both classification and regression problems. (Wadoux et al. 2020 ) analysed a large amount of peer-reviewed literature and found that, in the case of classification, 80% of the articles contained the application of at least one random tree model. More than one model was chosen in this research, as it is common to use several models of different types to compare results (Wadoux et al. 2020 ; Zhou et al. 2023 ).

The selection of algorithms was based on the results of previous applications in this field. As described by several authors (Wadoux et al. 2020 ), ML tools have not previously considered soil mechanics, phenomena, and properties, but rather learn from the data on which they are trained. For this reason, it can be useful to understand the results of the model applications in similar situations. In this case, to select the models, we search for a similar case study, where the goal is to predict the values of chemical components in the soil (Dai et al. 2022 ; Flynn et al. 2023 ; Forkuor et al. 2017 ; Hengl et al. 2017 ; Li et al. 2023a , b ; Li et al. 2022 ; Prado Osco et al. 2019 ; van der Westhuizen et al. 2023 ; Wadoux et al. 2020 ; Wang et al. 2022 ; Xiaorui et al. 2023 ; Xu et al. 2021 ; Zhou et al. 2023 ). Following the bibliography analysis, the algorithms selected were Random Forest Regression (RFR), Ranger, and Support Vector Machine Regression (SVR).

Random forest regression and ranger

While the RF and model is often used in fields, such as medicine (Sarica et al. 2017 ), it is also widely used in soil mapping (Wadoux et al. 2020 ).

This method is based on the creation of forests of decision trees to improve the accuracy of predictions, and is, therefore, classified as an ensemble algorithm, i.e. one that includes a number of other models (Zhou et al. 2023 ). Unlike other ML models, RF randomly selects the subset of independent variables to subdivide the nodes (leaves), making it more accurate and further minimising the instability of the trees (Forkuor et al. 2017 ; van der Westhuizen et al. 2023 ). It is possible to choose the number of trees that make up the forest (Tree Number = 500), each of which is created independently using a single sample of the training data.

Ranger is a fast implementation of RF mostly used for large datasets (Wright and Ziegler 2017 ). Both belong to the class of tree models. The Ranger package, implemented in the R workspace, enables managing some other aspects in the model realisation phase.

Specifically, the parameters to be handled in the function are different from those of RF and allow the implementation of model management and refinement. The main ones used in the model training phase are:

Quantreg, if enabled it performs a quantile prediction through a regression forest;

Num.trees, which adjusts the quantity of trees in the forest;

Write.forest, to store the results of the model;

Min.node.size, which is the minimum size of the leaves, the value 5 is recommended for this parameter if a regression is performed .

Importance, which makes a ranking of the importance of the independent variables in the prediction, for regression the importance is based on the value of the variance of the results and is coded with the terminology “ impurity” (Xu et al. 2016 ).

This makes this phase more refined compared to other models. We demonstrated the computational and memory efficiency of a ranger in the implementation done in R software, the algorithm manages many more values and variables in less time than RF, making it very effective and fast compared to other models (Wright and Ziegler 2017 ).

figure a

Algorithm 1 RFR Program Code

figure b

Algorithm 2 Ranger Program Code

Support vector regression

SVR, an extension of Support Vector Machine for Regression issues (Lee et al. 2020 ; Ramedani et al. 2014 ) is not a widely used model in this field, but there are some examples of its application in regression issues to predict the values of different soil properties (Li et al. 2023a , b ; Wang et al. 2021 ; Xu et al. 2021 ; Zhou et al. 2023 ). This algorithm implements a function whose purpose is to predict the dependent variable. One of the reasons we chose this algorithm is the difference in the inner workings of the tree models. SVR formulations are analogous to common linear regression, but there are some differences concerning it (Ramedani et al. 2014 ). This algorithm projects the data into a high-dimension space, through the Kernel function (the choice of kernel depends on the characteristics of the data and can have a significant impact on the performance of the model (Forkuor et al. 2017 )), to identify a separation hyperplane due to the support vector. Into the limit of the vector, managed by the cost parameter (C), the prediction occurs, i.e., the value predicted is located in this range (Adwad and Khanna 2015 ).

figure c

Algorithm 3 Ranger Program Code

Validation and assessment models

Two different techniques were used to validate the models. The first divided the model into two parts, in random mode. The larger part of the dataset was used to train the models (training dataset). The second part was used to test the performance of the model on unknown data (test dataset). The split of the dataset was 75% for the training dataset and the rest for the test dataset. The cross-validation, or k-fold cross-validation (CV), is a statistical technique that consists of dividing the training dataset into k parts to limit the overfitting phenomenon. The overfitting problems are essential when one wants to use ML tools, both in the case of classification and regression issues (Berrar 2019 ; Wang et al. 2021 ). According to the bibliography (Aghazadeh et al. 2019 ; Berrar 2019 ; Dharumarajan 2019 ; Hounkpatin et al. 2022 ; Khaledian and Miller 2020 ; Li et al. 2023a , b ; Liu et al. 2022 ; Maleki et al. 2023 ; Mashaba-Munghemezulu et al. 2021 ; Nolan et al. 2018 ; Radočaj et al. 2022b ; Rahman et al. 2020 ; Uddameri et al. 2020 ; Van Der Westhuizen et al. 2022 , 2023 ; Wadoux et al. 2020 ; Wang et al. 2021 ; Xu et al. 2021 ; Zhang et al. 2021 ; Zhou et al. 2023 ), the most widely used and efficient CVs are those with K = 5 and K = 10. In this paper, we have chosen a CV of K = 10.

The metrics used to assess the accuracy of the performance can be different according to the issue at hand. In this paper, we use the metrics that assess the residual of the prediction, i.e., the difference between actual and predicted values. The most common are the coefficient of determination (R 2 ), the root-mean-square error (RMSE), and the mean absolute error (MAE). These metrics are used in several soil mapping cases to compare the performance of the different models chosen (Chlingaryan et al. 2018 ; Dai et al. 2022 ; Lee et al. 2020 ; Liang et al. 2018 ; Prado Osco et al. 2019 ; Wadoux et al. 2020 ; Zhang et al. 2019 ). The formulas are as follows:

\(O\) is the real value of N;

\(P\) is the prediction.

Results and discussion

The following table shows the results of the descriptive statistical analysis (Table  3 ):

The final dataset consisted of 300 observations and 18 predictors.

The correlation matrix (Fig.  3 ) did not indicate a high association between the predictors, so we excluded the potential presence of the phenomenon of multicollinearity. Results from the spatial autocorrelations (Fig.  4 ) indicated a value of 0.108. These relationships were, therefore, like random spatial phenomena; in these cases, it might be more appropriate to apply a multivariate statistical algorithm to study the distribution of variables, rather than using a ‘traditional’ geostatistical approach.

figure 3

Correlation matrix

figure 4

Moran I scatterplot

Covariates importance

In the tree models, it is possible to verify the importance of the variables in the predictions (Figs. 5 and 6 ). The importance of the variables is defined in models such as RFR and Ranger; that is why the evaluation of the importance is based on the deep mechanics of the model when it creates the tree that will compose the random forest in the regression process. The statistics analysed by the function are InNodePurity (Increase in Node Purity), which assesses how the purity of the node (detected by a metric such as the Gini index or the entropy) increases when a node is split based on a specific variable. High values in this case indicate a greater influence of the variable in the node splitting, in this case, process.

figure 5

Plot of covariates importance in RFR model (RFR2 = RFR standard run; RFR*2 = RFR with tenfold CV )

figure 6

Plot of covariates importance in Ranger model (Ranger2 = RFR standard run; Ranger*2 = RFR with tenfold CV )

The SOM represents the principal source of organic N in the soil, which amounts to approximately 97–98%. Vegetation accumulated N in the ammoniacal and nitrate forms and returned it to the soil as organic N after death (Sequi et al. 2017 ).

For this reason, we justified the high relevance of SOM. It is important to verify, in a subsequent phase, if there is a spatial relationship between the distribution of the prediction and the values of SO. The class of variables that had the most influence in the prediction of N values were the same in both models (Table  4 ). It was possible to say that the predictors with more influence belonged to the class of chemical characteristics of the soil. The topography, especially altitude, also is important.

Residual analysis

Residual analysis was performed on the predictions made in the test phase to assess the performance of the models and their accuracy when working with unknown data. SOM contributes approximately 98% of organic nitrogen in the soil. Most plant accumulate N directly from the soil as ammonium and nitrate. After death, plant N is returned to the soil in organic form (Sequi et al. 2017 ). For this reason and because of the importance of the variable in the prediction, we chose to relate the residuals of the results and the value of the SOM.

The greater density of values in the Ranger prediction corresponded to fewer residual values (Fig.  7 ). This shows that the model generated relatively accurate results, with less deviation from the real value. Most of the results are located in the negative component of the plots, i.e., the model tends to underestimate the prediction relative to the real value. The values that were aligned in the first row of the graph were instead overlapped in the second row, corresponding to the zero value of the y-axis. The model was, therefore, able to predict these specific values without error.

figure 7

Plot of residual in Ranger model (first row: application without CV tenfold ; second row: application with CV tenfold )

In the model without CV validation, there is an inherent tendency to overestimate values in the range from 0 to 0.5. As can be seen in Fig.  7 , this tendency is eliminated in the model to which tenfold CV validation has been applied. The values that are aligned in the first line in the graph are superimposed at the zero value of the y-axis in the second. The model was therefore able to predict these certain values without committing any errors. The statistics on the residuals of the two applications of the model are shown in Table  5 .

The residual from the RFR model was very similar to the Ranger result. Again, the model showed the previously observed trend, but with greater moderation compared to the Ranger results. Contrary to Ranger, the application of the model with CV did not eliminate all the trends, resulting in an overestimated prediction corresponding to a real value of 0. The density of the predicted values was concentrated near the zero value in both RFR applications with and without CV (Fig.  8 ). The model with CV had a higher accuracy in the density curve, indicating a lower residual between the prediction and the real values.

figure 8

Plot of residual in RFR model (first row: application without CV tenfold ; second row: application with CV tenfold )

The statistics in Tables 5 and 6 show the affinity between the tree models in this application. Both the mean and the variance were similar. Additionally, in the complex, RFR performance was aligned with the width of the residual distribution. We can say that, even if it is short, RFR residuals assumed a high precision compared to Ranger residuals.

The SVR was influenced by the tendency to overestimate the lowest N, both with and without CV. While in the previous models, the CV limited this type of problem, in this case, the opposite was true. From the plots (Fig.  9 ) we can see an increase in the overestimated values, although the trend observed in the plot showing the relationships with SOM concentration was decreasing.

figure 9

Plot of residual in SVR model (first row: application without CV tenfold ; second row: application with CV tenfold )

The residual statistics in Table  7 indicated that the model had a wider bound than the other models. This suggests that the predictions had a higher error. The density, although more balanced, was less concentrated near the value of 0 on the x-axis, indicating an increase in the dispersion of the residuals and, therefore, a general increase in the error.

This analysis shows that CV has a positive significant influence on the model performance, regarding the tree algorithms, reducing some negative systematic trends. This does not happen in the case of the SVR algorithm and there have been some difficulties related to the overestimations of the lowest N values.

Accuracy assessment

This analysis demonstrated the reliability of the models in a regression prediction. The results near the real values produced a more solid DSM that was typical of the landscape characteristics. Part of the potential of these tools lies in providing a measure of the error that underlies the process of producing spatial information.

Table 8 shows the metrics related to the quality of the predictions in the training phase. These metrics are used to assess the quality of the model in predicting the training values.

From the values in Table  8 , it is possible to state that the better model performance, in the training phase was obtained by SVR since the algorithm had higher R 2 values and the lowest error metrics. RFR had better quality compared to Ranger. RFR had an R 2 of 0.86 while Ranger has 0.85. Additionally, the RMSE value was lower than Ranger which had a 0.29 for the RMSE, while RFR had 0.27. For the MAE, the opposite occurred as RFR and Ranger had 0.17 and 0.16, respectively.

Table 9 shows the metric values that represent the performance quality of the prediction in the test phase.

In the test phase, the situation was reversed as SVR had the lowest performance quality in terms of the selected metrics. RFR had the highest performance quality, with a prediction that approximated the real value. The values were slightly lower than in the training phase, in fact, the highest R 2 value was obtained by RFR at 0.79. Our results align with findings in other similar works. The R 2 of the RFR model predictions was higher than that obtained by Maleki et al. ( 2023 ), even if the metric error values were worse in this case. The R 2 of RFR and SVR were comparable to those obtained by other researchers (Lee et al. 2020 ; Liang et al. 2018 ), while the RMSE values showed higher precision in respect to those obtained by (Liu et al. 2023 ; Prado Osco et al. 2019 ). SVR resulted in RMSE and R 2 values better than those found by (Xiaorui et al. 2023 ) for the same model application. The MAE values were more moderate than those obtained by Prado Osco et al. ( 2019 ).

The graphs in Fig.  10 show the quality of the predictions for each model. In an optimal state, the predictions (red) should agree with the real values (black dots). In this case, all models had difficulty in predicting the highest values of N. RFR can accurately predict the value of N close to 0. Ranger and SVR cannot accurately predict the value around 0 g Kg −1 of N in the soil, in particular SVR which predicts a negative value.

figure 10

Graphs of the prediction value

Figure  11 shows the graphs comparing the real N values and the predictions. In an optimal state, the predictions would appear as a perfect diagonal, indicating that the prediction matches the real values. We have used a color scale for the prediction point to show the error: red indicates a high error, orange and yellow indicate a medium error, while the green point indicates a prediction close to the real values. The points in the RFR graph are more aligned along the diagonal, which, when compared to the other graphs, shows the higher quality of its prediction.

figure 11

Graphs of the distance between the prediction values and the actual values

As the previous graphs show, SVR and Ranger tended to overestimate N values close to or equal to 0, which did not happen in the case of RFR applications. Finally, it is possible to observe how SVR, in some states, obtains negative values in its prediction, in correspondence with a real value equal to 0.

Prediction maps

The models were used to produce prediction maps (Fig.  12 ). They showed the distribution of N concentration over the study area and the influence of some critical patterns:

In the western part, where there was a wooded vegetation cover (with a predominance of deciduous trees), the N concentration was higher than in the area occupied by agricultural activity, due to the absence of vegetation with a long-life cycle. Even if there was a contribution of N synthesis in the fertilization phase, the N was subject to different types of losses (e.g., denitrification and leaching).

The same scenario characterized the arable crops and pastures that occupy the central part, while the opposite was true for the area occupied by shrub and tree vegetation.

The hinterland of the city of Sassari (east-central sector) was one of the areas with the higher predicted N values, which was why the area was mostly occupied by olive groves along the city limits.

figure 12

Predictions Maps

The presence of a large area cultivated almost exclusively with olive trees ensures, in this condition, an adequate soil N concentration, partly due to the fertilizer applied. The low level was concentrated along the coast, where the highest level of urbanization was found. According to Amicabile ( 2016 ), all models showed an increased concentration of N, corresponding to the high levels of SOM. The predictions showed an accumulation of N along the course of the rivers, due to leaching, which manifested itself with a storage towards the lowest part. In the map product of the SVR predictions, this phenomenon was more evident. It was possible to observe high values close to the hydrographic network of the main river (Riu Mannu), localized in the eastern part.

The relationship between N concentrations in the surface horizons clearly shows that in soils of the investigated areas, the N concentrations increased as the ecosystem’s conservation status increased. It clearly shows how in areas with a forest cover (with a prevalence of broad-leaved trees), N concentration is higher than in the same areas occupied by agricultural activities, due to the lack of long-cycle, high-coverage vegetation in the latter. Even if there is an input of synthetic N, due to fertilisation actions in the field, it should be remembered that N in soils is subject to various types of loss (mainly through leaching and denitrification (Amicabile 2016 )). This is true for agricultural areas affected by arable crops or pastures for sheep breeding, while on soils with tree-type vegetation the opposite phenomenon occurs. Evidence of this can be seen in the fact that the models have, in all three cases, identified the maximum content in the areas bordering the city of Sassari, attributable to the massive presence of olive groves.

Concerning the difference between the model predictions, the main difference between the maps predicted by the tree models and the SVR was the localization of the higher values. In the tree models, the higher values of N were localized in the boundaries of the city of Sassari, while the SVR predicted higher values along the western coast of the municipality of Sassari. The RFR and Ranger map products showed a high N value on the surface of the municipality of Sorso (northeast of Sassari) compared to the SVR map. This behaviour could be explained by the difference in performance in the presence of low-density sampling points.

Conclusions

This research was conducted to evaluate the effectiveness and performance of some ML models using only open environmental databases. The use of open-source data will be pivotal in the future, especially due to the large datasets acquired by remote sensing or proximity sensors. However, great importance assumes the possibilities of the use of most effective algorithm. The results showed that the RFR performed strongly. The main outcomes also revealed that by using ML algorithms, it was possible to predict N values at a medium scale coupling large open environmental databases to obtain a reliable performance. More specifically, the applied models showed approximately the same performance, with the RFR showing the highest R 2 while the RSME showed the lowest. The spatial visualization of the results demonstrated the distribution of the N value in a middle-scale map, where it was possible to detect potential critical areas that could require specific actions in the environmental policy framework. Our next steps with this research are to improve the models by incorporating additional data sources to improve the spatio-temporal scale, taking into account the quality of the data, assessed on the basis of a deep exploratory data analysis. Indeed, the high spatio-temporal resolution is crucial for the implementation of effective soil management policies in areas of high human activity density.

Data availability

The data used to support this study are available by contacting the corresponding author.

Available on: http://www.sardegnaportalesuolo.it/opendata , redacted by Agris Sardegna.

Available on: https://esdac.jrc.ec.europa.eu/projects/lucas

Available on: https://earthexplorer.usgs.gov/

Adwad M, Khanna R (2015) Efficient learning machines. Springer, New York

Book   Google Scholar  

Aghazadeh M, Orooji A, Kamkar Haghighi M (2019) Developing an intelligent system for prediction of optimal dose of warfarin in Iranian adult patients with artificial heart valve. Front Health Inform 8(1):25. https://doi.org/10.30699/fhi.v8i1.213

Article   Google Scholar  

Amicabile S (2016) Manuale di Agricoltura (Terza). Ulrico Hoepli

Antognelli S (2018, maggio 28) Indici di vegetazione NDVI e NDMI: Istruzioni per l’uso. Agricolus . https://www.agricolus.com/indici-vegetazione-ndvi-ndmi-istruzioni-luso/

Arrouays D, Lagacherie P, Hartemink AE (2017) Digital soil mapping across the globe. Geoderma Reg 9:1–4. https://doi.org/10.1016/j.geodrs.2017.03.002

Arru B, Furesi R, Madau FA, Pulina P (2019) Recreational services provision and farm diversification: a technical efficiency analysis on Italian agritourism. Agriculture 9(2):42. https://doi.org/10.3390/agriculture9020042

Berrar D (2019) Cross-validation. In: Encyclopedia of bioinformatics and computational biology. Elsevier, pp 542–545. https://doi.org/10.1016/B978-0-12-809633-8.20349-X

Brungard C, Nauman T, Duniway M, Veblen K, Nehring K, White D, Salley S, Anchang J (2021) Regional ensemble modeling reduces uncertainty for digital soil mapping. Geoderma 397:114998. https://doi.org/10.1016/j.geoderma.2021.114998

Carmignani L, Oggiano G, Funedda A, Conti P, Pasci S (2015) The geological map of Sardinia (Italy) at 1:250,000 scale. J Maps. https://doi.org/10.1080/17445647.2015.1084544

Chan JY-L, Leow SMH, Bea KT, Cheng WK, Phoong SW, Hong Z-W, Chen Y-L (2022) Mitigating the multicollinearity problem and its machine learning approach: a review. Mathematics 10(8):1283. https://doi.org/10.3390/math10081283

Chen B, Liu E, Tian Q, Yan C, Zhang Y (2014) Soil nitrogen dynamics and crop residues. A review. Agron Sustain Dev 34(2):429–442. https://doi.org/10.1007/s13593-014-0207-8

Article   CAS   Google Scholar  

Chen S, Arrouays D, Leatitia Mulder V, Poggio L, Minasny B, Roudier P, Libohova Z, Lagacherie P, Shi Z, Hannam J, Meersmans J, Richer-de-Forges AC, Walter C (2022) Digital mapping of GlobalSoilMap soil properties at a broad scale: a review. Geoderma 409:115567. https://doi.org/10.1016/j.geoderma.2021.115567

Chlingaryan A, Sukkarieh S, Whelan B (2018) Machine learning approaches for crop yield prediction and nitrogen status estimation in precision agriculture: a review. Comput Electron Agric 151:61–69. https://doi.org/10.1016/j.compag.2018.05.012

Conrad O, Bechtel B, Bock M, Dietrich H, Fischer E, Gerlitz L, Wehberg J, Wichmann V, Böhner J (2015) System for automated geoscientific analyses (SAGA) v. 2.1.4. Geosci Model Dev 8(7):1991–2007. https://doi.org/10.5194/gmd-8-1991-2015

Dai L, Ge J, Wang L, Zhang Q, Liang T, Bolan N, Lischeid G, Rinklebe J (2022) Influence of soil properties, topography, and land cover on soil organic carbon and total nitrogen concentration: a case study in Qinghai-Tibet plateau based on random forest regression and structural equation modeling. Sci Total Environ 821:153440. https://doi.org/10.1016/j.scitotenv.2022.153440

Daoud JI (2017) Multicollinearity and regression analysis. J Phys Conf Ser 949:012009. https://doi.org/10.1088/1742-6596/949/1/012009

Das PP, Singh KR, Nagpure G, Mansoori A, Singh RP, Ghazi IA, Kumar A, Singh J (2022) Plant-soil-microbes: a tripartite interaction for nutrient acquisition and better plant growth for sustainable agricultural practices. Environ Res 214:113821. https://doi.org/10.1016/j.envres.2022.113821

Dharumarajan S (2019) The need for digital soil mapping in India. Geoderma Reg 16:e00204

Dimkpa CO, Fugice J, Singh U, Lewis TD (2020) Development of fertilizers for enhanced nitrogen use efficiency—trends and perspectives. Sci Total Environ 731:139113. https://doi.org/10.1016/j.scitotenv.2020.139113

Elia M, D’Este M, Ascoli D, Giannico V, Spano G, Ganga A, Colangelo G, Lafortezza R, Sanesi G (2020) Estimating the probability of wildfire occurrence in Mediterranean landscapes using artificial neural networks. Environ Impact Assess Rev 85:106474. https://doi.org/10.1016/j.eiar.2020.106474

Ferreira CSS, Seifollahi-Aghmiuni S, Destouni G, Ghajarnia N, Kalantari Z (2022) Soil degradation in the European Mediterranean region: processes, status and consequences. Sci Total Environ 805:150106. https://doi.org/10.1016/j.scitotenv.2021.150106

Flynn KC, Baath G, Lee TO, Gowda P, Northup B (2023) Hyperspectral reflectance and machine learning to monitor legume biomass and nitrogen accumulation. Comput Electron Agric 211:107991. https://doi.org/10.1016/j.compag.2023.107991

Forkuor G, Hounkpatin OKL, Welp G, Thiel M (2017) High resolution mapping of soil properties using remote sensing variables in South-Western Burkina Faso: a comparison of machine learning and multiple linear regression models. PLoS ONE 12(1):e0170478. https://doi.org/10.1371/journal.pone.0170478

Gorelick N, Hancher M, Dixon M, Ilyushchenko S, Thau D, Moore R (2017) Google earth engine: planetary-scale geospatial analysis for everyone. Remote Sens Environ 202:18–27. https://doi.org/10.1016/j.rse.2017.06.031

Hengl T, Leenaars JGB, Shepherd KD, Walsh MG, Heuvelink GBM, Mamo T, Tilahun H, Berkhout E, Cooper M, Fegraus E, Wheeler I, Kwabena NA (2017) Soil nutrient maps of Sub-Saharan Africa: assessment of soil nutrient content at 250 m spatial resolution using machine learning. Nutr Cycl Agroecosyst 109(1):77–102. https://doi.org/10.1007/s10705-017-9870-x

Heung B, Ho HC, Zhang J, Knudby A, Bulmer CE, Schmidt MG (2016) An overview and comparison of machine-learning techniques for classification purposes in digital soil mapping. Geoderma 265:62–77. https://doi.org/10.1016/j.geoderma.2015.11.014

Hoffimann J, Zortea M, De Carvalho B, Zadrozny B (2021) Geostatistical learning: challenges and opportunities. Front Appl Math Stat 7:689393. https://doi.org/10.3389/fams.2021.689393

Högberg P, Näsholm T, Franklin O, Högberg MN (2017) Tamm review: on the nature of the nitrogen limitation to plant growth in Fennoscandian boreal forests. For Ecol Manage 403:161–185. https://doi.org/10.1016/j.foreco.2017.04.045

Hounkpatin KOL, Bossa AY, Yira Y, Igue MA, Sinsin BA (2022) Assessment of the soil fertility status in Benin (West Africa)—digital soil mapping using machine learning. Geoderma Reg 28:e00444. https://doi.org/10.1016/j.geodrs.2021.e00444

Jain P, Coogan SCP, Subramanian SG, Crowley M, Taylor S, Flannigan MD (2020) A review of machine learning applications in wildfire science and management. Environ Rev 28(4):478–505. https://doi.org/10.1139/er-2020-0019

Keskin H, Grunwald S (2018) Regression kriging as a workhorse in the digital soil mapper’s toolbox. Geoderma 326:22–41. https://doi.org/10.1016/j.geoderma.2018.04.004

Keys to Soil Taxonomy, 13th Edition (2022)

Khaledian Y, Miller BA (2020) Selecting appropriate machine learning methods for digital soil mapping. Appl Math Model 81:401–418. https://doi.org/10.1016/j.apm.2019.12.016

Lee H, Wang J, Leblon B (2020) Using linear regression, random forests, and support vector machine with unmanned aerial vehicle multispectral images to predict canopy nitrogen weight in corn. Remote Sensing 12(13):2071. https://doi.org/10.3390/rs12132071

Li C, Li X, Meng X, Xiao Z, Wu X, Wang X, Ren L, Li Y, Zhao C, Yang C (2023a) Hyperspectral estimation of nitrogen content in wheat based on fractional difference and continuous wavelet transform. Agriculture 13(5):1017. https://doi.org/10.3390/agriculture13051017

Li J, Zhang T, Shao Y, Ju Z (2023b) Comparing machine learning algorithms for soil salinity mapping using topographic factors and sentinel-1/2 data: a case study in the yellow river delta of China. Remote Sensing 15(9):2332. https://doi.org/10.3390/rs15092332

Li R, Xu J, Luo J, Yang P, Hu Y, Ning W (2022) Spatial distribution characteristics, influencing factors, and source distribution of soil cadmium in Shantou City, Guangdong Province. Ecotoxicol Environ Saf 244:114064. https://doi.org/10.1016/j.ecoenv.2022.114064

Li X, McCarty GW, Du L, Lee S (2020) Use of topographic models for mapping soil properties and processes. Soil Systems 4(2):32. https://doi.org/10.3390/soilsystems4020032

Li Z, Wang J, Tang H, Huang C, Yang F, Chen B, Wang X, Xin X, Ge Y (2016) Predicting grassland leaf area index in the meadow steppes of northern China: a comparative study of regression approaches and hybrid geostatistical methods. Remote Sensing 8(8):632. https://doi.org/10.3390/rs8080632

Liang L, Di L, Huang T, Wang J, Lin L, Wang L, Yang M (2018) Estimation of leaf nitrogen content in wheat using new hyperspectral indices and a random forest regression algorithm. Remote Sensing 10(12):1940. https://doi.org/10.3390/rs10121940

Lindner T, Puck J, Verbeke A (2022) Beyond addressing multicollinearity: robust quantitative analysis and machine learning in international business research. J Int Bus Stud 53(7):1307–1314. https://doi.org/10.1057/s41267-022-00549-z

Liu F, Wu H, Zhao Y, Li D, Yang J-L, Song X, Shi Z, Zhu A-X, Zhang G-L (2022) Mapping high resolution national soil information grids of China. Sci Bull 67(3):328–340. https://doi.org/10.1016/j.scib.2021.10.013

Liu J, Yang K, Tariq A, Lu L, Soufan W, El Sabagh A (2023) Interaction of climate, topography and soil properties with cropland and cropping pattern using remote sensing data and machine learning methods. Egypt J Remote Sens Space Sci 26(3):415–426. https://doi.org/10.1016/j.ejrs.2023.05.005

Ma Z, Mei G, Piccialli F (2021) Machine learning for landslides prevention: a survey. Neural Comput Appl 33(17):10881–10907. https://doi.org/10.1007/s00521-020-05529-8

Maleki S, Karimi A, Mousavi A, Kerry R, Taghizadeh-Mehrjardi R (2023) Delineation of soil management zone maps at the regional scale using machine learning. Agronomy 13(2):445. https://doi.org/10.3390/agronomy13020445

Mashaba-Munghemezulu Z, Chirima GJ, Munghemezulu C (2021) Modeling the spatial distribution of soil nitrogen content at smallholder maize farms using machine learning regression and sentinel-2 data. Sustainability 13(21):11591. https://doi.org/10.3390/su132111591

Moran PAP (1948) The interpretation of statistical maps. J Roy Stat Soc Ser B 10(2):243–251. https://doi.org/10.1111/j.2517-6161.1948.tb00012.x

Nguyen TT, Vu TD (2019) Identification of multivariate geochemical anomalies using spatial autocorrelation analysis and robust statistics. Ore Geol Rev 111:102985. https://doi.org/10.1016/j.oregeorev.2019.102985

Nolan BT, Green CT, Juckem PF, Liao L, Reddy JE (2018) Metamodeling and mapping of nitrate flux in the unsaturated zone and groundwater, Wisconsin, USA. J Hydrol 559:428–441. https://doi.org/10.1016/j.jhydrol.2018.02.029

Nussbaum M, Spiess K, Baltensweiler A, Grob U, Keller A, Greiner L, Schaepman ME, Papritz A (2018) Evaluation of digital soil mapping approaches with large sets of environmental covariates. Soil 4(1):1–22. https://doi.org/10.5194/soil-4-1-2018

Orgiazzi A, Ballabio C, Panagos P, Jones A, Fernández-Ugalde O (2018) LUCAS soil, the largest expandable soil dataset for Europe: a review. Eur J Soil Sci 69(1):140–153. https://doi.org/10.1111/ejss.12499

Padarian J, Minasny B, McBratney AB (2019) Using deep learning for digital soil mapping. Soil 5(1):79–89. https://doi.org/10.5194/soil-5-79-2019

Panagos P, Ballabio C, Borrelli P, Meusburger K, Klik A, Rousseva S, Tadić MP, Michaelides S, Hrabalíková M, Olsen P, Aalto J, Lakatos M, Rymszewicz A, Dumitrescu A, Beguería S, Alewell C (2015a) Rainfall erosivity in Europe. Sci Total Environ 511:801–814. https://doi.org/10.1016/j.scitotenv.2015.01.008

Panagos P, Borrelli P, Meusburger K (2015b) A new European slope length and steepness factor (LS-Factor) for modeling soil erosion by water. Geosciences 5(2):117–126. https://doi.org/10.3390/geosciences5020117

Panagos P, Borrelli P, Meusburger K, Alewell C, Lugato E, Montanarella L (2015c) Estimating the soil erosion cover-management factor at the European scale. Land Use Policy 48:38–50. https://doi.org/10.1016/j.landusepol.2015.05.021

Panagos P, Borrelli P, Meusburger K, van der Zanden EH, Poesen J, Alewell C (2015d) Modelling the effect of support practices (P-factor) on the reduction of soil erosion by water at European scale. Environ Sci Policy 51:23–34. https://doi.org/10.1016/j.envsci.2015.03.012

Panagos P, Meusburger K, Ballabio C, Borrelli P, Alewell C (2014) Soil erodibility in Europe: a high-resolution dataset based on LUCAS. Sci Total Environ 479–480:189–200. https://doi.org/10.1016/j.scitotenv.2014.02.010

Piunti V (2019) ALGORITMI DI MACHINE LEARNING SUPERVISIONATO: POSSIBILI APPLICAZIONI NEL SETTORE ASSICURATIVOSANITARIO [UNIVERSITÀ POLITECNICA DELLE MARCHE FACOLTÀ DI ECONOMIA “GIORGIO FUÀ”]. https://tesi.univpm.it/bitstream/20.500.12075/7161/2/TESI%20VALENTINO%20PIUNTI.pdf

Poppiel RR, Demattê JAM, Rosin NA, Campos LR, Tayebi M, Bonfatti BR, Ayoubi S, Tajik S, Afshar FA, Jafari A, Hamzehpour N, Taghizadeh-Mehrjardi R, Ostovari Y, Asgari N, Naimi S, Nabiollahi K, Fathizad H, Zeraatpisheh M, Javaheri F, Rahmati M (2021) High resolution middle eastern soil attributes mapping via open data and cloud computing. Geoderma 385:114890. https://doi.org/10.1016/j.geoderma.2020.114890

Prado Osco L, Marques Ramos AP, Roberto Pereira D, Akemi Saito Moriya É, Nobuhiro Imai N, Takashi Matsubara E, Estrabis N, De Souza M, Marcato Junior J, Gonçalves WN, Li J, Liesenberg V, Eduardo Creste J (2019) Predicting canopy nitrogen content in citrus-trees using random forest algorithm associated to spectral vegetation indices from UAV-imagery. Remote Sens 11(24):2925. https://doi.org/10.3390/rs11242925

QGIS Development Team (2023) QGIS [Software]. Open Source Geospatial Foundation Project. http://qgis.osgeo.org

Radočaj D, Gašparović M, Jurišić M (2024) Open remote sensing data in digital soil organic carbon mapping: a review. Agriculture 14(7):1005. https://doi.org/10.3390/agriculture14071005

Radočaj D, Jurišić M, Antonić O, Šiljeg A, Cukrov N, Rapčan I, Plaščak I, Gašparović M (2022a) A multiscale cost-benefit analysis of digital soil mapping methods for sustainable land management. Sustainability 14(19):12170. https://doi.org/10.3390/su141912170

Radočaj D, Jurišić M, Antonić O, Šiljeg A, Cukrov N, Rapčan I, Plaščak I, Gašparović M (2022b) A multiscale cost-benefit analysis of digital soil mapping methods for sustainable land management. Sustainability 14(19):12170. https://doi.org/10.3390/su141912170

Rahman MM, Zhang X, Ahmed I, Iqbal Z, Zeraatpisheh M, Kanzaki M, Xu M (2020) Remote sensing-based mapping of senescent leaf C: N ratio in the sundarbans reserved forest using machine learning techniques. Remote Sens 12(9):1375. https://doi.org/10.3390/rs12091375

Ramedani Z, Omid M, Keyhani A, Shamshirband S, Khoshnevisan B (2014) Potential of radial basis function based support vector regression for global solar radiation prediction. Renew Sustain Energy Rev 39:1005–1011. https://doi.org/10.1016/j.rser.2014.07.108

Regione Autonoma della Sardegna (2023) Sardegna Geoportale [Webgis]. SardegnaMappe. https://www.sardegnageoportale.it/webgis2/sardegnamappe/?map=download_raster

Ridwan I, Kadir S, Nurlina N (2024) Wetland degradation monitoring using multi-temporal remote sensing data and watershed land degradation index. Global J Environ Sci Manag 10(1):83–96. https://doi.org/10.22034/gjesm.2024.01.07

RStudio Team (2011) RStudio: Integrated Development for R [Software]. RStudio Team (2020). http://www.rstudio.com/

Santra P, Kumar M, Panwar N (2017) Digital soil mapping of sand content in arid western India through geostatistical approaches. Geoderma Reg 9:56–72. https://doi.org/10.1016/j.geodrs.2017.03.003

Sarica A, Cerasa A, Quattrone A (2017) Random forest algorithm for the classification of neuroimaging data in alzheimer’s disease: a systematic review. Front Aging Neurosci 9:329. https://doi.org/10.3389/fnagi.2017.00329

Searle R, McBratney A, Grundy M, Kidd D, Malone B, Arrouays D, Stockman U, Zund P, Wilson P, Wilford J, Van Gool D, Triantafilis J, Thomas M, Stower L, Slater B, Robinson N, Ringrose-Voase A, Padarian J, Payne J, Andrews K (2021) Digital soil mapping and assessment for Australia and beyond: a propitious future. Geoderma Reg 24:e00359. https://doi.org/10.1016/j.geodrs.2021.e00359

Sequi P, Ciavatta C, Milano T (2017) Fondamenti della chimica del Suolo. Pàtron Editore

Shrestha N (2020) Detecting Multicollinearity in regression analysis. Am J Appl Math Stat 8(2):39–42. https://doi.org/10.12691/ajams-8-2-1

Singh B (2018) Are nitrogen fertilizers deleterious to soil health? Agronomy 8(4):48. https://doi.org/10.3390/agronomy8040048

Söderström M, Sohlenius G, Rodhe L, Piikki K (2016) Adaptation of regional digital soil mapping for precision agriculture. Precision Agric 17(5):588–607. https://doi.org/10.1007/s11119-016-9439-8

Taghizadeh-Mehrjardi R, Hamzehpour N, Hassanzadeh M, Heung B, Ghebleh Goydaragh M, Schmidt K, Scholten T (2021) Enhancing the accuracy of machine learning models using the super learner technique in digital soil mapping. Geoderma 399:115108. https://doi.org/10.1016/j.geoderma.2021.115108

Tybl A (2016) An overview of spatial econometrics. SSRN Electron J. https://doi.org/10.2139/ssrn.2778679

Uddameri V, Silva A, Singaraju S, Mohammadi G, Hernandez E (2020) Tree-based modeling methods to predict nitrate exceedances in the Ogallala aquifer in Texas. Water 12(4):1023. https://doi.org/10.3390/w12041023

van der Westhuizen S, Heuvelink GBM, Hofmeyr DP (2023) Multivariate random forest for digital soil mapping. Geoderma 431:116365. https://doi.org/10.1016/j.geoderma.2023.116365

Van Der Westhuizen S, Heuvelink GBM, Hofmeyr DP, Poggio L (2022) Measurement error-filtered machine learning in digital soil mapping. Spat Stat 47:100572. https://doi.org/10.1016/j.spasta.2021.100572

Wadoux AMJ-C, Minasny B, McBratney AB (2020) Machine learning for digital soil mapping: applications, challenges and suggested solutions. Earth Sci Rev 210:103359. https://doi.org/10.1016/j.earscirev.2020.103359

Wang L, Chen S, Li D, Wang C, Jiang H, Zheng Q, Peng Z (2021) Estimation of paddy rice nitrogen content and accumulation both at leaf and plant levels from UAV hyperspectral imagery. Remote Sens 13(15):2956. https://doi.org/10.3390/rs13152956

Wang N, Luo Y, Liu Z, Sun Y (2022) Spatial distribution characteristics and evaluation of soil pollution in coal mine areas in Loess Plateau of northern Shaanxi. Sci Rep 12(1):16440. https://doi.org/10.1038/s41598-022-20865-6

Wang X, Fan J, Xing Y, Xu G, Wang H, Deng J, Wang Y, Zhang F, Li P, Li Z (2019) The effects of mulch and nitrogen fertilizer on the soil environment of crop plants. Adv Agron 153:121–173. https://doi.org/10.1016/bs.agron.2018.08.003

Weintraub SR, Brooks PD, Bowen GJ (2017) Interactive effects of vegetation type and topographic position on nitrogen availability and loss in a temperate montane ecosystem. Ecosystems 20(6):1073–1088. https://doi.org/10.1007/s10021-016-0094-8

Worthy B (2015) The impact of open data in the UK: complex, unpredictable, and political. Public Adm 93(3):788–805. https://doi.org/10.1111/padm.12166

Wright MN, Ziegler A (2017) Ranger: a fast implementation of random forests for high dimensional data in C++ and R. J Stat Softw 77:1–17. https://doi.org/10.18637/jss.v077.i01

Xiaorui L, Jiamin Y, Longji Y (2023) Predicting the high heating value and nitrogen content of torrefied biomass using a support vector machine optimized by a sparrow search algorithm. RSC Adv 13(2):802–807. https://doi.org/10.1039/D2RA06869A

Xu R, Nettleton D, Nordman DJ (2016) Case-specific random forests. J Comput Graph Stat 25(1):49–65. https://doi.org/10.1080/10618600.2014.983641

Xu S, Wang M, Shi X, Yu Q, Zhang Z (2021) Integrating hyperspectral imaging with machine learning techniques for the high-resolution mapping of soil nitrogen fractions in soil profiles. Sci Total Environ 754:142135. https://doi.org/10.1016/j.scitotenv.2020.142135

Zhang G, Liu F, Song X (2017) Recent progress and future prospect of digital soil mapping: a review. J Integr Agric 16(12):2871–2885. https://doi.org/10.1016/S2095-3119(17)61762-3

Zhang P, Yin Z-Y, Jin Y-F (2021) State-of-the-art review of machine learning applications in constitutive modeling of soils. Archiv Comput Methods Eng 28(5):3661–3686. https://doi.org/10.1007/s11831-020-09524-z

Zhang Y, Ji W, Saurette DD, Easher TH, Li H, Shi Z, Adamchuk VI, Biswas A (2020) Three-dimensional digital soil mapping of multiple soil properties at a field-scale using regression kriging. Geoderma 366:114253. https://doi.org/10.1016/j.geoderma.2020.114253

Zhang Y, Sui B, Shen H, Ouyang L (2019) Mapping stocks of soil total nitrogen using remote sensing data: a comparison of random forest models with different predictors. Comput Electron Agric 160:23–30. https://doi.org/10.1016/j.compag.2019.03.015

Zhou J, Xu Y, Gu X, Chen T, Sun Q, Zhang S, Pan Y (2023) High-precision mapping of soil organic matter based on UAV imagery using machine learning algorithms. Drones 7(5):290. https://doi.org/10.3390/drones7050290

Download references

Open access funding provided by Università degli Studi di Sassari within the CRUI-CARE Agreement. Partial financial support was received from University of Sassari (FAR 2022, 2023, 2024).

The authors have no relevant financial or non-financial interests to disclose.

Author information

Authors and affiliations.

Dipartimento Di Architettura, Design E Urbanistica, Università Di Sassari, Via Piandanna 4, 07100, Sassari, Italy

Alessandro Auzzas, Gian Franco Capra & Antonio Ganga

Department of Biology and Chemistry, California State University, Monterey Bay, Seaside, CA, 93955, USA

Arun Dilipkumar Jani

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Antonio Ganga .

Ethics declarations

Conflict of interest.

The authors declare no competing interests.

Compliance with ethical standards

The authors were compliant with the ethical standards.

Ethical approval

Research meets all applicable standards relating to ethics and research integrity.

Informed consent

All authors provided informed consent.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 218 KB)

Rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Auzzas, A., Capra, G.F., Jani, A.D. et al. An improved digital soil mapping approach to predict total N by combining machine learning algorithms and open environmental data. Model. Earth Syst. Environ. (2024). https://doi.org/10.1007/s40808-024-02127-8

Download citation

Received : 16 May 2024

Accepted : 02 August 2024

Published : 20 August 2024

DOI : https://doi.org/10.1007/s40808-024-02127-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Machine learning
  • Random Forest Regression
  • Support Vector Regression
  • Digital soil mapping
  • Find a journal
  • Publish with us
  • Track your research

IMAGES

  1. Figure 1 from Multiscale Representation Learning of Graph Data With

    multiscale representation learning of graph data with node affinity

  2. Figure 1 from Multiscale Representation Learning of Graph Data With

    multiscale representation learning of graph data with node affinity

  3. 7.Graph Representation Learning

    multiscale representation learning of graph data with node affinity

  4. Illustration of graph representation learning input and output

    multiscale representation learning of graph data with node affinity

  5. Graph Representation Learning

    multiscale representation learning of graph data with node affinity

  6. A Guide to Graph Representation Learning

    multiscale representation learning of graph data with node affinity

COMMENTS

  1. Multiscale Representation Learning of Graph Data With Node Affinity

    Graph neural networks have emerged as a popular and powerful tool for learning hierarchical representation of graph data. In complement to graph convolution operators, graph pooling is crucial for extracting hierarchical representation of data in graph neural networks. However, most recent graph pooling methods still fail to efficiently exploit the geometry of graph data. In this paper, we ...

  2. Multiscale Representation Learning of Graph Data With Node Affinity

    However, most recent graph pooling methods still fail to efficiently exploit the geometry of graph data. In this paper, we propose a novel graph pooling strategy that leverages node affinity to ...

  3. Multiscale Representation Learning of Graph Data With Node Affinity

    A structure-aware kernel representation is introduced to explicitly exploit advanced topological information for efficient graph pooling without eigendecomposition of the graph Laplacian and is able to achieve state-of-the-art performance on a collection of public graph classification benchmark datasets. Graph neural networks have emerged as a popular and powerful tool for learning ...

  4. Multiscale Representation Learning of Graph Data With Node Affinity

    In this paper, we propose a novel graph pooling strategy that leverages node affinity to improve the hierarchical representation learning of graph data. Node affinity is computed by harmonizing the kernel representation of topology information and node features. In particular, a structure-aware kernel representation is introduced to explicitly ...

  5. Multiscale Representation Learning of Graph Data With Node Affinity

    Fig. 1. Framework of the proposed pooling operator. Node affinity is computed on the basis of the topological affinity of nodes and the similarity of signals they supporting with the help of kernel methods. On the basis of the node affinity, the coarsened graph is constructed by graph downsampling with seed node selection and graph reduction with soft-assignment. Specifically, for a graph Gl ...

  6. Multiscale Representation Learning of Graph Data With Node Affinity

    Graph Search is a network of interconnected data instantly accessible in a single place. Thanks to the power of advanded machine learning, you can efficiently navigate through the complexity of the EPFL academic world (courses, sessions, concepts, people, publications and laboratories) in order to better understand it, quickly find information, create new collaborations and make decisions.

  7. Multimodal learning with graphs

    The third component employs convolutional or message-passing steps to learn node representations based on graph adjacencies (see Supplementary Note 1 for more details on graph convolutions and ...

  8. Self-supervised graph representation learning using multi-scale

    Graph representation learning has received widespread attention in recent years. Most of the existing graph representation learning methods are based on supervised learning and require the complete graph as input. It needs a lot of computation memory cost. Besides, real-world graph data lacks labels and the cost of manually labeling data is expensive. Self-supervised learning provides a ...

  9. ‪Xing Gao‬

    Geometric Deep Learning Graph Neural Networks Autonomous Driving. Articles Cited by Public access Co-authors. Title. ... Dynamic scenario representation learning for motion forecasting with heterogeneous graph convolutional recurrent networks. ... Multiscale representation learning of graph data with node affinity. X Gao, W Dai, C Li, H Xiong ...

  10. Multi-scale graph attention subspace clustering network

    The first is to learn a affinity matrix Λ [23] from the original data in which Λ ij is the similarity between the i th and the j th sample, the second is to apply spectral clustering methods on the affinity matrix to find the segmentation of data. Compared these two sub-problems, learning a discriminative affinity matrix is more important.

  11. Multiscale Representation Learning of Graph Data With Node Affinity

    Multiscale Representation Learning of Graph Data With Node Affinity. Xing Gao 0005, Wenrui Dai, Chenglin Li, Hongkai Xiong, Pascal Frossard. Multiscale Representation Learning of Graph Data With Node Affinity. IEEE Trans. Signal and Information Processing over Networks, 7: 30-44, 2021.

  12. GRAHIES: Multi-Scale Graph Representation Learning with Latent

    A wide variety of deep neural network models for graph-structured data have been proposed to solve tasks like node/graph classification and link prediction. By effectively learning low-dimensional embeddings of graph nodes, they have shown state-of-the-art performance. However, most existing models learn node embeddings by exploring flat information propagation across the edges within the ...

  13. MGraphDTA: deep multiscale graph neural network for explainable drug

    Predicting drug-target affinity (DTA) is beneficial for accelerating drug discovery. Graph neural networks (GNNs) have been widely used in DTA prediction. However, existing shallow GNNs are insufficient to capture the global structure of compounds. Besides, the interpretability of the graph-based DTA models Most popular 2022 physical and theoretical chemistry articles

  14. PDF Multi-Scale Representation Learning on Proteins

    Multi-scale Protein Representation (Multi-scale Graph) The multi-scale graph is obtained by connecting the surface node and the backbone nodes. The above mentioned nodes have an edge if they have the same residue identifier r. The graph is encoded by the multi-scale message passing network.

  15. Molecular set representation learning

    Machine learning. Computational representation of molecules can take many forms, including graphs, string encodings of graphs, binary vectors or learned embeddings in the form of real-valued ...

  16. Multidta: drug-target binding affinity prediction via representation

    The increase in the affinity data available in DT knowledge-bases allows the use of advanced learning techniques such as deep learning architectures in the prediction of binding affinities.

  17. Multiscale Representation Learning of Graph Data With Node Affinity

    Key takeaway: 'The proposed graph pooling strategy, leveraging node affinity, improves hierarchical representation learning of graph data in graph neural networks, achieving state-of-the-art performance on public graph classification benchmark datasets.'

  18. MGraphDTA: deep multiscale graph neural network for explainable drug

    space. To address this problem, graph neural networks (GNNs) have been adopted in DTA prediction.31-36 The GNN-based methods represent the drugs as graphs and use GNN for DTA prediction. For instance, Tsubaki et al.34 proposed to use GNN and CNN to learn low-dimensional vector representation of compound graphs and protein sequences ...

  19. Random memristor-based dynamic graph CNN for efficient point ...

    On the software front, we have developed a random dynamic graph CNN to learn point clouds (Fig. 1e). Graph Neural Networks 10 (GNNs) update graph node feature representations through information ...

  20. Multiscale Representation Learning of Graph Data With Node Affinity

    Multiscale Representation Learning of Graph Data With Node Affinity. Xing Gao Wenrui Dai Chenglin Li Hongkai Xiong Pascal Frossard Published in: IEEE Trans. Signal Inf. Process. over Networks (2021)

  21. MGraphDTA: deep multiscale graph neural network for explainable drug

    To address the above problems, we proposed a multiscale graph neural network (MGNN) and a novel visual explanation method called gradient-weighted affinity activation mapping (Grad-AAM) for DTA prediction and interpretation. An overview of the proposed MGraphDTA is shown in Fig. 2. The MGNN with 27 graph convolutional layers and a multiscale ...

  22. Semi-Supervised Graph Contrastive Learning With Virtual Adversarial

    Semi-supervised graph learning aims to improve learning performance by leveraging unlabeled nodes. Typically, it can be approached in two different ways, including <italic>predictive representation learning</italic> (PRL) where unlabeled data provide clues on input distribution and <italic>label-dependent regularization</italic> (LDR) which smooths the output distribution with unlabeled nodes ...

  23. [2204.02337] Multi-Scale Representation Learning on Proteins

    Vignesh Ram Somnath, Charlotte Bunne, Andreas Krause. View a PDF of the paper titled Multi-Scale Representation Learning on Proteins, by Vignesh Ram Somnath and 2 other authors. Proteins are fundamental biological entities mediating key roles in cellular function and disease. This paper introduces a multi-scale graph construction of a protein ...

  24. Drug repositioning by collaborative learning based on graph

    It is because that the representations of drug and disease are intrinsic synergic on biological context and topological proximity in the heterogenous network. To solve these problems, a novel computational Drug Repositioning method by Collaborative Learning based on graph convolutional inductive Network (DRCLN) is developed in this manuscript.

  25. Context-embedded hypergraph attention network and self ...

    GNN models user session data as directed graphs, leveraging the relational structures between nodes in the graph to propagate and aggregate information for learning node representations.

  26. Adaptive Local Modularity Learning for Efficient Multilayer Graph

    More importantly, ALML reveals the equivalence between graph cut and graph modularity learning in multilayer graph scenarios in theory. To solve the optimization problem involved in ALML, this paper proposes an efficient alternating algorithm with quadratic-level time complexity, which is satisfactory in multilayer graph clustering scenarios ...

  27. Multiscale Representation Learning of Graph Data With Node Affinity

    However, most recent graph pooling methods still fail to efficiently exploit the geometry of graph data. In this paper, we propose a novel graph pooling strategy that leverages node affinity to improve the hierarchical representation learning of graph data. Node affinity is computed by harmonizing the kernel representation of topology ...

  28. A cross-domain user association scheme based on graph ...

    With the widespread adoption of mobile internet, users generate vast amounts of location-based data across multiple social networking platforms. This data is valuable for applications such as personalized recommendations and targeted advertising. Accurately identifying users across different platforms enhances understanding of user behavior and preferences. To address the complexity of cross ...

  29. An improved digital soil mapping approach to predict total N ...

    Digital Soil Mapping (DSM) is fundamental for soil monitoring, as it is limited and strategic for human activities. The availability of high temporal and spatial resolution data and robust algorithms is essential to map and predict soil properties and characteristics with adequate accuracy, especially at a time when the scientific community, legislators and land managers are increasingly ...