Given a pair of starting and ending proteins, our methodology returns candidate pathway segments between these two proteins with possible missing links (recovered false negatives). In our study, S. cerevisiae (yeast) data is used to demonstrate the effectiveness of our method.
In this study, we also attempt to recover false negative PPIs. To identify these missing links, we utilize protein families. First, we identify the families of the proteins. In short, a protein family is a group of evolutionarily related proteins. Therefore, by grouping the proteins, we capture their possible interaction traits and can infer highly possible missing interactions. We grow the PPIN with these inferred interaction edges and carry out our searches on this extended network as well as the actual PPIN.
We first checked every interacting protein and identified their families. The Protein Family data was downloaded from Pfam . Pfam is a database of multiple alignments of protein domains or conserved protein regions. These represent some evolutionary conserved structure which has implications for the protein's function. The curated part of Pfam contains over 8957 protein families. Next, we add inferred interaction edges to our network. We link the proteins in our PPIN with an inferred interaction edge, if (1) they do not interact with each other in the PPIN, and (2) there exists at least one (real) interaction between the families of these two proteins. For instance, assume that the genome of our model organism has proteins p1, p2, q1 and q2. Let us also assume that p1 and p2 belong to family A, q1 and q2 belong to family B, and proteins p1 and q1 interact but p2 and q2 do not interact with each other. We link proteins p2 and q2 with an inferred link, since members of A and B has an interaction connecting A and B.
Next, we search for a pathway segment using this extended PPIN. We follow the same search methodology described above. Once again, every possible path with a length between l lower and l upper is checked. However, this time we only allow at most one inferred edge on each simple path. Since the number of edges is increased by including the inferred edges, the searches took longer, up to 20 times depending on the degree of the proteins in consideration. Nevertheless, possible missing links can be recovered. Here, to make sure the secondary edges are correctly identified, we have only considered secondary edges with at least five association rules (this is close to the average number of association rules on interacting pathway proteins).
Given that there are missing links, PathFinder actually recovered all possible links that are available to us on this pathway segment. When compared with previous studies, for this particular example, PathFinder has a 78% recall and 40% precision in recovering this pathway segment (Figure 1). The color-coding algorithm output for this pathway had a 50% recall and 32% precision , whereas the NetSearch program prediction for the pheromone pathway had a 44% recall and 24% precision . Moreover, the resulting pathway segments of PathFinder just consists of proteins from the original pathway and two additional proteins, Kss1 (a MAP Kinase) and Sst2.
In this second example, the filamentation MAPK pathway is searched. The filamentation MAPK pathway (Figure 5A) is activated by glucose or nitrogen starvation and results in filamentous growth. The Sho1-Tec1 protein pair is picked as the starting and ending protein pair. However, this pathway has a missing interaction in the yeast PPIN. Previously, such missing links were noted to prevent attempts to recover signaling pathways segments . After searching for the pathway with our methodology (without inferred links), we acquired the pathway segment shown in Figure 5B. When the results are compared, all known interactions among the interacting proteins but Cdc42 were recovered. Additional proteins from the pheromone response and high osmolarity glycerol (HOG) MAPK pathways were also recovered. This is most likely due to shared proteins on these pathways.
The missing pathway protein, Cdc42, shown in Figure 5B is due to a missing interaction in the PPIN. To recover this false negative interaction edge, we incorporated inferred links into our network (See Recovering false negative interactions) and acquired the pathway segment in Figure 6B.
Here we have not only shown that our methodology can recover missing links on this pathway with additional empirical information, but also how functional properties are utilized to bring pathway segments together. As noted earlier , signaling pathways are not only individual paths but also inter connected functional chains.
NLP is a technique grounded in ML that enables devices to analyze, interpret, and even generate text . NLP and big data analytics tackle huge amounts of text data and can derive value from such a dataset in real-time . Some common NLP methods include lexical acquisition (i.e., obtains information about the lexical units of a language), word sense disambiguation (i.e., determining which sense of the word is used in a sentence when a word has multiple meanings), and part-of-speech (POS) tagging (i.e., determining the function of the words through labeling categories such as verb, noun, etc.). Several NLP-based techniques have been applied to text mining including information extraction, topic modeling, text summarization, classification, clustering, question answering, and opinion mining . For example, financial and fraud investigations may involve finding evidence of a crime in massive datasets. NLP techniques (particularly named entity extraction and information retrieval) can help manage and sift through huge amounts of textual information, such as criminal names and bank records, to support fraud investigations. Moreover, NLP techniques can help to create new traceability links and recover traceability links (i.e., missing or broken links at run-time) by finding semantic similarity among available textual artifacts . Furthermore, NLP and big data can be used to analyze news articles and predict rises and falls on the composite stock price index .
In online social networks (OSNs), nodes are organized into communities, where a community represents a group of nodes having similar characteristics, such as similar interests, opinions, or beliefs [1, 2]. The links between the nodes belonging to the same community are referred to as intra community links, and the links between the nodes belonging to different communities are referred to as inter community links. In social networks, intra-community links are driven by the effect of homophily  as similar nodes prefer to connect with each other. The formation of inter-community links is still not well explored in the literature; however, it can be explained by different complex phenomena, such as triadic closure and weak ties . In real-world networks, it is observed that the number of intra-community links is more than the number of inter-community links . The evolution of social networks is regulated by the formation of new links in the network.
Link prediction is a very well-known problem in network science and has been applied to predict missing links in different types of networks, such as friendship networks, collaboration networks, and chemical networks. Initially, researchers proposed heuristic methods that only considered the neighborhood information of the nodes for link prediction and did not consider the network topology. These heuristic methods were further extended that also considered the network structure properties like community structure to predict the links [15, 16, 20, 21]. However, most of these methods improved the overall accuracy of link prediction by improving the accuracy of intra-community link prediction. The main benefit of using heuristic methods is that these methods do not need any training and are comparatively faster.
For visualization, we show the NodeSim embedding of the Zachary Karate Network  in 2-dimension and 3-dimension space. The network and its embeddings are shown in Fig. 11, where the nodes having the same color belong to one community. The embedding shows that the nodes belonging to different communities are well separated; however, more similar nodes are embedded closer. For example, node 12 is more likely to form inter-community links with nodes 4, 5, 6, and 10, so, as observed, they are embedded closer but still well separated. The embedding of the nodes improves with high dimension, as we also observed in Sect. 4.4 that the accuracy increases with a higher dimension. We have also shown embeddings for Dutch School Friendship Network , and Illinois Highschool Friendship Network  in Appendix B. 2b1af7f3a8