Building smarter drug discovery with graph databases

The key to drug discovery success lies in a process where researchers can easily connect the dots between diseases, drugs, and their effects. However, achieving this monumental task presents significant challenges. It requires managing and analyzing vast amounts of complex data from diverse sources, with data models that are constantly evolving. Traditional information management systems are no longer sufficient for this purpose and often fail to give researchers the actionable insights they need. That's the promise of combining ontologies with graph databases in the biopharmaceutical world.

What are ontologies and why should you care?

Ontologies help organize information in a way that makes sense and is useful. Biopharmaceuticals are the secret sauce for making sense of the complex world of drugs, diseases, and biological processes. They speak a universal language, where everyone agrees on terms, making communication clearer. Each ontology zooms in on a specific area (like chemistry or genetics), providing a detailed description of every concept, property, and relationship of your data.

Ontologies have a few key parts:

Classes: They are like categories. In a drug ontology, you might have classes like "Drug," or "Disease."
Properties: These describe the classes. A drug might have properties like "dosage" or "side effects."
Relationships: These show how classes are connected. For example, a drug "treats" a disease.
Instances: These are specific examples. "Aspirin" would be an instance of the "Drug" class.

Fig. 1: Simplified example of an ontology showing the relationships between drugs, diseases, and genes.

The connective power of ontologies

Ontologies aren't just organizational tools—they're the glue that binds disparate datasets together. Think of ontologies as universal translators in a galaxy of diverse data languages. They're critical in ensuring that when talking about a specific drug, gene, or disease across different databases, you always refer to the same entity, even if it goes by different names or identifiers.

Below you can see some ontology databases that are used in drug research:

Consider this scenario: You're researching a new drug compound. In one database, it's listed under its chemical name, in another by its brand name, and in a third by a unique identifier. With ontologies, you can connect these dots easily.

For instance, the drug Aspirin might appear as:

"Aspirin" in a clinical trials database
"Acetylsalicylic acid" in a chemical compound database
"CHEBI:15365" in the ChEBI (Chemical Entities of Biological Interest) database

Fig. 2: Illustration of the unification of isolated entities in a single ontology.

An ontology can define that all these terms refer to the same entity (Fig. 2), allowing researchers to pull comprehensive information about the drug from multiple sources without missing crucial data or inadvertently treating them as separate compounds.

This connective power extends beyond just naming conventions. Ontologies can also bridge gaps in the granularity of concepts across datasets. One database might refer broadly to "NSAIDs" (Non-Steroidal Anti-Inflammatory Drugs), while another lists specific drugs like Aspirin, Ibuprofen, and Naproxen. A well-designed ontology can establish the hierarchical relationship between these concepts, enabling researchers to navigate between general classes of drugs and their specific instances.

As you can see, ontologies enable researchers to integrate data from diverse sources confidently, perform more comprehensive analyses by ensuring no relevant data is overlooked, and discover non-obvious relationships between entities that might be obscured by inconsistent naming or classification.

Building your ontology

When embarking on the journey to create an ontology, one of the first crossroads you'll encounter is whether to use an existing public ontology or build your own. This decision is critical and can significantly impact the effectiveness and efficiency of your knowledge graph.

Evaluating public vs. custom ontologies

Public ontologies like ChEBI, RxNorm, and ClinicalTrials.gov have the advantage of being well-established and widely recognized in the scientific community. They've been refined over years and offer a common language that facilitates data sharing and collaboration. However, they may only sometimes perfectly align with your research needs or data structures.

On the other hand, building your ontology allows you to tailor it precisely to your requirements. You can define classes and relationships directly corresponding to your data and research focus. But this path comes with challenges - it requires significant time, expertise, and resources.

The key to making this decision is understanding how well existing ontologies map to the datasets you care about. This involves a deep evaluation of your data sources, research objectives, and the capabilities of public ontologies. You might find that a combination of public and custom ontologies best serves you, allowing you to leverage established standards while accommodating your unique needs.

Understanding existing ontologies

Before you decide to build your own ontology from scratch, it's crucial to understand what's already out there.

Which public ontologies in your domain are the big players?
How well do these ontologies cover the concepts and relationships in your data sets?
Are existing ontologies too zoomed-out or too zoomed-in for what you need?
How often are public ontologies updated? You want to rely on something other than outdated information.

This step is gold - it could save you a ton of time and effort if an existing ontology (or a clever combo of a few) fits the bill.

Going with your own ontology

Creating your own ontology demands some important steps.

Fig. 3: Representation of the ontology creation process

Define the Model: Identify the key concepts, properties, and relationships crucial to your research. This step often involves collaboration between data scientists, domain experts, and ontology specialists to ensure the model is technically solid and scientifically relevant.
Work with SMEs: SMEs bring deep domain knowledge that helps accurately represent complex biological and pharmaceutical concepts. We are talking about biologists, chemists, pharmacologists, genomics experts, and anyone who can guide you in mapping between existing ontologies and your custom model, ensuring that your ontology remains compatible with wider scientific standards while meeting your specific needs.
Map Ontologies: If you mix public and custom ontologies, you must create bridges between them. This involves identifying equivalent or related concepts across different ontologies and defining how they link up.
Data Processing: Automation plays a significant role in building and maintaining ontologies, especially when dealing with large-scale data. Natural Language Processing (NLP) techniques can be employed to extract concepts and relationships from scientific literature automatically. Machine learning algorithms can help you connect the dots between different ontologies or suggest new relationships based on data patterns. But remember: human expertise is invaluable for validating relationships, resolving ambiguities, and ensuring the ontology accurately reflects current scientific understanding.
Iterative Development: Ontology development is an iterative process. As you build your ontology, continuously test it against your data and the real world. Be prepared to refine and adjust your model as you gain new insights or as your needs evolve.
Documentation: Remember to document the purpose and target end-users, the design decisions, the ontology structure, the reference to external resources, the usage guidelines, and any relevant information that will impact the long-term success and usability of your ontology.

Building and maintaining an ontology is a continuous activity. As new research emerges and our understanding of biological systems deepens, ontologies must evolve to reflect this new knowledge. Regular reviews and updates are essential to keeping your ontology relevant and valuable.

Knowledge graphs: Bringing ontologies to life

Knowledge graphs are a natural and practical implementation of ontologies, bringing the theoretical structure of ontologies into a format that computers can easily process and humans can intuitively understand. They serve as a bridge between the conceptual framework of ontologies and the data-driven world of modern biopharmaceutical research.

From ontology to knowledge graph

Remember how we described ontologies earlier? They consist of classes, properties, relationships, and instances. Knowledge graphs take these components and represent them in a graph structure:

Classes become node labels in the graph
Instances are represented as individual nodes
Properties are attributes of these nodes
Relationships are the edges connecting the nodes

Fig. 4: Knowledge graph representing nodes, labels, properties and their relationships.

Benefits of knowledge graphs in biopharmaceuticals

As you can see, the knowledge graph offers several advantages:

Intuitive Representation: The graph structure mimics how researchers naturally think about biomedical concepts and their relationships.
Flexibility: As new discoveries are made, new nodes and relationships can be easily added without disrupting the existing structure.
Powerful Querying: Complex questions that would require multiple joins in relational databases can often be answered with a single graph traversal.
Discovery of Hidden Connections: The interconnected nature of graphs allows for the discovery of non-obvious relationships between entities.
Integration of Diverse Data: Different types of data (genomic, chemical, clinical) can be integrated into a single, coherent structure.

Property graph databases

Creating a knowledge graph isn't just about conceptual mapping—it's also about choosing the right database to store and query this vast amount of interconnected information. The property graph databases store information as nodes, relationships, and properties—a structure that mirrors the conceptual model of knowledge graphs. When you use these databases, you get:

Simplicity in Design: Imagine trying to explain your knowledge graph to a colleague who isn't tech-savvy. With property graphs, what you sketch on a whiteboard is what you'll implement in the database. This simplicity streamlines the journey from concept to implementation, making it easier for everyone involved to understand and contribute to the project.
Flexibility for the Future: Need to add a new type of relationship between drugs and targets? Or perhaps you've discovered a new property of genes that you want to include? With property graphs, these changes can be easily manageable. You can expand and evolve your knowledge graph incrementally, adapting to discoveries and insights as they emerge.
Performance: By storing relationships directly in the database, these systems can traverse complex chains of connections with impressive speed. It's like giving your researchers a high-speed train to navigate through the vast landscape of biomedical data.
Friendly Querying: With intuitive query languages designed specifically for graph structures, developers can write cleaner, more expressive code. This means less time wrestling with complex queries and more time extracting valuable insights from your knowledge graph.

Neo4j is an example of graph database. It is well known and highly scalable (some explanation of its benefits in very few words). There are other good choices as well.

Neo4j: A powerful property graph database

Now that we understand the power of knowledge graphs and the benefits of property graph databases, let's zoom in on a leading player in this field: Neo4j.

Neo4j is an open-source, native graph database that's been optimized for storing and querying highly connected data. It embodies all the advantages we've discussed, making it an excellent choice for implementing ontologies and knowledge graphs in biopharmaceutical research.

Neo4j's Unique Advantages

While several graph databases exist in the market, Neo4j stands out in ways particularly beneficial to biopharmaceutical research:

ACID Compliance: Unlike some NoSQL databases, Neo4j ensures data integrity through ACID (Atomicity, Consistency, Isolation, Durability) compliance. This is crucial when dealing with sensitive pharmaceutical data.
Scalability: Neo4j offers both vertical and horizontal scalability options. Its Fabric feature allows for sharding graph data across multiple databases, essential for handling the vast datasets in genomics and drug interaction studies.
Graph Algorithms Library: Neo4j provides a comprehensive library of graph algorithms out-of-the-box. This is particularly useful for tasks like pathway analysis, drug target identification, and detecting unexpected interactions in complex biological networks.
Causal Clustering: This feature ensures high availability and fault tolerance, critical for maintaining uninterrupted access to vital research data.
Cypher Query Language: Neo4j uses Cypher, a declarative query language specifically designed for working with graph data. Cypher's syntax is intuitive and visually representative of the graph patterns it queries.

Putting It All Together

So, how do we actually use Neo4j to bring our ontologies to life? Here's a simplified roadmap:

Plan Your Data Model:
- Decide how your ontology classes will be represented as nodes
- Figure out how relationships will be shown
- Plan what properties your nodes and relationships will have
Import Your Data:
- Use Neo4j's tools to bring in your ontology data
- For more complex imports, Neo4j's APOC library can be a big help
Query and Explore:
- Use Cypher queries to navigate your ontology
- Take advantage of Neo4j's graph algorithms for deeper analysis
Visualize:
- Use Neo4j Browser to create visual representations of your data

Fig. 5: An example of a Cypher query to create and visualize the knowledge graph described and illustrated in this article

The future of drug discovery

As biomedical data grows fast, having powerful tools to organize and analyze this information becomes crucial. Ontologies provide the structure, while Neo4j offers the horsepower to explore and use this knowledge effectively.

For companies in the biopharmaceutical field, adopting these technologies could be a game-changer. It's not just about keeping up with the latest tech trends – it's about unlocking new potential in drug discovery and development. Who knows what breakthroughs await to be uncovered in the vast web of biomedical knowledge?

Published by Álan Gularte in Software Engineering

Democratizing data will transform biopharma R&D. It starts with the people.

Blog

How mid-market industrial firms win in the digital era

Blog

Building smarter drug discovery with graph databases

Blog

What is a data product?

View more Insights

Let’s start a conversation

Let's shape your insights into experience-led data products together.