NVIDIA
NVIDIA
ReaSyn
Resource
NVIDIA
NVIDIA
ReaSyn

ReaSyn Model Weights

ReaSyn v2 Overview

Description

ReaSyn is a model for predicting the synthesis pathway, reaction steps from reactants to final product(s), for a target product molecule. When the target molecule cannot be synthesized directly using known reaction steps, ReaSyn will generate the pathways for the most structurally similar, synthesizable analog of the target molecule.The model uses an encoder-decoder Transformer architecture, where a full synthetic pathway is represented as a text sequence. ReaSyn v2 improves the reconstruction and projection capabilities of ReaSyn v1 using a more advanced search (by combining top-down and bottom-up tree traversal) in addition to an Edit Flow model that edits generated pathways via deletion, substitution, and insertion operations. This approach allows the model to achieve SOTA performance in tasks like synthesis planning and incorporating synthesizability into goal-directed molecular property optimization.

This model is ready for commercial use.

License/Terms of Use

GOVERNING TERMS: Use of this model is governed by the NVIDIA Open Model License. ReaSyn source code is licensed under Apache 2.0.

Deployment Geography: Global

Use Case:

ReaSyn v2 is a model for predicting the synthetic pathway, reaction steps from reactants to final product(s), for a target product molecule. The model can be used in the pharmaceutical and chemical industries and in academic research to identify how to synthesize a molecule, help chemists in planning a first time synthesis of a molecule, the optimization of an existing synthesis pathway, or the filtering of candidate molecules based on ease of synthesis.

Release Date:

Github 10/27/2025 via https://github.com/NVIDIA-Digital-Bio/ReaSyn

NGC 10/27/2025

References

Research paper: "Exploring Synthesizable Chemical Space with Iterative Pathway Refinements," https://arxiv.org/abs/2509.16084

Model Architecture

Architecture Type: Encoder-decoder
Network Architecture: Encoder-decoder Transformer
ReaSyn v2 utilizes an encoder-decoder Transformer architecture which takes a molecular SMILES as input and outputs its synthetic pathway autoregressively. Encoder contains 6 layers and decoder contains 10 layers. Both encoder and decoder have a hidden size of 768, 16 attention heads, and a feed-forward dimension of 4096.
ReaSyn v2 has another Edit Flow model, which has the same encoder-decoder Transformer architecture as backbone but with three additional heads. The Edit Flow model takes a molecular SMILES and synthetic pathway generated from the autoregressive model as input and outputs the probabilities of edit operations: insertion, deletion, and substitution, that yield a more refined synthetic pathway.

The autoregressive model has 166M parameters and the Edit Bridge model has 174M parameters.

Autoregressive model

Input

Input Types: Text

Input Formats: SMILES string

Input Parameters: One-Dimensional (1D)

Other Properties Related to Input: Maximum input length is 256 tokens.

Output

Output Types: Text

Output Formats: Molecular synthetic pathway

Output Parameters: One-Dimensional (1D)

Other Properties Related to Output: Maximum output length is 512 tokens.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Edit Flow model

Input

Input Types: Text

Input Formats: SMILES string, molecular synthetic pathway

Input Parameters: One-Dimensional (1D)

Other Properties Related to Input: Maximum input length of SMILES string is 256 tokens. Maximum input length of molecular synthetic pathway is 512 tokens.

Output

Output Types: Text

Output Formats: Molecular synthetic pathway

Output Parameters: One-Dimensional (1D)

Other Properties Related to Output: Maximum output length is 512 tokens.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration

Runtime Engine: Torch

Supported Hardware Microarchitecture Compatibility: NVIDIA Ampere

Preferred Operating System: Linux, Windows

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Versions

ReaSyn v2

Training and Evaluation Datasets

Training Datasets

SynFormer Reaction Templates

Link: https://github.com/wenhao-gao/synformer/blob/main/data/rxn_templates/comprehensive.txt

Data Modality: Text

Text Training Data Size: 1 Billion to 10 Trillion Tokens

Data Collection Method by dataset: Human

Labeling Method by dataset: Automated

Properties: 115 molecular reaction templates in the SMARTS format

Building Blocks in Enamine US Stock retrieved in October 2023

Link: https://enamine.net/building-blocks/building-blocks-catalog

Data Modality: Text

Text Training Data Size: 1 Billion to 10 Trillion Tokens

Data Collection Method by dataset: Human

Labeling Method by dataset: N/A

Properties: 115 molecular reaction templates in the SMARTS format

Evaluation Dataset

Enamine REAL Test Set

Link: https://github.com/wenhao-gao/synformer/blob/main/data/enamine_smiles_1k.txt

https://enamine.net/compound-collections/real-compounds/real-database

Data Collection Method by dataset: Human

Labeling Method by dataset: N/A

Properties: Randomly selected 1k test molecules from Enamine REAL to evaluate synthesizable molecule reconstruction.

ChEMBL Test Set

Link: https://github.com/wenhao-gao/synformer/blob/main/data/chembl_filtered_1k.txt

https://www.ebi.ac.uk/chembl

Data Collection Method by dataset: Human

Labeling Method by dataset: N/A

Properties: Randomly selected 1k test molecules from ChEMBL to evaluate synthesizable molecule reconstruction.

ZINC250k Test Set

Link: https://www.kaggle.com/datasets/basu369victor/zinc250k

Data Collection Method by dataset: Synthetic

Labeling Method by dataset: N/A

Properties: Randomly selected 1k test molecules from ZINC250k to evaluate synthesizable molecule reconstruction.

Inference

Engine: Torch

Test Hardware: Ampere / NVIDIA A100

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Users are responsible for ensuring the physical properties of model-generated molecules are appropriately evaluated and comply with applicable safety regulations and ethical standards.

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.

Please report security vulnerabilities or NVIDIA AI Concerns here.

Bias

Participation considerations from adversely impacted groups (protected classes) in model design and testing: None

Measures taken to mitigate against unwanted bias: None

Explainability

Intended Application(s) & Domain(s): Molecular drug discovery and design

Model Type: Molecular synthesis pathway generation

Intended Users: Developers in the academic or pharmaceutical industries who want to predict synthesis pathways for molecules and who build artificial intelligence applications to perform property guided molecule optimization and novel molecule generation.

Output: Text

Describe how the model works: ReaSyn uses a Transformer encoder-decoder architecture and requires a "target molecule" as its input (SMILES format). The model then generates a synthetic pathway for a synthesizable molecule that that molecule, or, if needed, an analog of the input molecule.

Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: Not Applicable

Technical Limitations: The model's reasoning steps, while incorporating reactants, reaction types, and intermediate products, do not account for other important reaction information like environmental conditions or yields. This could be a limitation in real-world drug discovery scenarios.

Verified to have met prescribed quality standards?: Yes

Performance Metrics: Reconstruction rate, Similarity, Diversity (Pathway), Diversity (BB).

Potential Known Risks: The framework, while effective for generating drug candidates, also has the possibility of generating synthetic pathways for toxic drugs. This requires an additional scheme to be adopted to filter out harmful molecules during the generation search.

Licensing & Terms of Use: The use of this model is governed by the NVIDIA Open Model License Agreement.

Privacy

Generatable or reverse engineerable personal data?: No

Personal data used to create this model?: No

How often is the dataset reviewed?: Before Release

Is there provenance for all datasets used in training?: Yes

Does data labeling (annotation, metadata) comply with privacy laws?: Yes

Applicable Privacy Policy: NVIDIA Privacy Policy

Safety

Model Application(s): Synthesis Pathway Prediction in drug, chemical, and materials design.

Describe life critical impact (if present): Experimental results: Additional in silico and experimental tests are recommended before using the predicted synthesis paths in downstream applications.

Use Case Restrictions: Abide by NVIDIA Open Model License Agreement.

Model and Dataset Restrictions: The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.


ReaSyn v1 Overview

Description

ReaSyn is a model for predicting the synthesis pathway, reaction steps from reactants to finadatal product(s), for a target product molecule. When the target molecule cannot be synthesized directly using known reaction steps, ReaSyn will generate the pathways for the most structurally similar, synthesizable analog of the target molecule.The model uses an encoder-decoder transformer architecture and a chain-of-reaction notation, where a full synthetic pathway is represented as a text sequence. This approach allows the model to achieve SOTA performance in tasks like synthesis planning and incorporating synthesizability into goal-directed molecular property optimization.

This model is ready for commercial use.

License/Terms of Use

GOVERNING TERMS: Use of this model is governed by the NVIDIA Open Model License. ReaSyn source code is licensed under Apache 2.0.

Deployment Geography: Global

Use Case:

ReaSyn is a model for predicting the synthetic pathway, reaction steps from reactants to final product(s), for a target product molecule. The model can be used in the pharmaceutical and chemical industries and in academic research to identify how to synthesize a molecule, help chemists in planning a first time synthesis of a molecule, the optimization of an existing synthesis pathway, or the filtering of candidate molecules based on ease of synthesis.

Release Date:

Github 09/23/2025 via https://github.com/NVIDIA-Digital-Bio/ReaSyn

NGC 09/23/2025

References

Research paper: “Rethinking Molecule Synthesizability with Chain-of-Reaction”

Model Architecture

Architecture Type: Encoder-decoder
Network Architecture: Encoder-decoder Transformer
ReaSyn utilizes an Encoder-decoder Transformer architecture which takes a molecular SMILES as input and outputs its synthetic pathway. Encoder contains 6 layers and decoder contains 10 layers. Both encoder and decoder have a hidden size of 768, 16 attention heads, and a feed-forward dimension of 4096.

The total number of parameters in ReaSyn is 166M.

Input

Input Types: Text

Input Formats: SMILES string

Input Parameters: One-Dimensional (1D)

Other Properties Related to Input: Maximum input length is 256 tokens.

Output

Output Types: Text

Output Formats: Chain-of-Reaction sequence (molecular synthetic pathway)

Output Parameters: One-Dimensional (1D)

Other Properties Related to Output: Maximum output length is 768 tokens.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration

Runtime Engine: Torch

Supported Hardware Microarchitecture Compatibility: NVIDIA Ampere

Preferred Operating System: Linux, Windows

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Versions

ReaSyn v1

Training and Evaluation Datasets

Training Datasets

SynFormer Reaction Templates

Link: https://github.com/wenhao-gao/synformer/blob/main/data/rxn_templates/comprehensive.txt

Data Modality: Text

Text Training Data Size: 1 Billion to 10 Trillion Tokens

Data Collection Method by dataset: Human

Labeling Method by dataset: Automated

Properties: 115 molecular reaction templates in the SMARTS format

Building Blocks in Enamine US Stock retrieved in October 2023

Link: https://enamine.net/building-blocks/building-blocks-catalog

Data Modality: Text

Text Training Data Size: 1 Billion to 10 Trillion Tokens

Data Collection Method by dataset: Human

Labeling Method by dataset: N/A

Properties: 115 molecular reaction templates in the SMARTS format

Evaluation Dataset

Enamine REAL Test Set

Link: https://github.com/wenhao-gao/synformer/blob/main/data/enamine_smiles_1k.txt

https://enamine.net/compound-collections/real-compounds/real-database

Data Collection Method by dataset: Human

Labeling Method by dataset: N/A

Properties: Randomly selected 1k test molecules from Enamine REAL to evaluate synthesizable molecule reconstruction.

ChEMBL Test Set

Link: https://github.com/wenhao-gao/synformer/blob/main/data/chembl_filtered_1k.txt

https://www.ebi.ac.uk/chembl

Data Collection Method by dataset: Human

Labeling Method by dataset: N/A

Properties: Randomly selected 1k test molecules from ChEMBL to evaluate synthesizable molecule reconstruction.

ZINC250k Test Set

Link: https://www.kaggle.com/datasets/basu369victor/zinc250k

Data Collection Method by dataset: Synthetic

Labeling Method by dataset: N/A

Properties: Randomly selected 1k test molecules from ZINC250k to evaluate synthesizable molecule reconstruction.

Inference

Engine: Torch

Test Hardware: Ampere / NVIDIA A100

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Users are responsible for ensuring the physical properties of model-generated molecules are appropriately evaluated and comply with applicable safety regulations and ethical standards.

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.

Please report security vulnerabilities or NVIDIA AI Concerns here.

Bias

Participation considerations from adversely impacted groups (protected classes) in model design and testing: None

Measures taken to mitigate against unwanted bias: None

Explainability

Intended Application(s) & Domain(s): Molecular drug discovery and design

Model Type: Molecular synthesis pathway generation

Intended Users: Developers in the academic or pharmaceutical industries who want to predict synthesis pathways for molecules and who build artificial intelligence applications to perform property guided molecule optimization and novel molecule generation.

Output: Text

Describe how the model works: ReaSyn uses a Transformer encoder-decoder architecture and requires a "target molecule" as its input (SMILES format). The model then generates a synthetic pathway for a synthesizable molecule that that molecule, or, if needed, an analog of the input molecule.

Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: Not Applicable

Technical Limitations: The model's reasoning steps, while incorporating reactants, reaction types, and intermediate products, do not account for other important reaction information like environmental conditions or yields. This could be a limitation in real-world drug discovery scenarios.

Verified to have met prescribed quality standards?: Yes

Performance Metrics: Reconstruction rate, Similarity, Diversity (Pathway), Diversity (BB).

Potential Known Risks: The framework, while effective for generating drug candidates, also has the possibility of generating synthetic pathways for toxic drugs. This requires an additional scheme to be adopted to filter out harmful molecules during the generation search.

Licensing & Terms of Use: The use of this model is governed by the NVIDIA Open Model License Agreement.

Privacy

Generatable or reverse engineerable personal data?: No

Personal data used to create this model?: No

How often is the dataset reviewed?: Before Release

Is there provenance for all datasets used in training?: Yes

Does data labeling (annotation, metadata) comply with privacy laws?: Yes

Applicable Privacy Policy: NVIDIA Privacy Policy

Safety

Model Application(s): Synthesis Pathway Prediction in drug, chemical, and materials design.

Describe life critical impact (if present): Experimental results: Additional in silico and experimental tests are recommended before using the predicted synthesis paths in downstream applications.

Use Case Restrictions: Abide by NVIDIA Open Model License Agreement.

Model and Dataset Restrictions: The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.

Publisher
NVIDIA
NVIDIA
Latest Version2.0
UpdatedJanuary 9, 2026 UTC
Compressed Size1.86 GB