Translatomer
predicting ribosome profiling reveals translational regulation and interprets disease variants
Translatomer
This is our implementation for the paper:
Jialin He, Lei Xiong#, Shaohui Shi, Chengyu Li, Kexuan Chen, Qianchen Fang, Jiuhong Nan, Ke Ding, Jingyun Li, Yuanhui Mao, Carles A. Boix, Xinyang Hu, Manolis Kellis, Jingyun Li and Xushen Xiong#. Deep learning prediction of ribosome profiling with Translatomer reveals translational regulation and interprets disease variants. (Nature Machine Intelligence)
Introduction
Translatomer is a transformer-based multi-modal deep learning framework that predicts ribosome profiling track using genomic sequence and cell-type-specific RNA-seq as input.
Citation
If you want to use our codes and datasets in your research, please cite:
Prerequisites
To run this project, you need the following prerequisites:
- Python 3.9
- PyTorch 1.13.1+cu117
- Other required Python libraries (please refer to requirements.txt)
You can install all the required packages using the following command:
conda create -n pytorch python=3.9.16
conda activate pytorch
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
pip install -r requirements.txt
Data Preparation
Example data for model training can be downloaded from Zenodo
- Put all input files in a data folder. The input files have to be organized as follows:
```
- data
- hg38
- K562
- GSE153597
- input_features ++ rnaseq.bw
- output_features ++ riboseq.bw
- GSE153597
- HepG2
- GSE174419
- input_features ++ rnaseq.bw
- output_features ++ riboseq.bw *… ++ gencode.v43.annotation.gff3 ++ hg38.fa ++ hg38.fai ++ mean.sorted.bw
- GSE174419
- K562
- mm10 *… ```
- hg38
- data
- To generate training data, use the following command: ``` python generate_features_4rv.py [options]
[options]:
- –assembly Genome reference for the data. Default = ‘hg38’.
- –celltype Name of the cell line. Default = ‘K562’.
- –study GEO accession number for the data. Default = ‘GSE153597’.
- –region_len The desired sequence length (region length). Default = 65536.
- –nBins The number of bins for dividing the sequence. Default = 1024.
Example to run the codes:
find data/ -type d -name ‘output_features’ -exec mkdir -p ‘{}/tmp’ \; find data/ -type d -name ‘input_features’ -exec mkdir -p ‘{}/tmp’ \; nohup python generate_features_4rv.py –assembly hg38 –celltype HepG2 –study GSE174419 –region_len 65536 –nBins 1024 & nohup python generate_features_4rv.py –assembly hg38 –celltype K562 –study GSE153597 –region_len 65536 –nBins 1024 &
## Model Training
To train the Translatomer model, use the following command:
python train_all_11fold.py [options]
[options]:
- –seed Random seed for training. Default value: 2077.
- –save_path Path to the model checkpoint. Default = ‘checkpoints’.
- –data-root Root path of training data. Default = ‘data’ (Required).
- –assembly Genome assembly for training data. Default = ‘hg38’.
- –model-type Type of the model to use for training. Default = ‘TransModel’.
- –fold Which fold of the model training. Default=’0’,
- –patience Epochs before early stopping. Default = 8.
- –max-epochs Max epochs for training. Default = 128.
- –save-top-n Top n models to save during training. Default = 20.
- –num-gpu Number of GPUs to use for training. Default = 1.
- –batch-size Batch size for data loading. Default = 32.
- –ddp-disabled Flag to disable ddp (Distributed Data Parallel) for training. If provided, it will enable DDP with batch size adjustment.
- –num-workers Number of dataloader workers. Default = 1.
Example to run the codes:
nohup python train_all_11fold.py –save_path results/bigmodel_h512_l12_lr1e-5_wd0.05_ws2k_p32_fold0 –data-root data –assembly hg38 –dataset data_roots_mini.txt –model-type TransModel –fold 0 –patience 6 –max-epochs 128 –save-top-n 128 –num-gpu 1 –batch-size 32 –num-workers 1 >DNA_logs/bigmodel_h512_l12_lr1e-5_wd0.05_ws2k_p32_fold0.log 2>&1 & nohup python train_all_11fold.py –save_path results/bigmodel_h512_l12_lr1e-5_wd0.05_ws2k_p32_fold1 –data-root data –assembly hg38 –dataset data_roots_mini.txt –model-type TransModel –fold 1 –patience 6 –max-epochs 128 –save-top-n 128 –num-gpu 1 –batch-size 32 –num-workers 1 >DNA_logs/bigmodel_h512_l12_lr1e-5_wd0.05_ws2k_p32_fold1.log 2>&1 & ```
Tutorial
- Load pretrained model Pretrained model can be downloaded from Zenodo
- An example notebook containing code for applying Translatomer is here.
License
This project is licensed under MIT License.
Contact
For any questions or inquiries, please contact xiongxs@zju.edu.cn.