Biological Sequence Embedding Extraction Tutorial

What is Embedding?

Embedding (Embedding Vector) is a technique for converting text, sequences, or other unstructured data into numerical vectors. In bioinformatics, embedding can convert biological sequences such as DNA sequences and protein sequences into high-dimensional numerical vectors. These vectors can:

Capture semantic information of sequences: Similar sequences produce similar vectors
Support machine learning: Numerical vectors can be directly used in various machine learning algorithms
Dimensional reduction representation: Compress complex sequence information into fixed-length vectors
Calculate similarity: Calculate similarity between sequences through vector distance

Why Extract Embedding?

In bioinformatics research, embedding extraction has important value:

Sequence classification: Identify functional types of DNA sequences (such as promoters, enhancers, etc.)
Sequence similarity analysis: Quickly find similar biological sequences
Functional prediction: Predict protein function based on sequence embedding
Evolutionary analysis: Study evolutionary relationships of sequences

Note: Due to current limited resources, the API currently provides models supporting 1.2B and 10B, with a maximum embedding length of 128k, and only returns embeddings from the last layer

Import Libraries

[1]:

from genos import create_client

Create Client

[2]:

client = create_client(token="your_token_here")

To ensure smooth use of the service, please make sure you have completed token application.

Basic Usage

[4]:

# DNA sequence
sequence = "ATCGATCGATCGATCGATCGATCGATCG"

# Extract embedding
result = client.get_embedding(sequence)['result']

# View results
print(result)

{'result': {'sequence': 'ATCGATCGATCGATCGATCGATCGATCG', 'sequence_length': 28, 'token_count': 30, 'embedding_shape': [1, 1024], 'embedding_dim': 1024, 'pooling_method': 'mean', 'model_type': 'flash', 'device': 'cuda', 'embedding': tensor([[ 0.0015,  0.0085, -0.0737,  ..., -0.4238, -0.1729,  0.0094]])}, 'status': 200, 'message': None}

The result contains:

sequence: Input sequence
sequence_length: Sequence length
token_count: Number of tokens
embedding_dim: Embedding dimension
embedding: Embedding vector
pooling_method: Pooling method used
model_type: Model used

Model Parameters

Available Models

Genos supports multiple pre-trained models:

Model	Parameters	Flash Attention	Use Case
`Genos-1.2B`	1.2 billion	✓	Fast inference, general tasks
`Genos-10B`	10 billion	✓	High-precision tasks

Pooling Methods

Pooling controls how to aggregate multiple token embeddings into sequence-level representation:

Method	Description	Output
`mean`	Average pooling (default)	Single vector
`max`	Max pooling	Single vector
`min`	Min pooling	Single vector

Applications of Embeddings

Extracted embeddings can be used for:

Sequence similarity calculation:

# Calculate cosine similarity between two sequences
similarity = cosine_similarity(embedding1, embedding2)

Sequence classification:

# Train classifier using embeddings
classifier = train_classifier(embeddings, labels)

Clustering analysis:

# Perform clustering on sequences
clusters = kmeans_clustering(embeddings)

Dimensional reduction visualization:

# Use t-SNE or PCA for dimensional reduction visualization
reduced_embeddings = tsne.fit_transform(embeddings)

Summary

Through this tutorial, we learned:

Core Concepts

Embedding: Technique for converting biological sequences into numerical vectors
Pre-trained models: Large language models specifically trained for biological sequences
Hierarchical features: Different layers capture sequence information at different levels

Technical Process

Load pre-trained biological sequence models
Encode DNA sequences into tokens
Obtain embeddings through model inference
Analyze feature representations from different layers

Practical Value

Accelerate research: Quickly analyze large amounts of biological sequences
Improve accuracy: Use pre-trained knowledge to enhance prediction performance
Support downstream tasks: Provide foundation for classification, clustering, similarity analysis, etc.

Next Examples

Use extracted embeddings for downstream population prediction tasks
Downstream variant prediction tasks based on embeddings (API)
RNA coverage trajectory prediction (API)

Congratulations! You have mastered the basic methods of biological sequence embedding extraction!