Biological Sequence Embedding Extraction Tutorial
What is Embedding?
Embedding (Embedding Vector) is a technique for converting text, sequences, or other unstructured data into numerical vectors. In bioinformatics, embedding can convert biological sequences such as DNA sequences and protein sequences into high-dimensional numerical vectors. These vectors can:
Capture semantic information of sequences: Similar sequences produce similar vectors
Support machine learning: Numerical vectors can be directly used in various machine learning algorithms
Dimensional reduction representation: Compress complex sequence information into fixed-length vectors
Calculate similarity: Calculate similarity between sequences through vector distance
Why Extract Embedding?
In bioinformatics research, embedding extraction has important value:
Sequence classification: Identify functional types of DNA sequences (such as promoters, enhancers, etc.)
Sequence similarity analysis: Quickly find similar biological sequences
Functional prediction: Predict protein function based on sequence embedding
Evolutionary analysis: Study evolutionary relationships of sequences
Note: Due to current limited resources, the API currently provides models supporting 1.2B and 10B, with a maximum embedding length of 128k, and only returns embeddings from the last layer
Import Libraries
[1]:
from genos import create_client
Create Client
[2]:
client = create_client(token="your_token_here")
To ensure smooth use of the service, please make sure you have completed token application.
Basic Usage
[4]:
# DNA sequence
sequence = "ATCGATCGATCGATCGATCGATCGATCG"
# Extract embedding
result = client.get_embedding(sequence)['result']
# View results
print(result)
{'result': {'sequence': 'ATCGATCGATCGATCGATCGATCGATCG', 'sequence_length': 28, 'token_count': 30, 'embedding_shape': [1, 1024], 'embedding_dim': 1024, 'pooling_method': 'mean', 'model_type': 'flash', 'device': 'cuda', 'embedding': tensor([[ 0.0015, 0.0085, -0.0737, ..., -0.4238, -0.1729, 0.0094]])}, 'status': 200, 'message': None}
The result contains:
sequence: Input sequencesequence_length: Sequence lengthtoken_count: Number of tokensembedding_dim: Embedding dimensionembedding: Embedding vectorpooling_method: Pooling method usedmodel_type: Model used
Model Parameters
Available Models
Genos supports multiple pre-trained models:
Model |
Parameters |
Flash Attention |
Use Case |
|---|---|---|---|
|
1.2 billion |
✓ |
Fast inference, general tasks |
|
10 billion |
✓ |
High-precision tasks |
Pooling Methods
Pooling controls how to aggregate multiple token embeddings into sequence-level representation:
Method |
Description |
Output |
|---|---|---|
|
Average pooling (default) |
Single vector |
|
Max pooling |
Single vector |
|
Min pooling |
Single vector |
Applications of Embeddings
Extracted embeddings can be used for:
Sequence similarity calculation:
# Calculate cosine similarity between two sequences similarity = cosine_similarity(embedding1, embedding2)
Sequence classification:
# Train classifier using embeddings classifier = train_classifier(embeddings, labels)
Clustering analysis:
# Perform clustering on sequences clusters = kmeans_clustering(embeddings)
Dimensional reduction visualization:
# Use t-SNE or PCA for dimensional reduction visualization reduced_embeddings = tsne.fit_transform(embeddings)
Summary
Through this tutorial, we learned:
Core Concepts
Embedding: Technique for converting biological sequences into numerical vectors
Pre-trained models: Large language models specifically trained for biological sequences
Hierarchical features: Different layers capture sequence information at different levels
Technical Process
Load pre-trained biological sequence models
Encode DNA sequences into tokens
Obtain embeddings through model inference
Analyze feature representations from different layers
Practical Value
Accelerate research: Quickly analyze large amounts of biological sequences
Improve accuracy: Use pre-trained knowledge to enhance prediction performance
Support downstream tasks: Provide foundation for classification, clustering, similarity analysis, etc.
Next Examples
Use extracted embeddings for downstream population prediction tasks
Downstream variant prediction tasks based on embeddings (API)
RNA coverage trajectory prediction (API)
Congratulations! You have mastered the basic methods of biological sequence embedding extraction!