Biological Sequence Embedding Extraction Tutorial

What is Embedding?

Embedding (Embedding Vector) is a technique for converting text, sequences, or other unstructured data into numerical vectors. In bioinformatics, embedding can convert biological sequences such as DNA sequences and protein sequences into high-dimensional numerical vectors. These vectors can:

  1. Capture semantic information of sequences: Similar sequences produce similar vectors

  2. Support machine learning: Numerical vectors can be directly used in various machine learning algorithms

  3. Dimensional reduction representation: Compress complex sequence information into fixed-length vectors

  4. Calculate similarity: Calculate similarity between sequences through vector distance

Why Extract Embedding?

In bioinformatics research, embedding extraction has important value:

  • Sequence classification: Identify functional types of DNA sequences (such as promoters, enhancers, etc.)

  • Sequence similarity analysis: Quickly find similar biological sequences

  • Functional prediction: Predict protein function based on sequence embedding

  • Evolutionary analysis: Study evolutionary relationships of sequences

Note: Due to current limited resources, the API currently provides models supporting 1.2B and 10B, with a maximum embedding length of 128k, and only returns embeddings from the last layer

Import Libraries

[1]:
from genos import create_client

Create Client

[2]:
client = create_client(token="your_token_here")

To ensure smooth use of the service, please make sure you have completed token application.

Basic Usage

[4]:
# DNA sequence
sequence = "ATCGATCGATCGATCGATCGATCGATCG"

# Extract embedding
result = client.get_embedding(sequence)['result']

# View results
print(result)

{'result': {'sequence': 'ATCGATCGATCGATCGATCGATCGATCG', 'sequence_length': 28, 'token_count': 30, 'embedding_shape': [1, 1024], 'embedding_dim': 1024, 'pooling_method': 'mean', 'model_type': 'flash', 'device': 'cuda', 'embedding': tensor([[ 0.0015,  0.0085, -0.0737,  ..., -0.4238, -0.1729,  0.0094]])}, 'status': 200, 'message': None}

The result contains:

  • sequence: Input sequence

  • sequence_length: Sequence length

  • token_count: Number of tokens

  • embedding_dim: Embedding dimension

  • embedding: Embedding vector

  • pooling_method: Pooling method used

  • model_type: Model used

Model Parameters

Available Models

Genos supports multiple pre-trained models:

Model

Parameters

Flash Attention

Use Case

Genos-1.2B

1.2 billion

Fast inference, general tasks

Genos-10B

10 billion

High-precision tasks

Pooling Methods

Pooling controls how to aggregate multiple token embeddings into sequence-level representation:

Method

Description

Output

mean

Average pooling (default)

Single vector

max

Max pooling

Single vector

min

Min pooling

Single vector

Applications of Embeddings

Extracted embeddings can be used for:

  1. Sequence similarity calculation:

    # Calculate cosine similarity between two sequences
    similarity = cosine_similarity(embedding1, embedding2)
    
  2. Sequence classification:

    # Train classifier using embeddings
    classifier = train_classifier(embeddings, labels)
    
  3. Clustering analysis:

    # Perform clustering on sequences
    clusters = kmeans_clustering(embeddings)
    
  4. Dimensional reduction visualization:

    # Use t-SNE or PCA for dimensional reduction visualization
    reduced_embeddings = tsne.fit_transform(embeddings)
    

Summary

Through this tutorial, we learned:

Core Concepts

  • Embedding: Technique for converting biological sequences into numerical vectors

  • Pre-trained models: Large language models specifically trained for biological sequences

  • Hierarchical features: Different layers capture sequence information at different levels

Technical Process

  1. Load pre-trained biological sequence models

  2. Encode DNA sequences into tokens

  3. Obtain embeddings through model inference

  4. Analyze feature representations from different layers

Practical Value

  • Accelerate research: Quickly analyze large amounts of biological sequences

  • Improve accuracy: Use pre-trained knowledge to enhance prediction performance

  • Support downstream tasks: Provide foundation for classification, clustering, similarity analysis, etc.

Next Examples

  1. Use extracted embeddings for downstream population prediction tasks

  2. Downstream variant prediction tasks based on embeddings (API)

  3. RNA coverage trajectory prediction (API)

Congratulations! You have mastered the basic methods of biological sequence embedding extraction!