2022-06-18: DNA to Protein in R7RS Scheme

A Scheme implementation of the DNA to Protein conversion program given in Python 3 at: https://www.geeksforgeeks.org/dna-protein-python-3/

The program creates two functions: one to read a sequence of text from a file, returning all the contents as one string without line breaks; and a second to convert each DNA triple in the string into a single letter protein.

For the first function, we use read-line, which reads all the text on a line without line breaks, so our read-sequence-file function can append all the lines together to make a complete string.

For the second function, we use an association list for the table: we could use a hash-table or map, but the association list is built-in to R7RS-small and the example is small enough that efficiency does not matter.

Final Program

(import (scheme base)
        (scheme file)
        (scheme read)
        (scheme write))

(define (read-sequence-file filename)
  (with-input-from-file                                     ; <1>
    filename
    (lambda () 
      (do ((line (read-line) (read-line))                   ; <2>
           (sequence "" (string-append sequence line)))     ; <3>
        ((eof-object? line) sequence)))))                   ; <4>

(define (translate sequence)
  (let ((table                                              ; <5>
          '(("ATA" . #\I) ("ATC" . #\I) ("ATT" . #\I) ("ATG" . #\M) 
            ("ACA" . #\T) ("ACC" . #\T) ("ACG" . #\T) ("ACT" . #\T) 
            ("AAC" . #\N) ("AAT" . #\N) ("AAA" . #\K) ("AAG" . #\K) 
            ("AGC" . #\S) ("AGT" . #\S) ("AGA" . #\R) ("AGG" . #\R)
            ("CTA" . #\L) ("CTC" . #\L) ("CTG" . #\L) ("CTT" . #\L) 
            ("CCA" . #\P) ("CCC" . #\P) ("CCG" . #\P) ("CCT" . #\P) 
            ("CAC" . #\H) ("CAT" . #\H) ("CAA" . #\Q) ("CAG" . #\Q) 
            ("CGA" . #\R) ("CGC" . #\R) ("CGG" . #\R) ("CGT" . #\R) 
            ("GTA" . #\V) ("GTC" . #\V) ("GTG" . #\V) ("GTT" . #\V) 
            ("GCA" . #\A) ("GCC" . #\A) ("GCG" . #\A) ("GCT" . #\A) 
            ("GAC" . #\D) ("GAT" . #\D) ("GAA" . #\E) ("GAG" . #\E) 
            ("GGA" . #\G) ("GGC" . #\G) ("GGG" . #\G) ("GGT" . #\G) 
            ("TCA" . #\S) ("TCC" . #\S) ("TCG" . #\S) ("TCT" . #\S) 
            ("TTC" . #\F) ("TTT" . #\F) ("TTA" . #\L) ("TTG" . #\L) 
            ("TAC" . #\Y) ("TAT" . #\Y) ("TAA" . #\_) ("TAG" . #\_) 
            ("TGC" . #\C) ("TGT" . #\C) ("TGA" . #\_) ("TGG" . #\W))))
    (do ((i 0 (+ i 3))                                      ; <6>
         (result '() (cons (cdr (assoc (substring sequence i (+ i 3))
                                       table))              ; <7>
                           result)))
      ((>= i (string-length sequence))                      ; <8>
       (list->string (reverse result))))))

(let* ((dna-sequence (read-sequence-file "dna_sequence.txt"))
       (protein-sequence (translate dna-sequence))
       (target-sequence (read-sequence-file "amino_acid_sequence.txt")))
  (display "Comparing translated with target: ")
  (display (equal? protein-sequence target-sequence))
  (newline))
  1. Opens the given file as an input port
  2. Reads each line from the current input port as a string
  3. Joining the strings together in turn
  4. ... until the end of file is reached, when the read sequence is returned.
  5. The conversion table is stored in an association list.
  6. An index takes us through the string, one triple at a time
  7. ... looking up each codon in turn, and recording the protein letter.
  8. When the string is fully processed, turn the sequence of letters into a string to return.

Page from Peter's Scrapbook, output from a VimWiki on 2024-01-29.