Design and assembly of DNA molecules using multi-objective optimization

Abstract Rapid engineering of biological systems is currently hindered by limited integration of manufacturing constraints into the design process, ultimately reducing the yield of many synthetic biology workflows. Here we tackle DNA engineering as a multi-objective optimization problem aiming at finding the best tradeoff between design requirements and manufacturing constraints. We developed a new open-source algorithm for DNA engineering, called Multi-Objective Optimisation algorithm for DNA Design and Assembly, available as a Python and Anaconda package, as well as a Docker image. Experimental results show that our method provides near-optimal constructs and scales linearly with design complexity, effectively paving the way to rational engineering of DNA molecules from genes to genomes.


Introduction
Recent advances in synthetic biology and DNA synthesis technologies are enabling significant scientific and biotechnological breakthroughs, including the engineering of pathways for drug production [1], the construction of minimal bacterial cells [2] and the assembly of synthetic eukaryotic chromosomes [3].
Pivotal to these achievements has been the adoption of an iterative engineering workflow, known as the Design-Built-Test-Learn cycle (DBTL). The DBTL workflow starts with a design step where biological components, such as genes or promoters, are selected to be assembled into a construct to obtain a specific phenotype; usually, the output of this process is a sequence of DNA to be synthesized. The designed sequence is then built and cloned into a host organism, and then tested to assess whether the design requirements are met, e.g. gene expression levels and protein abundance. The testing phase then informs the learning step, which in turn aims at improving the design of the initial construct using the phenotypic information acquired.
Interestingly, the inherent waterfall structure of the DBTL workflow introduces dependencies between steps, making engineering biological systems still a complex task. This is especially true for the design and build steps; in particular, the design space is strongly constrained by the DNA synthesis process, since phosphoramidite synthesis poses limits on molecule length and content. These limitations are usually overcome by splitting the designed sequences into shorter fragments, which can be then put together through molecular assembly techniques [4,5], at the cost of increasing complexity both for the design and manufacturing steps. Ultimately, recoding the design to meet manufacturing constraints often leads to molecules with substantially different content and properties, effectively breaking the DBTL workflow.
Software have been developed to assist biological engineers in implementing the DBTL cycle, in particular for the design step, with tools such as Double Dutch [6], Cello [7], j5 [8], Raven [9], BOOST [10] and BioPartsBuilder [11]. Nevertheless, current software simply automates the process of designing and adapting the sequence for synthesis, often leading to suboptimal designs and lacking quantitative measures to evaluate design quality.
Here we build on our experience in mathematical programming methods for electronic design automation [12,13] to solve the conundrum of designing DNA for manufacturability. Similar to how electronic circuits design is informed by physical and silicon manufacturing requirements, we formulated the design and build steps as a multi-objective optimization problem, aiming at finding the best trade-off between design and manufacturing requirements. Thus, rather than a single construct, biological engineers will be presented with a set of manufacturable designs to choose from for downstream work. We also introduce analytical measures to assess design quality and algorithmic performances, which are sorely lacking in the biological design automation field.
We developed a new optimization algorithm to solve this complex multi-objective problem, which is implemented as part of an open-source Python software called Multi-Objective Optimisation algorithm for DNA Design and Assembly (MOODA). The software is released on PyPi, Anaconda, and as a Docker image, whereas the source code, documentation and examples are available on GitHub (https://github.com/stracquadaniolab/mooda).
We tested MOODA on an extensive synthetic DNA constructs dataset to assess the quality of the proposed designs and its computational efficiency. Experimental results show that MOODA can effectively identify near-optimal designs regardless of sequence complexity and its running time scales linearly with the number of objectives and the sequence length.

Materials and methods
Here we introduce a multi-objective formulation of the DNA design and assembly problem. We assume that the input is a DNA sequence, where coding regions have been annotated. We then propose an optimization algorithm to identify trade-off solutions for an arbitrary number of design and manufacturing requirements.

A multi-objective formulation of the DNA design and manufacturing problem
Let x be a sequence over the DNA alphabet Σ = {A, T, C, G} and F = (f 1 , f 2 , · · · , f k ), with f k : Σ → R and k being the number of design and manufacturing requirements, which we also call objectives. We assume that requirements can be evaluated computationally by a function, which returns a fitness measure for the sequence. Without loss of generality and to avoid ambiguities, we assume that all objectives must be minimized.
We can define a multi-objective optimization problem as follows: where for k = 1 the problem reduces to a standard single-objective optimization problem; however, for k > 1, it is usually not possible to find x such that all objectives are simultaneously minimized and, instead, we look for trade-off solutions. Let x 1 , x 2 be two sequences over the DNA alphabet, x 1 dominates x 2 , denoted as In mathematical terms, the set of trade-off solutions, or Pareto optimal set, is made of all the non-dominated solutions for F, which is the set of sequences that cannot improve an objective without worsening at least another one. The image of the nondominated solutions with respect to the mapping F is called Pareto front; geometrically, the Pareto front is bounded by an ideal point, which is the vector defined by all the minima, and the nadir point, which is the vector defined by all the maxima, thus representing the theoretical worst possible solution. In general, we cannot find the true Pareto optimal set unless boundary conditions are met, but approximations are usually sufficient in practice [14].
A plethora of methods have been proposed in literature to solve multi-objective problems, both deterministic [15,16] and stochastic [17][18][19][20]. While deterministic methods provide strong proofs of convergence and optimality, they are usually difficult to apply to non-numerical problems. Conversely, stochastic methods, such as genetic algorithms or evolutionary strategies, are domain agnostic and work well in practice, although lacking theoretical guarantees on convergence and optimality.

A Multi-Objective Optimisation algorithm for DNA Design and Assembly
Here we describe a new stochastic optimization algorithm, called Multi-Objective Optimisation algorithm for DNA Design and Assembly (MOODA). The basic unit of operation is the solution data structure z = (s, b), where s is a DNA sequence and b is the list of DNA fragments (or blocks) required to assemble arbitrary long sequences. Blocks are represented as sequence intervals to take advantage of interval algebra for downstream operations. Hereby we refer to z as the solution for a problem F involving k design and manufacturing objectives.
The algorithm takes as input a DNA sequence s, which is cloned n times to build an initial pool P of n solutions; the initial sequence is randomly split into fragments of approximately the same size, each one of size l min ≤ l ≤ lmax, with l min and lmax being the minimum and maximum DNA fragment that can be synthesized. Then, at each iteration t, each solution in P is cloned, randomly edited and evaluated with respect to the objective functions F. From the resulting pool of 2n solutions, n are selected for the next iteration. The algorithm stops when the maximum number of iterations Tmax is reached. An overview of the algorithm is presented in Alg. 1.
Hereby we describe the edit and selection procedures, which are the key components of our method.
Sequence editing and assembly operators. The edit operators are local search procedures, which take in a solution as input and return a new, possibly better design. We defined procedures to edit both DNA sequences and blocks; sequence edits are limited to coding regions because we can safely introduce silent mutations to match requirements, whereas block edits are limited by the minimum and maximum DNA fragment size that is possible to synthesize. We defined four edit procedures that cover most common scenarios; however, MOODA can be easily extended with custom functions to introduce different types of changes.
GC optimization operator. The GC content of a DNA fragment is often a major hurdle to its synthesis; usually, synthesis providers have stringent admissible ranges on GC content and sequences have to be recoded to meet this requirement. Nonetheless, the GC content is often associated with specific phenotypes [21], thus any GC optimization procedure should edit a DNA sequence in such a way that the encoded biological function is not affected. This requirement restricts the application of GC content optimization only to coding regions, since GC can be optimized by using synonymous codons, which guarantee that the final protein product is not altered.

Algorithm 1
Multi-Objective Optimisation algorithm for DNA Design and Assembly Evaluate(F, R) 9.
return CDSe 15. end procedure Here we define a GC optimization operator, which recodes a random coding sequence (CDS) by probabilistically using codons that bring its GC content closer to a user-defined target T GC (see Alg. 2). The GC procedure acts only on one coding region at the time and allows improvement of at most σ GC percent respect to the original sequence; here, we adopted this strategy to increase design diversity and avoid biases and divergent sequences.

Codon optimization operator. Transplanting genes and path-
ways between organisms often require changing their primary sequence at the codon level to ensure expression. Moreover, coding regions are often recoded to increase gene expression, as a way to maximize the production of a particular protein [22]. However, the relationship between codon usage and gene transcription is poorly understood [23]. Our codon optimization operator probabilistically recodes a fraction σc of the codons of a given gene, by silently replacing the current codons with the most frequent one, in accordance with an input codon usage table T CF . As for the GC optimization operator, to increase the diversity of the pool of designs generated by our method, we do not apply codon optimization to all CDSs at the same time, but only to one at random.
Block split operator. Current technologies do not allow synthesis of arbitrary long DNA molecules, thus requiring a construct to be split into shorter fragments and then reassembled using DNA assembly techniques. [24]. However, excessive fragmentation can be both expensive and increase the turnaround of the assembly process. The block split operator divides a DNA sequence into shorter fragments, whose length is between l min and lmax nucleotides; by design, the operator enforces homogeneity in block length by splitting sequences into blocks of discrete length, which is controlled by a parameter σ b .
Block join operator. The block join operator reduces the number of blocks by joining two consecutive blocks, thus decreasing the number of parts to assemble. The procedure selects two blocks at random and join them into a new longer block; if the new block exceeds the block maximum size, it is divided again into two new blocks with a size that is a multiple of the step size parameter σ b and within the maximum and minimum block length, respectively, lmax and l min . As for all our operators, we enforce diversity in our pool of designs by applying the join procedure only to a pair of blocks at the time.
All our operators are designed to generate overlaps between adjacent blocks compatible with Gibson assembly [24]; however, new assembly methods can be easily defined in Python and integrated with our package.

Selection of trade-off solutions. A crucial step of our method
is the selection procedure, where non-dominated solutions are picked for the next iteration. To do that, all solutions are compared to each other and assigned a rank based on the number of solutions they are dominated by; in this case, non-dominated solutions are those with the lowest rank. The domination criteria give the same weight to every objective function, improving the probability to find balanced trade-offs [25].
Once all solutions are ranked, they are ordered first based on their rank and then based on a distance metric, called crowding distance, defined as follows: where d(z j ) is the crowding distance related to the j solution and w f i (z j ) is the crowding distance with respect to the objective function f i , whereas f i (z j+1 ) and f i (z j−1 ) are the closest solutions to z j with respect to f i . We also denote with f max i and f min i the maximum and the minimum value found by the algorithm for the objective function f i , respectively. The crowding distance is a measure of how close two solutions are on the objective functions space and enable the algorithm to select solutions in unexplored regions of the Pareto Front. After the ranking, the top n solutions are selected for the next iteration.
The selection step is the most critical step for two reasons; first, since the non-dominated sorting procedure has complexity O(kn 2 ) and it is executed at each iteration, using large pool sizes will dramatically increase the running time of the algorithm. Second, since at most n solutions are selected at each iteration, other nondominated solutions can be discarded because of poor crowding distance score, effectively causing loss of information.
Here we address these problems by storing all solutions in an ad hoc data structure, called archive, whose size m >> n is set by the user. When the archive is full, m non-dominated solutions are retained, eventually discarding the others based on their crowding distance value. By setting the pool size smaller than the archive size, we are decreasing the running time of the sorting procedure with only a negligible cost in terms of memory consumption; moreover, by storing m >> n non-dominated solutions found during the optimization process, we are effectively returning more solutions at a fraction of the running time required for optimizing a pool of size m.
Software implementation and availability. We implemented MOODA in Python following object oriented programming best practice. The software requires as input an annotated GenBank file and a yaml configuration file, which specifies objective functions, and design operators and the assembly method to use to join DNA blocks. Our software can be easily customized by implementing new design operators and objective functions. New design operators should extend the base Operator class and override the apply method, which implements a user-defined procedure to manipulate a solution. Analogously, a new objective function should extend the base ObjectiveFunction class and override the eval method, which in turn will return a real value associated with the fitness of the input solution. The source code, documentation and a fully working example are available at https://github.com/stracquadaniolab/mooda, while packages are released on Anaconda and PyPi repositories and as a Docker image.

Design and manufacturing objectives
We assessed the performance of our method by studying four competing design and manufacturing requirements; these are common to most DNA engineering tasks and have an easily interpretable form useful to assess the performance of our method.
GC content objective function. The GC content of each designed DNA fragment must be within the limits specified by a DNA synthesis provider. Here we assume that an ideal GC value, T GC , is provided in input. Thus, we can mathematically define the GC content objective as follows: where z is a solution, and B(z) are the set of blocks defined in z.
The optimal value for f 1 is 0, which is obtained when GC(b) = T GC .
To obtain an upper bound we used a heuristic procedure, where we replaced the codon of each coding region in the input sequence with the one maximizing the difference with respect to T GC ; successively, we divided the sequence into the maximum admissible number of blocks and evaluated the objective function.
Codon usage objective function. One of the most common operations in synthetic biology is the transfer of genes or pathways from one organism to another. Nevertheless, each organism has its codon usage, since for each amino acid some codons are less common than others, and so are the related tRNAs [23], ultimately resulting in slower translation. Thus, we considered an objective function that rewards designs using the most frequent codons as follows: where z is a candidate solution, Q is the frequency of the most frequent codon for the amino acid aa(c) encoded by the codon c in a given species, and q is the frequency of codon c used in z. The lower bound for the codon usage objectives function is 0, which is obtained when each amino acid is encoded by the most frequent codon in the target species; conversely, the upper bound is obtained when all rare codons are used. Although our objective function is not accurate, introducing a new accurate model for evaluating translation efficiency is outside the scope of this research work.
Block length variance objective function. DNA assembly methods work best when the fragments of DNA have approximately the same size. Thus, we reward designs with blocks of homogeneous size by defining the following objective function: where b belongs to the set of blocks B(z) of the solution z, l is the length of the block andl is the average block length in the design z.
The block variance minimum is 0 when each block has the same length, whereas its maximum is (lmax − l min ) 2 /4, with l min and lmax being the minimum and maximum admissible fragment length, respectively.

Block number objective function. A small number of blocks usu-
ally simplifies and speeds up the assembly process. Thus, we evaluated each solution considering the number of blocks required for the assembly as follows: where B(z) is the set of blocks defined by z. Obviously, the minimum value is l(s)/lmax, whereas the maximum number of blocks is simply l(s)/l min . Achieving an optimal design with respect to all four requirements is not trivial, as they have conflicting objectives. For example, optimizing codon usage can introduce AT/GC rich regions in a construct; similarly, while splitting the construct in fragments could overcome GC restrictions, it increases manufacturing complexity. It is clear that as the complexity of the constructs and the number of requirements increase, finding an optimal trade-off is challenging.
Finally, it is important to note that the minimum and maximum of an objective function might be difficult or impossible to compute; however, using heuristic estimates, such as the value of the best known solution, usually works well in practice.

Results
We tested MOODA on an extensive dataset of DNA constructs to assess the quality of solutions and its computational efficiency.
Currently, no benchmark is available to evaluate DNA design methods, effectively hindering a fair assessment of the methods available in literature. Therefore, as part of our work, we developed a testbed to generate DNA sequences with tunable features.
Here we assume that our input sequences represent modular designs consisting of a set of transcription units (TUs) made of a promoter, a CDS and a terminator [26]. We then parameterized our dataset considering the number of TUs encoded, the length of the constructs, their GC content and codon usage. The length of the CDS of each TU was set by sampling the number of codons from a Poisson distribution with λ = 250, which is approximately equal to the average number of codons in Escherichia coli genes (288.67 codons in the HUSEC2011 strain) [27], whereas the amino acid sequence and the frequency of each codon were generated at random. We then set the length of promoters and terminators by sampling from a Poisson distribution with λ = 500 bp. For each TU component, the GC content of the sequence was set at random by sampling from a Beta distribution with α = k × t and β = k × (1 − t) with k = 150 and t = 0.55; this leads to TUs with a GC content of ∼ 55% on average. Finally, we generated three datasets consisting of 10 sequences made of 5, 10, 20 TUs, with a final sequence length ranging from 8481 bp to 35 264 bp.
We then redesigned our 30 sequences with respect to four design problems characterized by a varying number objectives, namely P1, P2, P3 and P4 (see Supplementary Table S1); ultimately, we tested our method on a benchmark of 120 design problems.
We run the standard MOODA implementation on our benchmark using the parameters reported in Supplementary Table S2, and in Supplementary Table S3 for the sequence editing operators. Since MOODA is a stochastic algorithm, we performed five independent runs for each problem and parameters combination to estimate the expected quality and optimality of the designs.

Evaluation of design quality
Evaluating the quality of solutions returned by multi-objective optimization algorithms is not trivial, since standard metrics, such as the root mean square error (RMSE), are poor performance indicators. Instead, we used the hypervolume indicator, which is a robust metric used for assessing the quality of a set of Pareto optimal solutions [28]. Let y ∈ R k be a vector of size k, where y i is the value of the i-th objective function. The hypervolume indicator is a function V k : R k → R returning the volume enclosed by the union of the polytopes p 1 , · · · , p i , · · · , p k , where p i is the intersection of the hyperplanes arising from y i and the axes. In practice, V k provides an approximation of how many solutions are dominated by a set of Pareto optimal solutions, where the higher the values of V k , the better is the quality of the non-dominated set.
Computing the hypervolume requires the definition of a reference point, estimated either analytically or numerically; in our case, we used the minimum value of each objective function as reference point. It is important to note that the hypervolume value is an unscaled metric; thus, its interpretation is not straightforward. To overcome this problem, we first evaluate the hypervolume of the search space, VΩ, by computing the hypervolume for the polytope bounded by the reference point and the nadir point; here we defined the nadir point as the vector of the maxima of each objective function. Then, we computed the normalized hypervolume, NV k , as V k /VΩ; intuitively, NV k values close to 1 are associated with optimal trade-off solutions.
We analyzed the quality of solutions for Problems P1 and P2 and observed that MOODA achieves near-optimal results regardless the number of TUs in each construct, with an average normalized hypervolume of 0.95, ranging from 0.93 to 0.97 (see Figure 1). Interestingly, we observed negligible differences in design quality between parameter settings, although better solutions are usually found with a higher number of iterations rather than large design pools.
We then analyzed solutions for the 3-objective Problems P3 and P4. Consistent with our previous findings, we obtained excellent results for P3 regardless of the number of TUs, with an average NV k = 0.94, ranging from 0.90 to 0.97; as expected, we see a linear decrease in quality with the increasing number of TUs, albeit always approximately > 0.93 on average. As already observed, better solutions are usually obtained by increasing the number of iterations rather than the size of the pool; this difference becomes evident when designing constructs with 20 TUs (see Figure 1C and Supplementary Figure S1). Surprisingly, we found worse performances on P4, which includes the codon usage objective function, with NV k = 0.5 on average (see Figure 1D). Upon inspection of the non-dominated sets, we found that the codon usage objective function was consistently far from the optimal value. We then reasoned that this could be because the codon usage procedure changes very few codons, resulting in extremely suboptimal designs. Thus, we run MOODA on P4 by allowing the codon usage operator to alter more codons, by setting σc = 0.75; as expected, we observed an increase in quality, albeit limited to 0.64 on average (see Supplementary Figure S2). This result suggests that as GC content is taken into account, finding a trade-off with codon usage becomes significantly more difficult. Evaluation of design quality. We report the normalized hypervolume values, NV k , for different parameter settings for the Design Problems A) P1 (GC content, block number), B) P2 (GC content, block variance), C) P3 (GC content, block variance and block number) and D) P4 (GC content, codon usage, block number). The normalized hypervolume, NV k , is the ratio between the hypervolume, V k , of the trade-off solutions generated by MOODA and the hypervolume of the design space, VΩ. We report normalized hypervolume values for each design problem at increasing number of transcription units; here n and Tmax represent the pool size and the number of iterations, respectively.
Taken together, we showed that MOODA provides near-optimal designs for the vast majority of test cases. We found that the algorithm performs remarkably well despite no tuning of the editing operators, suggesting the overall robustness of our framework.

Evaluation of design optimality
The normalized hypervolume indicator provides a quantitative measure of solutions quality, but it does not inform on whether the solutions found by the algorithm are the best trade-offs possible. Here we studied which parameter settings are likely to provide optimal trade-off solutions, that is solutions that are globally Pareto optimal. In general, rigorous proof of global optimality is non-deterministic polynomial acceptable problems (NP)-hard; thus, we relaxed our requirements and reverted to an approximate measure.
We defined the approximately global Pareto optimal set as the union of all non-dominated solutions identified by u independent algorithms for a given set of objective functions. In our experiments, for each design problem, we obtained an approximate global Pareto optimal set,P f , by combining non-dominated solutions obtained by running MOODA with different parameters settings. Then, we computed Rθ, that is the proportion of global Pareto optimal solutions found by running MOODA with parameter setting θ, normalized according to the pool size (see Supplementary Table S2); intuitively, the best parameter setting will have values of Rθ close to 1.
We found that MOODA consistently finds the vast majority of global Pareto optimal solutions when setting the pool size to n = 100 and the maximum number of iterations to Tmax = 200, with Rθ values ranging from 0.3 for Problem P1 to 1 for Problem P4 (see Figure 2). Consistent with our design quality analysis, we observed a linear dependency between the number of iterations and higher Rθ values (0.5 on average), with significant differences depending on the number of TUs in the construct, ranging from 0.05 for P1 to 1 P4. Conversely, we observed that the algorithm requires large pools when increasing the number of objectives in P3 and P4, suggesting that as the design space becomes bigger, more solutions need to be sampled.
Here we showed that the probability of finding globally optimal trade-off depends on the number of iterations the algorithm Figure 2. Evaluation of design optimality. We report the percentage of globally Pareto optimal solutions, R θ , derived from the global Pareto front,P f , for the four Design Problems A) P1 (GC content, block number), B) P2 (GC content, block variance), C) P3 (GC content, block variance and block number) and D) P4 (GC content, codon usage, block number). We report R θ values for each design problem at increasing number of transcription units; here n and Tmax represent the pool size and the number of iterations, respectively. is allowed to perform. This result suggests that promising solutions are likely to be found as a result of iterative improvements, rather than by simple stochastic sampling.

Computational complexity analysis
We then analyzed the running time of our algorithm on all instances of our benchmark. For consistency, we performed all our experiments on a system with 2 Intel Xeon Gold 6130 central processing units (16 cores, 2.10 Ghz), 128 Gb DDR4 random access memory and running Scientific Linux 7; we then recorded the user time and averaged across five independent runs.
We found the running time to scale linearly with the number of TUs and iterations (see Figure 3), with a running time ranging from 100 to 8000 seconds. Moreover, while the time remains comparable across P1, P2 and P3, we found MOODA to be substantially slower on P4; this can be explained by the use of the codon usage operator, which is computationally taxing.
Since the quality and the number of global Pareto optimal solutions depend more on the number of iterations than the pool  size, we decided to test whether we could obtain the same performances at a lower computational cost, by using the same number of iterations but reducing the pool size by 5-fold. To mitigate the risk of finding fewer Pareto optimal solutions, we used the archive system implemented in MOODA, by setting its size m = 100 for all experiments (see Supplementary Table S4); with these settings, the maximum number of non-dominated solutions remains approximately comparable between different experiments. We then evaluated the quality of the designs obtained in terms of normalized hypervolume and compared these values to the normalized hypervolume values obtained with standard parameter settings (see Supplementary Table S2), limiting our analysis to experiments with an equal number of iterations for Problems P1, P3 and P4.
Interestingly, we found that using the MOODA archive system effectively compensates for the smaller pool size; specifically, the algorithm was able to produce near-optimal results (see Figure 4), showing negligible differences compared to the design produced by running MOODA with standard parameter settings (see Figure 5). Conversely, the difference in running time is extremely significant (see Figure 6), with a drop of 2000 seconds on average; in particular, the archiving system exponentially reduces the running time on more complex problems (e.g. TUs= 20), leading to MOODA being up to 2.2 h faster for P4.
Taken together, we showed that MOODA has a running time growing linearly with sequence complexity. The use of an archiving system to keep track of non-dominated solutions effectively reduces the computational burden of our method.

Figure 5.
Comparison of design quality between standard MOODA and MOODA using the archive system. We report the difference in normalized hypervolume, ∆NV k , between standard MOODA and the MOODA with the archive system for the Design Problems A) P1 (GC content, block number), B) P3 (GC content, block variance and block number) and C) P4 (GC content, codon usage, block number). Positive values of ∆NV k are associated with better quality of the standard MOODA solutions compared to the archive version.

Discussion
Advances in chemical synthesis and molecular assembly techniques have enabled a plethora of synthetic biology applications of increasing complexity. Nonetheless, designing a DNA construct that can be easily manufactured remains a complex and time-consuming task.
Here we developed a new mathematical framework and a companion algorithm to tackle the design and assembly of a biological construct as a multi-objective optimization problem, aiming at finding the best trade-offs between conflicting design and manufacturing requirements. To the best of our knowledge, this is the first time that the concept of Pareto optimality has been proposed to simultaneously design and plan the assembly of DNA molecules. Moreover, we introduced quantitative measures of design quality, which provide useful information to speed up the design-build-learn-test cycle.
We performed extensive experiments and showed that our approach can find near-optimal manufacturable designs for arbitrary long and complex DNA molecules. We found that the probability of finding optimal trade-off solutions scales linearly with the number of iterations allowed to our method, and it is only marginally affected by the size of the pool of solutions. We further refined our algorithm by adding an archiving system to keep track of non-dominated solutions found throughout the optimization process, which dramatically reduces the running time of our method and ultimately allows end users to run complex analyses on standard desktop machines. We released our software as an open-source Python package, which can be easily installed from PyPi, Anaconda or Docker and extended with plug-ins. Figure 6. Comparison of the running time between standard MOODA and MOODA using the archive system. We report the difference in running time, measured in seconds, between standard MOODA and MOODA with the archive system for the Design Problems A) P1 (GC content, block number), B) P3 (GC content, block variance and block number) and C) P4 (GC content, codon usage, block number). Positive values of ∆Time are associated with MOODA archive system being faster than the standard implementation.
We are also aware of the limitations of our work. In particular, like every optimization method, the quality of solutions depends on the effectiveness of the search procedures and the accuracy of the objective functions to capture specific requirements; in biology, this has often proved to be a complex problem itself, as we experienced in our codon usage optimization experiments. Nonetheless, as models of biological processes become more accurate, defining objective functions that can exactly capture biological behavior will be feasible, and our method is ready to take advantage of these advances.
Ultimately, with the advent of large-scale synthetic genome projects, our framework provides exciting opportunities to do extensive chromosome editing in mammalian and plant systems.

Supplementary data
Supplementary data are available at SYNBIO Online.

Author contributions
A.G. and G.S. conceived the algorithm. A.G. developed MOODA and performed experiments. V.Z. developed supporting software. A.G. and G.S. analyzed experimental results. A.G. and G.S. wrote the manuscript.