fast sequential Markov coalescent simulation of genomic data under complex evolutionary models

Changes in fsc28 relative to fsc27

New features

New syntax in the .tpl files to deal with sample heterogeneity. We introduce the concept of sfs pools where the sfs of different samples can be computed as a pool. It allows for considering any spatial of temporal heterogeneity. New key word “sfspool” in deme size section
Possibility to record the deme of origin of chromosome segments when implementing an admixture even so that it is possible to simulate chromosome painting. New keyword “recordAdmOrigin” in historical events
New command line options (-y and -z) to fine tune the parameter estimation procedure
Other changes and bug corrections:

When simulating several data files with definition files (.def) the SFSs are written in different files, either in separate directories with the -j option, or in the same directory without the -j option
Program was crashing when simulating exponential growth and migration. Bug found by Jason Weir.
Optimisation of computations when estimating data from multidimensional SFS
Bad computation of lhood when estimated from the maxL.par files as compared to that computed during parameter estimations, in case of population growth. Bug found by Kyle Lewald
Incorrect simulations from par files when some demes are explicitly killed. Bug found by Kyle Lewal

Changes in fsc27 relative to fsc26

New features

New syntax in the .est files. It is now possible to include previously defined simple parameters as search range delimiters. The keyword paramInRange needs to be specified at the end of lines containing such parameters.
New keyword in .par or .tpl file: absoluteResize. It allows a given sink population to take a new absolute size, independently of its previous size. It eliminates the need to compute this resize as a complex parameter in the .est file
The [RULES] section has been suppressed from input files. It is simply not read anymore. These rules have become obsolete given the new syntax described in point.
SNP data types are not considered anymore, as they led to biased simulations. Use short segments of DNA and the -sX option to generate X SNPs instead
Simulations of large and sparsely occupied structured populations has been optimized and can be up to10 times faster than the previous version. There is very little gain for simulations with a small number of migration-connected demes, though.
Simulations of large recombining chromosomes has been optimized, when using large values of the -k options
Generation of genotype table (.gen file) as an alternative output to Arlequin (-G option). The additional -g option allows one to generate diploid genotypes (coded as 0, 1 or 2) instead of haploid genotypes (coded as 0 or 1)
Possibility to “kill” demes, such as to make them inaccessible to migration. Setting a sink deme size to zero (using a sink resize of zero in a historical event) will now prevent further migration to this deme. This is useful as one can keep the same migration matrix after the disappearance of some demes (e.g. due to population fusion backward in time).
Comments are now possible at the end of any line of .est files.
Other changes and bug corrections:
- When a deme size goes to zero (e.g. due to negative growth), a warning is only produced if the deme is occupied (thanks to David Marques for requesting this change).
- Bug corrected when computing likelihood with ghost populations and a single sampled deme.
- Corrected bug (found by David Marques) with options --noSingleton and --foldedSFS in the presence of ghost populations (the max est lhood was larger than the max obs lhood).
- Corrected bug occurring when computing the position of the next recombination position in case of very small recombination rates (thanks to Silvert Martin)
- Corrected important bug (thanks to David Marques) in case of the introduction of population growth at a given point in a population of initial constant size. The population size was adjusted as if there had been growth since generation zero.
- Corrected bug (thanks to Yu Sugihara) when generating diversity based on random parameters and using -Ex option when x >1.
- Corrected bug (thank to Jason Weir) when simulations scenarios with both migration and exponential growth. It led to program crashes and incorrect migration patterns.

Changes in fsc26 relative to fsc25 ver 2.21 (November 2015)

New features

Simple implementation of individual inbreeding
- The average inbreeding coefficient of individuals in a population can now be specified as a third optional parameter in the sample size definition. In this case, the sample age needs to be defined (set to zero in most applications), as:
  <sample size> <sample age> <inbreeding coefficient>
Possibility to define initial parameter values for demographic inference
- Option -initvalues file.pv , where file.pv lists initial non-complex parameter values to use. This option is mainly useful when computing bootstrap confidence intervals, as it allows one to use less replicates for each bootstrap data set. A *.pv file is now automatically generated after each parameter estimation by fsc26
Computation of MAF 1D and 2D SFS with option --foldedSFS by simply folding the corresponding unfolded SFS (for compatibility with angsd, where the minor allele is computed separately for each SFS)
Optional faster but approximate log computations with option --logprecision n, where n is a number between 10 and 23 specifying the precision of the computation of logarithms. 23 means full precision and is the default value.
Optional parameter optimization without taking singletons into account specified with option --nosingleton
Syntax changes
- For parameter optimization,
- -N option has been suppressed, and maximum no. of iteration is now equal to that set by the -n option
- The number of cycles to performed is now fixed and only specified with option -L
- The -l option is now optional and means something different. It is now used to specify the number of cycles where information on monomorphic sites is used. After these initial cycles, likelihood will only be computed (and optimized) on the polymorphic sites. This option needs to be used together with the “reference” keyword in the .est file (see section on est file).
- The –M option is now just a flag mentioning we want to perform parameter estimation from the observed SFS. It should therefore not be followed by any number.
- Removed -D option to produce output in dadi format.
Implementation of instantaneous bottlenecks with keyword instbot added to historical event definition. Only works in absence of recombination for the moment.

Bug corrections

Expected marginal SFS were not computed when computing expected SFS with FREQ data
Wrong likelihoods were computed with option -0

Changes in fastsimcoal25 (fsc25) relative to fastsimcoal21 (August 2014)

In addition to overall polishing and bug corrections, the main innovation of ver 2.5 of fastsimcoal2 is the introduction of multithreading (with the -c option). This option aims at offering the possibility of doing parameter optimization on desktop machines, as most modern machines have multiple cores. Note that there is no strict linear increase in the performance of multithreaded runs and no. of threads (cores), so that it is not recommended to use more than one thread on a linux cluster.

New features

The fastsimcoal2 program ver2.5 has been renamed fsc25 (shorter name is better)
Use of a different random number generator (same seed will produce different results than in fastsimcoal21)
Code optimization resulting in up to 1-75% speed gain for single threaded version (see benchmark)
Multithreading (64 bit only), for more speed gain on a multicore processor desktop machine (see benchmark)
Result files for parameter estimation now output in separate result directory
More options to generate SNP data
New specification for MAF SFS
Added a version for macOSX running in earlier versions (e.g. from 10.6 upwards) (thanks to Iain Mathieson)
More tolerant reading of input files (thanks to Allan Strand)
Rules in est files can now be used for parameter estimations

Bug corrections in fsc25 relative to fsc21

Corrected bug where maximum number of simulations was set to lower number of simulations during Brent optimization
Program crashed when trying to compute SFS when too many polymorphic sites need to be kept in memory (this number can be changed with -k option)
Solved problems when multiple parameter definitions are listed in def files
Corrected bug preventing recombination to be simulated when several block structures were defined in par file (thanks to Thomas Willems)
Incorrect computatin of likelihood when using fractional numbers i observed SFS (thanks to Andreas Kautt)

Bug corrections in ver 2.5.0.2 (August 2014)

Growth rates was inactivated in ver 2.5.0, and all simulatins were perfromed with a constant size population (thanks to Melissa Wilson Sayres for pointing this out)

Bug corrections and modifications in ver 2.5.1 (October 2014)

Example files are back in zip files (thanks to Alfredo)
Description of the exact format of the multiSFS format has been modified in the manual (thanks to Vitor Sousa and Raphael Leblois)
Problem in implementing recombination with multiple runs (option -nx where x>1) (thanks to Vitor Sousa and Yang)
More precision on branch length when outputing tree in NEXUS format (thanks to Shuo Yang)
New faster way to implement recombination under the SMC' algorithm and its extension to multiple recombinations between sites

Bug corrections and modifications in ver 2.5.2 (March 2015)

1. Bug corrections:

fsc251 asked for a joint SFS when two populations samples were listed in tpl file but only one contained active lineages. Bug found by Charleston Chiang.
TMRCA was not found in case of recombination and demes with some inactive lineages. Bug found by Ryan Bohlender)
fsc251 was not generating output files when path was provided before input file names (par or tpl).Note that fsc25 should always be run from the directory containing the input files, even though the program can be can be physically located elsewhere. Bug found by Greer Dolby.
fsc251 was not taking into account growth rate changes specified in historical events (bug introduced in 2.5.1, and it was not present in ver 2.5.0).

–k option has no upper limit anymore, and its default value is 100,000
Added new –P command line option, allowing to get the global pooled SFS obtained by pooling all lineages as if in a single population
Added two new operators in est file for complex parameters: %min% and %max%
Added new functions in est files for complex parameters: abs(), exp(), log(), log10(), pow10()
Added a new "bounded" keyword in est file to specify that the upper range of a simple parameter is bounded. Needs to be listed after the "output" or "hide" keywords.
Added two new keywords for historical events: "keep" and "nomig".
Expected joint SFS is now rescaled such that the sum of sfs entries for polymorphic sites is 1. This shoudl lead to more exact lhood computation from multiple 2D SFS.

Bug corrections and modifications in ver 2.5.2.8 (May 2015)

1. Bug corrections:

incorrect simulation of mutations in case of high recombination rates. There was a strong negative correlation between the recombination rate and the number of polymorphic loci, when adjacent sites were the object of recombination. The number of mutation was underestimated for recombination rates, say >1e-7. This bug affected ALL previous fsc releases
Possible overestimation of TMRCA and overall tree size in case of recombination. Bug present since early fsc2 release.
Crash of fsc2 in case of very high recombination rate with DNA da
Incorrect writing of recombination positions in output arp file when simulating several threads
maxObsLhood was not correctly computed when estimation of parameters in a scenario with a single population
Change of migration matrix not implemented after first recombination (thanks to Stefano Mona)
Computation of MAF SFS incorrect in case of multiple mutations per site ( when -I option not provided and high mutation rates) (thanks to Jason Weir).

Speed optimization
Output of random DNA nucleotides instead of N for monomorphic loci with the –S option.
Possibility to run fsc without command line option if file "fsc_run.txt" is present and contains run path and command line options in current working directory

Bug corrections and modifications in ver 2.5.2.21 (November 2015)

1. Bug corrections:

Non implementation of exponential growth at time zero for the first simulated tree. Initial population size therefore does not change for that tree. Note that specifications of exponential growth rates in historical events are correctly implemented even in the first tree. Exponential growth is then correctly implemented in the next simulated trees (thanks to Anand Bhaskar)
Crash in case of very large samples sizes (e.g. 60,000) (thanks to Anand Bhaskar)
Incorrect computation of the max lhood when non integers sre used in the observed sfs (thanks to Andi Knautt)
Reported expected SFS was that of the last iteration and not that associated to the max lhood parameter estimates
In case of crash due to bad tpl file, parameters reported in file called <generic name>_bad.par were not those leading to the crash

Speed optimization. Up to 30% speed gain.
Output of time to MRCA in file <generic name>_mrca.txt with new compiler directive --recordMRCA. Beware that this option really slows down computations. Note that we also output the he deme in which MRCA occurred.

Changes in fastsimcoal21 relative to fastsimcoal2 (December 2013)

New features

64 bit windows version of fastsimcoal2 (20% speed gain compared to 32 bit version!)
Modified output of monomorphic samples. By default, fastsimcoal2 only outputs polymorphic sites for DNA data. If the coalescent tree is too shallow, no mutation can occur on a given tree. In that case, fastsimcoal2 now outputs a single loci with "N" for all individual instead of missing data in arlequin files (*.arp). This change prevents a bug when analysing simulated arlequin files with arlsumstat.
Optional use of a manual seed for the random number generator (--seed xxx command line option)
Outputs par file with estimated maximum likelihood parameters. This file can be used to generate pseudo-observed SFS to estimate parametric bootstrap confidence intervals around the ML parameters.

Bug corrections

With -s0 option, the number of reported polymorphic sites in file "<file_name>_numPolymSites.obs" was incorrectly set to zero and the maximum likelihood reported in the file "<file_name>.lhoodObs" was set to INF. These two problems are now corrrected.
Multiple whitespace or multiple tabs between parameters in historical events caused erratic behavior. Multiple separators are now allowed in historical events.
When estimating a single parameter from the SFS, the number of performed ECM loops could be smaller (usually 2) than the required minimum.

Changes in fastsimcoal2 relative to fastsimcoal

New features

Optional output of all simulated sites (including monomorphic sites) (-S command line option)
Optional use of a manual seed for the random number generator (--seed xxx command line option)
Simulation of ascertained SNP data
Generation of the (joint) site frequency spectrum (SFS) from DNA sequence data
Generation of multidimensional (>2D) SFS
Generation of Nexus coalescent trees with branch lengths now expressed in fractions of generations (e.g. 1205.123)
Ability to estimate demographic parameters from the site frequency spectrum inferred from DNA sequences or ascertained SNP chips
Need to specify number of SNPs to output with -s option (specify 0 to output all SNPs)

Bug corrections

Potential crash when generating scenarios with historical events and recombination
Crash when simulation of samples of size zero and recombination
The pattern of polymorphisms obtained in a population for a given past demography changed depending or not if other samples were simulated as well, in presence of recombination
Non-convergence to the MRCA when simulating serial samples of age zero

Notes that bugs 1-3 were due to the same problem in the code.

Last updated by L. Excoffier on 25.09.2023