fast sequential Markov coalescent
simulation of genomic data under complex evolutionary models
Additional scrpts for the preparation of input files and the
analysis of results
We list below a series of R and bash scripts that have been developed
by several people from our group to facilitate the analysis of the
results of fsc27 as well as for preparing input files.
visualizing scenarios modeled in par files
A new R script called ParFileViewer.r
reads par files and plots the modeled scenario. parFileViewer has been
adapter to reflect the latest chages in fsc27. this script can be
useful to check that the modeled scenario corresponds to what was
intended. It can also be useful to visualize the scenario
after some parameter estimation (use the *_maxL.par file).
on the use of the this program can be found here.
Alexandre Gouy has written a web
application based on the R/Shiny framework h to facilitate the
interpretation of fastsimcoal results. It features various visualisations: the
demographic model, observed and expected marginal and joint SFS, as well as
likelihood comparisons and parameter estimates dispersions if multiple runs are
provided. Source code can be foundgitHub. It can be used locally or
directly from our CMPG web site.
running multiple instances of fsc
Laurent Excoffier and Nina Marchi have written a bash script to launch several instances of fsc under the SLURM worload manager. It requires .tpl, .est and .obs SFS files plus fastsimcoal program.
Vitor Sousa has written a similar bash script to run it on a single multi-core machine, with a an associated readme file here
David Alexander Marques (DAM) , Vitor Sousa
(VS), Nina Marchi (NM) and Laurent Excoffier (LE) have written
several file conversion scripts:
An Rscript (by DAM) SFStool.R to
1) convert: site frequency spectra (SFS), either multidimensional or 2D-SFS to marginal 2D-SFS or 1D-SFS
2) visualize: 2D-SFS and 1D-SFS, including comparison between observed
SFS (ending with .obs) and expected SFS (ending with .txt) if both are
available in a folder
A Python script (sampleKgenotypesPerPop.py, by DAM)
to prepare genotypes in a VCF for conversion into an SFS, in which no
missing data is allowed. The script randomly draws a user-defined
number (K) of non-missing genotypes per population from a VCF file and
thereby subsamples genotypes from each population into a new VCF with
pseudo-individuals. Sites with less than k genotypes per population are
discarded. The script uses a population file with format
Individual<TAB>Population, one per line, and allows to specify
different K's for different populations.
A Python script (vcf2sfs.py, by DAM)
which converts VCF files into site frequency spectra (SFS) files of the
respective dimensions in fastsimcoal2 format (1D, 2D or multi-D,
depending on the number of populations in the VCF file) by simply
counting alleles from genotypes for sites without missing data. It can
also compute SFS for non-overlapping windows in the genome (defined by
distance along the chromosome or number of sequenced sites in the VCF
file), compute bootstrap replicates of the SFS (by resampling single
sites) or block-bootstrap replicates of the SFS (by resampling
non-overlapping windows in the genome). IMPORTANT: Sites with missing
data are discarded, so use of sampleKgenotypesPerPop.py (see above)
recommended prior to get rid of missing data.
A Python script (foldSFS.py, by DAM)
that converts unfolded SFS of any dimension (*_DAF*, *_jointDAF*,
*_DSFS*) into their folded equivalent (*_MAF*, *_jointMAF*, *_MSFS*).
Currently supports only SFS files containing a single SFS.
A series of bash and R scripts to convert VCF files to SFS by VS. They are contained in the following zip file buildSFSFromVCF.zip
An R script
(by NM and LE) to transform a genotype table, as newly produced
by fsc27, into a multidimensional SFS. This script computes from a
genotype table the allele frequencies in user-defined populations and
then build a multidimensional SFS with populations listed in the order
defined by the user. The script uses a genotype table file with genomic
positions as rows, diploid individuals as columns, no missing data and
genotype entries coded as 0 (homozygote ancestral), 1 (heterozygote),
and 2 (homozygote derived). It also computes the number of sites,
monomorphic and polymorphic, to be used to compute the SFS (the
monomorphic sites do not have to be present in the genotype table).