Code compendium

Code compendium

Paul Czechowski ORCID iD icon https://orcid.org/0000-0001-7894-4042

I provide two recent samples of my existing work (of the last two years) concerning environmental DNA analysis in the Prince Charles Mountains and Fiordland. Alongside code comments and structure I include brief remarks on the both projects. Beyond this text note further cross-references:

Further code examples can be made available, e.g. for environmental DNA analyses related to NSF-funded work and ancient genome assembly.

Example 1 – Antarctica

Antarctic biodiversity predictions through substrate qualities and environmental DNA.

Background and motivation

The Prince Charles Mountains are some of the most remote ice-free areas in Antarctica, and Australia is dedicated to implementing new technologies for safeguarding Antarcticas’s biodiversity.

I started this new analysis of my Antarctic data during New Zealand’s first lockdown (31 March 2020). The resulting article is in production at ESA’s well-regarded Frontiers in Ecology and the Environment.

In this project I investigate the distribution of cryptic Antarctic invertebrates living in remote island-like terrestrial habitats of Antarctica in relation to soil geochemical and mineral properties of their substrate. I found neutral substrate pH, low conductivity, and some substrate minerals to be important predictors of presence for basidiomycetes, chlorophytes, ciliophorans, nematodes, or tardigrades. We collected soils in the Prince Charles Mountains, I analyzed soil environmental DNA at the Australian Centre for Ancient DNA. Soil geochemical variables were measured by APAL. I collected X-ray diffraction spectra in collboration with Duanne White at the University of Canberra. Funding was acquired by Mark Stevens (South Australian Museum)

Field work at Mt. Menzies

Code access and remarks

Files listed below and discussed here are available via Github. My analysis initially checks the suitability of the employed PCR primers (000_r_in_silico_pcr.R, with the R language). Subsequently, various tools are used for quality check and quality filtering of data. I use Bash, initially for metadata assembly (e.g. 090_bash_create_manifests.sh ), and then to combine purposeful features of Qiime 1 and 2 with software such as Cutadapt, Gnuplot, Blast, and FastTree. Denoising is implemented via DADA2 in Qiime (110_q2_denoise.sh). Blast output can be inspected with MEGAN. Custom analysis steps (predominantly scripts with numbers higher than 180... ) are implemented using R.

I give a most detailed description of the works entire analysis history in the projects README.md. In this file I included commit hashes to easily revert to earlier code versions. Scripts ought to be run in consecutive order as indicated by their preceding number codes. Additional information, and imagery, as well as a snapshot of the project code at time of pre-print submission I provided via Zenodo.

.
├── 000_r_in_silico_pcr.R
├── 050_q1_demultiplex.sh
├── 060_q1_split_samples.sh
├── 070_bash_check_quality.sh
├── 080_bash_cutadapt.sh
├── 090_bash_create_manifests.sh
├── 100_q2_import.sh
├── 110_q2_denoise.sh
├── 115_gnu_plot_denoise_plate1.gnu
├── 115_gnu_plot_denoise_plate2.gnu
├── 120_q2_merge.sh
├── 130_q2_summary.sh
├── 140_bash_fasta_blast.sh
├── 150_r_get_q2_tax-tab.r
├── 165_r_prep_q2_predictor-tab_with-raster.r
├── 170_q2_summary.sh
├── 175_q2_seq_align.sh
├── 177_q2_mask_align.sh
├── 180_q2_get_fastree.sh
├── 190_q2_export_objects.sh
├── 200_r_get_phyloseq.r
├── 220_r_check_species_occurences.r
├── 230_r_add_blast_scores_to_som.r
└── README.md

Thanks to Fiona, Tessa, Nick, Adrian, Josh, and Mark Stevens

Example 2 - Oceania

Environmental DNA analysis needs local reference data to inform taxonomy-based conservation policy – A case study from Aotearoa / New Zealand

Background and motivation

Fishing is one of the biggest industries in New Zealand, and Fiordland is an UNESCO world heritage site very important for the replenishment of New Zealand fish stocks. The analysis of environmental DNA with regards to fish was motivated by a request of the New Zealand Department of Conservation (Monique Ladds) to explore suitability of recent molecular biological approaches to assesses biodiversity of fish in marine reserves.

I started this analysis in addition to the sequencing of the Haast’s eagle genome at the beginning of my PostDoc at the University of Otago. The article is in review at SCB Conservation Science and Practice.

In this project I evaluated the impact of lacking environmental DNA reference data on descriptions of the Fiordland fish biodiversity I compared eDNA-derived species identifications against Baited Remote Underwater Video (BRUV) data collected at the same time and locations as the eDNA data. Furthermore, I cross referenced both eDNA and BRUV data against species lists for the same region obtained from literature surveys and the Ocean Biodiversity Information System (OBIS). Concordance of taxonomies between the data sources dissolved with lowering taxonomic levels, most decisively so for eDNA data. The work was entirely realized at the University of Otago, lead by me, in collaboration with researchers of the Department of Marine Sciences (Chris Hepburn, Will Rayment), and the Department of Biostatistics (Michel de Lange). Funding was acquired by Michael Knapp (Otago University)

Departure for water sample collection and BRUV deployment

During review, an often-overlooked aspect of this analysis has been the detailed alignment-level analysis of species misidentifications, at a level that is uncommon in other eDNA studies. It is absolutely crucial to have detailed reference data for species level biodiversity surveys.

Summary of binomial regression. Regression analysis of the 142 non-unique eDNA observations (Tjur’s R2 0.027) suggested each additional alignment gap to be associated with a 39% increased probability of observing a non-native species (Odds Ratio 1.39, 95% CI from 1.19 to 1.66, p <0.01). At the same time, a 1% increase in alignment concordance was associated with a 7% increased probability of non-native observation (OR 1.07, 95% CI 1.03 - 1.12, p <0.01).

Code access and remarks

All scripts used for analysis, and listed below are available via Github. Similar to my aforementioned project, I started by evaluating primer suitability (000_r_in_silico_pcr.R), read in metadata from Excel sheets and stored them as R data objects ( 200_r_metadata_management.R), checked and cleaned data for analysis (from 300_bash_cutadapt_demultiplex.sh to 980_q2_export_objects.sh), and implemented project specific analysis steps (using R from 990_r_get_eDNA_long_table.r to 998_r_summarize_results.r ). An effortful task was the retrieval, formatting and mapping of data from OBIS (in 998_r_get_OBIS_and_map.r). As usual, Blast output can be inspected with MEGAN.

The repository’s README.md file gives a comprehensive overview over the project history. I include commit hashes to enable easily going back to an earlier code version. Scripts ought to be run in consecutive order as indicated by their preceding number codes.

Additional information, and imagery, as well as a snapshot of the project code at time of pre-print submission are available via Zenodo.

.
├── 000_r_in_silico_pcr.R
├── 100_r_quantification_analysis.R
├── 200_r_metadata_management.R
├── 300_bash_cutadapt_demultiplex.sh
├── 350_bash_count_reads_and_mv_empty_fastqs.sh
├── 400_r_qiime_manifest.R
├── 500_q2_import.sh
├── 600_q2_denoise.sh
├── 650_r_plot_denoise.R
├── 750_bash_fasta_blast.sh
├── 800_r_get_q2_tax-tab.r
├── 850_r_prep_q2_predictor-tab.r
├── 900_q2_summary.sh
├── 980_q2_export_objects.sh
├── 990_r_get_eDNA_long_table.r
├── 995_r_get_BRUV_long_table.r
├── 995_r_get_PUBL_long_table.r
├── 997_r_format_longtables.r
├── 998_r_get_OBIS_and_map.r
├── 998_r_summarize_results.r
├── 999_get_package_citations.R
├── 999_r_mdl-makeBarPlots_310721.R
└── README.md

Thanks to the divers of the Department of Marine Sciences of the University of Otago, Prof. Chris Hepburn, and our skipper Pete.