The procedure is inspired by the Guidelines and Standards for Evidence Synthesis in Environmental Management27 and includes three main phases: (1) the development of an effective method for searching for literature through an iterative standardised process, (2) the development of eligibility criteria for an objective and transparent selection of relevant articles, and (3) the development of a metadata-coding framework (i.e. a pilot-tested spreadsheet) used to extract relevant metadata from retained articles to qualitatively characterise the evidence base. For phases 2 and 3, we adapt the guidelines to include assistance from machine learning models following the method of 26 by manually screening and coding a sample of articles to train machine learning classifiers in order to predict these classifications for the entire corpus. Following 26, we will also use a geoparser to identify geographic coordinates and extents from place names in the study titles and abstracts.
1. Searching for articles
1.1 Estimating comprehensiveness of search
To assess the search comprehensiveness, a “test list” of 94 benchmark articles was compiled through contributions from the interdisciplinary synthesis team (Additional File 1).
1.2 Search terms and languages
In evidence synthesis protocols, search strings that are used to search for relevant literature are assessed and refined through an iterative process of scoping searches. These scoping searches first aim to identify search terms that maximise the retrieval of the test list articles (i.e. an assessment of search comprehensiveness), and second to minimise the number of total results retrieved (i.e. an assessment of search specificity)27. In our synthesis, we carried out scoping searches in Web of Science Core Collection (WOS), using the language tag to limit the search results to English (LA = English). While we acknowledge the drawbacks of language bias28, this methodological choice was necessary due to limitations in the expertise of our review team and availability of a pre-trained machine learning model in other languages.
Our research question encompasses OROs that cover a broad range of concepts. Given this diversity, we determined that a single search string was ineffective at achieving full comprehensiveness with a realistic level of sensitivity (Additional File 2, sheet “General”). To address this, we classified all OROs considered in our research question into a typology modified from 11, and based on the types of OROs identified, we developed a search strategy composed of five search strings (termed “substrings”) (Figure 1). Each substring would be run individually and the search results subsequently pooled and de-duplicated to form a single database.
Each substring combined groups of search terms (i.e. “blocks”) that matched critical components of our topic (Table 3):
- Population block: terms relevant to oceans and coastal human/natural systems
- Intervention block: This block varied highly across the substrings, and was often represented through a combination of the following blocks joined with an “AND” operator:
- Climate_change: terms relevant to climate change or ocean-related climate impact drivers
- Option: general terms to represent an option or intervention or solution
- Ocean_renewables: terms that specifically target marine renewable energies
- Mitigation_options: terms to represent types of mitigation options other than marine renewable energies
- Natural_resilience: terms relevant to natural resilience options
- Societal_adaptation: terms relevant to societal adaptation options
- No_terms: terms excluded because we do not consider them within the scope of our review (e.g. reducing pollution, articles which only focus on reporting climate change impacts/vulnerability).
Each substring was tuned using iterative scoping searches, with the end goal that when the search results from all the substrings are pooled, 100% comprehensiveness would be achieved. When all the test list articles were retrieved, the number of records across all five substrings (before deduplication) totalled 392,111. This was an acceptable level of specificity for our machine-learning assisted approach (Table 3, Additional File 2). We therefore determined that the substrings were suitable for conducting the synthesis.
1.3 Searching the literature
Our machine learning-assisted approach requires access to title and abstract metadata exported for all search results in order to fit its classifiers. At the scale of our research question, this is best achieved using citation-indexed databases. We will conduct our searches in Web of Science Core Collection and Scopus (Additional Files 2 and 3).
1.4 Reference management
References from the searches will be exported from the citation-indexed databases as BibTeX files. All citation files will be read into the R statistical computing environment29 and deduplicated using the revtools package30. Duplicated records will be identified based on title and year matches. Titles will be pre-processed by converting to lowercase and removing punctuation, and the resulting titles with an optimal string alignment distance of <= 5 that are published in the same year will be tagged as duplicates. Since our machine learning model relies on titles and abstracts to screen and code articles, entries with an “NA'' value in either field will be removed; the most complete record will be retained.
2. Article screening and study eligibility criteria
2.1 Screening process
The eligibility screening step of an evidence map involves the application of eligibility criteria that determine which of the primary research studies identified in searches are relevant for answering the map question27. Our search queries from section 1 returned 265,436 unique articles. We therefore will use a machine learning-assisted approach to automate the evidence selection process. We will train a machine learning classification model based on a subset of manually screened article text (title, abstract and keywords) by a team of reviewers to predict inclusion with quantified uncertainty26. Manual screening decisions will be based on transparent and replicable eligibility criteria (cf. Section 2.2).
First, a manually-screened sample of articles is needed to train and test the machine learning classification model. Since the number of publications retrieved from each search string was unbalanced (Table 4), sampling randomly from the pool of all search results would likely return a subset that was biased towards the ORO types which retrieved the most search results. Since the manually-screened sample would be used to train and validate the model, we did not want this to bias the model’s performance. Therefore, to achieve a more balanced representation of ORO types, we will select 4,000 articles for manual screening from the deduplicated search results (excluding test list articles) using the following procedure: 1000 articles randomly sampled from the results of the “General” search string, 1000 random articles from the two mitigation search strings (333 from “Mitigation: renewable energy” and 666 from “Mitigation: other”), 1000 random articles from the “Natural resilience” string, and 1000 from the “Societal adaptation” string. These articles along with the test list articles (4,094 articles in total) will be uploaded into sysrev (www.sysrev.com): a web-based platform for publication screening.
Reviewer training aims to reduce bias in the manual screening process. During this process, each reviewer will independently screen a randomly-selected training set of titles and abstracts (n=30), and their inclusion decisions will be used to calculate a Fleiss' Kappa statistic31,32 to assess inter-reviewer reliability. If the kappa statistic is below 0.6, a group discussion will be conducted to resolve discrepancies and modify/clarify the eligibility criteria until a consensus is reached. A new training set (n=30) will then be screened, and this process repeated until an acceptable Fleiss' Kappa statistic of ≥ 0.6 is achieved. Any remaining disagreements will be discussed before commencing the screening.
After the reviewer training, reviewers will proceed to screen the sample articles. Articles will be screened by a single reviewer, except for 300 randomly-selected articles flagged for double-blind screening. Any conflicts in terms of screening decisions will be resolved by a third reviewer. We acknowledge that the use of a single reviewer to screen articles is a limitation of this study, but was necessary given the size of the evidence base. Reviewers will not screen articles that they have authored. We will also use an active learning strategy to screen articles. The titles and abstracts of articles will first be screened by the screening team in random order. Since our aim is to produce a sample of both inclusion and exclusion labels to train the machine learning classifier effectively, sufficient examples of each classification are needed. To mitigate the high proportion of irrelevant articles often found in systematic evidence syntheses, we draw from the active learning approach from 26, which sorts articles for screening by predicted relevance from a periodically-updating machine learning model embedded in their screening platform. For our study, we will begin the active learning process after at least 1500 articles have been screened in random order, and will use the predicted relevance from the machine learning model embedded in sysrev to sort the articles for screening. All articles with a predicted relevance above > 60% will be screened until no articles above that relevance threshold remain.
Machine learning predictions
This manually-screened sample will be used to train and test a binary machine learning classifier used to predict the inclusion or exclusion of all the articles retrieved by the search strategy26. In this approach, a nested cross-validation (CV) procedure is used to compare the performance of a support vector machine (SVM)33 and a pretrained DistilBERT model34 which has been fine-tuned with the manually-screened sample.
To explain the role of CV in machine learning, data is needed to train (i.e. fit) a model, as well as to evaluate a model’s performance. If the same data is used for both steps, the model’s performance metrics may be inflated due to overfitting. K-fold CV aims to address this issue by splitting the dataset into k non-overlapping folds so that the data used for training is not used in evaluation. Over k loops, each fold is reserved in turn as a test set while the remaining data is pooled and used to train a model, producing k candidate models.
Two nested loops of k-fold CV can be implemented to perform two essential functions: (1) finding the best model configuration by optimising the selection of hyperparameters in the inner loop, and (2) model evaluation in the outer loop. In the outer loop, the dataset is first divided into k folds, where one fold is reserved as a test set and the remaining k-1 folds as the training set. In this study, only randomly-selected articles are included in the test sets (as opposed to those screened using predicted relevance in the active learning procedure) so as not to bias the model evaluation. The remaining randomly-selected articles that are not included in the outer test set are incorporated into the training set. In the inner loop, a further k testing sets are drawn from the remaining randomly-selected articles in the outer loop’s training set, whilst all remaining articles are allocated to the inner training set. Grid search is used to initialise a model with each combination of hyperparameters and fit the model on each inner training and testing set. The combination of hyperparameters with the best mean F1 score (the harmonic mean of precision and recall) across the inner folds will be selected as the best model. The training and testing data from the outer CV is then used to evaluate the performance of the best model from the inner folds. The outer CV therefore returns k scores for each evaluation metric.
The binary inclusion/exclusion classifiers (SVM and DistilBERT), will be evaluated across k=5 inner and outer folds and the performance of the two types of classifiers will be compared using F1 and ROC AUC scores. When the most appropriate classifier is identified, the final model configuration will be chosen using outer CV, where each combination of hyperparameter settings will be tested on each outer fold. The model configuration that yields the highest F1 score will be selected and used for predictions. A confidence interval (mean +/- 1 standard deviation) for each prediction will be generated using five versions of the model trained on five folds of data. All articles with a predicted mean relevance greater or equal to 0.5 will be included.
2.2 Eligibility criteria
We developed eligibility criteria to screen for literature that includes the necessary elements of our primary research question (Additional File 4). The inclusion criteria are outlined below.
We focus on marine and coastal ecosystems across all biomes. These include the open ocean (including all water column layers from the surface to deep-ocean), seas and gulfs, and natural coastal marine ecosystems (i.e. all intertidal habitats), and artificial habitats (e.g. urban waterfronts). We also focus on associated coastal human systems (e.g. coastal communities, fisheries).
Relevant intervention(s) or exposure(s)
We include literature on OROs (see Table 1 and Figure 1) at any developmental stage from design to implementation, including exposure to natural ORO analogues as part of the “design” developmental stage. We also include natural inventory studies measuring natural carbon fluxes/storage in blue carbon ecosystems, as these studies provide an indirect measure of conservation OROs (i.e. the management or mitigation of anthropogenic impacts and stressors). We do not consider studies without a direct link to an ORO (e.g. documenting climate change impacts, or indirect actions aiming to ameliorate the distribution of wealth to reduce climate change vulnerability), nor do we consider actions aiming to reduce anthropogenic stressors (e.g. reducing pollution in river runoff, habitat loss through conversion), without a direct link to climate change. We recognize that these indirect factors will likely interact with OROs to play an important role in their current and continued outcomes, however due to resource constraints we do not consider them here.
No comparator is required.
We include literature where the outcome is directly relevant to primary (i.e. mitigation or adaptation) or secondary (i.e. co-benefits, trade-offs/dis-benefits) outcomes attributed to an ORO. Relevant primary outcomes should be directly related to the effectiveness of the ORO at achieving its primary aim of mitigation and/or adaptation. These outcomes can assess effectiveness directly (e.g. amount of energy produced by a renewable energy technology, changes in biodiversity/ecosystem recovery), or across dimensions which modify effectiveness (e.g. lead time, cost, feasibility, governability, upscaling potential, enabling conditions). Relevant secondary outcomes can include co-benefits (e.g. increases in biodiversity across all its components, from infra-specific to ecosystem diversity, associated natural and wilderness values, synergies with other OROs) or trade-offs/side effects/dis-benefits (e.g. displacing communities, reducing population abundance, negative impacts on other industrial sectors, etc). Secondary outcomes can be unintentional or intentional. We include all study methodologies, including primary research (e.g. observational, experimental or modelled quantitative or qualitative outcomes, or expert opinion) or evidence syntheses.
We exclude indirectly relevant outcomes such as technological advancements that are not explicitly implemented within the framework/technology of an ORO (e.g. electric battery advancements that are indirectly relevant for increasing the impact/usefulness of marine renewable energies).
2.3 Study validity assessment
No formal validity appraisal of included articles/studies will be performed. All studies that are deemed eligible at the title and abstract screening based on the eligibility criteria will be included in the evidence map.
3. Metadata coding
Once articles have been screened and determined eligible, metadata will be coded from each article in order to characterise the distribution of the evidence base. We anticipate given the large size of the search results, that even after screening is performed the volume of articles will be too large to code manually. We therefore will continue to implement the machine learning-assisted approach26.
To generate the data of coding decisions used to train and test the machine learning classifiers, the articles that were included from the manually-screened sample will be manually coded. All reviewers will code metadata using a standardised data codebook (Additional Files 5 and 6). To reduce inter-reviewer bias, each reviewer will use the codebook instructions and template to code the same randomly selected set of training articles. Any discrepancies will be discussed and resolved between coders, and any modifications to the codebook made for improved clarity. This process will be repeated until members are able to code articles with consistent agreement. Hereafter, articles will be coded by a single reviewer, except for 2 randomly selected articles out of every 50 coded that will be double-checked by a second reviewer and any conflicts discussed and resolved amongst the team. While it is unknown a priori which metadata variables will be accurately predicted by the machine learning model, metadata will be extracted across the following information categories relevant to our primary and secondary research questions: bibliographic, ORO description, study design and outcome types. If, after coding, any essential variable labels are under-represented, supplemental non-random screening and coding will be done by searching for relevant keyword matches within the corpus and screening and coding the returned articles until a more balanced representation is reached. These additional screening decisions would then be added to the initial manually-screened sample to update the screening model.
Machine learning predictions
To generate predictions for the coded variables, either a single-label or multi-label (depending on the variable) machine learning classifier will be trained and tested on the manually coded subset. Since the DistilBERT model has already been shown to out-perform an SVM in a similar field26, if the DistilBERT model performs similarly in this study’s binary inclusion/exclusion classifier, we will focus on the model configuration CV procedure of only the DistilBERT model for the predictions of the coded variables. In our study, we aim to code a large number of variables, and we therefore take this necessary methodological choice due to computational demands. For selecting the final DistilBERT model for each variable, we will use a k=3 fold CV and evaluate models based on their macro F1 score (where each label is weighted with equal importance). Unlike the binary classifier, since all the coded articles represent inclusions, they are likely representative of the dataset and are eligible for inclusion in the test sets used in the CV procedure. All articles with a predicted mean relevance greater or equal to 0.5 for a given label will be determined relevant for that label. For the variable indicating the type of ORO, if an article is predicted as included but no type of ORO is identified using the classifier, a label of “Other” will be assigned.
Geoparser for location extraction
To extract location data about each article’s population element, we will follow the approach of 26 to extract the geographical entity names from each article’s text (title, abstract and keywords) using the geoparser Mordecai. By using these geographical entity names to query the Natural Earth database35, we will be able to identify the geographical entity of the smallest resolution and its corresponding spatial extent.
4. Study mapping and presentation
The R statistical computing environment29,36 will be used to plot the distribution of the evidence base. The distribution of evidence across the scientific discipline, intervention, study type and outcome categories will be explored through frequency plots, heatmaps visualising the intersection between multiple factors, geographical maps, and barplots of the number of citations published in a category over time.
These plots will be used to characterise trends in academic evidence, and inform future research priorities for OROs. Across the different types of OROs, we will explore publishing trends across the different developmental stages from design to implementation, and identify knowledge gaps such as under-represented scientific disciplines, unexplored outcomes or under-researched ecosystems and marine systems. We also aim to identify knowledge clusters with sufficient information for subsequent systematic reviews, which will be informed by the distribution of citations across the intervention and outcome variables, and the proportion of these citations that represent existing systematic evidence synthesis.
The findings from this map will be summarised through a narrative report accompanied by relevant tables and figures. All associated code and materials will be uploaded into a Github repository which will be made publicly available upon the article’s publication.