NB: as noted above, our protocol is more extensive than required by Protocol Exchange and not all parts fit in the Protocol Exchange's pre-determined categories. In addition, tables and figures cannot be inserted in-text. We therefore highly recommend that you read the formatted document instead, which can be found under the Supplementary Files section. Protocol exchange requests a list here, which we will therefore give first:
1) retrieve documents from scientific databases;
2) manually label a subset of these documents;
3) use the labelled documents to train a binary classifier which selects relevant literature;
4) assign categories to the documents, using a supervised multi-label classifier, topic modelling, and a geoparser;
5) combine the different results of the previous step to synthesise the data
4 Methods
4.1 A machine learning approach to evidence mapping
As described by Haddaway et al. (2020), a systematic map or evidence map aims to summarise an entire evidence base. Where a systematic review aims to combine the findings of individual studies (“what works where and how?”), an evidence map by contrast aims to describe what is known in an area (“what kinds of research exist?”). Both systematic reviews and maps share a focus on transparent and robust methods; they lay out the rationale for their main research questions in a protocol, which also describes the main methods, as we do here. The basis for the map is an often extensive search query; matching documents are then screened by hand, discarding irrelevant documents and dividing relevant documents into categories. Combined with meta-data, this allows researchers to describe developments in the field. Results often take the form of a searchable or interactive database, as well as visualisations of the data, a list of knowledge gaps and clusters, and a report noting key findings.
Systematic maps in particular may benefit from incorporating computational methods, given their relatively descriptive nature and considerable evidence base (Haddaway et al., 2020). In this project, we use the ability of computers to handle large amounts of data to create an evidence map that is as inclusive as possible. This means that the documents cannot reasonably be screened by hand, which is why we make use of so-called “supervised machine learning” to both select and classify documents. Although we cannot provide a full introduction to machine learning here (we will instead refer the interested reader to the introduction provided by Marshall and Wallace, 2019), it is important to understand that supervised machine learning in essence attempts to mimic human decision making. It does so based on the so-called “training set”, which contains data labelled by humans. For this project, no appropriate pre-labelled dataset exists. A large proportion of this protocol is therefore dedicated to describing how we will create our training set, from which the algorithms will “learn” how to select relevant documents and which categories these documents belong to. We make use of additional machine learning methods to extract further information from our dataset and combine the different layers of information for our final assessment.
Altogether, our approach can be divided into 5 stages, given at the outset -- see also Figure 1. Conceptually, this is similar to earlier computer-assisted evidence maps (including work by members of the project team, e.g. Callaghan et al., 2021, Callaghan et al., 2020, Sietsma et al., 2021).
In the next section, we will detail the search strategy for step 1. We then describe the criteria used to create the training set used in steps 3 and 4 in section 4.3 Article selection and classification. More details on our chose machine learning methods are in 4.5 Machine learning considerations. The other machine learning methods used for categorisation – i.e. the geoparser and topic modelling – do not make use of the pre-labelled data. More details on their use are found in section 4.6 Data extraction. The final section, 4.7 Data Synthesis, describes how we create our evidence map with knowledge gaps and clusters.
4.2 Search strategy
4.2.1 Source
The search will be carried out on the Web of Science Core Collection[1], which is a publisher-independent scientific database with a wide coverage. In addition, we will also search the MEDLINE database, which covers life sciences and biomedical research, as well as Scopus, which is maintained by the publisher Elsevier and contains a variety of high-quality journals. Note that these databases have a bias towards natural sciences, as well as towards English-speaking countries and the Global North (Mongeon and Paul-Hus, 2016, Vera-Baceta et al., 2019), which limits the representativeness of our results.
The search will be carried out on title, abstract and keywords for all databases, as well as Keywords Plus for Web of Science. We will not limit the search by date, but will limit our search to articles and reviews, in line with our problem scope.
As is common for a machine learning based screening process (Marshall and Wallace, 2019, Haddaway et al., 2020), we will not assess any full texts. Full text screening would be time-intensive (Haddaway and Westgate, 2019), whereas we will need a substantial number of hand-coded documents for our machine learning algorithms to function. We will therefore retrieve only abstracts, titles and meta-data (publication year, authors, author affiliations, keywords, references, field of research for Web of Science) for each article matching our search string; this information per article is what is meant by the word “document”.
[1]Web of Science Core Collection here includes:
● Science Citation Index Exvpanded (SCI-EXPANDED) --1900-present
● Social Sciences Citation Index (SSCI) --1900-present
● Arts & Humanities Citation Index (A&HCI) --1975-present
● Conference Proceedings Citation Index- Science (CPCI-S) --1990-present
● Conference Proceedings Citation Index- Social Science & Humanities (CPCI-SSH) --1990-present
● Emerging Sources Citation Index (ESCI) --2015-present
4.2.2 Search string
Our search string has three main components: climate change, adaptation and policy. Each of these parts in turn has multiple sub-components for which a query was constructed. Documents need to match at least one keyword from all strings -- i.e. they are linked by a boolean AND. The majority of sub-components are internally linked by a boolean OR.
Our broad climate search string is a modified version of the query used by (Callaghan et al., 2020), which in turn is based on Grieneisen and Zhang (2011). We remove the majority of mitigation-related terms, but expand the general climate change part, and add keywords on impacts, vulnerability and risk. Note this general part of the query captures all literature which explicitly mentions “climate change” already. The added terms around climate impacts are based on the IPCC’s AR6 Table 12.2. This table describes changes in natural systems and impacts on human and natural systems that can be at least partially attributed to climate change. By including this, we capture literature which does not mention climate change explicitly, but does contain responses to impacts that are primarily driven by climatic changes.
In the adaptation component of our query, we take a similar approach: in line with our Problem Definition above, we need to strike a balance, such that we also capture the wider literature where climate change is a recognised contributor to the subject of interest, without capturing the much larger literature that investigates weather phenomena, rather than climate change. To do so, we split this part of the query in three parts: 1) recognised changes in natural systems based on the aforementioned table in AR6; 2) recognised impacts of these changes based on this same table; and 3) recognised adaptive responses based on the Final Government Draft of the IPCC AR6 WG2’s Cross Chapter Box FEASIB, which lists responses to climate change; given that the full WG2 report was not available to us at the time of publishing this protocol, the response options mentioned in AR 5 table 14.1 are added to this. The second and third part here link general weather terms with a boolean AND to the impact and response keywords respectively. These general weather terms are a wider version of the same impacts covered under the first part.
Finally, in the policy part of our components, we again take a broad view of what could be considered relevant. We include terms around policy and governance, including key moments for adaptation in the UNFCCC process. Some governance-related terms (e.g. framework, management/managing) are so widely used that they no longer proved sufficiently selective on their own; similarly, keywords around governance levels (e.g. national/international, cities) also proved too general. We therefore combine these two types of keywords with the NEAR operator. This means that for example articles mentioning a “national framework” are still captured.
The search terms are given in Web of Science syntax in table in the full document. The plain-text search strings are given in Supplementary Materials 2.
4.23 Comprehensiveness of the search
Two main factors limit the comprehensiveness of our search strategy. Both are common limitations for works of this kind (Konno and Pullin, 2020). First, keywords are English only, so we will only capture non-English literature that is indexed in English. Combined with the language bias of the databases (Vera-Baceta et al., 2019), this will lead to a relative over-representation of literature from English-speaking countries. Second, our database does not include grey literature. Although it seems likely that a substantial amount of such literature is relevant to adaptation policy, it is more difficult to access in a reproducible manner and often will not include an abstract. The latter poses a problem for us, as our machine learning will take place at the abstract level. In addition to the different format, the lack of a comprehensive database of grey literature also would make its inclusion more complex and time-intensive. As such, this is left to future projects.
The above two caveats notwithstanding, we will strive for search terms and document selection that are as close to comprehensive as possible. The following factors should help ensure this:
● The project team consists of a diverse range of adaptation researchers, with backgrounds ranging from engineering to policy research. This team has all been involved in the formulation of the search query and coding guidelines. A diverse subset will be involved in the coding itself.
● This protocol will be made public prior to completing the work.
● We will cross-check the resulting documents against the references of the Special Report on 1.5C (IPCC, 2018) and the latest Adaptation Gap Report (United Nations Environment Programme, 2021a). We will then scan the list of missing articles for titles that appear adaptation relevant to identify which keywords might be missing.
We will not update the search beyond this within the current project. However, we wish to stress that, once the algorithm is trained, it can be used to filter future searches too. This means that, given proper support, this project could be used as the basis for a so-called “living evidence map” which automatically updates as new scientific papers are published.
4.2 Article selection and classification
4.2.1 Screening strategy
The above search strategy is purposefully kept broad, meaning it includes both relevant and irrelevant articles. We will make use of supervised machine learning to first select the subset that is relevant to climate change adaptation, and then to classify these relevant articles into different categories.
In practice, this leads us to a three-tiered approach for article selection.
In the first step, relevant articles are selected. We first check the articles basic bibliographic data against our criteria set out in the PICoST framework outlined earlier, selecting only articles and reviews and filtering out articles where the abstract is missing. After this, a group of coders will determine whether the document is relevant. There are only two possible labels here: relevant or irrelevant. To be considered relevant, the document should meet two content-based criteria:
The document must include a substantial focus on a response to climate change or to a weather phenomenon wherefore changes can confidently be attributed to climate change, as determined by the IPCC.
The adjustment must be either enabled by, supported by, or a direct result of at least one policy.
These inclusion/exclusion criteria alongside the others following from our PICoST criteria are given in more detail in Table 3 in the formatted text.
In the second step, for the relevant documents only, the type of policy in the document will also be labelled. The categories for these labels are outlined in the subsequent section. If a document contains multiple policies, or one policy fits multiple categories, it will also gain multiple labels. If the document does not contain sufficient information to determine if a given label is appropriate, it will be left blank. In practice, these first two steps are done concurrently.
In the third step, the labelled data from the first two steps is taken to train multiple machine learning algorithms. The documents selected by this algorithm, along with their categories, will form the basis of our analysis.
A separate, more extensive guide for coders, which also contains additional details on the different categories is available in Supplementary Materials 3. Note that this is a first version; as coding progresses, this guide will be amended with additional information and examples to ensure that the coding guidelines are clear and consistently followed by all.
4.3.2 Consistency & independence
Coding will be conducted by multiple researchers from different backgrounds. To ensure that their coding is consistent, 15% of the documents will be coded by two or more researchers. This allows us to find disagreements between researchers; such disagreements will be discussed with the wider team until consensus is reached.
Furthermore, for a conventional systematic review, it is not unusual to address issues of procedural independence – in practice, mostly ensuring that researchers do not include or exclude work they have themselves authored. Given that the majority of the inclusion/exclusion decisions in this project will be made by an algorithm, this is less of an issue. However, should a researcher encounter their own work during the coding process, they will refrain from coding it.
4.4 Category definitions
4.4.1 Translating the NATO typology to adaptation
As stated, a major part of our policy categorisation is based on the NATO typology. These four types – Nodality, Authority, Treasure, Organisation – form the first level of our categorisation. We also add a more detailed second and third level. This categorization scheme was developed collectively by the research team. See the table in the formatted document given in the supplementary materials.
4.4.2 Maladaptation
As adaptation policies are implemented, there is increased attention for the, often unintended, negative consequences of these policies, known as maladaptation. Although there is both a political and scientific debate on the exact meaning and proper use of the term, (Juhola et al., 2016, Glover and Granberg, 2021), we take maladaptation to mean situations where “exposure and sensitivity to climate change impacts are increased as a result of action taken” (Schipper, 2020 p. 409). The maladaptive effects do not need to impact the target group of the policy – in other words, if an adaptation policy shifts the vulnerability to another group, this is still considered maladaptive. Likewise, policies which negatively impact the general welfare of a population are considered maladaptive, as are actions which increase greenhouse gas emissions.
We do not record the type of maladaptation. Instead, we only record if the document provides any evidence of maladaptation or not.
4.4.3 Constraints and limits
Documents will also be marked according to whether constraints, limits or any synonyms describing “factors making it harder to plan and implement adaptation” are mentioned in the abstract. If the answer is ‘yes’, coders will be asked to specify the type of constraint, choosing from the options in table 5 below. Definitions and categories are based on the AR5 and AR6 (WG2) from the IPCC.
4.4.4 Governance Level
Documents will be marked according to the level at which the policy is implemented (in theory or in practice), choosing between:
● International (including supranational and regional such as the EU or ASEAN),
● National,
● Subnational level (including local, state, province, region, municipal, city).
4.4.5 Type of impact responded to
Some documents specify what kind of climate change effect is being responded to. This includes both observed impacts and potential hazards. These are categorised according to IPCC AR5 SPM2 (p. 7), with the addition of the more specific terms drought, heat waves and storms to reflect the increased confidence since that assessment. This results in:
● Glaciers, snow, ice and permafrost
● Rivers, lakes and floods;
● Drought;
● Extreme heat;
● Food production;
● Wildfire;
● Coastal erosion and/or sea level effects;
● Storms and hurricanes;
● Terrestrial ecosystems;
● Marine ecosystems;
● Livelihoods, health or economics.
If no specific impact is mentioned, this is left blank. This includes policies which respond to climate change in general.
In general, the most specific category will be the one selected (e.g. forest fires are influenced by droughts but will only be classified under Wildfire; agriculture generally depends on terrestrial ecosystems to some extent, but will be classified under Food production). However, the categories are not mutually exclusive, so in case multiple specific impacts are mentioned, all will be recorded.
4.4.6 Evidence type
Documents will be marked according to whether they provide ex-post or ex-ante evidence on policies. For the purposes of this project, ex-post refers to all studies which analyse the effects of a policy which has already been enacted. Ex-ante refers to all studies which analyse potential effects of a policy once the policy has started being implemented.
4.4.7 Countries mentioned
We will use a pre-trained algorithm (Halterman, 2017) to identify geographic locations in the full search. As this algorithm has been trained on non-academic texts however, countries for the labelled data will be noted so we can estimate the accuracy of this algorithm for our particular dataset.
Where documents mention locations or geographical entities, annotators will record the country or countries which contain those geographical entities. For example, if a paper mentions “Berlin”, the annotator will enter “Germany”. Where geographical entities are supranational or non-national (e.g. the European Union, or the Atlantic Ocean) this field shall be left blank.
4.5 Machine learning considerations
4.5.1 Algorithm
To briefly reiterate, in this project we will employ supervised learning of two different types: first, a binary classifier is used for study identification (relevant vs not relevant); on the documents identified as relevant, we will then use a multi-label classifier for each coding level. In both cases, we are using the studies coded by hand as training and validation sets.
Choosing the appropriate algorithm for both these classification tasks is crucial to ensure that the machine learning predictions are fit for purpose. We will therefore test a variety of models from the Scikit Learn package (Pedregosa et al., 2011). Prior work (Berrang-Ford et al., 2021b, Sietsma et al., 2021) had positive results especially using Support Vector Machines (Chang and Lin, 2011); as this does not natively support multi-label predictions, we will use a one-VS-rest set-up for the category predictions. Following Callaghan et al. (2021)we will also test more state-of-the-art deep learning approaches based on BERT (Devlin et al., 2018), including a BERT model that has undergone additional pre-training for a corpus of documents related to climate change (Webersinke et al., 2021).
We will use a nested cross-validation procedure (Cawley and Talbot, 2010, see also Callaghan et al., 2021) to optimize hyperparameters and measure the accuracy of our classifiers. In simple terms, this entails dividing the labelled data up in several subsets. All but one of the subsets are then used to train an algorithm which is used to make predictions on the remaining subset – this is known as the “test set”. The procedure is repeated using another subset as the test set until each of the subsets has functioned as test set. If, for example, we divide in ten subsets, this means we have predictions for all labelled documents based on 10 algorithms that were each trained on 90% of the total labelled dataset. Comparing these predictions against the manually created labels provides an estimate of performance with quantified uncertainty. The process can then be repeated with different hyper-parameters and different algorithms. This allows us to choose the algorithm and hyper-parameters that are most appropriate for our dataset.
4.5.2 Random and non-random samples
Ensuring sufficient numbers of true positives may be a challenge as our initial dataset will be large and relatively unfocussed. With tens of thousands of documents in total and an evidence base that for some of the more specific categories will not be larger than a few dozen documents in total, we will need a mixture of strategies to provide the machine learning algorithm with sufficient examples to learn from, without biasing results. In practice, we will make use of a mixture of the following types of samples:
● Random – the majority of the coded documents will be selected at random. These documents will also be used to estimate the performance of our classifier.
● Preliminary machine learning – some documents will be selected based on preliminary results from our classifier. This serves two purposes: first, to identify early on which areas the classifier is struggling with; second, to increase the number of positive examples.
● Keyword-based – if a particular area is lacking positive examples, some samples will be drawn based on keywords. Care must be taken here not to bias the results, which in practice means choosing keywords that will be used not just by a large majority of the positive examples we are looking to identify, but which are also still used by a substantial body of other literature.
● From literature – if there is a need to further increase the number of positive examples, we may choose to create a sample based on the references of key literature (e.g. IPCC reports). This will ensure that the classifier performs well on highly-regarded documents.
The degree to which non-random samples will be used is dependent on the performance of the classifier. The type of sample will be recorded for all documents assigned to any given reviewer so that we can ensure a sufficiently large random set is used to evaluate the classifier performance and to prevent bias.
4.5.3 Accuracy targets & size of the training set
When setting accuracy targets for supervised machine learning algorithms, a key consideration is the size of the training set (i.e. the number of hand-coded documents), as this to a large degree determines the performance of the classifier. In theory, it may be possible to set a given level of accuracy (e.g. 95%) and keep increasing the size of the training set until this accuracy is reached. In practice however, using an accuracy target in such a way is often impractical for two reasons: first, hand-coding documents is time-intensive (Haddaway and Westgate, 2019); second, the performance of the classifier cannot increase beyond the quality of the input data. In other words, there is likely to be a substantial grey area on what is considered “relevant” which will lead to inconsistency among coders and therefore inconsistent training data. As a consequence, the algorithm will not be able to accurately distinguish between documents in this grey area either, no matter the size of the training dataset.
The extent to which machine learning classifiers can accurately make predictions is unknown a priori. Previous efforts using a similar strategy to the one outlined here have found that, even with thousands of documents coded, the accuracy of the classifier remains lower than what would be ordinarily expected in science, which is to say, less than 90% of true positives (Sietsma et al., 2021, Hsu and Rauber, 2021). Note however that traditional systematic reviews likely also suffer the same issue – it simply remains unreported as the “performance” of the human coders there is never quantified. Using the performance metrics obtained through cross-validation, as described earlier, we can provide estimates on the accuracy algorithm and will report these results as well as their implications for uncertainty bands.
Although a simple accuracy target cannot be set, we can use these accuracy scores to estimate when the classifier is reaching its maximum potential. More concretely, for this project, the training dataset will consist of at minimum 1 500 hand-coded documents, of which at least 1000 are from a random sample. This will then be used to train a first version of the classifier, which will be used to estimate if performance is still increasing with increases in the size of the training set. If this is the case, document screening will continue until the increase in accuracy of the classifier has not increased meaningfully for a minimum of 500 documents added.
The performance for most categories in the multi-label classifier is likely to be lower than that for the binary inclusion/exclusion classifier, given that the same number of documents is used here to provide information on multiple options within the category. Since we further expect the data to be unbalanced (i.e. some categories will have relatively few positive examples while others have many), we may use additional targeted samples to increase performance of the multi-label classifier. Foregrounded results will be limited to the category types where the classifier achieves consistently high accuracy. Recall also that the NATO-typology is hierarchical, which is useful here: if we do not have enough examples to make positive predictions at a lower level, we may still get usable predictions at the higher level.
4.6 Data extraction
Our evidence map is primarily based on the categories predicted for all the relevant documents. The criteria for these categories have been described above, though it is worth repeating that depending on data availability and associated performance of the classifier, some categories may later be merged. As stated also, the geographic location will be based on a pre-trained geoparser, namely Mordecai (Halterman, 2017). The hand-coded locations will only be used to estimate its accuracy.
Meta data will also be retrieved for all articles. This includes the publication year, allowing for a temporal analysis. Further, the author affiliations often include an address or place name. This field can also be fed through the geoparser to identify the geographic location of the authors.
In addition to these categories, we will also make use of topic modelling. This is a so-called “unsupervised” machine learning method. In contrast to the supervised methods described earlier, unsupervised learning does not make use of a training set. Rather, the algorithm searches for structures in the data itself – in the case of topic modelling specifically, the algorithm will find clusters of words which frequently occur together for a set number of topics. Each topic will then be named by the researchers and topics can be grouped together into overarching topic groups. These topic names and groups will be determined inductively and in combination with the findings from the classifier; as such, we cannot provide additional details on their content before the final dataset of relevant documents has been compiled.
4.7 Data Synthesis
4.7.1 Synthesis strategy
Ordinarily, a systematic review or map will result in a narrative summary where vote-counting is especially discouraged (Haddaway et al., 2018). For this machine learning-based project, more quantitative measures however are all but inevitable. Indeed, determining the size of the evidence base for our various categories is among the core objectives for this research and is necessary to provide context to further findings. As such, we expect to start our evidence synthesis with a description of our final dataset and its development over time, as well as its geographic distribution. These basic descriptors may already point towards biases in the evidence base.
By combining different layers within the final dataset, we can then investigate more complex questions. For example, the NATO categories can be combined with the results of the topic model to investigate what types of tools are most prevalent in different subject areas. We can also further investigate geographic biases, if there are any, by quantifying the prevalence of different topics per region.
4.7.2 Knowledge gaps and clusters
Since topic modelling entails the identification of clusters within a document set, this tool is ideally suited to identify knowledge clusters within academic literature. Topics within the topic model can also overlap, which can be used to identify larger topic groups. Moreover, since topic modelling assumes that each document consists of a mixture of topics, we can also investigate the co-occurrence of topics within documents. This can be used to further highlight knowledge clusters – e.g. if the topic model would include the categories typhoons and re-location, and these topics frequently occur together, this would be an indication that the evidence base here is strong.
To identify knowledge gaps, such a table with co-occurrences could also be useful – e.g. if the same typhoons topic would have little overlap with a coastal zone management topic, this would suggest a lack of evidence. In addition, the categorization of the selected documents should provide some insight into understudied areas – both in terms of subject areas and in terms of geographic distribution.
Lastly, we can compare our dataset against other current assessments of adaptation science generally and adaptation policy in particular. For this, the upcoming IPCC WG2 assessment as well as the Adaptation Gap Report (United Nations Environment Programme, 2021a) should form a solid basis. In the case of the former, we would for example expect to see more evidence for topics highlighted in the Summary for Policymakers and its figures. Pending government approval, a figure comparing adaptation options is expected.
4.7.3 Critical appraisal
It should be noted that we do not control for study quality in any way, except by limiting our search to established databases of peer-reviewed research. This is especially important given the disparate communities of research from which we draw research. These communities have varying epistemological bases and standards, which despite the diversity of researchers involved in this project, the team may not always be equipped to fully appreciate, especially judging from the abstracts alone. Overall, we aim to be inclusive, which may at times result in the inclusion of documents of a scientific standard that would not be acceptable to all. In our view, this is inevitable given the scale of the project.
This same scale also means that a small number of papers are unlikely to significantly influence the results. On the one hand, this means that the point raised above about scientific quality only becomes a major concern if the general standard of the field is insufficient; on the other hand, there are indications that in some areas of research, the standard may indeed be low (e.g. Scheelbeek et al., 2021) and voices with more fundamental criticisms may get crowded out. Any narrative emerging from the data should therefore be assessed critically in light of the power structures that underpin the science of adaptation policy (Overland and Sovacool, 2020, Nightingale, 2017). More generally, during the analysis, the whole team should keep in mind that we are not evaluating adaptation policies, but rather documenting where research on adaptation policies is published (similar to the proposal by Tompkins et al., 2018). In short, we assess quantity, not quality of adaptation policy literature. Still, if successful, this would result in the most comprehensive evidence map of adaptation policies to date.