In this article, we discuss data sources: best practices in using them and some of the most important items of importance to understand.
Why are data sources important?
Data sources are closely related to research questions in that they often provide answers to those questions. Understanding the very definition of the data and how researchers interact with it is crucial.
Data always has its sources, and there are specific rules and tips on how to define, understand, and use these sources in ways that answer research questions.
What is data and a data source?
Data is any type of information that can be measured, accessed or used in any way in a given study. This information from the data is used to answer certain research questions. Logically, the more data present in a study, the more information can be used to answer the research question.
How is data used?
We use this information to analyze the research and derive results which can help answer the research questions.
Before we start exploring the best practices and theories on using data sources, we must first understand what is the data source and the nature of the data itself.
Ask the right research questions
Before starting any research or analysis, it is important to define the research questions we are trying to answer in that process. These questions must be clear, and they must be relevant to the study itself.
Use data sources to answer these questions
Data sources can help answer these questions. When developing a research plan, it's very important to understand the relationship between the data source and the questions in a study.
Some important aspects to consider:
- Which data source is relevant to the research questions, and how is it relevant?
- Is the data associated with the research topic?
- Is the data credible enough to be trusted?
- Is the data accurate?
- What is the level of association of the data to the research question(s)?
- Is the data sample large enough to represent the entire population or sample?
Answering ‘yes’ or ‘no’ to these questions requires that you understand the data source and the nature of the data in detail.
Primary data sources
If a researcher is collecting data directly, whatever method they are using, such a data source is called a primary data source.
Here are some ways to gather data directly from primary data sources:
Direct quantitative measurements
One way to get direct quantitative measurements is to use different instruments or methods of measuring values within defined parameters. These measurements can come from different environments, like nature or labs in different experiments.
These measurements also depend on the accuracy of the instrument, so it is very important to keep in mind that any instrument used in research should be validated and credible. Also, information about the instrument is part of the data and should not be ignored or forgotten.
These types of direct measurements are most often quantitative, and they can also come from diagnostic procedures such as medical diagnoses or other types of observations in other life sciences, such as biology. In social sciences, data often comes from surveys, questionnaires, and other forms of answer collection from subjects.
Very often the data is not measured but is observed, and conclusions are made out of those observations. The determination of different species, diseases, or biological structures in the life sciences, taxonomies, and other types of data in the sciences are examples of deriving information based on the observations by the researcher.
Observations often require a higher level of knowledge or expertise in a subject in order to interact with the data sources optimally and in a credible way.
The advantages of primary data sources
Primary data sources have many advantages, such as:
- Being able to capture the level of information defined by the authors or - or other information predefined by the project - with more flexibility.
- Authors can determine some observations directly. And as such, they can lend to more detailed conclusions from the study.
On the other hand, there are some disadvantages.
Direct measurements, the use of different instruments, the construction of survey systems and communication with the subjects, materials needed to collect the data are often costly. This means that the study’s budget plays a role in the amount of data that can be collected.
Secondary data sources
Secondary sources are very important these days. In the digital world, there is more and more data stored in different online databases. In the past few years, the amount of data in online formats has grown so much that there are now thousands of such sources with large amounts of data. So what is the difference between using a primary and secondary source?
When using a secondary source, the data is already collected and stored in a safe environment and can be accessed when needed. The advantages of using such sources are speed, as well as large amounts of data and ease of use; but there are also some constraints and rules applicable on the use of secondary data. These include:
- Properly citing the data contributors
- The data background is already defined and its difficult to complete the context if its missing
Databases and data repositories
A database is an environment where data is stored, typically in online format. Databases are also often secure, with new data added and saved for future use.
Using publicly available data sources
Publicly available data has become highly available in past years and is one of the most important drivers of Research today. Different companies and research organizations are making large amounts of Data Available to researchers around the world.
The sources where this data is stored are most often called data repositories. Here are some repositories and other sources with publicly available data:
- NCBI - Biomedical and Genomic Data
- EMBL - Biomedical and Molecular Biology data
- Data.world - Community enabling free access to different types of data
- Uniprot - Protein/Amino acids data
- Reactome - Biological interaction and annotation data
- Paleoportal - Paleontology data
- HEP Data - Physics data
- Google datasets - offers a fantastic way to search for different data sources online and would be one of the easiest ways to leverage the search engine to find applicable data.
It’s important to look for additional information when using data sources. There may be terms for the data’s use. For example, some data can be used for research purposes but possibly not commercial purposes.
Understanding the context around the data
It's important to understand the data collection types. As mentioned before, it's important you understand the context around the data. Examples of context include location, time, or any other data related circumstances; but the context can be other data sources, too, if it's related to the data source.
Here are some examples of context and why it’s so important:
- Location - Data might vary in different locations.
- Time - A measurement made 10 years ago is not the same as a measurement made in the present studies, research might change with time. For most studies it is advisable to use the latest data sources.
- Circumstances - We should always define under what circumstances the data was collected, as these can define data source. And in order to understand how to use the data efficiently, we must understand these circumstances.
- Measurement/observation methods - This is very important. Different methods might affect the data, and the methods must be validated as correct.
- Limitations in collecting the data - Understanding the limitations in collecting the data from a source tells us about the limitations, as well as the credibility and accuracy of a data source.
- Information on the researchers - Knowing the background of the data collected by the researchers is very important contextual information.
- Potential sources of bias in the data - Analyses are often performed in research, so it's important to list potential sources of bias with the data from the start.
- Information used to determine the data’s sample size - Sample size needs to have a precise definition and should not be defined randomly. Very often, sample size is defined from previous research.
Perspectives on data sources to keep in mind
Data sources are not always representative of a whole data set. Keep in mind that the data sources might vary if collected from different locations and at different times. The larger the amount of data in a data source, the better. This will mean that the credibility of the data is greater.
Make sure to have clearly defined rules on what data sources can be applicable to include in research or other types of projects. These rules are described in what we call “the inclusion criteria” of the project.
The Inclusion criteria of a data source defines the discussion and conclusion of the study, which shows what is the part of the population or sample to which the results and analysis are being referred to.
Data sources versus reference sources
When results of the analysis are interpreted, they are often compared to the results of other studies or research projects. Studying these references is vital in defining what types of data sources would be best compared against them - and what are the potential differences to discuss. References also have their own data sources which also need to be compared against the data sources from the study.
Need help ensuring that your data is compliant with the rigorous data reporting standards of today’s peer-reviewed journals? We can help. Learn more about Research Square’s Data and Methods Reporting Badges.