>

>

>

How to use publicly available datasets for high school research

How to use publicly available datasets for high school research

How to use publicly available datasets for high school research | RISE Research

How to use publicly available datasets for high school research | RISE Research

RISE Research

RISE Research

High school student analyzing publicly available datasets on a laptop for academic research

TL;DR: Publicly available datasets are free, pre-collected collections of data that anyone can access and analyze. For high school researchers, they solve the hardest logistical problem in original research: getting real data without a lab, a budget, or institutional access. This post explains what publicly available datasets are, where to find them, how to use them correctly in a research paper, and what separates a strong dataset-driven study from a weak one.

Introduction

Most high school students assume that conducting original research means collecting their own data from scratch. They picture surveys, experiments, or fieldwork. What they do not realize is that some of the most rigorous, publishable high school research is built entirely on data that already exists and is free to access online. Learning how to use publicly available datasets for high school research is not a shortcut. It is a legitimate and widely respected methodology used by professional researchers at every level.

The gap is not in the data. It is in knowing which datasets are credible, how to frame a research question around existing data, and how to analyze and cite that data correctly. Most students who attempt this without guidance either pick the wrong dataset for their question, analyze it without a clear methodology, or draw conclusions the data cannot actually support. This post walks through the full process, step by step, so you can do it right.

What are publicly available datasets and why do they matter for your research paper?

Answer Capsule: A publicly available dataset is a structured collection of data released by a government agency, research institution, or international organization for open access. For high school researchers, these datasets enable original, quantitative analysis without requiring data collection, making university-level research achievable within a school-year timeline.

Publicly available datasets are released by bodies such as the World Health Organization, the U.S. Census Bureau, NASA, the World Bank, and academic repositories like Harvard Dataverse. They cover topics ranging from climate patterns and public health to economic indicators and social behavior. The data has already been collected, cleaned to varying degrees, and documented. Your job as a researcher is to ask a new question of that data and answer it rigorously.

A research paper built on a credible public dataset carries immediate methodological legitimacy. Peer reviewers and admissions readers can verify the source. The data is not self-reported or limited to your school hallway. When a student submits a paper to a journal like the International Journal of High School Research or the Journal of Student Research, a well-chosen public dataset signals that the work is grounded in real-world evidence. A paper with no dataset, or a dataset of 30 self-administered surveys, signals the opposite.

How to use publicly available datasets for high school research: a step-by-step process

Step 1: Start with your research question, not the dataset. The most common error students make is finding an interesting dataset first and then trying to build a question around it. This produces vague, unfocused papers. Instead, identify a specific, testable question in your subject area. For example: Does access to clean water correlate with under-five mortality rates across low-income countries between 2000 and 2020? That question tells you exactly which datasets to look for: water access data and child mortality data, both available from the World Bank and UNICEF respectively.

Step 2: Identify the right dataset for your question. Once your question is defined, search for datasets that contain the specific variables you need. Use Google Dataset Search (datasetsearch.research.google.com), the U.S. government open data portal at Data.gov, or domain-specific repositories like the CDC's National Center for Health Statistics for health research or NASA's Earthdata for environmental science. Check three things before committing to a dataset: the date range matches your study period, the variables are clearly defined in the accompanying documentation, and the source is a recognized institution rather than an anonymous upload.

Step 3: Read the dataset documentation before touching the data. Every credible dataset comes with a codebook or methodology document explaining how the data was collected, what each variable means, and any known limitations. Most students skip this step entirely. Skipping it leads to misinterpreting variables, using data outside its intended scope, or drawing conclusions the dataset was never designed to support. Spend at least one hour reading the documentation. Note the sample size, the geographic scope, the collection period, and any caveats the original researchers flagged.

Step 4: Clean and subset the data for your specific analysis. Large public datasets often contain thousands of rows and dozens of variables. You do not need all of them. Filter the dataset to the rows and columns relevant to your research question. If you are studying a specific country, age group, or time period, subset accordingly. Use a tool like Google Sheets for smaller datasets or Python with the pandas library for larger ones. Document every filtering decision you make. Your methodology section must explain exactly what subset of the data you used and why.

Step 5: Choose an analysis method that matches your question. Descriptive statistics (means, medians, distributions) answer questions about patterns. Correlation analysis answers questions about relationships between two variables. Regression analysis answers questions about the strength and direction of those relationships while controlling for other factors. Match the method to the question. A student asking whether GDP per capita predicts literacy rates needs a regression, not just a bar chart. Free tools like JASP, Google Sheets, or Python's scipy library can run these analyses without any cost.

Step 6: Cite the dataset correctly and acknowledge its limitations. A dataset is a source. It must be cited in your bibliography using the format required by your target journal, typically APA or Chicago. More importantly, your discussion section must address what the dataset cannot tell you. All datasets have limitations: they may be self-reported, they may have gaps, they may reflect measurement bias. Acknowledging these limitations is not a weakness. It is what separates a rigorous paper from a naive one.

The single most common mistake at this stage is treating correlation as causation. If your analysis shows that two variables move together, that is a finding worth reporting. It is not proof that one causes the other. State your findings precisely and let the data speak for what it actually shows.

Where most high school students get stuck with publicly available datasets

The first sticking point is variable selection. A dataset may contain fifty variables, and choosing which ones to include in your analysis requires understanding the theoretical relationship between them. Students working alone often include every variable that looks relevant, which produces cluttered, uninterpretable results. They also omit confounding variables that a researcher with domain knowledge would immediately recognize as necessary controls.

The second sticking point is methodology justification. It is not enough to say you ran a regression. You must explain why a regression was the appropriate method for your specific question, what assumptions that method requires, and whether your data meets those assumptions. Most high school students do not know what those assumptions are, let alone how to test them.

The third sticking point is scope. High school students frequently try to answer questions that require data across too many countries, too many years, or too many variables for a paper of their length and experience level. A PhD mentor narrows the scope in the first session, before the student has spent weeks on an unmanageable design.

A PhD mentor with experience in quantitative research makes the most difference at Steps 2, 4, and 5. They identify which dataset is actually appropriate for the question, flag methodological problems before they become embedded in the analysis, and confirm that the conclusions drawn are defensible. This is not guidance a student can replicate by reading a tutorial. It is judgment built from years of doing this work. RISE Research PhD mentors work with students on exactly these decisions, in a 1-on-1 setting, across the full research timeline.

If you are at this stage and want a PhD mentor to guide you through using publicly available datasets for high school research and the full research process, book a free 20-minute Research Assessment to see what is possible before the Summer 2026 Priority Deadline.

What does good use of publicly available datasets look like? A high school example

Answer Capsule: A strong example uses a named, documented dataset to answer a specific, testable question with an appropriate statistical method and honest acknowledgment of limitations. A weak example uses a dataset without reading its documentation, draws causal conclusions from correlational data, or asks a question too broad for the data available.

Consider two students both interested in the relationship between screen time and academic performance.

Weak approach: The student downloads a general survey dataset from a university website, selects variables labeled "screen time" and "grades," runs a correlation, and concludes that screen time causes lower grades. The paper does not specify which age group was studied, does not note that screen time was self-reported, and does not control for socioeconomic status or sleep duration.

Strong approach: The student uses the OECD PISA 2022 dataset, which is publicly available and covers 15-year-olds across 80 countries. The research question is: Does self-reported digital device use for leisure exceeding two hours per day correlate with lower mathematics scores among 15-year-olds in OECD countries, after controlling for socioeconomic index? The student runs a multiple regression using Python, includes the PISA socioeconomic composite variable as a control, and acknowledges in the discussion that self-reported screen time carries measurement error and that the cross-sectional design prevents causal inference.

The strong version is specific, methodologically justified, and honest about what it can and cannot claim. It is also directly publishable. You can read examples of what published student research looks like on the RISE Research publications page.

The best tools for using publicly available datasets as a high school student

Google Dataset Search (datasetsearch.research.google.com) is a free search engine specifically for datasets. It indexes datasets from government portals, academic repositories, and research institutions worldwide. It is the fastest way to find out whether a dataset exists for your topic before committing to a research question.

Data.gov is the U.S. government's open data portal, containing over 300,000 datasets across health, education, environment, and economics. Every dataset is documented and sourced. It is particularly useful for research questions focused on the United States. The limitation is geographic scope: it does not cover international comparisons well.

World Bank Open Data (data.worldbank.org) covers economic, social, and environmental indicators for over 200 countries from 1960 to the present. It is the standard source for cross-national research in economics, public health, and development. Data can be downloaded directly as CSV files compatible with Excel, Google Sheets, or Python.

JASP (jasp-stats.org) is a free, open-source statistical software designed for researchers who are not programmers. It runs t-tests, correlations, regressions, and ANOVA with a point-and-click interface and produces output formatted for academic papers. It is the most accessible tool for high school students who need to run real statistical analyses without coding experience.

Harvard Dataverse (dataverse.harvard.edu) is an academic data repository where researchers deposit the datasets from their published studies. It is useful for finding datasets tied to specific published papers, which makes it easier to understand the variables and methodology because the original paper documents them in detail.

Frequently asked questions about using publicly available datasets for high school research

Can high school students use publicly available datasets in published research papers?

Yes. Many peer-reviewed journals that publish high school research, including the Journal of Student Research and the International Journal of High School Research, accept papers based entirely on publicly available datasets. The key requirement is that the dataset is properly cited, the methodology is clearly described, and the analysis is original.

Using a public dataset does not make a paper less original. The originality comes from the research question you ask and the analysis you conduct, not from collecting new data. Journals evaluate the rigor of the analysis, not the source of the data.

How do I cite a publicly available dataset in my research paper?

Cite the dataset as you would any other source, using the citation format required by your target journal. In APA 7th edition, a dataset citation includes the author or organization, the year, the title of the dataset, the version number if applicable, and the URL or DOI. Always cite the dataset directly, not just the organization that produced it.

For example, a World Bank dataset citation in APA format would read: World Bank. (2023). World Development Indicators [Dataset]. https://databank.worldbank.org/source/world-development-indicators. Include this in your reference list and cite it in-text wherever you describe or use the data.

What subjects work best with publicly available datasets for high school research?

Economics, public health, environmental science, sociology, political science, and psychology all have strong publicly available dataset ecosystems. Biology and medicine are well served by datasets from the CDC, NIH, and WHO. Computer science students can use datasets from Kaggle or the UCI Machine Learning Repository for analysis or model training projects.

Humanities subjects like history and literature are less suited to dataset-driven research, though digital humanities projects using text corpora or historical census data are a growing area. If your subject is primarily qualitative, a different methodology is likely more appropriate.

How do I know if a publicly available dataset is credible enough to use?

A credible dataset comes from an identifiable institution, has documented methodology, specifies how the data was collected, and has a clear date range and sample description. Government agencies, international organizations like the UN and WHO, and academic repositories like Harvard Dataverse all meet this standard. Anonymous datasets uploaded to personal websites or without documentation do not.

Check whether the dataset has been used in published peer-reviewed research by searching Google Scholar for the dataset name. If other researchers have cited it in published papers, it is credible enough for your purposes.

Do I need coding skills to analyze publicly available datasets?

Not necessarily. Google Sheets and Microsoft Excel handle descriptive statistics, correlations, and basic regression for datasets up to a few thousand rows. JASP handles more advanced statistics without any coding. Python and R are more powerful and handle larger datasets, but they require learning time. For most high school research projects, Google Sheets or JASP is sufficient.

If your research question requires analysis of a very large dataset, such as the full PISA database, Python with the pandas library is the practical choice. Free tutorials on Python for data analysis are available through Kaggle Learn and Google's Python course, both at no cost.

Conclusion

Using publicly available datasets for high school research is one of the most direct paths to producing original, publishable work without the logistical barriers of primary data collection. The process requires a specific, testable research question, a credible and well-documented dataset, a methodology that matches the question, and honest interpretation of what the data can and cannot show. Getting any one of these wrong undermines the entire paper.

The students who do this well are not necessarily the ones who know the most statistics. They are the ones who had someone experienced tell them which dataset to use, which variables to include, and where their reasoning needed to be tightened. That guidance is what separates a paper that gets published from one that does not. You can see what that outcome looks like in the RISE Research admissions and publication results, and explore the range of research projects RISE scholars have completed. Research built on strong data also contributes directly to a competitive university application, as discussed in detail on the RISE blog covering how high school research helps college admissions.

The Summer 2026 Priority Deadline is approaching. If using publicly available datasets for high school research is a step you want to get right with expert guidance behind you, schedule a free Research Assessment and we will match you with a PhD mentor who has done this in your subject.

TL;DR: Publicly available datasets are free, pre-collected collections of data that anyone can access and analyze. For high school researchers, they solve the hardest logistical problem in original research: getting real data without a lab, a budget, or institutional access. This post explains what publicly available datasets are, where to find them, how to use them correctly in a research paper, and what separates a strong dataset-driven study from a weak one.

Introduction

Most high school students assume that conducting original research means collecting their own data from scratch. They picture surveys, experiments, or fieldwork. What they do not realize is that some of the most rigorous, publishable high school research is built entirely on data that already exists and is free to access online. Learning how to use publicly available datasets for high school research is not a shortcut. It is a legitimate and widely respected methodology used by professional researchers at every level.

The gap is not in the data. It is in knowing which datasets are credible, how to frame a research question around existing data, and how to analyze and cite that data correctly. Most students who attempt this without guidance either pick the wrong dataset for their question, analyze it without a clear methodology, or draw conclusions the data cannot actually support. This post walks through the full process, step by step, so you can do it right.

What are publicly available datasets and why do they matter for your research paper?

Answer Capsule: A publicly available dataset is a structured collection of data released by a government agency, research institution, or international organization for open access. For high school researchers, these datasets enable original, quantitative analysis without requiring data collection, making university-level research achievable within a school-year timeline.

Publicly available datasets are released by bodies such as the World Health Organization, the U.S. Census Bureau, NASA, the World Bank, and academic repositories like Harvard Dataverse. They cover topics ranging from climate patterns and public health to economic indicators and social behavior. The data has already been collected, cleaned to varying degrees, and documented. Your job as a researcher is to ask a new question of that data and answer it rigorously.

A research paper built on a credible public dataset carries immediate methodological legitimacy. Peer reviewers and admissions readers can verify the source. The data is not self-reported or limited to your school hallway. When a student submits a paper to a journal like the International Journal of High School Research or the Journal of Student Research, a well-chosen public dataset signals that the work is grounded in real-world evidence. A paper with no dataset, or a dataset of 30 self-administered surveys, signals the opposite.

How to use publicly available datasets for high school research: a step-by-step process

Step 1: Start with your research question, not the dataset. The most common error students make is finding an interesting dataset first and then trying to build a question around it. This produces vague, unfocused papers. Instead, identify a specific, testable question in your subject area. For example: Does access to clean water correlate with under-five mortality rates across low-income countries between 2000 and 2020? That question tells you exactly which datasets to look for: water access data and child mortality data, both available from the World Bank and UNICEF respectively.

Step 2: Identify the right dataset for your question. Once your question is defined, search for datasets that contain the specific variables you need. Use Google Dataset Search (datasetsearch.research.google.com), the U.S. government open data portal at Data.gov, or domain-specific repositories like the CDC's National Center for Health Statistics for health research or NASA's Earthdata for environmental science. Check three things before committing to a dataset: the date range matches your study period, the variables are clearly defined in the accompanying documentation, and the source is a recognized institution rather than an anonymous upload.

Step 3: Read the dataset documentation before touching the data. Every credible dataset comes with a codebook or methodology document explaining how the data was collected, what each variable means, and any known limitations. Most students skip this step entirely. Skipping it leads to misinterpreting variables, using data outside its intended scope, or drawing conclusions the dataset was never designed to support. Spend at least one hour reading the documentation. Note the sample size, the geographic scope, the collection period, and any caveats the original researchers flagged.

Step 4: Clean and subset the data for your specific analysis. Large public datasets often contain thousands of rows and dozens of variables. You do not need all of them. Filter the dataset to the rows and columns relevant to your research question. If you are studying a specific country, age group, or time period, subset accordingly. Use a tool like Google Sheets for smaller datasets or Python with the pandas library for larger ones. Document every filtering decision you make. Your methodology section must explain exactly what subset of the data you used and why.

Step 5: Choose an analysis method that matches your question. Descriptive statistics (means, medians, distributions) answer questions about patterns. Correlation analysis answers questions about relationships between two variables. Regression analysis answers questions about the strength and direction of those relationships while controlling for other factors. Match the method to the question. A student asking whether GDP per capita predicts literacy rates needs a regression, not just a bar chart. Free tools like JASP, Google Sheets, or Python's scipy library can run these analyses without any cost.

Step 6: Cite the dataset correctly and acknowledge its limitations. A dataset is a source. It must be cited in your bibliography using the format required by your target journal, typically APA or Chicago. More importantly, your discussion section must address what the dataset cannot tell you. All datasets have limitations: they may be self-reported, they may have gaps, they may reflect measurement bias. Acknowledging these limitations is not a weakness. It is what separates a rigorous paper from a naive one.

The single most common mistake at this stage is treating correlation as causation. If your analysis shows that two variables move together, that is a finding worth reporting. It is not proof that one causes the other. State your findings precisely and let the data speak for what it actually shows.

Where most high school students get stuck with publicly available datasets

The first sticking point is variable selection. A dataset may contain fifty variables, and choosing which ones to include in your analysis requires understanding the theoretical relationship between them. Students working alone often include every variable that looks relevant, which produces cluttered, uninterpretable results. They also omit confounding variables that a researcher with domain knowledge would immediately recognize as necessary controls.

The second sticking point is methodology justification. It is not enough to say you ran a regression. You must explain why a regression was the appropriate method for your specific question, what assumptions that method requires, and whether your data meets those assumptions. Most high school students do not know what those assumptions are, let alone how to test them.

The third sticking point is scope. High school students frequently try to answer questions that require data across too many countries, too many years, or too many variables for a paper of their length and experience level. A PhD mentor narrows the scope in the first session, before the student has spent weeks on an unmanageable design.

A PhD mentor with experience in quantitative research makes the most difference at Steps 2, 4, and 5. They identify which dataset is actually appropriate for the question, flag methodological problems before they become embedded in the analysis, and confirm that the conclusions drawn are defensible. This is not guidance a student can replicate by reading a tutorial. It is judgment built from years of doing this work. RISE Research PhD mentors work with students on exactly these decisions, in a 1-on-1 setting, across the full research timeline.

If you are at this stage and want a PhD mentor to guide you through using publicly available datasets for high school research and the full research process, book a free 20-minute Research Assessment to see what is possible before the Summer 2026 Priority Deadline.

What does good use of publicly available datasets look like? A high school example

Answer Capsule: A strong example uses a named, documented dataset to answer a specific, testable question with an appropriate statistical method and honest acknowledgment of limitations. A weak example uses a dataset without reading its documentation, draws causal conclusions from correlational data, or asks a question too broad for the data available.

Consider two students both interested in the relationship between screen time and academic performance.

Weak approach: The student downloads a general survey dataset from a university website, selects variables labeled "screen time" and "grades," runs a correlation, and concludes that screen time causes lower grades. The paper does not specify which age group was studied, does not note that screen time was self-reported, and does not control for socioeconomic status or sleep duration.

Strong approach: The student uses the OECD PISA 2022 dataset, which is publicly available and covers 15-year-olds across 80 countries. The research question is: Does self-reported digital device use for leisure exceeding two hours per day correlate with lower mathematics scores among 15-year-olds in OECD countries, after controlling for socioeconomic index? The student runs a multiple regression using Python, includes the PISA socioeconomic composite variable as a control, and acknowledges in the discussion that self-reported screen time carries measurement error and that the cross-sectional design prevents causal inference.

The strong version is specific, methodologically justified, and honest about what it can and cannot claim. It is also directly publishable. You can read examples of what published student research looks like on the RISE Research publications page.

The best tools for using publicly available datasets as a high school student

Google Dataset Search (datasetsearch.research.google.com) is a free search engine specifically for datasets. It indexes datasets from government portals, academic repositories, and research institutions worldwide. It is the fastest way to find out whether a dataset exists for your topic before committing to a research question.

Data.gov is the U.S. government's open data portal, containing over 300,000 datasets across health, education, environment, and economics. Every dataset is documented and sourced. It is particularly useful for research questions focused on the United States. The limitation is geographic scope: it does not cover international comparisons well.

World Bank Open Data (data.worldbank.org) covers economic, social, and environmental indicators for over 200 countries from 1960 to the present. It is the standard source for cross-national research in economics, public health, and development. Data can be downloaded directly as CSV files compatible with Excel, Google Sheets, or Python.

JASP (jasp-stats.org) is a free, open-source statistical software designed for researchers who are not programmers. It runs t-tests, correlations, regressions, and ANOVA with a point-and-click interface and produces output formatted for academic papers. It is the most accessible tool for high school students who need to run real statistical analyses without coding experience.

Harvard Dataverse (dataverse.harvard.edu) is an academic data repository where researchers deposit the datasets from their published studies. It is useful for finding datasets tied to specific published papers, which makes it easier to understand the variables and methodology because the original paper documents them in detail.

Frequently asked questions about using publicly available datasets for high school research

Can high school students use publicly available datasets in published research papers?

Yes. Many peer-reviewed journals that publish high school research, including the Journal of Student Research and the International Journal of High School Research, accept papers based entirely on publicly available datasets. The key requirement is that the dataset is properly cited, the methodology is clearly described, and the analysis is original.

Using a public dataset does not make a paper less original. The originality comes from the research question you ask and the analysis you conduct, not from collecting new data. Journals evaluate the rigor of the analysis, not the source of the data.

How do I cite a publicly available dataset in my research paper?

Cite the dataset as you would any other source, using the citation format required by your target journal. In APA 7th edition, a dataset citation includes the author or organization, the year, the title of the dataset, the version number if applicable, and the URL or DOI. Always cite the dataset directly, not just the organization that produced it.

For example, a World Bank dataset citation in APA format would read: World Bank. (2023). World Development Indicators [Dataset]. https://databank.worldbank.org/source/world-development-indicators. Include this in your reference list and cite it in-text wherever you describe or use the data.

What subjects work best with publicly available datasets for high school research?

Economics, public health, environmental science, sociology, political science, and psychology all have strong publicly available dataset ecosystems. Biology and medicine are well served by datasets from the CDC, NIH, and WHO. Computer science students can use datasets from Kaggle or the UCI Machine Learning Repository for analysis or model training projects.

Humanities subjects like history and literature are less suited to dataset-driven research, though digital humanities projects using text corpora or historical census data are a growing area. If your subject is primarily qualitative, a different methodology is likely more appropriate.

How do I know if a publicly available dataset is credible enough to use?

A credible dataset comes from an identifiable institution, has documented methodology, specifies how the data was collected, and has a clear date range and sample description. Government agencies, international organizations like the UN and WHO, and academic repositories like Harvard Dataverse all meet this standard. Anonymous datasets uploaded to personal websites or without documentation do not.

Check whether the dataset has been used in published peer-reviewed research by searching Google Scholar for the dataset name. If other researchers have cited it in published papers, it is credible enough for your purposes.

Do I need coding skills to analyze publicly available datasets?

Not necessarily. Google Sheets and Microsoft Excel handle descriptive statistics, correlations, and basic regression for datasets up to a few thousand rows. JASP handles more advanced statistics without any coding. Python and R are more powerful and handle larger datasets, but they require learning time. For most high school research projects, Google Sheets or JASP is sufficient.

If your research question requires analysis of a very large dataset, such as the full PISA database, Python with the pandas library is the practical choice. Free tutorials on Python for data analysis are available through Kaggle Learn and Google's Python course, both at no cost.

Conclusion

Using publicly available datasets for high school research is one of the most direct paths to producing original, publishable work without the logistical barriers of primary data collection. The process requires a specific, testable research question, a credible and well-documented dataset, a methodology that matches the question, and honest interpretation of what the data can and cannot show. Getting any one of these wrong undermines the entire paper.

The students who do this well are not necessarily the ones who know the most statistics. They are the ones who had someone experienced tell them which dataset to use, which variables to include, and where their reasoning needed to be tightened. That guidance is what separates a paper that gets published from one that does not. You can see what that outcome looks like in the RISE Research admissions and publication results, and explore the range of research projects RISE scholars have completed. Research built on strong data also contributes directly to a competitive university application, as discussed in detail on the RISE blog covering how high school research helps college admissions.

The Summer 2026 Priority Deadline is approaching. If using publicly available datasets for high school research is a step you want to get right with expert guidance behind you, schedule a free Research Assessment and we will match you with a PhD mentor who has done this in your subject.

Summer 2026 Priority Deadline Approaching in

03 days 16 hours

Book a free call
Book a free call

Want to build a standout academic profile?

Read More