Projects

The goal is to analyze a large, rich dataset to answer an interesting behavioral/social/neural question, with the final product being a potentially publishable paper.

This project is divided into two phases to be implemented in 201a and 201b.

In 201a your goal is to identify a conjunction of an interesting question and a data source that might answer it. You will need to understand the data, clean it, make graphs of the data that might answer the question, and do simple analyses to get your bearings.

In 201b you will do the more complete analyses, likely using more advanced methods that we will cover in 201b, and turn the initial report from 201a into something that could be submitted for publication.

Examples of this sort of thing

You will notice that particularly successful examples usually have a combination of a few things:

  1. a coherent research question, with a good justification for why the naturalistic data maps onto theoretical constructs of interest;
  2. a novel dataset, which might mean data that had not preciously been available, or a dataset that was created by cleverly combining/co-registering previously separate datasets;
  3. and (sometimes or) a fairly sophisticated analysis that adequately grapples with the complicated structure of the data.

The full project will span both 201a and 201b (previously I had it only in 201b, and that was not enough time)

201a

201a Timeline

Figure out groups as soon as possible, so we can assist if folks are group-less.

2021-10-20: Groups due
2021-10-27: Project plan due
2021-11-10: Preliminary data summaries due
2021-12-09: (before final) Write-ups due
2021-12-09: (during final) Project presentations
2021-12-10: Group-evaluation due

Groups

You will be in groups of ~3-5 (5 if group includes an undergrad). Undergrads should not be in a group together (should join a group of grad students). Hopefully you can self-assemble into groups, but I will help if need be.

Due: 2021-10-20 Create a Slack channel titled “project-[groupname]” (can be either public or private). Invite Ed and James into the channel. Upload a CSV file about your group makeup in the channel: 1 row per group member, and columns: last_name, first_name, email, group_name.

201a Project plan

Due: 2021-10-27

This is just a message including the following:

  • Description: (500 words max) describe the research question(s), the data source(s), and how they are related.
  • the data

201a Preliminary data summaries

Due: 2021-11-10

This is an R script (ideally an R markdown file), and output showing coarse summaries of the data (histograms, scatterplots, etc). This should include various validity checks to figure out if any of the data are corrupted, if co-registration of different data sources was done accurately, etc. Send the R script and output in the Slack channel.

The only goal here is to force you to look over the data to make sure you find problems early, rather than at the last minute.

201a Write-ups

Due: 2021-12-09, before class

This should be less than 2000 words, and should include:

  • A description of the research questions: why is this an interesting/important question? Add a bit of lit review to help others understand what is already known, and what missing information your approach aims to provide.
  • A description of the data, where it came from, any peculiarities about the data structure or collection method, etc.
  • Graphs that try to answer the key research questions, and some explanation about why the graphs answer (or fail to answer) these research questions.
  • Some basic statistics that attempt to quantify the answers to these research questions and the uncertainty associated with these answers (only methods covered in 201a are expected, but by all means do something fancier if you are comfortable)
  • Some discussion about which questions are adequately answered by the methods you used, and which will require a more elaborate analysis. (e.g., we fit a regression to all the data, but we know there is important substructure that this analysis is ignoring)

201a Presentation

Due: 2021-12-09, during final time

Tell us about your results.

This should be like a 10 minute conference talk:

  • provide motivation for the question you are asking, and explain why it is worth asking
  • describe (briefly) what was done on this question previously, and why your project fills a gap in knowledge
  • describe the data, any peculiarities therein, and explain why it is a useful data source for answering the question. what are the linking assumptions?
  • present your results (mostly graphs), and explain what we learn from these graphs.
  • describe the caveats, to set the agenda for what you might try to address in 201b.

The goals here are:

  • to teach the audience (us instructors, and the rest of the class) about this research domain/question
  • keep the audience entertained
  • practice working through the logic tying questions to data
  • practice making nice data visualizations in R to highlight the relevant results
  • practice giving talks.

Do not go into agonizing details about the trials and tribulations involved with your R code, your reanalysis, etc. Just give us your results.

201a Group-evaluation

Due: 2021-12-10

After you completed your presentation, each student should independently DM Ed on Slack: your group name, and your estimate as to what percentage of the total work each member of your group did. Basically, I want to know if someone single-handedly carried your group, or if someone was a free-rider and let the rest of the group do all the work. I’m hoping that there will not be such an uneven distribution of effort, and that the mere fact that these evaluations will be sent will motivate your group not to leave you hanging.

201b

In 201b you will start where you left off in 201a: fill in the missing analyses, flesh out the motivation/introduction, add a proper discussion, and hopefully, wind up with a paper that might be submitted for publication with minimal further effort.

Data sources

The lists below are just pointers, you should do your own googling, and hopefully you will find something new and interesting.
In general, you will get the most mileage in doing something new by intelligently combining several sources that have not already been combined for you (e.g., cross-linking crime statistics for a city with city demographics, or fluctuations in attitudes over time/geography, with voting patterns at that time/place, etc).

General social science survey: http://www3.norc.org/Gss+website/

data.gov: http://www.data.gov/

census datasets: https://www.census.gov/data/developers/data-sets.html

FBI crime statistics: http://www.fbi.gov/stats-services/crimestats

Perhaps useful forum: http://opendata.stackexchange.com/

List of random data sets: http://rs.io/100-interesting-data-sets-for-statistics/

FAA datasets: https://www.faa.gov/data_research/aviation_data_statistics/data_downloads/

FDIC (retail banking) datasets: https://www2.fdic.gov/idasp/warp_download_all.asp

Fuel economy information from the EPA: http://www.fueleconomy.gov/feg/download.shtml

NIH funding information (may be difficult to pull data into a usable format): http://report.nih.gov/nihdatabook/index.aspx

Personality test data: http://personality-testing.info/_rawdata/

Drug use data: http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/34933?q=&paging.rows=25&sortBy=10

CDC health and nutrition: http://www.cdc.gov/nchs/nhanes/nhanes_questionnaires.htm

American Sign Language corpus: http://www.bu.edu/av/asllrp/dai-asllvd.html

OpenPsychometrics: https://openpsychometrics.org/tests/OSRI/

Large, but likely tricky data set to analyze:
https://gigaom.com/2014/05/29/more-than-250-million-global-events-are-now-in-the-cloud-for-anyone-to-analyze/

Datasets available upon request:
Dundee eye-tracking: http://www.dundee.ac.uk/psychology/staff/profile/alan-kennedy

List of somewhat small data sets, more suited to small class examples rather than posing new questions:
http://www.calvin.edu/~stob/data/

More lists here:
http://opendata.stackexchange.com/questions/266/a-database-of-open-databases
http://stats.stackexchange.com/questions/7/locating-freely-available-data-samples

https://github.com/rasbt/pattern_classification/blob/master/resources/dataset_collections.md https://www.kaggle.com/datasets?sortBy=votes&group=all

https://cseweb.ucsd.edu/~jmcauley/datasets.html

https://aminer.org/citation

Other lists I found while googling for “public social science data sources”
http://socsciresearch.com/r6.html
http://ciser.cornell.edu/ASPs/datasource.asp?CATEGORY=2
http://personality-testing.info/_rawdata/ http://veekaybee.github.io/2018/07/23/small-datasets/ http://blog.yhat.com/posts/7-funny-datasets.html https://vincentarelbundock.github.io/Rdatasets/datasets.html http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets https://www.kaggle.com/datasets