Empirical Project 8 Working in Excel
Part 8.1 Cleaning and summarizing the data
Learning objectives for this part
- detect and correct entries in a dataset
- recode variables to make them easier to analyse
- calculate percentiles for subsets of the data.
So far we have been working with data that was formatted correctly. However, sometimes the datasets will be messier and we will need to ‘clean’ the data before starting any analysis. Data cleaning involves checking that all variables are formatted correctly, all observations are entered correctly (e.g. no typos), and missing values are coded in a way that the software you are using can understand. Otherwise, the software will either not be able to analyse your data, or only analyse the observations it recognizes, which could lead to incorrect results and conclusions.
In the data we will use, the European Values Study (EVS) data has been converted to Excel from another program, so there are some entries that were not converted correctly and some variables that need to be recoded (for example, replacing words with numbers, or replacing one number with another).
Download the EVS data and documentation:
- Download the EVS data.
- For the documentation, go to the data download site.
- Click on the ‘Data and Documents’ button in the middle of the page, then click the ‘Other Documents’ button and download the PDF file called ‘ZA4804_EVS_VariableCorrespondence.pdf’. This file contains information about each variable in the dataset (e.g. variable name and what it measures).
- The data spreadsheet contains an incomplete Data dictionary and four tabs with the data collected from each wave of the survey. The variable names currently do not tell us what the variable is, so we need to relabel them to avoid confusion.
- Use the PDF you have downloaded to fill in Column C (Variable description) of the Data dictionary tab. The second column of the PDF lists all variables in the dataset (alphabetically) and tells you what it measures.
- Fill in Column B with an appropriate new name for each variable and rename the variables in the other four tabs accordingly.
Throughout this project we will refer to the variables using their original names, so the Data dictionary tab will come in handy. In general, data dictionaries and variable correspondences are useful because they contain important information about what each variable represents and how it is measured, which usually cannot be summarized in a short variable name.
- Now we will take a closer look at how some of the variables were measured.
- Variable A170 is reported life satisfaction on a scale of 1 (dissatisfied) to 10 (satisfied). Respondents answered the question ‘All things considered, how satisfied are you with your life as a whole these days?’ Discuss the assumptions needed to use this measure in interpersonal and cross-country comparisons, and whether you think they are plausible. (You may find it helpful to refer to Box 2.1 of the ‘OECD guidelines on measuring subjective well-being’.)
- An individual’s employment status (variable X028) was self-reported. Explain whether misreporting of employment status is likely to be an issue, and give some factors that may affect the likelihood of misreporting in this context.
- Variables C036 to C041 ask about an individual’s attitudes towards work. With self-reports, we may also be concerned that individuals are using a heuristic (rule of thumb) to answer the questions. Table 2.1 of the ‘OECD Guidelines on measuring subjective well-being’ lists some response biases and heuristics that individuals could use. Pick three that you think particularly apply to questions about life satisfaction or work ethic and describe how we might check whether this issue may be present in our data.
Before doing any data analysis, it is important to check that all variables are coded in a way that the software can recognize. This process involves checking how missing values are coded (usually these need to be coded in a particular way for each software), and that numerical variables (numbers) only contain numbers and not text (in order to calculate summary statistics).
- We will now check and clean the dataset so it is ready to use. Make the following changes to all relevant data tabs:
- Currently, missing values are recorded as ‘.a’, but we would like them to be blank cells. Use Excel’s Find and Replace tool to change the ‘.a’ to blank cells for variables A009 to X047D. (See Excel walk-through 8.1 for help on how to do this).
- Variable A170 (life satisfaction) is currently a mixture of numbers (2 to 9) and words (‘Satisfied’ and ‘Dissatisfied’), but we would like it to be all numbers. Replace the word ‘Dissatisfied’ with the number 1, and the word ‘Satisfied’ with the number 10.
- Similarly, variable X011_01 (number of children) has recorded no children as a word rather than a number. Replace ‘No children’ with the number 0.
- The variables C036 to C041 should be replaced with numbers ranging from 1 (‘Strongly disagree’) to 5 (‘Strongly agree’) so we can take averages of them later. Similarly, variable
A009
should be recoded as 1 = ‘Very Poor’, 2 = ‘Poor’, 3 = ‘Fair’, 4 = ‘Good’, 5 = ‘Very Good’. (Hint: You may find it helpful to do the find and replace in ascending/descending order i.e. recode ’Strongly Disagree’/‘Very Poor’ as 1, then ‘Disagree’/‘Poor’ as 2, and so on.)
- We would like to split X025A into two variables, one for the number before the colon, and the other containing the words after the colon. Use Excel’s LEFT and/or RIGHT functions to create two new variables accordingly. (See Excel walk-through 8.1 for help on how to do this).
Excel walk-through 8.1 Cleaning data and splitting variables
Follow the walk-through in the Core Economics website video, or in Figure 8.1, to find out how to clean data and split variables in Excel.
Although the paper we are following only considered individuals aged 25–80 who were not students, we will retain the other observations in our analysis. However, we need to remove any observations that have missing values for A170, X028, or any of the other variables except X047D. Removing missing data ensures that any summary statistics or analysis we do is done on the same set of data (without having to always account for the fact that some values are missing), and is fine as long as the data are missing at random (i.e. there is no particular reason why certain observations are missing and not others).
-
In your spreadsheet, remove all rows in all waves that have missing data for A170. Do the same for:
- X003, X028, X007 and X001 in all waves
- A009 in Waves 1, 2, and 4 only
- C036, C037, C038, C039, C041 and X047D in Waves 3 and 4 only
- X011_01 and X025A, in Wave 4.
Excel walk-through 8.2 gives guidance on how to do this.
(Note: Excel may take some time to process the commands.)
Excel walk-through 8.2 Dropping observations that satisfy particular conditions
- We will now create two variables, work ethic and relative income, which we will use in our comparison of life satisfaction.
- Work ethic is measured as the average of C036 to C041. Using Excel’s AVERAGE function, create a new variable showing this work ethic measure.
- Since unemployed individuals and students may not have income, the study calculated relative income as the relative household income of that individual, measured as a deviation from the average income in that individual’s country. Explain the issues with using this method if the income distribution is skewed (for example, a long right tail).
- Instead of using average income, we will define relative income as the percentile of that individual’s household in the income distribution. Create a new variable showing this information. (See Excel walk-through 8.3 for one way to do this).
Excel walk-through 8.3 Calculating percentiles from actual values
- Now we have all the variables we need in the format we need. We will make some tables to summarize the data. Using the data for Wave 4:
Hint: For help on creating this table, see Excel walk-through 3.1. For help on changing the values in the table to percentages of the row total, follow steps 18–19 in Excel walk-through 3.2, but instead of selecting ‘Average’ (as in step 19), select ‘Show Values As’ (or ‘Show data as’) then ‘% of Row Total’.
- Create a table showing the breakdown of each country’s population according to employment status, with country (S003) as the row variable, and employment status (X028) as the column variable. Express the values as percentages of the row total rather than frequencies. Discuss any differences or similarities between countries that you find interesting.
- Create a new table as shown in Figure 8.4 (similar to Table 1 in the study ‘Employment status and subjective well-being’) and fill in the missing values. Reported life satisfaction is measured by A170. Summary tables such as these give a useful overview of what each variable looks like.
Male | Female | |||
---|---|---|---|---|
Mean | Standard deviation | Mean | Standard deviation | |
Life satisfaction | ||||
Self-reported health | ||||
Work ethic | ||||
Age | ||||
Education | ||||
Number of children |
Summary statistics by gender, European Values Study.
Part 8.2 Visualizing the data
Note
You will need to have done Questions 1–5 in Part 8.1 before doing this part.
Learning objectives for this part
- use column charts, line charts, and scatterplots to visualize data
- calculate and interpret correlation coefficients.
We will now create some summary charts of the self-reported measures (work ethic and life satisfaction), starting with column charts to show the distributions of values. Along with employment status, these are the main variables of interest, so it is important to look at them carefully before doing further data analysis.
The distribution of work ethic and life satisfaction may vary across countries but may also change over time within a country, especially since the surveys are conducted around once a decade. To compare distributions for a particular country over time, we have to use the same horizontal axis, so we will first need to make a separate frequency table for each distribution of interest (survey wave). Also, since the number of people surveyed in each wave may differ, we will use percentages instead of frequencies as the vertical axis variable.
- Using the data from Wave 3 and Wave 4 only, for three countries of your choice:
- Using the work ethic measure, create a separate frequency table for Wave 3 and for Wave 4, similar to Figure 8.5. The column ‘Percentage of individuals (%)’ refers to the number of individuals in that wave. The values in the first column should range from 1 to 5, in intervals of 0.2. (Hint: To count observations only for a particular country, you will need to use the IF function and FREQUENCY function together. See Excel walk-through 6.1 for help on how to do this).
Range of work ethic score | Frequency | Percentage of individuals (%) |
---|---|---|
1.00 | ||
1.20 | ||
… | ||
4.80 | ||
5.00 |
Frequency table for self-reported work ethic.
- Plot a separate column chart for each country to show the distribution of work ethic scores in Wave 3, with the percentage of individuals on the vertical axis and the range of work ethic score on the horizontal axis. On each chart, plot the distribution of scores in Wave 4 on top of the Wave 3 distribution. (See Excel walk-through 6.2 in Empirical Project 6 for guidance on what your charts should look like).
- Based on your charts from 1(b), does it appear that attitudes towards work in each country of your choice have changed over time? (Hint: For example, look at where the distribution is centred, the percentages of observations on the left tail or the right tail of the distribution, and how spread out the data is.)
- We will use line charts to make a similar comparison for life satisfaction (A170) over time. Using data for countries that are present in Waves 1 to 4:
- Create a table showing the average life satisfaction, by wave (column variable) and by country (row variable). (Hint: You may find it easier to make four separate pivot tables and copy-paste the results into a new table).
- Plot a line chart with wave number (1 to 4) on the horizontal axis and average life satisfaction on the vertical axis. Make sure to include a legend.
- From your results in 2(a) and (b), how has the distribution of life satisfaction changed over time? What other information about the distribution of life satisfaction could we use to supplement these results?
- Choose one or two countries and research events that could explain the observed changes in average life satisfaction over time shown in 2(a) and (b).
- correlation coefficient
- A numerical measure, ranging between 1 and −1, of how closely associated two variables are—whether they tend to rise and fall together, or move in opposite directions. A positive coefficient indicates that when one variable takes a high (low) value, the other tends to be high (low) too, and a negative coefficient indicates that when one variable is high the other is likely to be low. A value of 1 or −1 indicates that knowing the value of one of the variables would allow you to perfectly predict the value of the other. A value of 0 indicates that knowing one of the variables provides no information about the value of the other.
After describing patterns in our main variables over time, we will use correlation coefficients and scatterplots to look at the relationship between these variables and the other variables in our dataset.
- Using the Wave 4 data:
- Create a table as shown in Figure 8.6 and calculate the required correlation coefficients. For employment status and gender, you will need to create new variables: full-time employment should be equal to 1 if full-time employed and 0 if unemployed, and treated as missing data (left as a blank cell) otherwise. Gender should be 0 if male and 1 if female.
Variable | Life satisfaction | Work ethic |
---|---|---|
Age | ||
Education | ||
Full-time employment | ||
Gender | ||
Self-reported health | ||
Income | ||
Number of children | ||
Relative income | ||
Life satisfaction | 1 | |
Work ethic | 1 |
Correlation between life satisfaction, work ethic and other variables, Wave 4.
- Interpret the coefficients, paying close attention to how the variables are coded. (For example, you could comment on the absolute magnitude and sign of the coefficients). Explain whether the relationships implied by the coefficients are what you expected (for example, would you expect life satisfaction to increase or decrease with health, income, etc.)
Next, we will look at the relationship between employment status and life satisfaction and investigate the paper’s hypothesis that this relationship varies with the average work ethic in a country.
- Using the data from Wave 4, carry out the following:
- Create a table showing the average life satisfaction according to employment status (showing the full-time employed, retired, and unemployed categories only), with country (S003) as the row variable, and employment status (X028) as the column variable. Comment on any differences in average life satisfaction between these three groups, and whether social norms is a plausible explanation for these differences.
- Use the table from 4(a) to calculate the difference in average life satisfaction (full-time employed minus unemployed, and full-time employed minus retired).
- Make a separate scatterplot for each of these differences in life satisfaction, with average work ethic on the horizontal axis and difference in life satisfaction on the vertical axis.
- For each difference (employed vs unemployed, employed vs retired), calculate and interpret the correlation coefficient between average work ethic and difference in life satisfaction.
So far we have described the data using tables and charts, but have not made any statements about whether what we observe is consistent with an assumption that work ethic has no bearing on differences in life satisfaction between different subgroups. In the next part, we will assess the relationship between employment status and life satisfaction and assess whether the observed data lead us to reject the above assumption.
Part 8.3 Confidence intervals for difference in the mean
Note
You will need to have done Questions 1–5 in Part 8.1 before doing this part. Part 8.2 is not necessary but is helpful to get an idea of what the data looks like.
Learning objectives for this part
- calculate and interpret confidence intervals for the difference in means between two groups.
The aim of this project was to look at the empirical relationship between employment and life satisfaction.
When we calculate differences between groups, we collect evidence which may or may not support a hypothesis that life satisfaction is identical between different subgroups. Economists often call this testing for statistical significance. In Part 6.2 of Empirical Project 6, we constructed 95% confidence intervals for means, and used a rule of thumb to assess our hypothesis that the two groups considered were identical (at the population level) in the variable of interest. Now we will learn how to construct confidence intervals for the difference in two means, which allows us to make such a judgment on the basis of a single confidence interval.
Remember that the width of a confidence interval depends on the standard deviation and number of observations. (Read Part 6.2 of Empirical Project 6 to understand why.) When making a confidence interval for a sample mean (such as the mean life satisfaction of the unemployed), we use the standard deviation and number of observations in that sample (unemployed people) to obtain the standard error of the sample mean.
When we look at the difference in means (such as life satisfaction of employed minus unemployed), we are using data from two groups (the unemployed and the employed) to make one confidence interval, so the number of observations is the sum of observations across both groups. To calculate the standard deviation for the difference in means, we use the standard deviations (SD) of each group:
This formula requires the two groups of data to be independent, meaning that the data from one group is not related, paired, or matched with data from the other group. This assumption is reasonable for the life satisfaction data we are using. However, if the two groups of data are not independent, for example if the same people generated both groups of data (as in the public goods experiments of Empirical Project 2), then we cannot use this formula.
Once we have the new standard deviation and number of observations, we can calculate the width of the confidence interval (distance from the mean to one end of the interval) as before. The width of the confidence interval tells us how precisely the difference in means was estimated, and this precision tells us how compatible the sample data is with the hypothesis that there is no difference between the population means. If the value 0 falls outside the 95% confidence interval, this implies that the data is unlikely to have been sampled from a population with equal population means. For example, if the estimated difference is positive, and the 95% confidence interval is not wide enough to include 0, we can be reasonably confident that the true difference is positive too. In other words, it tells us that the p-value for the difference we have found is less than 0.05, so we would be very unlikely to find such a big difference if the population means were in fact identical.
In Figure 8.7, for Great Britain, we can be reasonably confident that the true difference in average life satisfaction between the full-time employed and the unemployed is positive. However, for Spain, we do not have strong evidence of a real difference in mean life satisfaction.
- We will apply this method to make confidence intervals for differences in life satisfaction. Choose three countries: one with an average work ethic in the top third of scores, one in the middle third, and one in the lower third. (See Excel walk-through 6.4 for help on calculating confidence intervals and adding them to a chart.)
- Create a pivot table for these countries, showing the average life satisfaction score, standard deviation of life satisfaction, and number of observations, with country (S003) as the row variable, and employment status (full-time employed, retired, and unemployed only) as the column variable.
- Use your pivot table from 1(a) to calculate the difference in means (full-time employed minus unemployed, and full-time employed minus retired), the standard deviation of these differences, and the number of observations.
- Use Excel’s CONFIDENCE.T function and the calculated values in 1(b) to determine the 95% confidence interval width (distance between the mean and one end of the interval) of the difference in means.
- Plot a column chart for your chosen countries showing the difference in life satisfaction (employed vs unemployed and employed vs retired) on the vertical axis, and country on the horizontal axis (sorted according to low, medium, and high work ethic). Add the confidence intervals from 1(c) to your chart.
- Interpret your findings from Question 1(d), commenting on the size of the observed differences in means, and the precision of your estimates.
- The method we used to compare life satisfaction relied on making comparisons between people with different employment statuses, but a person’s employment status is not entirely random. We cannot therefore make causal statements such as ‘being unemployed causes life satisfaction to decrease’. Describe how we could use the following methods to assess better the effect of being unemployed on life satisfaction, and make some statements about causality:
- a natural experiment
- panel data (data on the same individuals, taken at different points in time).