Empirical Project 9 Working in Google Sheets

Google Sheets-specific learning objectives

In addition to the learning objectives for this project, in this section you will learn how to create and format time variables.

Part 9.1 Households that did not get a loan

Learning objectives for this part

  • identify credit-constrained and credit-excluded households using survey information
  • create dummy (indicator) variables
  • compare characteristics of successful borrowers, discouraged borrowers, credit-constrained households, and credit-excluded households
  • explain why selection bias is an important issue.

The Ethiopian Socioeconomic Survey (ESS) data was collected in 2013–14 from a nationally representative sample of households. Households were asked about topics such as their housing conditions, assets, and access to credit.

Download the ESS data and survey questionnaire:

  1. The data is already in a format clean enough to use, so we will begin by summarizing the information in the ‘All households’ tab, starting with region and household characteristics.
dummy variable (indicator variable)
A variable that takes the value 1 if a certain condition is met, and 0 otherwise.

Now that we have an idea of what our data looks like, we will move on to identify households that are potentially excluded from the credit market or are credit constrained. The former are households who find it impossible to borrow and the latter are households who can only borrow on unfavourable terms (see Section 9.10 of Economy, Society, and Public Policy).

The variables in our dataset that are related to this issue are ‘did_not_apply’ and ‘loan_rejected’. We will soon also look at the responses given in the variables ‘reason_not_apply1’ and ‘reason_not_apply2’.

  1. Using the ‘All households’ tab:

To create operational categories to use throughout this project, we will label households as either:

You should note that the ‘denied’ households are only a subset of the credit-excluded households as there will be households that are credit excluded and do not even apply for a loan. One could, for instance, reason that households who answered ‘Inadequate Collateral’ or ‘Do Not Know Any Lender’ are also likely to be credit excluded.

  1. Using the subset of data from Question 2(b):

Note that arguably other answers are also indicative of being credit constrained, and so the criteria we use is definitely only a subset of all households that are credit constrained. For example, one could include households that have been denied a loan, and it is also indeed likely that some households that have been granted a loan are in fact credit constrained.

We will now analyse the stated reasons for wanting a loan, comparing those households that were successful (‘HH_status’ equal to ‘successful’) with those that were not successful (‘HH_status’ equal to ‘denied’).

  1. For both groups, create one table showing the proportion of households for each loan purpose. (You will realize that in the ‘All households’ tab, the reason for all successful loans is ‘Other’. For that reason, you should use the ‘Got loan’ tab to retrieve the reasons for loan information for successful loans.) Was the purpose of loans for denied and successful borrowers similar? (Hint: It may help to think about the broad categories of spending on consumption and investment).
  1. Using the information in the ‘All households’ tab and ‘Got loan’ tab, for each group of households:
Household characteristic Successful Denied
Age of household head    
Highest education in household    
Number of assets    
Household size    
Number of young children    
Number of working-age adults    

Characteristics of successful and denied borrowers.

Figure 9.1 Characteristics of successful and denied borrowers.

conditional mean
An average of a variable, taken over a subgroup of observations that satisfy certain conditions, rather than all observations.
  1. Using Figure 9.1, without conditioning on ‘rural’ or ‘region’, carry out the following:
  1. Using the information in the ‘All households’ tab:
selection bias
An issue that occurs when the sample or data observed is not representative of the population of interest. For example, individuals with certain characteristics may be more likely to be part of the sample observed (such as students being more likely than CEOs to participate in computer lab experiments).

A study on access to loans in Ethiopia looked at the relationship between loan amount and household characteristics. When doing so, they needed to account for selection bias, because we only observe positive loan amounts for successful borrowers. If we only had data for successful borrowers, then our sample would not be representative of the population of interest (all households), so we would have to interpret our results with caution. In our case, we have information about all households, so we can compare observable characteristics to see whether successful borrowers are similar to other households.

An article by the Institute for Work and Health explains selection bias in more detail, and why it is a problem encountered by all areas of research.

  1. Think of another example where there might be selection bias, in other words, where the data we observe is not representative of the population of interest.

Part 9.2 Households that got a loan

Learning objectives for this part

  • analyse the characteristics of loans obtained by successful borrowers.

For households that successfully got a loan, we will look at:

We will also see if there are any relationships between these loan characteristics and household characteristics.

Now we will use the variables relating to the loan start and end dates to calculate the duration of the loan. Before using these variables, we need to check that the variable entries make sense. Some of this information could be recorded incorrectly (for example, the year is missing a digit, or the month is a number rather than a word).

  1. Using the ‘Got loan’ tab, do the following:

Google Sheets walk-through 9.1 Creating and formatting time variables

How to create and format time variables using CONCATENATE.

Figure 9.2 How to create and format time variables using CONCATENATE.

The data

In this example, we will use the month and year variables in Columns O, P, R, and S to make new date variables in Columns V and W. Before doing this, we need to make sure all the months and years are coded correctly, otherwise Google Sheets may not recognize these values as dates. Also, some cells are blank, so we need to make our cell formulas work regardless of missing data.

Figure 9.2a In this example, we will use the month and year variables in Columns O, P, R, and S to make new date variables in Columns V and W. Before doing this, we need to make sure all the months and years are coded correctly, otherwise Google Sheets may not recognize these values as dates. Also, some cells are blank, so we need to make our cell formulas work regardless of missing data.

Combine the month and year variables into one variable

The CONCATENATE function combines text in cells in the specified order. You can add punctuation and spaces by specifying them in quotation marks (“”). Here, we use the IF function so that Google Sheets only fills in cells in rows where both start and end dates were non-missing (the AND function states these conditions).

Figure 9.2b The CONCATENATE function combines text in cells in the specified order. You can add punctuation and spaces by specifying them in quotation marks (“”). Here, we use the IF function so that Google Sheets only fills in cells in rows where both start and end dates were non-missing (the AND function states these conditions).

Reformat the cells as date variables

After step 5, you may not notice any visible changes to the text in cells, but Google Sheets now recognizes them as dates and you can use them to make calculations, such as counting the number of days between two dates.

Figure 9.2c After step 5, you may not notice any visible changes to the text in cells, but Google Sheets now recognizes them as dates and you can use them to make calculations, such as counting the number of days between two dates.

Use date variables to calculate duration

With the proper formatting, Google Sheets can calculate the number of days between two dates. In this example, some of the start dates are later than the end dates, resulting in negative numbers. We will correct these values in the next step.

Figure 9.2d With the proper formatting, Google Sheets can calculate the number of days between two dates. In this example, some of the start dates are later than the end dates, resulting in negative numbers. We will correct these values in the next step.

Recode incorrect durations

The ABS function converts any value to its positive counterpart. After step 9, the negative values in Column X are now recorded as positive numbers in Column Y. Again, we used the AND function so that Google Sheets only did this for non-blank cells (in this case, the condition AND(X2<>“”) would give the same results).

Figure 9.2e The ABS function converts any value to its positive counterpart. After step 9, the negative values in Column X are now recorded as positive numbers in Column Y. Again, we used the AND function so that Google Sheets only did this for non-blank cells (in this case, the condition AND(X2<>“”) would give the same results).

  1. Using the variables ‘loan_amount’ and ‘loan_interest’:

For help on making scatterplots and calculating correlation coefficients, see Google Sheets walk-through 1.7.

  1. Now we will look at sources of finance and how they are related to loan characteristics.
  1. In this project we have looked at patterns in borrowing and access to credit, but we are not able to make any causal statements such as ‘changes in X will cause households to be credit constrained’ or ‘characteristic Y causes improved access to credit’. Outline a policy intervention that could help improve households’ access to loans, and how to design the implementation so you can assess the causal effects of this policy.