The purpose of this assignment is to perform logistic regression, interpret the results, and analyze whether or not the information generated can be used to address a specific business problem. Please note, for this assignment, the “Adult Incomes” data set you used in Topic 1 will be used again. The purpose is to now run and analyze the data in R.
The marketing department is interested in creating advertising directed primarily at high-income individuals, and they have come to you seeking very specific customer data. The director of marketing explains that individuals with large amounts of disposable income tend to purchase luxury items. Therefore, understanding what predictors are correlated with high income can be very useful for a marketing department because it can help to tailor messages to the high-earning cohort. For example, individuals that earn capital gains tend to be high-income earners, and advertisements for luxury items can be targeted toward them on realty or investment websites.
As a member of the analytics team, you have been asked to determine a list of predictors and their relative impact on the likelihood of an individual being a high-income earner. Individuals earning more than $50,000 annually are considered high-income earners. In your summary, include discussion of how the marketing department can use your results to devise a smart advertising strategy.
Question 1: Explain the concept of logistic regression. For example, how does it make predictions about which category an observation will fall into? How is it different from linear regression? Be sure to include a discussion of the assumptions and limitations of logistic regression in your response.
Question 2: Check your working directory to ensure your file is saved in the correct location, and use the appropriate R function(s) to load the ‘Adult Incomes’ data set into your R environment as an object called ‘incdat.’ Check the dimensions of the data to ensure it loaded correctly (You should have 1,500 observations on 9 variables). Also complete the following steps to prepare your data for analysis:
- Create a new variable in the “incdat” data set called “Income_GT50,” where the variable is equal to 1 when the income is greater than $50,000 and 0 otherwise. Using the table function, produce a table showing the number of individuals with incomes greater than $50,000. Include a screenshot of the table as part of the output.
- Partition the data to create a training data set (50%) and test data set (50%). Check the dimensions of your testing and training set to confirm that each data set has 750 observations.
Question 3: Run a logistic regression on the training data with “Income_GT50” as the dependent variable and the following predictors: “Capital_Gain,” “Hours_Per_Week,” “Sex,” “Age,” and “Race.” Use the appropriate R function to summarize the results of the model. Include a screenshot of the R console output as part of the answer. What probability is being modeled?
Question 4: Is “Race” a statistically significant predictor when modeling whether incomes are greater than $50,000 annually? Use a 5% significance level. Make sure to consider the p-value when explaining your answer.
Question 5: Rerun the model without “Race” as a predictor. Show the model summary as part of your output. Using the information in the model summary, manually write the mathematical equation showing how the model calculates probability as a function of the predictors. Interpret the meaning of the coefficients for “Age” and “Sex.” If age were to change by one year, how would that affect the probability of earning more than $50,000? Include a screenshot of the R console output as part of the answer.
Question 6: Show the classification table and percent correct for each predicted outcome (>50K and <=50K) for the training data and test data. Why is the percent correctly predicted usually lower in the test data set? Include the training and testing classification table outputs when submitting the answer.
Question 7: Based on the logistic model from Question 4, calculate the difference in probability of earning more than 50k between males and females when age is 35, hours per week is 40, and capital gain is 0. Who is more likely to make more than 50k? Explain your answer.
Question 8: Based upon your analysis, what are the predictors that can determine whether an individual would be considered a high-income earner? Discuss how the marketing department can use this information in formulating its advertising strategy and any recommendations you would make based on your analysis.
Question 9: Read “Black Beauty Products Kept Under Lock and Key at Some Walmart Stores, Raising Complaints” located in the topic Resources. What data and relationships do you think informed Walmart’s decision to use this security measure for these hair care products? If race was a significant predictor in this case, what ethical and legal considerations should the marketing team have been aware of when using race as a variable for targeting their marketing strategy and reporting their findings?
Part 2 (Analysis of results and recommendations): Present your findings and recommendations in the form of a 250-word (minimum) executive summary that includes relevant data, charts, and tables in Microsoft Word. Be sure to include your R code and R output as a .txt file with your submission.
Submit the answers to Questions 1-9, including the specified screenshots and software outputs, in a Word document. Answer all questions in full sentences and be mindful of proper academic writing. Make sure the screenshots are cropped and edited to meet the expertly crafted rubric guidelines.
APA format is not required, but solid academic writing is expected.
This assignment uses a grading rubric. Please review the rubric prior to beginning the assignment to become familiar with the expectations for successful completion.
You are not required to submit this assignment to LopesWrite.
This benchmark assignment assesses the following programmatic competencies:
MS Business Analytics
1.6: Apply appropriate ethical and legal standards regarding the use and reporting of data.