代做PRACTICUM I代写R语言

2025-03-13 代做PRACTICUM I代写R语言

PRACTICUM I

Nota Bene

Read the instructions full at least twice from beginning to end before you get started.

Complete the practicum individually; while you may discuss your approach and share results, you may not share code or collaborate on the code. All of the work must be your own.

The due date is March 11 at 11:59pm ET. Late submissions are accepted (with the usual per day late penalty) until March 16 at 11:59pm ET. No submissions are accepted after this date.

Work on the practicum for at least three hours every day and use the time during the week prior to the practicum to start working on it, especially the configuration of the MySQL Server and the loading of    the data. Do not wait to get started. The average time to complete the practicum is 18-20 hours. Seek help early. Submit often and as soon as you have enough code that works. We will only grade the last submission. Check your submission before you submit and after. A gentle reminder that the average of both practicums must be above 70% to pass this course, so be sure to complete on time and seek help right away.

Do not procrastinate -- things that appear simple often take more time than expected and, of course, programming is fraught with potholes on the road to success. So plan accordingly. Do not wait until  shortly before it is due.

Learning Objectives

In this practicum you will learn how to:

· configure and connect to a cloud-hosted MySQL from R

· implement a normalized relational schema for an existing data set

· load data from CSV files into a relational database through R

· perform. simple analytics with SQL in R using literate programming

Overview

In this practicum you will build a database to analyze restaurants visits, revenue, and sales transactions. For an existing data set (generated synthetically), you will build a logical data model, a relational schema,realize the relational schema in a MySQL/MariaDB relational database, load data into the database, execute SQL queries, a finally perform. some simple analysis of the data.

More specifically, you will create a (normalized) relational database that has a structure that can accommodate data contained in one (or more) CSV files that may be the result of a data dump from an external party, another database, or generated in some other way. This relational database will be a MySQL (rather than a SQLite) database that is hosted on a cloud-sever so it becomes accessible from anywhere. While we recommend you use Aiven to host the MySQL database, you may use any cloud- based MySQL database server. You will then load the data from the CSV files into the various (normalized) tables in the cloud-based MySQL database. After that the CSV files are no longer used and all of your work and queries will be done against the database.

In practice, it is more likely that you will write a program or SQL script. to create the database tables, a different program to load the data from the CSV files into the database, and then write other programs to use the data in the database. In this practicum we are simulating this pipeline by writing multiple R programs as well as R Notebooks.

Read all practicum instructions first; the questions are not necessarily sequential as the process is iterative. So, if a later question requires fields you didn't build in when you first created the table, then go back to the previous code, update the code to create the correct table, and then re-build the table (drop  it,then create again). We will grade the database and not the sequence of the creation.

Use the provided time estimates for each tasks to time-box your time. Seek assistance if you spend more time than specified on a task -- you are likely not solving the problem correctly. A key objective is to learn how to look things up, how to navigate complex problems, and how to identify and resolve programming errors.

Read the Hints & Tips section frequently and before posting questions.

Key Resources & Prerequisite Lessons

06.103 ┆ Working with Vectors and Data Frames in R (http://artificium.us/lessons/06.r/l-6-103-vecs-and-dfs/l-6-103.html)

06.106 ┆ Import Data into R from CSV, TSV, and Excel Files (http://artificium.us/lessons/06.r/l-6-106-load-csv-tsv-excel-files/l-6-106.html) 06.108 ┆ Loops and Iteration in R (http://artificium.us/lessons/06.r/l-6-108-loops-iteration-in-r/l-6-108.html)

06.112 ┆ Basics of Text & String Processing in R (http://artificium.us/lessons/06.r/l-6-112-text-proc/l-6-112.html)

06.121 ┆ Writing Functions in R (http://artificium.us/lessons/06.r/l-6-121-funcs-in-r/l-6-121.html)

06.204 ┆ Literate Programming with R Notebooks (http://artificium.us/lessons/06.r/l-6-204-r-notebooks/l-6-204.html)

06.191 ┆ Debugging R Code (http://artificium.us/lessons/06.r/l-6-191-debugging/l-6-191.html)

06.301 ┆ Working with Databases in R (http://artificium.us/lessons/06.r/l-6-681-key-value-db-redis-from-r/l-6-301.html)

06.302 ┆ Bulk Load Data from CSV into Database in R (http://artificium.us/lessons/06.r/l-6-302-bulkload-data-into-db/l-6-302.html)

06.306 ┆ Dates in R and SQLite (http://artificium.us/lessons/06.r/l-6-306-dates-in-r-and-sql/l-6-306.html)

70.907 ┆ Stored Procedures in MySQL (http://artificium.us/lessons/70.sql/l-70-907-stored-procs-mysql/l-70-907.html)

Preliminary Tasks & Requirements

Before you start with the tasks below, read the Hints and Tips section below and go back to them often when you encounter problem. Most problems are addressed in that section. Consult the list before contacting us for help as you'll be able to resolve the issue more quickly.

1. Create a new project in R Studio named "CS5200.Practicum-I.LastNameF" where LastName is your last name and F is your first initial, e.g., "CS5200.Practicum-I.SmithJ".

2. Download the CSV file restaurant-visits-139874.csv (https://s3.us-east-

2.amazonaws.com/artificium.us/datasets/restaurant-visits-139874.csv)and save it locally to your R

Project folder. For development, you may use the local file, but your final code submission must load the data from the URL. You may wish to create a new data file that is a subset of the full data that you use for development so loading takes less time or only load the first 50 or 100 rows. This is a common strategy in practice. To download the file, use the right mouse button and choose "Save Link As..." or a similar menu choice on your browser. Do not click on the link as that may cause the browser to try to display the file which is unlikely to work.

The data in the CSV has been artificially generated, so the data is synthetic. If you are interested in the creation process, see 3.981 -- Synthetic Engineering of a Dataset on Restaurant Visits (http://artificium.us/lessons/03.ml/l-3-981-wk-exmpl-synth-restaurant-visit-data/l-3-981.html) .

3. Inspect the CSV data file that you downloaded so you are familiar with its columns, data types, and overall structure. You may wish to create a "sandbox" R Notebook in which you can do your inspection; you do not need to submit this notebook.

4. In R Notebooks, all R and SQL code blocks must be named, as shown in the example below. This is necessary so that you can reference your code blocks in the self-evaluation rubric to be filled out at  the end of the practicum. The names of code blocks must be unique.

```{r nameOfRCodeBlock, eval = T, warning = F}

```{sql nameOfSQLCodeBlock, connection = xDB}

You may add any additional block parameters as needed. Code blocks should be echoed (displayed) in your knitted result documents as instructed.

5. Use functions to structure your code so that it becomes more readable and easier to develop and

debug. Use headers to segment your notebook and add explanations as to what each code block means. Follow common coding practices and format your code so it is readable, and use functions to breakdown complex code. Echo all your code (although this is not always the right thing to do in

practice). Do not print large data frame. or query results -- print only a few rows.

The order of the questions may not necessarily suit the structure of your code, so you can answer questions out of order.

Part A / Configure Cloud Database

All of your data must be stored in a cloud database, so the database is accessible to not just you. We will use a MySQL database hosted on a cloud provider of your choice, although we recommend Aiven. While Aiven has some restrictions, we found it to be an easy-to-configure and use database cloud provider.

1. (0 pts / 2.0 hrs) Set up and configure a MySQL Server database on a cloud host. There are several options and you may choose any of the ones below or any other of your choosing; we recommend Aiven:

a. db4free.net (http://db4free.net/)

b. freemysqlhosting.co.uk (http://www.freemysqlhosting.co.uk/)

c. Aiven (http://aiven.io)

d. Google Cloud or AWS RDS

Note that you may use SQLite instead of MySQL but you will not get credit for this question nor credit for the question below that asks you to create a stored procedure, as those are not supported in SQLite. If you have difficulty setting up a cloud database, you should use SQLite and continue and resolve the cloud database setup once you've completed all steps of the practicum that are possible with SQLite; that way you do not lose any time.

CAUTION: Both Aiven and db4free do not allow the use of dbWriteTable() ; this approach of bulk-loading does not scale, so it should be avoided anyway. So, you must use INSERT SQL statements to write the data into Aiven and db4free. In addition, for Aiven, all text fields must beenclosed in single rather than double quotes.

Both Amazon AWS and Google Cloud offer free credits, but be sure to monitor usage so you do not  exceed the free credit or are prepared to pay (they can be a bit costly, so be careful to suspend your database when not in use and delete it after the Practicum has been graded).

You may collaborate and work with others to set up a cloud MySQL installation for Part A, but not for the remainder of the practicum. A cloud MySQL installation is necessary for us to run your code -- we cannot connect to a local installation of MySQL, although you may use one for testing and development.

If you cannot set up and connect to a MySQL cloud-host, contact us and we will provide you with an Aiven database.

Related Lesson: 6.304 ┆ Configure and Connect to Cloud MySQL from R (http://artificium.us/lessons/06.r/l-6-304-cloudMySQL-from-r/l-6-304.html)

2. (0 pts / 20 min) Write a small "sandbox" R program to connect to your newly created MySQL

database to test the connection. You can use this connection code in Part C below. You do not need to submit this test program. If you have trouble connecting to your cloud MySQL, be sure to disable any firewall or anti-virus software that maybe clocking port 3306 -- or add port 3306 to the list of open ports in your firewall software configuration.

Part B / Design Normalized Database

The data in the CSV file contains information about visits to restaurants owned by a restaurant management group. The file is the result of a "data dump" from a third-party system that the restaurant group wants to replace with an internally-built database and web application. The first step is to design a relational database schema that can hold the data. The schema must be normalized to at least 3NF.

Follow the steps below in your database design and record the steps in a document (you may use any   document writing tool of your choice). Be sure to add your name, course name, semester, and the exact questions you are answering. Be professional in your preparation.

Create an R Notebook named "designDBSchema.PractI.LastNameF.Rmd" where LastName is your last name and FirstInitial is the first letter of your first name in which to write your answers for your design; use embedded LaTeX for the equations and functional dependencies. When knitted, we should only see your design and normalization approach, but no code (as there is no code in this part).

1. (5 pts / 1 hr) For the relation represented by all of the columns in the CSV file, define all functional dependencies and list them.

2. (5 pts / 1 hr) Using the functional dependencies and the rules of normalization, decompose the relational from the CSV into several relations that all satisfy 3NF; give the relations reasonable names.

3. (5 pts / 1 hr) For the relations resulting from the normalization, create an ERD in the IE (Crow's Feet) notation. Add all attributes, attribute name, primary and foreign keys, data types, and entity descriptions. You may use any modeling tool of your choosing, e.g., LucidChart or mermaid. Embed the ERD into your document either an an embedded object, a rendered mermaid graphic, or an externally hosted image. If you render to HTML, then you must host any image on a server, but if you knit to PDF, then the image becomes embedded in the document.

Part C / Realize Database

1. (0 pts / 10 min) Create an R Script (R Program) named "createDB.PractI.LastNameF.R" where LastName is your last name and FirstInitial is the first letter of your first name, e.g., "createDB.PractI.GilesM.R". Add a header comment containing the name of the program, your name, and current semester.

2. (0 pts / 10 min) In your R Program connect to your cloud-hosted MySQL database. If you have

difficulty connecting to or setting up MySQL, then use SQLite and proceed. You can always come back to this question and change your configuration so that you connect to MySQL. This is the benefit of relational databases: you can easily switch between databases without changing your code.

3. ( 10 pts / 1.5 hrs) Using R, realize the 3NF normalized database schema designed in Part B. Add appropriate constraints, primary key and foreign key definitions. Add either lookup tables for categorical fields that are not Boolean or constraints. After creating the schema, be sure to disconnect from the database. Be sure to only create tables if they do not already exist. Add appropriate default values for each column. Add constraints as appropriate and allow NULL values only when appropriate.

Part D / Delete Database

1. (0 pts / 10 min) Create an R Script (R Program) named "deleteDB.PractI.LastNameF.R" where LastName is your last name and FirstInitial is the first letter of your first name, e.g., "deleteDB.PractI.GilesM.R". Add a header comment containing the name of the program, your name, and current semester.

2. (4 pts / 30 min) In your R Program connect to your cloud-hosted MySQL database and delete

(DROP) all tables if they exist. You can use this program to "re-initialize" your database, before you create the schema again using the program from Part C.

Part E / Populate Database

1. (0 pts / 10 min) Create an R Script (R Program) named "loadDB.PractI.LastNameF.R" where LastName is your last name and FirstInitial is the first letter of your first name, e.g., "loadDB.PractI.GilesM.R". Add a header comment containing the name of the program, your name, and current semester.

2. (1 pts / 10 min) Load the data from its CSV file into a dataframe called df.orig. For now, you may load it locally from your project folder or load a subset of the data locally, but eventually the data must be loaded from the URL, so be sure to change this before submission.

3. (30 pts / 6 hrs) Using the table definitions from Part C and the data in the dataframedf.orig from

above, write R code to populate the tables with the data from the appropriate columns. Load all data. Use appropriate default values for missing values. Note that some missing values have a "sentinel value", e.g., 99 is used when the party size is not know or "0000-00-00" is used for some missing dates. There are other such values; identify them and handle them as you believe will work and is reasonable. After loading the data, be sure to disconnect from the database.

All data manipulation and importing work must occur in you R Script. You may not modify the original data outside of R -- that would not be reproducible work. It may be helpful to create a  subset of the data for development and testing as the full file is quite large and takes time to load.

Part F / Test Data Loading Process

1. (0 pts / 10 min) Create an R Script named "testDBLoading.PractI.LastNameF.R" where LastName is your last name and FirstInitial is the first letter of your first name, e.g., "testDBLoading.PractI.GilesM.R". Add a header comment containing the name of the program, your name, and current semester.

2. (5 pts / 1.5 hrs) In this R program, load the original CSV into a dataframe and also connect to your database. Then perform. the following tests with appropriate messages indicating whether the results are what is expected: count the number of unique restaurants, customers, servers, and visits in the CSV and compare against the total number of rows in the appropriate tables. Sum up the total amount spent on food, alcohol, and tips in the CSV and compare against the same in the database.

Part G / Use Data for Reporting & Analytics

1. (0 pts / 10 min) Create an R Notebook (.Rmd file) named "RevenueReport.PractI.LastNameF.Rmd" where LastName is your last name and FirstInitial is the first letter of your first name, e.g.,

"RevenueReport.PractI.GilesM.Rmd". Use "Analyze Sales" as the title parameter, "CS5200

Practicum I" as the subtitle, your name as the author, and the current semester as the date. Write an R code chunk to connect to your database; do NOT load the CSV data. Do not echo any code in the knitted document for this or any of the questions below.

2. (5 pts / 1.5 hrs) Add a level two (##) header with the title "Analysis by Restaurant". Create a SQL

query (using either a SQL or an R code block) against your database to find the total number of

visits, total number of unique customers, total number of customers in the loyalty program, and total spent on food and alcohol (but not tips) for each restaurant. Display the result in a nicely formatted   table using the kableExtra package. Use appropriate headers for the table. The look of the table is  up to you. The restaurants should be the rows of the tables.

3. (5 pts / 1 hrs) Add a level two (##) header with the title "Analysis by Year". Create a SQL query

against your database to find the total revenue (food and alcohol sold, but not tips), average per

party spent, and average party size by year (for all restaurants, i.e., you do not have to slice per

restaurant). Put the years in the columns. Use code to find the years; do not hard code them, so the

document adjusts to new data if re-knitted. Format the result with appropriate table headers using the kableExtra package.

4. (5 pts / 2 hrs) Add a level two (##) header with the title "Trend by Year". Using the dataframe from the prior question, build a line chart that plots year along the x-axis versus total revenue. Adorn the graph with appropriate axis labels, titles, legend, data labels, etc. You should use the standard R  plot()

function; you do not need to use packages such as ggplot, ggplot2, or plotly -- although you may, of course. This tutorial (https://www.statmethods.net/graphs/scatterplot.html)may help you get    started.

5. (5 pts / 30 min) Knit the notebook to a PDF.

N.B. If you cannot knit to PDF after trying to knit on posit.cloud, then you may submit an HTML

document, but will lose 50% of the points for this question. If you knit to HTML, then you need to host your ERD as an image on a server so we can view it.

Part H / Add Business Logic

1. (0 pts / 10 min) Create an R Script named "configBusinessLogic.PractI.LastNameF.R" where

LastName is your last name and FirstInitial is the first letter of your first name, e.g.,

"configBusinessLogic.PractI.GilesM.R". Add a header comment containing the name of the program, your name, and current semester.

2. (5 pts / 2 hrs) Create a stored procedure in MySQL (note that if you used SQLite, then you cannot

complete this step) that adds a new visit to the database. Name the stored procedure `storeVisit`.

The stored procedure should take arguments for the restaurant, customer, date of visit, party size,     food and alcohol bill, and any other information required. You may assume that the server, customer, and restaurant already exist in the appropriate tables (so you can pass their PK values). Show that the stored procedure works.

3. (5 pts / 2 hrs) Create a stored procedure in MySQL (note that if you used SQLite, then you cannot

complete this step) that adds a new visit to the database. Name the stored procedure

`storeNewVisit`. The stored procedure should take arguments for the restaurant, customer, date of     visit, party size, food and alcohol bill, and any other information required., but you should not assume that the server, customer, and restaurant already exist in the appropriate tables (so you can pass

their PK values), but that they may need to be added if they do not already exist. Show that the stored procedure works.

Submission Details

Before submitting your all R programs, R Notebook, and PDF (HTML), complete the self-evaluation rubric (separate "assignment"; see Canvas).

1. All programs must run from start to end and the notebook must knit to PDF, so be sure to test

carefully, load any required libraries, and ensure that the code runs sequentially from start to end.

Clean your environment and knit the notebook to ensure there are no out-of-order dependencies and that the notebook will knit for us.

2. Your code has to run, obviously, but it also has to run somewhat efficiently... if everyone else's code   runs in 10-30 minutes but yours takes several hours then clearly is due to poor programming and not due to the inherent complexity of the problem. We expect that you follow common coding strategies   for writing efficient code such as factoring out invariants from loops, not calling functions repeatedly,   pre-allocating memory, not copying objects needlessly, not calling expensive functions when simpler  ones will do (e.g., call substring() instead of doing regular expressions), use which() when searching, use sqldf only when necessary, and so forth. These practices are not specific to R, although there are R specific performance issues, but those are less likely to be a concern here.

3. Create professionally developed code that is well documented, commented, and all chunks in notebooks are labeled.