Evaluation of pupils is a task that is required of all instructors and is thus something that they all do. At the conclusion of the semester, it is customary to offer each student a letter grade based on their overall performance on all of the assessments they have taken, including tests, homework, quizzes, and projects. This often requires you to do a number of computations, which you might accomplish on a spreadsheet. Alternatively, you may want to think about using Python and pandas.
While utilising a spreadsheet, one of the difficulties that might arise is that it can be difficult to recognise when a mistake has been made in a calculation. It’s possible that you chose the incorrect column and inserted quizzes where exams should have been. It’s possible that you discovered the maximum of two wrong values. Python and pandas, two data analysis tools, may help you speed up the process of finding and correcting errors in your computations, which is necessary to overcome this issue. The purpose of this guide is to teach you how to.
:
-
Load
and
merge
data from multiple sources with pandas -
Filter
and
group
data in a pandas DataFrame -
Calculate
and
plot
grades in a pandas DataFrame
Build It: In this lesson, you’ll construct a comprehensive project from start to finish. To get the code for this pandas project, click the link below, and then follow along as you develop your gradebook script. Check out the Pandas Learning Path if you’re interested in learning more about these cute and cuddly animals.
Demo: What You Will Build
You are going to write a Python script in order to complete this pandas project. This script will import your grade data and produce letter grades for your pupils. Watch the following video to see how the script is supposed to work in practise:
Your script will create CSV output files and run from either the command line or your integrated development environment (IDE). This will allow you to easily copy and paste the results into your institution’s grading system. In addition to this, you will be asked to construct a few plots so that you may examine the distribution of your marks.
Project Overview
This panda project consists of four primary components.
steps:
-
Explore the data
you’ll use in the project to determine which format and data you’ll need to calculate your final grades. -
Load the data
into pandas DataFrames, making sure to connect the grades for the same student across all your data sources. -
Calculate the final grades
and save them as CSV files. -
Plot the grade distribution
and explore how the grades vary among your students.
After you’ve finished going through all of these processes, you’ll have a functioning Python script that can figure out your grades. Your grading will be sent to you in a format that is designed to be compatible with the student administration system used at your institution.
Reading Material as Background
If you have even a tiny amount of expertise dealing with pandas, you will benefit tremendously from participating in our panda project. In the event that you are in need of a refresher, then the tutorials and courses that are provided here will get you up to speed for this.
project:
-
The Pandas DataFrame: Make Working With Data Delightful
-
Basic Pandas Data Structures
-
Pandas: How to Read and Write Files
-
Reading CSVs With Pandas
-
Combining Data in Pandas With
merge()
,
.join()
, and
concat()
Don’t put too much pressure on yourself to learn all that is presented in those lessons from memory. A real-world example of the concepts covered will be presented to you as part of this Pandas project. Let’s have a look at the information that you’ll be putting to use in this project now, shall we?
Investigating the Information About This Pandas Project
You undoubtedly utilised a number of services to manage your class this semester, just as the majority of instructors do.
including:
- The school’s student administration system
- A service to manage assigning and grading homework and exams
- A service to manage assigning and grading quizzes
You will utilise sample data for the purposes of this project. This sample data will reflect the results that you may achieve from using these systems. The information is included in files that use the comma-separated values (CSV) format. This section displays a few representative examples of the data. To begin, there is a file that has all of the information on the class roster saved inside it. Your student administration is the source of this information.
system:
ID | Name | NetID | Email Address | Section |
---|---|---|---|---|
1234567 | “Barrera Jr., Woody” | WXB12345 | WOODY.BARRERA_JR@UNIV.EDU | 1 |
2345678 | “Lambert, Malaika” | MXL12345 | MALAIKA.LAMBERT@UNIV.EDU | 2 |
3456789 | “Joyce, Traci” | TXJ12345 | TRACI.JOYCE@UNIV.EDU | 1 |
4567890 | “Flower, John Gregg” | JGF12345 | JOHN.G.2.FLOWER@UNIV.EDU | 3 |
This table lists the student ID number, name, NetID, and email address of each individual student, as well as the section of the class to which they belong. Throughout this term, you were the instructor for a single class that had many meeting times, and the section number for each of those times was different.
The next thing you have is a file that has your homework and your test results in it. The one that comes from the agency that grades homework and exams has a slightly different arrangement of columns than the one that comes from the
roster:
SID | First Name | Last Name | Homework 1 | Homework 1 – Max Points | Homework 1 – Submission Time | … |
---|---|---|---|---|---|---|
jgf12345 | Gregg | Flower | 69 | 80 | 2019-08-29 08:56:02-07:00 | … |
mxl12345 | Malaika | Lambert | 63 | 80 | 2019-08-29 08:56:02-07:00 | … |
txj12345 | Traci | Joyce | 80 | 2019-08-29 08:56:02-07:00 | … | |
wxb12345 | Woody | Barrera | 55 | 80 | 2019-08-29 08:56:02-07:00 | … |
A Student ID (SID), a first name, and a last name are included for each individual student in this table. In addition, there are three values that will be reported for every piece of homework and test that you take.
gave:
-
The
score
the student received -
The
maximum score
for that assignment -
The
time
the student submitted the assignment
Lastly, you have files that provide information on the grades on the quizzes. These files are divided up so that just one test is saved in each one, and the information that is included in these files is distinct from the information that is found in the roster and in the assignments.
files:
Last Name | First Name | Grade | |
---|---|---|---|
Barrera | Woody | woody.barrera_jr@univ.edu | 4 |
Flower | John | john.g.2.flower@univ.edu | 8 |
Joyce | Traci | traci.joyce@univ.edu | 8 |
Lambert | Malaika | malaika.lambert@univ.edu | 8 |
The table for the quiz includes columns for each student’s last name, first name, email address, and grade on the quiz. Take note that the highest possible score on the test is not included in the data provided in this table. In the next steps, you’ll be shown how to provide such information.
With closer inspection of these statistics, you could find various
features:
-
Each table has different representations of the students’ names
. For instance, in the roster table the names are in the form
"Last Name, First Name"
with quotes so that a
CSV parser
doesn’t interpret the comma as a new column. However, in the homework table, first names and last names each get their own column. -
Each student might use a different name in different data sources
. For instance, the quiz tables don’t include the suffix
Jr.
in Woody Barrera’s name. Another example is that John Flower prefers to be called by his middle name, Gregg, so he adjusted the display in the homework table. -
Each student’s email address doesn’t have the same elements
. The basic email address for a student is
first.last@univ.edu
. However, if an email of that form is already owned by another student, then the email address is modified to be unique. This means you can’t predict a student’s email address just from their name. -
Each column may use a unique name even if it has the same data
. For instance, all the students have an identifier of the form
abc12345
. The roster table calls this their NetID, while the homework table calls this their SID. The quiz tables don’t have this information at all. Similarly, some tables use the column header
Email address
, while others just use
Email
. -
Each table sorts the data differently
. In the roster table, the data are sorted by the
ID
column. In the homework table, the data are sorted by the first letter of the first name. In the quiz tables, the data are sorted in a random order. -
Each of the rows or columns in the tables may have missing data
. For instance, Traci Joyce didn’t submit her work for Homework 1, so her row is blank in the homework table.
The data that you will encounter in the actual world will have all of these characteristics as well as many more. In the next sections of this pandas project, you will learn how to deal with each of these traits and ensure that they do not interfere with your research.
Choosing the Ultimately Acceptable Format for the Data
You are now in a position to consider the final format of the data, having first been familiar with the raw data types. At the conclusion of the process, you will need to assign a letter grade to each student based on the raw scores they received. Your completed data table will have one row for each student, and that row will include all of that student’s information. The total number of pupils in your class will then determine the appropriate number of rows to use.
Each score on each of the homework assignments, quizzes, and exams will be represented in the columns. You will also save certain information about each student, such as their name and a special identification that is unique to them. At the end, you will record the results of each calculation as well as the overall letter grade in different columns.
You may see a preview of your completed product here.
table:
Identifier | Name | Homework | Quizzes | Exams | Final Score | Final Grade |
---|---|---|---|---|---|---|
Student 1 | Last, First | # | # | # | # | A–F |
Student 2 | Last, First | # | # | # | # | A–F |
… | … | … | … | … | … | … |
The database is organised such that each row contains all of the information pertaining to a specific student. The student’s individual identification is located in the first column, while the student’s name may be found in the second column. The results of the assignments, quizzes, exams, and overall grade are then recorded in a succession of columns. The last column is a space for the overall grade.
You are now in a position to begin working with the data after having gained an understanding of how the data will ultimately take form. The loading of the data is the very first stage!
Pandas will be used to load the data.
pandas is widely considered to be one of the most useful Python libraries for dealing with tabular data. You are going to make use of a significant portion of pandas’ capability, particularly with regard to the combining of datasets and the execution of mathematical operations on the data.
The compilation of code snippets that are shown in this part may be found in the 01-loading-the-data.py file. If you click the link that is provided below, you will be able to get the source code:
Create a Python script and save it with the name gradebook.py. In addition to this, you will need to create a folder with the name data, which will be used to hold the files containing the input data for your gradebook script.
After that, open gradebook.py and begin by including a module-level docstring that describes the function of the gradebook file. You also have the ability to import certain libraries, right?
now:
"""Calculate student grades by combining data from many sources.
Using pandas, this script combines data from the:
* Roster
* Homework & Exam grades
* Quiz grades
to calculate final grades for a class.
"""
from pathlib import Path
import pandas as pd
Included in this code is a docstring that provides an explanation of what the script is intended to do. After that, you need to import the pathlib.Path and pandas packages.
The Roster File Is Being Loaded
Now that everything is set up, you can begin loading the data, starting with the
roster:
HERE = Path(__file__).parent
DATA_FOLDER = HERE / "data"
roster = pd.read_csv(
DATA_FOLDER / "roster.csv",
converters={"NetID": str.lower, "Email Address": str.lower},
usecols=["Section", "Email Address", "NetID"],
index_col="NetID",
)
You will need to construct two constants, referred to as HERE and DATA FOLDER, in this section of code in order to maintain track of the location of the file that is now being executed as well as the folder in which the data is kept. Since they make use of the pathlib module, these constants make it simple to construct references to a variety of directories.
After that, you read the roster file by using the pd.read csv() function. In order to make further processing of the data easier, you will first create an index by using the index col function, and then you will use the usecols function to include just the columns that will be beneficial.
You must additionally give the converters option in order to convert the column names to lowercase. This will ensure that you are able to compare strings in the future. The string comparisons you’ll be doing later on will be easier to complete as a result of this.
In the roster DataFrame, part of the data may be seen by the user.
below:
NetID | Email Address | Section |
---|---|---|
wxb12345 | woody.barrera_jr@univ.edu | 1 |
mxl12345 | malaika.lambert@univ.edu | 2 |
txj12345 | traci.joyce@univ.edu | 1 |
jgf12345 | john.g.2.flower@univ.edu | 3 |
These are the first four rows of roster, and you should recognise them since they are identical to the rows of the roster table that you examined in the previous part. Nevertheless, since you gave the str.lower argument to the converters for the NetID and Email Address columns, both of those columns have been transformed to utilise lowercase string representations. You have neglected to include the Name and ID fields as well.
Putting the Homework and Exam Files in the Computer
The next step is to load the CSV file containing the grades for the homework and exams. Keep in mind that in addition to the grades, this file also contains the first and last names of the students, as well as a column labelled SID. You are going to disregard the columns along with the submission, right?
times:
hw_exam_grades = pd.read_csv(
DATA_FOLDER / "hw_exam_grades.csv",
converters={"SID": str.lower},
usecols=lambda x: "Submission" not in x,
index_col="SID",
)
You have to use the converters parameter once more in this piece of code in order to change the data in the SID column and the Email Address column to lowercase. Even if, at first glance, the information in these columns seems to be written in lowercase letters, it is always best practise to check and make sure that everything is consistent. In order to match the roster DataFrame, you will also need to supply the SID field as the index column.
You won’t be making use of the columns in this CSV file that provide the times that assignments were turned in for any further analysis. These columns include the timings. Nonetheless, you want to preserve so many more columns that it would be laborious to put them all clearly here.
In order to get around this limitation, usecols may also take functions that are called with just one parameter, which is the name of the column. In the event that the function returns the value True, the column will be integrated. In such case, the column will not be included. If the word “Submission” is found anywhere in the column name, then the column will not be considered for exclusion using the lambda function that you are passing here.
You can get an idea of how the data appears after it has been cleaned up by looking at the following example from the hw exam grades DataFrame:
loaded:
SID | … | Homework 1 | Homework 1 – Max Points | Homework 2 | … |
---|---|---|---|---|---|
jgf12345 | … | 69 | 80 | 52 | … |
mxl12345 | … | 63 | 80 | 57 | … |
txj12345 | … | nan | 80 | 77 | … |
wxb12345 | … | 55 | 80 | 62 | … |
These are the rows that include the example students’ information from the CSV file including their homework and exam grades that you looked at in the previous section. Take note that the value for the missing data in the Homework 1 column for Traci Joyce (SID txj12345) was interpreted as a nan, which stands for “Not a Number.” In the next part, you are going to learn how to process this sort of data. The ellipses (…) represent columns of data that are loaded from the actual data but are not displayed in the sample here. These columns of data are not shown in the sample here.
The Quiz Files Are Being Loaded
In the last step, you will have to load the data from the tests. You are required to read all five quizzes, however the information that you need may be found in a single DataFrame rather than in five distinct DataFrames. This is the most usable version of this data. The completed data format will seem as follows:
this:
Quiz 5 | Quiz 2 | Quiz 4 | Quiz 1 | Quiz 3 | |
---|---|---|---|---|---|
woody.barrera_jr@univ.edu | 10 | 10 | 7 | 4 | 11 |
john.g.2.flower@univ.edu | 5 | 8 | 13 | 8 | 8 |
traci.joyce@univ.edu | 4 | 6 | 9 | 8 | 14 |
malaika.lambert@univ.edu | 6 | 10 | 13 | 8 | 10 |
The Email column serves as the index for this DataFrame, and each question is contained inside its own individual column. Please take note that the quizzes are presented in the wrong sequence; nevertheless, when you compute your final results, you will realise that the order of the questions is irrelevant. You may load the quiz by using this code, which is located here.
files:
quiz_grades = pd.DataFrame()
for file_path in DATA_FOLDER.glob("quiz_*_grades.csv"):
quiz_name = " ".join(file_path.stem.title().split("_")[:2])
quiz = pd.read_csv(
file_path,
converters={"Email": str.lower},
index_col=["Email"],
usecols=["Email", "Grade"],
).rename(columns={"Grade": quiz_name})
quiz_grades = pd.concat([quiz_grades, quiz], axis=1)
With this code, you will create a DataFrame with no data and name it quiz grades. You need the empty DataFrame for the same reason that you need to construct an empty list before you can use the list.append() function: both are prerequisites.
You use the Path.glob() function to locate all of the quiz CSV files, and then you import them into pandas, ensuring that the email addresses are all converted to lowercase. You also put the email addresses of the students as the index column for each quiz. pd.concat() utilises these email addresses to align the data for each individual student.
Take note that you provide the axis=1 parameter to the pd.concat() function. This results in pandas concatenating columns rather than rows, with the result that each new question is added as a new column to the aggregate DataFrame.
Last but not least, you will need to use the DataFrame.rename() method to alter the name of the grade column from Grade to something more relevant to each individual test.
Combining the DataFrames for the Grades
You are now able to integrate the information included in your roster, hw exam grades, and quiz grades DataFrames as all of your data has successfully been imported. This enables you to do all of your calculations inside a single DataFrame, and then store the results of those calculations in another format when you are through.
The file called 02-merging-dataframes.py contains a compilation of all the changes that were made to the gradebook.py file in this section. If you click the link that is provided below, you will be able to get the source code:
You will combine the data together in one of two ways.
steps:
-
Merge
roster
and
hw_exam_grades
together into a new DataFrame called
final_data
. -
Merge
final_data
and
quiz_grades
together.
You will provide each DataFrame its own unique column to serve as the merging key; this will allow pandas to identify which rows should be kept together. Since each data source assigns a distinct and individual identity to each student in a distinctive way, this procedure is essential.
Combining the Attendance List with Homework Marks
The NetID or SID column serves as a one-of-a-kind identification for each individual student and may be found in both the roster and the hw exam grades table. Having an index is of the utmost importance when pandas operations such as merging or combining DataFrames are being performed. As you were loading the quiz files, you already had an opportunity to see how helpful this was.
After you loaded the roster and the homework grades, you need to keep in mind that you provided the index col option to the pd.read csv() function. You are now able to combine these two DataFrames.
together:
final_data = pd.merge(
roster, hw_exam_grades, left_index=True, right_index=True,
)
You may combine the roster and hw exam grades DataFrames by using the pd.merge() function, which is used in this code.
A representation of the combined DataFrame for the four examples is shown below.
students:
NetID | Email Address | … | Homework 1 | … |
---|---|---|---|---|
wxb12345 | woody.barrera_jr@univ.edu | … | 55 | … |
mxl12345 | malaika.lambert@univ.edu | … | 63 | … |
txj12345 | traci.joyce@univ.edu | … | nan | … |
jgf12345 | john.g.2.flower@univ.edu | … | 69 | … |
As you have seen previously, the ellipses denote columns that aren’t shown in this example but are present in the real DataFrame. These columns may be found in the actual DataFrame. Students who shared a NetID or SID have been grouped together in the example table; as a result, their email addresses and grades for Homework 1 are consistent with those found in the tables that you have already examined.
The Combining of the Test Scores
When you imported the data for the quiz grades, you utilised the email address as a unique identifier for each student. This was done so that the grades could be properly assigned. In contrast to hw exam grades and roster, which made use of the NetID and SID, respectively, this behaves differently.
You may combine quiz grades and final data by using the index that is included within quiz grades as well as the Email Address column that is contained within final data.
:
final_data = pd.merge(
final_data, quiz_grades, left_on="Email Address", right_index=True
)
You may instruct pandas to utilise the Email Address column in final data in the merge by using the left on parameter that is sent to the pd.merge() function in this code. You may also instruct pandas to utilise the index from quiz grades in the merging by using the right index command.
The following is a sample of the combined DataFrame, which displays the four examples:
students:
NetID | Email Address | … | Homework 1 | … | Quiz 3 |
---|---|---|---|---|---|
wxb12345 | woody.barrera_jr@univ.edu | … | 55 | … | 11 |
mxl12345 | malaika.lambert@univ.edu | … | 63 | … | 10 |
txj12345 | traci.joyce@univ.edu | … | nan | … | 14 |
jgf12345 | john.g.2.flower@univ.edu | … | 69 | … | 8 |
Keep in mind that the ellipses denote missing columns in the example table shown here, but that these columns will be available in the merged DataFrame. You may do a careful inspection of the tables that came before to ensure that the corresponding numbers are assigned to the right pupils.
Filling in nan Values
At this point, all of your data has been combined into a single DataFrame. You have one more item of data cleansing to do before you can go on to computing the grades. This is required.
You can see that Traci Joyce’s Homework 1 assignment still has a nan value by looking at the table that is located above this one. Calculations can’t be done with nan values since, well, they’re not numbers! You may assign a number to each of the nan values in final data by making use of the DataFrame.fillna() function.
:
final_data = final_data.fillna(0)
You may use the DataFrame.fillna() method, which is located in this code, to populate all nan values in final data with the number 0. This is an acceptable resolution since the nan value in the column that contains Traci Joyce’s Homework 1 shows that the score is missing, which suggests that she most likely did not turn in the assignment.
A sample of the updated DataFrame, displaying the four examples is shown below.
students:
NetID | … | First Name | Last Name | Homework 1 | … |
---|---|---|---|---|---|
wxb12345 | … | Woody | Barrera | 55 | … |
mxl12345 | … | Malaika | Lambert | 63 | … |
txj12345 | … | Traci | Joyce | 0 | … |
jgf12345 | … | John | Flower | 69 | … |
As can be seen in this table, Traci Joyce’s score on Homework 1 has been altered from nan to 0; nevertheless, the grades given to the other pupils have remained the same.
Pandas DataFrames for the Purpose of Determining Grades
You were responsible for completing three different types of tasks in your.
class:
- Exams
- Homework
- Quizzes
There is a certain amount of importance placed on each of these aspects of the pupils’ overall grade. You delegated the following work to your students for this semester’s class.
weights:
Category | Percent of Final Grade | Weight |
---|---|---|
Exam 1 Score | 5 |
|
Exam 2 Score | 10 |
|
Exam 3 Score | 15 |
|
Quiz Score | 30 |
|
Homework Score | 40 |
|
The overall score may be computed by first adding up the individual category totals, then multiplying each category’s weight by its overall score, and then adding up all of these numbers. After that, an overall letter grade will be determined based on the final score.
The file called 03-calculating-grades.py contains a compilation of all the changes that were made to the gradebook.py file in this section. If you click the link that is provided below, you will be able to get the source code:
This indicates that you will need to compute the sum for each group separately. The sum of a student’s scores in each area is expressed as a floating-point number between 0 and 1; this number shows the amount of points a student earned in comparison to the highest attainable score. You will be responsible for each category of the assignment in turn.
Performing the Overall Score Calculation for the Test
You will begin by determining marks for the various tests. You are able to compute the overall score for each test on its own as a result of the fact that each exam has its own distinct weight. Using a for loop, as can be seen in this example, is the solution that makes the most sense.
code:
n_exams = 3
for n in range(1, n_exams + 1):
final_data[f"Exam {n} Score"] = (
final_data[f"Exam {n}"] / final_data[f"Exam {n} - Max Points"]
)
Since there were three tests spread out over the semester, you need to tell this code to treat n exams as having a value of 3. After that, you iterate over each test to get the score for that test by dividing the raw score by the maximum possible points for that test.
Below is a representation of the data from the test for each of the four examples:
students:
NetID | Exam 1 Score | Exam 2 Score | Exam 3 Score |
---|---|---|---|
wxb12345 | 0.86 | 0.62 | 0.90 |
mxl12345 | 0.60 | 0.91 | 0.93 |
txj12345 | 1.00 | 0.84 | 0.64 |
jgf12345 | 0.72 | 0.83 | 0.77 |
On each of the assessments shown in this table, each student earned a score between 0.0 and 1.0. At the very conclusion of your story, you are going to multiply these scores by the weight in order to figure out what part of your overall grade they make up.
Doing the Calculations on the Homework
The next step is to determine how many points each piece of assignment is worth. The maximum number of points that may be earned for each piece of homework ranges from 50 to 100. This indicates that there are two distinct approaches to figuring out the task.
score:
-
By total score:
Sum the raw scores and maximum points independently, then take the ratio. -
By average score:
Divide each raw score by its respective maximum points, then take the sum of these ratios and divide the total by the number of assignments.
In the first way, students who maintained a high level of performance are rewarded with a higher score, but in the second method, students who did well on tasks that were worth more points are given preference. In order to be of assistance to the pupils, you will offer them the higher of these two scores.
It will take some time to calculate these scores.
steps:
- Collect the columns with homework data.
- Compute the total score.
- Compute the average score.
- Determine which score is larger and will be used in the final score calculation.
To begin, you will need to compile all of the columns with data from your assignment. You may do this by use the DataFrame.filter() method.
this:
homework_scores = final_data.filter(regex=r"^Homework \d\d?$", axis=1)
homework_max_points = final_data.filter(regex=r"^Homework \d\d? -", axis=1)
To filter the final data, you make use of a regular expression (abbreviated as regex) in this code. If the name of a column does not match the regular expression, then the column will not be included in the DataFrame that is produced.
axis is the name of the other input that you provide to the DataFrame.filter() method. You may switch between the two ways of working with a DataFrame by using the axis parameter. A number of the methods that make up a DataFrame can operate either row-wise or column-wise. Pandas will search the index for rows that match the given regular expression if you leave the axis parameter at its default value of 0. You instead want to identify all of the columns that match the regular expression, so you provide in the axis=1 parameter.
You may now proceed with the computations using the columns that you extracted from the DataFrame in order to meet your requirements. To get the overall amount of homework, you must first add the two numbers without regard to one another, and then divide those totals by one another.
score:
sum_of_hw_scores = homework_scores.sum(axis=1)
sum_of_hw_max = homework_max_points.sum(axis=1)
final_data["Total Homework"] = sum_of_hw_scores / sum_of_hw_max
You utilise the DataFrame.sum() method while passing the axis parameter while you are working with this code. The.sum() function’s default behaviour is to total up all of the values that are present in each row and column. Nonetheless, you are interested in the total of all the columns for each row given that each row represents a different student. The axis=1 option instructs pandas to do the specified action.
After that, you will provide the ratio of the two sums to a brand new column in final data that will be named Total Homework.
The following is a representation of one of the possible outcomes of the calculations for the four examples:
students:
NetID | Sum of Homework Scores | Sum of Max Scores | Total Homework |
---|---|---|---|
wxb12345 | 598 | 740 | 0.808108 |
mxl12345 | 612 | 740 | 0.827027 |
txj12345 | 581 | 740 | 0.785135 |
jgf12345 | 570 | 740 | 0.770270 |
This table displays the sum of each student’s homework scores, as well as the sum of their maximum possible points, and their overall score for their homework.
The second way to calculate this is to take each homework score, multiply it by its highest possible score, put all of these individual values together, and then divide the sum by the total number of assignments. You might do this by use a for loop and iteratively going over each column. On the other hand, Pandas makes it possible for you to be more productive since it will match the labels of the columns and indexes and will only do mathematical operations on the labels that match.
In order for this to function, you will need to adjust the column names in the homework max points table so that they are consistent with the names in the homework scores table. You may do this by use the DataFrame.set axis function ()
:
hw_max_renamed = homework_max_points.set_axis(homework_scores.columns, axis=1)
When you run this code, a new DataFrame is generated and given the name hw max renamed. The column axes of this new DataFrame are given the same names as the columns in the homework scores database. This DataFrame is now available for more applications.
calculations:
average_hw_scores = (homework_scores / hw_max_renamed).sum(axis=1)
The average hw scores are determined by dividing the total number of points possible for each assignment by the total number of points possible for that assignment. The next step is to use the DataFrame.sum() function with the axis=1 option to compile the ratios of all of the homework assignments included in each row.
Given that the highest possible point total for any one assignment is a perfect score of 1.0, the maximum possible value for this sum is the same as the total number of tasks to be completed for this unit. To nonetheless figure towards the overall score, you will need a number that is rated on a scale ranging from 0 to 1.
This indicates that you need to divide the total number of assignments by the average number of homework scores, which you may accomplish using this.
code:
final_data["Average Homework"] = average_hw_scores / homework_scores.shape[1]
To get the total number of homework assignments from the homework scores table, you make use of the DataFrame.shape method in this code. DataFrame.shape, which operates similarly to NumPy arrays, produces a tuple consisting of (n rows, n columns). The number of assignments is equal to the number of columns in the homework scores table, which can be obtained by taking the second value from the tuple.
After that, you will put the answer to the division into a brand new column in final data that will be known as the Average Homework.
The calculating result for the four examples is shown below as an example.
students:
NetID | Sum of Average Homework Scores | Average Homework |
---|---|---|
wxb12345 | 7.99405 | 0.799405 |
mxl12345 | 8.18944 | 0.818944 |
txj12345 | 7.85940 | 0.785940 |
jgf12345 | 7.65710 | 0.765710 |
Take note that the range of possible values for the Sum of Average Homework Scores in this table is from 0 to 10, but the range of possible values for the Average Homework column is from 0 to 1. The second column will be utilised to make a comparison to the total amount of homework that was assigned.
You may choose the maximum value to be utilised in the final grade now that you have computed the scores for both of your homework assignments.
calculation:
final_data["Homework Score"] = final_data[
["Total Homework", "Average Homework"]
].max(axis=1)
A new column will be formed and given the name “Homework Score.” In this code, you will pick the two columns that you have just established—Total Homework and Average Homework—and apply the highest possible value to the new column. Please take note that when axis=1, you are taking the maximum for each student.
The following is a representation of the findings that were computed for the four examples:
students:
NetID | Total Homework | Average Homework | Homework Score |
---|---|---|---|
wxb12345 | 0.808108 | 0.799405 | 0.808108 |
mxl12345 | 0.827027 | 0.818944 | 0.827027 |
txj12345 | 0.785135 | 0.785940 | 0.785940 |
jgf12345 | 0.770270 | 0.765710 | 0.770270 |
You are able to examine the differences between the Total Homework, Average Homework, and final Homework Score columns using this table. It is clear that the Homework Score takes into account whichever factor is greater, the total amount of homework or the average amount of homework.
Figuring Out the Results of the Quiz
The next step is to determine the final score for the test. You need to use the same process as you did for the assignment since the quizzes each have a different quantity of points that may be earned in total. The main difference is that the quiz data tables do not indicate the maximum grade that can be earned on each quiz; thus, you will need to build a pandas Series in order to store that information.
information:
quiz_scores = final_data.filter(regex=r"^Quiz \d$", axis=1)
quiz_max_points = pd.Series(
{"Quiz 1": 11, "Quiz 2": 15, "Quiz 3": 17, "Quiz 4": 14, "Quiz 5": 12}
)
sum_of_quiz_scores = quiz_scores.sum(axis=1)
sum_of_quiz_max = quiz_max_points.sum()
final_data["Total Quizzes"] = sum_of_quiz_scores / sum_of_quiz_max
average_quiz_scores = (quiz_scores / quiz_max_points).sum(axis=1)
final_data["Average Quizzes"] = average_quiz_scores / quiz_scores.shape[1]
final_data["Quiz Score"] = final_data[
["Total Quizzes", "Average Quizzes"]
].max(axis=1)
The majority of this code is quite similar to the code that was used for the homework in the previous part. The primary difference between this scenario with the homework situation is that you developed a pandas Series for the quiz max points variable, and you used a dictionary as the input. The dictionary’s keys are transformed into the index’s labels, and the dictionary’s values are repurposed as the Series’ values.
You don’t have to use the DataFrame.set axis() method for the quizzes since the index labels in quiz max points have the same names as the index labels in quiz scores. Moreover, Pandas will broadcast the form of a Series to ensure that it is consistent with the DataFrame.
Here is an example of what the computation for the outcome should look like:
quizzes:
NetID | Total Quizzes | Average Quizzes | Quiz Score |
---|---|---|---|
wxb12345 | 0.608696 | 0.602139 | 0.608696 |
mxl12345 | 0.681159 | 0.682149 | 0.682149 |
txj12345 | 0.594203 | 0.585399 | 0.594203 |
jgf12345 | 0.608696 | 0.615286 | 0.615286 |
As was to be anticipated, the Quiz Score column in this table is always sorted by whatever value, Total Quizzes or Average Quizzes, is greater.
Methods for Figuring Out the Letter Grade
At this point, you have finished all of the necessary computations for the overall grade. You will get a score that ranges from 0 to 1 for each of the assessments, including the assignments and the quizzes. After that, you need to compute the overall grade by multiplying each score by the weighting it was given. After that, you’ll be able to translate that number into a scale with letter grades ranging from A to F.
You will use a pandas Series, just as you used for storing the maximum quiz scores, in order to keep track of the weightings. In this approach, you will be able to automatically multiply by the appropriate columns based on final data. Using this, you may create custom weightings.
code:
weightings = pd.Series(
{
"Exam 1 Score": 0.05,
"Exam 2 Score": 0.1,
"Exam 3 Score": 0.15,
"Quiz Score": 0.30,
"Homework Score": 0.4,
}
)
You assign a value to the importance of each component of the class using this piece of code. As you have seen before, the first exam is worth 5 percent of the overall mark, the second exam is worth 10 percent, the third exam is worth 15 percent, quizzes are worth 30 percent, and homework is worth 40 percent of the total grade.
After that, you may establish the overall score by adding the percentages you just determined to the points you’ve already tallied.
score:
final_data["Final Score"] = (final_data[weightings.index] * weightings).sum(
axis=1
)
final_data["Ceiling Score"] = np.ceil(final_data["Final Score"] * 100)
You need to choose the columns of final data that have the same names as the index in weightings before you can proceed with the rest of this code. You need to do this because some of the other columns in final data have the type str. As a result, pandas will throw a TypeError if you attempt to multiply weightings by all of final data. You can avoid this error by doing what I just described.
Then, you use the DataFrame.sum(axis=1) function to calculate the total of these columns for each student, and then you create a new column and label it Final Score to store the outcome of this calculation. The value that is assigned to each student in this column is a floating-point number that falls somewhere between 0 and 1.
You are going to round each student’s grade up to the next higher number finally since you are such a kind and considerate instructor. The Final Score of each student is converted to a scale that ranges from 0 to 100 by multiplying it by 100, and then the numpy.ceil() function is used to round each score to the next largest integer. You will then enter this number into a brand new column that will be known as the Ceiling Score. Note: You will need to include import numpy as np at the very beginning of your script in order to utilise the np.ceil() function.
An example calculation result for each of these columns, based on the previous four examples, is as follows:
students:
NetID | Final Score | Ceiling Score |
---|---|---|
wxb12345 | 0.745852 | 75 |
mxl12345 | 0.795956 | 80 |
txj12345 | 0.722637 | 73 |
jgf12345 | 0.727194 | 73 |
The very last thing that has to be done is to assign a letter grade based on each student’s potential maximum score. You could make use of these letters in your educational institution.
grades:
-
A
: Score of 90 or higher -
B
: Score between 80 and 90 -
C
: Score between 70 and 80 -
D
: Score between 60 and 70 -
F
: Score below 60
You can’t simply rely on a dictionary to do the mapping for you since each letter grade has to translate to a range of scores. This makes it difficult. Thankfully, pandas has a method called Series.map(), which enables users to apply whatever function they like to the data contained in a Series. You may accomplish a similar goal by using a grading system that differs from the traditional letter grades.
You are able to create a suitable function using this.
way:
grades = {
90: "A",
80: "B",
70: "C",
60: "D",
0: "F",
}
def grade_mapping(value):
for key, letter in grades.items():
if value >= key:
return letter
With this piece of code, you will generate a dictionary that will contain the mapping between the lowest possible letter grade and the corresponding letter. The next step is to define the grade mapping() function, which requires the value of a row from the ceiling score Series as an input. You go through the grades in a loop, comparing each item’s value to the corresponding key from the dictionary. If value is larger than key, then the student is placed in that bracket, and you should return the corresponding letter grade to them. Note that in order for this function to operate, the grades must first be sorted ascendingly, and that this function itself depends on the order of the dictionary being preserved. If you are working with a version of Python that is older than 3.6, then you will need to utilise an older version of the language.
OrderedDict should be used in its place.
When grade mapping() is specified, you may use Series.map() to locate the letter you’re looking for.
grades:
letter_grades = final_data["Ceiling Score"].map(grade_mapping)
final_data["Final Grade"] = pd.Categorical(
letter_grades, categories=grades.values(), ordered=True
)
By using the grade mapping() function to the Ceiling Score column of the final data table, you may generate a new Series that you can later refer to as letter grades. A categorical data type would be appropriate for this situation seeing as how there are five different options for a letter grade. When you have mapped the scores to letters, you will be able to use the Pandas Categorical class to construct a category column for the data.
You need to pass the letter grades as well as two keyword tests in order to establish the categorised column.
arguments:
-
categories
is passed the values from
grades
. The values in
grades
are the possible letter grades in the class. -
ordered
is passed
True
to tell pandas that the categories are ordered. This will help later if you want to sort this column.
The category column that you make is going to be used in a brand new column in final data that’s going to be called Final Grade.
The concluding grades for the four examples are as follows:
students:
NetID | Final Score | Ceiling Score | Final Grade |
---|---|---|---|
wxb12345 | 0.745852 | 75 | C |
mxl12345 | 0.795956 | 80 | B |
txj12345 | 0.722637 | 73 | C |
jgf12345 | 0.727194 | 73 | C |
One of the four example students received a grade of B, while the other three received grades of C, which corresponds to the students’ ceiling scores and the letter grade mapping you developed.
Putting the Data into Groups
When you have finished computing the grades for each individual student, you will most likely need to enter those grades into the administration system for students. This term, the roster table has a field labelled “Section” which indicates that you are the instructor for many “sections” of the same class.
The file called 04-grouping-the-data.py contains a compilation of all of the changes that were made to the gradebook.py file in this section. If you click the link that is provided below, you will be able to get the source code:
In order to input the grades into your student management system, you will first need to divide the students up into their respective sections and then alphabetize them based on their last name. You are in luck since pandas can also take care of this problem for you.
Pandas is equipped with extensive capabilities that allow it to organise and sort data in DataFrames. You will need to sort the grouped data based on the students’ names after first grouping the data according to the students’ section numbers. You are able to do it using this.
code:
for section, table in final_data.groupby("Section"):
section_file = DATA_FOLDER / f"Section {section} Grades.csv"
num_students = table.shape[0]
print(
f"In Section {section} there are {num_students} students saved to "
f"file {section_file}."
)
table.sort_values(by=["Last Name", "First Name"]).to_csv(section_file)
In this section of code, you group the data by the Section column using the DataFrame.groupby() method on final data and then sort the grouped results using the DataFrame.sort values() method. In the last step, you will export the sorted data to a CSV file so that it can be uploaded to the system that manages students. Now that you’ve turned in all of your assignments for the semester, you can finally kick back and enjoy the break!
Developing a Schematic Based on the Statistics
Yet, before you put away the whiteboard marker for the summer, you may find it helpful to learn a little bit more about the general performance of the class. You may visualise some summary statistics for the class by using the pandas and Matplotlib libraries.
The file 05-plotting-summary-statistics.py contains a compilation of all of the changes that were made to the gradebook.py file in this section. If you click the link that is provided below, you will be able to get the source code:
To begin, you may find it helpful to examine a breakdown of the students’ letter grades in the class. You are able to do it using this.
code:
grade_counts = final_data["Final Grade"].value_counts().sort_index()
grade_counts.plot.bar()
plt.show()
In this section of code, you will compute the frequency of the individual letters by using the Series.value counts() function to the final data table’s Final Grade column. While the value counts are shown in descending order by default (from greatest to least), it would be more helpful to view them in alphabetical order. You may sort the grades into the order that you selected while you were defining the Categorical column by making use of the Series.sort index() function.
After that, you make advantage of pandas’ capability to utilise Matplotlib in order to generate a bar plot of the grade counts using the DataFrame.plot.bar() function. Since this is a script, you will need to instruct Matplotlib to show you the plot by using the plt.show() function, which will cause an interactive figure window to popup. Note: In order for this to function, you will need to include the import matplotlib.pyplot as plt at the very beginning of your script.
The following illustration should serve as a guide for how your figure should look:
The number of students who obtained each letter grade is shown along the horizontal axis of this chart, and the height of each bar in the picture corresponds to that number. The majority of your pupils received a grade of C for the unit.
After that, you should look at a histogram of the pupils’ numerical scores, if that is acceptable to you. Matplotlib’s DataFrame.plot.hist() function may be used by pandas to do this.
automatically:
final_data["Final Score"].plot.hist(bins=20, label="Histogram")
In this section of code, you will plot a histogram of the final scores by using the DataFrame.plot.hist() function. After the charting is finished, any keyword parameters are sent over to Matplotlib for processing.
You can make an approximation of the distribution of the data using a histogram, but you may also be interested in more advanced approaches. With the help of the SciPy library, pandas is able to produce a kernel density estimate using the DataFrame.plot.density() function. You could alternatively speculate that the data will follow a normal distribution and manually generate a normal distribution using the mean and standard deviation of your data. Both of these options are available to you. You are welcome to test this code to see how it works.
works:
final_data["Final Score"].plot.density(
linewidth=4, label="Kernel Density Estimate"
)
final_mean = final_data["Final Score"].mean()
final_std = final_data["Final Score"].std()
x = np.linspace(final_mean - 5 * final_std, final_mean + 5 * final_std, 200)
normal_dist = scipy.stats.norm.pdf(x, loc=final_mean, scale=final_std)
plt.plot(x, normal_dist, label="Normal Distribution", linewidth=4)
plt.legend()
plt.show()
When you initially run this code, the first thing you do is plot the kernel density estimate for your data using the DataFrame.plot.density() function. You make some adjustments to the line width and the label for the plot so that it is simpler to view.
Using DataFrame.mean() and DataFrame.std(), you will now compute the mean and standard deviation of your Final Score data, respectively. You may construct a collection of x-values that are five standard deviations below or above the mean by making use of the np.linspace() function. The conventional normal distribution formula is then plugged into to get the normal distribution, which is then calculated and stored in the normal dist variable. Note: In order for this to function, you will need to include import scipy.stats at the very beginning of your script.
The last step is to plot x vs normal dist, after which you will alter the line width and add a label. After revealing the storyline, you ought to obtain a result that looks somewhat like this:
The amount of each grade that is contained inside a certain bin is shown along the vertical axis of this diagram. The highest point is located close to a gradient of 0.78. The kernel density estimate and the normal distribution are two models that have a reasonable amount of success when it comes to fitting the data.