NumPy and Pandas are two data manipulation tools written in Python that are very extensive, very efficient, and very flexible. Understanding how data are referred as either shallow copies (views) or deep copies is a crucial notion for advanced users of these two libraries. (or just copies ). Pandas will occasionally display a SettingWithCopyWarning in the event that the user is making use of views and copies in a manner that could be considered inappropriate. You’ll get knowledge about the following topics by reading this article:
-
What
views
and
copies
are in NumPy and pandas - How to properly work with views and copies in NumPy and pandas
-
Why the
SettingWithCopyWarning
happens in pandas -
How to avoid getting a
SettingWithCopyWarning
in pandas
You will first be presented with a concise explanation of the SettingWithCopyWarning, along with instructions on how to prevent encountering it. You might find that this is sufficient for your requirements, but if you want to understand more about copies and other topics, you can also delve a little deeper into the specifics of NumPy and pandas.
views.
Prerequisites
Python versions 3.7 or 3.8, along with the NumPy and pandas libraries, are required in order for you to be able to follow the examples provided in this article. NumPy version 1.18.1 and pandas version 1.0.3 are both supported by this article. It is possible to set them up using
pip
:
$ python -m pip install -U "numpy==1.18.*" "pandas==1.0.*"
Conda is a package management system that you can use whether you choose the Anaconda distribution or the Miniconda distribution. Check out the article entitled Setting Up Python for Machine Learning on Windows if you are interested in learning more about this approach. Installing NumPy and Pandas in your environment is all that will be required for the time being.
:
$ conda install numpy=1.18.* pandas=1.0.*
Since NumPy and Pandas are now installed on your computer, you can import them and examine their contents.
versions:
>>>
>>> import numpy as np
>>> import pandas as pd
>>> np.__version__
'1.18.1'
>>> pd.__version__
'1.0.3'
That sums it up nicely. You do not need any additional preparation to read this content. The information that follows will still be relevant even though your versions might have some minor differences. Please be aware that reading this article requires you to have some prior understanding of pandas. For the subsequent courses, you will also need to have a basic understanding of NumPy.
You can check out the following if you want to brush up on your NumPy skills.
resources:
-
NumPy Quickstart Tutorial
-
Look Ma, No For-Loops: Array Programming With NumPy
-
Python Plotting With Matplotlib
Reading the can serve to refresh your memory of pandas.
following:
-
10 minutes to pandas
-
pandas DataFrames 101
-
The pandas DataFrame: Make Working With Data Delightful
-
Using pandas and Python to Explore Your Dataset
-
Python pandas: Tricks & Features You May Not Know
You are now prepared to begin studying views, copies, and the SettingWithCopyWarning function.
!
Example of a
There is a good probability that you have previously witnessed a SettingWithCopyWarning being put into action if you work with pandas. It is often frustrating, and at other times it is difficult to comprehend. Having said that, it was published for a specific purpose.
The fact that the SettingWithCopyWarning is not a mistake should be the first piece of information that you learn about it. This is a cautionary note. It gives you a heads-up that you have probably done something that may cause undesirable behavior in your code and notifies you about it.
Let’s see an example. To get started, you are going to make a pandas DataFrame.
:
>>>
>>> data = {"x": 2**np.arange(5),
... "y": 3**np.arange(5),
... "z": np.array([45, 98, 24, 11, 64])}
>>> index = ["a", "b", "c", "d", "e"]
>>> df = pd.DataFrame(data=data, index=index)
>>> df
x y z
a 1 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
This is an example of how to establish a dictionary that will be referred by the variable data.
contains:
-
The keys
"x"
,
"y"
, and
"z"
, which will be the column labels of the DataFrame -
Three
NumPy arrays
that hold the data of the DataFrame
The numpy.arange() method is used to generate the first two arrays, while the numpy.array() method is utilized to generate the third array. Check view the article NumPy arange(): How to Use np.arange() if you want to learn more about the arange() function.
The row labels for the DataFrame will be the strings “a,” “b,” “c,” and “d,” which are found in the list that is attached to the variable index. These strings are in alphabetical order.
At this point, you must finalize the process by initializing the DataFrame df that stores the information from the data and index. Consider it in this way to help you picture it:
The following is a synopsis of the most important information found in the
DataFrame:
-
Purple box:
Data -
Blue box:
Column labels -
Red box:
Row labels
Additional information, known as metadata, such as the DataFrame’s shape, the data types it supports, and so on is stored within it.
Since you already have a DataFrame at your disposal, the next step is to attempt to obtain a SettingWithCopyWarning. You will zero off any numbers in column z that are lower than fifty and do this for all of the values in that column. You can get started by making a mask or a filter using the Boolean operators available in pandas.
:
>>>
>>> mask = df["z"] < 50
>>> mask
a True
b False
c True
d True
e False
Name: z, dtype: bool
>>> df[mask]
x y z
a 1 1 45
c 4 9 24
d 8 27 11
mask is a specific instance of a pandas Series that contains Boolean data along with the indices from df.
:
-
True
indicates the rows in
df
in which the value of
z
is less than
50
. -
False
indicates the rows in
df
in which the value of
z
is
not
less than
50
.
df[mask] generates a DataFrame that contains all of the rows from df for which the specified mask evaluates to True. In this particular scenario, you are given rows a, c, and d.
If you attempt to modify df by removing rows a, c, and d using mask, you will receive a SettingWithCopyWarning, and df will continue to function as it has in the past.
same:
>>>
>>> df[mask]["z"] = 0
__main__:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
>>> df
x y z
a 1 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
As can be seen, there is an error in the process of assigning zeros to the column z. The complete procedure is depicted here in this image:
The following is an explanation of what occurs in the code sample:
above:
-
df[mask]
returns a completely new DataFrame (outlined in purple). This DataFrame holds a copy of the data from
df
that correspond to
True
values from
mask
(highlighted in green). -
df[mask]["z"] = 0
modifies the column
z
of the new DataFrame to zeros, leaving
df
untouched.
In most cases, you do not want this to happen! You need to make changes to df, not some intermediate data structure that isn’t being used by any other variables in your program. Pandas recognizes the potential for error and issues a SettingWithCopyWarning in order to alert you to the situation.
In this particular scenario, the correct approach to modifying df is to use either the.loc[] accessor, the.iloc[] accessor, the.at[] accessor, or both. .iat[]
:
>>>
>>> df = pd.DataFrame(data=data, index=index)
>>> df.loc[mask, "z"] = 0
>>> df
x y z
a 1 1 0
b 2 3 98
c 4 9 0
d 8 27 0
e 16 81 64
Using this strategy, you will be able to supply the single method that assigns the values to the DataFrame with two arguments: the mask and the “z” argument.
Altering the evaluation is yet another solution to this problem, which can be pursued.
order:
>>>
>>> df = pd.DataFrame(data=data, index=index)
>>> df["z"]
a 45
b 98
c 24
d 11
e 64
Name: z, dtype: int64
>>> df["z"][mask] = 0
>>> df
x y z
a 1 1 0
b 2 3 98
c 4 9 0
d 8 27 0
e 16 81 64
This is effective! You’ve made some changes to df. The following is an illustration of how this procedure works:
The following is a rundown of the:
image::
-
df["z"]
returns a
Series
object (outlined in purple) that points to the
same data
as the column
z
in
df
, not its copy. -
df["z"][mask] = 0
modifies this
Series
object by using
chained assignment
to set the masked values (highlighted in green) to zero. -
df
is modified as well since the
Series
object
df["z"]
holds the same data as
df
.
You have seen that the data contained in df[mask] is a copy of the data, whereas the data pointed to by df[“z”] is the same data that is contained in df. The criteria that pandas employ to decide whether or not you should make a copy are notoriously difficult to understand. To our good fortune, avoiding a SettingWithCopyWarning while assigning values to DataFrames may be accomplished in a number of straightforward methods.
Invoking accessors is typically thought of as a superior approach than chained assignment when it comes to these.
reasons:
-
The intention to modify
df
is clearer to pandas when you use a single method. - The code is cleaner for readers.
- The accessors tend to have better performance, even though you won’t notice this in most cases.
On the other hand, utilizing accessors isn’t always sufficient. They could potentially return copies, in which case you would receive a SettingWithCopyWarning message.
:
>>>
>>> df = pd.DataFrame(data=data, index=index)
>>> df.loc[mask]["z"] = 0
__main__:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
>>> df
x y z
a 1 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
You make use of the accessor.loc[] in both this example and the one that came before it. The assignment is unable to be completed because the df.loc[mask] function produces a new DataFrame that contains a duplicate of the data from the df variable. The new DataFrame is the one that is modified when df.loc[mask][“z”] = 0, not the original df.
In general, you should perform the following in pandas if you want to prevent getting a SettingWithCopyWarning:
following:
-
Avoid chained assignments
that combine two or more indexing operations like
df["z"][mask] = 0
and
df.loc[mask]["z"] = 0
. -
Apply single assignments
with just one indexing operation like
df.loc[mask, "z"] = 0
. This might (or might not) involve the use of accessors, but they are certainly very useful and are often preferable.
With this information, you will be able to successfully prevent the SettingWithCopyWarning as well as any undesired behavior in almost all circumstances. If, on the other hand, you are interested in learning more about NumPy, pandas, views, copies, and the problems that are associated with the SettingWithCopyWarning, then you should proceed with the rest of the article.
article.
Views and Copies in NumPy and pandas
Learning about views and copies is an essential step in becoming familiar with the data manipulation capabilities of NumPy and Pandas. In addition to that, it can assist you in preventing errors and performance bottlenecks. There are instances when data is copied from one section of memory to another, but there are also moments when two or more objects can share the same data, which saves both time and space.
memory.
Understanding Views and Copies in NumPy
Let’s get things going by constructing a NumPy array.
:
>>>
>>> arr = np.array([1, 2, 4, 8, 16, 32])
>>> arr
array([ 1, 2, 4, 8, 16, 32])
Since you already have arr, you may put it to use by building further arrays using it. First, let’s create a new array out of the second and fourth components of the arr array, which are 2 and 8. There are many different ways to accomplish this.
this:
>>>
>>> arr[1:4:2]
array([2, 8])
>>> arr[[1, 3]]
array([2, 8]))
It is not a problem if you are not familiar with the concept of array indexing. You will acquire further knowledge regarding these and other assertions at a later time. It is essential to take note, at least for the time being, that both statements return array([2, 8]). However, under some conditions, they behave differently.
surface:
>>>
>>> arr[1:4:2].base
array([ 1, 2, 4, 8, 16, 32])
>>> arr[1:4:2].flags.owndata
False
>>> arr[[1, 3]].base
>>> arr[[1, 3]].flags.owndata
True
At first glance, this can appear to be strange. The key point of differentiation here is that arr[1:4:2] returns a shallow copy, whereas arr[[1, 3]] returns a deep copy. Not only is it necessary to have an understanding of this distinction in order to cope with the SettingWithCopyWarning, but it is also required in order to manipulate large amounts of data using NumPy and pandas.
The next sections will provide you with additional information regarding shallow and deep copies in NumPy and
pandas.
Views in NumPy
A
copying on a shallow level or
view is an empty array in NumPy; it does not contain any data of its own. It examines, sometimes known as “viewing,” the data that was originally stored in the array. You are able to produce a view of an array by using.
.view()
:
>>>
>>> view_of_arr = arr.view()
>>> view_of_arr
array([ 1, 2, 4, 8, 16, 32])
>>> view_of_arr.base
array([ 1, 2, 4, 8, 16, 32])
>>> view_of_arr.base is arr
True
You have now obtained the view, also known as a shallow copy, of the original array arr, which is denoted by the array view_of_arr. The quality in question
The actual view of arr serves as the basis for the view of arr. To put it another way, view_of_arr does not possess any data; rather, it makes use of the data that is owned by arr. In addition, you can confirm this using the property.
.flags
:
>>>
>>> view_of_arr.flags.owndata
False
As can be seen, the value of view_of_arr.flags.owndata is not True. This indicates that view_of_arr does not own its own data and instead obtains the data from its.base:
The preceding picture demonstrates that the arr and view_of_arr indices both lead to the same set of data.
values.
Copies in NumPy
A
a deep duplicate of a NumPy array, which is also referred to as just a copy
copy is a distinct NumPy array that stores its own data and is independent from the original. The data of a deep copy is obtained by copying the elements of the old array into the new array. This is what is meant by the term “deep copy.” Both the original and the copy exist in their own distinct instances. It is possible to generate a replica of an array using.
.copy()
:
>>>
>>> copy_of_arr = arr.copy()
>>> copy_of_arr
array([ 1, 2, 4, 8, 16, 32])
>>> copy_of_arr.base is None
True
>>> copy_of_arr.flags.owndata
True
As can be seen, copy_of_arr does not have the.base file extension. To provide more clarity, the value of the copy_of_arr.base variable is
None . The value of the.flags.owndata attribute is set to True. This indicates that copy_of_arr is the owner of the following data:
The picture on the right demonstrates that the data included in arr and copy_of-arr are distinct from one another.
values.
Differences Between Views and Copies
There are two extremely significant points of contention between the opinions and
copies:
- Views don’t need additional storage for data, but copies do.
-
Modifying the original array affects its views, and vice versa. However, modifying the original array will
not
affect its copy.
Let’s begin by contrasting the sizes of arr, view_of_arr, and copy_of_arr. This will help us understand the primary distinction between views and copies. The quality in question
The value returned by.nbytes indicates the amount of memory that the items of the
array:
>>>
>>> arr.nbytes
48
>>> view_of_arr.nbytes
48
>>> copy_of_arr.nbytes
48
All of the arrays share the same total amount of memory, which comes in at 48 bytes. Each array examines six integer elements, each of which consists of eight bytes (64 bits). That comes to a total of 48 bytes.
On the other hand, if you use
sys.getsizeof() in order to obtain the amount of memory that is directly ascribed to each array, after which you will see the
difference:
>>>
>>> from sys import getsizeof
>>> getsizeof(arr)
144
>>> getsizeof(view_of_arr)
96
>>> getsizeof(copy_of_arr)
144
arr and copy_of_arr both have a capacity of 144 bytes. You should already be familiar with the fact that 48 of the total 144 bytes are reserved for the data elements. The last 96 bytes are reserved for use with other properties. due to the fact that view_of_arr does not contain any of its own data components, it just stores these 96 bytes.
To further emphasize the first distinction between views and copies, you have the ability to make changes to any component of the original.
array:
>>>
>>> arr[1] = 64
>>> arr
array([ 1, 64, 4, 8, 16, 32])
>>> view_of_arr
array([ 1, 64, 4, 8, 16, 32])
>>> copy_of_arr
array([ 1, 2, 4, 8, 16, 32])
You can see that the view has also been altered, but the copy has not been modified in any way. The graphic that follows provides an illustration of the code:
Because the view inspects the items of arr, the view is changed, yet its.base variable still refers to the original array. Due to the fact that the copy does not share any data with the original, any modifications made to the original do not have any impact on the duplicate in any way.
all.
Understanding Views and Copies in pandas
panda also differentiates between views and copies of data. Using the.copy() method, it is possible to generate a view or copy of a DataFrame. The value deep defines if you want a view (deep=False) or copy (deep=True). Since deep is set to True by default, omitting it will result in a
copy:
>>>
>>> df = pd.DataFrame(data=data, index=index)
>>> df
x y z
a 1 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
>>> view_of_df = df.copy(deep=False)
>>> view_of_df
x y z
a 1 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
>>> copy_of_df = df.copy()
>>> copy_of_df
x y z
a 1 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
Both the view and the copy of df appear identical at first glance. However, if you compare the NumPy representations of both of them, you might observe this small difference.
difference:
>>>
>>> view_of_df.to_numpy().base is df.to_numpy().base
True
>>> copy_of_df.to_numpy().base is df.to_numpy().base
False
Here,
The DataFrames’ data are stored as a NumPy array, which may be retrieved using the.to_numpy() method. It is clear that df and view_of_df share the same data and have the same.base directory. On the other hand, the data that is contained in copy_of_df are different.
You may demonstrate this by making changes to df.
:
>>>
>>> df["z"] = 0
>>> df
x y z
a 1 1 0
b 2 3 0
c 4 9 0
d 8 27 0
e 16 81 0
>>> view_of_df
x y z
a 1 1 0
b 2 3 0
c 4 9 0
d 8 27 0
e 16 81 0
>>> copy_of_df
x y z
a 1 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
You have given the value 0 to each individual cell in the z column of the df table. This results in a modification to view_of_df, while copy_of_df is unaffected by the change.
Row labels and column labels share the same characteristics.
behavior:
>>>
>>> view_of_df.index is df.index
True
>>> view_of_df.columns is df.columns
True
>>> copy_of_df.index is df.index
False
>>> copy_of_df.columns is df.columns
False
The labels for the rows and columns in df and view_of_df are identical, but copy_of_df has its own independent index instances. Bear in mind that you are unable to make changes to specific elements of the.index and.columns files. They cannot be changed in any way.
objects.
Indices and Slices in NumPy and pandas
The fundamental indexing and slicing operations in NumPy are quite comparable to the indexing and slicing operations on lists and tuples. However, NumPy and pandas each provide extra choices to reference and assign values to the objects and the portions of the objects, respectively.
In a manner analogous to that of containers, NumPy arrays and pandas objects (DataFrame and Series) have specific methods that make it possible to reference, assign, and delete values.
:
-
.__getitem__()
references values. -
.__setitem__()
assigns values. -
.__delitem__()
deletes values.
The majority of the time, you will need to call these methods in order to reference, assign, or delete data in Python container-like objects.
methods:
-
var = obj[key]
is equivalent to
var = obj.__getitem__(key)
. -
obj[key] = value
is equivalent to
obj.__setitem__(key, value)
. -
del obj[key]
is equivalent to
obj.__delitem__(key)
.
The key of the parameter specifies the index, and it can be any of the following: an integer, a slice, a tuple, a list, a NumPy array, etc.
on.
Indexing in NumPy: Copies and Views
When it comes to indexing arrays, NumPy adheres to a stringent set of constraints about copies and views. Depending on the method you employ to index your arrays—slicing, integer indexing, or Boolean indexing—will determine whether or not you obtain copies of the original data or views of the data.
indexing.
One-Dimensional Arrays
Python’s arrays, lists, and tuples may all be “sliced” to retrieve the data that you need using a well-known procedure called “slicing.” You will have a view of the sliced array when you slice a NumPy array.
array:
>>>
>>> arr = np.array([1, 2, 4, 8, 16, 32])
>>> a = arr[1:3]
>>> a
array([2, 4])
>>> a.base
array([ 1, 2, 4, 8, 16, 32])
>>> a.base is arr
True
>>> a.flags.owndata
False
>>> b = arr[1:4:2]
>>> b
array([2, 8])
>>> b.base
array([ 1, 2, 4, 8, 16, 32])
>>> b.base is arr
True
>>> b.flags.owndata
False
You began by creating the primary array arr, and then you cut it up into two smaller arrays that you will refer to as a and b. Both a and b use arr as their basis, and neither of them has any data of their own. Instead, they examine the information found in arr:
The slicing process was used to obtain the green indices seen in the previous image. Both a and b take a look at the elements of arr that are represented by the green rectangles that correspond to them.
Note:
If you have a large original array but you only need a small portion of it, you can call.copy() after slicing and then use a del statement to delete the variable that corresponds to the original array. This is useful when you only need a portion of a larger array. By doing it this manner, you can keep the copy while clearing memory of the original array.
While the result of a slicing operation is a view, there are other circumstances in which the creation of one array from another results in a copy.
When you index an array using a list of integers, you get a copy of the array that was being indexed. The copy includes those elements from the primary array whose indices are included in the
list:
>>>
>>> c = arr[[1, 3]]
>>> c
array([2, 8])
>>> c.base is None
True
>>> c.flags.owndata
True
The entries from arr with the indices 1 and 3 have been copied into the new array that was created, c. These components have values of 2 and 8 respectively. In this instance, c is a duplicate of arr; however, its.base property is set to None, and it possesses its own data, which are as follows:
The members of the original array arr that had the indices 1 and 3 chosen are copied into the newly created array c. When the copying process is complete, arr and c are free to function on their own.
Arrays created with NumPy can also be indexed using mask arrays or lists. Masks can take the form of Boolean arrays or lists that have the same structure as the original. You will receive a duplicate of the original array that contains only the entries that correspond to the True values of the you provide.
mask:
>>>
>>> mask = [False, True, False, True, False, False]
>>> d = arr[mask]
>>> d
array([2, 8])
>>> d.base is None
True
>>> d.flags.owndata
True
The second and fourth positions on the list mask both have True values assigned to them. Because of this, the elements of arr that are located in positions two and four are the only ones that are contained in the array d. As was the case with c, d is a copy; its.base attribute is set to None, and it possesses its own data, which are as follows:
True values from mask are represented by the elements of arr that are included within the green rectangles. These elements will now be present in the new array d that has been created. Following the copying process, arr and d can function independently.
Note:
You can use another NumPy array of integers in place of a list, but this comes with a few caveats.
not
a tuple
.
To review, the following are the variables that you’ve established up until this point that refer to arr:
:
# `arr` is the original array:
arr = np.array([1, 2, 4, 8, 16, 32])
# `a` and `b` are views created through slicing:
a = arr[1:3]
b = arr[1:4:2]
# `c` and `d` are copies created through integer and Boolean indexing:
c = arr[[1, 3]]
d = arr[[False, True, False, True, False, False]]
Bear in mind that the preceding examples illustrate some of the ways in which you
a reference to the data contained within an array. When slicing arrays, referencing data returns views. When using index and mask arrays, referencing data returns copies.
On the other hand, assignments will almost always result in a change to the data that was initially stored in the array.
Let’s find out what happens when you change the values of these arrays now that you have them all.
original:
>>>
>>> arr[1] = 64
>>> arr
array([ 1, 64, 4, 8, 16, 32])
>>> a
array([64, 4])
>>> b
array([64, 8])
>>> c
array([2, 8])
>>> d
array([2, 8])
You have moved the second value of arr up to 64 from the previous value of 2. The value 2 was likewise present in the arrays a, b, c, and d that were generated from the original array. The views a and b are the only ones that have been altered:
Both view a and view b examine the data of arr, which includes the array’s second component. Because of this, you observe a change. Because copies c and d do not share any data with arr, they have not been modified. They are not dependent on arr in any way.
.
Chained Indexing in NumPy
Does this behavior with a and b appear to have any resemblance at all to the preceding examples of pandas? It’s possible, considering how the idea of
NumPy uses a technique called chained indexing.
too:
>>>
>>> arr = np.array([1, 2, 4, 8, 16, 32])
>>> arr[1:4:2][0] = 64
>>> arr
array([ 1, 64, 4, 8, 16, 32])
>>> arr = np.array([1, 2, 4, 8, 16, 32])
>>> arr[[1, 3]][0] = 64
>>> arr
array([ 1, 2, 4, 8, 16, 32])
When employing chained indexing in NumPy, this example demonstrates the distinction between copies and views of the indexed data.
In the first scenario, the value returned by arr[1:4:2] is a view that refers to the data contained in arr and includes the elements 2 and 8. The first of these items has its value changed to 64 as a result of the expression arr[1:4:2][0] = 64. Both arr and the view that is produced by arr[1:4:2] have been updated to reflect the change.
In the second scenario, arr[[1, 3]] will yield a copy that has not only the components 1 and 3 but also 2 and 8. However, these components are not the same as those in arr. They are brand spanking new. arr[[1, 3]][0] = 64 makes changes to the copy that was returned by arr[[1, 3]], but it does not affect arr itself.
This is precisely the same behavior that causes pandas to throw a SettingWithCopyWarning, but the warning is not present in
NumPy.
Multidimensional Arrays
The same pattern applies for referencing multidimensional arrays.
principles:
- Slicing arrays returns views.
- Using index and mask arrays returns copies.
In addition, it is possible to slice data after combining index and mask arrays. When this occurs, copies are provided to you.
Here are several examples:
examples:
>>>
>>> arr = np.array([[ 1, 2, 4, 8],
... [ 16, 32, 64, 128],
... [256, 512, 1024, 2048]])
>>> arr
array([[ 1, 2, 4, 8],
[ 16, 32, 64, 128],
[ 256, 512, 1024, 2048]])
>>> a = arr[:, 1:3] # Take columns 1 and 2
>>> a
array([[ 2, 4],
[ 32, 64],
[ 512, 1024]])
>>> a.base
array([[ 1, 2, 4, 8],
[ 16, 32, 64, 128],
[ 256, 512, 1024, 2048]])
>>> a.base is arr
True
>>> b = arr[:, 1:4:2] # Take columns 1 and 3
>>> b
array([[ 2, 8],
[ 32, 128],
[ 512, 2048]])
>>> b.base
array([[ 1, 2, 4, 8],
[ 16, 32, 64, 128],
[ 256, 512, 1024, 2048]])
>>> b.base is arr
True
>>> c = arr[:, [1, 3]] # Take columns 1 and 3
>>> c
array([[ 2, 8],
[ 32, 128],
[ 512, 2048]])
>>> c.base
array([[ 2, 32, 512],
[ 8, 128, 2048]])
>>> c.base is arr
False
>>> d = arr[:, [False, True, False, True]] # Take columns 1 and 3
>>> d
array([[ 2, 8],
[ 32, 128],
[ 512, 2048]])
>>> d.base
array([[ 2, 32, 512],
[ 8, 128, 2048]])
>>> d.base is arr
False
In this demonstration, you will begin with the two-dimensional array denoted by arr. When working with rows, you apply slices. If you use the colon syntax (:), which is the same thing as saying slice(None), it indicates that you want to take all of the rows.
The views a and b are what you get back when you deal with the slice 1:3 for rows and the slice 1:4:2 for columns. The copies c and d are what you receive as a result of applying the list [1, 3] and mask [False, True, False, True] to each other.
The element arr itself serves as the base for both a and b. Both c and d have their own independent bases, which are not connected to arr.
When you make changes to the original, the views update themselves since they are looking at the same data, but the copies do not. This is similar to what happens with one-dimensional arrays.
same:
>>>
>>> arr[0, 1] = 100
>>> arr
array([[ 1, 100, 4, 8],
[ 16, 32, 64, 128],
[ 256, 512, 1024, 2048]])
>>> a
array([[ 100, 4],
[ 32, 64],
[ 512, 1024]])
>>> b
array([[ 100, 8],
[ 32, 128],
[ 512, 2048]])
>>> c
array([[ 2, 8],
[ 32, 128],
[ 512, 2048]])
>>> d
array([[ 2, 8],
[ 32, 128],
[ 512, 2048]])
You modified the values of the appropriate elements in views a and b to reflect the new value of 2 in arr, which was set to 100. It is not possible to make any changes to copies c and d using this method.
You can check out the for additional information regarding indexing NumPy arrays.
official tutorial for getting started quickly and
a tutorial on indexing
.
Indexing in pandas: Copies and Views
You have gained an understanding of the many indexing choices that are available to you in NumPy, and how you can use these options to refer either to the actual data (a view or shallow copy) or to newly copied data. (deep copy, or just copy). NumPy adheres to a stringent set of guidelines regarding this matter.
pandas is strongly dependent on NumPy arrays, although it provides more capability and flexibility. Because of this, the guidelines for returning viewed content and copies have become more convoluted and difficult to understand. They are contingent on the structure of the data, the types of the data, and various other aspects. In point of fact, Pandas doesn’t always guarantee that a view or copy will be referenced. Please take into consideration that indexing in Pandas is a fairly broad topic. It is necessary in order to use.
pandas’ data structures in the correct manner. You have many options available to you.
techniques:
-
Dictionary-like
notation - Attribute-like (dot) notation
-
The accessors
.loc[]
,
.iloc[]
,
.at[]
, and
.iat
If you are interested in learning more, check out the
documentation that is official along with
The pandas DataFrame makes working with data a joyous experience.
You will see two examples of how Pandas acts in a manner that is analogous to that of NumPy in the following section. To begin, it is clear that an is produced when a slice is used to access the first three rows of the data frame df.
view:
>>>
>>> df = pd.DataFrame(data=data, index=index)
>>> df["a":"c"]
x y z
a 1 1 45
b 2 3 98
c 4 9 24
>>> df["a":"c"].to_numpy().base
array([[ 1, 2, 4, 8, 16],
[ 1, 3, 9, 27, 81],
[45, 98, 24, 11, 64]])
>>> df["a":"c"].to_numpy().base is df.to_numpy().base
True
This view examines the same information as the df view.
On the other hand, getting a return value of a when reading the top two columns of df with a list of labels
copy:
>>>
>>> df = pd.DataFrame(data=data, index=index)
>>> df[["x", "y"]]
x y
a 1 1
b 2 3
c 4 9
d 8 27
e 16 81
>>> df[["x", "y"]].to_numpy().base
array([[ 1, 2, 4, 8, 16],
[ 1, 3, 9, 27, 81]])
>>> df[["x", "y"]].to_numpy().base is df.to_numpy().base
False
The original has a unique.base, whereas the duplicate has a different one.
You will discover further information pertaining to indexing DataFrames, as well as returning views and copies, in the next section. There are going to be various instances in which the behavior of pandas will get more complicated and will diverge from
NumPy.
Use of Views and Copies in pandas
Pandas will alert you with a SettingWithCopyWarning if it detects that you are attempting to change a copy of the data rather than the original, as you are already aware. This is typically the case after chained indexing.
You’re going to look at some specific situations that lead to a SettingWithCopyWarning in the following section. You will be able to determine the causes and acquire the knowledge necessary to circumvent them by effectively utilizing views, copies, and
accessors.
Chained Indexing and
In the first example, you saw how the SettingWithCopyWarning interacts with chained indexing. You should now be familiar with this behavior. Let’s go into more detail about it, shall we?
You have successfully constructed the DataFrame as well as the mask Series object that corresponds to df[“z”] 50.
:
>>>
>>> df = pd.DataFrame(data=data, index=index)
>>> df
x y z
a 1 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
>>> mask = df["z"] < 50
>>> mask
a True
b False
c True
d True
e False
Name: z, dtype: bool
You are already aware that the assignment df[mask][“z”] = 0 cannot be completed successfully. You will receive a SettingWithCopyWarning if this is the case.
:
>>>
>>> df[mask]["z"] = 0
__main__:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
>>> df
x y z
a 1 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
This assignment does not work out since df[mask] always returns a duplicate. To be more specific, the assignment is performed on the copy, and df is unaffected by this action.
You’ve also seen that the order in which evaluations are performed is important in pandas. In certain circumstances, the order of the operations might be changed to simplify the code.
work:
>>>
>>> df["z"][mask] = 0
>>> df
x y z
a 1 1 0
b 2 3 98
c 4 9 0
d 8 27 0
e 16 81 64
df[“z”][mask] = 0 is successful, and the changed df is returned to you without a SettingWithCopyWarning being generated.
It is recommended that you utilize the accessors; nevertheless, there is a possibility that doing so will result in problems.
well:
>>>
>>> df = pd.DataFrame(data=data, index=index)
>>> df.loc[mask]["z"] = 0
__main__:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
>>> df
x y z
a 1 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
In this particular scenario, the df.loc[mask] function returns a copy, the assignment is unsuccessful, and pandas, as expected, issues the warning.
Pandas is unable to identify the issue in some instances, and the assignment on the copy is allowed to proceed without a SettingWithCopyWarning being generated.
:
>>>
>>> df = pd.DataFrame(data=data, index=index)
>>> df.loc[["a", "c", "e"]]["z"] = 0 # Assignment fails, no warning
>>> df
x y z
a 1 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
Because df.loc[[“a”, “c”, “e”]] employs a list of indices and provides a copy, rather than a view, you will not be presented with a SettingWithCopyWarning in this scenario, nor will df be modified.
There are several circumstances in which the code is functional; however, pandas always displays the warning.
anyway:
>>>
>>> df = pd.DataFrame(data=data, index=index)
>>> df[:3]["z"] = 0 # Assignment succeeds, with warning
__main__:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
>>> df
x y z
a 1 1 0
b 2 3 0
c 4 9 0
d 8 27 11
e 16 81 64
>>> df = pd.DataFrame(data=data, index=index)
>>> df.loc["a":"c"]["z"] = 0 # Assignment succeeds, with warning
__main__:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
>>> df
x y z
a 1 1 0
b 2 3 0
c 4 9 0
d 8 27 11
e 16 81 64
In these two scenarios, you will use slices to choose the first three rows of the table in order to generate views. Both the views and the df scores for the assignments came out positive. However, you will still be presented with a SettingWithCopyWarning.
Avoiding using chained indexing is the method that is recommended for using when carrying out activities like these. The use of accessories can be of significant assistance with
that:
>>>
>>> df = pd.DataFrame(data=data, index=index)
>>> df.loc[mask, "z"] = 0
>>> df
x y z
a 1 1 0
b 2 3 98
c 4 9 0
d 8 27 0
e 16 81 64
This strategy only involves making a single call to a method; there is no chained indexing involved, and as a result, both the code and your intentions are more easily understood. In addition to this perk, this method of delegating is a little bit more effective.
data.
Impact of Data Types on Views, Copies, and the
The difference between creating views and creating copies in Pandas is somewhat determined by the sorts of data that are being used. Pandas treats DataFrames that have a single data type differently than those that have many types of data when determining whether or not to return a view or copy of the data.
Let’s concentrate on the many forms of data in this.
example:
>>>
>>> df = pd.DataFrame(data=data, index=index)
>>> df
x y z
a 1 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
>>> df.dtypes
x int64
y int64
z int64
dtype: object
You have successfully crafted the DataFrame with only integer columns. It is essential for this situation that all three columns include the same types of data. In this scenario, you can use a slice to choose rows, and then you will receive a
view:
>>>
>>> df["b":"d"]["z"] = 0
__main__:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
>>> df
x y z
a 1 1 45
b 2 3 0
c 4 9 0
d 8 27 0
e 16 81 64
This behavior is consistent with what you’ve seen in the article up to this point. The expression df[“b”:”d”] provides a perspective and enables you to make changes to the primary data. Because of this, the assignment df[“b”:”d”][“z”] = 0 turns out to be correct. Take note that even though the update to df was made successfully, you still receive a SettingWithCopyWarning in this particular scenario.
In the event that your DataFrame has columns of varying data types, you may end up with a copy rather than a view. In this scenario, the assignments will remain the same.
fail:
>>>
>>> df = pd.DataFrame(data=data, index=index).astype(dtype={"z": float})
>>> df
x y z
a 1 1 45.0
b 2 3 98.0
c 4 9 24.0
d 8 27 11.0
e 16 81 64.0
>>> df.dtypes
x int64
y int64
z float64
dtype: object
>>> df["b":"d"]["z"] = 0
__main__:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
>>> df
x y z
a 1 1 45.0
b 2 3 98.0
c 4 9 24.0
d 8 27 11.0
e 16 81 64.0
In this particular scenario, you made use of the.astype() function to produce a DataFrame that comprises three columns: two integer columns and one column containing floating-point data. In contrast to the earlier illustration, df[“b”:”d”] now yields a copy. Because of this change, the assignment df[“b”:”d”][“z”] = 0 fails, and df is left in its original state.
Utilizing the.loc[],.iloc[],.at[], and.iat[] access methods throughout your site will help you avoid confusion when there is any room for doubt.
code!
Hierarchical Indexing and
Hierarchical indexing, often known as MultiIndex, is a feature of pandas that gives you the ability to organize the row or column indices of your data in a hierarchy using many layers of organization. Working with data in dimensions other than two is made possible thanks to this useful addition to pandas, which also extends the flexibility of the program.
When creating hierarchical indexes, tuples can be used either as rows or columns.
labels:
>>>
>>> df = pd.DataFrame(
... data={("powers", "x"): 2**np.arange(5),
... ("powers", "y"): 3**np.arange(5),
... ("random", "z"): np.array([45, 98, 24, 11, 64])},
... index=["a", "b", "c", "d", "e"]
... )
>>> df
powers random
x y z
a 1 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
You now have the DataFrame df with a column that has two levels.
indices:
-
The first level
contains the labels
powers
and
random
. -
The second level
has the labels
x
and
y
, which belong to
powers
, and
z
, which belongs to
random
.
The columns x and y will be included in the DataFrame that is returned by the expression df[“powers”], which contains all columns below powers. If you only needed the column x, you could pass in both powers and x instead of simply x if that was your goal. To accomplish this, you should use the term df[“powers”, “x”] as that is the correct method.
:
>>>
>>> df["powers"]
x y
a 1 1
b 2 3
c 4 9
d 8 27
e 16 81
>>> df["powers", "x"]
a 1
b 2
c 4
d 8
e 16
Name: (powers, x), dtype: int64
>>> df["powers", "x"] = 0
>>> df
powers random
x y z
a 0 1 45
b 0 3 98
c 0 9 24
d 0 27 11
e 0 81 64
In the case of multilevel column indices, that is one approach to getting and setting column values. When working with multi-indexed DataFrames, you can additionally utilize accessors to retrieve or edit the data.
data:
>>>
>>> df = pd.DataFrame(
... data={("powers", "x"): 2**np.arange(5),
... ("powers", "y"): 3**np.arange(5),
... ("random", "z"): np.array([45, 98, 24, 11, 64])},
... index=["a", "b", "c", "d", "e"]
... )
>>> df.loc[["a", "b"], "powers"]
x y
a 1 1
b 2 3
The preceding illustration shows how to use.loc[] to return a DataFrame that contains the rows a and b as well as the columns x and y, all of which are powers. You can get a particular column (or row)
similarly:
>>>
>>> df.loc[["a", "b"], ("powers", "x")]
a 1
b 2
Name: (powers, x), dtype: int64
In this particular illustration, you indicate that you are interested in the intersection of rows a and b with the column x, which is located below the powers. You need to pass the tuple of indices (“powers”, “x”) in order to get a Series object as the result. This allows you to get a single column.
You can edit the elements of DataFrames that have hierarchical structures by employing this method.
indices:
>>>
>>> df.loc[["a", "b"], ("powers", "x")] = 0
>>> df
powers random
x y z
a 0 1 45
b 0 3 98
c 4 9 24
d 8 27 11
e 16 81 64
You can see how to avoid chained indexing in the examples that were just presented by using accessors (df.loc[[“a”, “b”], (“powers”, “x”)]) and by not using them. ( df[“powers”, “x”] ).
You have already seen that chained indexing can result in a SettingWithCopyWarning error.
:
>>>
>>> df = pd.DataFrame(
... data={("powers", "x"): 2**np.arange(5),
... ("powers", "y"): 3**np.arange(5),
... ("random", "z"): np.array([45, 98, 24, 11, 64])},
... index=["a", "b", "c", "d", "e"]
... )
>>> df
powers random
x y z
a 1 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64
>>> df["powers"]
x y
a 1 1
b 2 3
c 4 9
d 8 27
e 16 81
>>> df["powers"]["x"] = 0
__main__:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
>>> df
powers random
x y z
a 0 1 45
b 0 3 98
c 0 9 24
d 0 27 11
e 0 81 64
In this situation, the df[“powers”] command returns a DataFrame that contains the columns x and y. Because this is only a view that links to the data that comes from df, the assignment is considered to have been successful, and df is adjusted as a result. Pandas, on the other hand, will continue to throw a SettingWithCopyWarning.
If you run the same code many than once, but this time with various data types in the columns of df, you will end up with a different result each time.
behavior:
>>>
>>> df = pd.DataFrame(
... data={("powers", "x"): 2**np.arange(5),
... ("powers", "y"): 3**np.arange(5),
... ("random", "z"): np.array([45, 98, 24, 11, 64], dtype=float)},
... index=["a", "b", "c", "d", "e"]
... )
>>> df
powers random
x y z
a 1 1 45.0
b 2 3 98.0
c 4 9 24.0
d 8 27 11.0
e 16 81 64.0
>>> df["powers"]
x y
a 1 1
b 2 3
c 4 9
d 8 27
e 16 81
>>> df["powers"]["x"] = 0
__main__:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
>>> df
powers random
x y z
a 1 1 45.0
b 2 3 98.0
c 4 9 24.0
d 8 27 11.0
e 16 81 64.0
Because this instance of df contains more than one data type, calling df[“powers”] will return a copy of the variable. Calling df[“powers”][“x”] = 0 will cause a change to be made to the copy, but df will continue to exist in its original state, which will result in a SettingWithCopyWarning being thrown.
Avoiding chained assignment is the approach that is suggested for modifying df. You now understand that accessories are often quite useful, but that this is not always the case.
needed:
>>>
>>> df = pd.DataFrame(
... data={("powers", "x"): 2**np.arange(5),
... ("powers", "y"): 3**np.arange(5),
... ("random", "z"): np.array([45, 98, 24, 11, 64], dtype=float)},
... index=["a", "b", "c", "d", "e"]
... )
>>> df["powers", "x"] = 0
>>> df
powers random
x y z
a 0 1 45
b 0 3 98
c 0 9 24
d 0 27 11
e 0 81 64
>>> df = pd.DataFrame(
... data={("powers", "x"): 2**np.arange(5),
... ("powers", "y"): 3**np.arange(5),
... ("random", "z"): np.array([45, 98, 24, 11, 64], dtype=float)},
... index=["a", "b", "c", "d", "e"]
... )
>>> df.loc[:, ("powers", "x")] = 0
>>> df
powers random
x y z
a 0 1 45.0
b 0 3 98.0
c 0 9 24.0
d 0 27 11.0
e 0 81 64.0
In either scenario, you will receive the updated DataFrame df without being prompted with a SettingWithCopyWarning.
.
Change the Default
The SettingWithCopyWarning is not an error; rather, it is a warning. Your code will still be executed when it’s issued, despite the fact that it might not function as you had hoped it would.
Pandas allows you to change this behavior by modifying the mode.chained_assignment option, which may be done with the pandas.set_option() function. The following is available for your use:
settings:
-
pd.set_option("mode.chained_assignment", "raise")
raises a
SettingWithCopyException
. -
pd.set_option("mode.chained_assignment", "warn")
issues a
SettingWithCopyWarning
. This is the default behavior. -
pd.set_option("mode.chained_assignment", None)
suppresses both the warning and the error.
This particular piece of code, for instance, will throw a SettingWithCopyException rather than producing a SettingWithCopyWarning.
:
>>>
>>> df = pd.DataFrame(
... data={("powers", "x"): 2**np.arange(5),
... ("powers", "y"): 3**np.arange(5),
... ("random", "z"): np.array([45, 98, 24, 11, 64], dtype=float)},
... index=["a", "b", "c", "d", "e"]
... )
>>> pd.set_option("mode.chained_assignment", "raise")
>>> df["powers"]["x"] = 0
In addition to adjusting the behavior that is established by default, you may acquire the current setting that is associated with mode.chained_assignment by making use of the get_option() function.
:
>>>
>>> pd.get_option("mode.chained_assignment")
'raise'
Because you altered the behavior with set_option(), the value “raise” is returned to you in this scenario. “warn” is the value that is normally returned when pd.get_option(“mode.chained_assignment”) is called.
Even though you have the option to disable it, it is important to keep in mind that the SettingWithCopyWarning can be very helpful in identifying improper code and alerting you to it.