Listing of Contents
If you deal with large data sets, you likely recall the “aha!” moment when you found the pandas library during your Python adventure. pandas is a game-changer for data science and analytics, especially if you came to Python seeking a more powerful alternative to Excel and VBA.
What is it about pandas that has data scientists, analysts, and engineers such as myself so enthusiastic? According to the pandas description, it employs “fast, flexible, and expressive data structures meant to make dealing with “relational” or “labelled” data simple and intuitive.”
Rapid, versatile, simple, and intuitive? It sounds fantastic! If your work entails creating complex data models, you do not want to spend the majority of your development time waiting for modules to process large data sets. You want to devote your time and mental energy on evaluating your data, as opposed to stumbling about with less effective technologies.
But, I heard that pandas are slow…
When I initially began using pandas, I was told that although it was an excellent tool for data dissection, it was too sluggish to utilise for statistical modelling. Initially, this proved accurate. I wasted more than a few minutes twiddling my thumbs as pandas processed data.
Then I discovered that pandas is built on top of the NumPy array structure, and many of its operations are carried out in C, either through NumPy or pandas’ own library of Python extension modules that are written in Cython and compiled to C. Thus, shouldn’t pandas also be swift?
If you utilise it as it was designed, it most certainly should be!
The paradox is that code that ordinarily “looks” Pythonic may be inefficient in pandas in terms of performance. Similar to NumPy, pandas is optimised for vectorized operations that act on whole columns or datasets in a single pass. Individually considering each “cell” or row should normally be the final choice, not the first.
This Instruction
This is not a tutorial for overoptimizing your pandas code. pandas is already optimised for speedy execution when used properly. Also, there is a significant distinction between optimization and producing clean code.
This is a tutorial on utilising pandas in Python to use its powerful and user-friendly built-in capabilities. In addition, you will learn a few time-saving techniques, so you won’t be wasting time every time you deal with your data.
This lesson will cover the following topics:
- Advantages of using
datetime
data with time series - The most efficient route to doing batch calculations
- Saving time by storing data with HDFStore
To illustrate these concepts, I will use an example from my day work that examines a power use time series. After importing the data, you will go via gradually more efficient methods to get the final result. For most pandas, the proverb that there are several routes from point A to point B remains true. Nevertheless, this does not imply that all possible approaches scale equally well to bigger, more demanding datasets.
Assuming you are already familiar with pandas data selection fundamentals, let’s get started.
The Work Before Us
The purpose of this example is to calculate the overall year cost of energy usage using time-of-use tariffs. In other words, the price of power fluctuates throughout the day, therefore the objective is to multiply the electricity used for each hour by the proper price for that hour.
Let’s read our data from a CSV file with two columns: one for date and time, and the other for electrical energy spent in kilowatt hours (kWh):
There are 365 x 24 = 8760 rows throughout the whole year, with each row representing one hour of power use. Each row reflects the consumption for the “hour beginning” at the given time, so as 1/1/13 0:00 denotes the usage for the first hour of January 1.
Time Saving Using Datetime Data
First, you must use one of pandas’ I/O routines to read your data from the CSV file:
>>> import pandas as pd
>>> pd.__version__
'0.23.1'
# Make sure that `demand_profile.csv` is in your
# current working directory.
>>> df = pd.read_csv('demand_profile.csv')
>>> df.head()
date_time energy_kwh
0 1/1/13 0:00 0.586
1 1/1/13 1:00 0.580
2 1/1/13 2:00 0.572
3 1/1/13 3:00 0.596
4 1/1/13 4:00 0.592
This seems OK at first inspection, however there is a little flaw. pandas and NumPy share a dtypes idea (data types). If no parameters are provided, date time assumes the following dtype:
>>> df.dtypes
date_time object
energy_kwh float64
dtype: object
>>> type(df.iat[0, 0])
str
This is not an ideal situation. object is a container for any column that does not cleanly fit into a single data type, including str. Dates as strings would be difficult and inefficient to deal with. (It would also be inefficient with RAM.)
You’ll want the date time column to be represented as an array of datetime objects when dealing with time series data. (This is called a Timestamp by pandas.) Pandas simplify each step in this process:
>>> df['date_time'] = pd.to_datetime(df['date_time'])
>>> df['date_time'].dtype
datetime64[ns]
(Note that you could also use a pandas PeriodIndex in this situation.)
You now own a DataFrame named df that closely resembles our CSV file. It has two columns and a number index for row reference.
>>> df.head()
date_time energy_kwh
0 2013-01-01 00:00:00 0.586
1 2013-01-01 01:00:00 0.580
2 2013-01-01 02:00:00 0.572
3 2013-01-01 03:00:00 0.596
4 2013-01-01 04:00:00 0.592
The code above is basic and straightforward, but how quick is it? Let’s put it to the test using the unoriginally named @timeit timing decorator. This decorator closely resembles timeit.repeat() from Python’s standard library, except it enables you to return the function’s result and show its average execution time over several trials. (The timing results are returned by Python’s timeit.repeat(), not the function result.)
Writing a function and setting the @timeit decorator immediately above it will cause the function to be timed each time it is called. The decorator operates both an outer and an inner loop:
>>> @timeit(repeat=3, number=10)
... def convert(df, column_name):
... return pd.to_datetime(df[column_name])
>>> # Read in again so that we have `object` dtype to start
>>> df['date_time'] = convert(df, 'date_time')
Best of 3 trials with 10 function calls per trial:
Function `convert` ran in average of 1.610 seconds.
The outcome? 1.6 seconds for 8760 data rows. You can exclaim, “That is no time at all!” What if, though, you meet bigger data sets, such as a year’s worth of power use at one-minute intervals? It is sixty times more data, therefore you will have to wait around one and a half minutes. This sounds more intolerable.
I just evaluated 10 years’ worth of hourly power data from 330 locations. Can you believe I spent 88 minutes converting datetimes? Certainly not!
How can this be sped up? In general, pandas will be much faster the less data it must understand. Using the format argument to describe the format of your time and date data to pandas will result in significant speedups in this situation. You may do this by inputting the strftime codes shown above as follows:
>>> @timeit(repeat=3, number=100)
>>> def convert_with_format(df, column_name):
... return pd.to_datetime(df[column_name],
... format='%d/%m/%y %H:%M')
Best of 3 trials with 100 function calls per trial:
Function `convert_with_format` ran in average of 0.032 seconds.
The new outcome? 0.032 seconds, which is a 50-fold improvement! Hence, you have now saved my 330 websites around 86 minutes of processing time. Not a horrible development!
The datetimes in the CSV file are not in ISO 8601 format; you need YYYY-MM-DD HH:MM instead. Pandas will utilise the dateutil package to convert each string to a date if no format is specified.
In contrast, if the raw datetime data is already in ISO 8601 format, pandas can instantly parse the dates quickly. This is one benefit of being precise about the format in this case. It is also possible to provide the infer datetime format=True argument, although it is often preferable to be explicit.
Note that read csv() may also parse dates as part of the file I/O process. Note the arguments parse dates, infer datetime format, and date parser.
Easy Looping Over pandas Data
You are now prepared to calculate your power expenses, since your dates and times are now in a handy format. You must conditionally apply a cost factor to each hour of the day, since costs vary by hour. In this example, we will define the time-of-use expenses as follows:
Tariff Type | Cents per kWh | Time Range |
---|---|---|
Peak | 28 | 17:00 to 24:00 |
Shoulder | 20 | 7:00 to 17:00 |
Off-Peak | 12 | 0:00 to 7:00 |
If the price were a constant 28 cents per kWh for every hour of the day, most panda experts would know that this computation could be accomplished with a single line.
>>> df['cost_cents'] = df['energy_kwh'] * 28
This will result in the introduction of a new column containing the hourly cost of electricity:
date_time energy_kwh cost_cents
0 2013-01-01 00:00:00 0.586 16.408
1 2013-01-01 01:00:00 0.580 16.240
2 2013-01-01 02:00:00 0.572 16.016
3 2013-01-01 03:00:00 0.596 16.688
4 2013-01-01 04:00:00 0.592 16.576
# ...
Nevertheless, our pricing computation depends on the time of day. Here is where you will find a large number of individuals employing pandas in an unintended manner: by building a loop to do the conditional computation.
For the remainder of this course, you will go from a less-than-ideal starting point to a Pythonic solution that fully exploits pandas.
Yet, what is Pythonic about pandas? Ironically, individuals who have expertise with other (less user-friendly) programming languages, such as C++ or Java, are more prone to this since they “think in loops” by nature.
Let’s examine a non-Pythonic approach to loops that many people employ when they are uninformed of how pandas is intended to be utilised. We will use @timeit once again to determine the speed of this method.
Provide a function that applies the appropriate tariff to a given hour.
def apply_tariff(kwh, hour):
"""Calculates cost of electricity for given hour."""
if 0 <= hour < 7:
rate = 12
elif 7 <= hour < 17:
rate = 20
elif 17 <= hour < 24:
rate = 28
else:
raise ValueError(f'Invalid hour: {hour}')
return rate * kwh
Here is the non-Pythonic loop in all its glory:
>>> # NOTE: Don't do this!
>>> @timeit(repeat=3, number=100)
... def apply_tariff_loop(df):
... """Calculate costs in loop. Modifies `df` inplace."""
... energy_cost_list = []
... for i in range(len(df)):
... # Get electricity used and hour of day
... energy_used = df.iloc[i]['energy_kwh']
... hour = df.iloc[i]['date_time'].hour
... energy_cost = apply_tariff(energy_used, hour)
... energy_cost_list.append(energy_cost)
... df['cost_cents'] = energy_cost_list
...
>>> apply_tariff_loop(df)
Best of 3 trials with 100 function calls per trial:
Function `apply_tariff_loop` ran in average of 3.152 seconds.
This architecture may appear natural to those who have written “pure Python” for some time prior to using pandas: You have a conventional “for each x, if y then z” statement.
Still, this loop is awkward. For numerous reasons, you might consider the above to be a “antipattern” in pandas. It must first establish a list where the outputs will be recorded.
Second, it loops over the opaque object range(0, len(df)), and after applying apply tariff(), it must attach the result to a list that is used to create the new DataFrame column. It also employs chained indexing using df.iloc[i][‘date time’], which often produces undesired outcomes.
Nevertheless, the greatest disadvantage of this method is the time required for computations. With 8760 rows of data, this loop took over 3 seconds on my system. After this, you will examine some enhanced ways for iterating through pandas structures.
Iteration using.itertuples() and.iterrows ()
What alternatives are available? By providing the DataFrame.itertuples() and DataFrame.iterrows() methods, pandas has effectively rendered the for I in range(len(df)) syntax obsolete. These are both generator techniques that return a single row.
.itertuples() returns a namedtuple for each row, with the index value of the row as the first member. A nametuple is a data structure from Python’s collections module that functions like a tuple but contains fields that are available through attribute lookup.
.iterrows() returns (index, Series) pairs for each row in the DataFrame.
While.itertuples() is often a little quicker, for this example, let’s stick with pandas and use.iterrows() since some readers may not be familiar with nametuple. Consider what this accomplishes:
>>> df.dtypes
date_time object
energy_kwh float64
dtype: object
>>> type(df.iat[0, 0])
str
0
Some modest increases have been obtained. It is more understandable since the grammar is more straightforward and there is less clutter in the row value references. In terms of time savings, it about fivefold faster!
Yet, additional space for improvement exists. You’re still using some type of a Python for-loop, which means that each and every function call is performed in Python when it should ideally be done in a quicker language integrated into pandas’ architecture.
pandas’ .apply ()
This procedure may be improved by utilising the.apply() method instead of.iterrows(). The.apply() method applies functions (callables) along an axis of a DataFrame (all rows, or all columns). In this example, a lambda function is used to provide apply tariff() two columns of data:
>>> df.dtypes
date_time object
energy_kwh float64
dtype: object
>>> type(df.iat[0, 0])
str
1
The obvious grammatical benefits of.apply() include a large decrease in the amount of lines and very legible, plain code. In this instance, the duration was almost half of that of the.iterrows() technique.
Still, this is not yet “very quick.” One explanation is that.apply() will attempt to iterate through Cython iterators internally. In this instance, however, the passed lambda cannot be processed in Cython; thus, it is run in Python, which is much slower.
Using.apply() on my 10 years of hourly data for 330 locations would need around 15 minutes of processing time. If this computation was supposed to be a minor component of a bigger model, you would want to speed things up significantly. This is where vectorized operations are useful.
Choosing Data Using.isin ()
Before, you saw that if there were a single electricity price, the price could be applied to all power consumption data in a single line of code (df[‘energy kwh’ * 28). This operation was an example of a vectorized operation, and it is the quickest method to do tasks in pandas.
So how may conditional computations be implemented as vectorized operations in Pandas? One technique is to pick and organise DataFrame components depending on your criteria, and then perform a vectorized operation to each chosen group.
In the next example, you will learn how to pick rows using the.isin() function of pandas and then apply the necessary tariff using vectorized operations. Set the date time column as the DataFrame’s index before doing this action for convenience.
>>> df.dtypes
date_time object
energy_kwh float64
dtype: object
>>> type(df.iat[0, 0])
str
2
Let’s compare the following:
>>> df.dtypes
date_time object
energy_kwh float64
dtype: object
>>> type(df.iat[0, 0])
str
3
To comprehend this code, you must be aware that the.isin() function returns the following array of Boolean values:
>>> df.dtypes
date_time object
energy_kwh float64
dtype: object
>>> type(df.iat[0, 0])
str
4
These values indicate which DataFrame indexes (datetimes) fall inside the chosen hour range. Then, when you feed these Boolean arrays to the DataFrame’s.loc indexer, you get a DataFrame slice containing only rows that match the specified hours. Following that, multiplying the slice by the relevant tariff is a simple vectorized operation that is performed quickly.
How does this compare to the preceding looping operations? Secondly, you may see that apply tariff() is no longer required, since all conditional logic is now applied at row selection. So, there is a significant decrease in the number of lines of code you must write and the Python code that is called.
How long is the processing time? 315 times quicker than the non-Pythonic loop, about 71 times faster than.iterrows(), and 27 times faster than.apply (). You are now going at the rate required to swiftly process large data sets.
Can We Do Better?
By invoking df.loc and df.index.hour.isin() three times each in apply tariff isin(), we are admittedly still doing “manual labour.” If we had a finer-grained range of time slots, you might argue that this technique is not scalable. (A different hourly rate would need 24.isin() calls.) Fortunately, the pd.cut() feature of pandas enables you to do much more programmatically:
>>> df.dtypes
date_time object
energy_kwh float64
dtype: object
>>> type(df.iat[0, 0])
str
5
Let’s take a moment to assess the situation. pd.cut() applies an array of labels (our costs) to each hour based on which bin it belongs in. Notice that the include lowest argument specifies whether or not the initial interval should be left-inclusive. (You want to add time=0 to a group.)
This is a completely vectorized method for achieving the desired outcome, and it is the fastest:
>>> df.dtypes
date_time object
energy_kwh float64
dtype: object
>>> type(df.iat[0, 0])
str
6
You’ve reduced the time required to analyse the whole 300-site dataset from possibly over an hour to under a second. Not bad! Using NumPy functions to alter the underlying NumPy arrays for each DataFrame and then integrating the results back into pandas data structures is the last option.
Don’t Forget NumPy!
It is important to remember that pandas Series and DataFrames are built on top of the NumPy library while using pandas. This provides much more versatility in computing, since pandas works perfectly with NumPy arrays and operations.
In the following scenario, you will utilise NumPy’s digitise() method. Similar to pandas’ cut(), the data will be binned, but this time it will be represented as an array of indices indicating to which bin each hour belongs. Next, these indexes are applied to an array of prices:
>>> df.dtypes
date_time object
energy_kwh float64
dtype: object
>>> type(df.iat[0, 0])
str
7
Its syntax, like the cut() method, is very succinct and easy to understand. So how does its speed compare? Let’s see:
>>> df.dtypes
date_time object
energy_kwh float64
dtype: object
>>> type(df.iat[0, 0])
str
8
There is still a gain in performance at this time, but it is becoming more marginal. It is a good moment to stop working on code enhancements and consider the larger picture.
Using pandas, it is possible to keep, if you will, a “hierarchy” of preferred alternatives for batch computations such as the one you performed here. Typically, they will be listed from quickest to slowest (and most flexible to least flexible):
- Use vectorized operations: pandas methods and functions with no for-loops.
- Use the
.apply()
method with a callable. - Use
.itertuples()
: iterate over DataFrame rows asnamedtuples
from Python’scollections
module. - Use
.iterrows()
: iterate over DataFrame rows as (index,pd.Series
) pairs. While a pandas Series is a flexible data structure, it can be costly to construct each row into a Series and then access it. - Use “element-by-element” for loops, updating each cell or row one at a time with
df.loc
ordf.iloc
. (Or,.at
/.iat
for fast scalar access.)
Don’t Believe My Words: The precedence order shown above was suggested by a core pandas developer.
You may read How to Iterate Over Rows in Pandas and Why You Shouldn’t to learn more about your alternatives.
Here’s how the aforementioned “order of precedence” applies to each function you’ve created:
Function | Runtime (seconds) |
---|---|
apply_tariff_loop() |
3.152 |
apply_tariff_iterrows() |
0.713 |
apply_tariff_withapply() |
0.272 |
apply_tariff_isin() |
0.010 |
apply_tariff_cut() |
0.003 |
apply_tariff_digitize() |
0.002 |
Avoid Reprocessing With HDFStore
Now that you’ve examined rapid data processing in pandas, let’s investigate how to completely eliminate reprocessing time using HDFStore, which was just included into pandas.
While constructing a complicated data model, it is often advantageous to preprocess your data. Even if you use the format option, merely converting the date and time to datetime may take 20 minutes if you have 10 years of minute-frequency power usage data. You should only have to perform this once, not each time your model is executed for testing or analysis.
You may preprocess and then save your data in its processed form so that it can be accessed when necessary. So how can data be stored in the correct format without needing to be reprocessed? If you were to save as CSV, you would lose your datetime objects and be required to reprocess them upon subsequent access.
HDF5 is a high-performance storage format created primarily for storing tabular arrays of data. pandas provides a built-in solution that use HDF5 for this purpose. The HDFStore class of Pandas enables you to save your DataFrame in an HDF5 file for faster access, while preserving column types and other information. It is a class similar to a dictionary, therefore it can be read and written similarly to a Python dict object.
How to save your preprocessed power usage DataFrame, df, in an HDF5 file:
>>> df.dtypes
date_time object
energy_kwh float64
dtype: object
>>> type(df.iat[0, 0])
str
9
You may now shut down your computer and take a break, knowing that your processed data will be ready for you when you return. No reprocessing is necessary. Here’s how to get your data from the HDF5 file while preserving data types:
>>> df['date_time'] = pd.to_datetime(df['date_time'])
>>> df['date_time'].dtype
datetime64[ns]
0
A data store may include several tables, with each table’s name serving as its key.
A note regarding utilising the HDFStore in pandas: you must have PyTables >= 3.0.0 installed, therefore after installing pandas, update PyTables as follows:
>>> df['date_time'] = pd.to_datetime(df['date_time'])
>>> df['date_time'].dtype
datetime64[ns]
1
Conclusions
If you do not believe that your pandas project is quick, versatile, simple, and intuitive, you should reconsider your use of the library.
These examples are rather easy, but they highlight how the appropriate implementation of pandas capabilities may significantly improve runtime and code readability. You may implement the following guidelines the next time you deal with huge data sets in pandas:
ullist@2 Mark as Finished