The staggering quantity of information that can be found on the internet serves as a valuable resource for any area of study or personal interest. You’ll need to acquire expertise in web scraping if you want to capture that data in an efficient manner. Both the requests and Beautiful Soup Python libraries are excellent tools that may be used for the purpose. If you have a fundamental grasp of Python and HTML and like learning via the application of real-world examples, then this lesson is perfect for you. You will learn how to do the following at the end of this guide:
-
Decipher data encoded in
URLs
-
Use
requests
and Beautiful Soup for
scraping and parsing data
from the Web -
Step through a
web scraping pipeline
from start to finish -
Build a script
that fetches job offers from the Web and displays relevant information in your console
Going through this project will provide you with the understanding of the procedure as well as the tools necessary to scrape any static page that is available on the World Wide Web. If you click on the link that is provided below, you will be able to get the source code for the project:
Let’s get
started!
What Is Web Scraping?
The technique of obtaining information from the internet is known as “web scraping.” Web scraping may even consist of anything as simple as copying and pasting the lyrics to your favorite song. Yet, when referring to a process, the term “web scraping” will almost always refer to one that is automated. Although some websites take exception to their data being collected by automated scrapers, others have no problem with the practice.
If you are scraping a website in a polite manner for educational reasons, it is quite unlikely that you will run into any issues. Before you begin a large-scale project, you should still do some independent research and check to be that you are not in violation of any of the Conditions of Service before you get started.
project.
Reasons for Web Scraping
Let’s say you’re someone who enjoys surfing, both online and in real life, and you’re on the hunt for a job. Having said that, you are not seeking for just any job. You are approaching this situation with the mentality of a surfer, waiting for the ideal wave to come your way!
There is a website that recruits for exactly the type of work that you are looking for. Unfortuitously, the website does not provide an email notification service, and fresh employment opportunities only become available once in a blue moon. You give some thought to looking into it on a daily basis, but doing so does not strike you as the most entertaining or fruitful way to spend your time.
The world, however, also provides additional opportunities to put that surfer’s perspective to use! Python is a programming language that can assist you in automating the repetitive elements of your job search, so that you don’t have to keep checking the job site every day. Web scraping software is a possible answer to the problem of how to speed up the process of gathering data. You just need to write your code once, and it will automatically get the information you want from a variety of sources and several sites.
When you try to get the information you want manually, on the other hand, you might spend a lot of time clicking, scrolling, and searching, particularly if you need a large amount of data from websites that are frequently updated with new content. This is especially true if you need the information from multiple websites. The process of manually scraping websites may be laborious and time-consuming.
There is an incredible amount of knowledge available on the internet, and more is being uploaded all the time. There’s a good chance that at least some of that data will pique your interest, and the majority of it is freely available for you to take. Automated web scraping may help you achieve your goals, whether you are currently looking for a job or you just want to grab all of the lyrics to your favorite song by your favorite musician.
goals.
Challenges of Web Scraping
The development of the World Wide Web may be traced back to a diverse range of origins. It incorporates a wide variety of technology, fashions, and personalities, and it is still developing to this day. To put it another way, the web is a jumbled up mess! As a result of this, you will have some difficulties while scratching the surface.
Web:
-
Variety:
Every website is different. While you’ll encounter general structures that repeat themselves, each website is unique and will need personal treatment if you want to extract the relevant information. -
Durability:
Websites constantly change. Say you’ve built a shiny new web scraper that automatically cherry-picks what you want from your resource of interest. The first time you
run your script
, it works flawlessly. But when you run the same script only a short while later, you run into a discouraging and lengthy stack of
tracebacks
!
While so many websites are now undergoing active development, the possibility of unstable scripts should not be discounted. Since the site’s structure has been altered, it is possible that your scraper will no longer be able to successfully traverse the sitemap or locate the content it seeks. The good news is that most updates made to websites are rather minor and gradual, which means that it is likely that you will be able to update your scraper with just a few minor tweaks.
But, bear in mind that since the Internet is so dynamic, the scrapers you construct will most likely need ongoing maintenance due to the nature of the internet itself. You may set up continuous integration to perform scraping tests on a regular basis to check that your primary script will function correctly in the absence of your involvement.
knowledge.
An Alternative to Web Scraping: APIs
Application programming interfaces, or APIs, are made available by some website providers, and they let you to access their data in a way that has been predetermined. You can avoid having to parse HTML if you use APIs. Alternatively, you may have direct access to the data by using formats such as JSON and XML. HTML’s primary purpose is to provide viewers with a graphical representation of the material they are seeing.
The procedure is often more reliable when an API is used as opposed to web scraping, which is a method for getting data from websites. This is due to the fact that software developers construct APIs intended more for consumption by computers than by human sight.
It is possible for a website’s front-end presentation to be updated often; nevertheless, such a change in the design of the website does not impact the API structure. As the structure of an API is often more permanent, this indicates that it is a more trustworthy source of the website’s data.
On the other hand, APIs are subject to change as well. The same issues that websites face, namely diversity and longevity, also apply to application programming interfaces (APIs). In addition, if the documentation that is supplied for an API is of poor quality, it will be considerably more difficult for you to evaluate the API’s structure on your own.
This lesson will not cover the strategy or the tools that are necessary for you to obtain information by utilizing APIs; such topics are outside its scope. Check see API Integration in Python if you want additional information on this subject.
.
Scrape the Fake Python Job Site
During the course of this guide, you will construct a web scraper that crawls the Fake Python Jobs website in search of job advertisements for Python software developers. It is only an example website that contains fictitious job advertisements that you are free to scrape in order to hone your abilities. The HTML on the website will be parsed by your web scraper, which will then choose the pertinent information and filter that material for certain terms. Note: An earlier version of this article focused on scraping the Monster job board; however, Monster has since updated their website and no longer provides static HTML material. This revised lesson places its primary emphasis on a self-hosted static website. Such a website is certain to remain unchanged over time and provides you with a dependable training field on which to hone the skills necessary for web scraping.
Every website on the Internet that can be seen may have its content scraped, albeit the level of difficulty in doing so varies from website to website. This lesson will provide you with an introduction to web scraping in order to assist you in better comprehending the process as a whole. After that, you may use this method in the same way for any website that you want to scrape.
In addition, as you go through the course, you will come across a few exercise blocks. You may click on them to make them bigger, and then you can give yourself a challenge by doing the activities that are specified.
there.
Step 1: Inspect Your Data Source
Become familiar with the website you want to scrape before you start writing any code in Python. This is a prerequisite step. It ought to be the very first thing you do for any web scraping job that you want to get started on. It is necessary for you to have an understanding of the structure of the website in order to get the information that is relevant to you. To get started, open the website that you wish to scrape in your preferred web browser.
browser.
Explore the Website
Go across the site and engage with it in the same way that any other job seeker would in a similar situation. For instance, on the website’s home page, you may navigate through the following content:
You will be presented with a number of employment openings in the shape of a card, and each card will have two buttons. If you choose to click the Apply button, you will be sent to a new page that provides further descriptions of the position that you have chosen. As you interact with the, you could also notice that the URL that is shown in the address bar of your browser changes.
website.
Decipher the Information in URLs
A URL may include a great deal of information, which a coder can encode. The process of online scraping will be lot simpler for you if you first familiarize yourself with the inner workings of URLs and the components that make them up. For instance, you may notice that you are now on a details page that has the following:
URL:
https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html
You may split the URL in the previous sentence into two primary parts.
parts:
-
The base URL
represents the path to the search functionality of the website. In the example above, the base URL is
https://realpython.github.io/fake-jobs/
. -
The specific site location
that ends with
.html
is the path to the job description’s unique resource.
Any position that is listed on this website will make use of the default URL. But, the placement of the one-of-a-kind materials will be varied depending on which particular job ad you are looking at right now.
There is more information than simply the location of a file that may be stored in URLs. Some websites encrypt the data that you provide while doing a search by using something called query parameters. One way to think of them is as query strings, which are the strings that are sent to the database in order to obtain certain entries.
At the very end of a URL, you will see something called query parameters. If you go to Indeed and use their search box to look for “software developer” jobs in “Australia,” for instance, you’ll see that the URL changes to include these values as query parameters.
parameters:
https://au.indeed.com/jobs?q=software+developer&l=Australia
These are the query parameters that are included in this URL:?
Software developer with Australia as the location q. The query parameters include the following:
parts:
-
Start:
The beginning of the query parameters is denoted by a question mark (
?
). -
Information:
The pieces of information constituting one query parameter are encoded in key-value pairs, where related keys and values are joined together by an equals sign (
key=value
). -
Separator:
Every URL can have multiple query parameters, separated by an ampersand symbol (
&
).
Having obtained this knowledge, you are now able to disassemble the query parameters of the URL into two key-value pairs.
pairs:
-
q=software+developer
selects the type of job. -
l=Australia
selects the location of the job.
You should experiment with changing the search parameters and then notice how doing so impacts your URL. Proceed to type fresh information into the search box at the top.
top:
Modify these values in order to notice the modifications made to the URL.
The next step is to experiment with changing the values directly in your URL. Check out what occurs when you copy and paste the following URL into the address bar of your browser and see what happens.
bar:
https://au.indeed.com/jobs?q=developer&l=perth
If you make changes to the values in the search box on the website and then submit your search, those changes will be immediately reflected in the query parameters of the URL, and vice versa. If you modify one of them, then you will see that the website displays different results for you.
Exploring the URLs of a website, as you can see, may provide you with information on how to access data from the server that hosts the website.
Proceed with your exploration of the Fake Python Jobs website by going back there. Since this website is completely static and does not run on top of a database, you will not need to deal with query parameters in order to scrape its content.
tutorial.
Inspect the Site Using Developer Tools
After that, you will want to have a better understanding of the display-specific data structure. In order to extract the information that you need from the HTML response that you will gather in one of the next phases, you will need to have an understanding of the page’s structure. The structure of a website may be better understood with the assistance of developer tools. Developer tools are pre-installed on each and every contemporary web browser. You will learn how to use the developer tools in Chrome in the next portion of this guide. The procedure will be extremely comparable to that of other cutting-edge browsers.
If you’re using Chrome on macOS, you can access the developer tools by going to the View menu, clicking Developer, and then clicking Developer Tools. You can get to them on Windows and Linux by clicking the menu button in the top-right corner (), then choosing More Tools > Developer Tools from the drop-down menu that appears. You may also access your developer tools by using a keyboard shortcut, right-clicking anywhere on the page, and choosing the Inspect option from the context menu that appears.
:
-
Mac:
Cmd
+
Alt
+
I
-
Windows/Linux:
Ctrl
+
Shift
+
I
Developer tools provide you the ability to examine the site’s document object model (DOM) in an interactive manner, which helps you get a deeper comprehension of your source. Choose the Elements tab inside developer tools to begin exploring the Document Object Model (DOM) of your page. You will see a structure with several HTML components that you may click on. You can alter components, expand and collapse them, and even do so directly in your browser:
The structure of the page may be seen on the left, while its representation in HTML can be seen on the right.
You may consider the text that is shown in your browser to be the HTML structure of the page that is being viewed. If you are interested, you may learn more about the distinction between the DOM and HTML by reading about it on the CSS-TRICKS website.
You may zoom in on the position of items on the page by right-clicking them and selecting the Inspect option from the context menu that appears. You may also activate the components on the website that correspond to the HTML content that is located to your right by hovering your mouse pointer over the text.
To extend the exercise block for a particular activity to practice using your developer tools, click on the following:
Have some fun and discover new things! The page you’re dealing with will be much simpler to scrape if you spend more time familiarizing yourself with it. Having said that, try not to let all of that HTML content completely overwhelm you. You will make use of the power of programming in order to navigate through this labyrinth and choose the information that is pertinent to your needs.
you.
Step 2: Scrape HTML Content From a Page
Now that you have an understanding of the resources at your disposal, it is time to begin using Python. To begin, you will need to import the HTML code of the website into the Python script you are using so that you can interact with it. You will complete this activity by using the requests package in Python.
Before installing any external package, it is recommended that you first create a virtual environment for your project. To install the external requests, first activate your new virtual environment, and then execute the following command into your terminal:
library:
(venv) $ python -m pip install requests
After that, use your preferred text editor and create a new document. A few lines of code are all that are required to obtain the HTML.
code:
import requests
URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)
print(page.text)
This code sends out an HTTP request.
GET request to the address supplied in the URL. It is responsible for retrieving the HTML data that is sent back by the server and storing that data in a Python object.
If you print the.text property of page, you’ll see that it looks exactly the same as the HTML that you studied previously using the developer tools of your browser. You did an excellent job retrieving the material of the static site from the internet! You are now able to view the HTML of the site from inside of your Python.
script.
Static Websites
The website whose material you will be scraping for the purpose of this tutorial offers HTML in a static format. In this particular instance, the server that is hosting the website sends back HTML pages that have already been populated with all of the material that you, as a user, will have access to see.
Earlier, when you analyzed the website with developer tools, you saw that a job ad consisted of the following lengthy and jumbled-looking information:
HTML:
<div class="card">
<div class="card-content">
<div class="media">
<div class="media-left">
<figure class="image is-48x48">
<img
src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg"
alt="Real Python Logo"
/>
</figure>
</div>
<div class="media-content">
<h2 class="title is-5">Senior Python Developer</h2>
<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
</div>
</div>
<div class="content">
<p class="location">Stewartbury, AA</p>
<p class="is-small has-text-grey">
<time datetime="2021-04-08">2021-04-08</time>
</p>
</div>
<footer class="card-footer">
<a
href="https://www.realpython.com"
target="_blank"
class="card-footer-item"
>Learn</a
>
<a
href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html"
target="_blank"
class="card-footer-item"
>Apply</a
>
</footer>
</div>
</div>
It is not always easy to get your mind around a lengthy block of HTML code. You may make it simpler to read by using an HTML formatter, which will automatically clean it up and make it more organized. You are able to get a deeper comprehension of the structure of any code block if it is readable. While there is no guarantee that it will assist enhance the HTML layout, it is never a bad idea to give it a go. Note: It is important to keep in mind that the appearance of each website will be unique. Before going on to the next step, it is essential that you first examine and have a good understanding of the structure of the site you are presently working with.
Sometimes the HTML you’ll be working with will be difficult to understand. The HTML code of this employment board, to your good fortune, has descriptive class names for each of the components in which you are interested.
in:
-
class="title is-5"
contains the title of the job posting. -
class="subtitle is-6 company"
contains the name of the company that offers the position. -
class="location"
contains the location where you’d be working.
Remember that you can always go back to your browser and utilize the developer tools to further examine the HTML structure in an interactive manner. This is important to keep in mind in the event that you ever find yourself lost in a vast pile of HTML.
You should now be fully capable of putting Python’s requests library’s robust capabilities and user-friendly interface to good use. You were successful in scraping static HTML material off the web and making it accessible for further processing using only a few lines of code.
While scraping webpages, though, you could run across more difficult scenarios than usual. These might be rather demanding. You’re going to take a brief look at two of these more difficult things once you’ve finished scraping the HTML, and that’s before you learn how to extract the important information from the code.
situations.
Hidden Websites
On certain sites, the information that you’re looking for is concealed until you log in. This implies that in order to scrape any content from the website, you will first need to create an account. When making an HTTP request from inside your Python script, you will follow a different set of steps than when accessing a website from within your browser. Even if you can log in to the website using your browser, this does not necessarily indicate that your Python script will be able to scrape the page successfully.
But, the authentication process is already included into the requests library, so you won’t need to worry about it. Using these methods, you will be able to log in to websites while your Python script is making HTTP requests to those websites, and you will subsequently be able to scrape information that is protected behind a login. Since you will not be required to sign in before accessing the material on the job board, that topic will not be covered in this session.
authentication.
Dynamic Websites
You will learn how to scrape data from a static website by following the steps in this tutorial. Since the server delivers you an HTML page that already has all the page information in the response, working with static websites is a simple and easy process. You are able to do an HTML response parsing and instantly begin selecting the data that is relevant to your inquiry.
If you have a dynamic website, on the other hand, the server could not send back any HTML at all. On the other hand, you can get a response in the form of JavaScript code. This code will seem quite different from what you saw when you investigated the page using the developer tools in your browser. Please take note that throughout this walkthrough, the phrase
A dynamic website is one that does not return the same HTML code that is shown on the page when it is viewed in a browser.
A large number of today’s current online apps are created to give their functionality by working in conjunction with the browsers of their users. These applications do not deliver HTML pages; rather, they provide
A piece of code written in JavaScript that gives instructions to your browser to help it build the required HTML. Web applications provide dynamic material to clients in this manner to offload work from the server to the clients’ workstations, to prevent page reloads, and to enhance the overall user experience.
The events that take place in the browser are not the same as the events that take place in your script. Your web browser will faithfully execute the JavaScript code it gets from a server, and it will also construct the DOM and HTML for you to use on your local machine. You will not, however, retrieve the HTML page content if you request a dynamic website using your Python script. This is because dynamic websites generate their own web pages.
Just the information that the server delivers back to you is sent to you when you utilize requests. In the event that you have a dynamic website, you will wind up having some code written in JavaScript rather than HTML. Executing the JavaScript code that you have been given is the only way to go from the JavaScript code that you have been given to the material that you are interested in, exactly as your browser does. You won’t be able to do it with the help of the requests library, but there are alternative methods that can.
For instance, the creator of the requests library developed a project called requests-html, which enables users to display JavaScript by using syntax that is similar to the syntax used in requests. requests-html enables users to render JavaScript. In addition to that, it has the capacity to do data parsing by making use of Beautiful Soup in the background. Note: Another well-liked option for harvesting dynamic material is
Selenium . You may think of Selenium as a stripped-down browser that performs the JavaScript code on your behalf before handing the rendered HTML response on to the script that you are working on.
In this article, you won’t go any further into the process of scraping dynamically created material. If you need to scrape a dynamic page, it is sufficient for the time being to just keep in mind that you may investigate any one of the techniques described above.
website.
Step 3: Parse HTML Code With Beautiful Soup
You have been successful in scraping some HTML from the internet; but, when you look at it, it seems to be an enormous mess. There are thousands of properties strewn throughout, along with hundreds of HTML elements, and I believe I even saw some JavaScript in there somewhere. It is now time to parse this long code answer with the assistance of Python in order to make it more approachable and choose the data that you need. Beautiful Soup is a package for the Python programming language that can parse structured data. It gives you the ability to interact with HTML in a manner that is analogous to the way that developer tools let you engage with a web page. The library makes available a few simple methods that may be used by the user in order to investigate the HTML that was received. Installing Gorgeous may be done by entering the appropriate commands into your terminal.
Soup:
(venv) $ python -m pip install beautifulsoup4
After that, include the library into the Python script you’re working on, and make a Lovely Soup.
object:
import requests
from bs4 import BeautifulSoup
URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
When you add the two lines of code that are highlighted, a Beautiful Soup object is created. This object uses the HTML material that was scraped previously as its input, and it is created when you add the lines of code that are highlighted. Note: In order to prevent issues with character encoding, you should supply the page.content variable rather than the page.text variable. The text representation that you produced before using the.text property was inferior to the one stored in the.content attribute, which contains raw bytes and can be decoded more easily.
The second option, “html.parser,” ensures that you are using the correct parser for HTML by providing the necessary information.
content.
Find Elements by ID
Every element on a web page written in HTML may have an id property associated with it. This id attribute gives the element a one-of-a-kind identifier on the page, which is already implied by the name of the attribute. You may get started parsing your page by picking a particular element by using its ID. This will allow you to get started.
You should now go back to the developer tools and locate the HTML object that has all of the job posts. Discover more about the page by using the right mouse button to hover over different sections and then clicking “Inspect.” Remember that it is helpful to regularly go back to your browser and interactively explore the website using developer tools. Doing so will assist speed up the process. This teaches you how to locate the specific components that you are seeking for more specifically.
You need to find an element that has an id attribute with the value “ResultsContainer” in it. This element is a div with an id attribute. It also has some additional characteristics, but the ones listed below should give you a good idea of what you’re searching for.
for:
<div id="ResultsContainer">
<!-- all the job listings -->
</div>
You may search for a certain HTML element using Beautiful Soup’s help by using its
ID:
results = soup.find(id="ResultsContainer")
When you print an item from Beautiful Soup, you have the option to make it more aesthetically pleasing so that it is easier to see. If you use the.prettify() function on the results variable that you just allocated up above, then you will be able to see all of the HTML that is included inside the div> tag.
:
print(results.prettify())
If you know the ID of an element, you may single it out from the rest of the HTML code and work with it specifically. You are now only able to deal with this particular section of the HTML for the page. It seems like there was only a little adjustment made to the consistency of the soup. Despite this, there is still a lot of it.
dense.
Find Elements by HTML Class Name
You have seen that each advertisement for a job is enclosed in a div element that has the class card-content attached to it. You may now deal with the new object you created called results, and you can choose just the job listings that are included inside it. After all, these are the portions of the HTML in which you are interested. This may be accomplished with a single line of code.
code:
job_elements = results.find_all("div", class_="card-content")
In this section, you will make a call to the Beautiful Soup object’s.find all() method, which will return an iterable that contains all of the HTML for the job postings that are shown on that page.
Take a look at everything in this.
them:
for job_element in job_elements:
print(job_element, end="\n"*2)
It is already really cool, but there is still a significant amount of HTML! You’ve already seen that certain items on your website have detailed class names attached to them. You may choose those child items from inside each job ad by using the.find command ()
:
for job_element in job_elements:
title_element = job_element.find("h2", class_="title")
company_element = job_element.find("h3", class_="company")
location_element = job_element.find("p", class_="location")
print(title_element)
print(company_element)
print(location_element)
print()
Another instance of BeautifulSoup() is included inside each job element. As a result, you are free to apply the same procedures to it as you did to its parent element, which is results.
You are making significant progress toward the data that you are genuinely interested in with the help of this code snippet. Nonetheless, with all of those HTML elements and properties swimming about, there is a lot going on.
around:
<h2 class="title is-5">Senior Python Developer</h2>
<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
<p class="location">Stewartbury, AA</p>
You will next be shown how to limit this output so that you may access just the textual material that is of relevance to you.
in.
Extract Text From HTML Elements
You are only interested in seeing the position’s title, employer, and location in each job listing. And behold! You can count on Lovely Soup to take care of you. You may retrieve just the text content of the HTML components that a Beautiful Soup object represents by using the.text attribute in the Beautiful Soup object.
contains:
for job_element in job_elements:
title_element = job_element.find("h2", class_="title")
company_element = job_element.find("h3", class_="company")
location_element = job_element.find("p", class_="location")
print(title_element.text)
print(company_element.text)
print(location_element.text)
print()
If you run the code snippet that is provided above, you will see the text that is associated with each element. On the other hand, it’s probable that there will be some additional whitespace produced for you. As you are currently dealing with Python strings, you may remove any unnecessary whitespace by using the.strip() function. In addition to that, you may use any additional string techniques that are common in Python to further clean up your
text:
for job_element in job_elements:
title_element = job_element.find("h2", class_="title")
company_element = job_element.find("h3", class_="company")
location_element = job_element.find("p", class_="location")
print(title_element.text.strip())
print(company_element.text.strip())
print(location_element.text.strip())
print()
The outcomes ultimately seem to be substantially
better:
Senior Python Developer
Payne, Roberts and Davis
Stewartbury, AA
Energy engineer
Vasquez-Davidson
Christopherville, AA
Legal executive
Jackson, Chambers and Levy
Port Ericaburgh, AA
This is an easily legible list of jobs, which also provides the name of the firm and the location of each employment. Despite the fact that you are seeking for a career as a software developer, these results include job advertisements for positions in a wide variety of different industries.
well.
Find Elements by Class Name and Text Content
There are a variety of professions available; not all of them are for developers. You will first filter the jobs that are offered on the internet by utilizing keywords rather than printing out all of the available positions.
You are aware that components with the h2 tag are used to contain the job titles on the page. You may use the text parameter to filter results so that only certain jobs are shown.
:
python_jobs = results.find_all("h2", string="Python")
This piece of code locates any h2> elements in which the enclosed string is an exact match for the word “Python.” Take note that you are directly calling the function on the first results variable in your results array. If you go ahead and print() the output of the above piece of code to your console, then you may be dissatisfied since it will be empty.
empty:
>>>
>>> print(python_jobs)
[]
There was a job listing for Python in the results of the search; thus, why is it not showing up?
When you use string= in the way that you just did, your application searches for that string in its entirety. Any variation in the word’s spelling, whether it be in the capitalization, spacing, or punctuation, will prohibit the element from matching. In the next paragraph, you will learn how to improve the specificity of your search term.
general.
Pass a Function to a Beautiful Soup Method
You are not limited to passing strings as arguments to Lovely Soup methods; in certain cases, you may also provide functions. You may make use of a function by modifying the line of code that came before it.
instead:
python_jobs = results.find_all(
"h2", string=lambda text: "python" in text.lower()
)
At this point, you are going to be giving an anonymous function to the string= parameter. The lambda function examines the content of each h2 element, changes the case of the text to lowercase, and searches for the word “python” somewhere in the document. With this, you can check to see whether you were successful in identifying all of the Python tasks.
approach:
>>>
>>> print(len(python_jobs))
10
Your software has discovered ten job openings that are a good match and include the term “python” anywhere in the job title!
Discovering components according to the text that they contain is an effective method for filtering your HTML answer in search of certain information. When filtering text in Beautiful Soup objects, you have the option of using precise strings or functions as parameters. Beautiful Soup makes this flexibility available to you.
At this point, it seems like a good time to execute your for loop and output the titles, locations, and companies of the Python jobs you have applied for.
identified:
# ...
python_jobs = results.find_all(
"h2", string=lambda text: "python" in text.lower()
)
for job_element in python_jobs:
title_element = job_element.find("h2", class_="title")
company_element = job_element.find("h3", class_="company")
location_element = job_element.find("p", class_="location")
print(title_element.text.strip())
print(company_element.text.strip())
print(location_element.text.strip())
print()
Nevertheless, if you attempt to execute your scraper so that it may print out the information pertaining to the filtered Python tasks, you will come into an error.
error:
AttributeError: 'NoneType' object has no attribute 'text'
While you are scraping information from the Internet, you will often come across this warning since it represents a typical problem. Do an HTML inspection on one of the elements that makes up your python jobs list. What does it look like? What do you believe the source of the mistake to be?
from?
Identify Error Conditions
If you take a closer look at a single element in the python jobs document, you’ll see that it just consists of the h2 element that has the job information.
title:
<h2 class="title is-5">Senior Python Developer</h2>
When you look back at the code that you used to pick the items, you will find that you targeted that particular aspect of the data. You limited your search to include the h2> title parts of the job ads that included the term “python.” As can be seen, the remainder of the information on the job is not included in these components.
The prior error notice that you got was associated with
this:
AttributeError: 'NoneType' object has no attribute 'text'
You attempted to locate the job title, the business name, and the location of the job in each element of the python jobs database, however each element just includes the text of the job title.
Your diligent parsing library continues to hunt for the other ones, despite the fact that it cannot locate them, and it returns None as a result. If you then attempt to extract the.text property from one of these None objects while using print(), you will get the error message that has been shown.
The content that you are seeking for is included inside elements that are siblings to the h2 elements that were returned by your filter. You are able to pick aspects of siblings, children, and parents for each Beautiful Soup with the aid of Beautiful Soup.
object.
Access Parent Elements
You may get access to all of the information you need by moving up in the hierarchy of the DOM, beginning with the h2 elements that you have located and working your way down the tree. Take a second look at the markup language used in a single job advertisement. Locate the h2> element that has the job title, as well as the element that is its closest parent and holds all of the information that you are interested in.
in:
<div class="card">
<div class="card-content">
<div class="media">
<div class="media-left">
<figure class="image is-48x48">
<img
src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg"
alt="Real Python Logo"
/>
</figure>
</div>
<div class="media-content">
<h2 class="title is-5">Senior Python Developer</h2>
<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
</div>
</div>
<div class="content">
<p class="location">Stewartbury, AA</p>
<p class="is-small has-text-grey">
<time datetime="2021-04-08">2021-04-08</time>
</p>
</div>
<footer class="card-footer">
<a
href="https://www.realpython.com"
target="_blank"
class="card-footer-item"
>Learn</a
>
<a
href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html"
target="_blank"
class="card-footer-item"
>Apply</a
>
</footer>
</div>
</div>
All of the information that you want may be found inside the div> element that has been given the card-content class. The h2 title element that you discovered via the use of your filter has this element as a third-level parent.
With all of this information in mind, you can now utilize the elements that are included inside python jobs and instead retrieve the elements that are their great-grandparents to have access to all of the information that you need.
want:
python_jobs = results.find_all(
"h2", string=lambda text: "python" in text.lower()
)
python_job_elements = [
h2_element.parent.parent.parent for h2_element in python_jobs
]
You have included a list comprehension in python jobs that runs operations on each of the h2 title items that you obtained by filtering using the lambda expression. You are picking the parent element of the parent element of the parent element of each h2 title element that you have created. That’s an increase of three generations!
Upon reviewing the HTML for a single job posting, you discovered that the information you want is contained within a particular parent element known as card-content. This was discovered while reviewing the HTML for a single job posting.
You are now able to modify the code included inside your for loop so that it iterates across the parent components.
instead:
for job_element in python_job_elements:
# -- snip --
When you execute your script once again, you will see that the code once again has access to all of the information that is pertinent to the situation. This is due to the fact that you are now looping over the components of the div class=”card-content” rather than only the elements of the h2 title.
Using the.parent property that is included with each Beautiful Soup object provides you with an easy-to-understand method for navigating the DOM hierarchy of your document and addressing the components that you need. In a method similar to the one described above, you may also access child elements and sibling elements. Learn more about traversing the tree by reading up on it.
information.
Extract Attributes From HTML Elements
Your Python software already scans the website and filters its HTML to look for suitable job posts at this stage. Nicely done! On the other hand, there is no link to click on in order to submit a job application.
When you were looking over the website, you saw that each card had two links located at the bottom of it. You won’t be able to access the URLs in which you are interested if you treat the link elements in the same manner that you treated the other components.
in:
for job_element in python_job_elements:
# -- snip --
links = job_element.find_all("a")
for link in links:
print(link.text.strip())
If you execute this line of code, then instead of the corresponding URLs, you will see the link texts Learn and Apply.
This is because the.text property of an HTML element removes everything except the information that is visible to the user. It removes all HTML elements, including the HTML attributes that include the URL, leaving you with just the text that is associated with the link. Instead of throwing away the value of one of the HTML attributes, you will need to extract the value of that attribute so that you can access the URL.
The href property of a link element is connected to the URL of the linked resource. The value of the href attribute of the second a tag that is located at the bottom of the HTML for a single task is where you will find the precise URL that you are searching for.
posting:
<!-- snip -->
<footer class="card-footer">
<a href="https://www.realpython.com" target="_blank"
class="card-footer-item">Learn</a>
<a href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html"
target="_blank"
class="card-footer-item">Apply</a>
</footer>
</div>
</div>
To begin, get each and every an element that is included in a job card. The value of their href properties should be extracted using square brackets after that.
notation:
for job_element in python_job_elements:
# -- snip --
links = job_element.find_all("a")
for link in links:
link_url = link["href"]
print(f"Apply here: {link_url}\n")
In this bit of code, you started by retrieving all of the links from each of the job listings that were filtered. Next, using the [“href”] coding construct, you retrieved the href property, which is where the URL is stored, and wrote it to your console.
In the section labeled “Exercise,” you will discover instructions for a task that will need you to improve the link results that you have obtained:
If you want to learn more about a potential answer to this exercise, you may do so by clicking on the solution block:
You may extract additional HTML attributes by using the same square-bracket syntax that was used for the first attribute.
well.
Keep Practicing
If you have followed through with this instruction and written the code beside it, then you may execute your script exactly as it is, and the information about the bogus job will appear in your terminal. Your next move should be to look at employment boards that are being used in the real world. You may continue to hone your newly acquired abilities by returning to the web scraping process and making use of any one of the following:
sites:
The linked websites provide their search results in the form of static HTML answers, in a manner similar to that of the Fake Python job board. Because of this, you are able to scrape them using merely requests and Beautiful Soup.
You should begin going through this instruction once again, starting from the beginning, but this time utilizing one of these alternative websites. You will notice that the structure of each website is unique, which means that in order to get the data you want, you will need to rewrite the code in a slightly different fashion. Taking on this task is an excellent method to put the new information that you’ve just picked up into practice. While it may cause you to break out in a cold sweat on occasion, the end result will be improved coding abilities.
You will also have the opportunity to investigate more Beautiful Soup features on your second try. Make the documentation both a handbook and a source of inspiration for you. You will get more skilled at web scraping using Python, requests, and Beautiful Soup if you put in the extra effort to practice.
In order to bring your exploration of web scraping to a close, you might consider giving your code one last facelift and developing a command-line interface (CLI) application. This program would scrape one of the job boards and filter the results based on a keyword that you would enter before each run of the program. You may be able to search for certain job categories or work opportunities in a given place using the CLI tool at your disposal.
Check read the article on “How to Create Command-Line Interfaces in Python Using argparse” if you are interested in learning how to modify your script so that it may be used as a command-line interface.