Skip to main content

Python Web Scraping Tutorial

In this tutorial, we are going to talk about web scraping using python.

Firstly, we have to discuss about what is web scraping technique? Whenever we need the data (it can be text, images, links and videos) from web to our database. Lets discuss where we should need the web scraping in real world.

  1. Nowadays, we have so many competitors in each and every field for surpassing them we need their data from the website or Blogs to know about products, customers and their facilities.
  2. And Some Admin of Particular website, blogs and youtube channel want the reviews of their customers in database and want to update with this In, this condition they use web scraping

There are many other areas where we need web scraping, we discussed two points for precise this article for readers.

Prerequisites:

You just have basic knowledge of python nothing else so, get ready for learning web scraping.

Which technology we should use to achieve web scraping?

We can do this with JavaScript and python but according to me and most of the peoples, we can do it with python easily just you should know the basic knowledge of python nothing else rest of the things we will learn in this article.

Python Web Scraping Tutorial

1. Retrieving Links and Text from Website and Youtube Channel through Web Scraping

  • In this first point, we will learn how to get the text and the links of any webpage with some methods and classes.

We are going to do this beautiful soup method.

1. Install BS4 and Install lxml parser

  • To install BS4 in windows open your command prompt or windows shell and type: pip install bs4
  • To install lxml in windows open your command prompt or windows shell and type: pip install lxml

Note: “pip is not recognized” if this error occurs, take help from any reference.

To install BS4 in ubuntu open your terminal:

  • If you are using python version 2 type: pip install bs4
  • If you are using python version 3 type: pip3 install bs4  

To install lxml in ubuntu open your terminal

  • If you are using python version 2 type: pip install lxml
  • If you are using python version 3 type: pip3 install lxml

2. Open Pycharm and Import Modules

Import useful modules:

import bs4

import requests

Import useful modules

Then take url of particular website for example www.thecrazyprogrammer.com

url= "https://www.thecrazyprogrammer.com/"
data=requests.get(url)
soup=bs4.BeautifulSoup(data.text,'htm.parser')
print(soup.prettify())

And now you will get the html script with the help of these lines of code of particular link you provided to the program. This is the same data which is in the page source of the website webpage you can check it also.

Python Web Scraping Tutorial 1Python Web Scraping Tutorial 1

Now we talk about find function() with the help of find function we can get the text, links and many more things from our webpage. We can achieve this thing through the python code which is written below of this line:

We just take one loop in our program and comment the previous line.

for para in soup.find('p')
print(para)

Python Web Scraping Tutorial 2

And we will get the first para of our webpage, you can see the output in the below image. See, this is the original website view and see the output of python code in the below image.

Python Web Scraping Tutorial 3

Pycharm Output

Python Web Scraping Tutorial 4

Now, if you want all the paragraph of this webpage you just need to do some changes in this code i.e.

Here, we should use find_all function() instead find function. Let’s do it practically

Python Web Scraping Tutorial 5

You will get all paragraphs of web page.

Now, one problem will occur that is the “<p>” tag will print with the text data for removing the <p> tag we have to again do changes in the code like this:

Python Web Scraping Tutorial 6

We just add “.text” in the print function with para. This will give us only text without any tags. Now see the output there <p> tag has removed with this code.

With the last line we have completed our first point i.e. how we can get the data (text) and the html script of our webpage. In the second point we will learn how we get the hyperlinks from webpage.

2. How to Get All the Links Of Webpage Through Web Scraping

Introduction:

In this, we will learn how we can get the links of the webpage and the youtube channels also or any other web page you want.

All the import modules will be same some changes are there only that changes are:

Take one for loop with the condition of anchor tag ‘a’ and get all the links using href tag and assign them to the object (you can see in the below image) which taken under the for loop and then print the object. Now, you will get all the links of webpage. Practical work:

Python Web Scraping Tutorial 7

You will get all the links with the extra stuff (like “../” and “#” in the starting of the link)

Python Web Scraping Tutorial 8

  • There is only some valid links in this console screen rest of them are also link but because of some extra stuff are not treating like links for removing this bug we have to do change in our python code.
  • We need if and else condition and we will do slicing using python also, “../” if we replace it with our url (you can see the url above images) i.e.  https://www.thecrazyprogrammer.com/, we will get the valid links of the page in output console let see practically in below image.

Python Web Scraping Tutorial 9

In the above image we take the if condition where the link or you can say that the string start with the “../” start with 3 position of the string using slice method and the extra stuff like “#” which is unuseful for us that’s why we don’t  include it in our output and we used the len() function also for printing the string to the last and with the prefix of our webpage url are also adding for producing the link.

In your case you can use your own condition according to your output.

Now you can see we get more than one link using if condition. We get so many links but there is also one problem that is we are not getting the links which are starting with “/” for getting these links also we have to do more changes in our code lets see what should we do.

So, we have to add the condition elif also with the condition of “/” and here also we should give “#” condition also otherwise we will get extra stuff again in below image we have done this.

Python Web Scraping Tutorial 10

After putting this if and elif condition in our program to finding all the links in our particular webpage We have got the links without any error you can see in below image how we increased our links numbers since the program without the if and elif condition.

Python Web Scraping Tutorial 11

In this way we can get all the links the text of our particular page or website you can find the links in same manner of youtube channel also.

Note: If you have any problem to getting the links change the conditions in program as I have done with my problem you can use as your requirement.

So we have done how we can get the links of any webpage or youtube channel page.

3. Log In Facebook Through Web Scraping

Introduction

In this method we can login any account of facebook using Scraping.

Conditions: How we can use this scarping into facebook because the security of Facebook we are unable to do it directly.

So, we can’t login facebook directly we should do change in url of facebook like we should use m.facebook.com or mbasic.facebook.com url instead of www.facebook.com because facebook has high security level we can’t scrap data directly.

Let’s start scrapping.

This Is Webpage Of m.facebook.com URL

Let’s start with python. So first import all these modules:

import http.cookiejar

import urllib.request

import requests

import bs4

Then create one object and use cookiejar method which provides you the cookie into your python browser.

Create another object known as opener and assign the request method to it.

Note: do all the things on your risk don’t hack someone id or else.

Cj=http.cookiejar.Cookiejar()
Opener=urllib.request.build_opener(urllib.request.HTTPcookieProcessor)
Urllib.request.install_opener(opener)
Authentication_url=""

Python Web Scraping Tutorial 12

After this code, you have to find the link of particular login id through inspecting the page of m.facebook.com and then put the link into under commas and remove all the text after the login word and add “.php” with login word now type further code.

Python Web Scraping Tutorial 13

payload= {
        'email':"xyz@gmail.com",
        'pass':"(enter the password of id)"
}

After this use get function give one cookie to it.

Data=urllib.parse.urlencode(payload).encode('utf-8')
Req=urllib.request.Request(authentication_url,data)
Resp=urllib.request.urlopen(req)
Contents=resp.read()
Print(contents)

Python Web Scraping Tutorial 14

With this code we will login into facebook and the important thing I have written above also do it all things on your risk and don’t hack someone.

We can’t learn full concept of web scraping through this article only but still I hope you learned the basics of python web scrapping.

The post Python Web Scraping Tutorial appeared first on The Crazy Programmer.



from The Crazy Programmer https://www.thecrazyprogrammer.com/2019/03/python-web-scraping-tutorial.html

Comments

Popular posts from this blog

dotnet sdk list and dotnet sdk latest

Can someone make .NET Core better with a simple global command? Fanie Reynders did and he did it in a simple and elegant way. I'm envious, in fact, because I spec'ed this exact thing out in a meeting a few months ago but I could have just done it like he did and I would have used fewer keystrokes! Last year when .NET Core was just getting started, there was a "DNVM" helper command that you could use to simplify dealing with multiple versions of the .NET SDK on one machine. Later, rather than 'switching global SDK versions,' switching was simplified to be handled on a folder by folder basis. That meant that if you had a project in a folder with no global.json that pinned the SDK version, your project would use the latest installed version. If you liked, you could create a global.json file and pin your project's folder to a specific version. Great, but I would constantly have to google to remember the format for the global.json file, and I'd constan

R vs Python for Machine Learning

There are so many things to learn before to choose which language is good for Machine Learning. We will discuss each and everything about R as well as Python and the situation or problem in which situation we have to use which language. Let’s start Python and R are the two most Commonly used Programming Languages for Machine Learning and because of the popularity of both the languages Novice or you can say fresher are getting confused, whether they should choose R or Python language to commence their career in the Machine learning domain. Don’t worry guys through this article we will discuss R vs Python for Machine Learning. So, without exaggerating this article let’s get started. We will start it from the very Basics things or definitions. R vs Python for Machine Learning Introduction R is a programming language made by statisticians and data miners for statistical analysis and graphics supported by R foundation for statistical computing. R also provides high-quality graphics and

Top Tips For PCB Design Layout

Are you thinking about designing a printed circuit board? PCBs are quite complicated, and you need to make sure that the layout that you choose is going to operate as well as you want it to. For this reason, we have put together some top tips for PCB design layout. Keep reading if you would like to find out more about this. Leave Enough Space One of the most important design tips for PCB layout is that you need to make sure that you are leaving enough space between the components. While many people might think that packing components closely is the best route to take, this can cause problems further down the line. This is why we suggest leaving extra space for the wires that will spread. This way, you’ll have the perfect PCB design layout. Print Out Your Layout Struggling to find out if your components sizes match? Our next tip is to print out your layout and compare the printed version to your actual components. Datasheets can sometimes come with errors, so it doesn’t hurt to do