Similar to the previous post, in this post, we are going to learn how to extract information from the Internet. We have to create a dataset first, to implement data mining techniques. So, let’s start.
1. What is scraping?
Scraping is a technique that allows us to extract information from the Internet. For example, scraping a web page means that we are going to extract the HTML from that page and then take the ‘useful’ information from the HTML. Useful information is the information that we need, for example, the infobox of a Wikipedia page or the meta tags of a web page, etc. For more information, you can check the definition of Web Scraping.
2. Scraping a Webpage
For this project we are going to need the following packages:
Like before, I am going to build the project as a Python Class callable from any file. In this example, I am going to scrap my previous blog posts, first blog post and Mining the social media using python 2.7.
To begin with, I am going to create a text file that holds all the links that we want to scrap (in our case there will be only one link). So, we need to open the file and put the links into a list for further process.
File: links
https://thelastdev.com/first-blog-post/ https://thelastdev.com/mining-the-social-media-using-python-2-7-13/ testkdkljfhaslkd jfhaslk https://thelastdev
As you can see my file does not contain only URLs, so we are going to perform a ‘sanitize’ to ensure the robustness of our code. I am going to use regular expressions (regex) to filter the links but you can use other libraries or packages like:
- urlparse
- urllib (For Python <2.7.9, urllib does not attempt to validate the server certificates of HTTPS URIs!)
- etc
In any case, when you are building a project, always but ALWAYS perform data validation! This way it becomes a part of your life.
with open(linksPath, 'r') as file: WebScrap.urls = file.readlines() urlFilter = re.compile( r'^(?:http)s?://' # http:// or https:// r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' #ip r'(?:/?|[/?]\S+)$', re.IGNORECASE) WebScrap.urls = [url.strip() for url in WebScrap.urls if urlFilter.match(url.strip())]
Next step is to perform a scraping technique, where we are going to extract the needed information from a web page. If you have different websites in your URL list, then you have to scrap each website separately because we are aiming for specific content. In that case, you can extract the HTML and then perform analysis on each website page separately. In our case, I am going to extract information only from my blog posts (only two at that time). By checking the source of the page we can target the exact div/span/a/p, where the needed information is.
Here our desired element is a div with the class name ‘entry-content’. So using urllib2 and BeautifulSoup4 I am going to sample the html content and filter it to get the text of that div.
content = [] for url in WebScrap.urls: try: hdr = {'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'} req = urllib2.Request(url, headers=hdr) page = urllib2.urlopen(req) soup = BeautifulSoup(page.read(), "lxml") pageContent = soup.find('div', {'class': 'entry-content'}) content.append(pageContent.text) except urllib2.HTTPError, e: print 'HTTPError = ' + str(e.code) except urllib2.URLError, e: print 'URLError = ' + str(e.reason) except httplib.HTTPException, e: print 'HTTPException' except Exception: print 'generic exception: ' + traceback.format_exc()
The output of the list ‘content’ is like this:
Greetings everyone! This is my first blog post ever! In this post, I will try to explain the purpose of this blog along side with the fist topics that I am going to cover. First of all, I would like to thank each and every one of you for the time that you will dedicate to read my posts! Now, let the fun begins! As the title of the web page states, in this blog, I will analyze Data Mining algorithms implemented in Python. In the future, I would like to make some tutorials too, about Python language, C/C++ language, BioInformatics algorithms etc… Considering Data Mining posts, I will start presenting methods and Python libraries that are used to collect data, something like urllib2, beautifulsoup4, lxml, scrapy, tweety and many more! So my first goal is to show you a method to collect data from the Internet, after that we will be able to process them with many more algorithms and methods and in the final process, extract information from them! For each project I do and each line of code I post, I will upload everything to my personal GitHub with an appropriate link provided. Yours, Siaterlis Konstantinos Share this:TwitterFacebookLinkedInMoreGoogleRedditTumblrPinterestEmailPrintLike this:Like Loading... Greetings! In this post, I will show you how to mine the Social Media, to be more precice Twitter! It is a very simple process and I will show you how to do it in Python 2.7 in a couple of steps. Step 1 – Install Python Packages ....
And that’s it! The problems I encountered was only on User-Agent of urllib2 where I had to specify the compatibility.
Other ways of scraping a web page are:
- Python scrapy – A very powerful tool, which I am going to make a tutorial about.
- Python requests – Simple but effective way of getting the content of a web page
- Python webbrowser – This is an integrated python library, where is opens a browser with the page you have selected (same as selenium). This kind of scraping is useful when you have to deal with javascript generated web pages.
- Dryscrape – An awesome tool for scraping javascript generated web pages.
In the bottom line, urllib2 is a nice library for simple scraping. I prefer using scrapy on more complex projects, but always use API when possible.
Using a crawler to scrape a website, or using multiple ‘scrappers’ on the same website could cause damage to the actual website. API is the most convenient method of extracting information from a web page. For example, Wikipedia is full of information, but instead of scraping it we can use DBPedia to access everything in a reasonable amount of time. That’s all for today! Until next time, take care and have fun!
Yours,
Siaterlis Konstantinos
P.S. In the next posts we are going to see and implement methods for finding the topics a web page is about, the emotions a tweet has and much more! Also, in the future, we are going to start a secret project called ‘Siakon’!