PACKAGES
import requests
from bs4 import BeautifulSoup as bs
from tld import get_tld
import pandas as pd
import re
import copy
from urllib.parse import urljoin
EXTENDED BEAUTIFULSOUP FUNCTIONS
I've created several functions that should really be in base BeautifulSoup:
def findPreviouses(soup, tag, n):
PURPOSE
The find_previous() function in BeautifulSoup searches for previous occurrence of a particular tag.
However, there often are cases where we need to use this function several times, having nested functions like (find_previous(find_previous(soup))).
Before I found out about the find_all_previous() function, I needed a shorthand for searching for tags.
So here I am
PARAMETERS
soup [BeautifulSoup object]: the soup object that you want to search in
tag [HTML tag like 'p' or 'h3', or a list of them]: the tags that you want to search for
n [int]: the number of times the function find_previous() is to be repeated
OUTPUT
BeautifulSoup object
for x in range(n):
soup = soup.find_previous(tag)
return soup
def findNexts(soup, tag, n):
PURPOSE
The find_next() function in BeautifulSoup searches for next occurrence of a particular tag.
However, there often are cases where we need to use this function several times, having nested functions like (find_next(find_next(soup))).
Before I found out about the find_all_next() function, I needed a shorthand for searching for tags.
So here I am
PARAMETERS
soup [BeautifulSoup object]: the soup object that you want to search in
tag [HTML tag like 'p' or 'h3', or a list of them]: the tags that you want to search for
n [int]: the number of times the function find_next() is to be repeated
OUTPUT
BeautifulSoup object
for x in range(n):
soup = soup.find_next(tag)
return soup
def allstr(soup, joiner=""):
PURPOSE
The stripped_strings attribute takes a BeautifulSoup object and removes the whitespace elements.
However, it outputs in an iterable format.
As such, allstr() is shorthand for the cleanup process.
PARAMETERS
soup [BeautifulSoup object]: the soup object that you want to clean
joiner [str]: how you want to join the list
OUTPUT
str
return joiner.join(list(soup.stripped_strings))
def soupstr(soup):
PURPOSE
Similar to the above, soupstr() just converts into a list instead of joining all the strings together.
PARAMETERS
soup [BeautifulSoup object]: the soup object that you want to clean
OUTPUT
list of str
return list(soup.stripped_strings)
THE WEBSITE() CLASS
Here are the functionalities of the package:
class website():
PURPOSE
A class to represent data retrieved from a website.
Functions can also be applied to extract more information.
DEFAULT ATTRIBUTES
url [str]: the link itself
domain [str]: the domain link
html [BeautifulSoup object]: the contents of the page, in html format
hrefs [list of str]: all the outward links of the page
OUTPUT
str
def __init__(self, url):
PARAMETERS
url [str]: the link to the site
self.url = url
self.domain = get_tld(url,as_object = True).fld
self.html = bs(requests.get(url).text,'html.parser')
self.hrefs = [a['href'] for a in self.html(href=True)]
def __str__(self):
PURPOSE
When the object itself is called, return a dictionary containing its url and hrefs.
return {"url":self.url,"hrefs":self.hrefs}
def attachHrefs(self, hrefOpen = '(href', hrefClose = 'href)', subset = None, edit = False):
PURPOSE
This function aims to take each "a"-tagged element and extract its url and the text it contains.
After that, the url is appended to the text, and the whole thing is returned as a string back to the BeautifulSoup object.
In order to keep track of what changes I've made to the BeautifulSoup object, as well as to more easily find the urls, a "container" is placed around the urls.
That's where hrefOpen and hrefClose come into play.
They basically surround the url and makes the url easier to find using search terms.
By default, they convert an "a" tag from <a href="url.com">text</a> into text(hrefurl.comhref).
But of course, they can be changed.
PARAMETERS
hrefOpen [str]: refer to above
hrefClose [str]: refer to above
subset [list of html tags like 'p', 'h3']: controls which tags to conduct this operation on (eg. only want to change the "a" tags within div elements)
edit [boolean]: controls whether to change the self.html attribute (setting it to False will output the result to self.editHtml attribute instead)
OUTPUT
Will change either self.html or self.editHtml
if edit:
html = self.html
else:
self.editHtml = copy.copy(self.html)
html = self.editHtml
for tag in html(subset):
for a in tag('a',href=True):
try: a.string = a.string + hrefOpen + website.cleanHref(a['href'], domain=self.domain, url = self.url) + hrefClose
except: a.string = hrefOpen + website.cleanHref(a['href'], domain=self.domain, url = self.url) + hrefClose
def getTables(self, merge = True, href = True, hrefOpen = '(href', hrefClose = 'href)', hrefSeparate = True):
PURPOSE
The simple function is to retrieve all the tables within the html.
That simple task is done using Pandas' read_html() function.
However, combining it with the previously discussed function yields more useful results.
You see, as it stands, quite alot of tables contain links within their cells and as such, the pd.read_html() function doesn't work as well.
So, this function rips the links out of their "a" tags and places them either into separate columns, or attached to the text.
Under the former, for each row, all the links found are gathered into a single column.
Another purpose of this function is to combine tables with identical column headers.
Under the condition merge = True, the code looks for groups of tables with the exact same set of column headers.
From that, it returns the merged tables.
PARAMETERS
hrefOpen [str]: refer to above
hrefClose [str]: refer to above
merge [boolean]: refer to above
href [boolean]: whether to include the links, or too discard them altogether
hrefSeparate [boolean]: whether to separate the hrefs into a separate column
OUTPUT
Under merge = True, two new attributes will be added: self.mergedTables and self.uniqueTables.
mergedTables contains the tables that can be merged, while uniqueTables contains the rest.
if href:
self.attachHrefs(subset = 'td', hrefOpen = hrefOpen, hrefClose = hrefClose, edit = False)
if edit:
self.tables = pd.read_html(re.sub('[\n\t]+','Ǟ',str(self.html('table'))))
else:
self.tables = pd.read_html(re.sub('[\n\t]+','Ǟ',str(self.editHtml('table'))))
self.tables = [table.fillna('') for table in self.tables]
if hrefSeparate:
for table in self.tables:
table['Links'] = table.apply(lambda x: '\n'.join(re.findall(re.compile(hrefOpen + "(.*?)" + hrefClose),''.join(list(x)))),axis=1)
table.loc[:,table.columns != 'Links'] = table.loc[:,table.columns != 'Links'].applymap(lambda x: re.sub(re.compile(hrefOpen + ".*?" + hrefClose),'',x))
self.mergedTables = []
self.uniqueTables = []
if merge:
columnSets = {}
for table in self.tables:
try:
columns = '||'.join(sorted(list(table.columns)))
if columns not in columnSets:
columnSets[columns] = [table]
else:
columnSets[columns].append(table)
except:
self.uniqueTables.append(table)
for columnSet in columnSets:
self.mergedTables.append(pd.concat(columnSets[columnSet]))
def getLists(self, href = True, hrefOpen = '(href', hrefClose = 'href)', edit = False):
PURPOSE
Simliar to the getTables() function, the getLists() function consolidates the urls found within list elements.
PARAMETERS
hrefOpen [str]: refer to getTables()
hrefClose [str]: refer to getTables()
href [boolean]: whether to include the links, or too discard them altogether
edit [boolean]: there are two different places where html content is stored (self.html and self.editHtml). As such, this parameter will choose which one to use.
OUTPUT
self.rawLists is a list of lists, which just outputs everything
self.lists is the version of it where the links are attached to the texts
if href:
self.attachHrefs(subset = re.compile('[oi]l'), hrefOpen = hrefOpen, hrefClose = hrefClose)
if edit:
self.rawLists = self.html(re.compile('[oi]l'))
else:
self.rawLists = self.editHtml(re.compile('[oi]l'))
self.lists = []
for rawList in self.rawLists:
ss = pd.DataFrame([li.get_text() for li in rawList('li')], columns = ['Text'])
ss['Hrefs'] = ss['Text'].apply(lambda x: '\n'.join(re.findall(re.compile(hrefOpen + "(.*?)" + hrefClose),x)))
ss['Text'] = ss['Text'].apply(lambda x: re.sub(re.compile(hrefOpen + ".*?" + hrefClose),'',x))
self.lists.append(ss)
@staticmethod
def cleanHref(href, domain=None, url=None):
PURPOSE
Sometimes, the href of an "a" element points to an id on the same page (href="#projects" etc.)
Sometimes, the href of an "a" element points to a page in the same domain (href="projects.html" etc.)
Sometimes, the href of an "a" element points to an external source (href="projects.com" etc.)
This code aims to standardise all the codes, making it easier to traverse the site.
PARAMETERS
Since this is function mainly used by other functions within this package, there is not really a need to use it on its own.
href [str]: the input url which you want to clean
domain [str]: if you know the domain link, then put it here, else, just leave it be
url [str]: the input the current url of the site, which helps to convert the internal hrefs
OUTPUT
A single str for the cleaned url
if href[0] == '#':
return re.split('#[^#]*$',url)[0] + href
elif href[0:2] == '//':
return 'http:' + href
elif href[0] == '/':
return 'http://' + domain + href
else:
return href
return href
@classmethod
def cleanHrefs(self):
PURPOSE
This applies the cleanHref() function to all the links in the website.
Simple as that.
OUTPUT
Quite self-explanatory new attributes for the website() object.
allHrefs is for everything, selfHrefs is for ID references on the same page, intHrefs is for links to the same domain, and extHrefs is everything else.
self.allHrefs = []
self.selfHrefs = []
self.intHrefs = []
self.extHrefs = []
for href in self.hrefs:
try:
if href[0] == '#':
self.selfHrefs.append(re.split('#[^#]*$',self.url)[0]+href)
elif href[0:2] == '//':
self.extHrefs.append('http:'+href)
elif href[0] == '/':
self.intHrefs.append('http://'+self.domain+href)
else:
self.extHrefs.append(href)
self.allHrefs.append(href)
except: pass
self.allHrefs = list(set(self.allHrefs))
self.selfHrefs = list(set(self.selfHrefs))
self.intHrefs = list(set(self.intHrefs))
self.extHrefs = list(set(self.extHrefs))