4/2/2022

Quotes To Scrape

76
Quotes To Scrape Rating: 8,9/10 6690 votes
Get-TableOfContents-Headings

MAYA ANGELOU (more Maya Angelou quotes) In the end, though, maybe we must all give up trying to pay back the people in this world who sustain our lives. In the end, maybe it’s wiser to surrender before the miraculous scope of human generosity and to just keep saying thank you, forever and sincerely, for as long as we have voices.

  • With the evergrowing amount of data spread around the web, the need for gathering and structuring that data is also increasing day by day. This is exactly where web scraping comes into play. In this quick Scrapy Tutorial Video Course, you'll learn everything you need to get started with web scraping using Python and Scrapy. Among other things.
  • This is a demo website for web scraping purposes. Prices and ratings here were randomly assigned and have no real meaning. In stock Add to basket. Tipping the Velvet. In stock Add to basket. In stock Add to basket.
Scrapbooking words and phrases@(
  • 'PowerShell and Web Content'
  • 'Browsing Websites using PowerShell'
) How

Sometimes you end up in situations where you want to get information from an online source such as a webpage, but the service has no API available for you to get information through and it’s too much data to manually copy and paste. Or maybe you need to register a lot of entries on a website, but don’t have a bored friend to help out. Fear not, PowerShell can be your bored friend if you ask nicely!

If you’re using PowerShell 7 or higher you might not be able to run all examples in this post without modification, as the way that web requests parse the data has been changed.

PowerShell and Web Content

PowerShell has several ways of getting data from a source on the web, be it a normal webpage or a REST API. There are two cmdlets available to make web requests, and PowerShell also of course has access to everything that .NET has to offer. If neither Invoke-WebRequest or Invoke-RestMethod is good enough you can dig into System.Web and build solutions using that. You may encounter cases where encoding doesn’t work as expected, and making your own functions with classes from .NET can be one way of solving it.

Invoke-WebRequest

Invoke-WebRequest is just what it sounds like, it creates and sends a request to a specified web address and then returns a response. Think of it like opening a web page in your browser, you get all the HTML on the address you put in but also all the metadata that the browser handles for you to present the site.

You can see that there is a lot of metadata returned with the response. Using Invoke-WebRequest you get everything from the content of the web page to the HTTP status code to see what the server said about your request. This is useful but not always needed, sometimes we only want to look at the actual data on the page, stored in the Content property of the response.

We can of course save the response in a variable and expand it to get our data, but if we’re not going to use the metadata at all, there’s another cmdlet that we can use.

Invoke-RestMethod

Invoke-RestMethod behaves and is used in the same way as Invoke-WebRequest, the big difference is that you only get the content and no metadata. If the data is in JSON, it will also automatically parse it into an object. This is especially handy when working with REST APIs that respond with data in JSON, and removes the need to run the content of the response through ConvertFrom-Json afterwards.

We ran the same command, but this time we only got the actual HTML data of www.google.com. If we take a quick look at a site that has an API with more structured information, we can see the difference more clearly.

I like using the JSONPlaceholder API when demonstrating API requests, it’s a fake API that can be used to test your code.

Calling the cmdlets side by side makes it more clear as to what the differences are. If we take a look at the Content of the data we got from Invoke-WebRequest we see that it’s a simple JSON string, while what we got from Invoke-RestMethod has already been converted to a PSCustomObject with properties parsed from the JSON data.

Quotes To Scrape Dead

Browsing Websites using PowerShell

Now that we know how to get data from the web, let’s dive deeper to find out how we can parse data, click buttons and keep an active session after logging into a website.

Just like the fake API from the previous example there are many sites online simply for the purpose of testing web scraping, we’ll use Quotes to Scrape which has a login feature.

Parsing Data

If we look at the site using a browser we can see that it’s split up into a bunch of quotes, with tags and an author.

Scrapy Quotes To Scrape

Quotes toscrape

Let’s set our goal to getting all quotes on the first page, saving the quote and its author and tags to a list. To do this we will need to parse the HTML, and doing that in the most efficient way is by using Regular Expressions, or regex.

Looking at the HTML of the site in either PowerShell or by using a browser we can find out the structure of each quote.

  • The quote is in the first <span> tag.
  • The author is in the <small> tag.
  • The tags are in the content attribute of the <meta> tag.

Knowing this lets us create a regular expression to gather these values from a pattern, which we can use with the -match operator in PowerShell.

PowerShell returns true or false whether or not we find a match, and also stores any matches in a hashtable called $Matches automatically. The pattern above matches the text as . means “any character” and * means “zero or more times”. We can look at the automatic $Matches variable to verify our results.

Scrape

We can do better though, to filter out exactly what we need we can create a so-called named group.

Quotes To Scrape People

If we now run $Matches again we can see that it has a new value which we can reference by name and get the value from.

Using the same procedure we can create a pattern that gathers all values that we want in named groups, according to the HTML structure of each quote on the page:

  • Match a named group for the quote in the span tag, followed by a new line with anything on it.
  • Match a named group for the author in the small tag, followed by 5 new lines with anything on them.
  • Match a named group for the tags in the content attribute of the meta tag.

I won’t go deep into how regex works in this post, that’s for another time, but the following pattern matches the structure of each quote.

This lets us find the text in each of the patterns we defined earlier.

The problem is that even when we run it on all our HTML data we only get a single quote matched, this is because -match only returns a single match, the first one. There are other ways to match by regex in PowerShell as well which lets us get all matches, we can either use Select-String with the parameter -AllMatches and then look at the Matches property of the return value, or run the .NET version [regex].

Both of the commands above result in equal results. Each match comes with some metadata such as length and index in the total string.

We’re only interested in the matched named groups, so all we need is some magic to get those from each quote. To do this we can loop through all matches and save a custom object of each quote to an array, and we’re done.

Looking at the $QuoteList we can now see all of the different quotes, with their authors and the tags from the site split from each other into an array.

Interacting with a Website using PowerShell

So far we’ve gotten data from a website and then looked at or formatted it locally in PowerShell, but sometimes there are cases where the data is locked behind a click. Sometimes you need to log into a website with your credentials before you can access the data, and doing that requires you to have an active session between your web requests.

You can manage a continuous session between requests with both of the cmdlets that we’ve gone through, but you will have an easier time managing things such as entering information in fields and clicking buttons if you use Invoke-WebRequest because of the extra information that it returns.

Let’s use the same site again and try looking at our options for logging in.

Here we can see that we seem to have no forms to fill out and no input fields, but we do have some links. If we look back to how the site looks we can see that there is a link that leads to a login page. There are a ton of links so I won’t list them all, but we can filter out the one we want. We could also use the links to click the “Next” button to implement paging of all the quotes on the site.

Something to be aware of is that the properties Forms and InputFields may still have content even if it doesn’t display when looking at the object itself. Let’s have a look at the link and also make sure we’re not missing any fields to fill on the launch page.

Looks like there are actually no forms or fields. This matches our expectations since there are no visible ones when visiting the main page in a browser either, but it’s a good habit to look through the properties so we know what we have to work with.

We can see that our link has a property called href, if you’ve ever written HTML you probably recognize it as the destination for a link. This is in fact just normal HTML that PowerShell has parsed into an object for us, making it more convenient to browse the content.

We will use the href value of the link we found and simply add it onto our base URL in a string, then use Invoke-WebRequest again onto our new compounded URL. Then we’ll have a look at the properties to see if we can find any new fields or forms. Let’s also take the opportunity to create a continuous web session that we will use for future web requests, this is done using the -SessionVariable parameter in which we specify the name of a variable we want to store our new session in, in our case we’ll have a new variable called $DemoSession afterwards.

There are definitely some new input fields, but there are actually some hidden forms as well. Forms are generally the way of entering data onto a website, so we want to look for those when trying to log into sites using PowerShell, by accessing the Forms property.

There are a few interesting things to note here. Firstly, it shows that the Action of logging in is using the same URL as we just browsed to, this action is what happens when a user clicks the login button in the browser. Actions, just like links, have a path that adds onto the base URL of the website. We can also see that it uses the HTTP method POST which is used when you want to send data back to the web. This seems promising, so let’s see if we can set the username and password. This website actually accepts any values since it’s for testing only, so our input doesn’t matter.

Something more to note is that the Forms property is actually a list, so to make sure to get everything right we will access the fields of the first form found on the page, which also happens to be the only one. You access the fields just as you do with values in a hashtable.

Scrape

Great! Now all we need to do is post this back to the website and make sure to use the session that we created so that we can keep browsing the site once logged in. The body of the post will be the entire modified response of the previous web request, in our case our $Site variable that we’ve added our credentials into.

Even though the login action had the same path as our previous link, I used the action as part of the URL instead. This is to be extra clear since they’re not necessarily the same.

If everything worked as expected, as it does in the browser, we should have been redirected to the main page with one of the links now being “Logout” instead.

We successfully logged in! We can keep using our web session to navigate deeper into the site if we like, and we’ll keep being logged in as we do. Let’s click another link such as the “Next” one and see if we still have the logout button, that means we kept our session.

We’ll verify that we ended up on a different page by exploring the destination of this page’s “Next” button, and making sure that we still have a link to logout through.

Scrape

Quotes Toscrape

The next page is number 3 and we can still log out! As long as we provide our session we can keep browsing while being logged in. We could even have several browsing sessions active at the same time using different variables, if we wanted to.

I hope you learned something new about working with web content in PowerShell, if you come up with a fun web scraping project you’re welcome to post a comment on how and what you did below!

Quotes To Scrape Away

Comments

Quotes To Scrape

Please enable JavaScript to view the comments powered by Disqus.comments powered by Disqus